0% found this document useful (0 votes)

34 views291 pages

MIT18 05S14 Readings

Uploaded by

pucalerrone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views291 pages

MIT18 05S14 Readings

Uploaded by

pucalerrone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 291

Introduction

Class 1, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Probability vs. Statistics

In this introduction we will preview what we will be studying in 18.05. Don’t worry if many
of the terms are unfamiliar, they will be explained as the course proceeds.
Probability and statistics are deeply connected because all statistical statements are at bot-
tom statements about probability. Despite this the two sometimes feel like very di↵erent
subjects. Probability is logically self-contained; there are a few rules and answers all follow
logically from the rules, though computations can be tricky. In statistics we apply proba-
bility to draw conclusions from data. This can be messy and usually involves as much art
as science.
Probability example
You have a fair coin (equal probability of heads or tails). You will toss it 100 times. What
is the probability of 60 or more heads? There is only one answer (about 0.028444) and we
will learn how to compute it.
Statistics example
You have a coin of unknown provenance. To investigate whether it is fair you toss it 100
times and count the number of heads. Let’s say you count 60 heads. Your job as a statis-
tician is to draw a conclusion (inference) from this data. There are many ways to proceed,
both in terms of the form the conclusion takes and the probability computations used to
justify the conclusion. In fact, di↵erent statisticians might draw di↵erent conclusions.
Note that in the first example the random process is fully known (probability of heads =
.5). The objective is to find the probability of a certain outcome (at least 60 heads) arising
from the random process. In the second example, the outcome is known (60 heads) and the
objective is to illuminate the unknown random process (the probability of heads).

2 Frequentist vs. Bayesian Interpretations

There are two prominent and sometimes conflicting schools of statistics: Bayesian and
frequentist. Their approaches are rooted in di↵ering interpretations of the meaning of
probability.
Frequentists say that probability measures the frequency of various outcomes of an ex-
periment. For example, saying a fair coin has a 50% probability of heads means that if we
toss it many times then we expect about half the tosses to land heads.
Bayesians say that probability is an abstract concept that measures a state of knowledge
or a degree of belief in a given proposition. In practice Bayesians do not assign a single
value for the probability of a coin coming up heads. Rather they consider a range of values
each with its own probability of being true.
In 18.05 we will study and compare these approaches. The frequentist approach has long

1
18.05 class 1, Introduction, Spring 2014 2

been dominant in fields like biology, medicine, public health and social sciences. The
Bayesian approach has enjoyed a resurgence in the era of powerful computers and big
data. It is especially useful when incorporating new data into an existing statistical model,
for example, when training a speech or face recognition system. Today, statisticians are
creating powerful tools by using both approaches in complementary ways.

3 Applications, Toy Models, and Simulation

Probability and statistics are used widely in the physical sciences, engineering, medicine, the
social sciences, the life sciences, economics and computer science. The list of applications is
essentially endless: tests of one medical treatment against another (or a placebo), measures
of genetic linkage, the search for elementary particles, machine learning for vision or speech,
gambling probabilities and strategies, climate modeling, economic forecasting, epidemiology,
marketing, googling... We will draw on examples from many of these fields during this
course.
Given so many exciting applications, you may wonder why we will spend so much time
thinking about toy models like coins and dice. By understanding these thoroughly we will
develop a good feel for the simple essence inside many complex real-world problems. In
fact, the modest coin is a realistic model for any situations with two possible outcomes:
success or failure of a treatment, an airplane engine, a bet, or even a class.
Sometimes a problem is so complicated that the best way to understand it is through
computer simulation. Here we use software to run virtual experiments many times in order
to estimate probabilities. In this class we will use R for simulation as well as computation
and visualization. Don’t worry if you’re new to R; we will teach you all you need to know.
Counting and Sets
Class 1, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Know the definitions and notation for sets, intersection, union, complement.

2. Be able to visualize set operations using Venn diagrams.

3. Understand how counting is used computing probabilities.

4. Be able to use the rule of product, inclusion-exclusion principle, permutations and com-
binations to count the elements in a set.

2 Counting

2.1 Motivating questions

Example 1. A coin is fair if it comes up heads or tails with equal probability.

You flip a fair coin three times. What is the probability that exactly one of the flips results
in a head?
answer: With three flips, we can easily list the eight possible outcomes:

{T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH}

Three of these outcomes have exactly one head:

{T T H, T HT, HT T }

Since all outcomes are equally probable, we have

number of outcomes with 1 head 3
P (1 head in 3 flips) = = .
total number of outcomes 8

Think: Would listing the outcomes be practical with 10 flips?

A deck of 52 cards has 13 ranks (2, 3, . . . , 9, 10, J, Q, K, A) and 4 suits (~, , }, |,). A
poker hand consists of 5 cards. A one-pair hand consists of two cards having one rank and
three cards having three other ranks, e.g., {2~, 2, 5~, 8|, K}}

Test your intuition: the probability of a one-pair hand is:

(a) less than 5%
(b) between 5% and 10%
(c) between 10% and 20%

1
18.05 class 1, Counting and Sets, Spring 2014 2

(d) between 20% and 40%

(e) greater than 40%

At this point we can only guess the probability. One of our goals is to learn how to compute
it exactly. To start, we note that since every set of five cards is equally probable, we can
compute the probability of a one-pair hand as
number of one-pair hands
P (one-pair) =
total number of hands
So, to find the exact probability, we need to count the number of elements in each of these
sets. And we have to be clever about it, because there are too many elements to simply
list them all. We will come back to this problem after we have learned some counting
techniques.
Several times already we have noted that all the possible outcomes were equally probable
and used this to find a probability by counting. Let’s state this carefully in the following
principle.
Principle: Suppose there are n possible outcomes for an experiment and each is equally
probable. If there are k desirable outcomes then the probability of a desirable outcome is
k/n. Of course we could replace the word desirable by any other descriptor: undesirable,
funny, interesting, remunerative, . . .
Concept question: Can you think of a scenario where the possible outcomes are not
equally probable?
Here’s one scenario: on an exam you can get any score from 0 to 100. That’s 101 di↵erent
possible outcomes. Is the probability you get less than 50 equal to 50/101?

2.2 Sets and notation

Our goal is to learn techniques for counting the number of elements of a set, so we start
with a brief review of sets. (If this is new to you, please come to office hours).

2.2.1 Definitions

A set S is a collection of elements. We use the following notation.

Element: We write x 2 S to mean the element x is in the set S.
Subset: We say the set A is a subset of S if all of its elements are in S. We write this as
A ⇢ S.
Complement:: The complement of A in S is the set of elements of S that are not in A.
We write this as Ac or S A.
Union: The union of A and B is the set of all elements in A or B (or both). We write this
as A [ B.
Intersection: The intersection of A and B is the set of all elements in both A and B. We
write this as A \ B.
Empty set: The empty set is the set with no elements. We denote it ;.
18.05 class 1, Counting and Sets, Spring 2014 3

Disjoint: A and B are disjoint if they have no common elements. That is, if A \ B = ;.
Di↵erence: The di↵erence of A and B is the set of elements in A that are not in B. We
write this as A B.

Let’s illustrate these operations with a simple example.

Example 2. Start with a set of 10 animals
S = {Antelope, Bee, Cat, Dog, Elephant, Frog, Gnat, Hyena, Iguana, Jaguar}.
Consider two subsets:
M = the animal is a mammal = {Antelope, Cat, Dog, Elephant, Hyena, Jaguar}
W = the animal lives in the wild = {Antelope, Bee, Elephant, Frog, Gnat, Hyena, Iguana, Jaguar}.
Our goal here is to look at di↵erent set operations.
Intersection: M \ W contains all wild mammals: M \ W = {Antelope, Elephant, Hyena, Jaguar}.
Union: M [ W contains all animals that are mammals or wild (or both).
M [ W = {Antelope, Bee, Cat, Dog, Elephant, Frog, Gnat, Hyena, Iguana, Jaguar}.
Complement: M c means everything that is not in M , i.e. not a mammal. M c =
{Bee, Frog, Gnat, Iguana}.
Di↵erence: M W means everything that’s in M and not in W .
So, M W = {Cat, Dog}.
There are often many ways to get the same set, e.g. Mc = S M, M W = M \ Lc .

The relationship between union, intersection, and complement is given by DeMorgan’s laws:
(A [ B)c = Ac \ B c
(A \ B)c = Ac [ B c
In words the first law says everything not in (A or B) is the same set as everything that’s
(not in A) and (not in B). The second law is similar.

2.2.2 Venn Diagrams

Venn diagrams o↵er an easy way to visualize set operations.

In all the figures S is the region inside the large rectangle, L is the region inside the left
circle and R is the region inside the right circle. The shaded region shows the set indicated
underneath each figure.

S L R

L[R L\R Lc L R
18.05 class 1, Counting and Sets, Spring 2014 4

Proof of DeMorgan’s Laws

(L [ R)c Lc Rc Lc \ R c

(L \ R)c Lc Rc Lc [ R c

Example 3. Verify DeMorgan’s laws for the subsets A = {1, 2, 3} and B = {3, 4} of the
set S = {1, 2, 3, 4, 5}.
answer: For each law we just work through both sides of the equation and show they are
the same.
1. (A [ B)c = Ac \ B c :
Right hand side: A [ B = {1, 2, 3, 4} ) (A [ B)c = {5}.
Left hand side: Ac = {4, 5}, B c = {1, 2, 5} ) Ac \ B c = {5}.
The two sides are equal. QED
2. (A \ B)c = Ac [ B c :
Right hand side: A \ B = {3} ) (A \ B)c = {1, 2, 4, 5}.
Left hand side: Ac = {4, 5}, B c = {1, 2, 5} ) Ac [ B c = {1, 2, 4, 5}.
The two sides are equal. QED
Think: Draw and label a Venn diagram with A the set of Brain and Cognitive Science
majors and B the set of sophomores. Shade the region illustrating the first law. Can you
express the first law in this case as a non-technical English sentence?

2.2.3 Products of sets

The product of sets S and T is the set of ordered pairs:

S ⇥ T = {(s, t) | s 2 S, t 2 T }.

In words the right-hand side reads “the set of ordered pairs (s, t) such that s is in S and t
is in T .
The following diagrams show two examples of the set product.
18.05 class 1, Counting and Sets, Spring 2014 5

4
⇥ 1 2 3 4 3
1 (1,1) (1,2) (1,3) (1,4)
2 (2,1) (2,2) (2,3) (2,4)
3 (3,1) (3,2) (3,3) (3,4)
1
{1, 2, 3} ⇥ {1, 2, 3, 4}

1 4 5
[1, 4] ⇥ [1, 3] ⇢ [0, 5] ⇥ [0, 4]

The right-hand figure also illustrates that if A ⇢ S and B ⇢ T then A ⇥ B ⇢ S ⇥ T .

2.3 Counting

If S is finite, we use |S| or #S to denote the number of elements of S.

Two useful counting principles are the inclusion-exclusion principle and the rule of product.

2.3.1 Inclusion-exclusion principle

The inclusion-exclusion principle says

|A [ B| = |A| + |B| |A \ B|.

We can illustrate this with a Venn diagram. S is all the dots, A is the dots in the blue
circle, and B is the dots in the red circle.

B A\B A

|A| is the number of dots in A and likewise for the other sets. The figure shows that |A|+|B|
double-counts |A \ B|, which is why |A \ B| is subtracted o↵ in the inclusion-exclusion
formula.

Example 4. In a band of singers and guitarists, seven people sing, four play the guitar,
and two do both. How big is the band?
18.05 class 1, Counting and Sets, Spring 2014 6

answer: Let S be the set singers and G be the set guitar players. The inclusion-exclusion
principle says
size of band = |S [ G| = |S| + |G| |S \ G| = 7 + 4 2 = 9.

2.3.2 Rule of Product

The Rule of Product says:

If there are n ways to perform action 1 and then by m ways to perform action
2, then there are n · m ways to perform action 1 followed by action 2.

We will also call this the multiplication rule.

Example 5. If you have 3 shirts and 4 pants then you can make 3 · 4 = 12 outfits.
Think: An extremely important point is that the rule of product holds even if the ways to
perform action 2 depend on action 1, as long as the number of ways to perform action 2 is
independent of action 1. To illustrate this:
Example 6. There are 5 competitors in the 100m final at the Olympics. In how many
ways can the gold, silver, and bronze medals be awarded?

answer: There are 5 ways to award the gold. Once that is awarded there are 4 ways to
award the silver and then 3 ways to award the bronze: answer 5 · 4 · 3 = 60 ways.
Note that the choice of gold medalist a↵ects who can win the silver, but the number of
possible silver medalists is always four.

2.4 Permutations and combinations

2.4.1 Permutations

A permutation of a set is a particular ordering of its elements. For example, the set {a, b, c}
has six permutations: abc, acb, bac, bca, cab, cba. We found the number of permutations by
listing them all. We could also have found the number of permutations by using the rule
of product. That is, there are 3 ways to pick the first element, then 2 ways for the second,
and 1 for the first. This gives a total of 3 · 2 · 1 = 6 permutations.
In general, the rule of product tells us that the number of permutations of a set of k elements
is
k! = k · (k 1) · · · 3 · 2 · 1.

We also talk about the permutations of k things out of a set of n things. We show what
this means with an example.
Example 7. List all the permutations of 3 elements out of the set {a, b, c, d}. answer:
This is a longer list,
abc acb bac bca cab cba
abd adb bad bda dab dba
acd adc cad cda dac dca
bcd bdc cbd cdb dbc dcb
18.05 class 1, Counting and Sets, Spring 2014 7

Note that abc and acb count as distinct permutations. That is, for permutations the order
matters.
There are 24 permutations. Note that the rule or product would have told us there are
4 · 3 · 2 = 24 permutations without bothering to list them all.

2.4.2 Combinations

In contrast to permutations, in combinations order does not matter: permutations are lists
and combinations are sets. We show what we mean with an example
Example 8. List all the combinations of 3 elements out of the set {a, b, c, d}.
answer: Such a combination is a collection of 3 elements without regard to order. So, abc
and cab both represent the same combination. We can list all the combinations by listing
all the subsets of exactly 3 elements.
{a, b, c} {a, b, d} {a, c, d} {b, c, d}
There are only 4 combinations. Contrast this with the 24 permutations in the previous
example. The factor of 6 comes because every combination of 3 things can be written in 6
di↵erent orders.

2.4.3 Formulas

We’ll use the following notations.

n Pk = number of permutations (lists) of k distinct elements from a set of size n
n
n Ck = k = number of combinations (subsets) of k elements from a set of size n
We emphasise that by the number of combinations of k elements we mean the number of
subsets of size k.
These have the following notation and formulas:
n!
Permutations: n Pk = = n(n 1) · · · (n k + 1)
(n k)!
n! n Pk
Combinations: n Ck = =
k!(n k)! k!
The notation n Ck is read “n choose k”. The formula for n Pk follows from the rule of
product. It also implies the formula for n Ck because a subset of size k can be ordered in k!
ways.
We can illustrate the relation between permutations and combinations by lining up the
results of the previous two examples.
abc acb bac bca cab cba {a, b, c}
abd adb bad bda dab dba {a, b, d}
acd adc cad cda dac dca {a, c, d}
bcd bdc cbd cdb dbc dcb {b, c, d}
Permutations: 4 P3 Combinations: 4 C3
Notice that each row in the permutations list consists of all 3! permutations of the corre-
sponding set in the combinations list.
18.05 class 1, Counting and Sets, Spring 2014 8

2.4.4 Examples

Example 9. Count the following:

(i) The number of ways to choose 2 out of 4 things (order does not matter).
(ii) The number of ways to list 2 out of 4 things.
(iii) The number of ways to choose 3 out of 10 things.
answer: (i) This is asking for combinations: 42 = 2!4!2! = 6.
4!
(ii) This is asking for permuations: 4 P2 = 2! = 12.
10
(iii) This is asking for combinations: 3 = 3!10!7! = 10·9·8
3·2·1 = 120.

Example 10. (i) Count the number of ways to get 3 heads in a sequence of 10 flips of a
coin.
(ii) If the coin is fair, what is the probability of exactly 3 heads in 10 flips.
answer: (i) This asks for the number sequences of 10 flips (heads or tails) with exactly 3
heads. That is, we have to choose exactly 3 out of 10 flips to be heads. This is the same
question as in the previous example.
✓ ◆
10 10! 10 · 9 · 8
= = = 120.
3 3! 7! 3·2·1

(ii) Each flip has 2 possible outcomes (heads or tails). So the rule of product says there are
210 = 1024 sequences of 10 flips. Since the coin is fair each sequence is equally probable.
So the probability of 3 heads is
120
= .117 .
1024
Probability: Terminology and Examples
Class 2, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Know the definitions of sample space, event and probability function.

2. Be able to organize a scenario with randomness into an experiment and sample space.

3. Be able to make basic computations using a probability function.

2 Terminology

2.1 Probability cast list

• Experiment: a repeatable procedure with well-defined possible outcomes.

• Sample space: the set of all possible outcomes. We usually denote the sample space by
⌦, sometimes by S.

• Event: a subset of the sample space.

• Probability function: a function giving the probability for each outcome.

Later in the course we will learn about

• Probability density: a continuous distribution of probabilities.

• Random variable: a random numerical outcome.

2.2 Simple examples

Example 1. Toss a fair coin.

Experiment: toss the coin, report if it lands heads or tails.
Sample space: ⌦ = {H, T }.
Probability function: P (H) = .5, P (T ) = .5.

Example 2. Toss a fair coin 3 times.

Experiment: toss the coin 3 times, list the results.
Sample space: ⌦ = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.
Probability function: Each outcome is equally likely with probability 1/8.
For small sample spaces we can put the set of outcomes and probabilities into a probability
table.
Outcomes HHH HHT HTH HTT THH THT TTH TTT
Probability 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8

1
18.05 class 2, Probability: Terminology and Examples, Spring 2014 2

Example 3. Measure the mass of a proton

Experiment: follow some defined procedure to measure the mass and report the result.
Sample space: ⌦ = [0, 1), i.e. in principle we can get any positive value.
Probability function: since there is a continuum of possible outcomes there is no probability
function. Instead we need to use a probability density, which we will learn about later in
the course.

Example 4. Taxis (An infinite discrete sample space)

Experiment: count the number of taxis that pass 77 Mass. Ave during an 18.05 class.
Sample space: ⌦ = {0, 1, 2, 3, 4, . . . }.
This is often modeled with the following probability function known as the Poisson distri-
bution. (Do not worry about mastering the Poisson distribution just yet):
k
P (k) = e ,
k!
where is the average number of taxis. We can put this in a table:
Outcome 0 1 2 3 ... k ...

Probability e e e 2 /2 e 3 /3! ... e k /k! ...

1
X k
Question: Accepting that this is a valid probability function, what is ? e
k!
k=0
answer: This is the total probability of all possible outcomes, so the sum equals 1. (Note,
X1 n
this also follows from the Taylor series e = .)
n!
n=0

In a given setup there can be more than one reasonable choice of sample space. Here is a
simple example.
Example 5. Two dice (Choice of sample space)
Suppose you roll one die. Then the sample space and probability function are
Outcome 1 2 3 4 5 6
Probability: 1/6 1/6 1/6 1/6 1/6 1/6
Now suppose you roll two dice. What should be the sample space? Here are two options.
1. Record the pair of numbers showing on the dice (first die, second die).
2. Record the sum of the numbers on the dice. In this case there are 11 outcomes
{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. These outcomes are not all equally likely.
As above, we can put this information in tables. For the first case, the sample space is the
product of the sample spaces for each die

{(1, 1), (2, 1), (3, 1), . . . (6, 6)}

Each of the 36 outcomes is equally likely. (Why 36 outcomes?) For the probability function
we will make a two dimensional table with the rows corresponding to the number on the
first die, the columns the number on the second die and the entries the probability.
18.05 class 2, Probability: Terminology and Examples, Spring 2014 3

Die 2
1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
Die 1 3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Two dice in a two dimensional table

In the second case we can present outcomes and probabilities in our usual table.

outcome 2 3 4 5 6 7 8 9 10 11 12
probability 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

The sum of two dice

Think: What is the relationship between the two probability tables above?
We will see that the best choice of sample space depends on the context. For now, simply
note that given the outcome as a pair of numbers it is easy to find the sum.
Note. Listing the experiment, sample space and probability function is a good way to start
working systematically with probability. It can help you avoid some of the common pitfalls
in the subject.

Events.
An event is a collection of outcomes, i.e. an event is a subset of the sample space ⌦. This
sounds odd, but it actually corresponds to the common meaning of the word.
Example 6. Using the setup in Example 2 we would describe the event that you get
exactly two heads in words by E = ‘exactly 2 heads’. Written as a subset this becomes

E = {HHT, HT H, T HH}.

You should get comfortable moving between describing events in words and as subsets of
the sample space.
The probability of an event E is computed by adding up the probabilities of all of the
outcomes in E. In this example each outcome has probability 1/8, so we have P (E) = 3/8.

2.3 Definition of a discrete sample space

Definition. A discrete sample space is one that is listable, it can be either finite or infinite.
Examples. {H, T}, {1, 2, 3}, {1, 2, 3, 4, . . . }, {2, 3, 5, 7, 11, 13, 17, . . . } are all
discrete sets. The first two are finite and the last two are infinite.
Example. The interval 0  x  1 is not discrete, rather it is continuous. We will deal
with continuous sample spaces in a few days.
18.05 class 2, Probability: Terminology and Examples, Spring 2014 4

2.4 The probability function

So far we’ve been using a casual definition of the probability function. Let’s give a more
precise one.
Careful definition of the probability function.
For a discrete sample space S a probability function P assigns to each outcome ! a number
P (!) called the probability of !. P must satisfy two rules:

• Rule 1. 0  P (!)  1 (probabilities are between 0 and 1).

• Rule 2. The sum of the probabilities of all possible outcomes is 1 (something must
occur)

In symbols Rule 2 says: if S = {!1 , !2 , . . . , !n } then P (!1 ) + P (!2 ) + . . . + P (!n ) = 1. Or,

Xn
using summation notation: P (!j ) = 1.
j=1
The probability of an event E is the sum of the probabilities of all the outcomes in E. That
is, X
P (E) = P (!).
!2E

Think: Check Rules 1 and 2 on Examples 1 and 2 above.

Example 7. Flip until heads (A classic example)

Suppose we have a coin with probability p of heads and we have the following scenario.
Experiment: Toss the coin until the first heads. Report the number of tosses.
Sample space: ⌦ = {1, 2, 3, . . . }.
Probability function: P (n) = (1 p)n 1 p.
Challenge 1: show the sum of all the probabilities equals 1 (hint: geometric series).
Challenge 2: justify the formula for P (n) (we will do this soon).
Stopping problems. The previous toy example is an uncluttered version of a general
class of problems called stopping rule problems. A stopping rule is a rule that tells you
when to end a certain process. In the toy example above the process was flipping a coin
and we stopped after the first heads. A more practical example is a rule for ending a series
of medical treatments. Such a rule could depend on how well the treatments are working,
how the patient is tolerating them and the probability that the treatments would continue
to be e↵ective. One could ask about the probability of stopping within a certain number of
treatments or the average number of treatments you should expect before stopping.

3 Some rules of probability

For events A, L and R contained in a sample space ⌦.

Rule 1. P (Ac ) = 1 P (A).
Rule 2. If L and R are disjoint then P (L [ R) = P (L) + P (R).
18.05 class 2, Probability: Terminology and Examples, Spring 2014 5

Rule 3. If L and R are not disjoint, we have the inclusion-exclusion principle:

P (L [ R) = P (L) + P (R) P (L \ R)

We visualize these rules using Venn diagrams.

Ac
L R L R
A

⌦ = A [ Ac , no overlap L [ R, no overlap L [ R, overlap = L \ R

We can also justify them logically.

Rule 1: A and Ac split ⌦ into two non-overlapping regions. Since the total probability
P (⌦) = 1 this rule says that the probabiity of A and the probability of ’not A’ are comple-
mentary, i.e. sum to 1.
Rule 2: L and R split L [ R into two non-overlapping regions. So the probability of L [ R
is is split between P (L) and P (R)
Rule 3: In the sum P (L) + P (R) the overlap P (L \ R) gets counted twice. So P (L) +
P (R) P (L \ R) counts everything in the union exactly once.
Think: Rule 2 is a special case of Rule 3.
For the following examples suppose we have an experiment that produces a random integer
between 1 and 20. The probabilities are not necessarily uniform, i.e., not necessarily the
same for each outcome.
Example 8. If the probability of an even number is .6 what is the probability of an odd
number?
answer: Since being odd is complementary to being even, the probability of being odd is
1-.6 = .4.
Let’s redo this example a bit more formally, so you see how it’s done. First, so we can refer
to it, let’s name the random integer X. Let’s also name the event ‘X is even’ as A. Then
the event ‘X is odd’ is Ac . We are given that P (A) = .6. Therefore P (Ac ) = 1 .6 = .4 .
Example 9. Consider the 2 events, A: ‘X is a multiple of 2’; B: ‘X is odd and less than
10’. Suppose P (A) = .6 and P (B) = .25.
(i) What is A \ B?
(ii) What is the probability of A [ B?
answer: (i) Since all numbers in A are even and all numbers in B are odd, these events are
disjoint. That is, A \ B = ;.
(ii) Since A and B are disjoint P (A [ B) = P (A) + P (B) = .85.

Example 10. Let A, B and C be the events X is a multiple of 2, 3 and 6 respectively. If

P (A) = .6, P (B) = .3 and P (C) = .2 what is P (A or B)?
answer: Note two things. First we used the word ‘or’ which means union: ‘A or B’ =
A [ B. Second, an integer is divisible by 6 if and only if it is divisible by both 2 and 3.
18.05 class 2, Probability: Terminology and Examples, Spring 2014 6

This translates into C = A \ B. So the inclusion-exclusion principle says

P (A [ B) = P (A) + P (B) P (A \ B) = .6 + .3 .2 = .7 .
Conditional Probability, Independence and Bayes’ Theorem
Class 3, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals
1. Know the definitions of conditional probability and independence of events.

2. Be able to compute conditional probability directly from the definition.

3. Be able to use the multiplication rule to compute the total probability of an event.

4. Be able to check if two events are independent.

5. Be able to use Bayes’ formula to ‘invert’ conditional probabilities.

6. Be able to organize the computation of conditional probabilities using trees and tables.

7. Understand the base rate fallacy thoroughly.

2 Conditional Probability

Conditional probability answers the question ‘how does the probability of an event change
if we have extra information’. We’ll illustrate with an example.
Example 1. Toss a fair coin 3 times.
(a) What is the probability of 3 heads?
answer: Sample space ⌦ = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.
All outcomes are equally likely, so P (3 heads) = 1/8.
(b) Suppose we are told that the first toss was heads. Given this information how should
we compute the probability of 3 heads?
answer: We have a new (reduced) sample space: ⌦0 = {HHH, HHT, HT H, HT T }.
All outcomes are equally likely, so
P (3 heads given that the first toss is heads) = 1/4.
This is called conditional probability, since it takes into account additional conditions. To
develop the notation, we rephrase (b) in terms of events.
Rephrased (b) Let A be the event ‘all three tosses are heads’ = {HHH}.
Let B be the event ‘the first toss is heads’ = {HHH, HHT, HT H, HT T }.
The conditional probability of A knowing that B occurred is written

P (A|B)

This is read as
‘the conditional probability of A given B’
or
‘the probability of A conditioned on B’

1
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 2

or simply
‘the probability of A given B’.

We can visualize conditional probability as follows. Think of P (A) as the proportion of the
area of the whole sample space taken up by A. For P (A|B) we restrict our attention to B.
That is, P (A|B) is the proportion of area of B taken up by A, i.e. P (A \ B)/P (B).

B
A
A\B

Conditional probability: Abstract visualization and coin example

Note, A ⇢ B in the right-hand figure, so there are only two colors shown.
The formal definition of conditional probability catches the gist of the above example and
visualization.
Formal definition of conditional probability
Let A and B be events. We define the conditional probability of A given B as
P (A \ B)
P (A|B) = 6 0.
, provided P (B) = (1)
P (B)

Let’s redo the coin tossing example using the definition in Equation (1). Recall A = ‘3 heads’
and B = ‘first toss is heads’. We have P (A) = 1/8 and P (B) = 1/2. Since A \ B = A, we
also have P (A \ B) = 1/8. Now according to (1), P (A|B) = 1/8 1/2 = 1/4, which agrees with
our answer in Example 1b.

3 Multiplication Rule

The following formula is called the multiplication rule.

P (A \ B) = P (A|B) · P (B). (2)
This is simply a rewriting of the definition in Equation (1) of conditional probability. We
will see that our use of the multiplication rule is very similar to our use of the rule of
product in counting. In fact, the multiplication rule is just a souped up version of the rule
of product.
We start with a simple example where we can check all the probabilities directly by counting.
Example 2. Draw two cards from a deck. Define the events: S1 = ‘first card is a spade’
˙
and S2 = ‘second card is a spade’. What is the P (S2 |S1 )?
answer: We can do this directly by counting: if the first card is a spade then of the 51
cards remaining, 12 are spades.
P (S2 |S1 ) = 12/51.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 3

Now, let’s recompute this using formula (1). We have to compute P (S1 ), P (S2 ) and
P (S1 \ S2 ): We know that P (S1 ) = 1/4 because there are 52 equally likely ways to draw
the first card and 13 of them are spades. The same logic says that there are 52 equally
likely ways the second card can be drawn, so P (S2 ) = 1/4.
Aside: The probability P (S2 ) = 1/4 may seem surprising since the value of first card
certainly a↵ects the probabilities for the second card. However, if we look at all possible
two card sequences we will see that every card in the deck has equal probability of being
the second card. Since 13 of the 52 cards are spades we get P (S2 ) = 13/52 = 1/4. Another
way to say this is: if we are not given value of the first card then we have to consider all
possibilities for the second card.
Continuing, we see that
13 · 12
P (S1 \ S2 ) = = 3/51.
52 · 51
This was found by counting the number of ways to draw a spade followed by a second spade
and dividing by the number of ways to draw any card followed by any other card). Now,
using (1) we get
P (S2 \ S1 ) 3/51
P (S2 |S1 ) = = = 12/51.
P (S1 ) 1/4
Finally, we verify the multiplication rule by computing both sides of (2).
13 · 12 3 12 1 3
P (S1 \ S2 ) = = and P (S2 |S1 ) · P (S1 ) = · = . QED
52 · 51 51 51 4 51

Think: For S1 and S2 in the previous example, what is P (S2 |S1c )?

4 Law of Total Probability

The law of total probability will allow us to use the multiplication rule to find probabilities
in more interesting examples. It involves a lot of notation, but the idea is fairly simple. We
state the law when the sample space is divided into 3 pieces. It is a simple matter to extend
the rule when there are more than 3 pieces.
Law of Total Probability
Suppose the sample space ⌦ is divided into 3 disjoint events B1 , B2 , B3 (see the figure
below). Then for any event A:

P (A) = P (A \ B1 ) + P (A \ B2 ) + P (A \ B3 )
P (A) = P (A|B1 ) P (B1 ) + P (A|B2 ) P (B2 ) + P (A|B3 ) P (B3 ) (3)

The top equation says ‘if A is divided into 3 pieces then P (A) is the sum of the probabilities
of the pieces’. The bottom equation (3) is called the law of total probability. It is just a
rewriting of the top equation using the multiplication rule.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 4

⌦
B1
A \ B1

A \ B2 A \ B3

B2 B3
The sample space ⌦ and the event A are each divided into 3 disjoint pieces.
The law holds if we divide ⌦ into any number of events, so long as they are disjoint and
cover all of ⌦. Such a division is often called a partition of ⌦.
Our first example will be one where we already know the answer and can verify the law.
Example 3. An urn contains 5 red balls and 2 green balls. Two balls are drawn one after
the other. What is the probability that the second ball is red?
answer: The sample space is ⌦ = {rr, rg, gr, gg}.
Let R1 be the event ‘the first ball is red’, G1 = ‘first ball is green’, R2 = ‘second ball is
red’, G2 = ‘second ball is green’. We are asked to find P (R2 ).
The fast way to compute this is just like P (S2 ) in the card example above. Every ball is
equally likely to be the second ball. Since 5 out of 7 balls are red, P (R2 ) = 5/7.
Let’s compute this same value using the law of total probability (3). First, we’ll find the
conditional probabilities. This is a simple counting exercise.

P (R2 |R1 ) = 4/6, P (R2 |G1 ) = 5/6.

Since R1 and G1 partition ⌦ the law of total probability says

P (R2 ) = P (R2 |R1 )P (R1 ) + P (R2 |G1 )P (G1 ) (4)

4 5 5 2
= · + ·
6 7 6 7
30 5
= = .
42 7

Probability urns
The example above used probability urns. Their use goes back to the beginning of the
subject and we would be remiss not to introduce them. This toy model is very useful. We
quote from Wikipedia: http://en.wikipedia.org/wiki/Urn_problem

In probability and statistics, an urn problem is an idealized mental exercise

in which some objects of real interest (such as atoms, people, cars, etc.) are
represented as colored balls in an urn or other container. One pretends to draw
(remove) one or more balls from the urn; the goal is to determine the probability
of drawing one color or another, or some other properties. A key parameter is
whether each ball is returned to the urn after each draw.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 5

It doesn’t take much to make an example where (3) is really the best way to compute the
probability. Here is a game with slightly more complicated rules.
Example 4. An urn contains 5 red balls and 2 green balls. A ball is drawn. If it’s green
a red ball is added to the urn and if it’s red a green ball is added to the urn. (The original
ball is not returned to the urn.) Then a second ball is drawn. What is the probability the
second ball is red?
answer: The law of total probability says that P (R2 ) can be computed using the expression
in Equation (4). Only the values for the probabilities will change. We have

P (R2 |R1 ) = 4/7, P (R2 |G1 ) = 6/7.

Therefore,
4 5 6 2 32
P (R2 ) = P (R2 |R1 )P (R1 ) + P (R2 |G1 )P (G1 ) = · + · = .
7 7 7 7 49

5 Using Trees to Organize the Computation

Trees are a great way to organize computations with conditional probability and the law of
total probability. The figures and examples will make clear what we mean by a tree. As
with the rule of product, the key is to organize the underlying process into a sequence of
actions.
We start by redoing Example 4. The sequence of actions are: first draw ball 1 (and add the
appropriate ball to the urn) and then draw ball 2.

5/7 2/7
R1 G1
4/7 3/7 6/7 1/7

R2 G2 R2 G2

You interpret this tree as follows. Each dot is called a node. The tree is organized by levels.
The top node (root node) is at level 0. The next layer down is level 1 and so on. Each level
shows the outcomes at one stage of the game. Level 1 shows the possible outcomes of the
first draw. Level 2 shows the possible outcomes of the second draw starting from each node
in level 1.
Probabilities are written along the branches. The probability of R1 (red on the first draw)
is 5/7. It is written along the branch from the root node to the one labeled R1 . At the
next level we put in conditional probabilities. The probability along the branch from R1 to
R2 is P (R2 |R1 ) = 4/7. It represents the probability of going to node R2 given that you are
already at R1 .
The muliplication rule says that the probability of getting to any node is just the product of
the probabilities along the path to get there. For example, the node labeled R2 at the far left
really represents the event R1 \ R2 because it comes from the R1 node. The multiplication
rule now says
5 4
P (R1 \ R2 ) = P (R1 ) · P (R2 |R1 ) = · ,
7 7
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 6

which is exactly multiplying along the path to the node.

The law of total probability is just the statement that P (R2 ) is the sum of the probabilities
of all paths leading to R2 (the two circled nodes in the figure). In this case,
5 4 2 6 32
P (R2 ) = · + · = ,
7 7 7 7 49
exactly as in the previous example.

5.1 Shorthand vs. precise trees

The tree given above involves some shorthand. For example, the node marked R2 at the
far left really represents the event R1 \ R2 , since it ends the path from the root through
R1 to R2 . Here is the same tree with everything labeled precisely. As you can see this tree
is more cumbersome to make and use. We usually use the shorthand version of trees. You
should make sure you know how to interpret them precisely.

P (R1 ) = 5/7 P (G1 ) = 2/7

R1 G1
P (R2 |R1 ) = 4/7 P (G2 |R1 ) = 3/7 P (R2 |G1 ) = 6/7 P (G2 |G1 ) = 1/7

R 1 \ R2 R1 \ G2 G1 \ R2 G1 \ G2

6 Independence

Two events are independent if knowledge that one occurred does not change the probability
that the other occurred. Informally, events are independent if they do not influence one
another.
Example 5. Toss a coin twice. We expect the outcomes of the two tosses to be independent
of one another. In real experiments this always has to be checked. If my coin lands in honey
and I don’t bother to clean it, then the second toss might be a↵ected by the outcome of the
first toss.
More seriously, the independence of experiments can by undermined by the failure to clean or
recalibrate equipment between experiments or to isolate supposedly independent observers
from each other or a common influence. We’ve all experienced hearing the same ‘fact’ from
di↵erent people. Hearing it from di↵erent sources tends to lend it credence until we learn
that they all heard it from a common source. That is, our sources were not independent.
Translating the verbal description of independence into symbols gives

A is independent of B if P (A|B) = P (A). (5)

That is, knowing that B occurred does not change the probability that A occurred. In
terms of events as subsets, knowing that the realized outcome is in B does not change the
probability that it is in A.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 7

If A and B are independent in the above sense, then the multiplication rule gives P (A \
B) = P (A|B) · P (B) = P (A) · P (B). This justifies the following technical definition of
independence.
Formal definition of independence: Two events A and B are independent if

P (A \ B) = P (A) · P (B) (6)

This is a nice symmetric definition which makes clear that A is independent of B if and only
if B is independent of A. Unlike the equation with conditional probabilities, this definition
makes sense even when P (B) = 0. In terms of conditional probabilities, we have:
1. If P (B) 6= 0 then A and B are independent if and only if P (A|B) = P (A).
2. If P (A) 6= 0 then A and B are independent if and only if P (B|A) = P (B).
Independent events commonly arise as di↵erent trials in an experiment, as in the following
example.
Example 6. Toss a fair coin twice. Let H1 = ‘heads on first toss’ and let H2 = ‘heads on
second toss’. Are H1 and H2 independent?
answer: Since H1 \ H2 is the event ‘both tosses are heads’ we have

P (H1 \ H2 ) = 1/4 = P (H1 )P (H2 ).

Therefore the events are independent.

We can ask about the independence of any two events, as in the following two examples.
Example 7. Toss a fair coin 3 times. Let H1 = ‘heads on first toss’ and A = ‘two heads
total’. Are H1 and A independent?
answer: We know that P (A) = 3/8. Since this is not 0 we can check if the formula in
Equation 5 holds. Now, H1 = {HHH, HHT, HTH, HTT} contains exactly two outcomes
(HHT, HT H) from A, so we have P (A|H1 ) = 2/4. Since P (A|H1 ) 6= P (A) these events
are not independent.
Example 8. Draw one card from a standard deck of playing cards. Let’s examine the
independence of 3 events ‘the card is an ace’, ‘the card is a heart’ and ‘the card is red’.
Define the events as A = ‘ace’, H = ‘hearts’, R = ‘red’.
(a) We know that P (A) = 4/52 = 1/13, P (A|H) = 1/13. Since P (A) = P (A|H) we have
that A is independent of H.
(b) P (A|R) = 2/26 = 1/13. So A is independent of R. That is, whether the card is an ace
is independent of whether it’s red.
(c) Finally, what about H and R? Since P (H) = 1/4 and P (H|R) = 1/2, H and R are not
independent. We could also see this the other way around: P (R) = 1/2 and P (R|H) = 1,
so H and R are not independent.

6.1 Paradoxes of Independence

An event A with probability 0 is independent of itself, since in this case both sides of
equation (6) are 0. This appears paradoxical because knowledge that A occurred certainly
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 8

gives information about whether A occurred. We resolve the paradox by noting that since
P (A) = 0 the statement ‘A occurred’ is vacuous.
Think: For what other value(s) of P (A) is A independent of itself?

7 Bayes’ Theorem

Bayes’ theorem is a pillar of both probability and statistics and it is central to the rest of
this course. For two events A and B Bayes’ theorem (also called Bayes’ rule and Bayes’
formula) says
P (A|B) · P (B)
P (B|A) = . (7)
P (A)
Comments: 1. Bayes’ rule tells us how to ‘invert’ conditional probabilities, i.e. to find
P (B|A) from P (A|B).
2. In practice, P (A) is often computed using the law of total probability.
Proof of Bayes’ rule
The key point is that A \ B is symmetric in A and B. So the multiplication rule says

P (B|A) · P (A) = P (A \ B) = P (A|B) · P (B).

Now divide through by P (A) to get Bayes’ rule.

A common mistake is to confuse P (A|B) and P (B|A). They can be very di↵erent. This is
illustrated in the next example.
Example 9. Toss a coin 5 times. Let H1 = ‘first toss is heads’ and let HA = ‘all 5 tosses
are heads’. Then P (H1 |HA ) = 1 but P (HA |H1 ) = 1/16.
For practice, let’s use Bayes’ theorem to compute P (H1 |HA ) using P (HA |A1 ).The terms
are P (HA |H1 ) = 1/16, P (H1 ) = 1/2, P (HA ) = 1/32. So,

P (HA |H1 )P (H1 ) (1/16) · (1/2)

P (H1 |HA ) = = = 1,
P (HA ) 1/32
which agrees with our previous calculation.

7.1 The Base Rate Fallacy

The base rate fallacy is one of many examples showing that it’s easy to confuse the meaning
of P (B|A) and P (A|B) when a situation is described in words. This is one of the key
examples from probability and it will inform much of our practice and interpretation of
statistics. You should strive to understand it thoroughly.
Example 10. The Base Rate Fallacy
Consider a routine screening test for a disease. Suppose the frequency of the disease in the
population (base rate) is 0.5%. The test is highly accurate with a 5% false positive rate
and a 10% false negative rate.
You take the test and it comes back positive. What is the probability that you have the
disease?
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 9

answer: We will do the computation three times: using trees, tables and symbols. We’ll
use the following notation for the relevant events:
D+ = ‘you have the disease’
D = ‘you do not have the disease
T + = ‘you tested positive’
T = ‘you tested negative’.
We are given P (D+) = .005 and therefore P (D ) = .995. The false positive and false
negative rates are (by definition) conditional probabilities.

P (false positive) = P (T + |D ) = .05 and P (false negative) = P (T |D+) = .1.

The complementary probabilities are known as the true negative and true positive rates:

P (T |D ) = 1 P (T + |D ) = .95 and P (T + |D+) = 1 P (T |D+) = .9.

Trees: All of these probabilities can be displayed quite nicely in a tree.

.995 .005
D D+
.05 .95 .9 .1

T+ T T+ T

The question asks for the probability that you have the disease given that you tested positive,
i.e. what is the value of P (D+|T +). We aren’t given this value, but we do know P (T +|D+),
so we can use Bayes’ theorem.
P (T + |D+) · P (D+)
P (D + |T +) = .
P (T +)

The two probabilities in the numerator are given. We compute the denominator P (T +)
using the law of total probability. Using the tree we just have to sum the probabilities for
each of the nodes marked T +

P (T +) = .995 ⇥ .05 + .005 ⇥ .9 = .05425

Thus,
.9 ⇥ .005
P (D + |T +) = = 0.082949 ⇡ 8.3%.
.05425
Remarks: This is called the base rate fallacy because the base rate of the disease in the
population is so low that the vast majority of the people taking the test are healthy, and
even with an accurate test most of the positives will be healthy people. Ask your doctor
for his/her guess at the odds.
To summarize the base rate fallacy with specific numbers

95% of all tests are accurate does not imply 95% of positive tests are accurate

We will refer back to this example frequently. It and similar examples are at the heart of
many statistical misunderstandings.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 10

Other ways to work Example 10

Tables: Another trick that is useful for computing probabilities is to make a table. Let’s
redo the previous example using a table built with 10000 total people divided according to
the probabilites in this example.
We construct the table as follows. Pick a number, say 10000 people, and place it as the
grand total in the lower right. Using P (D+) = .005 we compute that 50 out of the 10000
people are sick (D+). Likewise 9950 people are healthy (D ). At this point the table looks
like:
D+ D total
T+
T
total 50 9950 10000
Using P (T + |D+) = .9 we can compute that the number of sick people who tested positive
as 90% of 50 or 45. The other entries are similar. At this point the table looks like the
table below on the left. Finally we sum the T + and T rows to get the completed table
on the right.
D+ D total D+ D total
T+ 45 498 T+ 45 498 543
T 5 9452 T 5 9452 9457
total 50 9950 10000 total 50 9950 10000
Using the complete table we can compute

|D + \ T + | 45
P (D + |T +) = = = 8.3%.
|T + | 543

Symbols: For completeness, we show how the solution looks when written out directly in
symbols.

Visualization: The figure below illustrates the base rate fallacy. The large blue area
represents all the healthy people. The much smaller red area represents the sick people.
The shaded rectangle represents the the people who test positive. The shaded area covers
most of the red area and only a small part of the blue area. Even so, the most of the shaded
area is over the blue. That is, most of the positive tests are of healthy people.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 11

D D+

7.2 Bayes’ rule in 18.05

As we said at the start of this section, Bayes’ rule is a pillar of probability and statistics.
We have seen that Bayes’ rule allows us to ‘invert’ conditional probabilities. When we learn
statistics we will see that the art of statistical inference involves deciding how to proceed
when one (or more) of the terms on the right side of Bayes’ rule is unknown.
Discrete Random Variables
Class 4, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Know the definition of a discrete random variable.

2. Know the Bernoulli, binomial, and geometric distributions and examples of what they
model.

3. Be able to describe the probability mass function and cumulative distribution function
using tables and formulas.

4. Be able to construct new random variables from old ones.

5. Know how to compute expected value (mean).

2 Random Variables

This topic is largely about introducing some useful terminology, building on the notions of
sample space and probability function. The key words are

1. Random variable

2. Probability mass function (pmf)

3. Cumulative distribution function (cdf)

2.1 Recap

A discrete sample space ⌦ is a finite or listable set of outcomes {!1 , !2 . . .}. The probability
of an outcome ! is denoted P (!).
X
An event E is a subset of ⌦. The probability of an event E is P (E) = P (!).
!2E

2.2 Random variables as payo↵ functions

Example 1. A game with 2 dice.

Roll a die twice and record the outcomes as (i, j), where i is the result of the first roll and
j the result of the second. We can take the sample space to be

⌦ = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)} = {(i, j) | i, j = 1, . . . 6}.

The probability function is P (i, j) = 1/36.

1
18.05 class 4, Discrete Random Variables, Spring 2014 2

In this game, you win $500 if the sum is 7 and lose $100 otherwise. We give this payo↵
function the name X and describe it formally by
(
500 if i + j = 7
X(i, j) =
100 if i + j 6= 7.

Example 2. We can change the game by using a di↵erent payo↵ function. For example

Y (i, j) = ij 10.

In this example if you roll (6, 2) then you win $2. If you roll (2, 3) then you win -$4 (i.e.,
lose $4).
Question: Which game is the better bet?
answer: We will come back to this once we learn about expectation.

These payo↵ functions are examples of random variables. A random variable assigns a
number to each outcome in a sample space. More formally:
Definition: Let ⌦ be a sample space. A discrete random variable is a function

X: ⌦!R

that takes a discrete set of values. (Recall that R stands for the real numbers.)
Why is X called a random variable? It’s ‘random’ because its value depends on a random
outcome of an experiment. And we treat X like we would a usual variable: we can add it
to other random variables, square it, and so on.

2.3 Events and random variables

For any value a we write X = a to mean the event consisting of all outcomes ! with
X(!) = a.
Example 3. In Example 1 we rolled two dice and X was the random variable
(
500 if i + j = 7
X(i, j) =
100 if i + j 6= 7.

The event X = 500 is the set {(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)}, i.e. the set of all
outcomes that sum to 7. So P (X = 500) = 1/6.
We allow a to be any value, even values that X never takes. In Example 1, we could look
at the event X = 1000. Since X never equals 1000 this is just the empty event (or empty
set)
‘X = 10000 = {} = ; P (X = 1000) = 0.
18.05 class 4, Discrete Random Variables, Spring 2014 3

2.4 Probability mass function and cumulative distribution function

It gets tiring and hard to read and write P (X = a) for the probability that X = a. When
we know we’re talking about X we will simply write p(a). If we want to make X explicit
we will write pX (a). We spell this out in a definition.
Definition: The probability mass function (pmf) of a discrete random variable is the
function p(a) = P (X = a).
Note:
1. We always have 0  p(a)  1.
2. We allow a to be any number. If a is a value that X never takes, then p(a) = 0.

Example 4. Let ⌦ be our earlier sample space for rolling 2 dice. Define the random
variable M to be the maximum value of the two dice:

M (i, j) = max(i, j).

For example, the roll (3,5) has maximum 5, i.e. M (3, 5) = 5.

We can describe a random variable by listing its possible values and the probabilities asso-
ciated to these values. For the above example we have:
value a: 1 2 3 4 5 6
pmf p(a): 1/36 3/36 5/36 7/36 9/36 11/36
For example, p(2) = 3/36.
Question: What is p(8)? answer: p(8) = 0.
Think: What is the pmf for Z(i, j) = i + j? Does it look familiar?

2.5 Events and inequalities

Inequalities with random variables describe events. For example X  a is the set of all
outcomes ! such that X(w)  a.
Example 5. If our sample space is the set of all pairs of (i, j) coming from rolling two dice
and Z(i, j) = i + j is the sum of the dice then

Z  4 = {(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (3, 1)}

2.6 The cumulative distribution function (cdf )

Definition: The cumulative distribution function (cdf) of a random variable X is the

function F given by F (a) = P (X  a). We will often shorten this to distribution function.
Note well that the definition of F (a) uses the symbol less than or equal. This will be
important for getting your calculations exactly right.

Example. Continuing with the example M , we have

value a: 1 2 3 4 5 6
pmf p(a): 1/36 3/36 5/36 7/36 9/36 11/36
cdf F (a): 1/36 4/36 9/36 16/36 25/36 36/36
18.05 class 4, Discrete Random Variables, Spring 2014 4

F (a) is called the cumulative distribution function because F (a) gives the total probability
that accumulates by adding up the probabilities p(b) as b runs from 1 to a. For example,
in the table above, the entry 16/36 in column 4 for the cdf is the sum of the values of the
pmf from column 1 to column 4. In notation:

As events: ‘M  4’ = {1, 2, 3, 4}; F (4) = P (M  4) = 1/36+3/36+5/36+7/36 = 16/36.

Just like the probability mass function, F (a) is defined for all values a. In the above
example, F (8) = 1, F ( 2) = 0, F (2.5) = 4/36, and F (⇡) = 9/36.

2.7 Graphs of p(a) and F (a)

We can visualize the pmf and cdf with graphs. For example, let X be the number of heads
in 3 tosses of a fair coin:
value a: 0 1 2 3
pmf p(a): 1/8 3/8 3/8 1/8
cdf F (a): 1/8 4/8 7/8 1
The colored graphs show how the cumulative distribution function is built by accumulating
probability as a increases. The black and white graphs are the more standard presentations.

3/8 3/8

1/8 1/8
a a
0 1 2 3 0 1 2 3

Probability mass function for X

1 1
7/8 7/8

4/8 4/8

1/8 1/8
a a
0 1 2 3 0 1 2 3

Cumulative distribution function for X

18.05 class 4, Discrete Random Variables, Spring 2014 5

1
33/36

30/36

26/36

21/36

15/36

10/36

6/36 6/36
5/36
4/36
3/36 3/36
2/36
1/36 1/36
a a
1 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12

pmf and cdf for the maximum of two dice (Example 4)

Histograms: Later we will see another way to visualize the pmf using histograms. These
require some care to do right, so we will wait until we need them.

2.8 Properties of the cdf F

The cdf F of a random variable satisfies several properties:

1. F is non-decreasing. That is, its graph never goes down, or symbolically if a  b then
F (a)  F (b).

2. 0  F (a)  1.

3. lim F (a) = 1, lim F (a) = 0.

a!1 a! 1

In words, (1) says the cumulative probability F (a) increases or remains constant as a
increases, but never decreases; (2) says the accumulated probability is always between 0
and 1; (3) says that as a gets very large, it becomes more and more certain that X  a and
as a gets very negative it becomes more and more certain that X > a.
Think: Why does a cdf satisfy each of these properties?

3 Specific Distributions

3.1 Bernoulli Distributions

Model: The Bernoulli distribution models one trial in an experiment that can result in
either success or failure This is the most important distribution is also the simplest. A
random variable X has a Bernoulli distribution with parameter p if:
18.05 class 4, Discrete Random Variables, Spring 2014 6

1. X takes the values 0 and 1.

2. P (X = 1) = p and P (X = 0) = 1 p.

We will write X ⇠ Bernoulli(p) or Ber(p), which is read “X follows a Bernoulli distribution

with parameter p” or “X is drawn from a Bernoulli distribution with parameter p”.
A simple model for the Bernoulli distribution is to flip a coin with probability p of heads,
with X = 1 on heads and X = 0 on tails. The general terminology is to say X is 1 on
success and 0 on failure, with success and failure defined by the context.
Many decisions can be modeled as a binary choice, such as votes for or against a proposal.
If p is the proportion of the voting population that favors the proposal, than the vote of a
random individual is modeled by a Bernoulli(p).
Here are the table and graphs of the pmf and cdf for the Bernoulli(1/2) distribution and
below that for the general Bernoulli(p) distribution.

p(a) F (a)
1
value a: 0 1
pmf p(a): 1/2 1/2 1/2 1/2
cdf F (a): 1/2 1
a a
0 1 0 1

Table, pmf and cmf for the Bernoulli(1/2) distribution

p(a) F (a)
1
values a: 0 1 p
pmf p(a): 1-p p
cdf F (a): 1-p 1 1 p 1 p
a a
0 1 0 1

Table, pmf and cmf for the Bernoulli(p) distribution

3.2 Binomial Distributions

The binomial distribution Binomial(n,p), or Bin(n,p), models the number of successes in n

independent Bernoulli(p) trials.
There is a hierarchy here. A single Bernoulli trial is, say, one toss of a coin. A single
binomial trial consists of n Bernoulli trials. For coin flips the sample space for a Bernoulli
trial is {H, T }. The sample space for a binomial trial is all sequences of heads and tails of
length n. Likewise a Bernoulli random variable takes values 0 and 1 and a binomial random
variables takes values 0, 1, 2, . . . , n.

Example 6. Binomial(1,p) is the same as Bernoulli(p).

18.05 class 4, Discrete Random Variables, Spring 2014 7

Example 7. The number of heads in n flips of a coin with probability p of heads follows
a Binomial(n, p) distribution.
We describe X ⇠ Binomial(n, p) by giving its values and probabilities. For notation we will
use k to mean an arbitrary number between 0 and n.
✓ ◆
n
We remind you that ‘n choose k = = n Ck is the number of ways to choose k things
k
out of a collection of n things and it has the formula
✓ ◆
n n!
= . (1)
k k! (n k)!
(It is also called a binomial coefficient.) Here is a table for the pmf of a Binomial(n, k) ran-
dom variable. We will explain how the binomial coefficients enter the pmf for the binomial
distribution after a simple example.
values a: 0 1 2 ··· k ··· n
✓ ◆ ✓ ◆ ✓ ◆
n 1 n 2 n k
pmf p(a): (1 p)n p (1 p)n 1
p (1 p)n 2
··· p (1 p)n k
··· pn
1 2 k

Example 8. What is the probability of 3 or more heads in 5 tosses of a fair coin?

answer: The binomial coefficients associated with n = 5 are
✓ ◆ ✓ ◆ ✓ ◆
5 5 5! 5·4·3·2·1 5 5! 5·4·3·2·1 5·4
= 1, = = = 5, = = = = 10,
0 1 1! 4! 4·3·2·1 2 2! 3! 2·1·3·2·1 2
and similarly ✓ ◆ ✓ ◆ ✓ ◆
5 5 5
= 10, = 5, = 1.
3 4 5
Using these values we get the following table for X ⇠ Binomial(5,p).

values a: 0 1 2 3 4 5
pmf p(a): (1 p)5 5p(1 p)4 10p2 (1 p)3 10p3 (1 p)2 5p4 (1 p) p5

We were told p = 1/2 so

✓ ◆3 ✓ ◆2 ✓ ◆4 ✓ ◆ 1 ✓ ◆5
1 1 1 1 1 16 1
P (X 3) = 10 +5 + = = .
2 2 2 2 2 32 2

Think: Why is the value of 1/2 not surprising?

3.3 Explanation of the binomial probabilities

For concreteness, let n = 5 and k = 2 (the argument for arbitrary n and k is identical.) So
X ⇠ binomial(5, p) and we want to compute p(2). The long way to compute p(2) is to list
all the ways to get exactly 2 heads in 5 coin flips and add up their probabilities. The list
has 10 entries:
HHTTT, HTHTT, HTTHT, HTTTH, THHTT, THTHT, THTTH, TTHHT, TTHTH,
TTTHH
18.05 class 4, Discrete Random Variables, Spring 2014 8

Each entry has the same probability of occurring, namely

p2 (1 p)3 .
This is because each of the two heads has probability p and each of the 3 tails has proba-
bility 1 p. Because the individual tosses are independent we can multiply probabilities.
Therefore, the total probability of exactly 2 heads is the sum of 10 identical probabilities,
i.e. p(2) = 10p2 (1 p)3 , as shown in the table.
This guides us to the shorter way to do the computation. We have to count the number of
sequences with exactly 2 heads. To do this we need to choose 2 of the tosses to be heads
and the remaining 3 to be tails. The number of such sequences is the number of ways to
choose 2 out of 5 things, that is 52 . Since each such sequence has the same probability,
p2 (1 p)3 , we get the probability of exactly 2 heads p(2) = 52 p2 (1 p)3 .

Here are some binomial probability mass function (here, frequency is the same as probabil-
ity).

3.4 Geometric Distributions

A geometric distribution models the number of tails before the first head in a sequence of
coin flips (Bernoulli trials).
Example 9. (a) Flip a coin repeatedly. Let X be the number of tails before the first heads.
So, X can equal 0, i.e. the first flip is heads, 1, 2, . . . . In principle it take any nonnegative
integer value.
(b) Give a flip of tails the value 0, and heads the value 1. In this case, X is the number of
0’s before the first 1.
18.05 class 4, Discrete Random Variables, Spring 2014 9

(c) Give a flip of tails the value 1, and heads the value 0. In this case, X is the number of
1’s before the first 0.
(d) Call a flip of tails a success and heads a failure. So, X is the number of successes before
the first failure.
(e) Call a flip of tails a failure and heads a success. So, X is the number of failures before
the first success.
You can see this models many di↵erent scenarios of this type. The most neutral language
is the number of tails before the first head.
Formal definition. The random variable X follows a geometric distribution with param-
eter p if

• X takes the values 0, 1, 2, 3, . . .

• its pmf is given by p(k) = P (X = k) = (1 p)k p.

We denote this by X ⇠ geometric(p) or geo(p). In table form we have:

value a: 0 1 2 3 ... k ...
2 3
pmf p(a): p (1 p)p (1 p) p (1 p) p . . . (1 p) p . . . k

Table: X ⇠ geometric(p): X = the number of 0s before the first 1.

We will show how this table was computed in an example below.
The geometric distribution is an example of a discrete distribution that takes an infinite
number of possible values. Things can get confusing when we work with successes and
failure since we might want to model the number of successes before the first failure or we
might want the number of failures before the first success. To keep straight things straight
you can translate to the neutral language of the number of tails before the first heads.
1

0.4 0.8

0.3 0.6

0.2 0.4

0.1 0.2

0.0 a 0.0 a
0 1 5 10 0 1 5 10

pmf and cdf for the geometric(1/3) distribution

Example 10. Computing geometric probabilities. Suppose that the inhabitants of an

island plan their families by having babies until the first girl is born. Assume the probability
of having a girl with each pregnancy is 0.5 independent of other pregnancies, that all babies
survive and there are no multiple births. What is the probability that a family has k boys?
18.05 class 4, Discrete Random Variables, Spring 2014 10

answer: In neutral language we can think of boys as tails and girls as heads. Then the
number of boys in a family is the number of tails before the first heads.
Let’s practice using standard notation to present this. So, let X be the number of boys in
a (randomly-chosen) family. So, X is a geometric random variable. We are asked to find
p(k) = P (X = k). A family has k boys if the sequence of children in the family from oldest
to youngest is
BBB . . . BG
with the first k children being boys. The probability of this sequence is just the product
of the probability for each child, i.e. (1/2)k · (1/2) = (1/2)k+1 . (Note: The assumptions of
equal probability and independence are simplifications of reality.)

Think: What is the ratio of boys to girls on the island?

More geometric confusion. Another common definition for the geometric distribution is the
number of tosses until the first heads. In this case X can take the values 1, i.e. the first
flip is heads, 2, 3, . . . . This is just our geometric random variable plus 1. The methods of
computing with it are just like the ones we used above.

3.5 Uniform Distribution

The uniform distribution models any situation where all the outcomes are equally likely.

X ⇠ uniform(N ).

X takes values 1, 2, 3, . . . , N , each with probability 1/N . We have already seen this distribu-
tion many times when modeling to fair coins (N = 2), dice (N = 6), birthdays (N = 365),
and poker hands (N = 52 5 ).

3.6 Discrete Distributions Applet

The applet at http://mathlets.org/mathlets/probability-distributions/ gives a dy-

namic view of some discrete distributions. The graphs will change smoothly as you move
the various sliders. Try playing with the di↵erent distributions and parameters.
This applet is carefully color-coded. Two things with the same color represent the same or
closely related notions. By understanding the color-coding and other details of the applet,
you will acquire a stronger intuition for the distributions shown.

3.7 Other Distributions

There are a million other named distributions arising is various contexts. We don’t expect
you to memorize them (we certainly have not!), but you should be comfortable using a
resource like Wikipedia to look up a pmf. For example, take a look at the info box at the
top rightof http://en.wikipedia.org/wiki/Hypergeometric_distribution. The info
box lists many (surely unfamiliar) properties in addition to the pmf.
18.05 class 4, Discrete Random Variables, Spring 2014 11

4 Arithmetic with Random Variables

We can do arithmetic with random variables. For example, we can add subtract, multiply
or square them.
There is a simple, but extremely important idea for counting. It says that if we have a
sequence of numbers that are either 0 or 1 then the sum of the sequence is the number of
1s.
Example 11. Consider the sequence with five 1s

1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0.

It is easy to see that the sum of this sequence is 5 the number of 1s.
We illustrates this idea by counting the number of heads in n tosses of a coin.
Example 12. Toss a fair coin n times. Let Xj be 1 if the jth toss is heads and 0 if it’s
tails. So, Xj is a Bernoulli(1/2) random variable. Let X be the total number of heads in
the n tosses. Assuming the tosses are independence we know X ⇠ binomial(n, 1/2). We
can also write
X = X 1 + X2 + X3 + . . . + Xn .
Again, this is because the terms in the sum on the right are all either 0 or 1. So, the sum
is exactly the number of Xj that are 1, i.e. the number of heads.
The important thing to see in the example above is that we’ve written the more complicated
binomial random variable X as the sum of extremely simple random variables Xj . This
will allow us to manipulate X algebraically.

Think: Suppose X and Y are independent and X ⇠ binomial(n, 1/2) and Y ⇠ binomial(m, 1/2).
What kind of distribution does X + Y follow? (Answer: binomial(n + m, 1/2). Why?)

Example 13. Suppose X and Y are independent random variables with the following
tables.
Values of X x: 1 2 3 4
pmf pX (x): 1/10 2/10 3/10 4/10

Values of Y y: 1 2 3 4 5
pmf pY (y): 1/15 2/15 3/15 4/15 5/15
Check that the total probability for each random variable is 1. Make a table for the random
variable X + Y .
answer: The first thing to do is make a two-dimensional table for the product sample space
consisting of pairs (x, y), where x is a possible value of X and y one of Y . To help do the
computation, the probabilities for the X values are put in the far right column and those
for Y are in the bottom row. Because X and Y are independent the probability for (x, y)
pair is just the product of the individual probabilities.
18.05 class 4, Discrete Random Variables, Spring 2014 12

Y values

1 2 3 4 5

X values 1 1/150 2/150 3/150 4/150 5/150 1/10

2 2/150 4/150 6/150 8/150 10/150 2/10

3 3/150 6/150 9/150 12/150 15/150 3/10

4 4/150 8/150 12/150 16/150 20/150 4/10

1/15 2/15 3/15 4/15 5/15

The diagonal stripes show sets of squares where X + Y is the same. All we have to do to
compute the probability table for X + Y is sum the probabilities for each stripe.
X + Y values: 2 3 4 5 6 7 8 9
pmf: 1/150 4/150 10/150 20/150 30/150 34/150 31/150 20/150

When the tables are too big to write down we’ll need to use purely algebraic techniques to
compute the probabilities of a sum. We will learn how to do this in due course.
Discrete Random Variables: Expected Value
Class 4, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Expected Value

In the R reading questions for this lecture, you simulated the average value of rolling a die
many times. You should have gotten a value close to the exact answer of 3.5. To motivate
the formal definition of the average, or expected value, we first consider some examples.
Example 1. Suppose we have a six-sided die marked with five 5 3’s and one 6. (This was
the red one from our non-transitive dice.) What would you expect the average of 6000 rolls
to be?
answer: If we knew the value of each roll, we could compute the average by summing
the 6000 values and dividing by 6000. Without knowing the values, we can compute the
expected average as follows.
Since there are five 3’s and one six we expect roughly 5/6 of the rolls will give 3 and 1/6 will
give 6. Assuming this to be exactly true, we have the following table of values and counts:
value: 3 6
expected counts: 5000 1000
The average of these 6000 values is then
5000 · 3 + 1000 · 6 5 1
= · 3 + · 6 = 3.5
6000 6 6
We consider this the expected average in the sense that we ‘expect’ each of the possible
values to occur with the given frequencies.

Example 2. We roll two standard 6-sided dice. You win $1000 if the sum is 2 and lose
$100 otherwise. How much do you expect to win on average per trial?
1
answer: The probability of a 2 is 1/36. If you play N times, you can ‘expect’ · N of the
36
35
trials to give a 2 and 36 · N of the trials to give something else. Thus your total expected
winnings are
N 35N
1000 · 100 · .
36 36
To get the expected average per trial we divide the total by N :
1 35
expected average = 1000 · 100 · = 69.44.
36 36

Think: Would you be willing to play this game one time? Multiple times?

Notice that in both examples the sum for the expected average consists of terms which are
a value of the random variable times its probabilitiy. This leads to the following definition.
Definition: Suppose X is a discrete random variable that takes values x1 , x2 , . . . , xn with
probabilities p(x1 ), p(x2 ), . . . , p(xn ). The expected value of X is denoted E(X) and defined

1
18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 2

by
n
X
E(X) = p(xj ) xj = p(x1 )x1 + p(x2 )x2 + . . . + p(xn )xn .
j=1

Notes:

1. The expected value is also called the mean or average of X and often denoted by µ
(“mu”).

2. As seen in the above examples, the expected value need not be a possible value of the
random variable. Rather it is a weighted average of the possible values.

3. Expected value is a summary statistic, providing a measure of the location or central

tendency of a random variable.

4. If all the values are equally probable then the expected value is just the usual average of
the values.

Example 3. Find E(X) for the random variable X with table:

values of X: 1 3 5
pmf: 1/6 1/6 2/3
1 1 2 24
answer: E(X) = · 1 + · 3 + · 5 = =4
6 6 3 6
Example 4. Let X be a Bernoulli(p) random variable. Find E(X).
answer: X takes values 1 and 0 with probabilities p and 1 p, so

E(X) = p · 1 + (1 p) · 0 = p.

Important: This is an important example. Be sure to remember that the expected value of
a Bernoulli(p) random variable is p.
Think: What is the expected value of the sum of two dice?

1.1 Mean and center or mass

You may have wondered why we use the name ‘probability mass function’. Here’s the
reason: if we place an object of mass p(xj ) at position xj for each j, then E(X) is the
position of the center of mass. Let’s recall the latter notion via an example.
Example 5. Suppose we have two masses along the x-axis, mass m1 = 500 at position
x1 = 3 and mass m2 = 100 at position x2 = 6. Where is the center of mass?
answer: Intuitively we know that the center of mass is closer to the larger mass.
m1 m2
x
3 6
18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 3

From physics we know the center of mass is

m 1 x 1 + m2 x 2 500 · 3 + 100 · 6
x= = = 3.5.
m 1 + m2 600
We call this formula a ‘weighted’ average of the x1 and x2 . Here x1 is weighted more heavily
because it has more mass.
Now look at the definition of expected value E(X). It is a weighted average of the values of
X with the weights being probabilities p(xi ) rather than masses! We might say that “The
expected value is the point at which the distribution would balance”. Note the similarity
between the physics example and Example 1.

1.2 Algebraic properties of E(X)

When we add, scale or shift random variables the expected values do the same. The
shorthand mathematical way of saying this is that E(X) is linear.
1. If X and Y are random variables on a sample space ⌦ then
E(X + Y ) = E(X) + E(Y )
2. If a and b are constants then
E(aX + b) = aE(X) + b.
We will think of aX + b as scaling X by a and shifting it by b.

Before proving these properties, let’s consider a few examples.

Example 6. Roll two dice and let X be the sum. Find E(X).
answer: Let X1 be the value on the first die and let X2 be the value on the second
die. Since X = X1 + X2 we have E(X) = E(X1 ) + E(X2 ). Earlier we computed that
E(X1 ) = E(X2 ) = 3.5, therefore E(X) = 7.

Example 7. Let X ⇠ binomial(n, p). Find E(X).

answer: Recall that X models the number of successes in n Bernoulli(p) random variables,
which we’ll call X1 , . . . Xn . The key fact, which we highlighted in the previous reading for
this class, is that
Xn
X= Xj .
j=1

Now we can use the Algebraic Property (1) to make the calculation simple.
n
X X X
X= Xj ) E(X) = E(Xj ) = p = np .
j=1 j j

We could have computed E(X) directly as

n
X n
X ✓ ◆
n k
E(X) = kp(k) = k p (1 p)n k
.
k
k=0 k=0
18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 4

It is possible to show that the sum of this series is indeed np. We think you’ll agree that
the method using Property (1) is much easier.

Example 8. (For infinite random variables the mean does not always exist.) Suppose X
has an infinite number of values according to the following table
values x: 2 22 23 ... 2k ...
2 3 Try to compute the mean.
pmf p(x): 1/2 1/2 1/2 . . . 1/2k . . .
answer: The mean is
1
X X 1
k 1
E(X) = 2 k = 1 = 1.
2
k=1 k=1
The mean does not exist! This can happen with infinite series.

1.3 Proofs of the algebraic properties of E(X)

The proof of Property (1) is simple, but there is some subtlety in even understanding what
it means to add two random variables. Recall that the value of random variable is a number
determined by the outcome of an experiment. To add X and Y means to add the values of
X and Y for the same outcome. In table form this looks like:
outcome !: !1 !2 !3 ... !n
value of X: x1 x2 x3 ... xn
value of Y : y1 y2 y3 ... yn
value of X + Y : x1 + y1 x2 + y2 x3 + y3 . . . xn + yn
prob. P (!): P (!1 ) P (!2 ) P (!3 ) . . . P (!n )
The proof of (1) follows immediately:
X X X
E(X + Y ) = (xi + yi )P (!i ) = xi P (!i ) + yi P (!i ) = E(X) + E(Y ).

The proof of Property (2) only takes one line.

X X X
E(aX + b) = p(xi )(axi + b) = a p(xi )xi + b p(xi ) = aE(X) + b.
P
The b term in the last expression follows because p(xi ) = 1.
Example 9. Mean of a geometric distribution
Let X ⇠ geo(p). Recall this means X takes values k = 0, 1, 2, . . . with probabilities
p(k) = (1 p)k p. (X models the number of tails before the first heads in a sequence of
Bernoulli trials.) The mean is given by
1 p
E(X) = .
p
To see this requires a clever trick. Mathematicians love this sort of thing and we hope you
are able to follow the logic. In this class we will not ask you to come up with something
like this on an exam.
Here’s the trick.: to compute E(X) we have to sum the infinite series
1
X
E(X) = k(1 p)k p.
k=0
18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 5

1
X 1
Here is the trick. We know the sum of the geometric series: xk = .
1 x
k=0
1
X 1
Di↵erentiate both sides: kxk 1
= .
(1 x)2
k=0
1
X x
Multiply by x: kxk = .
(1 x)2
k=0
1
X 1 p
Replace x by 1 p: k(1 p)k = .
p2
k=0
1
X 1 p
Multiply by p: k(1 p)k p = .
p
k=0
This last expression is the mean.
1 p
E(X) = .
p

Example 10. Flip a fair coin until you get heads for the first time. What is the expected
number of times you flipped tails?
answer: The number of tails before the first head is modeled by X ⇠ geo(1/2). From the
1/2
previous example E(X) = = 1. This is a surprisingly small number.
1/2

Example 11. Michael Jordan, the greatest basketball player ever, made 80% of his free
throws. In a game what is the expected number he would make before his first miss.
answer: Here is an example where we want the number of successes before the first failure.
Using the neutral language of heads and tails: success is tails (probability 1 p) and failure
is heads (probability = p). Therefore p = .2 and the number of tails (made free throws)
before the first heads (missed free throw) is modeled by a X ⇠ geo(.2). We saw in Example
9 that this is
1 p .8
E(X) = = = 4.
p .2

1.4 Expected values of functions of a random variable

(The change of variables formula.)

If X is a discrete random variable taking values x1 , x2 , . . . and h is a function the h(X) is
a new random variable. Its expected value is
X
E(h(X)) = h(xj )p(xj ).
j

We illustrate this with several examples.

Example 12. Let X be the value of a roll of one die and let Y = X 2 . Find E(Y ).
answer: Since there are a small number of values we can make a table.
X 1 2 3 4 5 6
Y 1 4 9 16 25 36
prob 1/6 1/6 1/6 1/6 1/6 1/6
18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 6

Notice the probability for each Y value is the same as that of the corresponding X value.
So,
1 1 1
E(Y ) = E(X 2 ) = 12 · + 22 · + . . . + 62 · = 15.167.
6 6 6
Example 13. Roll two dice and let X be the sum. Suppose the payo↵ function is given
by Y = X 2 6X + 1. Is this a good bet?
X12
answer: We have E(Y ) = (j 2 6j + 1)p(j), where p(j) = P (X = j).
j=2
We show the table, but really we’ll use R to do the calculation.
X 2 3 4 5 6 7 8 9 10 11 12
Y -7 -8 -7 -4 1 8 17 28 41 56 73
prob 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Here’s the R code I used to compute E(Y ) = 13.833.
x = 2:12
y = x^2 - 6*x + 1
p = c(1 2 3 4 5 6 5 4 3 2 1)/36
ave = sum(p*y)
It gave ave = 13.833.
To answer the question above: since the expected payo↵ is positive it looks like a bet worth
taking.

Quiz: If Y = h(X) does E(Y ) = h(E(X))? answer: NO!!! This is not true in general!
Think: Is it true in the previous example?
Quiz: If Y = 3X + 77 does E(Y ) = 3E(X) + 77?
answer: Yes. By property (2), scaling and shifting does behave like this.
Variance of Discrete Random Variables
Class 5, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to compute the variance and standard deviation of a random variable.

2. Understand that standard deviation is a measure of scale or spread.

3. Be able to compute variance using the properties of scaling and linearity.

2 Spread

The expected value (mean) of a random variable is a measure of location or central tendency.
If you had to summarize a random variable with a single number, the mean would be a good
choice. Still, the mean leaves out a good deal of information. For example, the random
variables X and Y below both have mean 0, but their probability mass is spread out about
the mean quite di↵erently.
values X -2 -1 0 1 2 values Y -3 3
pmf p(x) 1/10 2/10 4/10 2/10 1/10 pmf p(y) 1/2 1/2

It’s probably a little easier to see the di↵erent spreads in plots of the probability mass
functions. We use bars instead of dots to give a better sense of the mass.
p(x) pmf for X pmf for Y p(y)
1/2
4/10

2/10
1/10
x y
-2 -1 0 1 2 -3 0 3
pmf’s for two di↵erent distributions both with mean 0
In the next section, we will learn how to quantify this spread.

3 Variance and standard deviation

Taking the mean as the center of a random variable’s probability distribution, the variance
is a measure of how much the probability mass is spread out around this center. We’ll start
with the formal definition of variance and then unpack its meaning.
Definition: If X is a random variable with mean E(X) = µ, then the variance of X is
defined by
Var(X) = E((X µ)2 ).

1
18.05 class 5, Variance of Discrete Random Variables, Spring 2014 2

The standard deviation of X is defined by

p
= Var(X).

If the relevant random variable is clear from context, then the variance and standard devi-
ation are often denoted by 2 and (‘sigma’), just as the mean is µ (‘mu’).
What does this mean? First, let’s rewrite the definition explicitly as a sum. If X takes
values x1 , x2 , . . . , xn with probability mass function p(xi ) then
n
X
Var(X) = E((X µ)2 ) = p(xi )(xi µ)2 .
i=1

In words, the formula for Var(X) says to take a weighted average of the squared distance
to the mean. By squaring, we make sure we are averaging only non-negative values, so that
the spread to the right of the mean won’t cancel that to the left. By using expectation,
we are weighting high probability values more than low probability values. (See Example 2
below.)
Note on units:
1. has the same units as X.
2. Var(X) has the same units as the square of X. So if X is in meters, then Var(X) is in
meters squared.
Because and X have the same units, the standard deviation is a natural measure of
spread.
Let’s work some examples to make the notion of variance clear.
Example 1. Compute the mean, variance and standard deviation of the random variable
X with the following table of values and probabilities.
value x 1 3 5
pmf p(x) 1/4 1/4 1/2
answer: First we compute E(X) = 7/2. Then we extend the table to include (X 7/2)2 .
value x 1 3 5
p(x) 1/4 1/4 1/2
(x 7/2)2 25/4 1/4 9/4
Now the computation of the variance is similar to that of expectation:
25 1 1 1 9 1 11
Var(X) = · + · + · = .
4 4 4 4 4 2 4
p
Taking the square root we have the standard deviation = 11/4.

Example 2. For each random variable X, Y , Z, and W plot the pmf and compute the
mean and variance.

(i) value x 1 2 3 4 5
pmf p(x) 1/5 1/5 1/5 1/5 1/5
(ii) value y 1 2 3 4 5
pmf p(y) 1/10 2/10 4/10 2/10 1/10
18.05 class 5, Variance of Discrete Random Variables, Spring 2014 3

(iii) value z 1 2 3 4 5
pmf p(z) 5/10 0 0 0 5/10
(iv) value w 1 2 3 4 5
pmf p(w) 0 0 1 0 0
answer: Each random variable has the same mean 3, but the probability is spread out
di↵erently. In the plots below, we order the pmf’s from largest to smallest variance: Z, X,
Y , W.
p(z) pmf for Z p(x) pmf for X
.5

1/5

z x
1 2 3 4 5 1 2 3 4 5

p(w)
1

p(y) pmf for Y pmf for W

.1
y W
1 2 3 4 5 1 2 3 4 5

Next we’ll verify our visual intuition by computing the variance of each of the variables.
All of them have mean µ = 3. Since the variance is defined as an expected value, we can
compute it using the tables.
(i) value x 1 2 3 4 5
pmf p(x) 1/5 1/5 1/5 1/5 1/5
(X µ)2 4 1 0 1 4
4 1 0 1 4
Var(X) = E((X µ)2 ) = 5 + 5 + 5 + 5 + 5 = 2 .

(ii) value y 1 2 3 4 5
p(y) 1/10 2/10 4/10 2/10 1/10
(Y µ)2 4 1 0 1 4
4 2 0 2 4
Var(Y ) = E((Y µ)2 ) = 10 + 10 + 10 + 10 + 10 = 1.2 .

(iii) value z 1 2 3 4 5
pmf p(z) 5/10 0 0 0 5/10
(Z µ)2 4 1 0 1 4
20 20
Var(Z) = E((Z µ)2 ) = 10 + 10 = 4 .
18.05 class 5, Variance of Discrete Random Variables, Spring 2014 4

(iv) value w 1 2 3 4 5
pmf p(w) 0 0 1 0 0
(W µ)2 4 1 0 1 4

Var(W ) = 0 . Note that W doesn’t vary, so it has variance 0!

3.1 The variance of a Bernoulli(p) random variable.

Bernoulli random variables are fundamental, so we should know their variance.

If X ⇠ Bernoulli(p) then
Var(X) = p(1 p).

Proof: We know that E(X) = p. We compute Var(X) using a table.

values X 0 1
pmf p(x) 1 p p
(X µ)2 (0 p)2 (1 p)2
Var(X) = (1 p)p2 + p(1 p)2 = (1 p)p(1 p + p) = (1 p)p.

As with all things Bernoulli, you should remember this formula.

Think: For what value of p does Bernoulli(p) have the highest variance? Try to answer
this by plotting the PMF for various p.

3.2 A word about independence

So far we have been using the notion of independent random variable without ever carefully
defining it. For example, a binomial distribution is the sum of independent Bernoulli trials.
This may (should?) have bothered you. Of course, we have an intuitive sense of what inde-
pendence means for experimental trials. We also have the probabilistic sense that random
variables X and Y are independent if knowing the value of X gives you no information
about the value of Y .
In a few classes we will work with continuous random variables and joint probability func-
tions. After that we will be ready for a full definition of independence. For now we can use
the following definition, which is exactly what you expect and is valid for discrete random
variables.
Definition: The discrete random variables X and Y are independent if

P (X = a, Y = b) = P (X = a)P (Y = b)

for any values a, b. That is, the probabilities multiply.

3.3 Properties of variance

The three most useful properties for computing variance are:

1. If X and Y are independent then Var(X + Y ) = Var(X) + Var(Y ).
18.05 class 5, Variance of Discrete Random Variables, Spring 2014 5

2. For constants a and b, Var(aX + b) = a2 Var(X).

3. Var(X) = E(X 2 ) E(X)2 .
For Property 1, note carefully the requirement that X and Y are independent. We will
return to the proof of Property 1 in a later class.

Property 3 gives a formula for Var(X) that is often easier to use in hand calculations. The
computer is happy to use the definition! We’ll prove Properties 2 and 3 after some examples.

Example 3. Suppose X and Y are independent and Var(X) = 3 and Var(Y ) = 5. Find:
(i) Var(X + Y ), (ii) Var(3X + 4), (iii) Var(X + X), (iv) Var(X + 3Y ).
answer: To compute these variances we make use of Properties 1 and 2.
(i) Since X and Y are independent, Var(X + Y ) = Var(X) + Var(Y ) = 8.
(ii) Using Property 2, Var(3X + 4) = 9 · Var(X) = 27.
(iii) Don’t be fooled! Property 1 fails since X is certainly not independent of itself. We can
use Property 2: Var(X + X) = Var(2X) = 4 · Var(X) = 12. (Note: if we mistakenly used
Property 1, we would the wrong answer of 6.)
(iv) We use both Properties 1 and 2.
Var(X + 3Y ) = Var(X) + Var(3Y ) = 3 + 9 · 5 = 48.

Example 4. Use Property 3 to compute the variance of X ⇠ Bernoulli(p).

answer: From the table
X 0 1
p(x) 1 p p
X2 0 1
we have E(X 2 ) = p. So Property 3 gives
Var(X) = E(X 2 ) E(X)2 = p p2 = p(1 p).
This agrees with our earlier calculation.
Example 5. Redo Example 1 using Property 3.
answer: From the table
X 1 3 5
p(x) 1/4 1/4 1/2
X2 1 9 2
we have E(X) = 7/2 and
1 1 1 60
E(X 2 ) = 12 · + 32 · + 52 · = = 15.
4 4 2 4
So Var(X) = 15 (7/2)2 = 11/4 –as before in Example 1.

3.4 Variance of binomial(n,p)

Suppose X ⇠ binomial(n, p). Since X is the sum of independent Bernoulli(p) variables and
each Bernoulli variable has variance p(1 p) we have
X ⇠ binomial(n, p) ) Var(X) = np(1 p).
18.05 class 5, Variance of Discrete Random Variables, Spring 2014 6

3.5 Proof of properties 2 and 3

Proof of Property 2: This follows from the properties of E(X) and some algebra.
Let µ = E(X). Then E(aX + b) = aµ + b and

Var(aX+b) = E((aX+b (aµ+b))2 ) = E((aX aµ)2 ) = E(a2 (X µ)2 ) = a2 E((X µ)2 ) = a2 Var(X).

Proof of Property 3: We use the properties of E(X) and a bit of algebra. Remember
that µ is a constant and that E(X) = µ.

E((X µ)2 ) = E(X 2 2µX + µ2 )

= E(X 2 ) 2µE(X) + µ2
= E(X 2 ) 2µ2 + µ2
= E(X 2 ) µ2
= E(X 2 ) E(X)2 . QED

4 Tables of Distributions and Properties

Distribution range X pmf p(x) mean E(X) variance Var(X)

Bernoulli(p) 0, 1 p(0) = 1 p, p(1) = p p p(1 p)

✓ ◆
n k
Binomial(n, p) 0, 1,. . . , n p(k) = p (1 p)n k np np(1 p)
k
1 n+1 n2 1
Uniform(n) 1, 2, . . . , n p(k) =
n 2 12
1 p 1 p
Geometric(p) 0, 1, 2,. . . p(k) = p(1 p)k
p p2

Let X be a discrete random variable with range x1 , x2 , . . . and pmf p(xj ).

Expected Value: Variance:
Synonyms: mean, average
Notation: E(X), µ Var(X), 2

X X
Definition: E(X) = p(xj )xj E((X µ)2 ) = p(xj )(xj µ)2
j j
Scale and shift: E(aX + b) = aE(X) + b Var(aX + b) = a2 Var(X)
Linearity: (for any X, Y ) (for X, Y independent)
E(X + Y ) = E(X) + E(Y ) Var(X + Y ) = Var(X) + Var(Y )
P
Functions of X: E(h(X)) = p(xj ) h(xj )
Alternative formula: Var(X) = E(X 2 ) E(X)2 = E(X 2 ) µ2
Continuous Random Variables
Class 5, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Know the definition of a continuous random variable.

2. Know the definition of the probability density function (pdf) and cumulative distribution
function (cdf).

3. Be able to explain why we use probability density for continuous random variables.

2 Introduction

We now turn to continuous random variables. All random variables assign a number to
each outcome in a sample space. Whereas discrete random variables take on a discrete set
of possible values, continuous random variables have a continuous set of values.
Computationally, to go from discrete to continuous we simply replace sums by integrals. It
will help you to keep in mind that (informally) an integral is just a continuous sum.
Example 1. Since time is continuous, the amount of time Jon is early (or late) for class is
a continuous random variable. Let’s go over this example in some detail.
Suppose you measure how early Jon arrives to class each day (in units of minutes). That
is, the outcome of one trial in our experiment is a time in minutes. We’ll assume there are
random fluctuations in the exact time he shows up. Since in principle Jon could arrive, say,
3.43 minutes early, or 2.7 minutes late (corresponding to the outcome -2.7), or at any other
time, the sample space consists of all real numbers. So the random variable which gives the
outcome itself has a continuous range of possible values.
It is too cumbersome to keep writing ‘the random variable’, so in future examples we might
write: Let T = “time in minutes that Jon is early for class on any given day.”

3 Calculus Warmup

While we will assume you can compute the most familiar forms of derivatives and integrals
by hand, we do not expect you to be calculus whizzes. For tricky expressions, we’ll let the
computer do most of the calculating. Conceptually, you should be comfortable with two
views of a definite integral.
Z b
1. f (x) dx = area under the curve y = f (x).
a
Z b
2. f (x) dx = ‘sum of f (x) dx’.
a

1
18.05 class 5, Continuous Random Variables, Spring 2014 2

The connection between the two is:

n
X
area ⇡ sum of rectangle areas = f (x1 ) x + f (x2 ) x + . . . + f (xn ) x = f (xi ) x.
1
As the width x of the intervals gets smaller the approximation becomes better.
y
y y = f (x)
y = f (x)
Area = f (xi ) x

x
x
x x0 x1 x2 ··· xn
a b a b

Area is approximately the sum of rectangles

Note: In calculus you learned to compute integrals by finding antiderivatives. This is

important for calculations, but don’t confuse this method for the reason we use integrals.
Our interest in integrals comes primarily from its interpretation as a ‘sum’ and to a much
lesser extent its interpretation as area.

4 Continuous Random Variables and Probability Density Func-

tions

A continuous random variable takes a range of values, which may be finite or infinite in
extent. Here are a few examples of ranges: [0, 1], [0, 1), ( 1, 1), [a, b].
Definition: A random variable X is continuous if there is a function f (x) such that for
any c  d we have
Z d
P (c  X  d) = f (x) dx. (1)
c
The function f (x) is called the probability density function (pdf).
The pdf always satisfies the following properties:

1. f (x) 0 (f is nonnegative).
Z 1
2. f (x) dx = 1 (This is equivalent to: P ( 1 < X < 1) = 1).
1

The probability density function f (x) of a continuous random variable is the analogue of
the probability mass function p(x) of a discrete random variable. Here are two important
di↵erences:

1. Unlike p(x), the pdf f (x) is not a probability. You have to integrate it to get proba-
bility. (See section 4.2 below.)

2. Since f (x) is not a probability, there is no restriction that f (x) be less than or equal
to 1.
18.05 class 5, Continuous Random Variables, Spring 2014 3

Note: In Property 2, we integrated over ( 1, 1) since we did not know the range of values
taken by X. Formally, this makes sense because we just define f (x) to be 0 outside of the
range of X. In practice, we would integrate between bounds given by the range of X.

4.1 Graphical View of Probability

If you graph the probability density function of a continuous random variable X then
P (c  X  d) = area under the graph between c and d.

f (x)

P (c  X  d)

x
c d

Think: What is the total area under the pdf f (x)?

4.2 The terms ‘probability mass’ and ‘probability density’

Why do we use the terms mass and density to describe the pmf and pdf? What is the
di↵erence between the two? The simple answer is that these terms are completely analogous
to the mass and density you saw in physics and calculus. We’ll review this first for the
probability mass function and then discuss the probability density function.
Mass as a sum:
If masses m1 , m2 , m3 , and m4 are set in a row at positions x1 , x2 , x3 , and x4 , then the
total mass is m1 + m2 + m3 + m4 .
m1 m2 m3 m4
x
x1 x2 x3 x4

We can define a ‘mass function’ p(x) with p(xj ) = mj for j = 1, 2, 3, 4, and p(x) = 0
otherwise. In this notation the total mass is p(x1 ) + p(x2 ) + p(x3 ) + p(x4 ).
The probability mass function behaves in exactly the same way, except it has the dimension
of probability instead of mass.
Mass as an integral of density:
Suppose you have a rod of length L meters with varying density f (x) kg/m. (Note the units
are mass/length.)

x
x
0 x1 x2 x3 xi xn = L
mass of ith piece ⇡ f (xi ) x
18.05 class 5, Continuous Random Variables, Spring 2014 4

If the density varies continuously, we must find the total mass of the rod by integration:
Z L
total mass = f (x) dx.
0

This formula comes from dividing the rod into small pieces and ’summing’ up the mass of
each piece. That is:
X n
total mass ⇡ f (xi ) x
i=1
In the limit as x goes to zero the sum becomes the integral.
The probability density function behaves exactly the same way, except it has units of
probability/(unit x) instead of kg/m. Indeed, equation (1) is exactly analogous to the
above integral for total mass.
While we’re on a physics kick, note that for both discrete and continuous random variables,
the expected value is simply the center of mass or balance point.

Example 2. Suppose X has pdf f (x) = 3 on [0, 1/3] (this means f (x) = 0 outside of
[0, 1/3]). Graph the pdf and compute P (.1  X  .2) and P (.1  X  1).
answer: P (.1  X  .2) is shown below at left. We can compute the integral:
Z .2 Z .2
P (.1  X  .2) = f (x) dx = 3 dx = .3.
.1 .1

Or we can find the area geometrically:

area of rectangle = 3 · .1 = .3.

P (.1  X  1) is shown below at right. Since there is only area under f (x) up to 1/3, we
have P (.1  X  1) = 3 · (1/3 .1) = .7.

3 f (x) 3 f (x)

x x
.1 .2 1/3 .1 1/3 1

P (.1  X  .2) P (.1  X  1)

Think: In the previous example f (x) takes values greater than 1. Why does this not
violate the rule that probabilities are always between 0 and 1?

Note on notation. We can define a random variable by giving its range and probability
density function. For example we might say, let X be a random variable with range [0,1]
18.05 class 5, Continuous Random Variables, Spring 2014 5

and pdf f (x) = x/2. Implicitly, this means that X has no probability density outside of the
given range. If we wanted to be absolutely rigorous, we would say explicitly that f (x) = 0
outside of [0,1], but in practice this won’t be necessary.
Example 3. Let X be a random variable with range [0,1] and pdf f (x) = Cx2 . What is
the value of C?
answer: Since the total probability must be 1, we have
Z 1 Z 1
f (x) dx = 1 , Cx2 dx = 1.
0 0

By evaluating the integral, the equation at right becomes

C/3 = 1 ) C=3.

Note: We say the constant C above is needed to normalize the density so that the total
probability is 1.

Example 4. Let X be the random variable in the Example 3. Find P (X  1/2).

Z 1/2
1/2 1
answer: P (X  1/2) = 3x2 dx = x3 0 = .
0 8

Think: For this X (or any continuous random variable):

• What is P (a  X  a)?

• What is P (X = 0)?

• Does P (X = a) = 0 mean that X can never equal a?

In words the above questions get at the fact that the probability that a random person’s
height is exactly 5’9” (to infinite precision, i.e. no rounding!) is 0. Yet it is still possible
that someone’s height is exactly 5’9”. So the answers to the thinking questions are 0, 0,
and No.

4.3 Cumulative Distribution Function

The cumulative distribution function (cdf ) of a continuous random variable X is defined

in exactly the same way as the cdf of a discrete random variable.

F (b) = P (X  b).

Note well that the definition is about probability. When using the cdf you should first think
of it as a probability. Then when you go to calculate it you can use
Z b
F (b) = P (X  b) = f (x) dx, where f (x) is the pdf of X.
1

Notes:
1. For discrete random variables, we defined the cumulative distribution function but did
18.05 class 5, Continuous Random Variables, Spring 2014 6

not have much occasion to use it. The cdf plays a far more prominent role for continuous
random variables.
2. As before, we started the integral at 1 because we did not know the precise range of
X. Formally, this still makes sense since f (x) = 0 outside the range of X. In practice, we’ll
know the range and start the integral at the start of the range.
3. In practice we often say ‘X has distribution F (x)’ rather than ‘X has cumulative distri-
bution function F (x).’

Example 5. Find the cumulative distribution function for the density in Example 2.
Z a Z a
answer: For a in [0,1/3] we have F (a) = f (x) dx = 3 dx = 3a.
0 0
Since f (x) is 0 outside of [0,1/3] we know F (a) = P (X  a) = 0 for a < 0 and F (a) = 1
for a > 1/3. Putting this all together we have
8
>
<0 if a < 0
F (a) = 3a if 0  a  1/3
>
:
1 if 1/3 < a.

Here are the graphs of f (x) and F (x).

F (x)
1
3 f (x)

x x
1/3 1/3

Note the di↵erent scales on the vertical axes. Remember that the vertical axis for the pdf
represents probability density and that of the cdf represents probability.

Example 6. Find the cdf for the pdf in Example 3, f (x) = 3x2 on [0, 1]. Suppose X is a
random variable with this distribution. Find P (X < 1/2).
Z a
answer: f (x) = 3x2 on [0,1] ) F (a) = 3x2 dx = a3 on [0,1]. Therefore,
0
8
>
<0 if a < 0
F (a) = a3 if 0  a  1
>
:
1 if 1 < a

Thus, P (X < 1/2) = F (1/2) = 1/8. Here are the graphs of f (x) and F (x):
1
3 f (x) F (x)

x x
1 1
18.05 class 5, Continuous Random Variables, Spring 2014 7

4.4 Properties of cumulative distribution functions

Here is a summarry of the most important properties of cumulative distribution functions

(cdf)

1. (Definition) F (x) = P (X ≤ x)
2. 0 ≤ F (x) ≤ 1
3. F (x) is non-decreasing, i.e. if a ≤ b then F (a) ≤ F (b).
4. lim F (x) = 1 and lim F (x) = 0
x→∞ x→−∞

5. P (a ≤ X ≤ b) = F (b) − F (a)
6. F ′ (x) = f (x).

Properties 2, 3, 4 are identical to those for discrete distributions. The graphs in the previous
examples illustrate them.
Property 5 can be seen algebraically:
! b ! a ! b
f (x) dx = f (x) dx + f (x) dx
−∞ −∞ a
! b ! b ! a
⇔ f (x) dx = f (x) dx − f (x) dx
a −∞ −∞
⇔ P (a ≤ X ≤ b) = F (b) − F (a).
Property 5 can also be seen geometrically. The orange region below represents F (b) and
the striped region represents F (a). Their diﬀerence is P (a ≤ X ≤ b).

P (a ≤ X ≤ b)
x
a b

Property 6 is the fundamental theorem of calculus.

4.5 Probability density as a dartboard

We find it helpful to think of sampling values from a continuous random variable as throw-
ing darts at a funny dartboard. Consider the region underneath the graph of a pdf as a
dartboard. Divide the board into small equal size squares and suppose that when you throw
a dart you are equally likely to land in any of the squares. The probability the dart lands
in a given region is the fraction of the total area under the curve taken up by the region.
Since the total area equals 1, this fraction is just the area of the region. If X represents
the x-coordinate of the dart, then the probability that the dart lands with x-coordinate
between a and b is just
! b
P (a ≤ X ≤ b) = area under f (x) between a and b = f (x) dx.
a
Gallery of Continuous Random Variables
Class 5, 18.05, Spring 2014
Jeremy Orloﬀ and Jonathan Bloom

1 Learning Goals
1. Be able to give examples of what uniform, exponential and normal distributions are used
to model.

2. Be able to give the range and pdf’s of uniform, exponential and normal distributions.

2 Introduction

Here we introduce a few fundamental continuous distributions. These will play important
roles in the statistics part of the class. For each distribution, we give the range, the pdf,
the cdf, and a short description of situations that it models. These distributions all depend
on parameters, which we specify.
As you look through each distribution do not try to memorize all the details; you can always
look those up. Rather, focus on the shape of each distribution and what it models.
Although it comes towards the end, we call your attention to the normal distribution. It is
easily the most important distribution defined here.

3 Uniform distribution
1. Parameters: a, b.

2. Range: [a, b].

3. Notation: uniform(a, b) or U(a, b).

1
4. Density: f (x) = for a ≤ x ≤ b.
b−a
5. Distribution: F (x) = (x − a)/(b − a) for a ≤ x ≤ b.

6. Models: All outcomes in the range have equal probability (more precisely all out-
comes have the same probability density).

Graphs:
1
b−a f (x)
F (x)
1

x x
a b a b
pdf and cdf for uniform(a,b) distribution.

1
18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 2

Examples. 1. Suppose we have a tape measure with markings at each millimeter. If we

measure (to the nearest marking) the length of items that are roughly a meter long, the
rounding error will uniformly distributed between -0.5 and 0.5 millimeters.
2. Many boardgames use spinning arrows (spinners) to introduce randomness. When spun,
the arrow stops at an angle that is uniformly distributed between 0 and 2π radians.
3. In most pseudo-random number generators, the basic generator simulates a uniform
distribution and all other distributions are constructed by transforming the basic generator.

4 Exponential distribution

1. Parameter: λ.

2. Range: [0, ∞).

3. Notation: exponential(λ) or exp(λ).

4. Density: f (x) = λe−λx for 0 ≤ x.

5. Distribution: (easy integral)

F (x) = 1 − e−λx for x ≥ 0

6. Right tail distribution: P (X > x) = 1 − F (x) = e−λx .

7. Models: The waiting time for a continuous process to change state.

Examples. 1. If I step out to 77 Mass Ave after class and wait for the next taxi, my
waiting time in minutes is exponentially distributed. We will see that in this case λ is given
by one over the average number of taxis that pass per minute (on weekday afternoons).
2. The exponential distribution models the waiting time until an unstable isotope undergoes
nuclear decay. In this case, the value of λ is related to the half-life of the isotope.

Memorylessness: There are other distributions that also model waiting times, but the
exponential distribution has the additional property that it is memoryless. Here’s what
this means in the context of Example 1. Suppose that the probability that a taxi arrives
within the first five minutes is p. If I wait five minutes and in fact no taxi arrives, then the
probability that a taxi arrives within the next five minutes is still p.
By contrast, suppose I were to instead go to Kendall Square subway station and wait for
the next inbound train. Since the trains are coordinated to follow a schedule (e.g., roughly
12 minutes between trains), if I wait five minutes without seeing a train then there is a far
greater probability that a train will arrive in the next five minutes. In particular, waiting
time for the subway is not memoryless, and a better model would be the uniform distribution
on the range [0,12].
The memorylessness of the exponential distribution is analogous to the memorylessness
of the (discrete) geometric distribution, where having flipped 5 tails in a row gives no
information about the next 5 flips. Indeed, the exponential distribution is the precisely the
18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 3

continuous counterpart of the geometric distribution, which models the waiting time for a
discrete process to change state. More formally, memoryless means that the probability of
waiting t more minutes is unaﬀected by having already waited s minutes without incident.
In symbols, P (X > s + t | X > s) = P (X > t).
Proof of memorylessness: Since (X > s + t) ∩ (X > s) = (X > s + t) we have

P (X > s + t) e−λ(s+t)
P (X > s + t | X > s) = = = e−λt = P (X > t). QED
P (X > s) e−λs

Graphs:

5 Normal distribution

In 1809, Carl Friedrich Gauss published a monograph introducing several notions that have
become fundamental to statistics: the normal distribution, maximum likelihood estimation,
and the method of least squares (we will cover all three in this course). For this reason,
the normal distribution is also called the Gaussian distribution, and it the most important
continuous distribution.

1. Parameters: µ, σ.

2. Range: (−∞, ∞).

3. Notation: normal(µ, σ 2 ) or N (µ, σ 2 ).

1 2 2
4. Density: f (x) = √ e−(x−µ) /2σ .
σ 2π
5. Distribution: F (x) has no formula, so use tables or software such as pnorm in R to
compute F (x).

6. Models: Measurement error, intelligence/ability, height, averages of lots of data.

18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 4

The standard normal distribution N (0, 1) has mean 0 and variance 1. We reserve Z for
1 2
a standard normal random variable, φ(z) = √ e−x /2 for the standard normal density,
2π
and Φ(z) for the standard normal distribution.
Note: we will define mean and variance for continuous random variables next time. They
have the same interpretations as in the discrete case. As you might guess, the normal
distribution N (µ, σ 2 ) has mean µ, variance σ 2 , and standard deviation σ.
Here are some graphs of normal distributions. Note they are shaped like a bell curve. Note
also that as σ increases they become more spread out.
Graphs: (the bell curve):

5.1 Normal probabilities

To make approximations it is useful to remember the following rule of thumb for three
approximate probabilities
P (−1 ≤ Z ≤ 1) ≈ .68, P (−2 ≤ Z ≤ 2) ≈ .95, P (−3 ≤ Z ≤ 3) ≈ .99
within 1 · σ ≈ 68%

Normal PDF within 2 · σ ≈ 95%

within 3 · σ ≈ 99%
68%

95%

99%
z
−3σ −2σ −σ σ 2σ 3σ

Symmetry calculations
We can use the symmetry of the standard normal distribution about x = 0 to make some
calculations.
Example 1. The rule of thumb says P (−1 ≤ Z ≤ 1) ≈ .68. Use this to estimate Φ(1).
answer: Φ(1) = P (Z ≤ 1). In the figure, the two tails (in red) have combined area 1-.68 =
.32. By symmetry the left tail has area .16 (half of .32), so P (Z ≤ 1) ≈ .68 + .16 = .84.
18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 5

P (−1 ≤ Z ≤ 1)

P (Z ≤ −1) P (Z ≥ 1)
.34 .34
.16 .16
z
−1 1

5.2 Using R to compute Φ(z).

# Use the R function pnorm(x, µ, σ) to compute F (x) for N(µ, σ 2 )

pnorm(1,0,1)
[1] 0.8413447

pnorm(0,0,1)
[1] 0.5
pnorm(1,0,2)
[1] 0.6914625

pnorm(1,0,1) - pnorm(-1,0,1)
[1] 0.6826895
pnorm(5,0,5) - pnorm(-5,0,5)
[1] 0.6826895
# Of course z can be a vector of values
pnorm(c(-3,-2,-1,0,1,2,3),0,1)
[1] 0.001349898 0.022750132 0.158655254 0.500000000 0.841344746 0.977249868 0.998650102

Note: The R function pnorm(x, µ, σ) uses σ whereas our notation for the normal distri-
bution N(µ, σ 2 ) uses σ 2 .
Here’s a table of values with fewer decimal points of accuracy
z: -2 -1 0 .3 .5 1 2 3
Φ(z): 0.0228 0.1587 0.5000 0.6179 0.6915 0.8413 0.9772 0.9987

Example 2. Use R to compute P (−1.5 ≤ Z ≤ 2).

answer: This is Φ(2) − Φ(−1.5) = pnorm(2,0,1) - pnorm(-1.5,0,1) = 0.91044

6 Pareto and other distributions

In 18.05, we only have time to work with a few of the many wonderful distributions that are
used in probability and statistics. We hope that after this course you will feel comfortable
learning about new distributions and their properties when you need them. Wikipedia is
often a great starting point.
The Pareto distribution is one common, beautiful distribution that we will not have time
to cover in depth.
18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 6

1. Parameters: m > 0 and α > 0.

2. Range: [m, ∞).

3. Notation: Pareto(m, α).

α mα
4. Density: f (x) = .
xα+1
5. Distribution: (easy integral)

mα
F (x) = 1 − , for x ≥ m
xα

6. Tail distribution: P (X > x) = mα /xα , for x ≥ m.

7. Models: The Pareto distribution models a power law, where the probability that
an event occurs varies as a power of some attribute of the event. Many phenomena
follow a power law, such as the size of meteors, income levels across a population, and
population levels across cities. See Wikipedia for loads of examples:
http://en.wikipedia.org/wiki/Pareto_distribution#Applications
Manipulating Continuous Random Variables
Class 5, 18.05, Spring 2014
Jeremy Orloﬀ and Jonathan Bloom

1 Learning Goals
1. Be able to find the pdf and cdf of a random variable defined in terms of a random variable
with known pdf and cdf.

2 Transformations of Random Variables

If Y = aX +b then the properties of expectation and variance tell us that E(Y ) = aE(X)+b
and Var(Y ) = a2 Var(X). But what is the distribution function of Y ? If Y is continuous,
what is its pdf?
Often, when looking at transforms of discrete random variables we work with tables.
For continuous random variables transforming the pdf is just change of variables (‘u-
substitution’) from calculus. Transforming the cdf makes direct use of the definition of
the cdf.
Let’s remind ourselves of the basics:
1. The cdf of X is FX (x) = P (X ≤ x).
2. The pdf of X is related to FX by fX (x) = FX′ (x).

Example 1. Let X ∼ U (0, 2), so fX (x) = 1/2 and FX (x) = x/2 on [0,2]. What is the
range, pdf and cdf of Y = X 2 ?
answer: The range is easy: [0, 4].
To find the cdf we work systematically from the definition.
√ √ √
FY (y) = P (Y ≤ y) = P (X 2 ≤ y) = P (X ≤ y) = FX ( y) = y/2.

To find the pdf we can just diﬀerentiate the cdf

d 1
fY (y) = FY (y) = √ .
dy 4 y

An alternative way to find the pdf directly is by change of variables. The trick here is to
remember that it is fX (x)dx which gives probability (fX (x) by itself is probability density).
Here is how the calculation goes in this example.
dy
y = x2 ⇒ dy = 2x dx ⇒ dx = √
2 y
dx dy
fX (x) dx = = √ = fY (y) dy
2 4 y

dy
Therefore fY (y) = √
4 y

1
18.05 class 5, Manipulating Continuous Random Variables, Spring 2014 2

Example 2. Let X ∼ exp(λ), so fX (x) = λe−λx on [0, ∞]. What is the density of Y = X 2 ?
answer: Let’s do this using the change of variables.
dy
y = x2 ⇒ dy = 2x dx ⇒ dx = √
2 y
√ dy
fX (x) dx = λe−λx dx = λe−λ y √ = fY (y) dy
2 y

λ √
Therefore fY (y) = √ e−λ y .
2 y

X −5
Example 3. Assume X ∼ N(5, 32 ). Show that Z = is standard normal, i.e.,
3
Z ∼ N(0, 1).
answer: Again using the change of variables and the formula for fX (x) we have

x−5 dx
z= ⇒ dz = ⇒ dx = 3 dz
3 3
1 2 2 1 2 1 2
fX (x) dx = √ e−(x−5) /(2·3 ) dx = √ e−z /2 3 dz = √ e−z /2 dz = fZ (z) dz
3 2π 3 2π 2π
1 2
Therefore fZ (z) = √ e−z /2 . Since this is exactly the density for N(0, 1) we have shown
2π
that Z is standard normal.

This example shows an important general property of normal random variables which we
give in the next example.
X −µ
Example 4. Assume X ∼ N(µ, σ 2 ). Show that Z = is standard normal, i.e.,
σ
Z ∼ N(0, 1).
answer: This is exactly the same computation as the previous example with µ replacing 5
and σ replacing 3. We show the computation without comment.
x−µ dx
z= ⇒ dz = ⇒ dx = σ dz
σ σ
1 2 2 1 2 1 2
fX (x) dx = √ e−(x−µ) /(2·σ ) dx = √ e−z /2 σ dz = √ e−z /2 dz = fZ (z) dz
σ 2π σ 2π 2π
1 2
Therefore fZ (z) = √ e−z /2 . This shows Z is standard normal.
2π
Expectation, Variance and Standard Deviation for
Continuous Random Variables
Class 6, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to compute and interpret expectation, variance, and standard deviation for
continuous random variables.

2. Be able to compute and interpret quantiles for discrete and continuous random variables.

2 Introduction

So far we have looked at expected value, standard deviation, and variance for discrete
random variables. These summary statistics have the same meaning for continuous random
variables:

• The expected value µ = E(X) is a measure of location or central tendency.

• The standard deviation is a measure of the spread or scale.

• The variance 2 = Var(X) is the square of the standard deviation.

To move from discrete to continuous, we will simply replace the sums in the formulas by
integrals. We will do this carefully and go through many examples in the following sections.
In the last section, we will introduce another type of summary statistic, quantiles. You may
already be familiar with the .5 quantile of a distribution, otherwise known as the median
or 50th percentile.

3 Expected value of a continuous random variable

Definition: Let X be a continuous random variable with range [a, b] and probability
density function f (x). The expected value of X is defined by
Z b
E(X) = xf (x) dx.
a

Let’s see how this compares with the formula for a discrete random variable:
n
X
E(X) = xi p(xi ).
i=1

The discrete formula says to take a weighted sum of the values xi of X, where the weights are
the probabilities p(xi ). Recall that f (x) is a probability density. Its units are prob/(unit of X).

1
18.05 class 6, Expectation and Variance for Continuous Random Variables 2

So f (x) dx represents the probability that X is in an infinitesimal range of width dx around

x. Thus we can interpret the formula for E(X) as a weighted integral of the values x of X,
where the weights are the probabilities f (x) dx.
As before, the expected value is also called the mean or average.

3.1 Examples

Let’s go through several example computations. Where the solution requires an integration
technique, we push the computation of the integral to the appendix.
Example 1. Let X ⇠ uniform(0, 1). Find E(X).
answer: X has range [0, 1] and density f (x) = 1. Therefore,
Z 1 1
x2 1
E(X) = x dx = = .
0 2 0 2
Not surprisingly the mean is at the midpoint of the range.

Example 2. Let X have range [0, 2] and density 38 x2 . Find E(X).

answer:
Z 2 Z 2 2
3 3 3x4 3
E(X) = xf (x) dx = x dx = = .
0 0 8 32 0 2
Does it make sense that this X has mean is in the right half of its range?
answer: Yes. Since the probability density increases as x increases over the range, the
average value of x should be in the right half of the range.
f (x)

x
1 µ = 1.5

µ is “pulled” to the right of the midpoint 1 because there is more mass to the right.

Example 3. Let X ⇠ exp( ). Find E(X).

answer: The range of X is [0, 1) and its pdf is f (x) = e x. So (details in appendix)
Z 1 x 1
e 1
E(X) = e x dx = e x = .
0 0

x
f (x) = e

x
µ = 1/

Mean of an exponential random variable

18.05 class 6, Expectation and Variance for Continuous Random Variables 3

Example 4. Let Z ⇠ N(0, 1). Find E(Z).

1 z 2 /2
answer: The range of Z is ( 1, 1) and its pdf is (z) = p e . So (details in
2⇡
appendix) Z 1 1
1 z 2 /2 1 z 2 /2
E(Z) = p ze dz = p e = 0.
1 2⇡ 2⇡ 1

(z)

z
µ=0

The standard normal distribution is symmetric and has mean 0.

3.2 Properties of E(X)

The properties of E(X) for continuous random variables are the same as for discrete ones:
1. If X and Y are random variables on a sample space ⌦ then
E(X + Y ) = E(X) + E(Y ). (linearity I)
2. If a and b are constants then
E(aX + b) = aE(X) + b. (linearity II)

Example 5. In this example we verify that for X ⇠ N(µ, ) we have E(X) = µ.

answer: Example (4) showed that for standard normal Z, E(Z) = 0. We could mimic
the calculation there to show that E(X) = µ. Instead we will use the linearity properties
of E(X). In the class 5 notes on manipulating random variables we showed that if X ⇠
N(µ, 2 ) is a normal random variable we can standardize it:
X µ
Z= ⇠ N(0, 1).

Inverting this formula we have X = Z + µ. The linearity of expected value now gives
E(X) = E( Z + µ) = E(Z) + µ = µ

3.3 Expectation of Functions of X

This works exactly the same as the discrete case. if h(x) is a function then Y = h(X) is a
random variable and Z 1
E(Y ) = E(h(X)) = h(x)fX (x) dx.
1

Example 6. Let X ⇠ exp( ). Find E(X 2 ).

answer: Using integration by parts we have
Z 1  1
2 2 x 2x 2 2
E(X ) = x e dx = x2 e x
e x
2
e x
= 2
.
0 0
18.05 class 6, Expectation and Variance for Continuous Random Variables 4

4 Variance

Now that we’ve defined expectation for continuous random variables, the definition of vari-
ance is identical to that of discrete random variables.
Definition: Let X be a continuous random variable with mean µ. The variance of X is

Var(X) = E((X µ)2 ).

4.1 Properties of Variance

These are exactly the same as in the discrete case.

1. If X and Y are independent then Var(X + Y ) = Var(X) + Var(Y ).
2. For constants a and b, Var(aX + b) = a2 Var(X).
3. Theorem: Var(X) = E(X 2 ) E(X)2 = E(X 2 ) µ2 .
For Property 1, note carefully the requirement that X and Y are independent.
Property 3 gives a formula for Var(X) that is often easier to use in hand calculations. The
proofs of properties 2 and 3 are essentially identical to those in the discrete case. We will
not give them here.

Example 7. Let X ⇠ uniform(0, 1). Find Var(X) and X.

answer: In Example 1 we found µ = 1/2. Next we compute

Z 1
2 1
Var(X) = E((X µ) ) = (x 1/2)2 dx = .
0 12

Example 8. Let X ⇠ exp( ). Find Var(X) and X.

answer: In Examples 3 and 6 we computed

Z 1 Z 1
1 2
E(X) = x e x dx = and E(X ) = 2
x2 e x
dx = 2
.
0 0

So by Property 3,
2 1 1 1
Var(X) = E(X 2 ) E(X)2 = 2 2
= 2
and X = .
R1
We could have skipped Property 3 and computed this directly from Var(X) = 0 (x 1/ )2 e x dx.

Example 9. Let Z ⇠ N(0, 1). Show Var(Z) = 1.

Note: The notation for normal variables is X ⇠ N(µ, 2 ). This is certainly suggestive, but
as mathematicians we need to prove that E(X) = µ and Var(X) = 2 . Above we showed
E(X) = µ. This example shows that Var(Z) = 1, just as the notation suggests. In the next
example we’ll show Var(X) = 2 .
answer: Since E(Z) = 0, we have
Z 1
1 z 2 /2
Var(Z) = E(Z 2 ) = p z2e dz.
2⇡ 1
18.05 class 6, Expectation and Variance for Continuous Random Variables 5

(using integration by parts with u = z, v 0 = ze z 2 /2 ) u0 = 1, v = e z 2 /2 )

✓ ◆ Z 1
1 1 1
z 2 /2 z 2 /2
=p ze +p e dz.
2⇡ 1 2⇡ 1

The first term equals 0 because the exponential goes to zero much faster than z grows at
both ±1. The second term equals 1 because it is exactly the total probability integral of
the pdf '(z) for N(0, 1). So Var(X) = 1.

Example 10. Let X ⇠ N(µ, 2 ). Show Var(X) = 2.

answer: This is an exercise in change of variables. Letting z = (x µ)/ , we have

Z 1
1 2 2
Var(X) = E((X µ)2 ) = p (x µ)2 e (x µ) /2 dx
2⇡ 1
2 Z 1
2
=p z 2 e z /2 dz = 2 .
2⇡ 1

The integral in the last line is the same one we computed for Var(Z).

5 Quantiles

Definition: The median of X is the value x for which P (X  x) = 0.5, i.e. the value
of x such that P (X  X) = P (X x). In other words, X has equal probability of
being above or below the median, and each probability is therefore 1/2. In terms of the
cdf F (x) = P (X  x), we can equivalently define the median as the value x satisfying
F (x) = 0.5.
Think: What is the median of Z?
answer: By symmetry, the median is 0.

Example 11. Find the median of X ⇠ exp( ).

answer: The cdf of X is F (x) = 1 e x . So the median is the value of x for which
F (x) = 1 e x = 0.5.. Solving for x we find: x = (ln 2)/ .
Think: In this case the median does not equal the mean of µ = 1/ . Based on the graph
of the pdf of X can you argue why the median is to the left of the mean.

Definition: The pth quantile of X is the value qp such that P (X  qp ) = p.

Notes. 1. In this notation the median is q0.5 .
2. We will usually write this in terms of the cdf: F (qp ) = p.
With respect to the pdf f (x), the quantile qp is the value such that there is an area of p to
the left of qp and an area of 1 p to the right of qp . In the examples below, note how we
can represent the quantile graphically using either area of the pdf or height of the cdf.

Example 12. Find the 0.6 quantile for X ⇠ U (0, 1).

18.05 class 6, Expectation and Variance for Continuous Random Variables 6

answer: The cdf for X is F (x) = x on the range [0,1]. So q0.6 = 0.6.
f (x) F (x)

left tail area = prob = 0.6

F (q0.6 ) = 0.6

x x
q0.6 = 0.6 q0.6 = 0.6

q0.6 : left tail area = 0.6 , F (q0.6 ) = 0.6

Example 13. Find the 0.6 quantile of the standard normal distribution.
answer: We don’t have a formula for the cdf, so we use the R ‘quantile function’ qnorm.

q0.6 = qnorm(0.6, 0, 1) = 0.25335

(z)
(z)
1
left tail area = prob. = .6 F (q0.6 ) = 0.6

z z
q0.6 = 0.253 q0.6 = 0.253

q0.6 : left tail area = 0.6 , F (q.6 ) = 0.6

Quantiles give a useful measure of location for a random variable. We will use them more
in coming lectures.

5.1 Percentiles, deciles, quartiles

For convenience, quantiles are often described in terms of percentiles, deciles or quartiles.
The 60th percentile is the same as the 0.6 quantile. For example you are in the 60th percentile
for height if you are taller than 60 percent of the population, i.e. the probability that you
are taller than a randomly chosen person is 60 percent.
Likewise, deciles represent steps of 1/10. The third decile is the 0.3 quantile. Quartiles are
in steps of 1/4. The third quartile is the 0.75 quantile and the 75th percentile.

6 Appendix: Integral Computation Details

Example 3: Let X ⇠ exp( ). Find E(X).

The range of X is [0, 1) and its pdf is f (x) = e x . Therefore
Z 1 Z 1
E(X) = xf (x) dx = xe x dx
0 0
18.05 class 6, Expectation and Variance for Continuous Random Variables 7

(using integration by parts with u = x, v 0 = e x ) u0 = 1, v = e x)

Z 1
1
x x
= xe + e dx
0 0
e x 1 1
=0 = .
0

We used the fact that xe x and e x go to 0 as x ! 1.

Example 4: Let Z ⇠ N(0, 1). Find E(Z).

1 2
The range of Z is ( 1, 1) and its pdf is (z) = p e z /2 . By symmetry the mean must
2⇡
be 0. The only mathematically tricky part is to show that the integral converges, i.e. that
the mean exists at all (some random variable do not have means, but we will not encounter
this very often.) For completeness we include the argument, though this is not something
we will ask you to do. We first compute the integral from 0 to 1:
Z 1 Z 1
1 2
z (z) dz = p ze z /2 dz.
0 2⇡ 0

The u-substitution u = z 2 /2 gives du = z dz. So the integral becomes

Z 1 Z 1
1 z 2 /2 1 1
p ze dz. = p e u du = e u 0 = 1
2⇡ 0 2⇡ 0
Z 0
Similarly, z (z) dz = 1. Adding the two pieces together gives E(Z) = 0.
1

Example 6: Let X ⇠ exp( ). Find E(X 2 ).

Z 1 Z 1
2 2
E(X ) = x f (x) dx = x2 e x
dx
0 0

(using integration by parts with u = x2 , v 0 = e x ) u0 = 2x, v = e x)

Z 1
1
= x2 e x
+ 2xe x
dx
0 0

(the first term is 0, for the second term use integration by parts: u = 2x, v 0 = e x )
x
u0 = 2, v = e )

x 1 Z 1 x
e e
= 2x + dx
0 0
e x 1 2
=0 2 2
= 2
.
0
Central Limit Theorem and the Law of Large Numbers
Class 6, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Understand the statement of the law of large numbers.

2. Understand the statement of the central limit theorem.

3. Be able to use the central limit theorem to approximate probabilities of averages and
sums of independent identically-distributed random variables.

2 Introduction

We all understand intuitively that the average of many measurements of the same unknown
quantity tends to give a better estimate than a single measurement. Intuitively, this is
because the random error of each measurement cancels out in the average. In these notes
we will make this intuition precise in two ways: the law of large numbers (LoLN) and the
central limit theorem (CLT).
Briefly, both the law of large numbers and central limit theorem are about many independent
samples from same distribution. The LoLN tells us two things:

1. The average of many independent samples is (with high probability) close to the mean
of the underlying distribution.

2. This density histogram of many independent samples is (with high probability) close
to the graph of the density of the underlying distribution.

To be absolutely correct mathematically we need to make these statements more precise,

but as stated they are a good way to think about the law of large numbers.
The central limit theorem says that the sum or average of many independent copies of a
random variable is approximately a normal random variable. The CLT goes on to give
precise values for the mean and standard deviation of the normal variable.
These are both remarkable facts. Perhaps just as remarkable is the fact that often in practice
n does not have to all that large. Values of n > 30 often suffice.

2.1 There is more to experimentation than mathematics

The mathematics of the LoLN says that the average of a lot of independent samples from a
random variable will almost certainly approach the mean of the variable. The mathematics
cannot tell us if the tool or experiment is producing data worth averaging. For example,
if the measuring device is defective or poorly calibrated then the average of many mea-
surements will be a highly accurate estimate of the wrong thing! This is an example of

1
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 2

systematic error or sampling bias, as opposed to the random error controlled by the law of
large numbers.

3 The law of large numbers

Suppose X1 , X2 , . . . , Xn are independent random variables with the same underlying

distribution. In this case, we say that the Xi are independent and identically-distributed,
or i.i.d. In particular, the Xi all have the same mean µ and standard deviation .
Let X n be the average of X1 , . . . , Xn :
n
X1 + X2 + · · · + Xn 1 X
Xn = = Xi .
n n
i=1

Note that X n is itself a random variable. The law of large numbers and central limit
theorem tell us about the value and distribution of X n , respectively.
LoLN: As n grows, the probability that X n is close to µ goes to 1.
CLT: As n grows, the distribution of X n converges to the normal distribution N (µ, 2 /n).

Before giving a more formal statement of the LoLN, let’s unpack its meaning through a
concrete example (we’ll return to the CLT later on).

Example 1. Averages of Bernoulli random variables

Suppose each Xi is an independent flip of a fair coin, so Xi ⇠ Bernoulli(0.5) and µ = 0.5.
Then X n is the proportion of heads in n flips, and we expect that this proportion is close to
0.5 for large n. Randomness being what it is, this is not guaranteed; for example we could
get 1000 heads in 1000 flips, though the probability of this occurring is very small.
So our intuition translates to: with high probability the sample average X n is close to the
mean 0.5 for large n. We’ll demonstrate by doing some calculations in R. You can find the
code used for ‘class 6 prep’ in the usual place on our site.
To start we’ll look at the probability of being within 0.1 of the mean. We can express this
probability as

P (|X n 0.5| < 0.1) or equivalently P (0.4  X n  0.6)

The law of large numbers says that this probability goes to 1 as the number of flips n gets
large. Our R code produces the following values for P (0.4  X n  0.6).
n = 10: pbinom(6, 10, 0.5) - pbinom(3, 10, 0.5) = 0.65625
n = 50: pbinom(30, 50, 0.5) - pbinom(19, 50, 0.5) = 0.8810795
n = 100: pbinom(60, 100, 0.5) - pbinom(39, 100, 0.5) = 0.9647998
n = 500: pbinom(300, 500, 0.5) - pbinom(199, 500, 0.5) = 0.9999941
n = 1000: pbinom(600, 1000, 0.5) - pbinom(399, 1000, 0.5) = 1
As predicted by the LoLN the probability goes to 1 as n grows.
We redo the computations to see the probability of being within 0.01 of the mean. Our R
code produces the following values for P (0.49  X n  0.51).
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 3

n = 10: pbinom(5, 10, 0.5) - pbinom(4, 10, 0.5) = 0.2460937

n = 100: pbinom(51, 100, 0.5) - pbinom(48, 100, 0.5) = 0.2356466
n = 1000: pbinom(510, 1000, 0.5) - pbinom(489, 1000, 0.5) = 0.49334
n = 10000: pbinom(5100, 10000, 0.5) - pbinom(4899, 10000, 0.5) = 0.9555742
Again we see the probability of being close to the mean going to 1 as n grows. Since 0.01
is smaller than 0.1 it takes larger values of n to raise the probability to near 1.
This convergence of the probability to 1 is the LoLN in action! Whenever you’re confused,
it will help you to keep this example in mind. So we see that the LoLN says that with high
probability the average of a large number of independent trials from the same distribution
will be very close to the underlying mean of the distribution. Now we’re ready for the
formal statement.

3.1 Formal statement of the law of large numbers

Theorem (Law of Large Numbers): Suppose X1 , X2 , . . . , Xn , . . . are i.i.d. random

variables with mean µ and variance 2 . For each n, let X n be the average of the first n
variables. Then for any a > 0, we have

lim P (|X n µ| < a) = 1.

n!1

This says precisely that as n increases the probability of being within a of the mean goes
to 1. Think of a as a small tolerance of error from the true mean µ. In our example, if we
want the probability to be at least p = 0.99999 that the proportion of heads X ¯ n is within
a = 0.1 of µ = 0.5, then n > N = 500 is large enough. If we decrease the tolerance a and/or
increase the probability p, then N will need to be larger.

4 Histograms

We can summarize multiple samples x1 , . . . , xn of a random variable in a histogram. Here

we want to carefully construct histograms so that they resemble the area under the pdf.
We will then see how the LoLN applies to histograms.
The step-by-step instructions for constructing a density histogram are as follows.

1. Pick an interval of the real line and divide it into m intervals, with endpoints b0 , b1 , . . . ,
bm . Usually these are equally sized, so let’s assume this to start.
x
b0 b1 b2 b3 b4 b5 b6

Equally-sized bins
Each of the intervals is called a bin. For example, in the figure above the first bin is
[b0 , b1 ] and the last bin is [b5 , b6 ]. Each bin has a bin width, e.g. b1 b0 is the first bin
width. Usually the bins all have the same width, called the bin width of the histogram.

2. Place each xi into the bin that contains its value. If xi lies on the boundary of two bins,
we’ll put it in the left bin (this is the R default, though it can be changed).
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 4

3. To draw a frequency histogram: put a vertical bar above each bin. The height of the
bar should equal the number of xi in the bin.

4. To draw a density histogram: put a vertical bar above each bin. The area of the bar
should equal the fraction of all data points that lie in the bin.

Notes:
1. When all the bins have the same width, the frequency histogram bars have area propor-
tional to the count. So the density histogram results from simply by dividing the height of
each bar by the total area of the frequency histogram. Ignoring the vertical scale, the two
histograms look identical.
2. Caution: if the bin widths di↵er, the frequency and density histograms may look very
di↵erent. There is an example below. Don’t let anyone fool you by manipulating bin widths
to produce a histogram that suits their mischievous purposes!
In 18.05, we’ll stick with equally-sized bins. In general, we prefer the density histogram
since its vertical scale is the same as that of the pdf.
Examples. Here are some examples of histograms, all with the data [0.5,1,1,1.5,1.5,1.5,2,2,2,2].
The R code that drew them is in the R file ’class6-prep.r’. You can find the file in the usual
place on our site.
1. Here the frequency and density plots look the same but have di↵erent vertical scales.
Histogram of x
Histogram of x
4

0.8
3

0.6
Frequency

Density

0.4
2

0.2
1

0.0
0

0.5 1.0 1.5 2.0

x x

Bins centered at 0.5, 1, 1.5, 2, i.e. width 0.5, bounds at 0.25, 0.75, 1.25, 1.75, 2.25.

2. Here each value is on a bin boundary. Note the values are all on the bin boundaries
and are put into the left-hand bin. That is, the bins are right-closed, e.g the first bin is for
values in the right-closed interval (0, 0.5].
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 5

Histogram of x Histogram of x

0.8
4

0.6
3
Frequency

Density

0.4
2

0.2
1

0.0
0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

x x

Bin bounds at 0, 0.5, 1, 1.5, 2.

3. Here we show density histograms based on di↵erent bin widths. Note that the scale
keeps the total area equal to 1. The gaps are bins with zero counts.

Histogram of x Histogram of x
1.5
0.6

1.0
Density

Density
0.4

0.5
0.2
0.0

0.0

0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0

x x

Left: wide bins; Right: narrow bins.

4. Here we use unqual bin widths, so the frequency and density histograms look di↵erent

Histogram of x Histogram of x
0.8
5
4

0.6
Frequency

Density
3

0.4
2

0.2
1

0.0
0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

x x

Don’t be fooled! These are based on the same data.

The density histogram is the better choice with unequal bin widths. In fact, R will complain
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 6

if you try to make a frequency histogram with unequal bin widths. Compare the frequency
histogram with unequal bin widths with all the other histograms we drew for this data. It
clearly looks di↵erent. What happened is that by combining the data in bins (0.5, 1] and
(1, 1.5] into one bin (0.5, 1.5) we e↵ectively made the height of both smaller bins greater.
The reason the density histogram is nice is discussed in the next section.

4.1 The law of large numbers and histograms

The law of large number has an important consequence for density histograms.
LoLN for histograms: With high probability the density histogram of a large number
of samples from a distribution is a good approximation of the graph of the underlying pdf
f (x).
Let’s illustrate this by generating a density histogram with bin width 0.1 from 100000 draws
from a standard normal distribution. As you can see, the density histogram very closely
tracks the graph of the standard normal pdf (z).

0.5

0.4

0.3

0.2

0.1

0
-4 -3 -2 -1 0 1 2 3 4

Density histogram of 10000 draws from a standard normal distribution, with (z) in red.

5 The Central Limit Theorem

We now prepare for the statement of the CLT.

5.1 Standardization

Given a random variable X with mean µ and standard deviation , we define its standard-
ization of X as the new random variable
X µ
Z= .

Note that Z has mean 0 and standard deviation 1. Note also that if X has a normal
distribution, then the standardization of X is the standard normal distribution Z with
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 7

mean 0 and variance 1. This explains the term ‘standardization’ and the notation of Z
above.

5.2 Statement of the Central Limit Theorem

Suppose X1 , X2 , . . . , Xn , . . . are i.i.d. random variables each having mean µ and standard
deviation . For each n let Sn denote the sum and let X n be the average of X1 , . . . , Xn .
n
X
Sn = X 1 + X 2 + . . . + X n = Xi
i=1
X1 + X2 + . . . + Xn Sn
Xn = = .
n n
The properties of mean and variance show
2, p
E(Sn ) = nµ, Var(Sn ) =n Sn = n
2
E(X n ) = µ, Var(X n ) = , Xn =p .
n n

Since they are multiples of each other, Sn and X n have the same standardization

Sn nµ Xn µ
Zn = p = p
n / n
Central Limit Theorem: For large n,
2 2
X n ⇡ N(µ, /n), Sn ⇡ N(nµ, n ), Zn ⇡ N(0, 1).

Notes: 1. In words: X n is approximately a normal distribution with the same mean as X

but a smaller variance.
2. Sn is approximately normal.
3. Standardized X n and Sn are approximately standard normal.
The central limit theorem allows us to approximate a sum or average of i.i.d random vari-
ables by a normal random variable. This is extremely useful because it is usually easy to
do computations with the normal distribution.

A precise statement of the CLT is that the cdf’s of Zn converge to (z):

lim FZn (z) = (z).

n!1

The proof of the Central Limit Theorem is more technical than we want to get in 18.05. It
is accessible to anyone with a decent calculus background.

5.3 Standard Normal Probabilities

To apply the CLT, we will want to have some normal probabilities at our fingertips. The
following probabilities appeared in Class 5. Let Z ⇠ N (0, 1), a standard normal random
variable. Then with rounding we have:
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 8

1. P (|Z| < 1) = 0.68

2. P (|Z| < 2) = 0.95; more precisely P (|Z| < 1.96) ⇡ .95.
3. P (|Z| < 3) = 0.997
These numbers are easily compute in R using pnorm. However, they are well worth remem-
bering as rules of thumb. You should think of them as:

1. The probability that a normal random variable is within 1 standard deviation of its
mean is 0.68.
2. The probability that a normal random variable is within 2 standard deviations of its
mean is 0.95.
3. The probability that a normal random variable is within 3 standard deviations of its
mean is 0.997.

This is shown graphically in the following figure.

within 1 · ⇡ 68%

Normal PDF within 2 · ⇡ 95%

within 3 · ⇡ 99%
68%

95%

99%
z
3 2 2 3

Claim: From these numbers we can derive:

1. P (Z < 1) ⇡ 0.84
2. P (Z < 2) ⇡ 0.977
3. P (Z < 3) ⇡ 0.999
Proof: We know P (|Z| < 1) = 0.68. The remaining probability of 0.32 is in the two regions
Z > 1 and Z < 1. These regions are referred to as the right-hand tail and the left-hand
tail respectively. By symmetry each tail has area 0.16. Thus,
P (Z < 1) = P (|Z| < 1) + P (left-hand tail) = 0.84
The other two cases are handled similarly.

5.4 Applications of the CLT

Example 2. Flip a fair coin 100 times. Estimate the probability of more than 55 heads.
answer: Let Xj be the result of the j th flip, so Xj = 1 for heads and Xj = 0 for tails. The
total number of heads is
S = X1 + X2 + . . . + X100 .
We know E(Xj ) = 0.5 and Var(Xj ) = 1/4. Since n = 100, we have
E(S) = 50, Var(S) = 25 and S = 5.
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 9

The central limit theorem says that the standardization of S is approximately N(0, 1). The
question asks for P (S > 55). Standardizing and using the CLT we get
✓ ◆
S 50 55 50
P (S > 55) = P > ⇡ P (Z > 1) = 0.16.
5 5

Here Z is a standard normal random variable and P (Z > 1) = 1 P (Z < 1) ⇡ 0.16.

Example 3. Estimate the probability of more than 220 heads in 400 flips.
answer: This is nearly identical to the previous example. Now µS = 200 and S = 10 and
we want P (S > 220). Standardizing and using the CLT we get:
✓ ◆
S µS 220 200
P (S > 220) = P > ⇡ P (Z > 2) = .025.
S 10

Again, Z ⇠ N(0, 1) and the rules of thumb show P (Z > 2) = .025.

Note: Even though 55/100 = 220/400, the probability of more than 55 heads in 100 flips
is larger than the probability of more than 220 heads in 400 flips. This is due to the LoLN
and the larger value of n in the latter case.
Example 4. Estimate the probability of between 40 and 60 heads in 100 flips.
answer: As in the first example, E(S) = 50, Var(S) = 25 and S = 5. So
✓ ◆
40 100 S 50 60 50
P (40  S  60) = P   ⇡ P ( 2  Z  2)
5 5 5

We can compute the right-hand side using our rule of thumb. For a more accurate answer
we use R:
pnorm(2) - pnorm(-2) = 0.954 . . .

Recall that in Section 3 we used the binomial distribution to compute an answer of 0.965. . . .
So our approximate answer using CLT is o↵ by about 1%.
Think: Would you expect the CLT method to give a better or worse approximation of
P (200 < S < 300) with n = 500?
We encourage you to check your answer using R.

Example 5. Polling. When taking a political poll the results are often reported as a
number with a margin of error. For example 52% ± 3% favor candidate A. The rule of
p
thumb is that if you poll n people then the margin of error is ±1/ n. We will now see
exactly what this means and that it is an application of the central limit theorem.
Suppose there are 2 candidates A and B. Suppose further that the fraction of the population
who prefer A is p0 . That is, if you ask a random person who they prefer then the probability
they’ll answer A is po
To run the poll a pollster selects n people at random and asks ’Do you support candidate
A or candidate B. Thus we can view the poll as a sequence of n independent Bernoulli(p0 )
trials, X1 , X2 , . . . , Xn , where Xi is 1 if the ith person prefers A and 0 if they prefer B. The
fraction of people polled that prefer A is just the average X.
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 10

We know that each Xi ⇠ Bernoulli(p0 ) so,

p
E(Xi ) = p0 and Xi = p0 (1 p0 ).

Therefore, the central limit theorem tells us that

p p
X ⇡ N(p0 , / n), where = p0 (1 p0 ).

In a normal distribution 95% of the probability is within 2 standard deviations of the mean.
p
This means that in 95% of polls of n people the sample mean X will be within 2 / n of
the true mean p0 . The final step is to note that for any value of p0 we have  1/2. (It is
an easy calculus exercise to see that 1/4 is the maximum value of 2 = p0 (1 p0 ).) This
means that we can conservatively say that in 95% of polls of n people the sample mean
p
X is within 1/ n of the true mean. The frequentist statistician then takes the interval
p
X ± 1/ n and calls it the 95% confidence interval for p0 .
A word of caution: it is tempting and common, but wrong, to think that there is a 95%
probability the true fraction p0 is in the confidence interval. This is subtle, but the error
is the same one as thinking you have a disease if a 95% accurate test comes back positive.
It’s true that 95% of people taking the test get the correct result. It’s not necessarily true
that 95% of positive tests are correct.

5.5 Why use the CLT

Since the probabilities in the above examples can be computed exactly using the binomial
distribution, you may be wondering what is the point of finding an approximate answer
using the CLT. In fact, we were only able to compute these probabilities exactly because
the Xi were Bernoulli and so the sum S was binomial. In general, the distribution of the
S will not be familiar, so you will not be able to compute the probabilities for S exactly; it
can also happen that the exact computation is possible in theory but too computationally
intensive in practice, even for a computer. The power of the CLT is that it applies when Xi
has almost any distribution. Though we will see in the next section that some distributions
may require larger n for the approximation to be a good one).

5.6 How big does n have to be to apply the CLT?

Short answer: often, not that big.

The following sequences of pictures show the convergence of averages to a normal distribu-
tion.
First we show the standardized average of n i.i.d. uniform random variables with n =
1, 2, 4, 8, 12. The pdf of the average is in blue and the standard normal pdf is in red. By
the time n = 12 the fit between the standardized average and the true normal looks very
good.
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 11

0.4 0.5 0.4

0.35 0.35
0.4
0.3 0.3
0.25 0.3 0.25
0.2 0.2
0.15 0.2 0.15
0.1 0.1
0.1
0.05 0.05
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Next we show the standardized average of n i.i.d. exponential random variables with
n = 1, 2, 4, 8, 16, 64. Notice that this asymmetric density takes more terms to converge to
the normal density.
1 0.7 0.5
0.6
0.8 0.4
0.5
0.6 0.4 0.3

0.4 0.3 0.2

0.2
0.2 0.1
0.1
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Next we show the (non-standardized) average of n exponential random variables with

n = 1, 2, 4, 16, 64. Notice how this standard deviation shrinks as n grows, resulting in a
spikier (more peaked) density.
1 0.8 1
0.7
0.8 0.8
0.6
0.6 0.5 0.6
0.4
0.4 0.3 0.4
0.2
0.2 0.2
0.1
0 0 0
-2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 -1 0 1 2 3 4
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 12

2 3.5
3
1.5 2.5
2
1
1.5

0.5 1
0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4

The central limit theorem works for discrete variables also. Here is the standardized average
of n i.i.d. Bernoulli(.5) random variables with n = 1, 2, 12, 64. Notice that as n grows, the
average can take more values, which allows the discrete distribution to ’fill in’ the normal
density.
0.4 0.4 0.4
0.35 0.35 0.35
0.3 0.3 0.3
0.25 0.25 0.25
0.2 0.2 0.2
0.15 0.15 0.15
0.1 0.1 0.1
0.05 0.05 0.05
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4

Finally we show the (non-standardized) average of n Bernoulli(.5) random variables, with

n = 4, 12, 64. Notice how the standard deviation gets smaller resulting in a spikier (more
peaked) density.
1.4 3 7
1.2 2.5 6
1 5
2
0.8 4
1.5
0.6 3
1
0.4 2
0.2 0.5 1
0 0 0
-1 -0.5 0 0.5 1 1.5 2 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Appendix
Class 6, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Introduction

In this appendix we give more formal mathematical material that is not strictly a part of
18.05. This will not be on homework or tests. We give this material to emphasize that in
doing mathematics we should be careful to specify our hypotheses completely and give clear
deductive arguments to prove our claims. We hope you find it interesting and illuminating.

2 With high probability the density histogram resembles the

graph of the probability density function:

We stated that one consequence of the law of large numbers is that as the number of samples
increases the density histogram of the samples has an increasing probability of matching the
graph of the underlying pdf or pmf. This is a good rule of thumb, but it is rather imprecise.
It is possible to make more precise statements It will take some care to make a sensible and
precise statement, which will not be quite so sweeping.
Suppose we have an experiment that produces data according to the random variable X
and suppose we generate n independent samples from X. Call them

x1 , x 2 , . . . , x n .

By a bin we mean a range of values, i.e. [xk , xk+1 ). To make a density histogram of the
data we divide the range of X into m bins and calculate the fraction of the data in each
bin.
Now, let pk be the probability a random data point is in the kth bin. This is this probability
for an indicator (Bernoulli) random variable Bk,j which is 1 if the jth data point is in the
bin and and 0 otherwise.
Statement 1. Let p̄k be the fraction of the data in bin k. As the number n of data points
gets large the probability that x̄k is close to pk approaches 1. Said di↵erently, given any
small number, call it a the probability P (|p̄k pk | < a) depends on n, and as n goes to
infinity this probability goes to 1.
Proof. Let B ¯k be the average of Bk,j . Since E(Bk,j ) = pk , the law of large number says
exactly that
P (|B̄k pk | < a) approaches 1 as n goes to infinity.
But, since the Bk,j are indicator variables, their average is exactly p̄k , the fraction of the
data in bin k. Replacing B¯k by p̄k in the above equation gives

P (|p̄k pk | < a) approaches 1 as n goes to infinity.

This is exactly what statement 1 claimed.

1
18.05 class 6, Appendix, Spring 2014 2

Statement 2. The same statement holds for a finite number of bins simultaneously. That
is, for bins 1 to m we have
¯2 p2 | < a), . . . , (|B
P ( (|B̄1 p1 | < a), (|B ¯m pm | < a) ) approaches 1 as n goes to infinity.

Proof. First we note the following probability rule, which is a consequence of the inclusion
exclusion principle: If two events A and B have P (A) = 1 ↵1 and P (B) = 1 ↵2 then
P (A \ B) 1 (↵1 + ↵2 ).
Now, Statement 1 says that for any ↵ we can find n large enough that P (|B ¯k pk | < a) >
1 ↵/m for each bin separately. By the probability rule, the probability of the intersection
of all these events is at least 1 ↵. Since we can let ↵ be as small as we want by letting n
go to infinity, in the limit we get probability 1 as claimed.
Statement 3. If f (x) is a continuous probability density with range [a, b] then by taking
enough data and having a small enough bin width we can insure that with high probability
the density histogram is as close as we want to the graph of f (x).
Proof. We will only sketch the argument. Assume the bin around x has width is x. If
x is small enough then the probability a data point is in the bin is approximately f (x) x.
Statement 2 guarantees that if n is large enough then with high probability the fraction
of data in the bin is also approximately f (x) x. Since this is the area of the bin we see
that its height will be approximately f (x). That is, with high probability the height of the
histogram over any point x is close to f (x). This is what Statement 3 claimed.
Note. If the range is infinite or the density goes to infinity at some point we need to be
more careful. There are statements we could make for these cases.

3 The Chebyshev inequality

One proof of the LoLN follows from the following key inequality.
The Chebyshev inequality. Suppose Y is a random variable with mean µ and variance 2.

Then for any positive value a, we have

Var(Y )
P (|Y µ| a)  .
a2

In words, the Chebyshev inequality says that the probability that Y di↵ers from the mean
by more than a is bounded by Var(Y )/a2 . Morally, the smaller the variance of Y , the
smaller the probability that Y is far from its mean.

Proof of the LoLN: Since Var(X ¯ n ) = Var(X)/n, the variance of the average X¯ n goes to
¯
zero as n goes to infinity. So the Chebyshev inequality for Y = Xn and fixed a implies
that as n grows, the probability that X ¯ n is farther than a from µ goes to 0. Hence the
¯
probability that Xn is within a of µ goes to 1, which is the LoLN.
Proof of the Chebyshev inequality: The proof is essentially the same for discrete and
continuous Y . We’ll assume Y is continuous and also that µ = 0, since replacing Y by
18.05 class 6, Appendix, Spring 2014 3

Y µ does not change the variance. So

Z a Z 1 Z a Z 1
y2 y2
P (|Y | a) = f (y) dy + f (y) dy  f (y) dy + f (y) dy
1 a 1 a2 a a2
Z 1 2
y Var(Y )
 2
f (y) dy = .
1 a a2

The first inequality uses that y 2 /a2 1 on the intervals of integration. The second inequality
follows because including the range [ a, a] only makes the integral larger, since the integrand
is positive.

4 The need for variance

We didn’t lie to you, but we did gloss over one technical fact. Throughout we assumed
that the underlying distributions had a variance. For example, the proof of the law of
large numbers made use of the variance by way of the Chebyshev inequality. But there are
distributions which do not have a variance because the sum or integral for the variance does
not converge to a finite number. For such distributions the law of large numbers may not
be true. In 18.05 we won’t have to worry about this, but if you go deeper into statistics
this may become important.
Joint Distributions, Independence
Class 7, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Understand what is meant by a joint pmf, pdf and cdf of two random variables.

2. Be able to compute probabilities and marginals from a joint pmf or pdf.

3. Be able to test whether two random variables are independent.

2 Introduction

In science and in real life, we are often interested in two (or more) random variables at the
same time. For example, we might measure the height and weight of gira↵es, or the IQ
and birthweight of children, or the frequency of exercise and the rate of heart disease in
adults, or the level of air pollution and rate of respiratory illness in cities, or the number of
Facebook friends and the age of Facebook members.
Think: What relationship would you expect in each of the five examples above? Why?
In such situations the random variables have a joint distribution that allows us to compute
probabilities of events involving both variables and understand the relationship between the
variables. This is simplest when the variables are independent. When they are not, we use
covariance and correlation as measures of the nature of the dependence between them.

3 Joint Distribution

3.1 Discrete case

Suppose X and Y are two discrete random variables and that X takes values {x1 , x2 , . . . , xn }
and Y takes values {y1 , y2 , . . . , ym }. The ordered pair (X, Y ) take values in the product
{(x1 , y1 ), (x1 , y2 ), . . . (xn , ym )}. The joint probability mass function (joint pmf) of X and Y
is the function p(xi , yj ) giving the probability of the joint outcome X = xi , Y = yj .
We organize this in a joint probability table as shown:

1
18.05 class 7, Joint Distributions, Independence, Spring 2014 2

X\Y y1 y2 ... yj ... ym

x1 p(x1 , y1 ) p(x1 , y2 ) · · · p(x1 , yj ) · · · p(x1 , ym )
x2 p(x2 , y1 ) p(x2 , y2 ) · · · p(x2 , yj ) · · · p(x2 , ym )
··· ··· ··· ··· ··· ··· ···
··· ··· ··· ··· ··· ··· ···
xi p(xi , y1 ) p(xi , y2 ) ··· p(xi , yj ) ··· p(xi , ym )
··· ··· ··· ··· ··· ···
xn p(xn , y1 ) p(xn , y2 ) · · · p(xn , yj ) · · · p(xn , ym )

Example 1. Roll two dice. Let X be the value on the first die and let Y be the value on
the second die. Then both X and Y take values 1 to 6 and the joint pmf is p(i, j) = 1/36
for all i and j between 1 and 6. Here is the joint probability table:

X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36

Example 2. Roll two dice. Let X be the value on the first die and let T be the total on
both dice. Here is the joint probability table:

X\T 2 3 4 5 6 7 8 9 10 11 12
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36

A joint probability mass function must satisfy two properties:

1. 0  p(xi , yj )  1
2. The total probability is 1. We can express this as a double sum:
n X
X m
p(xi , yj ) = 1
i=1 j=1
18.05 class 7, Joint Distributions, Independence, Spring 2014 3

3.2 Continuous case

The continuous case is essentially the same as the discrete case: we just replace discrete sets
of values by continuous intervals, the joint probability mass function by a joint probability
density function, and the sums by integrals.
If X takes values in [a, b] and Y takes values in [c, d] then the pair (X, Y ) takes values in
the product [a, b] ⇥ [c, d]. The joint probability density function (joint pdf) of X and Y
is a function f (x, y) giving the probability density at (x, y). That is, the probability that
(X, Y ) is in a small rectangle of width dx and height dy around (x, y) is f (x, y) dx dy.
y
d
Prob. = f (x, y) dx dy

x
a b

A joint probability density function must satisfy two properties:

1. 0  f (x, y)
2. The total probability is 1. We now express this as a double integral:
Z dZ b
f (x, y) dx dy = 1
c a

Note: as with the pdf of a single random variable, the joint pdf f (x, y) can take values
greater than 1; it is a probability density, not a probability.
In 18.05 we won’t expect you to be experts at double integration. Here’s what we will
expect.

• You should understand double integrals conceptually as double sums.

• You should be able to compute double integrals over rectangles.

• For a non-rectangular region, when f (x, y) = c is constant, you should know that the
double integral is the same as the c ⇥ (the area of the region).

3.3 Events

Random variables are useful for describing events. Recall that an event is a set of outcomes
and that random variables assign numbers to outcomes. For example, the event ‘X > 1’
is the set of all outcomes for which X is greater than 1. These concepts readily extend to
pairs of random variables and joint outcomes.
18.05 class 7, Joint Distributions, Independence, Spring 2014 4

Example 3. In Example 1, describe the event B = ‘Y X 2’ and find its probability.

answer: We can describe B as a set of (X, Y ) pairs:

B = {(1, 3), (1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 5), (3, 6), (4, 6)}.

We can also describe it visually

The event B consists of the outcomes in the shaded squares.

The probability of B is the sum of the probabilities in the orange shaded squares, so
P (B) = 10/36.

Example 4. Suppose X and Y both take values in [0,1] with uniform density f (x, y) = 1.
Visualize the event ‘X > Y ’ and find its probability.
answer: Jointly X and Y take values in the unit square. The event ‘X > Y ’ corresponds
to the shaded lower-right triangle below. Since the density is constant, the probability is
just the fraction of the total area taken up by the event. In this case, it is clearly 0.5.
y
1

‘X > Y ’

x
1
The event ‘X > Y ’ in the unit square.
Example 5. Suppose X and Y both take values in [0,1] with density f (x, y) = 4xy. Show
f (x, y) is a valid joint pdf, visualize the event A = ‘X < 0.5 and Y > 0.5’ and find its
probability.
answer: Jointly X and Y take values in the unit square.
18.05 class 7, Joint Distributions, Independence, Spring 2014 5

y
1

x
1
The event A in the unit square.
To show f (x, y) is a valid joint pdf we must check that it is positive (which it clearly is)
and that the total probability is 1.
Z 1Z 1 Z 1 Z 1
⇥ 2 ⇤1
Total probability = 4xy dx dy = 2x y 0 dy = 2y dy = 1. QED
0 0 0 0

The event A is just the upper-left-hand quadrant. Because the density is not constant we
must compute an integral to find the probability.
Z .5 Z 1 Z .5 ⇥ Z .5
⇤1 3x 3
P (A) = 4xy dy dx = 2xy 2 .5 dx = dx = .
0 .5 0 0 2 16

3.4 Joint cumulative distribution function

Suppose X and Y are jointly-distributed random variables. We will use the notation ‘X 
x, Y  y’ to mean the event ‘X  x and Y  y’. The joint cumulative distribution function
(joint cdf) is defined as
F (x, y) = P (X  x, Y  y)

Continuous case: If X and Y are continuous random variables with joint density f (x, y)
over the range [a, b] ⇥ [c, d] then the joint cdf is given by the double integral
Z yZ x
F (x, y) = f (u, v) du dv.
c a

To recover the joint pdf, we di↵erentiate the joint cdf. Because there are two variables we
need to use partial derivatives:

@2F
f (x, y) = (x, y).
@x@y

Discrete case: If X and Y are discrete random variables with joint pmf p(xi , yj ) then the
joint cdf is give by the double sum
X X
F (x, y) = p(xi , yj ).
xi x yj y
18.05 class 7, Joint Distributions, Independence, Spring 2014 6

3.5 Properties of the joint cdf

The joint cdf F (x, y) of X and Y must satisfy several properties:

1. F (x, y) is non-decreasing: i.e. if x or y increase then F (x, y) must stay constant or

increase.
2. F (x, y) = 0 at the lower-left of the joint range.
If the lower left is ( 1, 1) then this means lim F (x, y) = 0.
(x,y)!( 1, 1)

3. F (x, y) = 1 at the upper-right of the joint range.

If the upper-right is (1, 1) then this means lim F (x, y) = 1.
(x,y)!(1,1)

Example 6. Find the joint cdf for the random variables in Example 5.
answer: The event ‘X  x and Y  y’ is a rectangle in the unit square.
y
1

(x, y)

‘X  x & Y  y’

x
1

To find the cdf F (x, y) we compute a double integral:

Z yZ x
F (x, y) = 4uv du dv = x2 y 2 .
0 0

Example 7. In Example 1, compute F (3.5, 4).

answer: We redraw the joint probability table. Notice how similar the picture is to the one
in the previous example.
F (3.5, 4) is the probability of the event ‘X  3.5 and Y  4’. We can visualize this event
as the shaded rectangles in the table:

The event ‘X  3.5 and Y  4’.

Adding up the probability in the shaded squares we get F (3.5, 4) = 12/36 = 1/3.
Note. One unfortunate di↵erence between the continuous and discrete visualizations is that
for continuous variables the value increases as we go up in the vertical direction while the
opposite is true for the discrete case. We have experimented with changing the discrete
tables to match the continuous graphs, but it causes too much confusion. We will just have
to live with the di↵erence!

3.6 Marginal distributions

When X and Y are jointly-distributed random variables, we may want to consider only one
of them, say X. In that case we need to find the pmf (or pdf or cdf) of X without Y . This
is called a marginal pmf (or pdf or cdf). The next example illustrates the way to compute
this and the reason for the term ‘marginal’.

3.7 Marginal pmf

Example 8. In Example 2 we rolled two dice and let X be the value on the first die and
T be the total on both dice. Compute the marginal pmf of X and of T .
answer: In the table each row represents a single value of X. So the event ‘X = 3’ is the
third row of the table. To find P (X = 3) we simply have to sum up the probabilities in this
row. We put the sum in the right-hand margin of the table. Likewise P (T = 5) is just the
sum of the column with T = 5. We put the sum in the bottom margin of the table.

X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(tj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1

Computing the marginal probabilities P (X = 3) = 1/6 and P (T = 5) = 4/36.

Note: Of course in this case we already knew the pmf of X and of T . It is good to see that
our computation here is in agreement!
As motivated by this example, marginal pmf’s are obtained from the joint pmf by summing:
X X
pX (xi ) = p(xi , yj ), pY (yj ) = p(xi , yj )
j i

The term marginal refers to the fact that the values are written in the margins of the table.
18.05 class 7, Joint Distributions, Independence, Spring 2014 8

3.8 Marginal pdf

For a continous joint density f (x, y) with range [a, b] ⇥ [c, d], the marginal pdf’s are:
Z d Z b
fX (x) = f (x, y) dy, fY (y) = f (x, y) dx.
c a

Compare these with the marginal pmf’s above; as usual the sums are replaced by integrals.
We say that to obtain the marginal for X, we integrate out Y from the joint pdf and vice
versa.
Example 9. Suppose (X, Y ) takes values on the square [0, 1]⇥[1, 2] with joint pdf f (x, y) =
8 3
3 x y. Find the marginal pdf’s fX (x) and fY (y).
answer: To find fX (x) we integrate out y and to find fY (y) we integrate out x.
Z 2 
8 3 4 3 2 2
fX (x) = x y dy = x y = 4x3
1 3 3 1
Z 1 
8 3 2 4 1 1 2
fY (y) = x y dx = x y = y.
0 3 3 0 3

Example 10. Suppose (X, Y ) takes values on the unit square [0, 1] ⇥ [0, 1] with joint pdf
f (x, y) = 32 (x2 + y 2 ). Find the marginal pdf fX (x) and use it to find P (X < 0.5).
answer:
Z 1  1
3 2 2 3 2 y3 3 2 1
fX (x) = (x + y ) dy = x y+ = x + .
0 2 2 2 0 2 2
Z 0.5 Z 0.5  0.5
3 2 1 1 3 1 5
P (X < 0.5) = fX (x) dx = x + dx = x + x = .
0 0 2 2 2 2 0 16

3.9 Marginal cdf

Finding the marginal cdf from the joint cdf is easy. If X and Y jointly take values on
[a, b] ⇥ [c, d] then
FX (x) = F (x, d), FY (y) = F (b, y).
If d is 1 then this becomes a limit FX (x) = lim F (x, y). Likewise for FY (y).
y!1

Example 11. The joint cdf in the last example was F (x, y) = 12 (x3 y + xy 3 ) on [0, 1] ⇥ [0, 1].
Find the marginal cdf’s and use FX (x) to compute P (X < 0.5).
answer: We have FX (x) = F (x, 1) = 12 (x3 + x) and FY (y) = F (1, y) = 1
2 (y + y 3 ). So
P (X < 0.5) = FX (0.5) = 12 (0.53 + 0.5) = 16
5
: exactly the same as before.

3.10 3D visualization

We visualized P (a < X < b) as the area under the pdf f(x) over the interval [a, b]. Since
the range of values of (X, Y ) is already a two dimensional region in the plane, the graph of
18.05 class 7, Joint Distributions, Independence, Spring 2014 9

f (x, y) is a surface over that region. We can then visualize probability as volume under the
surface.
Think: Summoning your inner artist, sketch the graph of the joint pdf f (x, y) = 4xy and
visualize the probability P (A) as a volume for Example 5.

4 Independence

We are now ready to give a careful mathematical definition of independence. Of course, it

will simply capture the notion of independence we have been using up to now. But, it is nice
to finally have a solid definition that can support complicated probabilistic and statistical
investigations.
Recall that events A and B are independent if

P (A \ B) = P (A)P (B).

Random variables X and Y define events like ‘X  2’ and ‘Y > 5’. So, X and Y are
independent if any event defined by X is independent of any event defined by Y . The
formal definition that guarantees this is the following.
Definition: Jointly-distributed random variables X and Y are independent if their joint
cdf is the product of the marginal cdf’s:

F (X, Y ) = FX (x)FY (y).

For discrete variables this is equivalent to the joint pmf being the product of the marginal
pmf’s.:
p(xi , yj ) = pX (xi )pY (yj ).
For continous variables this is equivalent to the joint pdf being the product of the marginal
pdf’s.:
f (x, y) = fX (x)fY (y).

Once you have the joint distribution, checking for independence is usually straightforward
although it can be tedious.
Example 12. For discrete variables independence means the probability in a cell must be
the product of the marginal probabilities of its row and column. In the first table below
this is true: every marginal probability is 1/6 and every cell contains 1/36, i.e. the product
of the marginals. Therefore X and Y are independent.
In the second table below most of the cell probabilities are not the product of the marginal
probabilities. For example, none of marginal probabilities are 0, so none of the cells with 0
probability can be the product of the marginals.
18.05 class 7, Joint Distributions, Independence, Spring 2014 10

X\Y 1 2 3 4 5 6 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 1/6
2 1/36 1/36 1/36 1/36 1/36 1/36 1/6
3 1/36 1/36 1/36 1/36 1/36 1/36 1/6
4 1/36 1/36 1/36 1/36 1/36 1/36 1/6
5 1/36 1/36 1/36 1/36 1/36 1/36 1/6
6 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/6 1/6 1/6 1/6 1/6 1/6 1

X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1

Example 13. For continuous variables independence means you can factor the joint pdf
or cdf as the product of a function of x and a function of y.
(i) Suppose X has range [0, 1/2], Y has range [0, 1] and f (x, y) = 96x2 y 3 then X and Y
are independent. The marginal densities are fX (x) = 24x2 and fY (y) = 4y 3 .
(ii) If f (x, y) = 1.5(x2 +y 2 ) over the unit square then X and Y are not independent because
there is no way to factor f (x, y) into a product fX (x)fY (y).
(iii) If F (x, y) = 12 (x3 y + xy 3 ) over the unit square then X and Y are not independent
because the cdf does not factor into a product FX (x)FY (y).
Covariance and Correlation
Class 7, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Understand the meaning of covariance and correlation.

2. Be able to compute the covariance and correlation of two random variables.

2 Covariance

Covariance is a measure of how much two random variables vary together. For example,
height and weight of gira↵es have positive covariance because when one is big the other
tends also to be big.
Definition: Suppose X and Y are random variables with means µX and µY . The
covariance of X and Y is defined as

Cov(X, Y ) = E((X µX )(Y µY )).

2.1 Properties of covariance

1. Cov(aX + b, cY + d) = acCov(X, Y ) for constants a, b, c, d.

2. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ).

3. Cov(X, X) = Var(X)

4. Cov(X, Y ) = E(XY ) µX µY .

5. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ) for any X and Y .

6. If X and Y are independent then Cov(X, Y ) = 0.

Warning: The converse is false: zero covariance does not always imply independence.

Notes. 1. Property 4 is like the similar property for variance. Indeed, if X = Y it is exactly
that property: Var(X) = E(X 2 ) µ2X .
By Property 5, the formula in Property 6 reduces to the earlier formula Var(X + Y ) =
Var(X) + Var(Y ) when X and Y are independent.
We give the proofs below. However, understanding and using these properties is more
important than memorizing their proofs.

1
18.05 class 7, Covariance and Correlation, Spring 2014 2

2.2 Sums and integrals for computing covariance

Since covariance is defined as an expected value we compute it in the usual way as a sum
or integral.
Discrete case: If X and Y have joint pmf p(xi , yj ) then
0 1
n X
X m X n X
m
Cov(X, Y ) = p(xi , yj )(xi µX )(yj µY ) = @ p(xi , yj )xi yj A µX µY .
i=1 j=1 i=1 j=1

Continuous case: If X and Y have joint pdf f (x, y) over range [a, b] ⇥ [c, d] then
Z dZ b ✓Z dZ b ◆
Cov(X, Y ) = (x µx )(y µy )f (x, y) dx dy = xyf (x, y) dx dy µx µy .
c a c a

2.3 Examples

Example 1. Flip a fair coin 3 times. Let X be the number of heads in the first 2 flips
and let Y be the number of heads on the last 2 flips (so there is overlap on the middle flip).
Compute Cov(X, Y ).
answer: We’ll do this twice, first using the joint probability table and the definition of
covariance, and then using the properties of covariance.
With 3 tosses there are only 8 outcomes {HHH, HHT,...}, so we can create the joint prob-
ability table directly.

X\Y 0 1 2 p(xi )
0 1/8 1/8 0 1/4
1 1/8 2/8 1/8 1/2
2 0 1/8 1/8 1/4
p(yj ) 1/4 1/2 1/4 1

From the marginals we compute E(X) = 1 = E(Y ). Now we use use the definition:
X
Cov(X, Y ) = E((X µx )(Y µY )) = p(xi , yj )(xi 1)(yj 1)
i,j

We write out the sum leaving out all the terms that are 0, i.e. all the terms where xi = 1
or yi = 1 or the probability is 0.
1 1 1
Cov(X, Y ) = (0 1)(0 1) + (2 1)(2 1) = .
8 8 4
We could also have used property 4 to do the computation: From the full table we compute
2 1 1 1 5
E(XY ) = 1 · +2 +2 +4 = .
8 8 8 8 4
18.05 class 7, Covariance and Correlation, Spring 2014 3

5 1
So Cov(XY ) = E(XY ) µX µY = 1= .
4 4
Next we redo the computation of Cov(X, Y ) using the properties of covariance. As usual,
let Xi be the result of the ith flip, so Xi ⇠ Bernoulli(0.5). We have
X = X1 + X2 and Y = X2 + X3 .
We know E(Xi ) = 1/2 and Var(Xi ) = 1/4. Therefore using Property 2 of covariance, we
have
Cov(X, Y ) = Cov(X1 +X2 , X2 +X3 ) = Cov(X1 , X2 )+Cov(X1 , X3 )+Cov(X2 , X2 )+Cov(X2 , X3 ).
Since the di↵erent tosses are independent we know
Cov(X1 , X2 ) = Cov(X1 , X3 ) = Cov(X2 , X3 ) = 0.
Looking at the expression for Cov(X, Y ) there is only one non-zero term

1
Cov(X, Y ) = Cov(X2 , X2 ) = Var(X2 ) = .
4

Example 2. (Zero covariance does not imply independence.) Let X be a random variable
that takes values 2, 1, 0, 1, 2; each with probability 1/5. Let Y = X 2 . Show that
Cov(X, Y ) = 0 but X and Y are not independent.
answer: We make a joint probability table:

Y \X -2 -1 0 1 2 p(yj )
0 0 0 1/5 0 0 1/5
1 0 1/5 0 1/5 0 2/5
4 1/5 0 0 0 1/5 2/5
p(xi ) 1/5 1/5 1/5 1/5 1/5 1

Using the marginals we compute means E(X) = 0 and E(Y ) = 2.

Next we show that X and Y are not independent. To do this all we have to do is find one
place where the product rule fails, i.e. where p(xi , yj ) 6= p(xi )p(xj ):
P (X = 2, Y = 0) = 0 but P (X = 2) · P (Y = 0) = 1/25.
Since these are not equal X and Y are not independent. Finally we compute covariance
using Property 4:
1
Cov(X, Y ) = ( 8 1 + 1 + 8) µX µy = 0.
5
Discussion: This example shows that Cov(X, Y ) = 0 does not imply that X and Y are
independent. In fact, X and X 2 are as dependent as random variables can be: if you know
the value of X then you know the value of X 2 with 100% certainty.
The key point is that Cov(X, Y ) measures the linear relationship between X and Y . In
the above example X and X 2 have a quadratic relationship that is completely missed by
Cov(X, Y ).
18.05 class 7, Covariance and Correlation, Spring 2014 4

2.4 Proofs of the properties of covariance

1 and 2 follow from similar properties for expected value.

3. This is the definition of variance:

Cov(X, X) = E((X µX )(X µX )) = E((X µX )2 ) = Var(X).

4. Recall that E(X µx ) = 0. So

Cov(X, Y ) = E((X µX )(Y µY ))

= E(XY µX Y µY X + µX µY )
= E(XY ) µX E(Y ) µY E(X) + µX µY
= E(XY ) µX µY µX µY + µX µY
= E(XY ) µX µY .

5. Using properties 3 and 2 we get

Var(X+Y ) = Cov(X+Y, X+Y ) = Cov(X, X)+2Cov(X, Y )+Cov(Y, Y ) = Var(X)+Var(Y )+2Cov(X, Y ).

6. If X and Y are independent then f (x, y) = fX (x)fY (y). Therefore

Z Z
Cov(X, Y ) = (x µX )(y µY )fX (x)fY (y) dx dy
Z Z
= (x µX )fX (x) dx (y µY )fY (y) dy

= E(X µX )E(Y µY )
= 0.

3 Correlation

The units of covariance Cov(X, Y ) are ‘units of X times units of Y ’. This makes it hard to
compare covariances: if we change scales then the covariance changes as well. Correlation
is a way to remove the scale from the covariance.
Definition: The correlation coefficient between X and Y is defined by
Cov(X, Y )
Cor(X, Y ) = ⇢ = .
X Y

3.1 Properties of correlation

1. ⇢ is the covariance of the standardizations of X and Y .

2. ⇢ is dimensionless (it’s a ratio!).

3. 1  ⇢  1. Furthermore,
⇢ = +1 if and only if Y = aX + b with a > 0,
⇢ = 1 if and only if Y = aX + b with a < 0.
18.05 class 7, Covariance and Correlation, Spring 2014 5

Property 3 shows that ⇢ measures the linear relationship between variables. If the corre-
lation is positive then when X is large, Y will tend to large as well. If the correlation is
negative then when X is large, Y will tend to be small.

Example 2 above shows that correlation can completely miss higher order relationships.

3.2 Proof of Property 3 of correlation

(This is for the mathematically interested.)

✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆
X Y X Y X Y
0  Var = Var + Var 2Cov , =2 2⇢
X Y X Y X Y
) ⇢1
✓ ◆
X Y
Likewise 0  Var + ) 1  ⇢.
X Y
✓ ◆
X Y X Y
If ⇢ = 1 then 0 = Var ) = c.
X Y X Y

Example. We continue Example 1. To compute the correlation we divide the covariance

by the standard deviations. pIn Example 1 we found
p Cov(X, Y ) = 1/4 and Var(X) =
2Var(Xj ) = 1/2. So, X = 1/ 2. Likewise Y = 1/ 2. Thus

Cov(X, Y ) 1/4 1
Cor(X, Y ) = = = .
X Y 1/2 2

We see a positive correlation, which means that larger X tend to go with larger Y and
smaller X with smaller Y . In Example 1 this happens because toss 2 is included in both X
and Y , so it contributes to the size of both.

3.3 Bivariate normal distributions

The bivariate normal distribution has density


1 (x µX )2 (y µY )2 2⇢(x µx )(y µy )
+
2(1 ⇢2 ) 2 2 x y
e X Y
f (x, y) = p
2⇡ X Y 1 ⇢2

For this distribution, the marginal distributions for X and Y are normal and the correlation
between X and Y is ⇢.
In the figures below we used R to simulate the distribution for various values of ⇢. Individ-
ually X and Y are standard normal, i.e. µX = µY = 0 and X = Y = 1. The figures show
scatter plots of the results.
These plots and the next set show an important feature of correlation. We divide the data
into quadrants by drawing a horizontal and a verticle line at the means of the y data and
x data respectively. A positive correlation corresponds to the data tending to lie in the 1st
and 3rd quadrants. A negative correlation corresponds to data tending to lie in the 2nd
and 4th quadrants. You can see the data gathering about a line as ⇢ becomes closer to ±1.
18.05 class 7, Covariance and Correlation, Spring 2014 6

3
rho=0.00 rho=0.30

● ●

3
●
● ●
● ● ● ● ●
● ●● ● ● ● ●
●●
● ● ● ● ● ●
●● ● ● ● ●
● ●
●● ●
2

● ●

2
●● ●
●● ● ● ●● ● ●
● ● ● ●● ● ●● ●
●● ● ●● ●
● ● ● ●● ●● ● ●
● ● ● ● ●● ● ●● ●●● ●
●● ● ●● ● ● ● ● ● ●●
● ●● ●● ● ● ● ● ●
● ● ●● ● ●●
● ● ● ●● ● ●
● ●●
● ● ● ●●● ●● ●●●●● ● ●
● ●●● ● ● ● ●● ●
● ●
● ● ● ●● ●● ● ●● ● ● ● ●● ● ●● ●●● ●● ● ●
● ●●● ● ●● ●● ●●● ● ●● ●●● ● ● ● ●●
● ● ● ●●●●
● ●● ●
●
● ●●●●
●● ● ●● ● ●●● ● ● ● ●●● ● ●●
● ●● ● ●
●●● ●● ●●● ●●●
● ●●
●●
●● ●
● ●
1

● ● ●●● ● ● ● ●● ● ●● ●

1
●●●● ●●●
● ● ● ● ●
●●
● ● ● ● ●● ● ●●● ● ●●● ●●●
●● ●●● ●● ● ● ●● ●
● ● ● ● ● ● ● ●
●
● ●● ●● ● ●● ●●● ● ● ●● ●
● ●● ●●●● ●● ● ● ●● ●●● ● ●● ●
●● ● ●●●● ● ● ●
●● ● ● ●●
●●
●
● ●
●●
● ●●●
● ●● ●● ● ●
● ●●● ●● ● ● ● ● ● ● ●● ●● ● ●●
● ●● ●●● ● ●● ●● ●●
●● ●
●
● ●● ● ●
●●●
●●●
●●
●
●
●●●● ●●
●● ● ● ●●
● ● ● ● ● ● ● ● ● ● ●
● ●● ●
● ● ● ●● ● ●●● ●●●●
●●●●●● ●
● ● ● ●● ● ●●● ●● ● ● ● ●● ●● ●● ●●●●
● ● ● ● ●● ●●●●●●●● ● ●● ● ● ●●
●● ● ●
● ●● ● ●● ●●● ● ●● ●
●●
●●●
●● ● ● ● ●●● ●●● ●● ●● ●●
●● ● ● ● ● ● ●● ● ● ● ●● ●●● ●
● ●●● ●●● ● ●
●●● ● ● ●● ●● ●
●
●● ● ●
● ● ● ●● ● ●● ●
● ●● ●● ●● ●
● ●●●●
●●
●●●●● ● ●● ● ●● ● ● ●● ●●●● ●●●●●● ●●● ●●●● ●● ● ●●
●● ●●●●
●● ● ●●
● ● ●●● ●●● ● ● ●● ● ●●●
● ● ●
● ●
●●●● ●●●● ●● ● ●●
● ●● ●●
● ● ●●● ●●● ●
● ●●●●●●●●● ●●●●
●●●
●
● ●●
●●●
●●●
● ● ●●
●
●
● ●● ● ●●● ● ● ●● ● ● ●●● ● ● ●●●
● ● ● ● ●● ●
●●
●●● ● ●
●
● ●●●● ● ●●
●● ●
● ●●● ●●● ● ●●●●
● ●●●●●● ●
● ● ●● ● ●●
●● ● ●●●
●●●● ● ●●●●● ● ●● ●●
●● ●●●●●
●● ●● ●
−1 0

● ● ●
●● ●●

−1 0
y

y
● ●●● ● ●●● ● ●● ●● ● ●
●● ●
●●
●●●●●
●●●
●●
● ●● ● ●● ●●
●
●●
● ●●
●
●
●●● ●●
●●●●●●● ●
● ●
●●● ●
● ● ● ●
●
● ●● ● ●●
●●●● ● ● ● ● ●●● ●●●● ●
●
●●●●● ●
●
●●●●
● ●● ●
●● ●● ●
● ● ●● ● ●●●●● ●● ●●●● ●●● ●● ●● ● ●●● ●●
● ●● ● ●●
●●●● ●●
● ● ●●●●
● ●●●● ●●●●● ●
●
● ●●
●
●●● ●● ●
●
● ●●●●●●●
● ● ●●
●●
●
● ●● ●●
●
●●●● ● ●● ●●●● ● ● ● ● ● ● ●●● ●● ●●●● ●● ●
● ●●
●● ●●● ●●●●● ●
●
●●●
●
●●
●
● ● ●● ● ● ●
● ●
● ●
●● ● ● ●●●
●● ●●● ● ●● ● ● ●● ● ●●●●
●● ●● ●● ●● ● ● ● ● ● ●● ● ●
●
●●●●●●● ● ● ● ● ● ●●● ●● ●●
●
● ●
●
●● ● ● ●●● ●
●● ● ●●●
● ●●● ●●
● ●●●● ●●● ●
●●● ● ● ●●●●
●
● ●●● ●●●●● ●● ● ●● ●●●●
●●
● ●
●● ●
● ● ●●
● ●
●●
●
● ● ● ●●
●● ●● ●● ● ●●● ● ● ●●
● ●
● ●
●● ● ● ● ●● ●●● ● ● ● ●●●● ●● ● ●●
●●●●●●●● ●●
● ●●●●
● ●
● ●● ● ● ● ●● ●● ●
● ● ●●●● ●● ● ● ●●● ● ●●● ●●●● ● ●●●
●
● ● ●● ●● ● ● ●●● ●● ● ● ● ● ●● ●●
● ●●●●
●●●●●
●●●●
● ●●●● ●● ●●● ●
●● ●●
● ●●● ●
●● ● ●● ● ● ●●
● ● ●
●●● ●●
●
●●
●● ●
●●●●●● ●● ● ● ● ●● ●● ●●
● ●● ●●●●● ●●● ● ●●●
● ● ●
●●●● ● ● ● ● ●● ●●
● ●● ●●●
● ● ● ●● ● ● ● ● ●● ●●●
●●●● ● ●●●●
●● ●
●●●
●●
● ●● ●●● ●●●● ● ●
● ● ● ● ● ● ● ●●● ●●● ● ● ● ●●●●
● ● ●●● ● ●●●● ● ●
●
●
● ●
● ●
● ● ● ●●●
●
●●● ● ●● ●
●●● ●● ●● ● ● ●●● ●● ●● ●●
● ●● ●●●●● ● ●
● ● ● ●●●
●● ● ● ●
●
● ● ● ●●●●●
● ● ●●●● ● ●●
●● ● ●
● ● ● ● ● ● ●
● ●● ● ● ● ● ●●● ●●● ●● ● ●● ● ●● ● ●
● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ● ● ● ●● ●● ●● ● ● ●
● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●
●● ● ●● ●
●●●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●
● ● ●
●
● ●
●
● ● ●
●●● ●●
● ● ● ● ●● ●
● ● ●● ●● ●
● ●● ●
● ●●
●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●
●
−3

−3
● ●● ●

−3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
x x

rho=0.70 rho=1.00

● ●● ● ●
● ●
●●●● ● ●● ● ●● ● ● ●●
●● ●
2

● ●
● ● ●●● ●
●
● ● ●
●● ●
●●●
●
●
●● ● ● ● ● ●● ●● ●●● ● ● ●
●●
●
2

●
●● ●● ●● ●● ● ● ●● ● ● ●
●
●●
●
●●● ●● ●●●●
●
●●●● ●●
●● ●●●●● ● ● ● ●
●
●
●●●
●
● ●●● ● ●●● ●●● ●● ● ● ●●
●
● ● ● ● ●● ● ●● ● ● ●● ● ●
●●
●
●●
●●● ● ● ● ●● ● ●●● ●● ●●
●●●● ● ●● ● ●● ● ●●
●●
1

●● ● ● ●
●
●●
●
●●●● ● ● ●● ●●●●●● ●
●●● ● ● ● ● ●
●
●●
●
●●
● ● ●● ●● ● ●● ●●
● ●●
●
● ●●
●
●●●
●●● ●●●● ● ●●
●
●
●●
●●
● ●●● ● ● ●●● ●
●●●
●●
● ●●●● ● ●
● ●● ● ● ●●
●
●
●●
● ● ● ●●
●● ●●
●●●●●
●
●● ●●●●●●●
●
●●● ●●●●●● ● ●●●●● ●
●●
●
●
●●
●
●
●● ● ●●● ● ●● ●●●● ●●● ●● ●
●●●●● ●● ●●● ● ● ● ●
●
●●
●
●
●●
●● ●● ● ●●●●
●●● ●●
●●●
●● ●
●
● ● ●
● ● ●● ●● ● ● ●● ●
● ● ● ● ●
●
●
●
●
●
●
●
●
●●●● ●● ● ●●●
● ●●●● ●● ●
●● ● ●●
●
●●
●
●
● ● ●● ●● ●●● ●● ●
●●●●
●
●●●●● ● ●● ●●●
● ● ● ●
●
●●
●
●
●● ● ●
● ●● ●
●●●
●●●
●●● ●●
●
●●● ● ●●
● ●● ● ● ●● ●
●●
●
●
●
●● ●● ●●
0

● ●
0

●● ●●●● ●● ●●● ● ●● ●
●●●●
● ●● ● ●● ●●●●●●● ● ●● ● ● ●●
●
●●
●
● ● ● ●● ●●●●● ● ●●●●
● ●●●● ●● ●● ●
●
y

●
y

●
●
● ●●● ●● ● ● ●
● ● ●●●●● ● ● ● ● ● ●
●●●
●
●
● ●● ●● ●●
●
● ● ● ●● ● ●
●
●●
●● ● ●●● ● ● ●
●
●
● ● ● ●
● ●● ●
● ●● ●●
●
●● ● ● ● ● ●●
●
●● ● ●●
● ●●
●●
● ●● ●● ●
● ● ●
● ●
●
−3

−4

● ●

−3 −2 −1 0 1 2 3 −4 −2 0 2
x x

rho=−0.50 rho=−0.90
3

● ●
● ●●
● ●
● ● ● ● ●
●● ● ●●
● ● ●
2

● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●
● ●
● ● ●●● ● ●●● ●●●● ●
2

●● ● ●● ●
● ● ●● ●●●●●●● ●●●● ●●
●● ●●● ● ●
●●
● ● ● ● ● ● ●● ●
● ● ● ● ● ● ●● ●●
●●
●
●
●●●●
●●●●●●●● ●●
● ● ●
●● ● ●● ●● ●●● ●●●
●●● ● ● ● ● ●
●●● ● ●● ●●
● ●●
●●●●● ●●
●
●
● ●●
●● ● ● ● ●●●●● ●● ● ● ● ●● ● ●● ●
● ●●● ●
● ●●●●● ●
● ●●
●
●●●●
●●●●●
●●●
●
●●● ●● ●
●
● ● ● ●● ●●
●●● ● ● ●● ● ●●●● ● ● ● ●●● ●
●
●● ●● ●
● ● ●● ●
● ●
●
● ● ● ● ● ● ● ● ●●
●● ●
●●
●●●
●●●
● ●●●●●
●●
● ●●
●●●● ●
● ●● ●●
●●
● ● ● ●●● ●
●●●● ● ●●● ●
●● ● ● ● ●● ●●
● ●●●● ●●●●●
●●
●
●
●●
● ●
●●●●● ● ● ●● ●● ● ●
● ●●● ● ●● ●● ●● ● ● ●●●
●●● ●
●● ●●
●●
●● ● ●
1

● ●● ●
● ●
●●
●● ● ●●● ●● ●
●● ●●●● ●● ● ●●●●
●● ● ●● ● ● ● ●
● ●●●
●●●
●●
●
●
●
●●
●●●●
●●
● ●
●
● ●●●
●●●●● ● ●
● ●●●●● ●● ● ●
●● ● ●
●● ●●●● ●● ● ●
● ● ● ●
●● ●● ● ●● ● ● ● ● ●●●
●●
●●
●
●
●●
● ●●●
●
● ●●●
●●●
●●
● ●
●●
● ●●●●
●●●
● ●●●
●●● ● ●●
●● ●●● ●● ● ●
● ●●●● ●●● ●●● ●
●●●
● ●●● ●● ● ●● ● ● ●● ● ● ●●●
●●● ●
●●
●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●●
●●
● ●
●
●
●
●
●
●●●
●●
● ●●
●
●
●●
●● ●● ●
●● ● ●●●
● ●● ● ● ● ●
●● ● ● ●●
● ●●● ●
●● ● ●● ●●●
● ● ● ●●●
●●●●● ●
● ●● ●●●
●●● ● ● ●
●● ● ●● ●● ● ●● ●● ●● ●●●●● ●
●● ●●●●● ●● ● ●●●
●● ●●
y

● ● ● ● ●● ●●
●● ● ● ●
● ●●●
●
● ●●
● ●
● ● ●
●●●●
● ●
●
● ● ●● ●
●●● ●●●●● ● ● ● ● ● ● ●● ●●
● ●●●
●
●●
●
●●
●
●●
●●●●
●●
●
●●
●
●●
●
●● ●
●
●
●
●
●●
●●●●●●●● ●
● ●
● ●● ●●● ● ●● ● ●●● ● ● ●●● ● ●● ● ● ●●● ●●●●●
0

●●● ●●●●●●●●● ● ●●● ●●●●● ● ● ● ●

● ● ● ● ●●● ●●● ●●●●
●● ● ●● ●●● ●
●●●● ●●● ●
● ●● ● ● ●●● ●
●● ●
●●●
●
●●
●
●●
●●● ●
●●
●● ●● ●● ●
● ● ●● ● ● ●● ●
● ● ●● ● ●
●●●●
● ●
●
● ● ● ●● ● ●
● ●● ●
● ●
●
●● ● ●
●●● ●●
●
● ●
●●
●● ●
●●●● ●
● ● ● ●●●●●●●●●● ● ●● ●●
●●● ●● ● ● ● ●● ●
● ● ●● ● ● ●● ●
●●●●●●●
● ●●
●●
●
●●●●●●●●●●●● ●
● ● ●● ● ●●●●●●●●● ●●●●●● ●●●●● ●
●
● ●●●
●●
●
●
●● ●●●●
●●● ● ●● ● ● ●
●●●●
● ● ●
● ● ●● ●●● ● ● ● ●
● ● ●●● ●● ●
● ● ●●●● ●●
●●● ●
●
●●●●●● ●●●
● ●●
●●●●● ● ● ●● ●
●● ●● ●●●
●
● ●● ●●
● ● ●● ●●●
●●
●
●
●●● ●●●
●● ●●● ●●●●●
● ● ● ● ●● ●●● ● ● ●
●●● ●● ●●● ●●● ● ●● ●●
●●
● ●●
● ● ●●
● ● ●
●● ●●●●● ●●● ● ●
●● ● ●● ● ●● ●● ●● ●●
●●● ●●●● ● ●●
● ●
●●●●
● ●●●
−2

●●● ● ● ● ●
● ● ● ●● ● ●● ●
● ● ●
● ● ● ● ●
−4

● ●
● ● ●
● ●

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4
x x
18.05 class 7, Covariance and Correlation, Spring 2014 7

3.4 Overlapping uniform distributions

We ran simulations in R of the following scenario. X1 , X2 , . . . , X20 are i.i.d and follow a
U (0, 1) distribution. X and Y are both sums of the same number of Xi . We call the number
of Xi common to both X and Y the overlap. The notation in the figures below indicates
the number of Xi being summed and the number which overlap. For example, 5,3 indicates
that X and Y were each the sum of 5 of the Xi and that 3 of the Xi were common to both
sums. (The data was generated using rand(1,1000);)
Using the linearity of covariance it is easy to compute the theoretical correlation. For
each plot we give both the theoretical correlation and the correlation of the data from the
simulated sample.

(1, 0) cor=0.00, sample_cor=−0.07 (2, 1) cor=0.50, sample_cor=0.48

2.0
● ● ● ●● ● ● ● ●
● ● ●● ● ● ●● ● ●● ● ● ●
● ●● ● ● ● ● ●●
● ●
● ●
● ● ● ●● ●●
● ● ● ●
● ●●● ●●●● ● ● ● ● ● ● ●● ● ●
● ●●
●● ●● ●● ●● ●● ● ● ●● ●● ● ● ●
● ● ●
● ● ●● ● ●
● ● ● ● ●● ●● ●● ●
●● ●● ● ● ● ● ●
●●● ● ● ●● ●● ●● ● ● ●● ● ●● ●●● ● ● ●●●● ● ●
● ●● ●● ● ●
● ● ● ● ● ●● ● ● ● ● ●●
●●
●●
● ●
● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ●
0.8

● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●
● ● ●● ● ● ● ● ● ● ● ●
● ●
● ● ●● ● ● ●● ●●● ● ● ●●● ●● ● ● ●

1.5
●
● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ●●●● ● ● ● ●● ● ●●
● ●
●
● ●●● ● ● ●●● ● ●●● ● ● ●● ● ●● ● ●● ●● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ●
● ● ●●● ● ● ● ●●●● ●● ●●● ●
● ●● ●● ●● ●● ● ● ● ● ●●●● ● ●● ●●● ●
●● ● ● ●●●● ● ● ●
● ●● ● ●
● ●
●
● ● ●●
●● ● ● ● ● ●● ●●●
●
●● ● ●● ● ●
●● ● ●
● ● ● ● ● ●●● ● ● ● ●●● ● ● ●● ● ●●● ● ●● ●
● ● ●● ● ● ●
● ●● ● ● ● ● ●●● ●● ● ●● ●● ● ● ● ●●● ●
● ●
● ● ● ● ● ●● ● ● ●● ●
● ● ● ● ● ●● ●●●●●●●
●
● ●●● ●
●● ● ● ● ●● ● ●● ● ●
●●●● ● ●● ● ● ●● ●● ●●● ● ●●● ● ● ●
● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●
●● ● ●● ● ●
●● ●● ●● ●●
●
●
● ●●● ● ●
● ● ●
● ● ●● ● ●●●● ● ● ●● ● ● ● ● ● ●
● ●●● ●● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ●
●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●
● ● ● ●●● ● ●
● ● ● ● ● ● ●● ●
● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ●● ●
●● ● ● ●● ●● ●●
● ●
●● ●
●●● ●●● ●● ●●● ●● ● ●
●
●●● ● ●● ●●
● ● ● ●● ●
● ● ● ● ●●●● ● ● ●●● ●●● ● ●● ●●● ●●● ● ●●
● ●● ●● ●● ●
● ● ●●●● ● ● ●● ●●● ● ● ● ● ● ● ●● ●●●● ●● ●●●● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ●● ●● ●●● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ●

1.0
●● ● ●● ● ●● ●● ●
● ● ● ● ●● ● ● ● ●●●● ●
●●● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ●● ●
●● ●● ● ●● ● ● ● ●● ● ● ●●
● ● ●● ● ● ●●● ● ● ●●
● ●● ●● ●● ● ●● ● ● ●●
● ●
y

● ●● ● ● ●
● ●● ●
● ● ● ● ●● ● ●
● ●● ●● ●
●
● ●●● ●
● ●● ● ●●●●
●● ●
● ● ● ●
● ● ● ● ●
● ●● ● ● ● ●
●●●● ●● ●●● ● ● ●● ● ● ●
● ●●
● ● ●
● ●● ● ● ● ●●● ● ● ●● ● ●● ●● ●● ● ●● ● ● ●● ● ●● ●●●●● ●● ●
● ●
● ●● ●
●●● ● ● ● ●● ●●●● ●● ●● ● ● ● ● ● ●● ● ●●● ●●
● ● ●● ● ●●
●● ●●
●● ●●● ● ●●●
● ● ● ●● ● ● ●●
● ● ●● ●●
● ● ● ● ●
●● ● ●● ● ● ●●●●● ●●
● ●● ● ●
● ● ● ●
●● ● ● ● ●●
● ●
●
● ● ● ● ● ● ●●● ● ●● ● ● ●●
● ● ●● ●● ● ● ● ● ● ● ● ● ●●
● ●
● ● ●● ●● ● ●● ● ●
●● ●● ●
● ● ●● ●● ● ● ● ●● ●● ●● ● ● ●● ● ●
● ● ●● ●● ●● ●
●● ●
● ● ●● ●
● ●● ●●
● ●●
● ●●● ● ●● ●●●● ● ●● ●●
● ● ● ● ●
0.5

● ● ●
● ● ● ●
● ●
●
● ●●
●●● ●
●● ● ● ● ●● ● ● ● ● ●● ●● ●
●
● ●
● ● ● ● ●
● ● ● ● ● ●●
● ● ● ● ● ● ● ● ● ●
●● ●
● ● ●●● ●● ●● ● ● ● ● ● ●●●●●● ● ● ●● ● ●
●
● ● ● ●
● ● ● ●
●●
●● ●●● ● ●● ●
●
●●
● ● ● ● ●● ● ●● ●
●● ● ●
● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ●●●● ●●● ●●●●● ● ●●
● ●
● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●
● ●●●
● ●● ● ● ●● ●● ● ● ●●●● ● ●● ● ● ● ●● ●● ● ●● ● ●● ●
● ● ● ●●●● ● ●● ● ●● ● ● ●●
● ● ● ● ●● ●● ● ● ●● ● ●● ● ●
●● ●● ●
●
● ●●● ●● ● ● ● ●
● ●●
●●
● ●●● ●● ● ●● ● ● ● ● ●
●●●
● ● ● ● ●● ● ●●
● ● ●
● ● ● ●● ● ● ● ●●● ● ●●
● ●
●
●● ● ●●● ●●●
● ●●● ●
● ● ● ●● ● ● ● ●● ● ●
● ●●
●● ● ● ● ● ●
0.0

● ●● ● ● ● ● ● ●●● ● ●
●● ●
0.0

● ●● ● ● ●● ● ●● ● ● ●
● ●● ●● ● ●● ● ● ● ● ●● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
x x

(5, 1) cor=0.20, sample_cor=0.21 (5, 3) cor=0.60, sample_cor=0.63

● ●
●
● ●
● ●
●● ● ● ●
●
4

● ● ● ●
4

● ● ● ● ●
●
● ● ● ● ●
● ● ●●
● ● ● ● ●● ● ●
● ● ●● ●● ●● ● ●● ● ● ●
●●●●● ●● ●● ● ●
●●
● ● ● ● ● ● ● ● ●● ●
● ● ●
● ● ● ● ●●●● ● ●● ●● ● ● ● ● ●● ● ●
●●● ● ● ● ●● ● ●●● ●● ● ● ● ● ●
● ●
● ● ● ● ●●● ●● ●● ●
● ● ●●●● ● ● ●● ●● ●● ● ● ● ● ●● ●●● ●● ● ●●●
● ● ●
● ● ● ● ●●● ● ●
●● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●●● ●●● ● ● ● ●● ●● ●●●● ● ●
● ●
●●● ●● ●●
●●●
●● ● ● ● ●● ● ●● ●
●● ●
●
● ● ● ●● ● ●●● ● ● ● ● ●●
●
●
● ●● ● ● ● ●●
● ●●● ●● ●● ● ●● ● ● ●
●
● ● ● ● ●● ● ●●●● ●
● ●●● ● ●● ●● ●
● ●
●● ● ● ● ● ● ● ●●●●● ●● ● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ● ● ●
● ●● ● ●● ●●●● ● ●
●● ●
● ● ●●●●●● ●● ●● ●●●
● ●● ●● ●● ●●
● ●●● ●●●● ● ● ●● ●
3

● ● ● ●● ● ● ● ●● ●● ● ●●
● ● ●● ● ●●● ●● ●● ●●●● ● ● ● ●●
●●●
●● ● ● ●● ●● ●● ● ● ● ●●●●●● ●● ● ●
3

●●
● ●● ●●●● ● ●●● ● ● ●● ● ●●● ● ●● ● ●●● ● ●● ●● ●●●
●● ● ●
● ●● ● ● ● ●●
●
● ● ●●● ●●●●●●● ●
●
●
●●
●●●● ●● ●● ●● ● ●●● ●
●
● ●● ●●●●
● ●●
● ●●●●●●●●●●●●● ●
●● ●●●●●● ●● ●● ●
●
● ● ● ●● ●●●● ●●●● ● ● ●●●● ●● ● ●●●●● ●● ●● ●● ●●● ●● ● ●
● ●● ● ●● ●● ● ●
●●●●● ●●●
●● ● ●●
●● ●●●●●● ● ● ●
● ●
● ● ● ●● ●● ●● ●
●●●
●● ●●
●
● ● ●● ●
● ● ● ● ●● ●● ●● ● ● ●● ●
●●●● ●●●
●●●
● ●● ●
●● ●●● ● ●●●
● ●● ● ●●●● ●
●
●
● ●
● ●● ●● ●
●●● ●● ●● ●●●●●●●●●● ●●●● ● ● ●
● ●
●
● ●● ●●●●●● ● ● ●●
● ●● ●
● ●●
●● ●● ●●●●
●●●●● ● ●●
●
● ● ●● ●● ●●● ● ● ●●● ●●
●
●●●
●● ●●● ●●●● ●●● ●
●●●● ●●●
●●●●●
●
●●
●● ● ● ● ●● ●● ● ●
●● ● ●
● ● ● ●●
●● ● ●● ● ● ● ●●
●● ●
●
● ●●●● ●●
● ●
● ●● ● ●● ● ●● ● ● ●●● ●● ●
●●
●●● ●●●●
●●● ●●●● ●● ●●●● ●
y

●● ● ●● ●● ●
● ●● ●●
●●● ●● ●
● ●●●
●● ● ● ● ●●●●●●●●●● ● ●●● ● ● ●
●●●● ●●●
●● ●●
●●●● ● ●
● ●●●●● ●● ● ●●●●●●●● ●
● ●●●●●● ●●
● ●●● ● ● ● ● ● ● ●● ●●●●●●● ●● ●
●●●●●●●
●●● ● ● ●
●●●
●
●● ● ● ●●
● ● ●●●● ●
●●●
● ● ●●●● ●
● ● ● ● ●
● ● ● ● ●● ● ●●● ●● ● ● ●
● ●●●●●●
● ●● ● ● ● ● ● ● ● ● ●
●● ● ●●● ●
●
●
● ●●
●●● ●● ●●
● ●
●●●●●
● ●●
● ● ● ●● ● ●● ●●
● ●
● ●●
●●
●●●●●● ●●● ●● ●●● ●● ● ●
● ● ● ●● ●● ●●●● ●●●●
●
● ●●
● ●●
●●
● ●
●
● ● ●
●
●
● ●●
● ●● ● ●● ●●
● ●
● ● ●● ● ● ●●● ●●
●●
●● ●
●●● ●●●
● ●● ●●●●●
●●●●●●● ● ● ● ● ●
● ●●●
● ● ●●●●
● ●● ● ●
●●● ●● ●
●●●
● ● ●●● ●● ●● ● ●
● ● ●●● ●● ● ● ● ●● ●●
●● ● ● ●●● ● ● ●●●●●●● ●
●●
●●●
●● ● ●
● ●● ● ●
●●
● ●●● ●●
●●● ●●●● ●
●● ● ● ●●●
●●● ● ●●●● ●
● ● ●● ●●●
2

●● ● ● ●● ● ● ● ● ●● ● ● ●
● ● ●●●● ●●●● ●● ●● ●●● ●● ●●● ●● ●
2

● ● ● ●●●● ●
● ● ● ●● ●●
●●● ●● ● ●●●●●● ●● ● ● ● ● ●● ●●
● ● ●● ●●● ●● ●●●● ● ●
● ●●● ● ●● ● ●
● ● ●● ● ●
● ●●● ● ● ●●
●
●●●●●● ● ● ● ● ● ● ●
● ● ●● ●● ●●● ●
● ●● ●●● ● ● ●
●●● ● ● ●
● ● ●● ●● ● ● ●●● ● ● ●●●●
● ●●● ●
● ●● ● ●
●●●●●●●●● ● ●
● ● ●
● ● ●● ● ● ● ●● ●●●●
●●● ● ● ●
● ● ●●●● ● ●●
●
●
●
●
●● ●●● ●● ●● ● ●●
●●
● ● ● ● ● ●●● ●●
● ●●●●● ●● ●● ●● ●
● ●
● ● ●
●
●● ●● ● ● ●●
●● ●●●●
● ●●● ● ●●
●●●●●●
● ● ●● ●● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ●●
● ●
● ● ●● ● ● ●●
● ●●
● ● ●● ● ● ● ●●● ● ●
● ●●●● ● ●● ●● ●● ● ●
● ● ● ●● ●●
● ●● ● ● ●
● ● ● ● ●● ●● ● ● ●●
● ● ● ● ●● ● ●●● ●● ●
● ● ● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●
● ●●●● ● ● ●
● ● ● ● ●● ● ● ●
1

●●
1

● ● ● ● ●● ● ●●
● ● ● ● ●●●
● ● ● ●
● ● ●
● ●
● ●
●
● ●

1 2 3 4 1 2 3 4
x x
18.05 class 7, Covariance and Correlation, Spring 2014 8

(10, 5) cor=0.50, sample_cor=0.53 (10, 8) cor=0.80, sample_cor=0.81

8
● ● ●
● ●
● ● ●
●●● ● ● ●
● ● ●
● ● ● ●
●
7

●● ● ●
● ●●

7
● ● ● ● ● ● ● ●● ●● ●
● ● ● ● ● ●
● ●● ● ● ●
● ● ● ●● ● ● ● ●
● ●● ● ●● ● ● ●● ● ● ● ● ●
● ● ●
●
●● ●● ● ●● ●●
● ● ●● ● ● ● ● ● ● ●●
●
● ●
●
● ● ●
● ●● ● ● ● ● ● ● ● ● ●●●● ●●●● ● ●●
● ●●
●
● ●● ● ● ●●●
●●
● ● ● ● ●● ●●● ●● ● ● ●
●
● ●● ●●
●
●● ●
● ●
● ●● ●●
● ●
●●●● ●● ●●●● ●
●●
●●
●● ●●●
●●● ●● ● ● ●
● ●●
● ● ●●●
6

●● ● ● ●● ● ●● ●● ● ●●● ● ● ●●● ● ●
● ●
● ● ●●●●● ● ●●● ● ● ● ● ●

6
●
●●●● ● ● ● ● ● ● ● ● ● ● ●● ●
● ●●●●●●
● ●
● ●● ● ●
● ● ● ● ●●●●●
●● ●● ● ● ● ● ●●●● ●
●●● ● ●● ● ● ● ●●
● ●
● ●● ● ● ● ●●● ●●● ●●● ●●● ●
● ● ● ● ● ● ● ●
●●● ●●●●● ●● ●
●● ●
●●●
●● ●
●●●●
● ● ●●
●
●●●
●
●
●●●
●●
●●●●●
●●● ●●● ●●● ●● ●
●●●
●●●
●
● ●●● ●●●
●●● ●● ● ●● ● ●●●
● ● ●●● ● ●●
● ● ●
●●
●●●●●● ● ● ● ● ● ● ●
●● ● ●●● ●●●●●
●● ●●●●
●● ●
● ● ●● ●●
●●● ● ●
● ● ● ●● ● ●
●● ●●●●●● ● ●● ●●
●●●●
● ● ●● ●● ●●● ● ●
●●
● ●● ● ● ●● ●●●●
● ●●● ● ●● ●
● ●●●● ●
● ● ●● ●●● ●●● ● ● ●●
●
●●
● ●● ●●
● ●●
● ● ●●●
● ●
●● ●● ●●●● ●● ●●
● ●●●
● ●
● ● ● ● ● ● ●
●●●●
●●●●
●
●●
●●
●● ●● ● ●●●●
●● ●●●
● ●
● ●● ●●● ● ●●● ●
● ●● ●●● ● ●●● ●
●●
●● ● ●●
●●●● ●●●
● ●●
● ● ● ●●
●●●
● ●● ●●● ● ● ● ●
●● ●●
●●
●
●● ●●● ●
● ●● ●●●● ●
●
● ●
●● ● ● ● ●●● ●●● ● ●● ●●●●●
●●
●●●●●● ●●●
●
● ● ● ● ●●● ● ●●● ●● ●
●● ●●●●
●●
●● ●●● ●●
● ●
●● ●●
● ● ●● ●● ● ●
● ●
●● ●● ●● ●● ●● ●
●●●● ●●
●● ●●
●●●●● ●●●● ●● ● ●●●●●●● ●● ●
●●● ●● ● ● ● ●
●
●●
●
●
●●
●● ●●●●●●
●●●
● ●●●●●●●● ●●●●●●●
● ●●
5

● ● ● ●● ● ●

5
● ● ● ● ●● ● ● ●
● ●
●●● ●●●
● ● ●
●●●●● ●●● ●● ●
● ● ●●
● ●● ● ● ● ● ●● ●● ●● ● ● ●
●
y

4
● ●● ●● ● ●● ●● ●●
4

3
●● ● ● ● ● ●
3

●● ● ● ●
● ●● ● ● ●●
● ● ●
●
● ● ● ● ● ●

2
● ●
2

2 3 4 5 6 7 3 4 5 6 7 8
x x
Class 8 Review Problems
18.05, Spring 2014

1 Counting and Probability

1. (a) How many ways can you arrange the letters in the word STATISTICS? (e.g.
SSSTTTIIAC counts as one arrangement.)
(b) If all arrangements are equally likely, what is the probabilitiy the two ’i’s are next to
each other.

2 Conditional Probability and Bayes’ Theorem

2. Corrupted by their power, the judges running the popular game show America’s Next
Top Mathematician have been taking bribes from many of the contestants. Each episode,
a given contestant is either allowed to stay on the show or is kicked o↵.
If the contestant has been bribing the judges she will be allowed to stay with probability 1.
If the contestant has not been bribing the judges, she will be allowed to stay with probability
1/3.
Over two rounds, suppose that 1/4 of the contestants have been bribing the judges. The
same contestants bribe the judges in both rounds, i.e., if a contestant bribes them in the
first round, she bribes them in the second round too (and vice versa).
(a) If you pick a random contestant who was allowed to stay during the first episode, what
is the probability that she was bribing the judges?
(b) If you pick a random contestant, what is the probability that she is allowed to stay
during both of the first two episodes?
(c) If you pick random contestant who was allowed to stay during the first episode, what
is the probability that she gets kicked o↵ during the second episode?

3 Independence

3. You roll a twenty-sided die. Determine whether the following pairs of events are
independent.
(a) ‘You roll an even number’ and ‘You roll a number less than or equal to 10’.
(b) ‘You roll an even number’ and ‘You roll a prime number’.

4 Expectation and Variance

4. The random variable X takes values -1, 0, 1 with probabilities 1/8, 2/8, 5/8 respectively.
(a) Compute E(X).

1
Class 8 review, Spring 2014 2

(b) Give the pmf of Y = X 2 and use it to compute E(Y ).

5. Compute the expectation and variance of a Bernoulli(p) random variable.

6. Suppose 100 people all toss a hat into a box and then proceed to randomly pick out a
hat. What is the expected number of people to get their own hat back.
Hint: express the number of people who get their own hat as a sum of random variables
whose expected value is easy to compute.

5 Probability Mass Functions, Probability Density Functions

and Cumulative Distribution Functions

7. (a) Suppose that X has probability density function fX (x) = e x for x 0. Compute
the cdf, FX (x).
(b) If Y = X 2 , compute the pdf and cdf of Y.

8. Suppose you roll a fair 6-sided die 100 times (independently), and you get $3 every
time you roll a 6. Let X1 be the number of dollars you win on rolls 1 through 25.
Let X2 be the number of dollars you win on rolls 26 through 50.
Let X3 be the number of dollars you win on rolls 51 through 75.
Let X4 be the number of dollars you win on rolls 76 throught 100.
Let X = X1 + X2 + X3 + X4 be the total number of dollars you win over all 100 rolls.
(a) What is the probability mass function of X?
(b) What is the expectation and variance of X?
(c) Let Y = 4X1 . (So instead of rolling 100 times, you just roll 25 times and multiply your
winnings by 4.) (i) What are the expectation and variance of Y ?
(ii) How do the expectation and variance of Y compare to those of X? (I.e., are they bigger,
smaller, or equal?) Explain (briefly) why this makes sense.

6 Joint Probability, Covariance, Correlation

9. (Arithmetic Puzzle) The joint and marginal pmf’s of X and Y are partly given in
the following table.

Y
X\ 1 2 3
1 1/6 0 . . . 1/3
2 . . . 1/4 . . . 1/3
3 . . . . . . 1/4 . . .
1/6 1/3 . . . 1
(a) Complete the table.
Class 8 review, Spring 2014 3

(b) Are X and Y independent?

10. Covariance and Independence

Let X be a random variable that takes values -2, -1, 0, 1, 2; each with probability 1/5.
Let Y = X 2 .
(a) Fill out the following table giving the joint frequency function for X and Y . Be sure
to include the marginal probabilities.
X -2 -1 0 1 2 total
Y
0
1
4
total
(b) Find E(X) and E(Y ).
(c) Show X and Y are not independent.
(d) Show Cov(X, Y ) = 0.
This is an example of uncorrelated but non-independent random variables. The reason
this can happen is that correlation only measures the linear dependence between the two
variables. In this case, X and Y are not at all linearly related.

11. Continuous Joint Distributions

Suppose X and Y are continuous random variables with joint density function f (x, y) = x+y
on the unit square [0, 1] ⇥ [0, 1].
(a) Let F (x, y) be the joint CDF. Compute F (1, 1). Compute F (x, y).
(b) Compute the marginal densities for X and Y .
(c) Are X and Y independent?
(d) Compute E(X), (Y ), E(X 2 + Y 2 ), Cov(X, Y ).

7 Law of Large Numbers, Central Limit Theorem

12. Suppose X1 , . . . , XP
100 are i.i.d. with mean 1/5 and variance 1/9. Use the central limit
theorem to estimate P ( Xi < 30).

13. (More Central Limit Theorem)

The average IQ in a population is 100 with standard deviation 15 (by definition, IQ is
normalized so this is the case). What is the probability that a randomly selected group of
100 people has an average IQ above 115?
Class 8 review, Spring 2014 4

Standard normal table of left tail probabilities.

z (z) z (z) z (z) z (z)

-4.00 0.0000 -2.00 0.0228 0.00 0.5000 2.00 0.9772 (z) = P (Z  z) for N(0, 1).
-3.95 0.0000 -1.95 0.0256 0.05 0.5199 2.05 0.9798
(Use interpolation to estimate
-3.90 0.0000 -1.90 0.0287 0.10 0.5398 2.10 0.9821
z values to a 3rd decimal
-3.85 0.0001 -1.85 0.0322 0.15 0.5596 2.15 0.9842
place.)
-3.80 0.0001 -1.80 0.0359 0.20 0.5793 2.20 0.9861
-3.75 0.0001 -1.75 0.0401 0.25 0.5987 2.25 0.9878
-3.70 0.0001 -1.70 0.0446 0.30 0.6179 2.30 0.9893
-3.65 0.0001 -1.65 0.0495 0.35 0.6368 2.35 0.9906
-3.60 0.0002 -1.60 0.0548 0.40 0.6554 2.40 0.9918
-3.55 0.0002 -1.55 0.0606 0.45 0.6736 2.45 0.9929
-3.50 0.0002 -1.50 0.0668 0.50 0.6915 2.50 0.9938
-3.45 0.0003 -1.45 0.0735 0.55 0.7088 2.55 0.9946
-3.40 0.0003 -1.40 0.0808 0.60 0.7257 2.60 0.9953
-3.35 0.0004 -1.35 0.0885 0.65 0.7422 2.65 0.9960
-3.30 0.0005 -1.30 0.0968 0.70 0.7580 2.70 0.9965
-3.25 0.0006 -1.25 0.1056 0.75 0.7734 2.75 0.9970
-3.20 0.0007 -1.20 0.1151 0.80 0.7881 2.80 0.9974
-3.15 0.0008 -1.15 0.1251 0.85 0.8023 2.85 0.9978
-3.10 0.0010 -1.10 0.1357 0.90 0.8159 2.90 0.9981
-3.05 0.0011 -1.05 0.1469 0.95 0.8289 2.95 0.9984
-3.00 0.0013 -1.00 0.1587 1.00 0.8413 3.00 0.9987
-2.95 0.0016 -0.95 0.1711 1.05 0.8531 3.05 0.9989
-2.90 0.0019 -0.90 0.1841 1.10 0.8643 3.10 0.9990
-2.85 0.0022 -0.85 0.1977 1.15 0.8749 3.15 0.9992
-2.80 0.0026 -0.80 0.2119 1.20 0.8849 3.20 0.9993
-2.75 0.0030 -0.75 0.2266 1.25 0.8944 3.25 0.9994
-2.70 0.0035 -0.70 0.2420 1.30 0.9032 3.30 0.9995
-2.65 0.0040 -0.65 0.2578 1.35 0.9115 3.35 0.9996
-2.60 0.0047 -0.60 0.2743 1.40 0.9192 3.40 0.9997
-2.55 0.0054 -0.55 0.2912 1.45 0.9265 3.45 0.9997
-2.50 0.0062 -0.50 0.3085 1.50 0.9332 3.50 0.9998
-2.45 0.0071 -0.45 0.3264 1.55 0.9394 3.55 0.9998
-2.40 0.0082 -0.40 0.3446 1.60 0.9452 3.60 0.9998
-2.35 0.0094 -0.35 0.3632 1.65 0.9505 3.65 0.9999
-2.30 0.0107 -0.30 0.3821 1.70 0.9554 3.70 0.9999
-2.25 0.0122 -0.25 0.4013 1.75 0.9599 3.75 0.9999
-2.20 0.0139 -0.20 0.4207 1.80 0.9641 3.80 0.9999
-2.15 0.0158 -0.15 0.4404 1.85 0.9678 3.85 0.9999
-2.10 0.0179 -0.10 0.4602 1.90 0.9713 3.90 1.0000
-2.05 0.0202 -0.05 0.4801 1.95 0.9744 3.95 1.0000
Class 8 Review Problems –solutions, 18.05, Spring 2014

1 Counting and Probability

1. (a) Create an arrangement in stages and count the number of possibilities at each
stage: ✓ ◆
10
Stage 1: Choose three of the 10 slots to put the S’s:
3 ✓ ◆
7
Stage 2: Choose three of the remaining 7 slots to put the T’s:
✓ ◆3
4
Stage 3: Choose two of the remaining 4 slots to put the I’s:
✓ 2◆
2
Stage 4: Choose one of the remaining 2 slots to put the A:
✓ ◆ 1
1
Stage 5: Use the last slot for the C:
1
Number of arrangements:
✓ ◆✓ ◆✓ ◆✓ ◆✓ ◆
10 7 4 2 1
= 50400.
3 3 2 1 1
✓ ◆
10
(b) The are = 45 equally likely ways to place the two I’s.
2
There are 9 ways to place them next to each other, i.e. in slots 1 and 2, slots 2 and 3, . . . ,
slots 9 and 10.
So the probability the I’s are adjacent is 9/45 = 0.2.

2 Conditional Probability and Bayes’ Theorem

2. The following tree shows the setting. Stay1 means the contestant was allowed to stay
during the first episode and stay2 means the they were allowed to stay during the second.

1/4 3/4

Bribe Honest
1 0 1/3 2/3

Stay1 Leave1 Stay1 Leave1

1 0 1/3 2/3

Stay2 Leave2 Stay2 Leave2

Let’s name the relevant events:

B = the contestant is bribing the judges
H = the contestant is honest (not bribing the judges)
S1 = the contestant was allowed to stay during the first episode
S2 = the contestant was allowed to stay during the second episode

1
Class 8 review, Spring 2014 2

L1 = the contestant was asked to leave during the first episode

L2 = the contestant was asked to leave during the second episode

(a) We first compute P (S1 ) using the law of total probability.

1 1 3 1
P (S1 ) = P (S1 |B)P (B) + P (S1 |H)P (H) = 1 · + · = .
4 3 4 2

P (B) 1/4 1
We therefore have (by Bayes’ rule) P (B|S1 ) = P (S1 |B) =1· = .
P (S1 ) 1/2 2
(b) Using the tree we have the total probability of S2 is
1 3 1 1 1
P (S2 ) = + · · =
4 4 3 3 3
P (L2 \ S1 )
(c) We want to compute P (L2 |S1 ) = .
P (S1 )
From the calculation we did in part (a), P (S1 ) = 1/2. For the numerator, we have (see the
tree)
1 2 3 1
P (L2 \ S1 ) = P (L2 \ S1 |B)P (B) + P (L2 \ S1 |H)P (H) = 0 · + · =
3 9 4 6
1/6 1
Therefore P (L2 |S1 ) = = .
1/2 3

3 Independence

3. E = even numbered = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20}.

L = roll  10 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
B = roll is prime = {2, 3, 5, 7, 11, 13, 17, 19} (We use B because P is not a good choice.)
(a) P (E) = 10/20, P (E|L) = 5/10. These are the same, so the events are independent.
(b) P (E) = 10/20. P (E|B) = 1/8. These are not the same so the events are not indepen-
dent.

4 Expectation and Variance

4. (a) We have

X values: -1 0 1
prob: 1/8 2/8 5/8
X2 1 0 1
So, E(X) = 1/8 + 5/8 = 1/2.
Y values: 0 1
(b) ) E(Y ) = 6/8 = 3/4.
prob: 2/8 6/8
Class 8 review, Spring 2014 3

(c) The change of variables formula just says to use the bottom row of the table in part
(a): E(X 2 ) = 1 · (1/8) + 0 · (2/8) + 1 · (5/8) = 3/4 (same as part (b)).
(d) Var(X) = E(X 2 ) E(X)2 = 3/4 1/4 = 1/2.

5. Make a table
X: 0 1
prob: (1-p) p
X2 0 1.
From the table, E(X) = 0 · (1 p) + 1 · p = p.
Since X and X2 have the same table E(X 2 ) = E(X) = p.
Therefore, Var(X) = p p2 = p(1 p).

6. Let X be the number of people who get their own hat.

Following the hint: let Xj represent whether person j gets their own hat. That is, Xj = 1
if person j gets their hat and 0 if not.
100
X 100
X
We have, X = Xj , so E(X) = E(Xj ).
j=1 j=1
Since person j is equally likely to get any hat, we have P (Xj = 1) = 1/100. Thus, Xj ⇠
Bernoulli(1/100) ) E(Xj ) = 1/100 ) E(X) = 1.

5 Probability Mass Functions, Probability Density Functions

and Cumulative Distribution Functions

7. (a) We have cdf of X,

Z x
x x
FX (x) = e dx = 1 e .
0

Now for y 0, we have

(b)
p p
FY (y) = P (Y  y) = P (X 2  y) = P (X  y) = 1 e y
.
Di↵erentiating FY (y) with respect to y, we have

1 p
y
fY (y) = y 2 e .
2

8. (a) There are a number of ways to present this.

Let T be the total number of times you roll a 6 in the 100 rolls. We know T ⇠ Binomial(100, 1/6).
Since you win $3 every time you roll a 6, we have X = 3T . So, we can write
✓ ◆ ✓ ◆k ✓ ◆100 k
100 1 5
P (X = 3k) = , for k = 0, 1, 2, . . . , 100.
k 6 6
Class 8 review, Spring 2014 4

Alternatively we could write

✓ ◆ ✓ ◆x/3 ✓ ◆100 x/3
100 1 5
P (X = x) = , for x = 0, 3, 6, . . . , 300.
x/3 6 6

(b) E(X) = E(3T ) = 3E(T ) = 3 · 100 · 16 = 50,

Var(X) = Var(3T ) = 9Var(T ) = 9 · 100 · 16 · 56 = 125.
(c) (i) Let T1 be the total number of times you roll a 6 in the first 25 rolls. So, X1 = 3T1
and Y = 12T1 .
Now, T1 ⇠ Binomial(25, 1/6), so

E(Y ) = 12E(T1 ) = 12 · 25 · 16 = 50.

and
1 5
Var(Y ) = 144Var(T1 ) = 144 · 25 ·· = 500.
6 6
(ii) The expectations are the same by linearity because X and Y are the both
3 ⇥ 100 ⇥ a Bernoulli(1/6) random variable.
For the variance, Var(X) = 4Var(X1 ) because X is the sum of 4 independent variables all
identical to X1 . However Var(Y ) = Var(4X1 ) = 16Var(X1 ). So, the variance of Y is 4
times that of X. This should make some intuitive sense because X is built out of more
independent trials than X1 .
Another way of thinking about it is that the di↵erence between Y and its expectation is
four times the di↵erence between X1 and its expectation. However, the di↵erence between
X and its expectation is the sum of such a di↵erence for X1 , X2 , X3 , and X4 . Its probably
the case that some of these deviations are positive and some are negative, so the absolute
value of this di↵erence for the sum is probably less than four times the absolute value of this
di↵erence for one of the variables. (I.e., the deviations are likely to cancel to some extent.)

6 Joint Probability, Covariance, Correlation

9. (Arithmetic Puzzle) (a) The marginal probabilities have to add up to 1, so the two
missing marginal probabilities can be computed: P (X = 3) = 1/3, P (Y = 3) = 1/2. Now
each row and column has to add up to its respective margin. For example, 1/6 + 0 + P (X =
1, Y = 3) = 1/3, so P (X = 1, Y = 3) = 1/6. Here is the completed table.
Y
X\ 1 2 3
1 1/6 0 1/6 1/3
2 0 1/4 1/12 1/3
3 0 1/12 1/4 1/3
1/6 1/3 1/2 1

(b) No, X and Y are not independent.

For example, P (X = 2, Y = 1) = 0 6= P (X = 2) · P (Y = 1).
Class 8 review, Spring 2014 5

10. Covariance and Independence

(a)
X -2 -1 0 1 2
Y
0 0 0 1/5 0 0 1/5
1 0 1/5 0 1/5 0 2/5
4 1/5 0 0 0 1/5 2/5
1/5 1/5 1/5 1/5 1/5 1
Each column has only one nonzero value. For example, when X = 2 then Y = 4, so in
the X = 2 column, only P (X = 2, Y = 4) is not 0.
(b) Using the marginal distributions: E(X) = 15 ( 2 1 + 0 + 1 + 2) = 0.
1 2 2
E(Y ) = 0 · + 1 · + 4 · = 2.
5 5 5
(c) We show the probabilities don’t multiply:
P (X = 2, Y = 0) = 0 6= P (X = 2) · P (Y = 0) = 1/25.
Since these are not equal X and Y are not independent. (It is obvious that X 2 is not
independent of X.)
(d) Using the table from part (a) and the means computed in part (d) we get:

Cov(X, Y ) = E(XY ) E(X)E(Y )

1 1 1 1 1
= ( 2)(4) + ( 1)(1) + (0)(0) + (1)(1) + (2)(4)
5 5 5 5 5
= 0.

11. Continuous Joint DistributionsZ aZ b

(a) F (a, b) = P (X  a, Y  b) = (x + y) dy dx.
0 0
b a
y2 b 2 x2 b2 a2 b + ab2
Inner integral: xy + = xb + . Outer integral: b+ x = .
2 0 2 2 2 0 2
x2 y + xy 2
So F (x, y) = and F (1, 1) = 1.
2
Z 1 Z 1 1
y2 1
(b) fX (x) = f (x, y) dy = (x + y) dy = xy + = x+ .
0 0 2 0 2
By symmetry, fY (y) = y + 1/2.
(c) To see if they are independent we check if the joint density is the product of the
marginal densities.
f (x, y) = x + y, fX (x) · fY (y) = (x + 1/2)(y + 1/2).
Since these are not equal, X and Y are not independent.
Z 1Z 1 Z 1" 1
# Z 1
2 y2 x 7
(d) E(X) = x(x + y) dy dx = x y+x dx = x2 + dx = .
0 0 0 2 0 0 2 12
Class 8 review, Spring 2014 6

Z 1 Z 1
(Or, using (b), E(X) = xfX (x) dx = x(x + 1/2) dx = 7/12.)
0 0
By symmetry E(Y ) = 7/12.
Z 1Z 1
2 2 5
E(X + Y ) = (x2 + y 2 )(x + y) dy dx = .
0 0 6
Z 1Z 1
1
E(XY ) = xy(x + y) dy dx = .
0 0 3
1 49 1
Cov(X, Y ) = E(XY ) E(X)E(Y ) = = .
3 144 144

7 Law of Large Numbers, Central Limit Theorem

12. Standardize:
! P !
X Xi µ 1
30/n µ
p n p
P Xi < 30 =P <
/ n / n
i
✓ ◆
30/100 1/5
⇡P Z< (by the central limit theorem)
1/30
= P (Z < 3)
= 0.9987 (from the table of normal probabilities)

13. (More Central Limit Theorem)

Let Xj be the IQ of a randomly selected person. We are given E(Xj ) = 100 and Xj = 15.
Let X be the average of the IQ’s of 100 randomly selected people. Then we know
p
E(X) = 100 and X = 15/ 100 = 1.5.

The problem asks for P (X > 115). Standardizing we get P (X > 115) ⇡ P (Z > 10).
This is e↵ectively 0.
Introduction to Statistics
Class 10, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Know the three overlapping “phases” of statistical practice.

2. Know what is meant by the term statistic.

2 Introduction to statistics

Statistics deals with data. Generally speaking, the goal of statistics is to make inferences
based on data. We can divide this the process into three phases: collecting data, describing
data and analyzing data. This fits into the paradigm of the scientific method. We make
hypotheses about what’s true, collect data in experiments, describe the results, and then
infer from the results the strength of the evidence concerning our hypotheses.

2.1 Experimental design

The design of an experiment is crucial to making sure the collected data is useful. The
adage ‘garbage in, garbage out’ applies here. A poorly designed experiment will produce
poor quality data, from which it may be impossible to draw useful, valid inferences. To
quote R.A. Fisher one of the founders of modern statistics,

To consult a statistician after an experiment is finished is often merely to ask

him to conduct a post-mortem examination. He can perhaps say what the
experiment died of.

2.2 Descriptive statistics

Raw data often takes the form of a massive list, array, or database of labels and numbers.
To make sense of the data, we can calculate summary statistics like the mean, median, and
interquartile range. We can also visualize the data using graphical devices like histograms,
scatterplots, and the empirical cdf. These methods are useful for both communicating and
exploring the data to gain insight into its structure, such as whether it might follow a
familiar probability distribution.

2.3 Inferential statistics

Ultimately we want to draw inferences about the world. Often this takes the form of
specifying a statistical model for the random process by which the data arises. For example,
suppose the data takes the form of a series of measurements whose error we believe follows
a normal distribution. (Note this is always an approximation since we know the error must

1
18.05 class 10, Introduction to Statistics, Spring 2014 2

have some bound while a normal distribution has range ( 1, 1).) We might then use the
data to provide evidence for or against this hypothesis. Our focus in 18.05 will be on how
to use data to draw inferences about model parameters. For example, assuming gestational
length follows a N (µ, ) distribution, we’ll use the data of the gestational lengths of, say,
500 pregnancies to draw inferences about the values of the parameters µ and . Similarly,
we may model the result of a two-candidate election by a Bernoulli(p) distribution, and use
poll data to draw inferences about the value of p.
We can rarely make definitive statements about such parameters because the data itself
comes from a random process (such as choosing who to poll). Rather, our statistical evidence
will always involve probability statements. Unfortunately, the media and public at large
are wont to misunderstand the probabilistic meaning of statistical statements. In fact,
researchers themselves often commit the same errors. In this course, we will emphasize the
meaning of statistical statements alongside the methods which produce them.
Example 1. To study the e↵ectiveness of new treatment for cancer, patients are recruited
and then divided into an experimental group and a control group. The experimental group
is given the new treatment and the control group receives the current standard of care.
Data collected from the patients might include demographic information, medical history,
initial state of cancer, progression of the cancer over time, treatment cost, and the e↵ect of
the treatment on tumor size, remission rates, longevity, and quality of life. The data will
be used to make inferences about the e↵ectiveness of the new treatment compared to the
current standard of care.
Notice that this study will go through all three phases described above. The experimental
design must specify the size of the study, who will be eligible to join, how the experimental
and control groups will be chosen, how the treatments will be administered, whether or
not the subjects or doctors know who is getting which treatment, and precisely what data
will be collected, among other things. Once the data is collected it must be described and
analyzed to determine whether it supports the hypothesis that the new treatment is more
(or less) e↵ective than the current one(s), and by how much. These statistical conclusions
will be framed as precise statements involving probabilities.
As noted above, misinterpreting the exact meaning of statistical statements is a common
source of error which has led to tragedy on more than one occasion.
Example 2. In 1999 in Great Britain, Sally Clark was convicted of murdering her two
children after each child died weeks after birth (the first in 1996, the second in 1998).
Her conviction was largely based on a faulty use of statistics to rule out sudden infant
death syndrome. Though her conviction was overturned in 2003, she developed serious
psychiatric problems during and after her imprisonment and died of alcohol poisoning in
2007. See http://en.wikipedia.org/wiki/Sally_Clark
This TED talk discusses the Sally Clark case and other instances of poor statistical intuition:
http://www.youtube.com/watch?v=kLmzxmRcUTo

2.4 What is a statistic?

We give a simple definition whose meaning is best elucidated by examples.

Definition. A statistic is anything that can be computed from the collected data.
18.05 class 10, Introduction to Statistics, Spring 2014 3

Example 3. Consider the data of 1000 rolls of a die. All of the following are statistics:
the average of the 1000 rolls; the number of times a 6 was rolled; the sum of the squares
of the rolls minus the number of even rolls. It’s hard to imagine how we would use the
last example, but it is a statistic. On the other hand, the probability of rolling a 6 is not a
statistic, whether or not the die is truly fair. Rather this probability is a property of the die
(and the way we roll it) which we can estimate using the data. Such an estimate is given
by the statistic ‘proportion of the rolls that were 6’.

Example 4. Suppose we treat a group of cancer patients with a new procedure and collect
data on how long they survive post-treatment. From the data we can compute the average
survival time of patients in the group. We might employ this statistic as an estimate of the
average survival time for future cancer patients following the new procedure. The actual
survival is not a statistic.

Example 5. Suppose we ask 1000 residents whether or not they support the proposal to
legalize marijuana in Massachusetts. The proportion of the 1000 who support the proposal
is a statistic. The proportion of all Massachusetts residents who support the proposal is
not a statistic since we have not queried every single one (note the word “collected” in the
definition). Rather, we hope to draw a statistical conclusion about the state-wide proportion
based on the data of our random sample.

The following are two general types of statistics we will use in 18.05.

1. Point statistics: a single value computed from data, such as the sample average xn or
the sample standard deviation sn .

2. Interval statistics: an interval [a, b] computed from the data. This is really just a pair of
point statistics, and will often be presented in the form x ± s.

3 Review of Bayes’ theorem

We cannot stress strongly enough how important Bayes’ theorem is to our view of inferential
statistics. Recall that Bayes’ theorem allows us to ‘invert’ conditional probabilities. That
is, if H and D are events, then Bayes’ theorem says
P (D|H )P (H)
P (H|D) = .
P (D)
In scientific experiments we start with a hypothesis and collect data to test the hypothesis.
We will often let H represent the event ‘our hypothesis is true’ and let D be the collected
data. In these words Bayes’ theorem says
P (data |hypothesis is true) · P (hypothesis is true)
P (hypothesis is true | data) =
P (data)
The left-hand term is the probability our hypothesis is true given the data we collected.
This is precisely what we’d like to know. When all the probabilities on the right are known
exactly, we can compute the probability on the left exactly. This will be our focus next
week. Unfortunately, in practice we rarely know the exact values of all the terms on the
18.05 class 10, Introduction to Statistics, Spring 2014 4

right. Statisticians have developed a number of ways to cope with this lack of knowledge
and still make useful inferences. We will be exploring these methods for the rest of the
course.
Example 6. Screening for a disease redux
Suppose a screening test for a disease has a 1% false positive rate and a 1% false negative
rate. Suppose also that the rate of the disease in the population is 0.002. Finally suppose
a randomly selected person tests positive. In the language of hypothesis and data we have:
Hypothesis: H = ‘the person has the disease’
Data: D = ‘the test was positive.’
What we want to know: P (H|D) = P (the person has the disease | a positive test)
In this example all the probabilities on the right are known so we can use Bayes’ theorem
to compute what we want to know.

P (hypothesis | data) = P (the person has the disease | a positive test)

= P (H|D)
P (D|H)P (H)
=
P (D)
.99 · .002
=
.99 · .002 + .01 · .998
= 0.166

Before the test we would have said the probability the person had the disease was 0.002.
After the test we see the probability is 0.166. That is, the positive test provides some
evidence that the person has the disease.
Maximum Likelihood Estimates
Class 10, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals
1. Be able to define the likelihood function for a parametric model given data.
2. Be able to compute the maximum likelihood estimate of unknown parameter(s).

2 Introduction

Suppose we know we have data consisting of values x1 , . . . , xn drawn from an exponential

distribution. The question remains: which exponential distribution?!
We have casually referred to the exponential distribution or the binomial distribution or the
normal distribution. In fact the exponential distribution exp( ) is not a single distribution
but rather a one-parameter family of distributions. Each value of defines a di↵erent dis-
tribution in the family, with pdf f (x) = e x on [0, 1). Similarly, a binomial distribution
bin(n, p) is determined by the two parameters n and p, and a normal distribution N (µ, 2 )
is determined by the two parameters µ and 2 (or equivalently, µ and ). Parameterized
families of distributions are often called parametric distributions or parametric models.
We are often faced with the situation of having random data which we know (or believe)
is drawn from a parametric model, whose parameters we do not know. For example, in
an election between two candidates, polling data constitutes draws from a Bernoulli(p)
distribution with unknown parameter p. In this case we would like to use the data to
estimate the value of the parameter p, as the latter predicts the result of the election.
Similarly, assuming gestational length follows a normal distribution, we would like to use
the data of the gestational lengths from a random sample of pregnancies to draw inferences
about the values of the parameters µ and 2 .
Our focus so far has been on computing the probability of data arising from a parametric
model with known parameters. Statistical inference flips this on its head: we will estimate
the probability of parameters given a parametric model and observed data drawn from it.
In the coming weeks we will see how parameter values are naturally viewed as hypotheses,
so we are in fact estimating the probability of various hypotheses given the data.

3 Maximum Likelihood Estimates

There are many methods for estimating unknown parameters from data. We will first
consider the maximum likelihood estimate (MLE), which answers the question:

For which parameter value does the observed data have the biggest probability?

The MLE is an example of a point estimate because it gives a single value for the unknown
parameter (later our estimates will involve intervals and probabilities). Two advantages of

1
18.05 class 10, Maximum Likelihood Estimates , Spring 2014 2

the MLE are that it is often easy to compute and that it agrees with our intuition in simple
examples. We will explain the MLE through a series of examples.

Example 1. A coin is flipped 100 times. Given that there were 55 heads, find the maximum
likelihood estimate for the probability p of heads on a single toss.
Before actually solving the problem, let’s establish some notation and terms.
We can think of counting the number of heads in 100 tosses as an experiment. For a given
value of p, the probability of getting 55 heads in this experiment is the binomial probability
✓ ◆
100 55
P (55 heads) = p (1 p)45 .
55
The probability of getting 55 heads depends on the value of p, so let’s include p in by using
the notation of conditional probability:
✓ ◆
100 55
P (55 heads | p) = p (1 p)45 .
55

You should read P (55 heads | p) as:

‘the probability of 55 heads given p,’
or more precisely as
‘the probability of 55 heads given that the probability of heads on a single toss is p.’
Here are some standard terms we will use as we do statistics.

• Experiment: Flip the coin 100 times and count the number of heads.

• Data: The data is the result of the experiment. In this case it is ‘55 heads’.

• Parameter(s) of interest: We are interested in the value of the unknown parameter p.

• Likelihood, or likelihood function: this is P (data | p). Note it is a function of both the
data and the parameter p. In this case the likelihood is
✓ ◆
100 55
P (55 heads | p) = p (1 p)45 .
55

Notes: 1. The likelihood P (data | p) changes as the parameter of interest p changes.

2. Look carefully at the definition. One typical source of confusion is to mistake the likeli-
hood P (data | p) for P (p | data). We know from our earlier work with Bayes’ theorem that
P (data | p) and P (p | data) are usually very di↵erent.

Definition: Given data the maximum likelihood estimate (MLE) for the parameter p is
the value of p that maximizes the likelihood P (data | p). That is, the MLE is the value of
p for which the data is most likely.
answer: For the problem at hand, we saw above that the likelihood
✓ ◆
100 55
P (55 heads | p) = p (1 p)45 .
55
18.05 class 10, Maximum Likelihood Estimates , Spring 2014 3

We’ll use the notation p̂ for the MLE. We use calculus to find it by taking the derivative of
the likelihood function and setting it to 0.
✓ ◆
d 100
P (data |p) = (55p54 (1 p)45 45p55 (1 p)44 ) = 0.
dp 55
Solving this for p we get

55p54 (1 p)45 = 45p55 (1 p)44

55(1 p) = 45p
55 = 100p
the MLE is p̂ = .55

Note: 1. The MLE for p turned out to be exactly the fraction of heads we saw in our data.
2. The MLE is computed from the data. That is, it is a statistic.
3. Officially you should check that the critical point is indeed a maximum. You can do this
with the second derivative test.

3.1 Log likelihood

If is often easier to work with the natural log of the likelihood function. For short this is
simply called the log likelihood. Since ln(x) is an increasing function, the maxima of the
likelihood and log likelihood coincide.

Example 2. Redo the previous example using log likelihood.

answer: We had the likelihood P (55 heads | p) = 100 55
55 p (1 p)45 . Therefore the log
likelihood is
✓✓ ◆◆
100
ln(P (55 heads | p) = ln + 55 ln(p) + 45 ln(1 p).
55
Maximizing likelihood is the same as maximizing log likelihood. We check that calculus
gives us the same answer as before:
 ✓✓ ◆◆
d d 100
(log likelihood) = ln + 55 ln(p) + 45 ln(1 p)
dp dp 55
55 45
= =0
p 1 p
) 55(1 p) = 45p
) p̂ = .55

3.2 Maximum likelihood for continuous distributions

For continuous distributions, we use the probability density function to define the likelihood.
We show this in a few examples. In the next section we explain how this is analogous to
what we did in the discrete case.
18.05 class 10, Maximum Likelihood Estimates , Spring 2014 4

Example 3. Light bulbs

Suppose that the lifetime of Badger brand light bulbs is modeled by an exponential distri-
bution with (unknown) parameter . We test 5 bulbs and find they have lifetimes of 2, 3,
1, 3, and 4 years, respectively. What is the MLE for ?
answer: We need to be careful with our notation. With five di↵erent values it is best to
use subscripts. Let Xj be the lifetime of the ith bulb and let xi be the value Xi takes. Then
each Xi has pdf fXi (xi ) = e xi . We assume the lifetimes of the bulbs are independent,
so the joint pdf is the product of the individual densities:
x1 x2 x3 x4 x5 5 (x1 +x2 +x3 +x4 +x5 )
f (x1 , x2 , x3 , x4 , x5 | ) = ( e )( e )( e )( e )( e )= e .

Note that we write this as a conditional density, since it depends on . Viewing the data
as fixed and as variable, this density is the likelihood function. Our data had values

x1 = 2, x2 = 3, x3 = 1, x4 = 3, x5 = 4.

So the likelihood and log likelihood functions with this data are
5 13
f (2, 3, 1, 3, 4 | ) = e , ln(f (2, 3, 1, 3, 4 | ) = 5 ln( ) 13

Finally we use calculus to find the MLE:

d 5 5
(log likelihood) = 13 = 0 ) ˆ = .
d 13
Note: 1. In this example we used an uppercase letter for a random variable and the
corresponding lowercase letter for the value it takes. This will be our usual practice.
2. The MLE for turned out to be the reciprocal of the sample mean x̄, so X ⇠ exp( ˆ )
satisfies E(X) = x̄.

The following example illustrates how we can use the method of maximum likelihood to
estimate multiple parameters at once.
Example 4. Normal distributions
Suppose the data x1 , x2 , . . . , xn is drawn from a N(µ, 2 ) distribution, where µ and are
unknown. Find the maximum likelihood estimate for the pair (µ, 2 ).
answer: Let’s be precise and phrase this in terms of random variables and densities. Let
uppercase X1 , . . . , Xn be i.i.d. N(µ, 2 ) random variables, and let lowercase xi be the value
Xi takes. The density for each Xi is
1 (xi µ)2
fXi (xi ) = p e 2 2 .
2⇡
Since the Xi are independent their joint pdf is the product of the individual pdf’s:
✓ ◆n P
1 n (xi µ)2
f (x1 , . . . , xn | µ, ) = p e i=1 2 2 .
2⇡
For the fixed data x1 , . . . , xn , the likelihood and log likelihood are
✓ ◆n P n
X
1 n (xi µ)2 p (xi µ)2
f (x1 , . . . , xn |µ, ) = p e i=1 2 2 , ln(f (x1 , . . . , xn |µ, )) = n ln( 2⇡) n ln( ) 2
.
2⇡ 2
i=1
18.05 class 10, Maximum Likelihood Estimates , Spring 2014 5

Since ln(f (x1 , . . . , xn |µ, )) is a function of the two variables µ, we use partial derivatives
to find the MLE. The easy value to find is µ̂:
n n Pn
@f (x1 , . . . , xn |µ, ) X (xi µ) X
i=1 xi
= 2
=0 ) xi = nµ ) µ̂ = = x.
@µ n
i=1 i=1

To find ˆ we di↵erentiate and solve for :

n
X Pn
@f (x1 , . . . , xn |µ, ) n (xi µ)2 2 i=1 (xi µ)2
= + 3
=0 ) ˆ = .
@ n
i=1

We already know µ̂ = x, so we use that as the value for µ in the formula for ˆ. We get the
maximum likelihood estimates
µ̂ =x = the mean of the data
n
X n
X
1 1
ˆ2 = (xi µ̂)2 = (xi x)2 = the variance of the data.
n n
i=1 i=1

Example 5. Uniform distributions

Suppose our data x1 , . . . xn are independently drawn from a uniform distribution U (a, b).
Find the MLE estimate for a and b.
answer: This example is di↵erent from the previous ones in that we won’t use calculus to
find the MLE. The density for U (a, b) is b 1 a on [a, b]. Therefore our likelihood function is
(⇣ ⌘n
1
b a if all xi are in the interval [a, b]
f (x1 , . . . , xn | a, b) =
0 otherwise.

This is maximized by making b a as small as possible. The only restriction is that the
interval [a, b] must include all the data. Thus the MLE for the pair (a, b) is

â = min(x1 , . . . , xn ) ˆb = max(x1 , . . . , xn ).

Example 6. Capture/recapture method

The capture/recapture method is a way to estimate the size of a population in the wild.
The method assumes that each animal in the population is equally likely to be captured by
a trap.
Suppose 10 animals are captured, tagged and released. A few months later, 20 animals are
captured, examined, and released. 4 of these 20 are found to be tagged. Estimate the size
of the wild population using the MLE for the probability that a wild animal is tagged.
answer: Our unknown parameter n is the number of animals in the wild. Our data is that
4 out of 20 recaptured animals were tagged (and that there are 10 tagged animals). The
likelihood function is
n 10 10
16 4
P (data | n animals) = n
20
(The numerator is the number of ways to choose 16 animals from among the n 10 untagged
ones times the number of was to choose 4 out of the 10 tagged animals. The denominator
18.05 class 10, Maximum Likelihood Estimates , Spring 2014 6

is the number of ways to choose 20 animals from the entire population of n.) We can use
R to compute that the likelihood function is maximized when n = 50. This should make
some sense. It says our best estimate is that the fraction of all animals that are tagged is
10/50 which equals the fraction of recaptured animals which are tagged.

Example 7. Hardy-Weinberg. Suppose that a particular gene occurs as one of two

alleles (A and a), where allele A has frequency ✓ in the population. That is, a random copy
of the gene is A with probability ✓ and a with probability 1 ✓. Since a diploid genotype
consists of two genes, the probability of each genotype is given by:

genotype AA Aa aa
probability ✓2 2✓(1 ✓) (1 ✓)2

Suppose we test a random sample of people and find that k1 are AA, k2 are Aa, and k3 are
aa. Find the MLE of ✓.
answer: The likelihood function is given by
✓ ◆✓ ◆✓ ◆
k1 + k2 + k3 k2 + k3 k3 2k1
P (k1 , k2 , k3 | ✓) = ✓ (2✓(1 ✓))k2 (1 ✓)2k3 .
k1 k2 k3
So the log likelihood is given by

constant + 2k1 ln(✓) + k2 ln(✓) + k2 ln(1 ✓) + 2k3 ln(1 ✓)

We set the derivative equal to zero:

2k1 + k2 k2 + 2k3
=0
✓ 1 ✓
Solving for ✓, we find the MLE is
2k1 + k2
✓ˆ = ,
2k1 + 2k2 + 2k3
which is simply the fraction of A alleles among all the genes in the sampled population.

4 Why we use the density to find the MLE for continuous

distributions

The idea for the maximum likelihood estimate is to find the value of the parameter(s) for
which the data has the highest probability. In this section we ’ll see that we’re doing this
is really what we are doing with the densities. We will do this by considering a smaller
version of the light bulb example.
Example 8. Suppose we have two light bulbs whose lifetimes follow an exponential( )
distribution. Suppose also that we independently measure their lifetimes and get data
x1 = 2 years and x2 = 3 years. Find the value of that maximizes the probability of this
data.
answer: The main paradox to deal with is that for a continuous distribution the probability
of a single value, say x1 = 2, is zero. We resolve this paradox by remembering that a single
18.05 class 10, Maximum Likelihood Estimates , Spring 2014 7

measurement really means a range of values, e.g. in this example we might check the light
bulb once a day. So the data x1 = 2 years really means x1 is somewhere in a range of 1 day
around 2 years.
If the range is small we call it dx1 . The probability that X1 is in the range is approximated
by fX1 (x1 | ) dx1 . This is illustrated in the figure below. The data value x2 is treated in
exactly the same way.
density fX1 (x1 | ) density fX2 (x2 | )

probability ⇡ fX1 (x1 | ) dx1 probability ⇡ fX2 (x2 | ) dx2

dx1
dx2
x x
x1 x2

The usual relationship between density and probability for small ranges.

Since the data is collected independently the joint probability is the product of the individual
probabilities. Stated carefully

P (X1 in range, X2 in range| ) ⇡ fX1 (x1 | ) dx1 · fX2 (x2 | ) dx2

Finally, using the values x1 = 2 and x2 = 3 and the formula for an exponential pdf we have
2 3 2 5
P (X1 in range, X2 in range| ) ⇡ e dx1 · e dx2 = e dx1 dx2 .

Now that we have a genuine probability we can look for the value of that maximizes it.
Looking at the formula above we see that the factor dx1 dx2 will play no role in finding the
maximum. So for the MLE we drop it and simply call the density the likelihood:
2 5
likelihood = f (x1 , x2 | ) = e .

The value of that maximizes this is found just like in the example above. It is ˆ = 2/5.

5 Appendix: Properties of the MLE

For the interested reader, we note several nice features of the MLE. These are quite technical
and will not be on any exams.
The MLE behaves well under transformations. That is, if p̂ is the MLE for p and g is a
one-to-one function, then g(p̂) is the MLE for g(p). For example, if ˆ is the MLE for the
standard deviation then (ˆ)2 is the MLE for the variance 2 .
Furthermore, the MLE is asymptotically unbiased and has asymptotically minimal variance.
To explain these notions, note that the MLE is itself a random variable since the data is
random and the MLE is computed from the data. Let x1 , x2 , . . . be an infinite sequence of
samples from a distribution with parameter p. Let p̂n be the MLE for p based on the data
x1 , . . . , x n .
Asymptotically unbiased means that as the amount of data grows, the mean of the MLE
converges to p. In symbols: E(p̂n ) ! p as n ! 1. Of course, we would like the MLE to be
18.05 class 10, Maximum Likelihood Estimates , Spring 2014 8

close to p with high probability, not just on average, so the smaller the variance of the MLE
the better. Asymptotically minimal variance means that as the amount of data grows, the
MLE has the minimal variance among all unbiased estimators of p. In symbols: for any
unbiased estimator p̃n and ✏ > 0 we have that Var(p̃n ) + ✏ > Var(p̂n ) as n ! 1.
Bayesian Updating with Discrete Priors
Class 11, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals
1. Be able to apply Bayes’ theorem to compute probabilities.
2. Be able to define the and to identify the roles of prior probability, likelihood (Bayes
term), posterior probability, data and hypothesis in the application of Bayes’ Theorem.
3. Be able to use a Bayesian update table to compute posterior probabilities.

2 Review of Bayes’ theorem

Recall that Bayes’ theorem allows us to ‘invert’ conditional probabilities. If H and D are
events, then:
P (D | H)P (H)
P (H | D) =
P (D)
Our view is that Bayes’ theorem forms the foundation for inferential statistics. We will
begin to justify this view today.

2.1 The base rate fallacy

When we first learned Bayes’ theorem we worked an example about screening tests showing
that P (D|H) can be very di↵erent from P (H|D). In the appendix we work a similar example.
If you are not comfortable with Bayes’ theorem you should read the example in the appendix
now.

3 Terminology and Bayes’ theorem in tabular form

We now use a coin tossing problem to introduce terminology and a tabular format for Bayes’
theorem. This will provide a simple, uncluttered example that shows our main points.
Example 1. There are three types of coins which have di↵erent probabilities of landing
heads when tossed.

• Type A coins are fair, with probability 0.5 of heads

• Type B coins are bent and have probability 0.6 of heads
• Type C coins are bent and have probability 0.9 of heads

Suppose I have a drawer containing 5 coins: 2 of type A, 2 of type B, and 1 of type C. I

reach into the drawer and pick a coin at random. Without showing you the coin I flip it
once and get heads. What is the probability it is type A? Type B? Type C?

1
18.05 class 11, Bayesian Updating with Discrete Priors, Spring 2014 2

answer: Let A, B, and C be the event that the chosen coin was type A, type B, and type
C. Let D be the event that the toss is heads. The problem asks us to find
P (A|D), P (B|D), P (C|D).
Before applying Bayes’ theorem, let’s introduce some terminology.

• Experiment: pick a coin from the drawer at random, flip it, and record the result.
• Data: the result of our experiment. In this case the event D = ‘heads’. We think of
D as data that provides evidence for or against each hypothesis.
• Hypotheses: we are testing three hypotheses: the coin is type A, B or C.
• Prior probability: the probability of each hypothesis prior to tossing the coin (collect-
ing data). Since the drawer has 2 coins of type A, 2 of type B and 1 or type C we
have
P (A) = 0.4, P (B) = 0.4, P (C) = 0.2.

• Likelihood: (This is the same likelihood we used for the MLE.) The likelihood function
is P (D|H), i.e., the probability of the data assuming that the hypothesis is true. Most
often we will consider the data as fixed and let the hypothesis vary. For example,
P (D|A) = probability of heads if the coin is type A. In our case the likelihoods are
P (D|A) = 0.5, P (D|B) = 0.6, P (D|C) = 0.9.

The name likelihood is so well established in the literature that we have to teach
it to you. However in colloquial language likelihood and probability are synonyms.
This leads to the likelihood function often being confused with the probabity of a
hypothesis. Because of this we’d prefer to use the name Bayes’ term. However since
we are stuck with ‘likelihood’ we will try to use it very carefully and in a way that
minimizes any confusion.
• Posterior probability: the probability (posterior to) of each hypothesis given the data
from tossing the coin.
P (A|D), P (B|D), P (C|D).
These posterior probabilities are what the problem asks us to find.

We now use Bayes’ theorem to compute each of the posterior probabilities. We are going
to write this out in complete detail so we can pick out each of the parts (Remember that
the data D is that the toss was heads.)
First we organize the probabilities into a tree:

0.4 0.4 0.2

A B C
0.5 0.5 0.6 0.4 0.9 0.1
H T H T H T

Probability tree for choosing and tossing a coin.

18.05 class 11, Bayesian Updating with Discrete Priors, Spring 2014 3

P (D|A)P (A)
Bayes’ theorem says, e.g. P (A|D) = . The denominator P (D) is computed
P (D)
using the law of total probability:

P (D) = P (D|A)P (A) + P (D|B)P (B) + P (D|C)P (C) = 0.5 · 0.4 + 0.6 · 0.4 + 0.9 · 0.2 = 0.62.

Now each of the three posterior probabilities can be computed:

P (D|A)P (A) 0.5 · 0.4 0.2

Notice that the total probability P (D) is the same in each of the denominators and that it
is the sum of the three numerators. We can organize all of this very neatly in a Bayesian
update table:
Bayes
hypothesis prior likelihood numerator posterior
H P (H) P (D|H) P (D|H)P (H) P (H|D)
A 0.4 0.5 0.2 0.3226
B 0.4 0.6 0.24 0.3871
C 0.2 0.9 0.18 0.2903
total 1 0.62 1
The Bayes numerator is the product of the prior and the likelihood. We see in each of the
Bayes’ formula computations above that the posterior probability is obtained by dividing
the Bayes numerator by P (D) = 0.625. We also see that the law of law of total probability
says that P (D) is the sum of the entries in the Bayes numerator column.
Bayesian updating: The process of going from the prior probability P (H) to the pos-
terior P (H|D) is called Bayesian updating. Bayesian updating uses the data to alter our
understanding of the probability of each of the possible hypotheses.

3.1 Important things to notice

1. There are two types of probabilities: Type one is the standard probability of data, e.g.
the probability of heads is p = 0.9. Type two is the probability of the hypotheses, e.g.
the probability the chosen coin is type A, B or C. This second type has prior (before
the data) and posterior (after the data) values.

2. The posterior (after the data) probabilities for each hypothesis are in the last column.
We see that coin B is now the most probable, though its probability has decreased from
a prior probability of 0.4 to a posterior probability of 0.39. Meanwhile, the probability
of type C has increased from 0.2 to 0.29.

3. The Bayes numerator column determines the posterior probability column. To compute
the latter, we simply rescaled the Bayes numerator so that it sums to 1.
18.05 class 11, Bayesian Updating with Discrete Priors, Spring 2014 4

4. If all we care about is finding the most likely hypothesis, the Bayes numerator works as
well as the normalized posterior.

5. The likelihood column does not sum to 1. The likelihood function is not a probability
function.

6. The posterior probability represents the outcome of a ‘tug-of-war’ between the likelihood
and the prior. When calculating the posterior, a large prior may be deflated by a small
likelihood, and a small prior may be inflated by a large likelihood.

7. The maximum likelihood estimate (MLE) for Example 1 is hypothesis C, with a likeli-
hood P (D|C) = 0.9. The MLE is useful, but you can see in this example that it is not
the entire story, since type B has the greatest posterior probability.

Terminology in hand, we can express Bayes’ theorem in various ways:

P (D|H)P (H)
P (H|D) =
P (D)
P (data|hypothesis)P (hypothesis)
P (hypothesis|data) =
P (data)

With the data fixed, the denominator P (D) just serves to normalize the total posterior prob-
ability to 1. So we can also express Bayes’ theorem as a statement about the proportionality
of two functions of H (i.e, of the last two columns of the table).

P (hypothesis|data) / P (data|hypothesis)P (hypothesis)

This leads to the most elegant form of Bayes’ theorem in the context of Bayesian updating:

posterior / likelihood ⇥ prior

3.2 Prior and posterior probability mass functions

Earlier in the course we saw that it is convenient to use random variables and probability
mass functions. To do this we had to assign values to events (head is 1 and tails is 0). We
will do the same thing in the context of Bayesian updating.
Our standard notations will be:

• ✓ is the value of the hypothesis.

• p(✓) is the prior probability mass function of the hypothesis.

• p(✓|D) is the posterior probability mass function of the hypothesis given the data.

• p(D|✓) is the likelihood function. (This is not a pmf!)

In Example 1 we can represent the three hypotheses A, B, and C by ✓ = 0.5, 0.6, 0.9. For
the data we’ll let x = 1 mean heads and x = 0 mean tails. Then the prior and posterior
probabilities in the table define the prior and posterior probability mass functions.
18.05 class 11, Bayesian Updating with Discrete Priors, Spring 2014 5

Hypothesis ✓ prior pmf p(✓) poster pmf p(✓|x = 1)

A 0.5 P (A) = p(0.5) = 0.4 P (A|D) = p(0.5|x = 1) = 0.3226
B 0.6 P (B) = p(0.6) = 0.4 P (B|D) = p(0.6|x = 1) = 0.3871
C 0.9 P (C) = p(0.9) = 0.2 P (C |D) = p(0.9|x = 1) = 0.2903
Here are plots of the prior and posterior pmf’s from the example.
p(✓) p(✓|x = 1)

.4 .4

.2 .2

✓ ✓
.5 .6 .9 .5 .6 .9
Prior pmf p(✓) and posterior pmf p(✓|x = 1) for Example 1

If the data was di↵erent then the likelihood column in the Bayesian update table would be
di↵erent. We can plan for di↵erent data by building the entire likelihood table ahead of
time. In the coin example there are two possibilities for the data: the toss is heads or the
toss is tails. So the full likelihood table has two likelihood columns:
hypothesis likelihood p(x|✓)
✓ p(x = 0|✓) p(x = 1|✓)
0.5 0.5 0.5
0.6 0.4 0.6
0.9 0.1 0.9
Example 2. Using the notation p(✓), etc., redo Example 1 assuming the flip was tails.
answer: Since the data has changed, the likelihood column in the Bayesian update table is
now for x = 0. That is, we must take the p(x = 0|✓) column from the likelihood table.
Bayes
hypothesis prior likelihood numerator posterior
✓ p(✓) p(x = 0 | ✓) p(x = 0 | ✓)p(✓) p(✓ | x = 0)
0.5 0.4 0.5 0.2 0.5263
0.6 0.4 0.4 0.16 0.4211
0.9 0.2 0.1 0.02 0.0526
total 1 0.38 1
Now the probability of type A has increased from 0.4 to 0.5263, while the probability of
type C has decreased from 0.2 to only 0.0526. Here are the corresponding plots:
18.05 class 11, Bayesian Updating with Discrete Priors, Spring 2014 6

p(✓) p(✓|x = 0)

.4 .4

.2 .2

✓ ✓
.5 .6 .9 .5 .6 .9

Prior pmf p(✓) and posterior pmf p(✓|x = 0)

3.3 Food for thought.

Suppose that in Example 1 you didn’t know how many coins of each type were in the
drawer. You picked one at random and got heads. How would you go about deciding which
hypothesis (coin type) if any was most supported by the data?

4 Updating again and again

In life we are continually updating our beliefs with each new experience of the world. In
Bayesian inference, after updating the prior to the posterior, we can take more data and
update again! For the second update, the posterior from the first data becomes the prior
for the second data.
Example 3. Suppose you have picked a coin as in Example 1. You flip it once and get
heads. Then you flip the same coin and get heads again. What is the probability that the
coin was type A? Type B? Type C?
answer: As we update several times the table gets big, so we use a smaller font to fit it in:
Bayes Bayes
hypothesis prior likelihood 1 numerator 1 likelihood 2 numerator 2 posterior 2
✓ p(✓) p(x1 = 1|✓) p(x1 = 1|✓)p(✓) p(x2 = 1|✓) p(x2 = 1|✓)p(x1 = 1|✓)p(✓) p(✓|x1 = 1, x2 = 1)
0.5 0.4 0.5 0.2 0.5 0.1 0.2463
0.6 0.4 0.6 0.24 0.6 0.144 0.3547
0.9 0.2 0.9 0.18 0.9 0.162 0.3990
total 1 0.406 1

Note that the second Bayes numerator is computed by multiplying the first Bayes numerator
and the second likelihood; since we are only interested in the final posterior, there is no
need to normalize until the last step. As shown in the last column and plot, after two heads
the type C hypothesis has finally taken the lead!
18.05 class 11, Bayesian Updating with Discrete Priors, Spring 2014 7

p(✓) p(✓|x = 1) p(✓|x1 = 1, x2 = 1)

.4 .4 .4

.2 .2 .2

✓ ✓ ✓
.5 .6 .9 .5 .6 .9 .5 .6 .9

prior p(✓), first posterior p(✓|x1 = 1), second posterior p(✓|x1 = 1, x2 = 1)

5 Appendix: the base rate fallacy

Example 4. A screening test for a disease is both sensitive and specific. By that we mean
it is usually positive when testing a person with the disease and usually negative when
testing someone without the disease. Let’s assume the true positive rate is 99% and the
false positive rate is 2%. Suppose the prevalence of the disease in the general population is
0.5%. If a random person tests positive, what is the probability that they have the disease?
answer: As a review we first do the computation using trees. Next we will redo the
computation using tables.
Let’s use notation established above for hypotheses and data: let H+ be the hypothesis
(event) that the person has the disease and let H be the hypothesis they do not. Likewise,
let T+ and T represent the data of a positive and negative screening test respectively. We
are asked to compute P (H+ |T+ ).
We are given

P (T+ |H+ ) = 0.99, P (T+ |H ) = 0.02, P (H+ ) = 0.005.

From these we can compute the false negative and true negative rates:

P (T |H+ ) = 0.01, P (T |H ) = 0.98

All of these probabilities can be displayed quite nicely in a tree.

0.005 0.995
H+ H
0.99 0.01 0.02 0.98

T+ T T+ T

Bayes’ theorem yields

P (T+ |H+ )P (H+ ) 0.99 · 0.005

P (H+ |T+ ) = = = 0.19920 ⇡ 20%
P (T+ ) 0.99 · 0.005 + 0.02 · 0.995

Now we redo this calculation using a Bayesian update table:

18.05 class 11, Bayesian Updating with Discrete Priors, Spring 2014 8

Bayes
hypothesis prior likelihood numerator posterior
H P (H) P (T+ |H) P (T+ |H)P (H) P (H|T+ )
H+ 0.005 0.99 0.00495 0.19920
H 0.995 0.02 0.01990 0.80080
total 1 NO SUM 0.02485 1
The table shows that the posterior probability P (H+ |T+ ) that a person with a positive test
has the disease is about 20%. This is far less than the sensitivity of the test (99%) but
much higher than the prevalence of the disease in the general population (0.5%).
Bayesian Updating: Probabilistic Prediction
Class 12, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to use the law of total probability to compute prior and posterior predictive
probabilities.

2 Introduction

In the previous class we looked at updating the probability of hypotheses based on data.
We can also use the data to update the probability of each possible outcome of a future
experiment. In this class we will look at how this is done.

2.1 Probabilistic prediciton; words of estimative probability (WEP)

There are many ways to word predictions:

• Prediction: “It will rain tomorrow.”

• Prediction using words of estimative probability (WEP): “It is likely to rain tomor-
row.”

• Probabilistic prediction: “Tomorrow it will rain with probability 60% (and not rain
with probability 40%).”

Each type of wording is appropriate at di↵erent times.

In this class we are going to focus on probabilistic prediction and precise quantitative state-
ments. You can see http://en.wikipedia.org/wiki/Words_of_Estimative_Probability
for an interesting discussion about the appropriate use of words of estimative probability.
The article also contains a list of weasel words such as ‘might’, ‘cannot rule out’, ‘it’s
conceivable’ that should be avoided as almost certain to cause confusion.
There are many places where we want to make a probabilistic prediction. Examples are

• Medical treatment outcomes

• Weather forecasting

• Climate change

• Sports betting

• Elections

• ...

1
18.05 class 12, Bayesian Updating: Probabilistic Prediction, Spring 2014 2

These are all situations where there is uncertainty about the outcome and we would like as
precise a description of what could happen as possible.

3 Predictive Probabilities

Probabilistic prediction simply means assigning a probability to each possible outcomes of

an experiment.
Recall the coin example from the previous class notes: there are three types of coins which
are indistinguishable apart from their probability of landing heads when tossed.

• Type A coins are fair, with probability 0.5 of heads

• Type B coins have probability 0.6 of heads
• Type C coins have probability 0.9 of heads

You have a drawer containing 4 coins: 2 of type A, 1 of type B, and 1 of type C. You reach
into the drawer and pick a coin at random. We let A stand for the event ‘the chosen coin
is of type A’. Likewise for B and C.

3.1 Prior predictive probabilities

Before taking data we can compute the probability that our chosen coin will land heads (or
tails) if flipped. Let DH be the event it lands heads and let DT the event it lands tails. We
can use the law of total probability to determine the probabilities of these events. Either
by drawing a tree or directly proceeding to the algebra, we get:

.5 .25 .25
A B C Coin type
.5 .5 .6 .4 .9 .1
Flip result
DH DT DH DT DH DT

Definition: These probabilities give a (probabilistic) prediction of what will happen if the
coin is tossed. Because they are computed before we collect any data they are called prior
predictive probabilities.

3.2 Posterior predictive probabilities

Suppose we flip the coin once and it lands heads. We now have data D, which we can use
to update the prior probabilities of our hypotheses to posterior probabilities. Last class we
learned to use a Bayes table to facilitate this computation:
18.05 class 12, Bayesian Updating: Probabilistic Prediction, Spring 2014 3

Bayes
hypothesis prior likelihood numerator posterior
H P (H) P (D|H) P (D|H)P (H) P (H|D)
A 0.5 0.5 0.25 0.4
B 0.25 0.6 0.15 0.24
C 0.25 0.9 0.225 0.36
total 1 0.625 1
Having flipped the coin once and gotten heads, we can compute the probability that our
chosen coin will land heads (or tails) if flipped a second time. We proceed just as before, but
using the posterior probabilities P (A|D), P (B|D), P (C|D) in place of the prior probabilities
P (A), P (B), P (C).

.4 .24 .36
A B C Coin type
.5 .5 .6 .4 .9 .1
Flip result
DH DT DH DT DH DT

P (DH |D) = P (DH |A)P (A|D) + P (DH |B)P (B|D) + P (DH |C)P (C|D)
= 0.5 · 0.4 + 0.6 · 0.24 + 0.9 · 0.36 = 0.668
P (DT |D) = P (DT |A)P (A|D) + P (DT |B)P (B|D) + P (DT |C)P (C|D)
= 0.5 · 0.4 + 0.4 · 0.24 + 0.1 · 0.36 = 0.332

Definition: These probabilities give a (probabilistic) prediction of what will happen if the
coin is tossed again. Because they are computed after collecting data and updating the
prior to the posterior, they are called posterior predictive probabilities.
Note that heads on the first toss increases the probability of heads on the second toss.

3.3 Review

Here’s a succinct description of the preceding sections that may be helpful:

Each hypothesis gives a di↵erent probability of heads, so the total probability of heads is
a weighted average. For the prior predictive probability of heads, the weights are given by
the prior probabilities of the hypotheses. For the posterior predictive probability of heads,
the weights are given by the posterior probabilities of the hypotheses.
Remember: Prior and posterior probabilities are for hypotheses. Prior predictive and
posterior predictive probabilities are for data. To keep this straight, remember that the
latter predict future data.
Bayesian Updating: Odds
Class 12, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to convert between odds and probability.

2. Be able to update prior odds to posterior odds using Bayes factors.

3. Understand how Bayes factors measure the extent to which data provides evidence for
or against a hypothesis.

2 Odds

When comparing two events, it common to phrase probability statements in terms of odds.
Definition The odds of event E versus event E 0 are the ratio of their probabilities P (E)/P (E 0 ).
If unspecified, the second event is assumed to be the complement E c . So the odds of E are:

P (E)
O(E) = .
P (E c )

For example, O(rain) = 2 means that the probability of rain is twice the probability of no
rain (2/3 versus 1/3). We might say ‘the odds of rain are 2 to 1.’
1/2
Example. For a fair coin, O(heads) = = 1. We might say the odds of heads are 1 to
1/2
1 or fifty-fifty.
1/6 1
Example. For a standard die, the odds of rolling a 4 are = . We might say the odds
5/6 5
are ‘1 to 5 for’ or ‘5 to 1 against’ rolling a 4.
Example. The probability of a pair in a five card poker hand is 0.42257. So the odds of a
pair are 0.42257/(1-0.42257) = 0.73181.

We can go back and forth between probability and odds as follows.

p q
Conversion formulas: if P (E) = p then O(E) = . If O(E) = q then P (E) = .
1 p 1+q
Notes:
1. The second formula simply solves q = p/(1 p) for p.
2. Probabilities are between 0 and 1, while odds are between 0 to 1.
3. The property P (E c ) = 1 P (E) becomes O(E c ) = 1/O(E).

Example. Let F be the event that a five card poker hand is a full house. Then P (F ) =
0.00145214 so O(F ) = 0.0014521/(1 0.0014521) = 0.0014542.
The odds not having a full house are O(F c ) = (1 0.0014521)/0.0014521 = 687 = 1/O(F ).

1
18.05 class 12, Bayesian Updating: Odds, Spring 2014 2

4. If P (E) or O(E) is small then O(E) ⇡ P (E). This follows from the conversion formulas.

Example. In the poker example where F = ‘full house’ we saw that P (F ) and O(F ) di↵er
only in the fourth significant digit.

3 Updating odds

3.1 Introduction

In Bayesian updating, we used the likelihood of data to update prior probabilities of hy-
potheses to posterior probabilities. In the language of odds, we will update prior odds to
posterior odds. One of our key points will be that the data can provide evidence supporting
or negating a hypothesis depending on whether its posterior odds are greater or less than
its prior odds.

3.2 Example: Marfan syndrome

Marfan syndrome is a genetic disease of connective tissue that occurs in 1 of every 15000
people. The main ocular features of Marfan syndrome include bilateral ectopia lentis (lens
dislocation), myopia and retinal detachment. About 70% of people with Marfan syndrome
have a least one of these ocular features; only 7% of people without Marfan syndrome do.
(We don’t guarantee the accuracy of these numbers, but they will work perfectly well for
our example.)
If a person has at least one of these ocular features, what are the odds that they have
Marfan syndrome?
answer: This is a standard Bayesian updating problem. Our hypotheses are:
M = ‘the person has Marfan syndrome’
M c = ‘the person does not have Marfan syndrome’
The data is:
F = ‘the person has at least one ocular feature’.
We are given the prior probability of M and the likelihoods of F given M or M c :

P (M ) = 1/15000, P (F |M ) = 0.7, P (F |M c ) = 0.07.

As before, we can compute the posterior probabilities using a table:

Bayes
hypothesis prior likelihood numerator posterior
H P (H) P (F |H) P (F |H)P (H) P (H|F )
M 0.000067 0.7 0.0000467 0.00066
Mc 0.999933 0.07 0.069995 0.99933
total 1 0.07004 1
First we find the prior odds:

P (M ) 1/15000 1
O(M ) = c
= = ⇡ 0.000067.
P (M ) 14999/15000 14999
18.05 class 12, Bayesian Updating: Odds, Spring 2014 3

The posterior odds are given by the ratio of the posterior probabilities or the Bayes numer-
ators, since the normalizing factor will be the same in both numerator and denominator.

P (M |F ) P (F |M )P (M )
O(M |F ) = c
= = 0.000667.
P (M |F ) P (F |M c )P (M c )

The posterior odds are a factor of 10 larger than the prior odds. In that sense, having an
ocular feature is strong evidence in favor of the hypothesis M . However, because the prior
odds are so small, it is still highly unlikely the person has Marfan syndrome.

4 Bayes factors and strength of evidence

The factor of 10 in the previous example is called a Bayes factor. The exact definition is
the following.
Definition: For a hypothesis H and data D, the Bayes factor is the ratio of the likelihoods:

P (D|H )
Bayes factor = .
P (D|H c )

Let’s see exactly where the Bayes factor arises in updating odds. We have

P (D|H)
= · O(H)
P (D|H c )

posterior odds = Bayes factor ⇥ prior odds

From this formula, we see that the Bayes’ factor (BF ) tells us whether the data provides
evidence for or against the hypothesis.

• If BF > 1 then the posterior odds are greater than the prior odds. So the data
provides evidence for the hypothesis.

• If BF < 1 then the posterior odds are less than the prior odds. So the data provides
evidence against the hypothesis.

• If BF = 1 then the prior and posterior odds are equal. So the data provides no
evidence either way.

The following example is taken from the textbook Information Theory, Inference, and
Learning Algorithms by David J. C. Mackay, who has this to say regarding trial evidence.
18.05 class 12, Bayesian Updating: Odds, Spring 2014 4

In my view, a jury’s task should generally be to multiply together carefully

evaluated likelihood ratios from each independent piece of admissible evidence
with an equally carefully reasoned prior probability. This view is shared by many
statisticians but learned British appeal judges recently disagreed and actually
overturned the verdict of a trial because the jurors had been taught to use Bayes’
theorem to handle complicated DNA evidence.

Example 1. Two people have left traces of their own blood at the scene of a crime. A
suspect , Oliver, is tested and found to have type ‘O’ blood. The blood groups of the two
traces are found to be of type ‘O’ (a common type in the local population, having frequency
60%) and type ‘AB’ (a rare type, with frequency 1%). Does this data (type ‘O’ and ‘AB’
blood were found at the scene) give evidence in favor of the proposition that Oliver was one
of the two people present at the scene of the crime?”
answer: There are two hypotheses:
S = ‘Oliver and another unknown person were at the scene of the crime
S c = ‘two unknown people were at the scene of the crime’
The data is:
D = ‘type ‘O’ and ‘AB’ blood were found’
P (D|S)
The Bayes factor for Oliver’s presence is BFOliver = . We compute the numerator
P (D|S c )
and denominator of this separately.
The data says that both type O and type AB blood were found. If Oliver was at the scene
then ‘type O’ blood would be there. So P (D|S) is the probability that the other person
had type AB blood. We are told this is .01, so P (D|S) = 0.01.
If Oliver was not at the scene then there were two random people one with type O and one
with type AB blood. The probability of this is 2 · 0.6 · 0.01. The factor of 2 is because there
are two ways this can happen –the first person is type O and the second is type AB or vice
versa.*
Thus the Bayes factor for Oliver’s presence is

P (D|S ) 0.01
BFOliver = c
= = 0.83.
P (D|S ) 2 · 0.6 · 0.01

Since BFOliver < 1, the data provides (weak) evidence against Oliver being at the scene.

*We have assumed the blood types of the two people are independent. This is not precisely true,
2 · NO · NAB
but for a large population it is close enough. The exact probability is where NO
N · (N 1)
is the number of people with type O blood, NAB the number with type AB blood and N the size
NO NAB
of the population. We have = 0.6. For large N we have N ⇡ N 1, so ⇡ 0.01. This
N N 1
shows the probability is approximately 2 · 0.6 · 0.01 as claimed.

Example 2. Another suspect Alberto is found to have type ‘AB’ blood. Do the same data
give evidence in favor of the proposition that Alberto was one of the two people present at
the crime?
18.05 class 12, Bayesian Updating: Odds, Spring 2014 5

answer: Reusing the above notation with Alberto in place of Oliver we have:
P (D|S) 0.6
BFAlberto = c
= = 50.
P (D|S ) 2 · 0.6 · 0.01
Since BFAlberto 1, the data provides strong evidence in favor of Alberto being at the
scene.
Notes:
1. In both examples, we have only computed the Bayes factor, not the posterior odds. To
compute the latter, we would need to know the prior odds that Oliver (or Alberto) was at
the scene based on other evidence.
2. Note that if 50% of the population had type O blood instead of 60%, then the Oliver’s
Bayes factor would be 1 (neither for nor against). More generally, the break-even point
for blood type evidence is when the proportion of the suspect’s blood type in the general
population equals the proportion of the suspect’s blood type among those who left blood
at the scene.

4.1 Updating again and again

Suppose we collect data in two stages, first D1 , then D2 . We have seen in our dice and coin
examples that the final posterior can be computed all at once or in two stages where we
first update the prior using the likelihoods for D1 and then update the resulting posterior
using the likelihoods for D2 . The latter approach works whenever likelihoods multiply:

P (D1 , D2 |H) = P (D1 |H)P (D2 |H).

Since likelihoods are conditioned on hypotheses, we say that D1 and D2 are conditionally
independent if the above equation holds for every hypothesis H.
Example. There are five dice in a drawer, with 4, 6, 8, 12, and 20 sides (these are the
hypotheses). I pick a die at random and roll it twice. The first roll gives 7. The second roll
gives 11. Are these results conditionally independent? Are they independent?
answer: These results are conditionally independent. For example, for the hypothesis of
the 8-sided die we have:

P (7 on roll 1 | 8-sided die) = 1/8

P (11 on roll 2 | 8-sided die) = 0
P (7 on roll 1, 11 on roll 2 | 8-sided die) = 0

For the hypothesis of the 20-sided die we have:

P (7 on roll 1 | 20-sided die) = 1/20

P (11 on roll 2 | 20-sided die) = 1/20
P (7 on roll 1, 11 on roll 2 | 20-sided die) = (1/20)2

However, the results of the rolls are not independent. That is:

P (7 on roll 1, 11 on roll 2) 6= P (7 on roll 1)P (11 on roll 2).

18.05 class 12, Bayesian Updating: Odds, Spring 2014 6

Intuitively, this is because a 7 on the roll 1 allows us to rule out the 4- and 6-sided dice,
making an 11 on roll 2 more likely. Let’s check this intuition by computing both sides
precisely. On the righthand side we have:
1 1 1 1 1 1 31
P (7 on roll 1) = · + · + · =
5 8 5 12 5 20 600
1 1 1 1 2
P (11 on roll 2) = · + · =
5 12 5 20 75
On the lefthand side we have:

P (7 on roll 1, 11 on roll 2) = P (11 on roll 2 | 7 on roll 1)P (7 on roll 1)

✓ ◆
30 1 6 1 31
= · + · ·
93 12 31 20 600
17 31 17
= · =
465 600 9000
30 6
Here 93 and 31 are the posterior probabilities of the 12- and 20-sided dice given a 7 on roll
1. We conclude that, without conditioning on hypotheses, the rolls are not independent.
Returning the to general setup, if D1 and D2 are conditionally independent for H and H c
then it makes sense to consider each Bayes factor independently:

P (Di |H)
BFi = .
P (Di |H c )

The prior odds of H are O(H). The posterior odds after D1 are

O(H|D1 ) = BF1 · O(H).

And the posterior odds after D1 and D2 are

O(H|D1 , D2 ) = BF2 · O(H|D1 )

= BF2 · BF1 · O(H)

We have the beautifully simple notion that updating with new data just amounts to mul-
tiplying the current posterior odds by the Bayes factor of the new data.

Example 3. Other symptoms of Marfan Syndrome

Recall from the earlier example that the Bayes factor for a least one ocular feature (F ) is

P (F |M ) 0.7
BFF = c
= = 10.
P (F |M ) 0.07

The wrist sign (W ) is the ability to wrap one hand around your other wrist to cover your
pinky nail with your thumb. Assume 10% of the population have the wrist sign, while 90%
of people with Marfan’s have it. Therefore the Bayes factor for the wrist sign is

P (W |M ) 0.9
BFW = = = 9.
P (W |M c ) 0.1
18.05 class 12, Bayesian Updating: Odds, Spring 2014 7

We will assume that F and W are conditionally independent symptoms. That is, among
people with Marfan syndrome, ocular features and the wrist sign are independent, and
among people without Marfan syndrome, ocular features and the wrist sign are independent.
Given this assumption, the posterior odds of Marfan syndrome for someone with both an
ocular feature and the wrist sign are
1 6
O(M |F, W ) = BFW · BFF · O(M ) = 9 · 10 · ⇡ .
14999 1000
We can convert the posterior odds back to probability, but since the odds are so small the
result is nearly the same:
6
P (M |F, W ) ⇡ ⇡ 0.596%.
1000 + 6
So ocular features and the wrist sign are both strong evidence in favor of the hypothesis
M , and taken together they are very strong evidence. Again, because the prior odds are so
small, it is still unlikely that the person has Marfan syndrome, but at this point it might be
worth undergoing further testing given potentially fatal consequences of the disease (such
as aortic aneurysm or dissection).
Note also that if a person has exactly one of the two symptoms, then the product of the
Bayes factors is near 1 (either 9/10 or 10/9). So the two pieces of data essentially cancel
each other out with regard to the evidence they provide for Marfan’s syndrome.

5 Log odds

In practice, people often find it convenient to work with the natural log of the odds in place
of odds. Naturally enough these are called the log odds. The Bayesian update formula

O(H|D1 , D2 ) = BF2 · BF1 · O(H)

becomes
ln(O(H|D1 , D2 )) = ln(BF2 ) + ln(BF1 ) + ln(O(H)).
We can interpret the above formula for the posterior log odds as the sum of the prior log
odds and all the evidence ln(BFi ) provided by the data. Note that by taking logs, evidence
in favor (BFi > 1) is positive and evidence against (BFi < 1) is negative.
To avoid lengthier computations, we will work with odds rather than log odds in this course.
Log odds are nice because sums are often more intuitive then products. Log odds also play a
central role in logistic regression, an important statistical model related to linear regression.
Bayesian Updating with Continuous Priors
Class 13, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Understand a parameterized family of distributions as representing a continuous range

of hypotheses for the observed data.

2. Be able to state Bayes’ theorem and the law of total probability for continous densities.

3. Be able to apply Bayes’ theorem to update a prior probability density function to a

posterior pdf given data and a likelihood function.

4. Be able to interpret and compute posterior predictive probabilities.

2 Introduction

Up to now we have only done Bayesian updating when we had a finite number of hypothesis,
e.g. our dice example had five hypotheses (4, 6, 8, 12 or 20 sides). Now we will study
Bayesian updating when there is a continuous range of hypotheses. The Bayesian update
process will be essentially the same as in the discrete case. As usual when moving from
discrete to continuous we will need to replace the probability mass function by a probability
density function, and sums by integrals.
The first few sections of this note are devoted to working with pdfs. In particular we will
cover the law of total probability and Bayes’ theorem. We encourage you to focus on how
these are essentially identical to the discrete versions. After that, we will apply Bayes’
theorem and the law of total probability to Bayesian updating.

3 Examples with continuous ranges of hypotheses

Here are three standard examples with continuous ranges of hypotheses.

Example 1. Suppose you have a system that can succeed or fail with probability p. Then
we can hypothesize that p is anywhere in the range [0, 1]. That is, we have a continuous
range of hypotheses. We will often model this example with a ‘bent’ coin with unknown
probability p of heads.

Example 2. The lifetime of a certain isotope is modeled by an exponential distribution

exp( ). In principal, the mean lifetime 1/ can be any real number in (0, 1).

Example 3. We are not restricted to a single parameter. In principle, the parameters µ

and of a normal distribution can be any real numbers in ( 1, 1) and (0, 1), respectively.
If we model gestational length for single births by a normal distribution, then from millions
of data points we know that µ is about 40 weeks and is about one week.

1
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 2

In all of these examples we modeled the random process giving rise to the data by a dis-
tribution with parameters –called a parametrized distribution. Every possible choice of the
parameter(s) is a hypothesis, e.g. we can hypothesize that the probability of succcess in
Example 1 is p = 0.7313. We have a continuous set of hypotheses because we could take
any value between 0 and 1.

4 Notational conventions

4.1 Parametrized models

As in the examples above our hypotheses often take the form a certain parameter has value
✓. We will often use the letter ✓ to stand for an arbitrary hypothesis. This will leave
symbols like p, f , and x to take there usual meanings as pmf, pdf, and data. Also, rather
than saying ‘the hypothesis that the parameter of interest has value ✓’ we will simply say
the hypothesis ✓.

4.2 Big and little letters

We have two parallel notations for outcomes and probability:

1. (Big letters) Event A, probability function P (A).
2. (Little letters) Value x, pmf p(x) or pdf f (x).
These notations are related by P (X = x) = p(x), where x is a value the discrete random
variable X and ‘X = x’ is the corresponding event.
We carry these notations over to the probabilities used in Bayesian updating.
1. (Big letters) From hypotheses H and data D we compute several associated probabilities

P (H), P (D), P (H|D), P (D|H).

In the coin example we might have H = ‘the chosen coin has probability 0.6 of heads’, D
= ‘the flip was heads’, and P (D|H) = 0.6
2. (Small letters) Hypothesis values ✓ and data values x both have probabilities or proba-
bility densities:
p(✓) p(x) p(✓|x) p(x|✓)
f (✓) f (x) f (✓|x) f (x|✓)
In the coin example we might have ✓ = 0.6 and x = 1, so p(x|✓) = 0.6. We might also write
p(x = 1|✓ = 0.6) to emphasize the values of x and ✓, but we will never just write p(1|0.6)
because it is unclear which value is x and which is ✓.
Although we will still use both types of notation, from now on we will mostly use the small
letter notation involving pmfs and pdfs. Hypotheses will usually be parameters represented
by Greek letters (✓, , µ, , . . . ) while data values will usually be represented by English
letters (x, xi , y, . . . ).
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 3

5 Quick review of pdf and probability

Suppose X is a random variable with pdf f (x). Recall f (x) is a density; its units are
probability/(units of x).
f (x) f (x)

probability f (x)dx

P (c  X  d)

x dx
c x
d x

The probability that the value of X is in [c, d] is given by

Z d
f (x) dx.
c

The probability that X is in an infinitesimal range dx around x is f (x) dx. In fact, the
integral formula is just the ‘sum’ of these infinitesimal probabilities. We can visualize these
probabilities by viewing the integral as area under the graph of f (x).
In order to manipulate probabilities instead of densities in what follows, we will make
frequent use of the notion that f (x) dx is the probability that X is in an infinitesimal range
around x of width dx. Please make sure that you fully understand this notion.

6 Continuous priors, discrete likelihoods

In the Bayesian framework we have probabilities of hypotheses –called prior and posterior
probabilities– and probabilities of data given a hypothesis –called likelihoods. In earlier
classes both the hypotheses and the data had discrete ranges of values. We saw in the
introduction that we might have a continuous range of hypotheses. The same is true for
the data, but for today we will assume that our data can only take a discrete set of values.
In this case, the likelihood of data x given hypothesis ✓ is written using a pmf: p(x|✓).
We will use the following coin example to explain these notions. We will carry this example
through in each of the succeeding sections.
Example 4. Suppose we have a bent coin with unknown probability ✓ of heads. The
value of of ✓ is random and could be anywhere between 0 and 1. For this and the examples
that follow we’ll suppose that the value of ✓ follows a distribution with continuous prior
probability density f (✓) = 2✓. We have a discrete likelihood because tossing a coin has only
two outcomes, x = 1 for heads and x = 0 for tails.

p(x = 1|✓) = ✓, p(x = 0|✓) = 1 ✓.

Think: This can be tricky to wrap your mind around. We have a coin with an unknown
probability ✓ of heads. The value of the parameter ✓ is itself random and has a prior pdf
f (✓). It may help to see that the discrete examples we did in previous classes are similar.
For example, we had a coin that might have probability of heads 0.5, 0.6, or 0.9. So,
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 4

we called our hypotheses H0.5 , H0.6 , H0.9 and these had prior probabilities P (H0.5 ) etc. In
other words, we had a coin with an unknown probability of heads, we had hypotheses about
that probability and each of these hypotheses had a prior probability.

7 The law of total probability

The law of total probability for continuous probability distributions is essentially the same
as for discrete distributions. We replace the prior pmf by a prior pdf and the sum by an
integral. We start by reviewing the law for the discrete case.
Recall that for a discrete set of hypotheses H1 , H2 , . . . Hn the law of total probability says
n
X
P (D) = P (D|Hi )P (Hi ). (1)
i=1

This is the total prior probability of D because we used the prior probabilities P (Hi )
In the little letter notation with ✓1 , ✓2 , . . . , ✓n for hypotheses and x for data the law of total
probability is written
X n
p(x) = p(x|✓i )p(✓i ). (2)
i=1
We also called this the prior predictive probability of the outcome x to distinguish it from
the prior probability of the hypothesis ✓.
Likewise, there is a law of total probability for continuous pdfs. We state it as a theorem
using little letter notation.
Theorem. Law of total probability. Suppose we have a continuous parameter ✓ in the
range [a, b], and discrete random data x. Assume ✓ is itself random with density f (✓) and
that x and ✓ have likelihood p(x|✓). In this case, the total probability of x is given by the
formula. Z b
p(x) = p(x|✓)f (✓) d✓ (3)
a
Proof. Our proof will be by analogy to the discrete version: The probability term p(x|✓)f (✓) d✓
is perfectly analogous to the term p(x|✓i )p(✓i ) in Equation 2 (or the term P (D|Hi )P (Hi )
in Equation 1). Continuing the analogy: the sum in Equation 2 becomes the integral in
Equation 3
As in the discrete case, when we think of ✓ as a hypothesis explaining the probability of the
data we call p(x) the prior predictive probability for x.
Example 5. (Law of total probability.) Continuing with Example 4. We have a bent coin
with probability ✓ of heads. The value of ✓ is random with prior pdf f (✓) = 2✓ on [0, 1].
Suppose I flip the coin once. What is the total probability of heads?
answer: In Example 4 we noted that the likelihoods are p(x = 1|✓) = ✓ and p(x = 0|✓) =
1 ✓. So the total probability of x = 1 is
Z 1 Z 1 Z 1
2
p(x = 1) = p(x = 1|✓) f (✓) d✓ = ✓ · 2✓ d✓ = 2✓ 2 d✓ = .
0 0 0 3
Since the prior is weighted towards higher probabilities of heads, so is the total probability.
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 5

8 Bayes’ theorem for continuous probability densities

The statement of Bayes’ theorem for continuous pdfs is essentially identical to the statement
for pmfs. We state it including d✓ so we have genuine probabilities:
Theorem. Bayes’ Theorem. Use the same assumptions as in the law of total probability,
i.e. ✓ is a continuous parameter with pdf f (✓) and range [a, b]; x is random discrete data;
together they have likelihood p(x|✓). With these assumptions:
p(x|✓)f (✓) d✓ p(x|✓)f (✓) d✓
f (✓|x) d✓ = = Rb . (4)
p(x)
a p(x|✓)f (✓) d✓

Proof. Since this is a statement about probabilities it is just the usual statement of Bayes’
theorem. This is important enough to warrant spelling it out in words: Let ⇥ be the random
variable that produces the value ✓. Consider the events

H = ‘⇥ is in an interval of width d✓ around the value ✓’

and
D = ‘the value of the data is x’.
Then P (H) = f (✓) d✓, P (D) = p(x), and P (D|H) = p(x|✓). Now our usual form of Bayes’
theorem becomes
P (D|H)P (H) p(x|✓)f (✓) d✓
f (✓|x) d✓ = P (H|D) = =
P (D) p(x)
Looking at the first and last terms in this equation we see the new form of Bayes’ theorem.
Finally, we firmly believe that is is more conducive to careful thinking about probability
to keep the factor of d✓ in the statement of Bayes’ theorem. But because it appears in the
numerator on both sides of Equation 4 many people drop the d✓ and write Bayes’ theorem
in terms of densities as
p(x|✓)f (✓) p(x|✓)f (✓)
f (✓|x) = = Rb .
p(x)
a p(x|✓)f (✓) d✓

9 Bayesian updating with continuous priors

Now that we have Bayes’ theorem and the law of total probability we can finally get to
Bayesian updating. Before continuing with Example 4, we point out two features of the
Bayesian updating table that appears in the next example:
1. The table for continuous priors is very simple: since we cannot have a row for each of
an infinite number of hypotheses we’ll have just one row which uses a variable to stand for
all hypotheses ✓.
2. By including d✓, all the entries in the table are probabilities and all our usual probability
rules apply.
Example 6. (Bayesian updating.) Continuing Examples 4 and 5. We have a bent coin
with unknown probability ✓ of heads. The value of ✓ is random with prior pdf f (✓) = 2✓.
Suppose we flip the coin once and get heads. Compute the posterior pdf for ✓.
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 6

answer: We make an update table with the usual columns. Since this is our first example
the first row is the abstract version of Bayesian updating in general and the second row is
Bayesian updating for this particular example.
hypothesis prior likelihood Bayes numerator posterior

✓ f (✓) d✓ p(x = 1|✓) p(x = 1|✓)f (✓) d✓ f (✓|x = 1) d✓

✓ 2✓ d✓ ✓ 2✓2 d✓ 3✓2 d✓
Rb R1
total a f (✓) d✓ = 1 p(x = 1) = 0 2✓2 d✓ = 2/3 1

Therefore the posterior pdf (after seeing 1 heads) is f (✓|x) = 3✓2 .

We have a number of comments:
1. Since we used the prior probability f (✓) d✓, the hypothesis should have been:
’the unknown paremeter is in an interval of width d✓ around ✓’.
Even for us that is too much to write, so you will have to think it everytime we write that
the hypothesis is ✓.
2. The posterior pdf for ✓ is found by removing the d✓ from the posterior probability in
the table.
f (✓|x) = 3✓2 .

3. (i) As always p(x) is the total probability. Since we have a continuous distribution
instead of a sum we compute an integral.
(ii) Notice that by including d✓ in the table, it is clear what integral we need to compute
to find the total probability p(x).
4. The table organizes the continuous version of Bayes’ theorem. Namely, the posterior pdf
is related to the prior pdf and likelihood function via:
p(x|✓) f (✓)d✓ p(x|✓) f (✓)
f (✓|x)d✓ = R b =
p(x)
a p(x|✓)f (✓) d✓

Removing the d✓ in the numerator of both sides we have the statement in terms of densities.
5. Regarding both sides as functions of ✓, we can again express Bayes’ theorem in the form:
f (✓|x) / p(x|✓) · f (✓)
posterior / likelihood ⇥ prior.

9.1 Flat priors

One important prior is called a flat or uniform prior. A flat prior assumes that every
hypothesis is equally probable. For example, if ✓ has range [0, 1] then f (✓) = 1 is a flat
prior.
Example 7. (Flat priors.) We have a bent coin with unknown probability ✓ of heads.
Suppose we toss it once and get tails. Assume a flat prior and find the posterior probability
for ✓.
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 7

answer: This is the just Example 6 with a change of prior and likelihood.
hypothesis prior likelihood Bayes numerator posterior
✓ f (✓) d✓ p(x = 0|✓) f (✓|x = 0) d✓

✓ 1 · d✓ 1 ✓ (1 ✓) d✓ 2(1 ✓) d✓
Z 1
Rb
total a f (✓) d✓ = 1 p(x = 0) = (1 ✓) d✓ = 1/2 1
0

9.2 Using the posterior pdf

Example 8. In the previous example the prior probability was flat. First show that this
means that a priori the coin is equally like to be biased towards heads or tails. Then, after
observing one heads, what is the (posterior) probability that the coin is biased towards
heads?
answer: Since the parameter ✓ is the probability the coin lands heads, the first part of the
problem asks us to show P (✓ > .5) = 0.5 and the second part asks for P (✓ > .5 | x = 1).
These are easily computed from the prior and posterior pdfs respectively.
The prior probability that the coin is biased towards heads is
Z 1 Z 1
1
P (✓ > .5) = f (✓) d✓ = 1 · d✓ = ✓|1.5 = .
.5 .5 2

The probability of 1/2 means the coin is equally likely to be biased toward heads or tails.
The posterior probabilitiy that it’s biased towards heads is
Z 1 Z 1
1 3
P (✓ > .5|x = 1) = f (✓|x = 1) d✓ = 2✓ d✓ = ✓ 2 .5 = .
.5 .5 4
We see that observing one heads has increased the probability that the coin is biased towards
heads from 1/2 to 3/4.

10 Predictive probabilities

Just as in the discrete case we are also interested in using the posterior probabilities of the
hypotheses to make predictions for what will happen next.
Example 9. (Prior and posterior prediction.) Continuing Examples 4, 5, 6: we have a
coin with unknown probability ✓ of heads and the value of ✓ has prior pdf f (✓) = 2✓. Find
the prior predictive probability of heads. Then suppose the first flip was heads and find the
posterior predictive probabilities of both heads and tails on the second flip.
answer: For notation let x1 be the result of the first flip and let x2 be the result of the
second flip. The prior predictive probability is exactly the total probability computed in
Examples 5 and 6.
Z 1 Z 1
2
p(x1 = 1) = p(x1 = 1|✓)f (✓) d✓ = 2✓2 d✓ = .
0 0 3
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 8

The posterior predictive probabilities are the total probabilities computed using the poste-
rior pdf. From Example 6 we know the posterior pdf is f (✓|x1 = 1) = 3✓2 . So the posterior
predictive probabilities are
Z 1 Z 1
p(x2 = 1|x1 = 1) = p(x2 = 1|✓, x1 = 1)f (✓|x1 = 1) d✓ = ✓ · 3✓2 d✓ = 3/4
0 0
Z 1 Z 1
p(x2 = 0|x1 = 1) = p(x2 = 0|✓, x1 = 1)f (✓|x1 = 1) d✓ = (1 ✓) · 3✓2 d✓ = 1/4
0 0

(More simply, we could have computed p(x2 = 0|x1 = 1) = 1 p(x2 = 1|x1 = 1) = 1/4.)

11 From discrete to continuous Bayesian updating

To develop intuition for the transition from discrete to continuous Bayesian updating, we’ll
walk a familiar road from calculus. Namely we will:
(i) approximate the continuous range of hypotheses by a finite number.
(ii) create the discrete updating table for the finite number of hypotheses.
(iii) consider how the table changes as the number of hypotheses goes to infinity.
In this way, will see the prior and posterior pmf’s converge to the prior and posterior pdf’s.
Example 10. To keep things concrete, we will work with the ‘bent’ coin with a flat prior
f (✓) = 1 from Example 7. Our goal is to go from discrete to continuous by increasing the
number of hypotheses
4 hypotheses. We slice [0, 1] into 4 equal intervals: [0, 1/4], [1/4, 1/2], [1/2, 3/4], [3/4, 1].
Each slice has width ✓ = 1/4. We put our 4 hypotheses ✓i at the centers of the four slices:
✓1 : ‘✓ = 1/8’, ✓2 : ‘✓ = 3/8’, ✓3 : ‘✓ = 5/8’, ✓4 : ‘✓ = 7/8’.
The flat prior gives each hypothesis a probability of 1/4 = 1 · ✓. We have the table:
hypothesis prior likelihood Bayes num. posterior

✓ = 1/8 1/4 1/8 (1/4) ⇥ (1/8) 1/16

✓ = 3/8 1/4 3/8 (1/4) ⇥ (3/8) 3/16

✓ = 5/8 1/4 5/8 (1/4) ⇥ (5/8) 5/16

✓ = 7/8 1/4 7/8 (1/4) ⇥ (7/8) 7/16

n
X
Total 1 – ✓i ✓ 1
i=1

Here are the density histograms of the prior and posterior pmf. The prior and posterior
pdfs from Example 7 are superimposed on the histograms in red.
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 9

2 density 2 density

1.5 1.5

1 1

.5 .5

x x
1/8 3/8 5/8 7/8 1/8 3/8 5/8 7/8

8 hypotheses. Next we slice [0,1] into 8 intervals each of width ✓ = 1/8 and use the
center of each slice for our 8 hypotheses ✓i .
✓1 : ’✓ = 1/16’, ✓2 : ’✓ = 3/16’, ✓3 : ’✓ = 5/16’, ✓4 : ’✓ = 7/16’
✓5 : ’✓ = 9/16’, ✓6 : ’✓ = 11/16’, ✓7 : ’✓ = 13/16’, ✓8 : ’✓ = 15/16’
The flat prior gives each hypothesis the probablility 1/8 = 1 · ✓. Here are the table and
density histograms.
hypothesis prior likelihood Bayes num. posterior
✓ = 1/16 1/8 1/16 (1/8) ⇥ (1/16) 1/64
✓ = 3/16 1/8 3/16 (1/8) ⇥ (3/16) 3/64
✓ = 5/16 1/8 5/16 (1/8) ⇥ (5/16) 5/64
✓ = 7/16 1/8 7/16 (1/8) ⇥ (7/16) 7/64
✓ = 9/16 1/8 9/16 (1/8) ⇥ (9/16) 9/64
✓ = 11/16 1/8 11/16 (1/8) ⇥ (11/16) 11/64
✓ = 13/16 1/8 13/16 (1/8) ⇥ (13/16) 13/64
✓ = 15/16 1/8 15/16 (1/8) ⇥ (15/16) 15/64
n
X
Total 1 – ✓i ✓ 1
i=1

2 density 2 density

1.5 1.5

1 1

.5 .5

x x
1/16 3/16 5/16 7/16 9/16 11/16 13/16 15/16 1/16 3/16 5/16 7/16 9/16 11/16 13/16 15/16

20 hypotheses. Finally we slice [0,1] into 20 pieces. This is essentially identical to the
previous two cases. Let’s skip right to the density histograms.
2 density 2 density

1.5 1.5

1 1

.5 .5

x x

Looking at the sequence of plots we see how the prior and posterior density histograms
converge to the prior and posterior probability density functions.
Notational conventions
Class 13, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to work with the various notations and terms we use to describe probabilities
and likelihood.

2 Introduction

We’ve introduced a number of di↵erent notations for probability, hypotheses and data. We
collect them here, to have them in one place.

3 Notation and terminology for data and hypotheses

The problem of labeling data and hypotheses is a tricky one. When we started the course
we talked about outcomes, e.g. heads or tails. Then when we introduced random variables
we gave outcomes numerical values, e.g. 1 for heads and 0 for tails. This allowed us to do
things like compute means and variances. We need to do something similar now. Recall
our notational conventions:

• Events are labeled with capital letters, e.g. A, B, C.

• A random variable is capital X and takes values small x.

• The connection between values and events: ‘X = x’ is the event that X takes the
value x.

• The probability of an event is capital P (A).

• A discrete random variable has a probability mass function small p(x) The connection
between P and p is that P (X = x) = p(x).

• A continuous random variable has a probability density function f (x) The connection
Rb
between P and f is that P (a  X  b) = a f (x) dx.

• For a continuous random variable X the probability that X is in an infinitesimal

interval of width dx round x is f (x) dx.

In the context of Bayesian updating we have similar conventions.

• We use capital letters, especially H, to indicate a hypothesis, e.g. H = ’the coin is

fair’.

1
18.05 class 13, Notational conventions, Spring 2014 2

• We use lower case letters, especially ✓, to indicate the hypothesized value of a model
parameter, e.g. the probability the coin lands heads is ✓ = 0.5.

• We use upper case letters, especially D, when talking about data as events. For
example, D = ‘the sequence of tosses was HTH.

• We use lower case letters, especially x, when talking about data as values. For exam-
ple, the sequence of data was x1 , x2 , x3 = 1, 0, 1.

• When the set of hypotheses is discrete we can use the probability of individual hy-
potheses, e.g. p(✓). When the set is continuous we need to use the probability for an
infinitesimal range of hypotheses, e.g. f (✓) d✓.

The following table summarizes this for discret ✓ and continuous ✓. In both cases we
are assuming a discrete set of possible outcomes (data) x. Tomorrow we will deal with a
continuous set of outcomes.
Bayes
hypothesis prior likelihood numerator posterior
H P (H) P (D|H) P (D|H)P (H) P (H|D)
Discrete ✓: ✓ p(✓) p(x|✓) p(x|✓)p(✓) p(✓|x)
Continuous ✓: ✓ f (✓) d✓ p(x|✓) p(x|✓)f (✓) d✓ f (✓|x) d✓
Remember the continuous hypothesis ✓ is really a shorthand for ‘the parameter ✓ is in an
interval of width d✓ around ✓’.
Beta Distributions
Class 14, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be familiar with the 2-parameter family of beta distributions and its normalization.

2. Be able to update a beta prior to a beta posterior in the case of a binomial likelihood.

2 Beta distribution

The beta distribution beta(a, b) is a two-parameter distribution with range [0, 1] and pdf

(a + b 1)! a 1
f (✓) = ✓ (1 ✓ )b 1
(a 1)!(b 1)!

We have made an applet so you can explore the shape of the Beta distribution as you vary
the parameters:
http://mathlets.org/mathlets/beta-distribution/.
As you can see in the applet, the beta distribution may be defined for any real numbers
a > 0 and b > 0. In 18.05 we will stick to integers a and b, but you can get the full story
here: http://en.wikipedia.org/wiki/Beta_distribution
In the context of Bayesian updating, a and b are often called hyperparameters to distinguish
them from the unknown parameter ✓ representing our hypotheses. In a sense, a and b are
‘one level up’ from ✓ since they parameterize its pdf.

2.1 A simple but important observation!

If a pdf f (✓) has the form c✓a 1 (1 ✓)b 1 then f (✓) is a beta(a, b) distribution and the
normalizing constant must be
(a + b 1)!
c= .
(a 1)! (b 1)!
This follows because the constant c must normalize the pdf to have total probability 1.
There is only one such constant and it is given in the formula for the beta distribution.
A similar observation holds for normal distributions, exponential distributions, and so on.

2.2 Beta priors and posteriors for binomial random variables

Example 1. Suppose we have a bent coin with unknown probability ✓ of heads. We toss
it 12 times and get 8 heads and 4 tails. Starting with a flat prior, show that the posterior
pdf is a beta(9, 5) distribution.

1
18.05 class 14, Beta Distributions, Spring 2014 2

answer: This is nearly identical to examples from the previous class. We’ll call the data
from all 12 tosses x1 . In the following table we call the leading constant factor in the
posterior column c2 . Our simple observation will tell us that it has to be the constant
factor from the beta pdf.
The data is 8 heads ✓ and◆4 tails. Since this comes from a binomial(12, ✓) distribution, the
12 8
likelihood p(x1 |✓) = ✓ (1 ✓)4 . Thus the Bayesian update table is
8

Bayes
hypothesis prior likelihood numerator posterior
12 12
✓ 1 · d✓ 8 ✓8 (1 ✓)4 8 ✓8 (1 ✓)4 d✓ c2 ✓8 (1 ✓)4 d✓
✓ ◆Z 1
12
total 1 T = ✓8 (1 ✓)4 d✓ 1
8 0

Our simple observation above holds with a = 9 and b = 5. Therefore the posterior pdf

f (✓|x1 ) = c2 ✓8 (1 ✓)4

follows a beta(9, 5) distribution and the normalizing constant c2 must be

13!
c2 = .
8! 4!

Note: We explicitly included the binomial coefficient 12

8 in the likelihood. We could just
as easily have given it a name, say c1 and not bothered making its value explicit.

Example 2. Now suppose we toss the same coin again, getting n heads and m tails. Using
the posterior pdf of the previous example as our new prior pdf, show that the new posterior
pdf is that of a beta(9 + n, 5 + m) distribution.
answer: It’s all in the table. We’ll call the data of these n + m additional tosses x2 . This
time we won’t make the binomial coefficient explicit. Instead we’ll just call it c3 . Whenever
we need a new label we will simply use c with a new subscript.
Bayes
hyp. prior likelihood posterior numerator
n+8
✓ c2 ✓8 (1 ✓)4 d✓ c3 ✓n (1 ✓)m c2 c3 ✓n+8 (1 ✓)m+4 d✓ c4 ✓ (1 ✓)m+4 d✓
Z 1
total 1 T = c2 c3 ✓n+8 (1 ✓)m+4 d✓ 1
0

Again our simple observation holds and therefore the posterior pdf

f (✓|x1 , x2 ) = c4 ✓n+8 (1 ✓)m+4

follows a beta(n + 9, m + 5) distribution.

Note: Flat beta. The beta(1, 1) distribution is the same as the uniform distribution on
[0, 1], which we have also called the flat prior on ✓. This follows by plugging a = 1 and
b = 1 into the definition of the beta distribution, giving f (✓) = 1.
18.05 class 14, Beta Distributions, Spring 2014 3

Summary: If the probability of heads is ✓, the number of heads in n + m tosses follows a

binomial(n + m, ✓) distribution. We have seen that if the prior on ✓ is a beta distribution
then so is the posterior; only the parameters a, b of the beta distribution change! We
summarize precisely how they change in a table. We assume the data is n heads in n + m
tosses.
hypothesis data prior likelihood posterior

✓ x=n beta(a, b) binomial(n + m, ✓) beta(a + n, b + m)

✓ x=n c1 ✓ a 1 (1 ✓)b 1 d✓ c2 ✓n (1 ✓)m c3 ✓a+n 1 (1 ✓)b+m 1 d✓

2.3 Conjugate priors

In the literature you’ll see that the beta distribution is called a conjugate prior for the
binomial distribution. This means that if the likelihood function is binomial, then a beta
prior gives a beta posterior. In fact, the beta distribution is a conjugate prior for the
Bernoulli and geometric distributions as well.
We will soon see another important example: the normal distribution is its own conjugate
prior. In particular, if the likelihood function is normal with known variance, then a normal
prior gives a normal posterior.
Conjugate priors are useful because they reduce Bayesian updating to modifying the param-
eters of the prior distribution (so-called hyperparameters) rather than computing integrals.
We saw this for the beta distribution in the last table. For many more examples see:
http://en.wikipedia.org/wiki/Conjugate_prior_distribution
Continuous Data with Continuous Priors
Class 14, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to construct a Bayesian update table for continuous hypotheses and continuous
data.

2. Be able to recognize the pdf of a normal distribution and determine its mean and variance.

2 Introduction

We are now ready to do Bayesian updating when both the hypotheses and the data take
continuous values. The pattern is the same as what we’ve done before, so let’s first review
the previous two cases.

3 Previous cases

1. Discrete hypotheses, discrete data

Notation

• Hypotheses H

• Data x

• Prior P (H)

• Likelihood p(x | H)

• Posterior P (H | x).

Example 1. Suppose we have data x and three possible explanations (hypotheses) for the
data that we’ll call A, B, C. Suppose also that the data can take two possible values, -1
and 1.
In order to use the data to help estimate the probabilities of the di↵erent hypotheses we
need a prior pmf and a likelihood table. Assume the prior and likelihoods are given in
the following table. (For this example we are only concerned with the formal process of of
Bayesian updating. So we just made up the prior and likelihoods.)

1
18.05 class 14, Continuous Data with Continuous Priors, Spring 2014 2

hypothesis prior hypothesis likelihood p(x | H)

H P (H) H x= 1 x=1
A 0.1 A 0.2 0.8
B 0.3 B 0.5 0.5
C 0.6 C 0.7 0.3
Prior probabilities Likelihoods
Naturally, each entry in the likelihood table is a likelihood p(x | H). For instance the 0.2
row A and column x = 1 is the likelihood p(x = 1 | A).

Question: Suppose we run one trial and obtain the data x1 = 1. Use this to find the
posterior probabilities for the hypotheses.
answer: The data picks out one column from the likelihood table which we then use in our
Bayesian update table.

Bayes
hypothesis prior likelihood numerator posterior
p(x | H)P (H)
H P (H) p(x = 1 | H) p(x | H)P (H) P (H | x) =
p(x)
A 0.1 0.8 0.08 0.195
B 0.3 0.5 0.15 0.366
C 0.6 0.3 0.18 0.439
total 1 p(x) = 0.41 1

To summarize: the prior probabilities of hypotheses and the likelihoods of data given hy-
pothesis were given; the Bayes numerator is the product of the prior and likelihood; the
total probability p(x) is the sum of the probabilities in the Bayes numerator column; and
we divide by p(x) to normalize the Bayes numerator.

2. Continuous hypotheses, discrete data

Now suppose that we have data x that can take a discrete set of values and a continuous
parameter ✓ that determines the distribution the data is drawn from.
Notation

• Hypotheses ✓

• Data x

• Prior f (✓) d✓

• Likelihood p(x | ✓)

• Posterior f (✓ | x) d✓.

Note: Here we multiplied by d✓ to express the prior and posterior as probabilities. As

densities, we have the prior pdf f (✓) and the posterior pdf f (✓ | x).
Example 2. Assume that x ⇠ Binomial(5, ✓). So ✓ is in the range [0, 1] and the data x
can take six possible values, 0, 1, . . . , 5.
18.05 class 14, Continuous Data with Continuous Priors, Spring 2014 3

Since there is a continuous range of values we use a pdf to describe the prior on ✓. Let’s
suppose the prior is f (✓) = 2✓. We can still make a likelihood table, though it only has one
row representing an arbitrary hypothesis ✓.

hypothesis likelihood p(x | ✓)

x=0 x=1 x=2 x=3 x=4 x=5

5 5 5 5 5 5
✓ 0 (1 ✓)5 1 ✓(1 ✓)4 2 ✓2 (1 ✓)3 3 ✓3 (1 ✓)2 4 ✓4 (1 ✓) 5 ✓5

Likelihoods

Question: Suppose we run one trial and obtain the data x1 = 2. Use this to find the
posterior pdf for the parameter (hypotheses) ✓.
answer: As before, the data picks out one column from the likelihood table which we can
use in our Bayesian update table. Since we want to work with probabilities we write f (✓)d ✓
and f (✓ | x1 ) d✓ for the pdf’s.

Bayes
hypothesis prior likelihood numerator posterior

p(x | ✓)f (✓) d✓

✓ f (✓) d✓ p(x = 2 | ✓) p(x | ✓)f (✓) d✓ f (✓ | x) d✓ =
p(x)

5 5 3! 3! 3
✓ 2✓ d✓ 2 ✓2 (1 ✓)3 2 2 ✓3 (1 ✓)3 d✓ f (✓ | x) d✓ = ✓ (1 ✓)3 d✓
7!

R1 5 5 3! 3!
total 1 p(x) = 0 2 2 ✓2 (1 ✓)3 d✓ = 2 2 7! 1

To summarize: the prior probabilities of hypotheses and the likelihoods of data given hy-
pothesis were given; the Bayes numerator is the product of the prior and likelihood; the
total probability p(x) is the integral of the probabilities in the Bayes numerator column;
and we divide by p(x) to normalize the Bayes numerator.

4 Continuous hypotheses and continuous data

When both data and hypotheses are continuous, the only change to the previous example is
that the likelihood function uses a pdf f (x | ✓) instead of a pmf p(x | ✓). The general shape
of the Bayesian update table is the same.
Notation

• Hypotheses ✓

• Data x

• Prior f (✓)d✓
18.05 class 14, Continuous Data with Continuous Priors, Spring 2014 4

• Likelihood f (x | ✓) dx

• Posterior f (✓ | x) d✓.

Simplifying the notation. In the previous cases we included d✓ so that we were working
with probabilities instead of densities. When both data and hypotheses are continuous
we will need both d✓ and dx. This makes things conceptually simpler, but notationally
cumbersome. To simplify the notation we will allow ourselves to dx in our tables. This is
fine because the data x is a fixed. We keep the d✓ because the hypothesis ✓ is allowed to
vary.
For comparison, we first show the general table in simplified notation followed immediately
afterward by the table showing the infinitesimals.

Bayes
hypoth. prior likelihood numerator posterior

f (x | ✓)f (✓) d✓
✓ f (✓) d✓ f (x | ✓) f (x | ✓)f (✓) d✓ f (✓ | x) =
f (x)

R
total 1 f (x) = f (x | ✓)f (✓) d✓ 1

Bayesian update table without dx

Bayes
hypoth. prior likelihood numerator posterior

f (x | ✓)f (✓) d✓ dx f (x | ✓)f (✓) d✓

✓ f (✓) d✓ f (x | ✓) dx f (x | ✓)f (✓) d✓ dx f (✓ | x) d✓ = =
f (x) dx f (x)

R
total 1 f (x) dx = f (x | ✓)f (✓) d✓ dx 1

Bayesian update table with d✓ and dx

To summarize: the prior probabilities of hypotheses and the likelihoods of data given hy-
pothesis were given; the Bayes numerator is the product of the prior and likelihood; the
total probability f (x) dx is the integral of the probabilities in the Bayes numerator column;
we divide by f (x) dx to normalize the Bayes numerator.

5 Normal hypothesis, normal data

A standard example of continuous hypotheses and continuous data assumes that both the
data and prior follow normal distributions. The following example assumes that the variance
of the data is known.

Example 3. Suppose we have data x = 5 which wass drawn from a normal distribution
18.05 class 14, Continuous Data with Continuous Priors, Spring 2014 5

with unknown mean ✓ and standard deviation 1.

x ⇠ N(✓, 1)

Suppose further that our prior distribution for ✓ is ✓ ⇠ N(2, 1).

Let x represent an arbitrary data value.
(a) Make a Bayesian table with prior, likelihood, and Bayes numerator.
(b) Show that the posterior distribution for ✓ is normal as well.
(c) Find the mean and variance of the posterior distribution.
answer: As we did with the tables above, a good compromise on the notation is to include
d✓ but not dx. The reason for this is that the total probability is computed by integrating
over ✓ and the d✓ reminds of us that.
Our prior pdf is
1 (✓ 2)2 /2
f (✓) = p e .
2⇡
The likelihood function is
1 (5 ✓)2 /2
f (x = 5 | ✓) = p e .
2⇡
We know we are going to multiply the prior and the likelihood, so we carry out that algebra
first. In the very last step we simplify the constant factor into one constant we call c1 .
1 2 1 2
prior · likelihood = p e (✓ 2) /2 · p e (5 ✓) /2
2⇡ 2⇡
1 (2✓2 14✓+29)/2
= e
2⇡
1 (✓2 7✓+29/2)
= e (complete the square)
2⇡
1 ((✓ 7/2)2 +9/4)
= e
2⇡
e 9/4 (✓ 7/2)2 )
= e
2⇡
2
= c1 e (✓ 7/2)

In the last step we replaced the complicated constant factor by the simpler expression c1 .

Bayes posterior
hypothesis prior likelihood numerator f (✓ | x = 5) d✓

f (x = 5 | ✓)f (✓) d✓
✓ f (✓) d✓ f (x = 5 | ✓) f (x = 5 | ✓)f (✓) d✓
f (x = 5)

2 2 (✓ 7/2)2 (✓ 7/2)2
✓ p1 e (✓ 2) /2 d✓ p1 e (5 ✓) /2 c1 e c2 e
2⇡ 2⇡

R
total 1 f (x = 5) = f (x = 5 | ✓)f (✓) d✓ 1
18.05 class 14, Continuous Data with Continuous Priors, Spring 2014 6

We can see by the form of the posterior pdf that it is a normal distribution. Because the
2 2
exponential for a normal distribution is e (✓ µ) /2 we have mean µ = 7/2 and 2 2 = 1,
so variance 2 = 1/2.
We don’t need to bother computing the total probability; it is just used for normalization
and we already know the normalization constant p12⇡ for a normal distribution.
Here is the graph of the prior and the posterior pdf’s for this example. Note how the data
‘pulls’ the prior towards the data.

prior = blue; posterior = purple; data = red

Now we’ll repeat the previous example for general x. When reading this if you mentally
substitute 5 for x you will understand the algebra.
Example 4. Suppose our data x is drawn from a normal distribution with unknown mean
✓ and standard deviation 1.
x ⇠ N(✓, 1)

answer: As before, we show the algebra used to simplify the Bayes numerator: The prior
pdf and likelihood function are
1 (✓ 2)2 /2 1 (x ✓)2 /2
f (✓) = p e f (x | ✓) = p e .
2⇡ 2⇡
The Bayes numerator is the product of the prior and the likelihood:
1 2 1 2
prior · likelihood = p e (✓ 2) /2 · p e (x ✓) /2
2⇡ 2⇡
1 (2✓2 (4+2x)✓+4+x2 )/2
= e
2⇡
1 (✓2 (2+x)✓+(4+x2 )/2)
= e (complete the square)
2⇡
1 ((✓ (1+x/2))2 (1+x/2)2 +(4+x2 )/2)
= e
2⇡
2
= c1 e (✓ (1+x/2))

Just as in the previous example, in the last step we replaced all the constants, including
the exponentials that just involve x, by by the simple constant c1 .
18.05 class 14, Continuous Data with Continuous Priors, Spring 2014 7

Now the Bayesian update table becomes

Bayes posterior
hypothesis prior likelihood numerator f (✓ | x) d✓

f (x | ✓)f (✓) d✓
✓ f (✓) d✓ f (x | ✓) f (x | ✓)f (✓) d✓
f (x)

2 2 (✓ (1+x/2))2 (✓ (1+x/2))2
✓ p1 e (✓ 2) /2 d✓ p1 e (x ✓) /2 c1 e c2 e
2⇡ 2⇡

R
total 1 f (x) = f (x | ✓)f (✓) d✓ 1

As in the previous example we can see by the form of the posterior that it must be a normal
distribution with mean 1 + x/2 and variance 1/2. (Compare this with the case x = 5 in the
previous example.)

6 Predictive probabilities

Since the data x is continuous it has prior and posterior predictive pdfs. The prior predictive
pdf is the total probability density computed at the bottom of the Bayes numerator column:
Z
f (x) = f (x|✓)f (✓) d✓,

where the integral is computed over the entire range of ✓.

The posterior predictive pdf has the same form as the prior predictive pdf, except it use
the posterior probabilities for ✓:
Z
f (x2 |x1 ) = f (x2 |✓, x1 )f (✓|x1 ) d✓,

As usual, we usually assume x1 and x2 are conditionally independent. That is,

f (x2 |✓, x1 ) = f (x2 |✓).

In this case the formula for the posterior predictive pdf is a little simpler:
Z
f (x2 |x1 ) = f (x2 |✓)f (✓|x1 ) d✓,
Conjugate priors: Beta and normal
Class 15, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Understand the benefits of conjugate priors.

2. Be able to update a beta prior given a Bernoulli, binomial, or geometric likelihood.

3. Understand and be able to use the formula for updating a normal prior given a normal
likelihood with known variance.

2 Introduction and definition

In this reading, we will elaborate on the notion of a conjugate prior for a likelihood function.
With a conjugate prior the posterior is of the same type, e.g. for binomial likelihood the beta
prior becomes a beta posterior. Conjugate priors are useful because they reduce Bayesian
updating to modifying the parameters of the prior distribution (so-called hyperparameters)
rather than computing integrals.
Our focus in 18.05 will be on two important examples of conjugate priors: beta and normal.
For a far more comprehensive list, see the tables herein:
http://en.wikipedia.org/wiki/Conjugate_prior_distribution
We now give a definition of conjugate prior. It is best understood through the examples in
the subsequent sections.
Definition. Suppose we have data with likelihood function f (x|✓) depending on a hypothe-
sized parameter. Also suppose the prior distribution for ✓ is one of a family of parametrized
distributions. If the posterior distribution for ✓ is in this family then we say the the prior
is a conjugate prior for the likelihood.

3 Beta distribution

In this section, we will show that the beta distribution is a conjugate prior for binomial,
Bernoulli, and geometric likelihoods.

3.1 Binomial likelihood

We saw last time that the beta distribution is a conjugate prior for the binomial distribution.
This means that if the likelihood function is binomial and the prior distribution is beta then
the posterior is also beta.

1
18.05 class 15, Conjugate priors: Beta and normal, Spring 2014 2

More specifically, suppose that the likelihood follows a binomial(N, ✓) distribution where N
is known and ✓ is the (unknown) parameter of interest. We also have that the data x from
one trial is an integer between 0 and N . Then for a beta prior we have the following table:
hypothesis data prior likelihood posterior
✓ x beta(a, b) binomial(N, ✓) beta(a + x, b + N x)
✓ x c1 ✓a 1 (1 ✓)b 1 c2 ✓x (1 ✓)N x c3 ✓a+x 1 (1 ✓)b+N x 1

The table is simplified by writing the normalizing coefficient as c1 , c2 and c3 respectively.

If needed, we can recover the values of the c1 and c2 by recalling (or looking up) the
normalizations of the beta and binomial distributions.
✓ ◆
(a + b 1)! N N! (a + b + N 1)!
c1 = c2 = = c3 =
(a 1)! (b 1)! x x! (N x)! (a + x 1)! (b + N x 1)!

3.2 Bernoulli likelihood

The beta distribution is a conjugate prior for the Bernoulli distribution. This is actually
a special case of the binomial distribution, since Bernoulli(✓) is the same as binomial(1,
✓). We do it separately because it is slightly simpler and of special importance. In the
table below, we show the updates corresponding to success (x = 1) and failure (x = 0) on
separate rows.

hypothesis data prior likelihood posterior

✓ x beta(a, b) Bernoulli(✓) beta(a + 1, b) or beta(a, b + 1)
✓ x=1 c1 ✓a 1 (1 ✓)b 1 ✓ c3 ✓a (1 ✓)b 1
✓ x=0 c1 ✓a 1 (1 ✓)b 1 1 ✓ c3 ✓a 1 (1 ✓)b
The constants c1 and c3 have the same formulas as in the previous (binomial likelihood
case) with N = 1.

3.3 Geometric likelihood

Recall that the geometric(✓) distribution describes the probability of x successes before
the first failure, where the probability of success on any single independent trial is ✓. The
corresponding pmf is given by p(x) = ✓x (1 ✓).
Now suppose that we have a data point x, and our hypothesis ✓ is that x is drawn from a
geometric(✓) distribution. From the table we see that the beta distribution is a conjugate
prior for a geometric likelihood as well:

hypothesis data prior likelihood posterior

✓ x beta(a, b) geometric(✓) beta(a + x, b + 1)
✓ x c1 ✓a 1 (1 ✓)b 1 ✓x (1 ✓) c3 ✓a+x 1 (1 ✓)b

At first it may seem strange that the beta distribution is a conjugate prior for both the
binomial and geometric distributions. The key reason is that the binomial and geometric
likelihoods are proportional as functions of ✓. Let’s illustrate this in a concrete example.
Example 1. While traveling through the Mushroom Kingdom, Mario and Luigi find some
rather unusual coins. They agree on a prior of f (✓) ⇠ beta(5,5) for the probability of heads,
18.05 class 15, Conjugate priors: Beta and normal, Spring 2014 3

though they disagree on what experiment to run to investigate ✓ further.

a) Mario decides to flip a coin 5 times. He gets four heads in five flips.
b) Luigi decides to flip a coin until the first tails. He gets four heads before the first tail.
Show that Mario and Luigi will arrive at the same posterior on ✓, and calculate this posterior.
answer: We will show that both Mario and Luigi find the posterior pdf for ✓ is a beta(9, 6)
distribution.
Mario’s table
hypothesis data prior likelihood posterior
✓ x=4 beta(5, 5) binomial(5,✓) ???
5 4
✓ x=4 c1 ✓4 (1 ✓)4 4 ✓ (1 ✓) c3 ✓8 (1 ✓)5
Luigi’s table
hypothesis data prior likelihood posterior
✓ x=4 beta(5, 5) geometric(✓) ???
✓ x=4 c1 ✓4 (1 ✓)4 ✓4 (1 ✓) c3 ✓8 (1 ✓)5
Since both Mario and Luigi’s posterior has the form of a beta(9, 6) distribution that’s what
they both must be. The normalizing factor is the same in both cases because it’s determined
by requiring the total probability to be 1.

4 Normal begets normal

We now turn to another important example: the normal distribution is its own conjugate
prior. In particular, if the likelihood function is normal with known variance, then a normal
prior gives a normal posterior. Now both the hypotheses and the data are continuous.
Suppose we have a measurement x ⇠ N (✓, 2 ) where the variance 2 is known. That is, the
mean ✓ is our unknown parameter of interest and we are given that the likelihood comes
from a normal distribution with variance 2 . If we choose a normal prior pdf
2
f (✓) ⇠ N(µprior , prior )

then the posterior pdf is also normal: f (✓|x) ⇠ N(µpost , 2

post ) where
µpost µprior x 1 1 1
2 = 2 + 2
, 2 = 2 + 2
(1)
post prior post prior

The following form of these formulas is easier to read and shows that µpost is a weighted
average between µprior and the data x.
1 1 aµprior + bx 2 1
a= 2 b= 2
, µpost = , post = . (2)
prior a+b a+b

With these formulas in mind, we can express the update via the table:

hypothesis data prior likelihood posterior

✓ x 2
f (✓) ⇠ N(µprior , prior ) f (x|✓) ⇠ N(✓, 2 ) 2 )
f (✓|x) ⇠ N(µpost , post
✓
✓ ◆
◆ ⇣ ⌘ ⇣ ⌘
2
(✓ µprior ) 2 (✓ µpost )2
✓ x c1 exp 2 2
c2 exp (x2 2✓) c3 exp 2 2
prior post
18.05 class 15, Conjugate priors: Beta and normal, Spring 2014 4

We leave the proof of the general formulas to the problem set. It is an involved algebraic
manipulation which is essentially the same as the following numerical example.

Example 2. Suppose we have prior ✓ ⇠ N(4, 8), and likelihood function likelihood x ⇠
N(✓, 5). Suppose also that we have one measurement x1 = 3. Show the posterior distribution
is normal.
answer: We will show this by grinding through the algebra which involves completing the
square.
(✓ 4)2 /16 (x1 ✓)2 /10 (3 ✓)2 /10
prior: f (✓) = c1 e ; likelihood: f (x1 |✓) = c2 e = c2 e

We multiply the prior and likelihood to get the posterior:

(✓ 4)2 /16 (3 ✓)2 /10
f (✓|x1 ) = c3 e e
✓ ◆
(✓ 4)2 (3 ✓)2
= c3 exp
16 10
We complete the square in the exponent

(✓ 4)2 (3 ✓)2 5(✓ 4)2 + 8(3 ✓)2

=
16 10 80
13✓2 88✓ + 152
=
80
2 88 152
✓ 13 ✓ + 13
=
80/13
(✓ 44/13)2 + 152/13 (44/13)2
= .
80/13
Therefore the posterior is
(✓ 44/13)2 +152/13 (44/13)2 (✓ 44/13)2
f (✓|x1 ) = c3 e 80/13 = c4 e 80/13 .

This has the form of the pdf for N(44/13, 40/13). QED

For practice we check this against the formulas (2).

2 2 1 1
µprior = 4, prior = 8, =5 ) a= , b= .
8 5
Therefore
aµprior + bx 44
µpost = = = 3.38
a+b 13
2 1 40
post = = = 3.08.
a+b 13

Example 3. Suppose that we know the data x ⇠ N(✓, 1) and we have prior N(0, 1). We
get one data value x = 6.5. Describe the changes to the pdf for ✓ in updating from the
prior to the posterior.
18.05 class 15, Conjugate priors: Beta and normal, Spring 2014 5

answer: Here is a graph of the prior pdf with the data point marked by a red line.

Prior in blue, posterior in magenta, data in red

The posterior mean will be a weighted average of the prior mean and the data. So the peak
of the posterior pdf will be be between the peak of the prior and the read line. A little
algebra with the formula shows

2 1 2 2
post = 2 2
= prior · 2 2
< prior
1/ prior + 1/ prior +

That is the posterior has smaller variance than the prior, i.e. data makes us more certain
about where in its range ✓ lies.

4.1 More than one data point

Example 4. Suppose we have data x1 , x2 , x3 . Use the formulas (1) to update sequentially.
answer: Let’s label the prior mean and variance as µ0 and 2
0. The updated means and
variances will be µi and i2 . In sequence we have

1 1 1 µ1 µ0 x1
2 = 2 + 2
; 2 = 2 + 2
1 0 1 0
1 1 1 1 2 µ2 µ1 x2 µ0 x1 + x2
2 = 2 + 2
= 2 + 2
; 2 = 2 + 2
= 2 + 2
2 1 0 2 1 0
1 1 1 1 3 µ3 µ2 x3 µ0 x1 + x2 + x3
2 = 2 + 2
= 2 + 2
; 2 = 2 + 2
= 2 + 2
3 2 0 3 2 0

The example generalizes to n data values x1 , . . . , xn :

18.05 class 15, Conjugate priors: Beta and normal, Spring 2014 6

Normal-normal update formulas for n data points

µpost µprior nx̄ 1 1 n x1 + . . . + xn
2 = 2 + 2
, 2 = 2 + 2
, x̄ = . (3)
post prior post prior n

Again we give the easier to read form, showing µpost is a weighted average of µprior and the
sample average x̄:
1 n aµprior + bx̄ 2 1
a= 2 b= 2
, µpost = , post = . (4)
prior a+b a+b

Interpretation: µpost is a weighted average of µprior and x̄. If the number of data points is
2
large then the weight b is large and x̄ will have a strong influence on the posterior. If prior
is small then the weight a is large and µprior will have a strong influence on the posterior.
To summarize:
1. Lots of data has a big influence on the posterior.
2. High certainty (low variance) in the prior has a big influence on the posterior.
The actual posterior is a balance of these two influences.
Choosing priors
Class 15, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals
1. Learn that the choice of prior a↵ects the posterior.
2. See that too rigid a prior can make it difficult to learn from the data.
3. See that more data lessons the dependence of the posterior on the prior.
4. Be able to make a reasonable choice of prior, based on prior understanding of the system
under consideration.

2 Introduction

Up to now we have always been handed a prior pdf. In this case, statistical inference from
data is essentially an application of Bayes’ theorem. When the prior is known there is no
controversy on how to proceed. The art of statistics starts when the prior is not known
with certainty. There are two main schools on how to proceed in this case: Bayesian and
frequentist. For now we are following the Bayesian approach. Starting next week we will
learn the frequentist approach.
Recall that given data D and a hypothesis H we used Bayes’ theorem to write
P (D|H) · P (H)
P (H|D) =
P (D)
posterior / likelihood · prior.

Bayesian: Bayesians make inferences using the posterior P (H|D), and therefore always
need a prior P (H). If a prior is not known with certainty the Bayesian must try to make
a reasonable choice. There are many ways to do this and reasonable people might make
di↵erent choices. In general it is good practice to justify your choices and to explore a range
of priors to see if they all point to the same conclusion.
Frequentist: Very briefly, frequentists do not try to create a prior. Instead, they make
inferences using the likelihood P (D|H).
We will compare the two approaches in detail once we have more experience with each. For
now we simply list two benefits of the Bayesian approach.
1. The posterior probability P (H|D) for the hypothesis given the evidence is usually exactly
what we’d like to know. The Bayesian can say something like ‘the parameter of interest has
probability 0.95 of being between 0.49 and 0.51.’
2. The assumptions that go into choosing the prior can be clearly spelled out.

More good data: It is always the case that more good data allows for stronger conclusions
and lessens the influence of the prior. The emphasis should be as much on good data
(quality) as on more data (quantity).

1
18.05 class 15, Choosing priors, Spring 2014 2

3 Example: Dice

Suppose we have a drawer full of dice, each of which has either 4, 6, 8, 12, or 20 sides. This
time, we do not know how many of each type are in the drawer. A die is picked at random
from the drawer and rolled 5 times. The results in order are 4, 2, 4, 7, and 5.

3.1 Uniform prior

Suppose we have no idea what the distribution of dice in the drawer might be. In this case
it’s reasonable to use a flat prior. Here is the update table for the posterior probabilities
that result from updating after each roll. In order to fit all the columns, we leave out the
unnormalized posteriors.

hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 post4 lik5 post5
H4 1/5 1/4 0.370 1/4 0.542 1/4 0.682 0 0.000 0 0.000
H6 1/5 1/6 0.247 1/6 0.241 1/6 0.202 0 0.000 1/6 0.000
H8 1/5 1/8 0.185 1/8 0.135 1/8 0.085 1/8 0.818 1/8 0.876
H12 1/5 1/12 0.123 1/12 0.060 1/12 0.025 1/12 0.161 1/12 0.115
H20 1/5 1/20 0.074 1/20 0.022 1/20 0.005 1/20 0.021 1/20 0.009

This should look familiar. Given the data the final posterior is heavily weighted towards
hypthesis H8 that the 8-sided die was picked.

3.2 Other priors

To see how much the above posterior depended on our choice of prior, let’s try some other
priors. Suppose we have reason to believe that there are ten times as many 20-sided dice
in the drawer as there are each of the other types. The table becomes:

hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 post4 lik5 post5
H4 0.071 1/4 0.222 1/4 0.453 1/4 0.650 0 0.000 0 0.000
H6 0.071 1/6 0.148 1/6 0.202 1/6 0.193 0 0.000 1/6 0.000
H8 0.071 1/8 0.111 1/8 0.113 1/8 0.081 1/8 0.688 1/8 0.810
H12 0.071 1/12 0.074 1/12 0.050 1/12 0.024 1/12 0.136 1/12 0.107
H20 0.714 1/20 0.444 1/20 0.181 1/20 0.052 1/20 0.176 1/20 0.083

Even here the final posterior is heavily weighted to the hypothesis H8 .

What if the 20-sided die is 100 times more likely than each of the others?

hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 post4 lik5 post5
H4 0.0096 1/4 0.044 1/4 0.172 1/4 0.443 0 0.000 0 0.000
H6 0.0096 1/6 0.030 1/6 0.077 1/6 0.131 0 0.000 1/6 0.000
H8 0.0096 1/8 0.022 1/8 0.043 1/8 0.055 1/8 0.266 1/8 0.464
H12 0.0096 1/12 0.015 1/12 0.019 1/12 0.016 1/12 0.053 1/12 0.061
H20 0.9615 1/20 0.889 1/20 0.689 1/20 0.354 1/20 0.681 1/20 0.475

With such a strong prior belief in the 20-sided die, the final posterior gives a lot of weight
to the theory that the data arose from a 20-sided die, even though it extremely unlikely the
18.05 class 15, Choosing priors, Spring 2014 3

20-sided die would produce a maximum of 7 in 5 roles. The posterior now gives roughly
even odds that an 8-sided die versus a 20-sided die was picked.

3.3 Rigid priors

Mild cognitive dissonance. Too rigid a prior belief can overwhelm any amount of data.
Suppose I’ve got it in my head that the die has to be 20-sided. So I set my prior to
P (H20 ) = 1 with the other 4 hypotheses having probability 0. Look what happens in the
update table.

hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 post4 lik5 post5
H4 0 1/4 0 1/4 0 1/4 0 0 0 0 0
H6 0 1/6 0 1/6 0 1/6 0 0 0 1/6 0
H8 0 1/8 0 1/8 0 1/8 0 1/8 0 1/8 0
H12 0 1/12 0 1/12 0 1/12 0 1/12 0 1/12 0
H20 1 1/20 1 1/20 1 1/20 1 1/20 1 1/20 1

No matter what the data, a hypothesis with prior probability 0 will have posterior probabil-
ity 0. In this case I’ll never get away from the hypothesis H20 , although I might experience
some mild cognitive dissonance.

Severe cognitive dissonance. Rigid priors can also lead to absurdities. Suppose I now
have it in my head that the die must be 4-sided. So I set P (H4 ) = 1 and the other prior
probabilities to 0. With the given data on the fourth roll I reach an impasse. A roll of
7 can’t possibly come from a 4-sided die. Yet this is the only hypothesis I’ll allow. My
unnormalized posterior is a column of all zeros which cannot be normalized.

hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 unnorm. post4 post4
H4 1 1/4 1 1/4 1 1/4 1 0 0 ???
H6 0 1/6 0 1/6 0 1/6 0 0 0 ???
H8 0 1/8 0 1/8 0 1/8 0 1/8 0 ???
H12 0 1/12 0 1/12 0 1/12 0 1/12 0 ???
H20 0 1/20 0 1/20 0 1/20 0 1/20 0 ???

I must adjust my belief about what is possible or, more likely, I’ll suspect you of accidently
or deliberately messing up the data.

4 Example: Malaria

Here is a real example adapted from Statistics, A Bayesian Perspective by Donald Berry:

By the 1950’s scientists had begun to formulate the hypothesis that carriers of the sickle-cell
gene were more resistant to malaria than noncarriers. There was a fair amount of circum-
stantial evidence for this hypothesis. It also helped explain the persistance of an otherwise
deleterious gene in the population. In one experiment scientists injected 30 African volun-
teers with malaria. Fifteen of the volunteers carried one copy of the sickle-cell gene and the
other 15 were noncarriers. Fourteen out of 15 noncarriers developed malaria while only 2
18.05 class 15, Choosing priors, Spring 2014 4

out of 15 carriers did. Does this small sample support the hypothesis that the sickle-cell
gene protects against malaria?
Let S represent a carrier of the sickle-cell gene and N represent a non-carrier. Let D+
indicate developing malaria and D indicate not developing malaria. The data can be put
in a table.
D+ D
S 2 13 15
N 14 1 15
16 14 30
Before analysing the data we should say a few words about the experiment and experimental
design. First, it is clearly unethical: to gain some information they infected 16 people with
malaria. We also need to worry about bias. How did they choose the test subjects. Is
it possible the noncarriers were weaker and thus more susceptible to malaria than the
carriers? Berry points out that it is reasonable to assume that an injection is similar to
a mosquito bite, but it is not guaranteed. This last point means that if the experiment
shows a relation between sickle-cell and protection against injected malaria, we need to
consider the hypothesis that the protection from mosquito transmitted malaria is weaker or
non-existent. Finally, we will frame our hypothesis as ’sickle-cell protects against malaria’,
but really all we can hope to say from a study like this is that ’sickle-cell is correlated with
protection against malaria’.
Model. For our model let ✓S be the probability that an injected carrier S develops malaria
and likewise let ✓N be the probabilitiy that an injected noncarrier N develops malaria. We
assume independence between all the experimental subjects. With this model, the likelihood
is a function of both ✓S and ✓N :

P (data|✓S , ✓N ) = c ✓S2 (1 ✓S )13 ✓N

14
(1 ✓N ).

As usual we leave the constant factor c as a letter. (It is a product of two binomial coeffi-
cients: c = 15
2
15
14 .)
Hypotheses. Each hypothesis consists of a pair (✓N , ✓S ). To keep things simple we will
only consider a finite number of values for these probabilities. We could easily consider
many more values or even a continuous range of hypotheses. Assume ✓S and ✓N are each
one of 0, 0.2, 0.4, 0.6, 0.8, 1. This leads to two-dimensional tables.
First is a table of hypotheses. The color coding indicates the following:
1. Light orange squares along the diagonal are where ✓S = ✓N , i.e. sickle-cell makes no
di↵erence one way or the other.
2. Pink and red squares above the diagonal are where ✓N > ✓S , i.e. sickle-cell provides
some protection against malaria.
3. In the red squares ✓N ✓S 0.6, i.e. sickle-cell provides a lot of protection.
4. White squares below diagonal are where ✓S > ✓N , i.e. sickle-cell actually increases the
probability of developing malaria.
18.05 class 15, Choosing priors, Spring 2014 5

✓N \✓S 0 0.2 0.4 0.6 0.8 1

1 (0,1) (.2,1) (.4,1) (.6,1) (.8,1) (1,1)
0.8 (0,.8) (.2,.8) (.4,.8) (.6,.8) (.8,.8) (1,.8)
0.6 (0,.6) (.2,.6) (.4,.6) (.6,.6) (.8,.6) (1,.6)
0.4 (0,.4) (.2,.4) (.4,.4) (.6,.4) (.8,.4) (1,.4)
0.2 (0,.2) (.2,.2) (.4,.2) (.6,.2) (.8,.2) (1,.2)
0 (0,0) (.2,0) (.4,0) (.6,0) (.8,0) (1,0)

Hypotheses on level of protection due to S:

red = strong; pink = some; orange = none; white = negative.
Next is the table of likelihoods. (Actually we’ve taken advantage of our indi↵erence to scale
and scaled all the likelihoods by 100000/c to make the table more presentable.) Notice that,
to the precision of the table, many of the likelihoods are 0. The color coding is the same as
in the hypothesis table. We’ve highlighted the biggest likelihoods with a blue border.

✓N \✓S 0 0.2 0.4 0.6 0.8 1

1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.8 0.00000 1.93428 0.18381 0.00213 0.00000 0.00000
0.6 0.00000 0.06893 0.00655 0.00008 0.00000 0.00000
0.4 0.00000 0.00035 0.00003 0.00000 0.00000 0.00000
0.2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

Likelihoods p(data|✓S , ✓N ) scaled by 100000/c

4.1 Flat prior

Suppose we have no opinion whatsoever on whether and to what degree sickle-cell protects
against malaria. In this case it is reasonable to use a flat prior. Since there are 36 hypotheses
each one gets a prior probability of 1/36. This is given in the table below. Remember each
square in the table represents one hypothesis. Because it is a probability table we include
the marginal pmf.

✓N \✓S 0 0.2 0.4 0.6 0.8 1 p(✓N )

1 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.8 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.6 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.4 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.2 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(✓S ) 1/6 1/6 1/6 1/6 1/6 1/6 1

Flat prior p(✓S , ✓N ): every hypothesis (square) has equal probability

To compute the posterior we simply multiply the likelihood table by the prior table and
18.05 class 15, Choosing priors, Spring 2014 6

normalize. Normalization means making sure the entire table sums to 1.

✓N \✓S 0 0.2 0.4 0.6 0.8 1 p(✓N |data)

1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.8 0.00000 0.88075 0.08370 0.00097 0.00000 0.00000 0.96542
0.6 0.00000 0.03139 0.00298 0.00003 0.00000 0.00000 0.03440
0.4 0.00000 0.00016 0.00002 0.00000 0.00000 0.00000 0.00018
0.2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
p(✓S |data) 0.00000 0.91230 0.08670 0.00100 0.00000 0.00000 1.00000

Posterior to flat prior: p(✓S , ✓N |data)

To decide whether S confers protection against malaria, we compute the posterior proba-
bilities of ‘some protection’ and of ‘strong protection’. These are computed by summing
the corresponding squares in the posterior table.

Some protection: P (✓N > ✓S ) = sum of pink and red = .99995

Strong protection: P (✓N ✓S > .6) = sum of red = .88075

Working from the flat prior, it is e↵ectively certain that sickle-cell provides some protection
and very probable that it provides strong protection.

4.2 Informed prior

The experiment was not run without prior information. There was a lot of circumstantial
evidence that the sickle-cell gene o↵ered some protection against malaria. For example it
was reported that a greater percentage of carriers survived to adulthood.
Here’s one way to build an informed prior. We’ll reserve a reasonable amount of probability
for the hypotheses that S gives no protection. Let’s say 24% split evenly among the 6
(orange) cells where ✓N = ✓S . We know we shouldn’t set any prior probabilities to 0, so
let’s spread 6% of the probability evenly among the 15 white cells below the diagonal. That
leaves 70% of the probability for the 15 pink and red squares above the diagonal.

✓N \✓S 0 0.2 0.4 0.6 0.8 1 p(✓N )

1 0.04667 0.04667 0.04667 0.04667 0.04667 0.04000 0.27333
0.8 0.04667 0.04667 0.04667 0.04667 0.04000 0.00400 0.23067
0.6 0.04667 0.04667 0.04667 0.04000 0.00400 0.00400 0.18800
0.4 0.04667 0.04667 0.04000 0.00400 0.00400 0.00400 0.14533
0.2 0.04667 0.04000 0.00400 0.00400 0.00400 0.00400 0.10267
0 0.04000 0.00400 0.00400 0.00400 0.00400 0.00400 0.06000
p(✓S ) 0.27333 0.23067 0.18800 0.14533 0.10267 0.06000 1.0

Informed prior p(✓S , ✓N ): makes use of prior information that sickle-cell is protective.
We then compute the posterior pmf.
18.05 class 15, Choosing priors, Spring 2014 7

✓N \✓S 0 0.2 0.4 0.6 0.8 1 p(✓N |data)

1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.8 0.00000 0.88076 0.08370 0.00097 0.00000 0.00000 0.96543
0.6 0.00000 0.03139 0.00298 0.00003 0.00000 0.00000 0.03440
0.4 0.00000 0.00016 0.00001 0.00000 0.00000 0.00000 0.00017
0.2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
p(✓S |data) 0.00000 0.91231 0.08669 0.00100 0.00000 0.00000 1.00000

Posterior to informed prior: p(✓S , ✓N |data)

We again compute the posterior probabilities of ‘some protection’ and ‘strong protection’.

Some protection: P (✓N > ✓S ) = sum of pink and red = .99996

Strong protection: P (✓N ✓S > .6) = sum of red = .88076

Note that the informed posterior is nearly identical to the flat posterior.

4.3 PDALX

The following plot is based on the flat prior. For each x, it gives the probability that
✓N ✓S x. To make it smooth we used many more hypotheses.
Prob. diff. at least x

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
x
Probability the di↵erence ✓N ✓S is at least x (PDALX).

Notice that it is virtually certain that the di↵erence is at least .4.

Probability intervals
Class 16, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to find probability intervals given a pmf or pdf.

2. Understand how probability intervals summarize belief in Bayesian updating.

3. Be able to use subjective probability intervals to construct reasonable priors.

4. Be able to construct subjective probability intervals by systematically estimating quan-

tiles.

2 Probability intervals

Suppose we have a pmf p(✓) or pdf f (✓) describing our belief about the value of an unknown
parameter of interest ✓.
Definition: A p-probability interval for ✓ is an interval [a, b] with P (a  ✓  b) = p.
Notes. P
1. In the discrete case with pmf p(✓), this means a✓i b p(✓i ) = p.
Rb
2. In the continuous case with pdf f (✓), this means a f (✓) d✓ = p.
3. We may say 90%-probability interval to mean 0.9-probability interval. Probability
intervals are also called credible intervals to contrast them with confidence intervals, which
we’ll introduce in the frequentist unit.

Example 1. Between the 0.05 and 0.55 quantiles is a 0.5 probability interval. There are
many 50% probability intervals, e.g. the interval from the 0.25 to the 0.75 quantiles.
In particular, notice that the p-probability interval for ✓ is not unique.

Q-notation. We can phrase probability intervals in terms of quantiles. Recall that the
s-quantile for ✓ is the value qs with P (✓  qs ) = s. So for s  t, the amount of probability
between the s-quantile and the t-quantile is just t s. In these terms, a p-probability
interval is any interval [qs , qt ] with t s = p.
Example 2. We have 0.5 probability intervals [q0.25 , q0.75 ] and [q0.05 , q0.55 ].

Symmetric probability intervals.

The interval [q0.25 , q0.75 ] is symmetric because the amount of probability remaining on either
side of the interval is the same, namely 0.25. If the pdf is not too skewed, the symmetric
interval is usually a good default choice.
More notes.
1. Di↵erent p-probability intervals for ✓ may have di↵erent widths. We can make the width

1
18.05 class 16, Probability intervals, Spring 2014 2

smaller by centering the interval under the highest part of the pdf. Such an interval is
usually a good choice since it contains the most likely values. See the examples below for
normal and beta distributions.
2. Since the width can vary for fixed p, a larger p does not always mean a larger width.
Here’s what is true: if a p1 -probability interval is fully contained in a p2 -probability interval,
then p1 is bigger than p2 .

Probability intervals for a normal distribution. The figure shows a number of prob-
ability intervals for the standard normal.
1. All of the red bars span a 0.68-probability interval. Notice that the smallest red bar
runs between -1 and 1. This runs from the 16th percential to the 84th percentile so it is a
symmetric interval.
2. All the magenta bars span a 0.9-probability interval. They are longer than the red
bars because they include more probability. Note again that the shortest magenta bar is
symmetric.

red = 0.68, magenta = 0.9, green = 0.5

Probabilitiy intervals for a beta distribution. The following figure shows probability
intervals for a beta distribution. Notice how the two red bars have very di↵erent lengths
yet cover the same probability p = 0.68.
18.05 class 16, Probability intervals, Spring 2014 3

red = 0.68, magenta = 0.9, green = 0.5

3 Uses of probability intervals

3.1 Summarizing and communicating your beliefs

Probability intervals are an intuitive and e↵ective way to summarize and communicate your
beliefs. It’s hard to describe an entire function f (✓) to a friend in words. If the function isn’t
from a parameterized family then it’s especially hard. Even with a beta distribution, it’s
easier to interpret “I think ✓ is between 0.45 and 0.65 with 50% probability” than “I think ✓
follows a beta(8,6) distribution”. An exception to this rule of communication might be the
normal distribution, but only if the recipient is also comfortable with standard deviation.
Of course, what we gain in clarity we lose in precision, since the function contains more
information than the probability interval.
Probability intervals also play well with Bayesian updating. If we update from the prior
f (✓) to the posterior f (✓|x), then the p-probability interval for the posterior will tend to be
shorter than than the p-probability interval for the prior. In this sense, the data has made
us more certain. See for example the election example below.

4 Constructing a prior using subjective probability intervals

Probability intervals are also useful when we do not have a pmf or pdf at hand. In this
case, subjective probability intervals give us a method for constructing a reasonable prior
for ✓ “from scratch”. The thought process is to ask yourself a series of questions, e.g., ‘what
is my expected value for ✓?’; ‘my 0.5-probability interval?’; ‘my 0.9-probability interval?’
Then build a prior that is consistent with these intervals.
18.05 class 16, Probability intervals, Spring 2014 4

4.1 Estimating the intervals directly

Example 3. Building priors

In 2013 there was a special election for a congressional seat in a district in South Carolina.
The election pitted Republican Mark Sanford against Democrat Elizabeth Colbert Busch.
Let ✓ be the fraction of the population who favored Busch. Our goal in this example is to
build a subjective prior for ✓. We’ll use the following prior evidence.

• Sanford is a former S. Carolina Congressman and Governor

• He had famously resigned after having an a↵air in Argentina while he claimed to be

hiking the Appalachian trail.

• In 2013 Sanford won the Republican primary over 15 primary opponents.

• In the district in the 2012 presidential election the Republican Romney beat the
Democrat Obama 58% to 40%.

• The Colbert bump: Elizabeth Colbert Busch is the sister of well-known comedian
Stephen Colbert.

Our strategy will be to use our intuition to construct some probability intervals and then
find a beta distribution that approximately matches these intervals. This is subjective so
someone else might give a di↵erent answer.
Step 1. Use the evidence to construct 0.5 and 0.9 probability intervals for ✓.
We’ll start by thinking about the 90% interval. The single strongest prior evidence is the
58% to 40% of Romney over Obama. Given the negatives for Sanford we don’t expect he’ll
win much more than 58% of the vote. So we’ll put the top of the 0.9 interval at 0.65. With
all of Sanford’s negatives he could lose big. So we’ll put the bottom at 0.3.

0.9 interval: [0.3, 0.65]

For the 0.5 interval we’ll pull these endpoints in. It really seems unlikely Sanford will get
more votes than Romney, so we can leave 0.25 probability that he’ll get above 57%. The
lower limit seems harder to predict. So we’ll leave 0.25 probability that he’ll get under 42%.

0.5 interval: [0.42, 0.57]

Step 2. Use our 0.5 and 0.9 probability intervals to pick a beta distribution that approx-
imats these intervals. We used the R function pbeta and a little trial and error to choose
beta(11,12). Here is our R code.
a = 11
b = 12
pbeta(0.65, a, b) - pbeta(0.3, a, b)
pbeta(0.57, a, b) - pbeta(0.42, a, b)

This computed P ([0.3, 0.65]) = 0.91 and P ([0.42, 0.57]) = 0.52. So our intervals are actually
0.91 and 0.52-probability intervals. This is pretty close to what we wanted!
18.05 class 16, Probability intervals, Spring 2014 5

At right is a graph of the density of beta(11,12). The red line shows our interval [0.42, 0.57]
and the blue line shows our interval [0.3, 0.65].

PDF for beta(11,12) PDF for beta(9.9,11.0)

3
3

q0.25 = 0.399
q0.25 = 0.472
q0.25 = 0.547

2
2

1
1
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ
beta(11,12) found using probability intervals and beta(9.9,11.0) found using quantiles

4.2 Constructing a prior by estimating quantiles

The method in Example 3 gives a good feel for building priors from probability intervals.
Here we illustrate a slightly di↵erent way of building a prior by estimating quantiles. The
basic strategy is to first estimate the median, then divide and conquer to estimate the first
and third quantiles. Finally you choose a prior distribution that fits these estimates.
Example 4. Redo the Sanford vs. Colbert-Busch election example using quantiles.
answer: We start by estimating the median. Just as before the single strongest evidence is
the 58% to 40% victory of Romney over Obama. However, given Sanford’s negatives and
Busch’s Colbert bump we’ll estimate the median at 0.47.
In a district that went 58 to 40 for the Republican Romney it’s hard to imagine Sanford’s
vote going a lot below 40%. So we’ll estimate Sanford 25th percentile as 0.40. Likewise,
given his negatives it’s hard to imagine him going above 58%, so we’ll estimate his 75th
percentile as 0.55.
We used R to search through values of a and b for the beta distribution that matches these
quartiles the best. Since the beta distribution does not require a and b to be integers we
looked for the best fit to 1 decimal place. We found beta(9.9, 11.0). Above is a plot of
beta(9.9,11.0) with its actual quartiles shown. These match the desired quartiles pretty
well.

Historic note. In the election Sanford won 54% of the vote and Busch won 45.2%. (Source:
http://elections.huffingtonpost.com/2013/mark-sanford-vs-elizabeth-colbert-busch-sc1
The Frequentist School of Statistics
Class 17, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to explain the di↵erence between the frequentist and Bayesian approaches to
statistics.

2. Know our working definition of a statistic and be able to distinguish a statistic from a
non-statistic.

2 Introduction

After much foreshadowing, the time has finally come to switch from Bayesian statistics to
frequentist statistics. For much of the twentieth century, frequentist statistics has been the
dominant school. If you’ve ever encountered confidence intervals, p-values, t-tests, or 2 -
tests, you’ve seen frequentist statistics. With the rise of high-speed computing and big data,
Bayesian methods are becoming more common. After we’ve studied frequentist methods
we will compare the strengths and weaknesses of the two approaches.

2.1 The fork in the road

Both schools of statistics start with probability. In particular both know and love Bayes’
theorem:
P (D|H) P (H)
P (H|D) = .
P (D)
When the prior is known exactly all statisticians will use this formula. For Bayesian inference
we take H to be a hypothesis and D some data. Over the last few weeks we have seen that,
given a prior and a likelihood model, Bayes’ theorem is a complete recipe for updating our
beliefs in the face of new data. This works perfectly when the prior was known perfectly. We
saw this in our dice examples. We also saw examples of a disease with a known frequency
in the general population and a screening test of known accuracy.
In practice we saw that there is usually no universally-accepted prior – di↵erent people
will have di↵erent a priori beliefs – but we would still like to make useful inferences from
data. Bayesians and frequentists take fundamentally di↵erent approaches to this challenge,
as summarized in the figure below.

1
18.05 class 17, The Frequentist School of Statistics, Spring 2014 2

Everyone uses Bayes’

P (D|H)P (H)
Probability P (H|D) = formula when the prior
(mathematics) P (D) P (H) is known.

Bayesian path Frequentist path

Statistics
(art)
P (D|H)Pprior (H)
PPosterior (H|D) = Likelihood L(H; D) = P (D|H)
P (D)
Bayesians require a prior, so Without a known prior frequen-
they develop one from the best tists draw inferences from just
information they have. the likelihood function.

The reasons for this split are both practical (ease of implementation and computation) and
philosophical (subjectivity versus objectivity and the nature of probability).

2.2 What is probability?

The main philosophical di↵erence concerns the meaning of probability. The term frequentist
refers to the idea that probabilities represent longterm frequencies of repeatable random
experiments. For example, ‘a coin has probability 1/2 of heads’ means that the relative
frequency of heads (number of heads out of number of flips) goes to 1/2 as the number of
flips goes to infinity. This means the frequentist finds it non-sensical to specify a probability
distribution for a parameter with a fixed value. While Bayesians are happy to use probability
to describe their incomplete knowledge of a fixed parameter, frequentists reject the use of
probability to quantify degree of belief in hypotheses.
Example 1. Suppose I have a bent coin with unknown probability ✓ of heads. The value
of ✓ may be unknown, but it is a fixed value. Thus, to the frequentist there can be no prior
pdf f (✓). By comparison the Bayesian may agree that ✓ has a fixed value, but interprets
f (✓) as representing uncertainty about that value. Both the Bayesian and the frequentist
are perfectly happy with p(heads | ✓) = ✓, since the longterm frequency of heads given ✓ is
✓.
In short, Bayesians put probability distributions on everything (hypotheses and data), while
frequentists put probability distributions on (random, repeatable, experimental) data given
a hypothesis. For the frequentist when dealing with data from an unknown distribution
only the likelihood has meaning. The prior and posterior do not.

3 Working definition of a statistic

Our view of statistics is that it is the art of drawing conclusions (making inferences) from
data. With that in mind we can make a simple working definition of a statistic. There is a
more formal definition, but we don’t need to introduce it at this point.
18.05 class 17, The Frequentist School of Statistics, Spring 2014 3

Statistic. A statistic is anything that can be computed from data. Sometimes to be more
precise we’ll say a statistic is a rule for computing something from data and the value of the
statistic is what is computed. This can include computing likelihoods where we hypothesize
values of the model parameters. But it does not include anything that requires we know
the true value of a model parameter with unknown value.
Examples. 1. The mean of data is a statistic. It is a rule that says given data x1 , . . . , xn
compute x1 +...+x
n
n
.
2. The maximum of data is a statistic. It is a rule that says to pick the maximum value of
the data x1 , . . . , xn .
3. Suppose x ⇠ N(µ, 9) where µ is unknown. Then the likelihood

1 (x 7)2
p(x|µ = 7) = p e 18
3 2⇡
is a statistic. However, the distance of x from the true mean µ is not a statistic since we
cannot compute it without knowing µ
Point statistic. A point statistic is a single value computed from data. For example, the
mean and the maximum are both point statistics. The maximum likelihood estimate is also
a point statistic since it is computed directly from the data based on a likelihood model.
Interval statistic. An interval statistic is an interval computed from data. For example,
the range from the minimum to maximum of x1 , . . . , xn is an interval statistic, e.g. the data
0.5, 1.0, 0.2, 3.0, 5.0 has range [0.2, 5.0].
Set statistic. A set statistic is a set computed from data.
Example. Suppose we have five dice: 4, 6, 8, 12 and 20-sided. We pick one at random and
roll it once. The value of the roll is the data. The set of dice for which this roll is possible
is a set statistic. For example, if the roll is a 10 then the value of this set statistic is {12,
20}. If the roll is a 7 then this set statistic has value {8, 12, 20}.
It’s important to remember that a statistic is itself a random variable since it is computed
from random data. For example, if data is drawn from N(µ, 2 ) then the mean of n data
points follows N(µ, 2 /n)).
Sampling distribution. The probability distribution of a statistic is called its sampling
distribution.
Point estimate. We can use statistics to make a point estimate of a parameter ✓. For
example, if the parameter ✓ represents the true mean then the data mean x̄ is a point
estimate of ✓.
Null Hypothesis Significance Testing I
Class 17, 18.05
Jeremy Orloﬀ and Jonathan Bloom

1 Learning Goals

1. Know the definitions of the significance testing terms: NHST, null hypothesis, alternative
hypothesis, simple hypothesis, composite hypothesis, significance level, power.

2. Be able to design and run a significance test for Bernoulli or binomial data.

3. Be able to compute a p-value for a normal hypothesis and use it in a significance test.

2 Introduction

Frequentist statistics is often applied in the framework of null hypothesis significance testing
(NHST). We will look at the Neyman-Pearson paradigm which focuses on one hypothesis
called the null hypothesis. There are other paradigms for hypothesis testing, but Neyman-
Pearson is the most common. Stated simply, this method asks if the data is well outside
the region where we would expect to see it under the null hypothesis. If so, then we reject
the null hypothesis in favor of a second hypothesis called the alternative hypothesis.
The computations done here all involve the likelihood function. There are two main diﬀer-
ences between what we’ll do here and what we did in Bayesian updating.
1. The evidence of the data will be considered purely through the likelihood function it
will not be weighted by our prior beliefs.
2. We will need a notion of extreme data, e.g. 95 out of 100 heads in a coin toss or a Mayfly
that lives for a month.

2.1 Motivating examples

Example 1. Suppose you want to decide whether a coin is fair. If you toss it 100 times
and get 85 heads, would you think the coin is likely to be unfair? What about 60 heads? Or
52 heads? Most people would guess that 85 heads is strong evidence that the coin is unfair,
whereas 52 heads is no evidence at all. Sixty heads is less clear. Null hypothesis significance
testing (NHST) is a frequentist approach to thinking quantitatively about these questions.

Example 2. Suppose you want to compare a new medical treatment to a placebo or

the current standard of care. What sort of evidence would convince you that the new
treatment is better than the placebo or the current standard? Again, NHST is a quantitative
framework for answering these questions.

1
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 2

3 Significance testing

We’ll start by listing the ingredients for NHST. Formally they are pretty simple. There is
an art to choosing good ingredients. We will explore the art in examples. If you have never
seen NHST before just scan this list now and come back to it after reading through the
examples and explanations given below.

3.1 Ingredients

• H0 : the null hypothesis. This is the default assumption for the model generating the
data.

• HA : the alternative hypothesis. If we reject the null hypothesis we accept this alter-
native as the best explanation for the data.

• X: the test statistic. We compute this from the data.

• Null distribution: the probability distribution of X assuming H0 .

• Rejection region: if X is in the rejection region we reject H0 in favor of HA .

• Non-rejection region: the complement to the rejection region. If X is in this region

we do not reject H0 . Note that we say ‘do not reject’ rather than ‘accept’ because
usually the best we can say is that the data does not support rejecting H0 .

The null hypothesis H0 and the alternative hypothesis HA play diﬀerent roles. Typically
we choose H0 to be either a simple hypothesis or the default which we’ll only reject if we
have enough evidence against it. The examples below will clarify this.

4 NHST Terminology

In this section we will use one extended example to introduce and explore the terminology
used in null hypothesis significance testing (NHST).
Example 3. To test whether a coin is fair we flip it 10 times. If we get an unexpectedly
large or small number of heads we’ll suspect the coin is unfair. To make this precise in the
language of NHST we set up the ingredients as follows. Let θ be the probability that the
coin lands heads when flipped.
1.Null hypothesis: H0 = ‘the coin is fair’, i.e. θ = 0.5.
2.Alternative hypothesis: HA = ‘the coin is not fair’, i.e. θ ̸= .5
3.Test statistic: X = number of heads in 10 flips
4.Null distribution: This is the probability function based on the null hypothesis
p(x | θ = 0.5) ∼ binomial(10, 0.5).
Here is the probability table for the null distribution.
x 0 1 2 3 4 5 6 7 8 9 10
p(x | H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001
5. Rejection region: under the null hypothesis we expect to get about 5 heads in 10 tosses.
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 3

We’ll reject H0 if the number of heads is much fewer or greater than 5. Let’s set the rejection
region as {0, 1, 2, 8, 9, 10}. That is, if the number of heads in 10 tosses is in this region we
will reject the hypothesis that the coin is fair in favor of the hypothesis that it is not.
We can summarize all this in the graph and probability table below. The rejection region
consists of those values of x in red. The probabilities corresponding to it are shaded in red.
We also show the null distribution as a stem plot with the rejection values of x in red.

x 0 1 2 3 4 5 6 7 8 9 10
p(x|H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001

Rejection region and null probabilities as a table for Example 3.

p(x | H0 )
.25

.15

.05
x
0 1 2 3 4 5 6 7 8 9 10

Rejection region and null probabilities as a stemp plot for Example 3.

Notes for Example 3:
1. The null hypothesis is the cautious default: we won’t claim the coin is unfair unless we
have compelling evidence.
2. The rejection region consists of data that is extreme under the null hypothesis. That is,
it consists of the outcomes that are in the tail of the null distribution away from the high
probability center. As we’ll discuss soon, how far away depends on the significance level α
of the test.
3. If we get 3 heads in 10 tosses, then the test statistic is in the non-rejection region. The
usual scientific language would be to say that the data ‘does not support rejecting the null
hypothesis’. Even if we got 5 heads, we would not claim that the data proves the null
hypothesis is true.
Question: If we have a fair coin what is the probability that we will decide incorrectly it
is unfair?
answer: The null hypothesis is that the coin is fair. The question asks for the probability
the data from a fair coin will be in the rejection region. That is, the probability that we
will get 0, 1, 2, 8, 9 or 10 heads in 10 tosses. This is the sum of the probabilities in red.
That is,
P (rejecting H0 | H0 is true) = 0.11
Below we will continue with Example 3, define more terms used in NHST and see how to
quantify properties of the significance test.

4.1 Simple and composite hypotheses

Definition: simple hypothesis: A simple hypothesis is one for which we can specify its
distribution completely. A typical simple hypothesis is that a parameter of interest takes a
specific value.
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 4

Definition: composite hypotheses: If its distribution cannot be fully specified, we say

that the hypothesis is composite. A typical composite hypothesis is that a parameter of
interest lies in a range of values.
In Example 3 the null hypothesis is that θ = 0.5, so the null distribution is binomial(10, 0.5).
Since the null distribution is fully specified, H0 is simple. The alternative hypothesis is that
θ ̸= 0.5. This is really many hypotheses in one: θ could be 0.51, 0.7, 0.99, etc. Since the
alternative distribution binomial(10, θ) is not fully specified, HA is composite.
Example 4. Suppose we have data x1 , . . . , xn . Suppose also that our hypotheses are
H0 : the data is drawn from N (0, 1)
HA : the data is drawn from N (1, 1).
These are both simple hypotheses – each hypothesis completely specifies a distribution.
Example 5. (Composite hypotheses.) Now suppose that our hypotheses are
H0 : the data is drawn from a Poisson distribution of unknown parameter.
HA : the data is not drawn from a Poisson distribution.
These are both composite hypotheses, as they don’t fully specify the distribution.
Example 6. In an ESP experiment a subject is asked to identify the suits of 100 cards
drawn (with replacement) from a deck of cards. Let T be the number of successes. The
(simple) null hypothesis that the subject does not have ESP is given by
H0 : T ∼ binomial(100, 0.25)
The (composite) alternative hypothesis that the subject has ESP is given by
HA : T ∼ binomial(100, p) with p > 0.25
Another (composite) alternative hypothesis that something besides pure chance is going on,
i.e. the subject has ESP or anti-ESP. This is given by
HA : T ∼ binomial(100, p), with p ̸= 0.25
Values of p < 0.25 represent hypotheses that the subject has a kind of anti-esp.

4.2 Types of error

There are two types of errors we can make. We can incorrectly reject the null hypothesis
when it is true or we can incorrectly fail to reject it when it is false. These are unimagina-
tively labeled type I and type II errors. We summarize this in the following table.
True state of nature
H0 HA
Our Reject H0 Type I error correct decision
decision ‘Don’t reject’ H0 correct decision Type II error

Type I: false rejection of H0

Type II: false non-rejection (‘acceptance’) of H0

4.3 Significance level and power

Significance level and power are used to quantify the quality of the significance test. Ideally
a significance test would not make errors. That is, it would not reject H0 when H0 was true
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 5

and would reject H0 in favor of HA when HA was true. Altogether there are 4 important
probabilities corresponding to the 2 × 2 table just above.
P (reject H0 |H0 ) P (reject H0 |HA )
P (do not reject H0 |H0 ) P (do not reject H0 |HA )
The two probabilities we focus on are:
Significance level = P (reject H0 |H0 )
= probability we incorrectly reject H0
= P (type I error).
Power = probability we correctly reject H0
= P (reject H0 |HA )
= 1 − P (type II error).
Ideally, a hypothesis test should have a small significance level (near 0) and a large power
(near 1). Here are two analogies to help you remember the meanings of significance and
power.
Some analogies
1. Think of H0 as the hypothesis ‘nothing noteworthy is going on’, i.e. ‘the coin is fair’,
‘the treatment is no better than placebo’ etc. And think of HA as the opposite: ‘something
interesting is happening’. Then power is the probability of detecting something interesting
when it’s present and significance level is the probability of mistakenly claiming something
interesting has occured.
2. In the U.S. criminal defendents are presumed innocent until proven guilty beyond a
reasonable doubt. We can phrase this in NHST terms as
H0 : the defendent is innocent (the default)
HA : the defendent is guilty.
Significance level is the probability of finding and innocent person guilty. Power is the
probability of correctly finding a guilty party guilty. ‘Beyond a reasonable doubt’ means
we should demand the significance level be very small.
Composite hypotheses
HA is composite in Example 3, so the power is diﬀerent for diﬀerent values of θ. We expand
the previous probability table to include some alternate values of θ. We do the same with
the stem plots. As always in the NHST game, we look at likelihoods: the probability of the
data given a hypothesis.

x 0 1 2 3 4 5 6 7 8 9 10
H0 : p(x|θ = 0.5) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001
HA : p(x|θ = 0.6) .000 .002 .011 .042 .111 .201 .251 .215 .121 .040 .006
HA : p(x|θ = 0.7) .000 .0001 .001 .009 .037 .103 .200 .267 .233 .121 .028
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 6

p(x | H0 ) p(x | θ = .6) p(x | θ = .7)

.25 .25 .25

.15 .15 .15

.05 .05 .05

x
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Rejection region and null and alternative probabilities for example 3

We use the probability table to compute the significance level and power of this test.
Significance level = probability we reject H0 when it is true
= probability the test statistic is in the rejection region when H0 is true
= probability the test stat. is in the rejection region of the H0 row of the table
= sum of red boxes in the θ = 0.5 row
= 0.11
Power when θ = 0.6 = probability we reject H0 when θ = 0.6
= probability the test statistic is in the rejection region when θ = 0.6
= probability the test stat. is in the rejection region of the θ = 0.6 row of the table
= sum of dark blue boxes in the θ = 0.6 row
= 0.180
Power when θ = 0.7 = probability we reject H0 when θ = 0.7
= probability the test statistic is in the rejection region when θ = 0.7
= probability the test stat. is in the rejection region of the θ = 0.7 row of the table
= sum of dark green boxes in the θ = 0.7 row
= 0.384

We see that the power is greater for θ = 0.7 than for θ = 0.6. This isn’t surprising since we
expect it to be easier to recognize that a 0.7 coin is unfair than is is to recognize 0.6 coin
is unfair. Typically, we get higher power when the alternate hypothesis is farther from the
null hypothesis. In Example 3, it would be quite hard to distinguish a fair coin from one
with θ = 0.51.

4.4 Conceptual sketches

We illustrate the notions of null hypothesis, rejection region and power with some sketches
of the pdfs for the null and alternative hypotheses.

4.4.1 Null distribution: rejection and non-rejection regions

The first diagram below illustrates a null distribution with rejection and non-rejection re-
gions. Also shown are two possible test statistics: x1 and x2 .
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 7

f (x|H0 )

x
x2 -3 0 x1 3
reject H0 don’t reject H0 reject H0

The test statistic x1 is in the non-rejection region. So, if our data produced the test statistic
x1 then we would not reject the null hypothesis H0 . On the other hand the test statistic x2
is in the rejection region, so if our data produced x2 then we would reject the null hypothesis
in favor of the alternative hypothesis.
There are several things to note in this picture.
1. The rejection region consists of values far from the center of the null distribution.
2. The rejection region is two-sided. We will also see examples of one-sided rejection regions
as well.
3. The alternative hypothesis is not mentioned. We reject or don’t reject H0 based only
on the likelihood f (x|H0 ), i.e. the probability of the test statistic conditioned on H0 . As
we will see, the alternative hypothesis HA should be considered when choosing a rejection
region, but formally it does not play a role in rejecting or not rejection H0 .
4. Sometimes we rather lazily call the non-rejection region the acceptance region. This is
technically incorrect because we never truly accept the null hypothesis. We either reject or
say the data does not support rejecting H0 . This is often summarized by the statement:
you can never prove the null hypothesis.

4.4.2 High and low power tests

The next two figures show high and low power tests.
The shaded area under f (x|H0 ) represents the significance level. Remember the significance
level is

• The probability of falsely rejecting the null hypothesis when it is true.

• The probabilitiy the test statistic falls in the rejection region even though H0 is true.

Likewise, the shaded area under f (x|HA ) represents the power, i.e. the probability that the
test statistic is in the rejection (of H0 ) region when HA is true. Both tests have the same
significance level, but if f (x|HA ) has considerable overlap with f (x|H0 ) the power is much
lower. It is well worth your while to thoroughly understand these graphical representations
of significance testing.
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 8

f (x|HA ) f (x|H0 )

x
x1 -4 x2 0 x3
reject H0 region . non-reject H0 region

High power test

f (x|HA ) f (x|H0 )

x
x1 x2 -0.4 0 x3
reject H0 region . non-reject H0 region

Low power test

In both tests both distributions are standard normal. The null distribution, rejection region
and significance level are all the same. (The significance level is the red/purple area under
f (x | H0 and above the rejection region.) In the top figure we see the means of the two
distributions are 4 standard deviations apart. Since the areas under the densities have very
little overlap the test has high power. That is if the data x is drawn from HA it will almost
certainly be in the rejection region. For example x3 would be a very surprising outcome for
the HA distribution.
In the bottom figure we see the means of the two distributions are just 0.4 standard devia-
tions apart. Since the areas under the densities have a lot of overlap the test has low power.
That is if the data x is drawn from HA it is highly likely to be in the non-rejection region.
For example x3 would be not be a very surprising outcome for the HA distribution.
Typically we can increase the power of a test by increasing the amount of data and thereby
decreasing the variance of the null and alternative distributions. In experimental design it
is important to determine ahead of time the number of trials or subjects needed to achieve
a desired power.

Example 7. Suppose a drug for a disease is being compared to a placebo. We choose our
null and alternative hypotheses as
H0 = the drug does not work better than the placebo
HA = the drug works better than the placebo
The power of the hypothesis test is the probability that the test will conclude that the drug
is better, if it is indeed truly better. The significance level is the probability that the test
will conclude that the drug works better, when in fact it does not.
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 9

5 Designing a hypothesis test

Formally all a hypothesis test requires is H0 , HA , a test statistic and a rejection region. In
practice the design is often done using the following steps.
1. Pick the null hypothesis H0 .
The choice of H0 and HA is not mathematics. It’s art and custom. We often choose H0 to
be simple. Or we often choose H0 to be the simplest or most cautious explanation, i.e. no
eﬀect of drug, no ESP, no bias in the coin.
2. Decide if HA is one-sided or two-sided.
In the example 3 we wanted to know if the coin was unfair. An unfair coin could be biased
for or against heads, so HA : θ ̸= 0.5 is a two-sided hypothesis. If we only care whether or
not the coin is biased for heads we could use the one-sided hypothesis HA : θ > 0.5.
3. Pick a test statistic.
For example, the sample mean, sample total, or sample variance. Often the choice is obvious.
Some standard statistics that we will encounter are z, t, and χ2 . We will learn to use these
statistics as we work examples over the next few classes. One thing we will say repeatedly
is that the distributions that go with these statistics are always conditioned on the null
hypothesis. That is, we will compute likelihoods such as f (z | H0 ).
4. Pick a significance level and determine the rejection region.
We will usually use α to denote the significance level. The Neyman-Pearson paradigm is to
pick α in advance. Typical values are 0.1, 0.05, 0.01. Recall that the significance level is
the probability of a type I error, i.e. of incorrectly rejecting the null hypothesis when it is
true. The value we choose will depend on the consequences of a type I error.
Once the significance level is chosen we can determine the rejection region in the tail(s) of
the null distribution. In Example 3, HA is two sided so the rejection region is split between
the two tails of the null distribution. This distribution is given in the following table:

x 0 1 2 3 4 5 6 7 8 9 10
p(x|H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001

If we set α = 0.05 then the rejection region must contain at most .05 probability. For a
two-sided rejection region we get
{0, 1, 9, 10}.
If we set α = 0.01 the rejection region is
{0, 10}.

Suppose we change HA to ‘the coin is biased in favor of heads’. We now have a one-sided
hypothesis θ > 0.5. Our rejection region will now be in the right-hand tail since we don’t
want to reject H0 in favor of HA if we get a small number of heads. Now if α = 0.05 the
rejection region is the one-sided range
{9, 10}.
If we set α = 0.01 then the rejection region is
{10}.
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 10

5. Determine the power(s).

As we saw in Example 3, once the rejection region is set we can determine the power of the
test at various values of the alternate hypothesis.

Example 8. (Consequences of significance) If α = 0.1 then we’d expect a 10% type

I error rate. That is, we expect to reject the null hypothesis in 10% of those experiments
where the null hypothesis is true. Whether 0.1 is a reasonable signficance level depends on
the decisions that will be made using it.
For example, if you were running an experiminent to determine if your chocolate is more
than 72% cocoa then a 10% error type I error rate is probably okay. That is, falsely believing
some 72% chocalate is greater that 72%, is probably acceptable. On the other hand, if your
forensic lab is identifying fingerprints for a murder trial then a 10% type I error rate, i.e.
mistakenly claiming that fingerprints found at the crime scene belonged to someone who
was truly innocent, is definitely not acceptable.
Significance for a composite null hypothesis. If H0 is composite then P(type I error) depends
on which member of H0 is true. In this case the significance level is defined as the maximum
of these probabilities.

6 Critical values

Critical values are like quantiles except they refer to the probability to the right of the value
instead of the left.
Example 9. Use R to find the 0.05 critical value for the standard normal distribution.
answer: We label this critical value z0.05 . The critical value z0.05 is just the 0.95 quantile,
i.e. it has 5% probability to its right and therefore 95% probability to its left. We computed
it with the R function qnorm: qnorm(0.95, 0, 1), which returns 1.64.
In a typical significance test the rejection region consists of one or both tails of the null
distribution. The value of the test significant that marks the start of the rejection region is
a critical value. We show this and the notation used in some examples.
Example 10. Critical values and rejection regions. Suppose our test statistic x has null
distribution is N(100, 152 ), i.e. f (x|H0 ) ∼ N(100, 152 ). Suppose also that our rejection
region is right-sided and we have a significance level of 0.05. Find the critical value and
sketch the null distribution and rejection region.
answer: The notation used for the critical value with right tail containing probability 0.05
is x0.05 . The critical value x0.05 is just the 0.95 quantile, i.e. it has 5% probability to its
right and therefore 95% probability to its left. We computed it with the R function qnorm:
qnorm(0.95, 100, 15), which returned 124.7. This is shown in the figure below.
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 11

f (x|H0 ) ∼ N(100, 152 )

x0.05 = 124.7
α = red = 0.05

x
x0.05
100
non-reject H0 reject H0

Example 11. Critical values and rejection regions. Repeat the previous example for a
left-sided rejection region with significance level 0.05. In this case, the start of the rejection
region is at the 0.05 quantile. Since there is 95%
answer: In this case the critical value has 0.05 probability to its left and therefore 0.95
probability to its right. So we label it x0.95 . Since it is the 0.05 quantile compute it with
the R function: qnorm(0.05, 100, 15), which returned 75.3.

f (x|H0 ) ∼ N(100, 152 )

x0.95 = 75.3
α = red = 0.05

x
x0.95
100
reject H0 non-reject H0

Example 12. Critical values. Repeat the previous example for a two-sided rejection region.
Put half the significance in each tail.
answer: To have a total significance of 0.05 we put 0.025 in each tail. That is, the left tail
starts at x0.975 = q0.025 and the right tail starts at x0.025 = q0.975 . We compute these values
with qnorm(0.025, 100, 15) and qnorm(0.975, 100, 15). The values are shown in the
figure below.

f (x|H0 ) ∼ N(100, 152 )

x0.025 = 129.4
x0.975 = 70.6
α = red = 0.05

x
x0.975 x0.025
100
reject H0 non-reject H0 reject H0

7 p-values

In practice people often specify the significance level and do the significance test using what
are called p-values. We will first define p-value and the see that
If the p-value is less than the significance level α then qw reject H0 . Other-
wise we do not reject H0 .
Definition. The p-value is the probability, assuming the null hypothesis, of seeing data at
least as extreme as the experimental data. What ‘at least as extreme’ means depends on
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 12

the experimental design.

We illustrate the definition and use of p-values with a simple one-sided example. In later
classes we will look at two-sided examples. This example also introduces the z-test. All this
means is that our test statistic is standard normal (or approximately standard normal).
Example 13. The z-test for normal hypotheses
IQ is normally distributed in the population according to a N(100, 152 ) distribution. We
suspect that most MIT students have above average IQ so we frame the following hypothe-
ses.
H0 = MIT student IQs are distributed identically to the general population
= MIT IQ’s follow a N(100, 152 ) distribution.
HA = MIT student IQs tend to be higher than those of the general population
= the average MIT student IQ is greater than 100.
Notice that HA is one-sided.
Suppose we test 9 students and find they have an average IQ of x = 112. Can we reject H0
at a significance level α = 0.05?
answer: To compute p we first standardize the data: Under the null hypothesis x̄ ∼
N(100, 152 /9) and therefore
x̄ − 100 36
z= √ = = 2.4 ∼ N(0, 1).
15/ 9 15

That is, the null distribution for z is standard normal. We call z a z-statistic, we will use
it as our test statistic.
For a right-sided alternative hypothesis the phrase ‘data at least as extreme’ is a one-sided
tail to the right of z. The p-value is then

p = P (Z ≥ 2.4) = 1- pnorm(2.4,0,1) = 0.0081975.

Since p ≤ α we reject the null hypothesis. The reason this works is explained below. We
phrase our conclusion as
We reject the null hypothesis in favor of the alternative hypothesis that MIT
students have higher IQs on average. We have done this at significance level
0.05 with a p-value of 0.008.
Notes: 1. The average x = 112 is random: if we ran the experiment again we could get a
diﬀerent value for x.
2. We could use the statistic x directly. Standardizing is fairly standard because, with
practice, we will have a good feel for the meaning of diﬀerent z-values.
The justification for rejecting H0 when p ≤ α is given in the following figure.
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 13

f (z|H0 ) ∼ N(0, 1)

z0.05 = 1.64
α = pink + red = 0.05
p = red = 0.008

z
z0.05 2.4
non-reject H0 reject H0

In this example α = 0.05, z0.05 = 1.64 and the rejection region is the range to the right
of z0.05 . Also, z = 2.4 and the p-value is the probability to the right of z. The picture
illustrates that

• z = 2.64 is in the rejection region

• is the same as z is to the right of z0.05

• is the same as the probability to the right of z is less than 0.05

• which means p < 0.05.

8 More examples

Hypothesis testing is widely used in inferential statistics. We don’t expect that the following
examples will make perfect sense at this time. Read them quickly just to get a sense of how
hypothesis testing is used. We will explore the details of these examples in class.
Example 14. The chi-square statistic and goodness of fit. (Rice, example B, p.313)
To test the level of bacterial contamination, milk was spread over a grid with 400 squares.
The amount of bacteria in each square was counted. We summarize in the table below.
The bottom row of the table is the number of diﬀerent squares that had a given amount of
bacteria.
Amount of bacteria 0 1 2 3 4 5 6 7 8 9 10 19
Number of squares 56 104 80 62 42 27 9 9 5 3 2 1

We compute that the average amount of bacteria per square is 2.44. Since the Poisson(λ)
distribution is used to model counts of relatively rare events and the parameter λ is the
expected value of the distribution. we decide to see if these counts could come from a
Poisson distribution. To do this we first graphically compare the observed frequencies with
those expected from Poisson(2.44).
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 14

● ●

Number of squares Poisson(2.44)

● ● ●

80
●
● Observed
60 ●
●
●
40

●
●
20

●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
0

5 10 15
Number of bacteria in square

The picture is suggestive, so we do a hypothesis test with

H0 : the samples come from a Poisson(2.44) distribution.
HA : the samples come from a diﬀerent distribution.
We use a chi-square statistic, so called because it (approximately) follows a chi-square
distribution. To compute X 2 we first combine the last few cells in the table so that the
minimum expected count is around 5 (a general rule-of-thumb in this game.)
The expected number of squares with a certain amount of bacteria comes from considering
400 trials from a Poisson(2.44) distribution, e.g., with l = 2.44 the expected number of
l3
squares with 3 bacteria is 400 × e−l = 84.4.
3!
! (Oi − Ei )2
The chi-square statistic is , where Oi is the observed number and Ei is the
Ei
expected number.

Number per square 0 1 2 3 4 5 6 >6

Observed 56 104 80 62 42 27 9 20
Expected 34.9 85.1 103.8 84.4 51.5 25.1 10.2 5.0
Component of X 2 12.8 4.2 5.5 6.0 1.7 0.14 0.15 44.5

Summing up we get X 2 = 74.9.

Since the mean (2.44) and the total number of trials (400) are fixed, the 8 cells only have
6 degrees of freedom. So, assuming H0 , our chi-square statistic follows (approximately) a
χ26 distribution. Using this distribution, P (X 2 > 74.59) = 0 (to at least 6 decimal places).
Thus we decisively reject the null hpothesis in favor of the alternate hypothesis that the
distribution is not Poisson(2.44).
To analyze further, look at the individual components of X 2 . There are large contributions
in the tail of the distribution, so that is where the fit goes awry.
18.05 class 17, Null Hypothesis Significance Testing I, Spring 2014 15

Example 15. Student’s t test.

Suppose we want to compare a medical treatment for increasing life expectancy with a
placebo. We give n people the treatment and m people the placebo. Let X1 , . . . , Xn be the
number of years people live after receiving the treatment. Likewise, let Y1 , . . . , Ym be the
number of years people live after receiving the placebo. Let X¯ and Y¯ be the sample means.
We want to know if the diﬀerence between X ¯ and Y¯ is statistically significant. We frame
this as a hypothesis test. Let µX and µY be the (unknown) means.
H0 : µX = µY , HA : µX ̸= µY .
With certain assumptions and a proper formula for the pooled standard error sp the test
X̄ − Y¯
statistic t = follow a t distribution with n + m − 2 degrees of freedom. So our
sp
rejection region is determined by a threshold t0 with P (t > t0 ) = α.
Null Hypothesis Significance Testing II
Class 18, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to list the steps common to all null hypothesis significance tests.

2. Be able to define and compute the probability of Type I and Type II errors.

3. Be able to look up and apply one and two sample t-tests.

2 Introduction

We continue our study of significance tests. In these notes we will introduce two new tests:
one-sample t-tests and two-sample t-tests. You should pay careful attention to the fact that
every test makes some assumptions about the data – often that is drawn from a normal
distribution. You should also notice that all the tests follow the same pattern. It is just the
computation of the test statistic and the type of the null distribution that changes.

3 Review: setting up and running a significance test

There is a fairly standard set of steps one takes to set up and run a null hypothesis signifi-
cance test.

1. Design an experiment to collect data and choose a test statistic x to be computed

from the data. The key requirement here is to know the null distribution f (x|H0 ).
To compute power, one must also know the alternative distribution f (x|HA ).

2. Decide if the test is one or two-sided based on HA and the form of the null distribution.

3. Choose a significance level ↵ for rejecting the null hypothesis. If applicable, compute
the corresponding power of the test.

4. Run the experiment to collect data x1 , x2 , . . . , xn .

5. Compute the test statistic x.

6. Compute the p-value corresponding to x using the null distribution.

7. If p < ↵, reject the null hypothesis in favor of the alternative hypothesis.

Notes.
1. Rather than choosing a significance level, you could instead choose a rejection region and
reject H0 if x falls in this region. The corresponding significance level is then the probability
that x falls in the rejection region.

1
18.05 class 18, Null Hypothesis Significance Testing II, Spring 2014 2

2. The null hypothesis is often the ‘cautious hypothesis’. The lower we set the significance
level, the more “evidence” we will require before rejecting our cautious hypothesis in favor
of a more sensational alternative. It is standard practice to publish the p value itself so that
others may draw their own conclusions.
3. A key point of confusion: A significance level of 0.05 does not mean the test only
makes mistakes 5% of the time. It means that if the null hypothesis is true, then the
probability the test will mistakenly reject it is 5%. The power of the test measures the
accuracy of the test when the alternative hypothesis is true. Namely, the power of the
test is the probability of rejecting the null hypothesis if the alternative hypothesis is true.
Therefore the probability of falsely failing to reject the null hypothesis is 1 minus the power.
Errors. We can summarize these two types of errors and their probabilities as follows:
Type I error = rejecting H0 when H0 is true.
Type II error = failing to reject H0 when HA is true.

P(type I error) = probability of falsely rejecting H0

= P(test statistic is in the rejection region | H0 )
= significance level of the test
P(type II error) = probability of falsely not rejecting H0
= P(test statistic is in the acceptance region | HA )
= 1 - power.
Helpful analogies. In terms of medical testing for a disease, a Type I error is a false
positive and a Type II error is a false negative. In terms of a jury trial, a Type I error is
convicting an innocent defendant and a Type II error is acquitting a guilty defendant.

4 Understanding a significance test

Questions to ask:

1. How did they collect data? What is the experimental setup?

2. What are the null and alternative hypotheses?

3. What type of significance test was used?

Does their data match the criteria needed to use this type of test?
How robust is the test to deviations from this criteria.

4. For example, some tests comparing two groups of data assume that the groups are
drawn from distributions that have the same variance. This needs to be verified before
applying the test. Often the check is done using another significance test designed to
compare the variances of two groups of data.

5. How is the p-value computed?

A significance test comes with a test statistic and a null distribution. In most tests
the p-value is

p = P (data at least as extreme as what we got | H0 )

18.05 class 18, Null Hypothesis Significance Testing II, Spring 2014 3

What does ‘data at least as extreme as the data we saw,’ mean? I.e. is the test one
or two-sided.

6. What is the significance level ↵ for this test? If p < ↵ then the experimenter will
reject H0 in favor of HA .

5 t tests

Many significance tests assume that the data are drawn from a normal distribution, so
before using such a test you should examine the data to see if the normality assumption is
reasonable. We will describe how to do this in more detail later, but plotting a histogram
is a good start. Like the z-test, the one-sample and two-sample t-tests we’ll consider below
start from this normality assumption.
We don’t expect you to memorize all the computational details of these tests and those to
follow. In real life, you have access to textbooks, google, and wikipedia; on the exam, you’ll
have your notecard. Instead, you should be able to identify when a t test is appropriate
and apply this test after looking up the details and using a table or software like R.

5.1 z-test

Let’s first review the z-test.

2
• Data: we assume x1 , x2 , . . . , xn ⇠ N (µ, ), where µ is unknown and is known.

• Null hypothesis: µ = µ0 for some specific value µ0

x µ
• Test statistic: z= p 0 = standardized mean
/ n
• Null distribution: f (z | H0 ) is the pdf of Z ⇠ N (0, 1)

• One-sided p-value (right side): p = P (Z > z | H0 )

One-sided p-value (left side): p = P (Z < z | H0 )
Two-sided p-value: p = P (|Z| > |z|).

Example 1. Suppose that we have data that follows a normal distribution of unknown
mean µ and known variance 4. Let the null hypothesis H0 be that µ = 0. Let the alternative
hypothesis HA be that µ > 0. Suppose we collect the following data:

1, 2, 3, 6, 1

At a significance level of ↵ = 0.05, should we reject the null hypothesis?

answer: There are 5 data points with average x = 2.2. Because we have normal data with
a known variance we should use a z test. Our z statistic is
x µ 2.2 0
z = p0 = p = 2.460
/ n 2/ 5
18.05 class 18, Null Hypothesis Significance Testing II, Spring 2014 4

Our test is one-sided because the alternative hypothesis is one-sided. So (using R) our
p-value is
p = P (Z > z) = P (Z > 2.460) = 0.007
Since p < .05, we reject the null hypothesis in favor of the alternative hypothesis µ > 0.
We can visualize the test as follows:
f (z|H0 ) ⇠ Norm(0, 1)

Rejection region starts at q.95 = 1.645.

↵ = pink + red = .05
z = black dot = 2.46
p = red = .007

z
1.645 2.46
don’t reject H0 reject H0

5.2 The Student t distribution

‘Student’ is the pseudonym used by the William Gosset who first described this test and
this test and distribution. See http://en.wikipedia.org/wiki/Student’s_t-test
The t-distribution is symmetric and bell-shaped like the normal distribution. It has a
parameter df which stands for degrees of freedom. For df small the t-distribution has more
probability in its tails than the standard normal distribution. As df increases t(df ) becomes
more and more like the standard normal distribution.
Here is a simple applet that shows t(df ) and compares it to the standard normal distribution:
http://mathlets.org/mathlets/t-distribution/
As usual in R, the functions pt, dt, qt, rt correspond to cdf, pdf, quantiles, and random
sampling for a t distribution. Remember that you can type ?dt in RStudio to view the help
file specifying the parameters of dt. For example, pt(1.65,3) computes the probability
that x is less than or equal 1.65 given that x is sampled from the t distribution with 3
degrees of freedom, i.e. P (x  1.65) given that x ⇠ t(3)).

5.3 One sample t-test

For the z-test, we assumed that the variance of the underlying distribution of the data was
known. However, it is often the case that we don’t know and therefore we must estimate
it from the data. In these cases, we use a one sample t-test instead of a z-test and the
studentized mean in place of the standardized mean

2
• Data: we assume x1 , x2 , . . . , xn ⇠ N (µ, ), where both µ and are unknown.

• Null hypothesis: µ = µ0 for some specific value µ0

• Test statistic:
x µ0
t= p
s/ n
18.05 class 18, Null Hypothesis Significance Testing II, Spring 2014 5

where
n
X
1
s2 = (xi x)2 .
n 1
i=1
Here t is called the Studentized mean and s2 is called the sample variance. The latter
is an estimate of the true variance 2 .
• Null distribution: f (t | H0 ) is the pdf of T ⇠ t(n 1), the t distribution with n 1
degrees of freedom.*
• One-sided p-value (right side): p = P (T > t | H0 )
One-sided p-value (left side): p = P (T < t | H0 )
Two-sided p-value: p = P (|T | > |t|).

*It’s a theorem (not an assumption) that if the data is normal with mean µ0 then the
Studentized mean follows a t-distribution. A proof would take us too far afield, but you can
look it up if you want: http://en.wikipedia.org/wiki/Student’s_t-distribution#
Derivation
Example 2. Now suppose that in the previous example the variance is unknown. That
is, we have data that follows a normal distribution of unknown mean µ and and unknown
variance . Suppose we collect the same data as before:
1, 2, 3, 6, 1
As above, let the null hypothesis H0 be that µ = 0 and the alternative hypothesis HA be
that µ > 0. At a significance level of ↵ = 0.05, should we reject the null hypothesis?
answer: There are 5 data points with average x = 2.2. Because we have normal data with
unknown mean and unknown variance we should use a one-sample t test. Computing the
sample variance we get
1
s2 = (1 2.2)2 + (2 2.2)2 + (3 2.2)2 + (6 2.2)2 + ( 1 2.2)2 = 6.7
4
Our t statistic is
x µ0 2.2 0
t = p = p p = 1.901
s/ n 6.7/ 5
Our test is one-sided because the alternative hypothesis is one-sided. So (using R) the
p-value is
p = P (T > t) = P (T > 1.901) = 1-pt(1.901,4) = 0.065
Since p > .05, we do not reject the null hypothesis.
We can visualize the test as follows:
f (t|H0 ) ⇠ t(4)

Rejection region starts at q.95 = 2.13.

↵ = red = .05
t = black dot = 1.90
p = pink + red = 0.065

t
1.90 2.13
don’t reject H0 reject H0
18.05 class 18, Null Hypothesis Significance Testing II, Spring 2014 6

5.4 Two-sample t-test with equal variances

We next consider the case of comparing the means of two samples. For example, we might
be interested in comparing the mean efficacies of two medical treatments.

• Data: We assume we have two sets of data drawn from normal distributions
2
x1 , x2 , . . . , xn ⇠ N (µ1 , )
2
y1 , y2 , . . . , ym ⇠ N (µ2 , )

where the means µ1 and µ2 and the variance 2 are all unknown. Notice the assump-
tion that the two distributions have the same variance. Also notice that there are n
samples in the first group and m samples in the second.

• Null hypothesis: µ1 = µ2 (the values of µ1 and µ2 are not specified)

• Test statistic:
x ȳ
t= ,
sp
where sp2 is the pooled variance
✓ ◆
(n 1)s2x + (m 1)sy2 1 1
s2p = +
n+m 2 n m

Here s2x and sy2 are the sample variances of the xi and yj respectively. The expression
for t is somewhat complicated, but the basic idea remains the same and it still results
in a known null distribution.

• Null distribution: f (t | H0 ) is the pdf of T ⇠ t(n + m 2).

• One-sided p-value (right side): p = P (T > t | H0 )

One-sided p-value (left side): p = P (T < t | H0 )
Two-sided p-value: p = P (|T | > |t|).

Note 1: Some authors use a di↵erent notation. They define the pooled variance as

(n 1)s2x + (m 1)s2y
sp2-other-authors =
n+m 2
and what we called the pooled variance they point out is the estimated variance of x ȳ.
That is,
s2p = sp-other-authors ⇥ (1/n + 1/m) ⇡ s2x y¯

Note 2: There is a version of the two-sample t-test that allows the two groups to have
di↵erent variances. In this case the test statistic is a little more complicated but R will
handle it with equal ease.
Example 3. The following data comes from a real study in which 1408 women were
admitted to a maternity hospital for (i) medical reasons or through (ii) unbooked emergency
18.05 class 18, Null Hypothesis Significance Testing II, Spring 2014 7

admission. The duration of pregnancy is measured in complete weeks from the beginning
of the last menstrual period. We can summarize the data as follows:
Medical: 775 observations with x̄M = 39.08 and s2M = 7.77.
Emergency: 633 observations with x̄E = 39.60 and s2E = 4.95
Set up and run a two-sample t-test to investigate whether the mean duration di↵ers for the
two groups.
What assumptions did you make?
answer: The pooled variance for this data is
✓
✓ ◆
◆
774(7.77) + 632(4.95) 1 1
s2p = + = .0187
1406 775 633

The t statistic for the null distribution is

x¯M y¯E
= 3.8064
sp

We have 1406 degrees of freedom. Using R to compute the two-sided p-value we get

p = P (|T | > |t|) = 2*dt(-3.8064, 1406) = 0.00015

p is very small, much smaller than ↵ = .05 or ↵ = .01. Therefore we reject the null
hypothesis in favor of the alternative that there is a di↵erence in the mean durations.
Rather than compute the two-sided p-value exactly using a t-distribution we could have
noted that with 1406 degrees of freedom the t distribution is essentially standard normal
and 3.8064 is almost 4 standard deviations. So

P (|t| 3.8064) ⇡ P (|z| 3.8064) < .001

We assumed the data was normal and that the two groups had equal variances. Given the
large di↵erence between the sample variances this assumption may not be warranted.
In fact, there are other significance tests that test whether the data is approximately normal
and whether the two groups have the same variance. In practice one might apply these first
to determine whether a t test is appropriate in the first place. We don’t have time to go
into normality tests here, but we will see the F distribution used for equality of variances
next week.
http://en.wikipedia.org/wiki/Normality_test
http://en.wikipedia.org/wiki/F-test_of_equality_of_variances
Null Hypothesis Significance Testing III
Class 19, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Given hypotheses and data, be able to identify to identify an appropriate significance

test from a list of common ones.

2. Given hypotheses, data, and a suggested significance test, know how to look up details
and apply the significance test.

2 Introduction

In these notes we will collect together some of the most common significance tests, though
by necessity we will leave out many other useful ones. Still, all significance tests follow the
same basic pattern in their design and implementation, so by learning the ones we include
you should be able to easily apply other ones as needed.
Designing a null hypothesis significance test (NHST):

• Specify null and alternative hypotheses.

• Choose a test statistic whose null distribution and alternative distribution(s) are
known.

• Specify a rejection region. Most often this is done implicitly by specifying a signif-
icance level ↵ and a method for computing p-values based on the tails of the null
distribution.

• Compute power using the alternative distribution(s).

Running a NHST:

• Collect data and compute the test statistic.

• Check if the test statistic is in the rejection region. Most often this is done implicitly
by checking if p < ↵. If so, we ‘reject the null hypothesis in favor of the alternative
hypothesis’. Otherwise we conclude ‘the data does not support rejecting the null
hypothesis’.

Note the careful phrasing: when we fail to reject H0 , we do not conclude that H0 is true.
The failure to reject may have other causes. For example, we might not have enough data
to clearly distinguish H0 and HA , whereas more data would indicate that we should reject
H0 .

1
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 2

3 Population parameters and sample statistics

Example 1. If we randomly select 10 men from a population and measure their heights
we say we have sampled the heights from the population. In this case the sample mean, say
x, is the mean of the sampled heights. It is a statistic and we know its value explicitly. On
the other hand, the true average height of the population, say µ, is unknown and we can
only estimate its value. We call µ a population parameter.
The main purpose of significance testing is to use sample statistics to draw conlusions about
population parameters. For example, we might test if the average height of men in a given
population is greater than 70 inches.

4 A gallery of common significance tests related to the nor-

mal distribution

We will show a number of tests that all assume normal data. For completeness we will
include the z and t tests we’ve already explored.
You shouldn’t try to memorize these tests. It is a hopeless task to memorize the tests given
here and even more hopeless to memorize all the tests we’ve left out. Rather, your goal
should be to be able to find the correct test when you need it. Pay attention to the types
of hypotheses the tests are designed to distinguish and the assumptions about the data
needed for the test to be valid. We will work through the details of these tests in class and
on homework.
The null distributions for all of these tests are all related to the normal distribution by
explicit formulas. We will not go into the details of these distributions or the arguments
showing how they arise as the null distributions in our significance tests. However, the
arguments are accessible to anyone who knows calculus and is interested in undersdanding
them. Given the name of any distribution, you can easily look up the details of its con-
struction and properties online. You can also use R to explore the distribution numerically
and graphically.
When analyzing data with any of these tests one thing of key importance is to verify that
the assumptions are true or at least approximately true. For example, you shouldn’t use a
test that assumes the data is normal unless you’ve checked that the data is approximately
normal.
The script class19.r contains examples of using R to run some of these tests. It is posted in
our usual place for R code.

4.1 z-test

• Use: Test if the population mean equals a hypothesized mean.

• Data: x1 , x2 , . . . , xn .
• Assumptions: The data are independent normal samples:
xi ⇠ N (µ, 2 ) where µ is unknown, but is known.
• H0 : For a specified µ0 , µ = µ0 .
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 3

• HA :
Two-sided: µ=6 µ0
one-sided-greater: µ > µ0
one-sided-less: µ < µ0
x µ
• Test statistic: z = p0
/ n
• Null distribution: f (z | H0 ) is the pdf of Z ⇠ N (0, 1).
• p-value:
Two-sided: p = P (|Z| > z) = 2*(1-pnorm(abs(z), 0, 1))
one-sided-greater: p = P (Z > z) = 1 - pnorm(z, 0, 1)
one-sided-less: p = P (Z < z) = pnorm(z, 0, 1)
• R code: There does not seem to be a single R function to run a z-test. Of course it
is easy enough to get R to compute the z score and p-value.

Example 2. We quickly reprise our example from the class 17 notes.

IQ is normally distributed in the population according to a N(100, 152 ) distribution. We
suspect that most MIT students have above average IQ so we frame the following hypothe-
ses.
H0 = MIT student IQs are distributed identically to the general population
= MIT IQ’s follow a N(100, 152 ) distribution.
HA = MIT student IQs tend to be higher than those of the general population
= the average MIT student IQ is greater than 100.
Notice that HA is one-sided.
Suppose we test 9 students and find they have an average IQ of x̄ = 112. Can we reject H0
at a significance level ↵ = 0.05?
answer: Our test statistic is
x̄ 100 36
z= p = = 2.4.
15/ 9 15

The right-sided p-value is thereofre

p = P (Z 2.4) = 1- pnorm(2.4,0,1) = 0.0081975.

Since p  ↵ we reject the null hypothesis in favor of the alternative hypothesis that MIT
students have higher IQs on average.

4.2 One-sample t-test of the mean

• Use: Test if the population mean equals a hypothesized mean.

• Data: x1 , x2 , . . . , xn .
• Assumptions: The data are independent normal samples:
xi ⇠ N (µ, 2 ) where both µ and are unknown.
• H0 : For a specified µ0 , µ = µ0
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 4

• HA :
Two-sided: µ=6 µ0
one-sided-greater: µ > µ0
one-sided-less: µ < µ0
x µ0
• Test statistic: t = p ,
s/ n
n
X
2 1
where s2 is the sample variance: s = (xi x)2
n 1
i=1
• Null distribution: f (t | H0 ) is the pdf of T ⇠ t(n 1).
(Student t-distribution with n 1 degrees of freedom)
• p-value:
Two-sided: p = P (|T | > t) = 2*(1-pt(abs(t), n-1))
one-sided-greater: p = P (T > t) = 1 - pt(t, n-1)
one-sided-less: p = P (T < t) = pt(t, n-1)
• R code example: For data x = 1, 3, 5, 7, 2 we can run a one-sample t-test with H0 :
µ = 2.5 using the R command:
t.test(x, mu = mu0, alternative=t́wo.sided´ =TRUE)
This will return a several pieces of information including the mean of the data, t-value
and the two-sided p-value. See the help for this function for other argument settings.

Example 3. Look in the class 18 notes or slides for an example of this test. The class 19
example R code also gives an example.

4.3 Two-sample t-test for comparing means

4.3.1 The case of equal variances

We start by describing the test assuming equal variances.

• Use: Test if the population means from two populations di↵er by a hypothesized
amount.
• Data: x1 , x2 , . . . , xn and y1 , y2 , . . . , ym .
• Assumptions: Both groups of data are independent normal samples:
xi ⇠ N (µx , 2)

yj ⇠ N (µy , 2)

where both µx and µy are unknown and possibly di↵erent. The variance 2 is un-
known, but the same for both groups.
• H0 : For a specified µ0 : µx µ y = µ0
• HA :
Two-sided: µx µy 6= µ0
one-sided-greater: µx µ y > µ0
one-sided-less: µx µ y < µ0
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 5

x
ȳ µ0
• Test statistic: t = ,
sP
where sx2 and sy2 are the sample variances of the x and y data respectively, and sP2
is (sometimes called) the pooled sample variance:
✓ ◆
(n 1)sx2 + (m 1)sy2 1 1
s2p = + and df = n + m 2
n+m 2 n m

• Null distribution: f (t | H0 ) is the pdf of T ⇠ t(df ), the t-distribution with df =

n + m 2 degrees of freedom.
• p-value:
Two-sided: p = P (|T | > t) = 2*(1-pt(abs(t), df))
one-sided-greater: p = P (T > t) = 1 - pt(t, df)
one-sided-less: p = P (T < t) = pt(t, df)
• R code: The R function t.test will run a two-sample t-test. See the example code
in class19.r

Example 4. Look in the class 18 notes or slides for an example of the two-sample t-test.
Notes: 1. Most often the test is done with µ0 = 0. That is, the null hypothesis is the the
means are equal, i.e. µx µy = 0.
2. If the x and y data have the same length, n, then the formula for s2p becomes simpler:

s2x + sy2
s2p =
n

4.3.2 The case of unequal variances

There is a form of the t-test for when the variances are not assumed equal. It is sometimes
called Welch’s t-test.
This looks exactly the same as the case of equal except for a small change in the assumptions
and the formula for the pooled variance:

where both µx and µy are unknown and possibly di↵erent. The variances 2 and 2
x y
are unknown and not assumed to be equal.
• H0 , HA : Exactly the same as the case of equal variances.
x ȳ µ0
• Test statistic: t = ,
sP
where sx2 and s2y are the sample variances of the x and y data respectively, and s2P
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 6

is (sometimes called) the pooled sample variance:

s2x s2y (s2x /n + sy2 /m)2

s2p = + and df =
n m (s2x /n)2 /(n 1) + (s2y /m)2 /(m 1)

• Null distribution: f (t | H0 ) is the pdf of T ⇠ t(df ), the t distribution with df degrees

of freedom.
• p-value: Exactly the same as the case of equal variances.
• R code: The function t.test also handles this case by setting the argument var.equal=FALSE.

4.3.3 The paired two-sample t-test

When the data naturally comes in pairs (xi , yi ), we can us the paired two-sample t-test.
(After checking the assumptions are valid!)
Example 5. To measure the e↵ectiveness of a cholesterol lowering medication we might
test each subject before and after treatment with the drug. So for each subject we have
a pair of measurements: xi = cholesterol level before treatment and yi = cholesterol level
after treatment.
Example 6. To measure the e↵ectiveness of a cancer treatment we might pair each subject
who received the treatment with one who did not. In this case we would want to pair subjects
who are similar in terms of stage of the disease, age, sex, etc.

• Use: Test if the average di↵erence between paired values in a population equals a
hypothesized value.
• Data: x1 , x2 , . . . , xn and y1 , y2 , . . . , yn must have the same length.
• Assumptions: The di↵erences wi = xi yi between the paired samples are independent
draws from a normal distribution N(µ, 2 ), where µ and are unknown.
• NOTE: This is just a one-sample t-test using wi .
• H0 : For a specified µ0 , µ = µ0 .
• HA :
Two-sided: µ 6= µ0
one-sided-greater: µ > µ0
one-sided-less: µ < µ0
w µ0
• Test statistic: t = p ,
s/ n
n
X
2 1
where s2 is the sample variance: s = (wi w )2
n 1
i=1
• Null distribution: f (t | H0 ) is the pdf of T ⇠ t(n 1).
(Student t-distribution with n 1 degrees of freedom)
• p-value:
Two-sided: p = P (|T | > t) = 2*(1-pt(abs(t), n-1))
one-sided-greater: p = P (T > t) = 1 - pt(t, n-1)
one-sided-less: p = P (T < t) = pt(t, n-1)
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 7

• R code: The R function t.test will do a paired two-sample test if you set the argu-
ment paired=TRUE. You can also run a one-sample t-test on x y. There are examples
of both of these in class19.r

Example 7. The following example is taken from Rice 1

To study the e↵ect of cigarette smoking on platelet aggregation Levine (1973) drew blood
samples from 11 subjects before and after they smoked a cigarette and measured the extent
to which platelets aggregated. Here is the data:
Before 25 25 27 44 30 67 53 53 52 60 28
After 27 29 37 56 46 82 57 80 61 59 43
Di↵erence 2 4 10 12 16 15 4 27 9 -1 15
The null hypothesis is that smoking had no e↵ect on platelet aggregation, i.e. that the dif-
ference should have mean µ0 = 0. We ran a paired two-sample t-test to test this hypothesis.
Here is the R code: (It’s also in class19.r.)
before.cig = c(25,25,27,44,30,67,53,53,52,60,28)
after.cig = c(27,29,37,56,46,82,57,80,61,59,43)
mu0 = 0
result = t.test(after.cig, before.cig, alternative="two.sided", mu=mu0, paired=TRUE)
print(result)
Here is the output:
Paired t-test
data: after.cig and before.cig
t = 4.2716, df = 10, p-value = 0.001633
alternative hypothesis: true difference in means is not equal to 0
mean of the differences: 10.27273
We got the same results with the one-sample t-test:
t.test(after.cig - before.cig, mu=0)

4.4 One-way ANOVA (F -test for equal means)

• Use: Test if the population means from n groups are all the same.
• Data: (n groups, m samples from each group)
x1,1 , x1,2 , . . . , x1,m
x2,1 , x2,2 , . . . , x2,m
...
xn,1 , xn,2 , . . . , xn,m
• Assumptions: Data for each group is an independent normal sample drawn from
distributions with (possibly) di↵erent means but the same variance:
x1,j ⇠ N (µ1 , 2)

x2,j ⇠ N (µ2 , 2)

...
xn,j ⇠ N (µn , 2)

1
John Rice, Mathematical Statistics and Data Analysis, 2nd edition, p. 412. This example references P.H
Levine (1973) An acute e↵ect of cigarette smoking on platelet function. Circulation, 48, 619-623.
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 8

The group means µi are unknown and possibly di↵erent. The variance is unknown,
but the same for all groups.
• H0 : All the means are identical µ1 = µ2 = . . . = µn .
• HA : Not all the means are the same.
MSB
• Test statistic: w = MS W
, where
x̄i = mean of group i
xi,1 + xi,2 + . . . + xi,m
= .
m
x = grand mean of all the data.
s2i = sample variance of group i
m
1 X
= (xi,j x ¯ i )2 .
m 1
j=1
MSB = between group variance
= m ⇥ sample variance of group means
n
m X
= (x̄i x)2 .
n 1
i=1
MSW = average within group variance
= sample mean of s21 , . . . , sn2
s2 + s22 + . . . + s2n
= 1
n
• Idea: If the µi are all equal, this ratio should be near 1. If they are not equal then
MSB should be larger while MSW should remain about the same, so w should be
larger. We won’t give a proof of this.
• Null distribution: f (w | H0 ) is the pdf of W ⇠ F (n 1, n(m 1)).
This is the F -distribution with (n 1) and n(m 1) degrees of freedom. Several
F -distributions are plotted below.
• p-value: p = P (W > w) = 1- pf(w, n-1, n*(m-1)))
1.0

F(3,4)
F(10,15)
0.8

F(30,15)
0.6
0.4
0.2
0.0

0 2 4 6 8 10
x
Notes: 1. ANOVA tests whether all the means are the same. It does not test whether
some subset of the means are the same.
2. There is a test where the variances are not assumed equal.
3. There is a test where the groups don’t all have the same number of samples.
4. R has a function aov() to run ANOVA tests. See:
https://personality-project.org/r/r.guide/r.anova.html#oneway
http://en.wikipedia.org/wiki/F-test
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 9

Example 8. The table shows patients’ perceived level of pain (on a scale of 1 to 6) after
3 di↵erent medical procedures.

T1 T2 T3
2 3 2
4 4 1
1 6 3
5 1 3
3 4 5

(1) Set up and run an F-test comparing the means of these 3 treatments.
(2) Based on the test, what might you conclude about the treatments?
answer: Using the code below, the F statistic is 0.325 and the p-value is 0.729 At any
reasonable significance level we will fail to reject the null hypothesis that the average pain
level is the same for all three treatments..
Note, it is not reasonable to conclude the the null hypothesis is true. With just 5 data
points per procedure we might simply lack the power to distinguish di↵erent means.
R code to perform the test
# DATA ----
T1 = c(2,4,1,5,3)
T2 = c(3,4,6,1,4)
T3 = c(2,1,3,3,5)
procedure = c(rep(’T1’,length(T1)),rep(’T2’,length(T2)),rep(’T3’,length(T3)))
pain = c(T1,T2,T3)
data.pain = data.frame(procedure,pain)
aov.data = aov(pain⇠procedure,data=data.pain) # do the analysis of variance
print(summary(aov.data)) # show the summary table
# class19.r also show code to compute the ANOVA by hand.
The summary shows a p-value (shown as Pr(>F)) of 0.729. Therefore we do not reject the
null hypothesis that all three group population means are the same.

4.5 Chi-square test for goodness of fit

This is a test of how well a hypothesized probability distribution fits a set of data. The test
statistic is called a chi-square statistic and the null distribution associated of the chi-square
statistic is the chi-square distribution. It is denoted by 2 (df ) where the parameter df is
called the degrees of freedom.
Suppose we have an unknown probability mass function given by the following table.
Outcomes !1 !2 ... !n
Probabilities p1 p2 ... pn
In the chi-square test for goodness of fit we hypothesize a set of values for the probabilities.
Typically we will hypothesize that the probabilities follow a known distribution with certain
parameters, e.g. binomial, Poisson, multinomial. The test then tries to determine if this
set of probabilities could have reasonably generated the data we collected.
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 10

• Use: Test whether discrete data fits a specific finite probability mass function.
• Data: An observed count Oi for each possible outcome !i .
• Assumptions: None
• H0 : The data was drawn from a specific discrete distribution.
• HA : The data was drawn from a di↵erent distribution.
• Test statistic: The data consists of observed counts Oi for each !i . From the null hy-
pothesis probability table we get a set of expected counts Ei . There are two statistics
that we can use:
X ✓ ◆
Oi
Likelihood ratio statistic G = 2 ⇤ Oi ln
Ei
X (Oi Ei )2
Pearson’s chi-square statistic X 2 = .
Ei

It is a theorem that under the null hypthesis X 2 ⇡ G and both are approximately
chi-square. Before computers, X 2 was used because it was easier to compute. Now,
it is better to use G although you will still see X 2 used quite often.
• Degrees of freedom df : For chi-square tests the number of degrees of freedom can be
a bit tricky. In this case df = n 1. It is computed as the number of cell counts
that can be freely set under HA consistent with the statistics needed to compute the
expected cell counts assuming H0 .
• Null distribution: Assuming H0 , both statistics (approximately) follow a chi-square
distribution with df degrees of freedom. That is both f (G | H0 ) and f (X 2 | H0 ) have
the same pdf as Y ⇠ 2 (df ).
• p-value:
p = P (Y > G) = 1 - pchisq(G, df)
p = P (Y > X 2 ) = 1 - pchisq(X 2 , df)
• R code: The R function chisq.test can be used to do the computations for a chi-
square test use X 2 . For G you either have to do it by hand or find a package that has
a function. (It will probably be called likelihood.test or G.test.

Notes. 1. When the likelihood ratio statistic G is used the test is also called a G-test or
a likelihood ratio test.
Example 9. First chi-square example. Suppose we have an experiment that produces
numerical data. For this experiment the possible outcomes are 0, 1, 2, 3, 4, 5 or more. We
run 51 trials and count the frequency of each outcome, getting the following data:
Outcomes 0 1 2 3 4 5
Observed counts 3 10 15 13 7 3
Suppose our null hypothesis H0 is that the data is drawn from 51 trials of a binomial(8,
0.5) distribution and our alternative hypothesis HA is that the data is drawn from some
other distribution. Do all of the following:
1. Make a table of the observed and expected counts.
2. Compute both the likelihood ratio statistic G and Pearson’s chi-square statistic X 2 .
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 11

3. Compute the degrees of freedom of the null distribution.

4. Compute the p-values corresponding to G and X 2 .
answer: All of the R code used for this example is in class19.r.
1. Assuming H0 the data truly comes from a binomial(8, 0.5) distribution. We have 51
total observations, so the expected count for each outcome is just 51 times its probability.
We computed the binomial(8, 0.5) probabilities and expected counts in R:
Outcomes 0 1 2 3 4 5
Observed counts 3 10 15 13 7 3
H0 probabilities 0.0039 0.0313 0.1094 0.2188 0.2734 0.3633
Expected counts 0.19 1.53 5.36 10.72 13.40 17.80
2. Using the formulas above we compute that X 2 = 116.41 and G = 66.08
3. The only statistic used in computing the expected counts was the total number of
observations 51. So, the degrees of freedom is 5, i.e we can set 5 of the cell counts freely
and the last is determined by requiring that the total number is 51.
4. The p-values are pG =1 - pchisq(G, 5) and pX2 = 1 - pchisq(X 2 , 5). Both p-
values are e↵ectively 0. For almost any significance level we would reject H0 in favor of
HA .
Example 10. (Degrees of freedom.) Suppose we have the same data as in the previous
example, but our null hypothesis is that the data comes from independent trials of bino-
mial(8, ✓) distribution, where ✓ can be anything. (HA is that the data comes from some
other distribution.) In this case we must estimate ✓ from the data, e.g. using the MLE.
In total we have computed two values from the data: the total number of counts and the
estimate of ✓. So, the degrees of freedom is 6 2 = 4.
Example 11. Mendel’s genetic experiments (Adapted from Rice Mathematical Statis-
tics and Data Analysis, 2nd ed., example C, p.314)
In one of his experiments on peas Mendel crossed 556 smooth, yellow male peas with
wrinkled green female peas. Assuming the smooth and wrinkled genes occur with equal
frequency we’d expect 1/4 of the pea population to have two smooth genes (SS), 1/4 to
have two wrinkled genes (ss), and the remaining 1/2 would be heterozygous Ss. We also
expect these fractions for yellow (Y ) and green (y) genes. If the color and smoothness
genes are inherited independently and smooth and yellow are both dominant we’d expect
the following table of frequencies for phenotypes.
Yellow Green
Smooth 9/16 3/16 3/4
Wrinkled 3/16 1/16 1/4
3/4 1/4 1
Probability table for the null hypothesis

So from the 556 crosses the expected number of smooth yellow peas is 556 ⇥ 9/16 = 312.75.
Likewise for the other possibilities. Here is a table giving the observed and expected counts
from Mendel’s experiments.
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 12

Observed count Expected count

Smooth yellow 315 312.75
Smooth green 108 104.25
Wrinkled yellow 102 104.25
Wrinkled green 31 34.75
The null hypothesis is that the observed counts are random samples distributed according
to the frequency table given above. We use the counts to compute our statistics
The likelihood ratio statistic is
X ✓ ◆
Oi
G = 2⇤ Oi ln
Ei
✓ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆◆
315 108 102 31
= 2 ⇤ 315 ln + 108 ln + 102 ln + 31 ln
412.75 104.25 104.25 34.75
= 0.618

Pearson’s chi-square statistic is

X (Oi Ei ) 2 2.75 3.75 2.25 3.75
X2 = = + + + = 0.604
Ei 312.75 104.25 104.25 34.75
You can see that the two statistics are very close. This is usually the case. In general the
likelihood ratio statistic is more robust and should be preferred.
The degrees of freedom is 3, because there are 4 observed quantities and one relation between
them, i.e. they sum to 556. So, under the null hypothesis G follows a 2 (3) distribution.
Using R to compute the p-value we get

p = 1- pchisq(0.618, 3) = 0.892

Assuming the null hypothesis we would see data at least this extreme almost 90% of the
time. We would not reject the null hypothesis for any reasonable significance level.
The p-value using Pearson’s statistic is 0.985 –nearly identical.
The script class19.r shows these calculations and also how to use chisq.test to run a
chi-square test directly.

4.6 Chi-square test for homogeneity

This is a test to see if several independent sets of random data are all drawn from the same
distribution. (The meaning of homogeneity in this case is that all the distributions are the
same.)

• Use: Test whether m di↵erent independent sets of discrete data are drawn from the
same distribution.
• Outcomes: !1 , !2 , . . . , !n are the possible outcomes. These are the same for each set
of data.
• Data: We assume m independent sets of data giving counts for each of the possible
outcomes. That is, for data set i we have an observed count Oi,j for each possible
outcome !j .
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 13

• Assumptions: None
• H0 : Each data set is drawn from the same distribution. (We don’t specify what this
distribution is.)
• HA : The data sets are not all drawn from the same distribution.
• Test statistic: See the example below. There are mn cells containing counts for each
outcome for each data set. Using the null distribution we can estimate expected counts
for each of the data sets. The statistics X 2 and G are computed exactly as above.
• Degrees of freedom df : (m 1)(n 1). (See the example below.)
• The null distribution 2 (df ). The p-values are computed just as in the chi-square test
for goodness of fit.
• R code: The R function chisq.test can be used to do the computations for a chi-
square test use X 2 . For G you either have to do it by hand or find a package that has
a function. (It will probably be called likelihood.test or G.test.

Example 12. Someone claims to have found a long lost work by William Shakespeare.
She asks you to test whether or not the play was actually written by Shakespeare .
You go to http://www.opensourceshakespeare.org and pick a random 12 pages from
King Lear and count the use of common words. You do the same thing for the ‘long lost
work’. You get the following table of counts.

Word a an this that

King Lear 150 30 30 90
Long lost work 90 20 10 80
Using this data, set up and evaluate a significance test of the claim that the long lost book
is by William Shakespeare. Use a significance level of 0.1.
answer: The null hypothesis H0 : For the 4 words counted the long lost book has the same
relative frequencies as the counts taken from King Lear.
The total word count of both books combined is 500, so the the maximum likelihood estimate
of the relative frequencies assuming H0 is simply the total count for each word divided by
the total word count.
Word a an this that Total count
King Lear 150 30 30 90 300
Long lost work 90 20 10 80 200
totals 240 50 40 170 500
rel. frequencies under H0 240/500 50/500 40/500 170/500 500/500
Now the expected counts for each book under H0 are the total count for that book times
the relative frequencies in the above table. The following table gives the counts: (observed,
expected) for each book.
Word a an this that Totals
King Lear (150, 144) (30, 30) (30, 24) (90, 102) (300, 300)
Long lost work (90, 96) (20, 20) (10, 16) (80, 68) (200, 200)
Totals (249, 240) (50, 50) (40, 40) (170, 170) (500, 500)
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 14

The chi-square statistic is

X (Oi Ei ) 2
X2 =
Ei
6202 62 122 62 02 62 122
= + + + + + + +
144 30 24 102 96 20 16 68
⇡ 7.9

There are 8 cells and all the marginal counts are fixed because they were needed to determine
the expected counts. To be consistent with these statistics we could freely set the values
in 3 cells in the table, e.g. the 3 blue cells, then the rest of the cells are determined
in order to make the marginal totals correct. Thus df = 3. (Or we could recall that
df = (m 1)(n 1) = (3)(1) = 3, where m is the number of columns and n is the number
of rows.)
Using R we find p = 1-pchisq(7.9,3) = 0.048. Since this is less than our significance
level of 0.1 we reject the null hypothesis that the relative frequencies of the words are the
same in both books.
If we make the further assumption that all of Shakespeare’s plays have similar word fre-
quencies (which is something we could check) we conclude that the book is probably not
by Shakespeare.

4.7 Other tests

There are far too many other tests to even make a dent. We will see some of them in
class and on psets. Again, we urge you to master the paradigm of NHST and recognize the
importance of choosing a test statistic with a known null distribution.
Comparison of frequentist and Bayesian inference.
Class 20, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to explain the di↵erence between the p-value and a posterior probability to a
doctor.

2 Introduction

We have now learned about two schools of statistical inference: Bayesian and frequentist.
Both approaches allow one to evaluate evidence about competing hypotheses. In these notes
we will review and compare the two approaches, starting from Bayes’ formula.

3 Bayes’ formula as touchstone

In our first unit (probability) we learned Bayes’ formula, a perfectly abstract statement
about conditional probabilities of events:

P (B | A)P (A)
P (A | B) = .
P (B)

We began our second unit (Bayesian inference) by reinterpreting the events in Bayes’ for-
mula:
P (D | H)P (H)
P (H | D) = .
P (D)
Now H is a hypothesis and D is data which may give evidence for or against H. Each term
in Bayes’ formula has a name and a role.

• The prior P (H) is the probability that H is true before the data is considered.

• The posterior P (H | D) is the probability that H is true after the data is considered.

• The likelihood P (D | H) is the evidence about H provided by the data D.

• P (D) is the total probability of the data taking into account all possible hypotheses.

If the prior and likelihood are known for all hypotheses, then Bayes’ formula computes the
posterior exactly. Such was the case when we rolled a die randomly selected from a cup
whose contents you knew. We call this the deductive logic of probability theory, and it gives
a direct way to compare hypotheses, draw conclusions, and make decisions.
In most experiments, the prior probabilities on hypotheses are not known. In this case, our
recourse is the art of statistical inference: we either make up a prior (Bayesian) or do our
best using only the likelihood (frequentist).

1
18.05 class 20, Comparison of frequentist and Bayesian inference., Spring 2014 2

The Bayesian school models uncertainty by a probability distribution over hypotheses.

One’s ability to make inferences depends on one’s degree of confidence in the chosen prior,
and the robustness of the findings to alternate prior distributions may be relevant and
important.
The frequentist school only uses conditional distributions of data given specific hypotheses.
The presumption is that some hypothesis (parameter specifying the conditional distribution
of the data) is true and that the observed data is sampled from that distribution. In
particular, the frequentist approach does not depend on a subjective prior that may vary
from one investigator to another.
These two schools may be further contrasted as follows:
Bayesian inference

• uses probabilities for both hypotheses and data.

• depends on the prior and likelihood of observed data.
• requires one to know or construct a ‘subjective prior’.
• dominated statistical practice before the 20th century.
• may be computationally intensive due to integration over many parameters.

Frequentist inference (NHST)

• never uses or gives the probability of a hypothesis (no prior or posterior).

• depends on the likelihood P (D | H)) for both observed and unobserved data.
• does not require a prior.
• dominated statistical practice during the 20th century.
• tends to be less computationally intensive.

Frequentist measures like p-values and confidence intervals continue to dominate research,
especially in the life sciences. However, in the current era of powerful computers and
big data, Bayesian methods have undergone an enormous renaissance in fields like ma-
chine learning and genetics. There are now a number of large, ongoing clinical trials using
Bayesian protocols, something that would have been hard to imagine a generation ago.
While professional divisions remain, the consensus forming among top statisticians is that
the most e↵ective approaches to complex problems often draw on the best insights from
both schools working in concert.

4 Critiques and defenses

4.1 Critique of Bayesian inference

1. The main critique of Bayesian inference is that a subjective prior is, well, subjective.
There is no single method for choosing a prior, so di↵erent people will produce di↵erent
priors and may therefore arrive at di↵erent posteriors and conclusions.
18.05 class 20, Comparison of frequentist and Bayesian inference., Spring 2014 3

2. Furthermore, there are philosophical objections to assigning probabilities to hypotheses,

as hypotheses do not constitute outcomes of repeatable experiments in which one can mea-
sure long-term frequency. Rather, a hypothesis is either true or false, regardless of whether
one knows which is the case. A coin is either fair or unfair; treatment 1 is either better or
worse than treatment 2; the sun will or will not come up tomorrow.

4.2 Defense of Bayesian inference

1. The probability of hypotheses is exactly what we need to make decisions. When the
doctor tells me a screening test came back positive I want to know what is the probability
this means I’m sick. That is, I want to know the probability of the hypothesis “I’m sick”.
2. Using Bayes’ theorem is logically rigorous. Once we have a prior all our calculations
have the certainty of deductive logic.
3. By trying di↵erent priors we can see how sensitive our results are to the choice of prior.
4. It is easy to communicate a result framed in terms of probabilities of hypotheses.
5. Even though the prior may be subjective, one can specify the assumptions used to arrive
at it, which allows other people to challenge it or try other priors.
6. The evidence derived from the data is independent of notions about ‘data more extreme’
that depend on the exact experimental setup (see the “Stopping rules” section below).
7. Data can be used as it comes in. There is no requirement that every contingency be
planned for ahead of time.

4.3 Critique of frequentist inference

1. It is ad-hoc and does not carry the force of deductive logic. Notions like ‘data more
extreme’ are not well defined. The p-value depends on the exact experimental setup (see
the “Stopping rules” section below).
2. Experiments must be fully specified ahead of time. This can lead to paradoxical seeming
results. See the ‘voltmeter story’ in:
http://en.wikipedia.org/wiki/Likelihood_principle
3. The p-value and significance level are notoriously prone to misinterpretation. Careful
statisticians know that a significance level of 0.05 means the probability of a type I error
is 5%. That is, if the null hypothesis is true then 5% of the time it will be rejected due to
randomness. Many (most) other people erroneously think a p-value of 0.05 means that the
probability of the null hypothesis is 5%.
Strictly speaking you could argue that this is not a critique of frequentist inference but,
rather, a critique of popular ignorance. Still, the subtlety of the ideas certainly contributes
to the problem. (see “Mind your p’s” below).

4.4 Defense of frequentist inference

1. It is objective: all statisticians will agree on the p-value. Any individual can then decide
if the p-value warrants rejecting the null hypothesis.
18.05 class 20, Comparison of frequentist and Bayesian inference., Spring 2014 4

2. Hypothesis testing using frequentist significance testing is applied in the statistical anal-
ysis of scientific investigations, evaluating the strength of evidence against a null hypothesis
with data. The interpretation of the results is left to the user of the tests. Di↵erent users
may apply di↵erent significance levels for determining statistical significance. Frequentist
statistics does not pretend to provide a way to choose the significance level; rather it ex-
plicitly describes the trade-o↵ between type I and type II errors.
3. Frequentist experimental design demands a careful description of the experiment and
methods of analysis before starting. This helps control for experimenter bias.
4. The frequentist approach has been used for over 100 years and we have seen tremendous
scientific progress. Although the frequentist herself would not put a probability on the belief
that frequentist methods are valuable shoudn’t this history give the Bayesian a strong prior
belief in the utility of frequentist methods?

5 Mind your p’s.

We run a two-sample t-test for equal means, with ↵ = 0.05, and obtain a p-value of 0.04.
What are the odds that the two samples are drawn from distributions with the same mean?
(a) 19/1 (b) 1/19 (c) 1/20 (d) 1/24 (e) unknown
answer: (e) unknown. Frequentist methods only give probabilities of statistics conditioned
on hypotheses. They do not give probabilities of hypotheses.

6 Stopping rules

When running a series of trials we need a rule on when to stop. Two common rules are:
1. Run exactly n trials and stop.
2. Run trials until you see a certain result and then stop.
In this example we’ll consider two coin tossing experiments.
Experiment 1: Toss the coin exactly 6 times and report the number of heads.
Experiment 2: Toss the coin until the first tails and report the number of heads.
Jon is worried that his coin is biased towards heads, so before using it in class he tests it
for fairness. He runs an experiment and reports to Jerry that his sequence of tosses was
HHHHHT . But Jerry is only half-listening, and he forgets which experiment Jon ran to
produce the data.
Frequentist approach.
Since he’s forgotten which experiment Jon ran, Jerry the frequentist decides to compute
the p-values for both experiments given Jon’s data.
Let ✓ be the probability of heads. We have the null and one-sided alternative hypotheses
H0 : ✓ = 0.5, HA : ✓ > 0.5.

Experiment 1: The null distribution is binomial(6, 0.5) so, the one sided p-value is the
probability of 5 or 6 heads in 6 tosses. Using R we get
p = 1 - pbinom(4, 6, 0.5) = 0.1094.
18.05 class 20, Comparison of frequentist and Bayesian inference., Spring 2014 5

Experiment 2: The null distribution is geometric(0.5) so, the one sided p-value is the prob-
ability of 5 or more heads before the first tails. Using R we get

p = 1 - pgeom(4, 0.5) = 0.0313.

Using the typical significance level of 0.05, the same data leads to opposite conclusions! We
would reject H0 in experiment 2, but not in experiment 1.
The frequentist is fine with this. The set of possible outcomes is di↵erent for the di↵erent
experiments so the notion of extreme data, and therefore p-value, is di↵erent. For example,
in experiment 1 we would consider T HHHHH to be as extreme as HHHHHT . In ex-
periment 2 we would never see T HHHHH since the experiment would end after the first
tails.
Bayesian approach.
Jerry the Bayesian knows it doesn’t matter which of the two experiments Jon ran, since
the binomial and geometric likelihood functions (columns) for the data HHHHHT are
proportional. In either case, he must make up a prior, and he chooses Beta(3,3). This is a
relatively flat prior concentrated over the interval 0.25  ✓  0.75.
See http://mathlets.org/mathlets/beta-distribution/
Since the beta and binomial (or geometric) distributions form a conjugate pair the Bayesian
update is simple. Data of 5 heads and 1 tails gives a posterior distribution Beta(8,4). Here
is a graph of the prior and the posterior. The blue lines at the bottom are 50% and 90%
probability intervals for the posterior.
3.0

Prior Beta(3,3)
Posterior Beta(8,4)
2.0
1.0
0.0

0 .25 .50 .75 1.0

θ
Prior and posterior distributions with 0.5 and 0.9 probability intervals
Here are the relevant computations in R:

Posterior 50% probability interval: qbeta(c(0.25, 0.75), 8, 4) = [0.58 0.76]

Posterior 90% probability interval: qbeta(c(0.05, 0.95), 8, 4) = [0.44 0.86]
P (✓ > 0.50 | data) = 1- pbeta(0.5, posterior.a, posterior.b) = 0.89
Starting from the prior Beta(3,3), the posterior probability that the coin is biased toward
heads is 0.89.
18.05 class 20, Comparison of frequentist and Bayesian inference., Spring 2014 6

7 Making decisions

Quite often the goal of statistical inference is to help with making a decision, e.g. whether
or not to undergo surgery, how much to invest in a stock, whether or not to go to graduate
school, etc.
In statistical decision theory, consequences of taking actions are measured by a utility
function. The utility function assigns a weight to each possible outcome; in the language of
probability, it is simply a random variable.
For example, in my investments I could assign a utility of d to the outcome of a gain of
d dollars per share of a stock (if d < 0 my utility is negative). On the other hand, if my
tolerance for risk is low, I will assign a more negative utility to losses than to gains (say,
d2 if d < 0 and d if d 0).
A decision rule combines the expected utility with evidence for each hypothesis given by
the data (e.g., p-values or posterior distributions) into a formal statistical framework for
making decisions.
In this setting, the frequentist will consider the expected utility given a hypothesis

E(U | H)

where U is the random variable representing utility. There are frequentist methods for
combining the expected utility with p-values of hypotheses to guide decisions.
The Bayesian can combine E(U | H) with the posterior (or prior if it’s before data is col-
lected) to create a Bayesian decision rule.
In either framework, two people considering the same investment may have di↵erent utility
functions and make di↵erent decisions. For example, a riskier stock (with higher potential
upside and downside) will be more appealing with respect to the first utility function above
than with respect to the second (loss-averse) one.
A significant theoretical result is that for any decision rule there is a Bayesian decision rule
which is, in a precise sense, at least as good a rule.
Confidence intervals based on normal data
Class 22, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to determine whether an expression defines a valid interval statistic.

2. Be able to compute z and t confidence intervals for the mean given normal data.

3. Be able to compute the 2 confidence interval for the variance given normal data.

4. Be able to define the confidence level of a confidence interval.

5. Be able to explain the relationship between the z confidence interval (and confidence
level) and the z non-rejection region (and significance level) in NHST.

2 Introduction

We continue to survey the tools of frequentist statistics. Suppose we have a model (proba-
bility distribution) for observed data with an unknown parameter. We have seen how NHST
uses data to test the hypothesis that the unknown parameter has a particular value.
We have also seen how point estimates like the MLE use data to provide an estimate of the
unknown parameter. On its own, a point estimate like x̄ = 2.2 carries no information about
its accuracy; it’s just a single number, regardless of whether its based on ten data points or
one million data points.
For this reason, statisticians augment point estimates with confidence intervals. For exam-
ple, to estimate an unknown mean µ we might be able to say that our best estimate of
the mean is x = 2.2 with a 95% confidence interval [1.2, 3.2]. Another way to describe the
interval is: x ± 1.
We will leave to later the explanation of exactly what the 95% confidence level means.
For now, we’ll note that taken together the width of the interval and the confidence level
provide a measure on the strength of the evidence supporting the hypothesis that the µ is
close to our estimate x. You should think of the confidence level of an interval as analogous
to the significance level of a NHST. As explained below, it is no accident that we often see
significance level ↵ = 0.05 and confidence level 0.95 = 1 ↵.
We will first explore confidence intervals in situations where you will easily be able to com-
pute by hand: z and t confidence intervals for the mean and 2 confidence intervals for the
variance. We will use R to handle all the computations in more complicated cases. Indeed,
the challenge with confidence intervals is not their computation, but rather interpreting
them correctly and knowing how to use them in practice.

1
18.05 class 22, Confidence intervals based on normal data, Spring 2014 2

3 Interval statistics

Recall that our working definition of a statistic is anything that can be computed from
data. In particular, the formula for a statistic cannot include unknown quantities.
Example 1. Suppose x1 , . . . , xn is drawn from N(µ, 2) where µ and are unknown.
(i) x and x 5 are statistics.
(ii) x µ is not a statistic since µ is unknown.
(iii) If µ0 a known value, then x µ0 is a statistic. This case arises when we consider the
null hypothesis µ = µ0 . For example, if the null hypothesis is µ = 5, then the statistic
x µ0 is just x 5 from (i).
We can play the same game with intervals to define interval statistics
Example 2. Suppose x1 , . . . , xn is drawn from N(µ, 2) where µ is unknown.
(i) The interval [x 2.2, x + 2.2] = x ± 2.2 is an interval statistic.

2 2
(ii) If is known, then x p , x + p is an interval statistic.
n n

2 2
(iii) On the other hand, if is unknown then x p , x + p is not an interval statistic.
n n

2s 2s
(iv) If s2 is the sample variance, then x p , x + p is an interval statistic because s2
n n
is computed from the data.
We will return to (ii) and (iv), as these are respectively the z and t confidence intervals for
estimating µ.
Technically an interval statistic is nothing more than a pair of point statistics giving the
lower and upper bounds of the interval. Our reason for emphasizing that the interval is a
statistic is to highlight the following:

1. The interval is random – new random data will produce a new interval.

2. As frequentists we are perfectly happy using it because it doesn’t depend on the value
of an unknown parameter or hypothesis.

3. As usual with frequentist statistics we have to assume a certain hypothesis, e.g. value
of µ, before we can compute probabilities about the interval.
Example 3. Suppose we draw n samples x1 , . . . , xn from a N(µ, 1) distribution,
where µ is unknown. Suppose we wish to know the probability that 0 is in the
interval [x 2, x + 2]. Without knowing the value of µ this is impossible. However,
we can compute this probability for any given (hypothesized) value of µ.

4. A warning which will be repeated: Be careful in your thinking about these probabili-
ties. Confidence intervals are a frequentist notion. Since frequentists do not compute
probabilities of hypotheses, the confidence level is never a probability that the un-
known parameter is in the confidence interval.
18.05 class 22, Confidence intervals based on normal data, Spring 2014 3

4 z confidence intervals for the mean

Throughout this section we will assume that we have normally distributed data:
2
x1 , x2 , . . . , xn ⇠ N(µ, ).

As we often do, we will introduce the main ideas through examples, building on what
we know about rejection and non-rejection regions in NHST until we have constructed a
confidence interval.

4.1 Definition of z confidence intervals for the mean

We start with z confidence intervals for the mean. First we’ll give the formula. Then we’ll
walk through the derivation in one entirely numerical example. This will give us the basic
idea. Then we’ll repeat this example, replacing the explicit numbers by symbols. Finally
we’ll work through a computational example.
Definition: Suppose the data x1 , . . . , xn ⇠ N(µ, 2 ), with unknown mean µ and known
variance 2 . The (1 ↵) confidence interval for µ is

z↵/2 · z↵/2 ·
x p , x + p , (1)
n n

where z↵/2 is the right critical value P (Z > z↵/2 ) = ↵/2.

For example, if ↵ = 0.05 then z↵/2 = 1.96 so the 0.95 (or 95%) confidence interval is

1.96 1.96
x p , x+ p .
n n

We’ve created an applet that generates normal data and displays the corresponding z con-
fidence interval for the mean. It also shows the t-confidence interval, as discussed in the
next section. Play around to get a sense for random intervals!
http://mathlets.org/mathlets/confidence-intervals/
Example 4. Suppose we collect 100 data points from a N(µ, 32 ) distribution and the
sample mean is x = 12. Give the 95 % confidence interval for µ.
answer: Using the formula this is trivial to compute: the 95% confidence interval for µ is
 
1.96 1.96 1.96 · 3 1.96 · 3
x p , x+ p = 12 , 12 +
n n 10 10

4.2 Explaining the definition part 1: rejection regions

Our next goal is to explain the definition (1) starting from our knowledge of rejection/non-
rejection regions. The phrase ‘non-rejection region’ is not pretty, but we will discipline
ourselves to use it instead of the inacurate phrase ‘acceptance region’.
18.05 class 22, Confidence intervals based on normal data, Spring 2014 4

Example 5. Suppose that n = 12 data points are drawn from N(µ, 52 ) where µ is unknown.
Set up a two-sided significance test of H0 : µ = 2.71 using the statistic x at significance
level ↵ = 0.05. Describe the rejection and non-rejection regions.
answer: Under the null hypothesis µ = 2.71 we have xi ⇠ N(2.71, 52 ) and thus

x ⇠ N(2.71, 52 /12)

where 52 /12 is the variance ( x )2 of x. We know that significance ↵ = 0.05 corresponds to

a rejection region outside 1.96 standard deviations from the hypothesized mean. That is,
the non-rejection and rejection regions are separated by the critical values x ± 1.96 x .
Non-rejection region:

1.96 · 5 1.96 · 5
2.71 p , 2.71 + p = [ 0.12, 5.54].
12 12

✓  ◆
1.96 · 5 1.96 · 5
Rejection region: 1, 2.71 p [ 2.71 + p , 1 = ( 1, 0.12] [
12 12
[5.54, 1)

The following figure shows the rejection and non-rejection regions for x. The regions repre-
sent ranges of x so they are represented by the colored bars on the x axis. The area of the
shaded region is the significance level.

N (2.71, 52 /12)

x
.12 2.71 5.54

The rejection (orange) and non-rejection (blue) regions for x.

Let’s redo the previous example using symbols for the known quantities as well as for µ.
Example 6. Suppose that n data points are drawn from N(µ, 2 ) where µ is unknown
and is known. Set up a two-sided significance test of H0 : µ = µ0 using the statistic x at
significance level ↵ = 0.05. Describe the rejection and non-rejection regions.
answer: Under the null hypothesis µ = µ0 we have xi ⇠ N(µ0 , 2) and thus
2
x ⇠ N(µ0 , /n),

where 2 /n is the variance ( 2

x) of x and µ0 , and n are all known values.
Let z↵/2 be the critical value: P (Z > z↵ /2) = ↵/2. Then the non-rejection and rejection
regions are separated by the values of x that are z↵/2 · x from the hypothesized mean.
Since x =p we have
n
Non-rejection region: 
z↵/2 · z↵/2 ·
µ0 p , µ0 + p (2)
n n
18.05 class 22, Confidence intervals based on normal data, Spring 2014 5

Rejection region:
✓  ◆
z↵/2 · z↵/2 ·
1, µ0 p [ µ0 + p , 1 .
n n

We get the same figure as above, with the explicit numbers replaced by symbolic values.
2
N (µ0 , /n)

x
µ0
z↵/2 ·
p µ0 µ0 +
z↵/2 ·
p
n n

The rejection (orange) and non-rejection (blue) regions for x.

4.3 Manipulating intervals: pivoting

We need to get comfortable manipulating intervals. In general, we will make use of the type
of ‘obvious’ statements that are very hard to get across. One key is to be clear about the
various items.
Here is a quick summary of intervals around x and µ0 and what is called pivoting. Pivoting
is the idea the x is in µ0 ± a says exactly the same thing as µ0 is in x ± a.
Example 7. Suppose we have the sample mean x and hypothesized mean µ0 = 2.71.
Suppose also that the null distribution is N(µ0 , 32 ). Then with a significance level of 0.05
we have:

• µ0 + 1.96 = 2.71 + 1.96(3) = 2.71 + 5.88 is the 0.025 critical value

• µ0 1.96 = 2.71 1.96(3) = 2.71 5.88 is the 0.975 critical value

• The non-rejection region is centered on µ0 = 2.71. That is, we don’t reject H0 if x is

in the interval

[µ0 1.96 , µ0 + 1.96 ] = [2.71 5.88, 2.71 + 5.88]

• The confidence interval is centered on x. The 0.95 confidence interval uses the same
width as the non-rejection region. It is the interval

[x 1.96 , x + 1.96 ] = [x 5.88, x + 5.88]

There is a symmetry here: x is in the interval [2.71 1.96 , 2.71 + 1.96 ] is equivalent to
2.71 is in the interval [x 1.96 , x + 1.96 ].
This symmetry is called pivoting. Here are some simple numerical examples of pivoting.

Example 8. (i) 1.5 is in the interval [0 2.3, 0+2.3], so 0 is in the interval [1.5 2.3, 1.5+2.3]
(ii) Likewise 1.5 is not in the interval [0 1, 0 + 1], so 0 is not in the interval [1.5 1, 1.5 + 1].
18.05 class 22, Confidence intervals based on normal data, Spring 2014 6

The symmetry might be most clear if we talk In terms of distances: the statement
’1.5 is in the interval [0 2.3, 0 + 2.3]’
says that the distance from 1.5 to 0 is at most 2.3. Likewise, the statement
’0 is in the interval [1.5 2.3, 1.5 + 2.3]’
says exactly the same thing, i.e. the distance from 0 to 1.5 is less than 2.3.
Here is a visualization of pivoting from intervals around µ0 to intervals around x.
2 1 0 1 2 3 4
µ0 x
µ0 ± 1 this interval does not contain x
x±1 this interval does not contain µ0
µ0 ± 2.3 this interval contains x
x ± 2.3 this interval contains µ0

The distance between x and µ is 1.5. Now, since 1 < 1.5, µ ± 1, does not stretch far enough
to contain x. Likewise the inteval x ± 1 does not stretch far enough to contain µ0 . In
contrast, since 2.3 > 1.5, we have x is in the interval µ0 ± 2.3 and µ0 is in the interval
x ± 2.3.

4.4 Explaining the definition part 2: translating the non-rejection region

to a confidence interval

The previous examples are nice if we happen to have a null hypothesis. But what if we
don’t have a null hypothesis? In this case, we have the point estimate x but we still want
to use the data to estimate an interval range for the unknown mean. That is, we want an
interval statistic. This is given by a confidence interval.
Here we will show how to translate the notion of an non-rejection region to that of a
confidence interval. The confidence level will control the rate of certain types of errors in
much the same way the significance level does for NHST.
The trick is to give a little thought to the non-rejection region. Using the numbers from
Example 5 we would say that at significance level 0.05 we don’t reject if
1.96 · 5 p
x is in the interval 2.71 ± p = 2.71 ± 1.96 · 5/ 12. (3)
12
The rolesp of x and 2.71 are symmetric. The equation just above can be read as x is within
1.96 · 5/ 12 of 2.71. This is exactly equivalent to saying that we don’t reject if
1.96 · 5
2.71 is in the interval x ± p , (4)
12
p
i.e. 2.71 is within 1.96 · 5/ 12 of x.
Now we have magically arrived at our goal of an interval statistic estimating the unknown
mean. We can rewrite equation (4) as: at significance level 0.05 we don’t reject if

1.96 · 5 1.96 · 5
the interval x p , x+ p contains 2.71. (5)
12 12
Thus, di↵erent values of x generate di↵erent intervals.
18.05 class 22, Confidence intervals based on normal data, Spring 2014 7

The interval in equation (5) is exactly the confidence interval defined in Equation (1). We
make a few observations about this confidence interval.

1. It only depends on x, so it is a statistic.

2. The significance level ↵ = 0.05 means that, assuming the null hypothesis that µ = 2.71
is true, random data will lead us to reject the null hypothesis 5% of the time (a Type
I error).

3. Again assuming that µ = 2.71, then 5% of the time the confidence interval will not
contain 2.71, and conversely, 95% of the time it will contain 2.71

The following figure illustrates how we don’t reject H0 if the confidence interval around it
contains µ0 and we reject H0 if the confidence interval doesn’t contain µ0 . There is a lot in
the figure so we will list carefully what you are seeing:
1. We started with the figure from Example 5 which shows the null distribution for µ0 = 2.71
and the rejection and non-rejection regions.
2. We added two possible values of the statistic x, i.e. x1 and x2 , and their confidence
intervals. Note that the width of each interval is exactly the same as the width of the
1.96 · 5
non-rejection region since both use ± p .
12
The first value, x1 , is in the non-rejection region and its interval includes the null hypothesis
µ0 = 2.71. This illustrates that not rejecting H0 corresponds to the confidence interval
containing µ0 .
The second value, x2 , is in the rejection region and its interval does not contain µ0 . This
illustrates that rejecting H0 corresponds to the confidence interval not containing µ0 .

N (2.71, 52 /12)

x
x2 .12 2.71 x1 5.54

The non-rejection region (blue) and two confidence intervals (green).

We can still wring one more essential observation out of this example. Our choice of
null hypothesis µ = 2.71 was completely arbitrary. If we replace µ = 2.71 by any other
hypothesis µ = µ0 then the interval (5) will come out the same.
We call the interval (5) a 95% confidence interval because, assuming µ = µ0 , on average it
will contain µ0 in 95% of random trials.

4.5 Explaining the definition part 3: translating a general non-rejection

region to a confidence interval

Note that the specific values of and n in the preceding example were of no particular
consequence, so they can be replaced by their symbols. In this way we can take Example
(6) quickly through the same steps as Example (5).
18.05 class 22, Confidence intervals based on normal data, Spring 2014 8

In words, Equation (2) and the corresponding figure say that we don’t reject if
z↵/2
x is in the interval µ0 ± p .
n
This is exactly equivalent to saying that we don’t reject if
z↵/2
µ0 is in the interval x ± p . (6)
n
We can rewrite equation (6) as: at significance level ↵ we don’t reject if

z↵/2 · z↵/2 ·
the interval x p , x+ p contains µ0 . (7)
n n
We call the interval (7) a (1 ↵) confidence interval because, assuming µ = µ0 , on average
it will contain µ0 in the fraction (1 ↵) of random trials.

The following figure illustrates the point that µ0 is in the (1 ↵) confidence interval around
x is equivalent to x is in the non-rejection region (at significance level ↵) for H0 : µ0 = µ.
2
N (µ0 , /n)

x
x2 µ0 z↵/2 · p µ0 x1 µ0 + z↵/2 · p
n n

x1 is in non-rejection region for µ0 , the confidence interval around x1 contains µ0 .

4.6 Computational example

Example 9. Suppose the data 2.5, 5.5, 8.5, 11.5 was drawn from a N(µ, 102 ) distribution
with unknown mean µ.
(a) Compute the point estimate x for µ and the corresponding 50%, 80% and 95% confidence
intervals.
(b) Consider the null hypothesis µ = 1. Would you reject H0 at ↵ = 0.05? ↵ = 0.20?
↵ = 0.50? Do these two ways: first by checking if the hypothesized value of µ is in the
relevant confidence interval and second by constructing a rejection region.
answer: (a) We compute that x = 7.0. The critical points are
z0.025 = qnorm(0.975) = 1.96, z0.1 = qnorm(0.9) = 1.28, z0.25 = qnorm(0.75) = 0.67.
Since n = 4 we have x ⇠ N(µ, 102 /4), i.e. x = 5. So we have:
95% conf. interval = [x z0.025 x , x + z0.025 x ] = [7 1.96 · 5, 7 + 1.96 · 5] = [ 2.8, 16.8]
80% conf. interval = [x z0.1 x , x + z0.1 x ] = [7 1.28 · 5, 7 + 1.28 · 5] = [ 0.6, 13.4]
50% conf. interval = [x z0.75 x , x + z0.75 x ] = [7 0.67 · 5, 7 + 0.67 · 5] = [ 3.65, 10.35]
Each of these intervals is a range estimate of µ. Notice that the higher the confidence level,
the wider the interval needs to be.
(b) Since µ = 1 is in the 95% and 80% confidence intervals, we would not reject the null
hypothesis at the ↵ = 0.05 or ↵ = 0.20 levels. Since µ = 1 is not in the 50% confidence
interval, we would reject H0 at the ↵ = 0.5 level.
18.05 class 22, Confidence intervals based on normal data, Spring 2014 9

We construct the rejection regions using the same critical values as in part (a). The di↵er-
ence is that rejection regions are intervals centered on the hypothesized value for µ: µ0 = 1
and confidence intervals are centered on x. Here are the rejection regions.
↵ = 0.05 ) ( 1, µ0 z0.025 x ] [ [µ0 + z0.025 x , 1) = ( 1, 8.8] [ [10.8, 1)
↵ = 0.20 ) ( 1, µ0 z0.1 x ] [ [µ0 + z0.1 x , 1) = ( 1, 5.4] [ [7.4, 1)
↵ = 0.25 ) ( 1, µ0 z0.25 x ] [ [µ0 + z0.25 x , 1) = ( 1, 2.35] [ [4.35, 1)
To to do the NHST we must check whether or not x = 7 is in the rejection region.
↵ = 0.05: 7 < 10.8 is not in the rejection region.
We do not reject the hypothesis that µ = 1 at a significance level of 0.05.
↵ = 0.2: 7 < 7.4 is not in the rejection region.
We do not reject the hypothesis that µ = 1 at a significance level of 0.2.
↵ = 0.5: 7 > 4.35 is in the rejection region.
We reject the hypothesis that µ = 1 at a significance level 0.5.
We get the same answers using either method.

5 t-confidence intervals for the mean

This will be nearly identical to normal confidence intervals. In this setting is not known,
so we have to make the following replacements.
s
1. Use sx = p instead of x = p . Here s is the sample variance we used before in
n n
t-tests

2. Use t-critical values instead of z-critical values.

5.1 Definition of t-confidence intervals for the mean

Definition: Suppose that x1 , . . . , xn ⇠ N(µ, 2 ), where the values of the mean µ and the
standard deviation are both unknown. . The (1 ↵) confidence interval for µ is

t↵/2 · s t↵/2 · s
x p , x + p , (8)
n n

here t↵/2 is the right critical value P (T > t↵/2 ) = ↵/2 for T ⇠ t(n 1) and s2 is the sample
variance of the data.

5.2 Construction of t confidence intervals

Suppose that n data points are drawn from N(µ, 2 ) where µ and are unknown. We’ll
derive the t confidence interval following the same pattern as for the z confidence interval.
Under the null hypothesis µ = µ0 , we have xi ⇠ N(µ0 , 2 ). So the studentized mean follows
a Student t distribution with n 1 degrees of freedom:
x µ0
t= p ⇠ t(n 1).
s/ n
18.05 class 22, Confidence intervals based on normal data, Spring 2014 10

Let t↵/2 be the critical value: P (T > t↵/2 ) = ↵/2, where T ⇠ t(n 1). We know from
running one-sample t-tests that the non-rejection region is given by

|t|  t↵/2

Using the definition of the t-statistic to write the rejection region in terms of x we get: at
significance level ↵ we don’t reject if

|x µ0 | s
p  t↵/2 , |x µ0 |  t↵/2 · p .
s/ n n

Geometrically, the right hand side says that we don’t reject if

s
µ0 is within t↵/2 · p of x.
n

This is exactly equivalent to saying that we don’t reject if


t↵/2 · s t↵/2 · s
the interval x p , x + p contains µ0 .
n n

This interval is the confidence interval defined in (8).

Example 10. Suppose the data 2.5, 5.5, 8.5, 11.5 was drawn from a N(µ, 2) distribution
with µ and both unknown.
Give interval estimates for µ by finding the 95%, 80% and 50% confidence intervals.
answer: By direct computation we have x = 7 and s2 = 15. The critical points are
t0.025 = qt(0.975) = 3.18, t0.1 = qt(0.9) = 1.64, and t0.25 = qt(0.75) = 0.76.

s s
95% conf. interval = x t0.025 · p , x + t0.025 · p = [0.84, 13.16]
 n n
s s
80% conf. interval = x t0.1 · p , x + t0.1 · p = [3.82, 10.18]
 n n
s s
50% conf. interval = x t0.25 · p , x + t0.25 · p = [5.53, 8.47]
n n
All of these confidence intervals give interval estimates for the value of µ. Again, notice
that the higher the confidence level, the wider the corresponding interval.

6 Chi-square confidence intervals for the variance

We now turn to an interval estimate for the unknown variance.

Definition: Suppose the data x1 , . . . , xn is drawn from N(µ, 2 ) with mean µ and standard
deviation both unknown. The (1 ↵) confidence interval for the variance 2 is

(n 1)s2 (n 1)s2
, . (9)
c↵/2 c1 ↵/2

Here c↵/2 is the right critical value P (X 2 > c↵/2 ) = ↵/2 for X 2 ⇠ 2 (n 1) and s2 is the
sample variance of the data.
18.05 class 22, Confidence intervals based on normal data, Spring 2014 11

The derivation of this interval is nearly identical to that of the previous derivations, now
starting from the chi-square test for variance. The basic fact we need is that, for data drawn
from N(µ, 2 ) with known , the statistic

(n 1)s2
2

follows a chi-square distribution with n 1 degrees of freedom. So given the null hypothesis
H0 : = 0 , the test statistic is (n 1)s2 / 02 and the non-rejection region at significance
level ↵ is
(n 1)s2
c1 ↵/2 < 2 < c↵/2 .
0
A little algebra converts this to

(n 1)s2 2 (n 1)s2
> 0 > .
c1 ↵/2 c↵/2

This says we don’t reject if


(n 1)s2 (n 1)s2 2
the interval , contains 0
c↵/2 c1 ↵/2

This is our (1 ↵) confidence interval.

We will continue our exploration of confidence intervals next class. In the meantime, truly
the best way is to internalize the meaning of the confidence level is to experiment with the
confidence interval applet:
http://mathlets.org/mathlets/confidence-intervals/
Confidence Intervals: Three Views
Class 23, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to produce z, t and 2 confidence intervals based on the corresponding stan-

dardized statistics.

2. Be able to use a hypothesis test to construct a confidence interval for an unknown

parameter.

3. Refuse to answer questions that ask, in essence, ‘given a confidence interval what is
the probability or odds that it contains the true value of the unknown parameter?’

2 Introduction

Our approach to confidence intervals in the previous reading was a combination of stan-
dardized statistics and hypothesis testing. Today we will consider each of these perspectives
separately, as well as introduce a third formal viewpoint. Each provides its own insight.
1. Standardized statistic. Most confidence intervals are based on standardized statistics
with known distributions like z, t or 2 . This provides a straightforward way to construct
and interpret confidence intervals as a point estimate plus or minus some error.
2. Hypothesis testing. Confidence intervals may also be constructed from hypothesis
tests. In cases where we don’t have a standardized statistic this method will still work. It
agrees with the standardized statistic approach in cases where they both apply.
This view connects the notions of significance level ↵ for hypothesis testing and confidence
level 1 ↵ for confidence intervals; we will see that in both cases ↵ is the probability of
making a ‘type 1’ error. This gives some insight into the use of the word confidence. This
view also helps to emphasize the frequentist nature of confidence intervals.
3. Formal. The formal definition of confidence intervals is perfectly precise and general.
In a mathematical sense it gives insight into the inner workings of confidence intervals.
However, because it is so general it sometimes leads to confidence intervals without useful
properties. We will not dwell on this approach. We o↵er it mainly for those who are
interested.

3 Confidence intervals via standardized statistics

The strategy here is essentially the same as in the previous reading. Assuming normal data
we have what we called standardized statistics like the standardized mean, Studentized
mean, and standardized variance. These statistics have well known distributions which
depend on hypothesized values of µ and . We then use algebra to produce confidence
intervals for µ or .

1
18.05 class 23, Confidence Intervals: Three Views, Spring 2014 2

Don’t let the algebraic details distract you from the essentially simple idea underlying
confidence intervals: we start with a standardized statistic (e.g., z, t or 2 ) and use some
algebra to get an interval that depends only on the data and known parameters.

3.1 z-confidence intervals for µ: normal data with known

z-confidence intervals for the mean of normal data are based on the standardized mean, i.e.
the z-statistic. We start with n independent normal samples
2
x1 , x2 , . . . , xn ⇠ N(µ, ).
We assume that µ is the unknown parameter of interest and is known.
We know that the standardized mean is standard normal:
x µ
z = p ⇠ N(0, 1).
/ n

For the standard normal critical value z↵/2 we have: P ( z↵/2 < Z < z↵/2 ) = 1 ↵.
Thus, ✓ ◆
x µ
P z↵/2 < p < z↵/2 | µ = 1 ↵
/ n
A little bit of algebra puts this in the form of an interval around µ:
✓ ◆
P x z↵/2 · p < µ < x + z↵/2 · p | µ = 1 ↵
n n
We can emphasize that the interval depends only on the statistic x and the known value
by writing this as
✓ ◆
P x z↵/2 · p , x + z↵/2 · p contains µ | µ = 1 ↵.
n n
This is the (1 ↵) z-confidence interval for µ. We often write it using the shorthand

x ± z↵/2 · p
n
Think of it as x ± error.
Make sure you notice that the probabilities are conditioned on µ. As with all frequen-
tist statistics, we have to fix hypothesized values of the parameters in order to compute
probabilities.

3.2 t-confidence intervals for µ: normal data with unknown µ and

t-confidence intervals for the mean of normal data are based on the Studentized mean, i.e.
the t-statistic.
Again we have x1 , x2 , . . . , xn ⇠ N(µ, 2 ), but now we assume both µ and are unknown.
We know that the Studentized mean follows a Student t distribution with n 1 degrees of
freedom. That is,
x µ
t = p ⇠ t(n 1),
s/ n
18.05 class 23, Confidence Intervals: Three Views, Spring 2014 3

where s2 is the sample variance.

Now all we have to do is replace the standardized mean by the Studentized mean and the
same logic we used for z gives us the t-confidence interval: start with
✓ ◆
x µ
P t↵/2 < p < t↵/2 | µ = 1 ↵.
s/ n
A little bit of algebra isolates µ in the middle of an interval:
✓ ◆
s s
P x t↵/2 · p < µ < x + t↵/2 · p | µ = 1 ↵
n n
We can emphasize that the interval depends only on the statistics x and s by writing this
as ✓ ◆
s s
P x t↵/2 · p , x + t↵/2 · p contains µ | µ = 1 ↵.
n n

This is the (1 ↵) t-confidence interval for µ. We often write it using the shorthand
s
x ± t↵/2 · p
n
Think of it as x ± error.

2 2
3.3 -confidence intervals for : normal data with unknown µ and

You guessed it: 2 -confidence intervals for the variance of normal data are based on the
standardized variance, i.e. the 2 -statistic.
We follow the same logic as above to get a 2 -confidence interval for 2. Because this is
the third time through it we’ll move a little more quickly.
We assume we have n independent normal samples: x1 , x2 , . . . , xn ⇠ N(µ, 2 ). We assume
that µ and are both unknown. The standardized variance is
(n 1)s2
X2 = 2
⇠ 2
(n 1).

We know that the X 2 statistic follows a 2 distribution with n 1 degrees of freedom.

For Z and t we used, without comment, the symmetry of the distributions to replace z1 ↵/2
by z↵/2 and t1 ↵/2 by t↵/2 . Because the 2 distribution is not symmetric we need to be
explicit about the critical values on both the left and the right. That is,

P (c1 ↵/2 < X 2 < c↵/2 ) = 1 ↵,

where c↵/2 and c1 ↵/2 are right tail critical values. Thus,
✓ ◆
(n 1)s2
P c1 ↵/2 < 2
< c↵/2 | =1 ↵

A little bit of algebra puts this in the form of an interval around 2 :

✓ ◆
(n 1)s2 (n 1)s2
P < 2< | =1 ↵
c↵/2 c1 ↵/2
18.05 class 23, Confidence Intervals: Three Views, Spring 2014 4

We can emphasize that the interval depends only on the statistic s2 by writing this as
✓ ◆
(n 1)s2 (n 1)s2 2 2
P , contains | = 1 ↵.
c↵/2 c1 ↵/2

This is the (1 ↵) 2 -confidence interval for 2.

4 Confidence intervals via hypothesis testing

Suppose we have data drawn from a distribution with a parameter ✓ whose value is unknown.
A significance test for the value ✓ has the following short description.
1. Set the null hypothesis H0 : ✓ = ✓0 for some special value ✓0 , e.g. we often have H0 :
✓ = 0.
2. Use the data to compute the value of a test statistic, call it x.
3. If x is far enough into the tail of the null distribution (the distribution assuming the null
hypothesis) then we reject H0 .
In the case where there is no special value to test we may still want to estimate ✓. This is
the reverse of significance testing; rather than seeing if we should reject a specific value of
✓ because it doesn’t fit the data we want to find the range of values of ✓ that do, in some
sense, fit the data. This gives us the following definitions.
Definition. Given a value x of the test statistic, the (1 ↵) confidence interval contains all
values ✓0 which are not rejected (at significance level ↵) when they are the null hypothesis.
Definition. A type 1 CI error occurs when the confidence interval does not contain the
true value of ✓.
For a (1 ↵) confidence interval the type 1 CI error rate is ↵.
Example 1. Here is an example relating confidence intervals and hypothesis tests. Suppose
data x is drawn from a binomial(12, ✓) distribution with ✓ unknown. Let ↵ = 0.1 and create
the (1 ↵) = 90% confidence interval for each possible value of x.
Our strategy is to look at one possible value of ✓ at a time and choose rejection regions for
a significance test with ↵ = 0.1. Once this is done, we will know, for each value of x, which
values of ✓ are not rejected, i.e. the confidence interval associated with x.
To start we set up a likelihood table for binomial(12, ✓) in Table 1. Each row shows the
probabilities p(x|✓) for one value of ✓. To keep the size manageable we only show ✓ in
increments of 0.1.
18.05 class 23, Confidence Intervals: Three Views, Spring 2014 5

✓\x 0 1 2 3 4 5 6 7 8 9 10 11 12
1.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
0.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.09 0.23 0.38 0.28
0.8 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.05 0.13 0.24 0.28 0.21 0.07
0.7 0.00 0.00 0.00 0.00 0.01 0.03 0.08 0.16 0.23 0.24 0.17 0.07 0.01
0.6 0.00 0.00 0.00 0.01 0.04 0.10 0.18 0.23 0.21 0.14 0.06 0.02 0.00
0.5 0.00 0.00 0.02 0.05 0.12 0.19 0.23 0.19 0.12 0.05 0.02 0.00 0.00
0.4 0.00 0.02 0.06 0.14 0.21 0.23 0.18 0.10 0.04 0.01 0.00 0.00 0.00
0.3 0.01 0.07 0.17 0.24 0.23 0.16 0.08 0.03 0.01 0.00 0.00 0.00 0.00
0.2 0.07 0.21 0.28 0.24 0.13 0.05 0.02 0.00 0.00 0.00 0.00 0.00 0.00
0.1 0.28 0.38 0.23 0.09 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Table 1. Likelihood table for Binomial(12, ✓)

Tables 2-4 below show the rejection region (in orange) and non-rejection region (in blue) for the
various values of ✓. To emphasize the row-by-row nature of the process the Table 2 just shows these
regions for ✓ = 1.0, then Table 3 adds in regions for ✓ = 0.9 and Table 4 shows them for all the
values of ✓.
Immediately following the tables we give a detailed explanation of how the rejection/non-rejection
regions were chosen.
✓\x 0 1 2 3 4 5 6 7 8 9 10 11 12 significance
1.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.000
0.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.09 0.23 0.38 0.28
0.8 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.05 0.13 0.24 0.28 0.21 0.07
0.7 0.00 0.00 0.00 0.00 0.01 0.03 0.08 0.16 0.23 0.24 0.17 0.07 0.01
0.6 0.00 0.00 0.00 0.01 0.04 0.10 0.18 0.23 0.21 0.14 0.06 0.02 0.00
0.5 0.00 0.00 0.02 0.05 0.12 0.19 0.23 0.19 0.12 0.05 0.02 0.00 0.00
0.4 0.00 0.02 0.06 0.14 0.21 0.23 0.18 0.10 0.04 0.01 0.00 0.00 0.00
0.3 0.01 0.07 0.17 0.24 0.23 0.16 0.08 0.03 0.01 0.00 0.00 0.00 0.00
0.2 0.07 0.21 0.28 0.24 0.13 0.05 0.02 0.00 0.00 0.00 0.00 0.00 0.00
0.1 0.28 0.38 0.23 0.09 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Table 2. Likelihood table for binomial(12, ✓) with rejection/non-rejection regions for ✓ = 1.0

✓\x 0 1 2 3 4 5 6 7 8 9 10 11 12 significance
1.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.000
0.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.09 0.23 0.38 0.28 0.026
0.8 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.05 0.13 0.24 0.28 0.21 0.07
0.7 0.00 0.00 0.00 0.00 0.01 0.03 0.08 0.16 0.23 0.24 0.17 0.07 0.01
0.6 0.00 0.00 0.00 0.01 0.04 0.10 0.18 0.23 0.21 0.14 0.06 0.02 0.00
0.5 0.00 0.00 0.02 0.05 0.12 0.19 0.23 0.19 0.12 0.05 0.02 0.00 0.00
0.4 0.00 0.02 0.06 0.14 0.21 0.23 0.18 0.10 0.04 0.01 0.00 0.00 0.00
0.3 0.01 0.07 0.17 0.24 0.23 0.16 0.08 0.03 0.01 0.00 0.00 0.00 0.00
0.2 0.07 0.21 0.28 0.24 0.13 0.05 0.02 0.00 0.00 0.00 0.00 0.00 0.00
0.1 0.28 0.38 0.23 0.09 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Table 3. Likelihood table with rejection/non-rejection regions shown for ✓ = 1.0 and 0.9
18.05 class 23, Confidence Intervals: Three Views, Spring 2014 6

✓\x 0 1 2 3 4 5 6 7 8 9 10 11 12 significance
1.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.000
0.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.09 0.23 0.38 0.28 0.026
0.8 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.05 0.13 0.24 0.28 0.21 0.07 0.073
0.7 0.00 0.00 0.00 0.00 0.01 0.03 0.08 0.16 0.23 0.24 0.17 0.07 0.01 0.052
0.6 0.00 0.00 0.00 0.01 0.04 0.10 0.18 0.23 0.21 0.14 0.06 0.02 0.00 0.077
0.5 0.00 0.00 0.02 0.05 0.12 0.19 0.23 0.19 0.12 0.05 0.02 0.00 0.00 0.092
0.4 0.00 0.02 0.06 0.14 0.21 0.23 0.18 0.10 0.04 0.01 0.00 0.00 0.00 0.077
0.3 0.01 0.07 0.17 0.24 0.23 0.16 0.08 0.03 0.01 0.00 0.00 0.00 0.00 0.052
0.2 0.07 0.21 0.28 0.24 0.13 0.05 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.073
0.1 0.28 0.38 0.23 0.09 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.026
0.0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000

Table 4. Likelihood table with rejection/non-rejection regions for ✓ = 0.0 to 1.0

Choosing the rejection and non-rejection regions in the tables

The first problem we confront is how exactly to choose the rejection region. We used two rules:
1. The total probabilitiy of the rejection region, i.e. the significance, should be less than or equal
to 0.1. (Since we have a discrete distribution it is impossible to make the significance exactly 0.1.)
2. We build the rejection region by choosing values of x one at a time, always picking the unused
value with the smallest probability. We stop when the next value would make the significance more
that 0.1.
There are other ways to choose the rejection region which would result in slight di↵erences. Our
method is one reasonable way.
Table 2 shows the rejection (orange) and non-rejection (blue) regions for ✓ = 1.0. This is a special
case because most of the probabilities in this row are 0.0. We’ll move right on to the next table and
step through the process for that.
In Table 3, let’s walk through the steps used to find these regions for ✓ = 0.9.

• The smallest probability is when x = 0, so x = 0 is in the rejection region.

• The next smallest is when x = 1, so x = 1 is in the rejection region.
• We continue with x = 2, . . . , 8. At this point the total probability in the rejection region is
0.026.
• The next smallest probability is when x = 9. Adding this probability (0.09) to 0.026 would
put the total probability over 0.1. So we leave x = 9 out of the rejection region and stop the
process.

Note three things for the ✓ = 0.9 row:

1. None of the probabilities in this row are truly zero, though some are small enough that they equal
0 to 2 decimal places.
2. We show the significance for this value of ✓ in the right hand margin. More precisely, we show
the significance level of the NHST with null hypothesis ✓ = 0.9 and the given rejection region.
3. The rejection region consists of values of x. When we say the rejection region is shown in orange
we really mean the rejection region contains the values of x corresponding to the probabilities
highlighted in orange.
Think: Look back at the ✓ = 1.0 row and make sure you understand why the rejection region is
x = 0, . . . , 11 and the significance is 0.000.
Example 2. Using Table 4 determine the 0.90 confidence interval when x = 8.
18.05 class 23, Confidence Intervals: Three Views, Spring 2014 7

answer: The 90% confidence interval consists of all those ✓ that would not be rejected by an ↵ = 0.1
hypothesis test when x = 8. Looking at the table, the blue (non-rejected) entries in the column
x = 8 correspond to 0.5  ✓  0.8: the confidence interval is [0.5, 0.8].
Remark: The point of this example is to show how confidence intervals and hypothesis tests are
related. Since Table 4 has only finitely many values of ✓, our answer is close but not exact. Using a
computer we could look at many more values of ✓. For this problem we used R to find that, correct
to 2 decimal places, the confidence interval is [0.42, 0.85].
Example 3. Explain why the expected type one CI error rate will be at most 0.092, provided that
the true value of ✓ is in the table.
answer: The short answer is that this is the maximum significance for any ✓ in Table 4. Expanding
on that slightly: we make a type one CI error if the confidence interval does not contain the true
value of ✓, call it ✓true . This happens exactly when the data x is in the rejection region for ✓true .
The probability of this happening is the significance for ✓true and this is at most 0.092.
Remark: The point of this example is to show how confidence level, type one CI error rate and
significance for each hypothesis are related. As in the previous example, we can use R to compute
the significance for many more values of ✓. When we do this we find that the maximum significance
for any ✓ is 0.1 ocurring when ✓ ⇡ 0.0452.
Summary notes:
1. We start with a test statistic x. The confidence interval is random because it depends on x.
2. For each hypothesized value of ✓ we make a significance test with significance level ↵ by choosing
rejection regions.
3. For a specific value of x the associated confidence interval for ✓ consists of all ✓ that aren’t rejected
for that value, i.e. all ✓ that have x in their non-rejection regions.
4. Because the distribution is discrete we can’t always achieve the exact significance level, so our
confidence interval is really an ‘at least 90% confidence interval’.

Example 4. Open the applet http://mathlets.org/mathlets/confidence-intervals/. We

want you to play with the applet to understand the random nature of confidence intervals and the
meaning of confidence as (1 - type I CI error rate).
(a) Read the help. It is short and will help orient you in the applet. Play with di↵erent settings of
the parameters to see how they a↵ect the size of the confidence intervals.
(b) Set the number of trials to N = 1. Click the ‘Run N trials’ button repeatedly and see that each
time data is generated the confidence intervals jump around.
(c) Now set the confidence level to c = .5. As you click the ‘Run N trials’ button you should see that
about 50% of the confidence intervals include the true value of µ. The ‘Z correct’ and ‘t correct’
values should change accordingly.
(d) Now set the number of trials to N = 100. With c = .8. The ‘Run N trials’ button will now run
100 trials at a time. Only the last confidence interval will be shown in the graph, but the trials all
run and the ‘percent correct’ statistics will be updated based on all 100 trials.
Click the run trials button repeatedly. Watch the correct rates start to converge to the confidence
level. To converge even faster, set N = 1000.

5 Formal view of confidence intervals

Recall: An interval statistic is an interval Ix computed from data x. An interval is determined by
its lower and upper bounds, and these are random because x is random.
18.05 class 23, Confidence Intervals: Three Views, Spring 2014 8

We suppose that x is drawn from a distribution with pdf f (x|✓) where the parameter ✓ is unknown.
Definition: A (1 ↵) confidence interval for ✓ is an interval statistic Ix such that

P (Ix contains ✓0 | ✓ = ✓0 ) = 1 ↵

for all possible values of ✓0 .

We wish this was simpler, but a definition is a definition and this definition is one way to weigh the
evidence provided by the data x. Let’s unpack it a bit.
The confidence level of an interval statistic is a probability concerning a random interval and a
hypothesized value ✓0 for the unknown parameter. Precisely, it is the probability that the random
interval (computed from random data) contains the value ✓0 , given that the model parameter truly
is ✓0 . Since the true value of ✓ is unknown, the frequentist statistician defines a, say, 95% confidence
intervals so that the 0.95 probability is valid no matter which hypothesized value of the parameter
is actually true.

6 Comparison with Bayesian probability intervals

Confidence intervals are a frequentist notion, and as we’ve repeated many times, frequentists don’t
assign probabilities to hypotheses, e.g. the value of an unknown parameter. Rather they compute
likelihoods; that is, probabilities about data or associated statistics given a hypothesis (note the
condition ✓ = ✓0 in the formal view of confidence intervals). Note that the construction of confidence
intervals proceeds entirely from the full likelihood table.
In contrast Bayesian posterior probability intervals are truly the probability that the value of the
unknown parameter lies in the reported range. We add the usual caveat that this depends on the
specific choice of a (possibly subjective) Bayesian prior.
This distinction between the two is subtle because Bayesian posterior probability intervals and
frequentist confidence intervals share the following properties:
1. They start from a model f (x|✓) for observed data x with unknown parameter ✓.
2. Given data x, they give an interval I(x) specifying a range of values for ✓.
3. They come with a number (say .95) that is the probability of something.
In practice, many people misinterpret confidence intervals as Bayesian probability intervals, forget-
ting that frequentists never place probabilities on hypotheses (this is analogous to mistaking the
p-value in NHST for the probability that H0 is false). The harm of this misinterpretation is some-
what mitigated by that fact that, given enough data and a reasonable prior, Bayesian and frequentist
intervals often work out to be quite similar.
For an amusing example illustrating how they can be quite di↵erent, see the first answer here
(involving chocolate chip cookies!):
http://stats.stackexchange.com/questions/2272/whats-the-difference-between-a-confidence-
interval-and-a-

This example uses the formal definitions and is really about confidence sets instead of confidence
intervals.
Confidence Intervals for the Mean of Non-normal Data
Class 23, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals

1. Be able to derive the formula for conservative normal confidence intervals for the
proportion ✓ in Bernoulli data.

2. Be able to compute rule-of-thumb 95% confidence intervals for the proportion ✓ of a

Bernoulli distribution.

3. Be able to compute large sample confidence intervals for the mean of a general distri-
bution.

2 Introduction

So far, we have focused on constructing confidence intervals for data drawn from a normal
distribution. We’ll now will switch gears and learn about confidence intervals for the mean
when the data is not necessarily normal.
We will first look carefully at estimating the probability ✓ of success when the data is drawn
from a Bernoulli(✓) distribution –recall that ✓ is also the mean of the Bernoulli distribution.
Then we will consider the case of a a large sample from an unknown distribution; in this
case we can appeal to the central limit theorem to justify the use z-confidence intervals.

3 Bernoulli data and polling

One common use of confidence intervals is for estimating the proportion ✓ in a Bernoulli(✓)
distribution. For example, suppose we want to use a political poll to estimate the proportion
of the population that supports candidate A, or equivalent the probability ✓ that a random
person supports candidate A. In this case we have a simple rule-of-thumb that allows us to
quickly compute a confidence interval.

3.1 Conservative normal confidence intervals

Suppose we have i.i.d. data x1 , x2 , . . . , xn all drawn from a Bernoulli(✓) distribution. then
a conservative normal (1 ↵) confidence interval for ✓ is given by
1
x ± z↵/2 · p . (1)
2 n
p
The proof given below uses the central limit theorem and the observation that = ✓(1 ✓) 
1/2.

1
18.05 class 23, Confidence Intervals for the Mean of Non-normal Data , Spring 2014 2

You will also see in the derivation below that this formula is conservative, providing an ‘at
least (1 ↵)’ confidence interval.
Example 1. A pollster asks 196 people if they prefer candidate A to candidate B and finds
that 120 prefer A and 76 prefer B. Find the 95% conservative normal confidence interval
for ✓, the proportion of the population that prefers A.
answer: We have x = 120/196 = 0.612, ↵ = 0.05 and z.025 = 1.96. The formula says a 95%
confidence interval is
1.96
I ⇡ 0.612 ± = 0.612 ± 0.007.
2 · 14

3.2 Proof of Formula 1

The proof of Formula 1 will rely on the following fact.

Fact. The standard deviation of a Bernoulli(✓) distribution is at most 0.5.
Proof of fact: Let’s denote this standard deviation by ✓ to emphasize its dependence on
✓. The variance is then ✓2 = ✓(1 ✓). It’s easy to see using calculus or by graphing this
parabola that the maximum occurs when ✓ = 1/2. Therefore
p the maximum variance is 1/4,
which implies that the standard deviation p is less the 1/4 = 1/2.
Proof of formula (1). The proof relies on the central limit theorem which says that (for
large n) the distribution of x is approximately normal with mean ✓ and standard deviation
p
✓ / n. For normal data we have the (1 ↵) z-confidence interval
✓
x ± z↵/2 · p
n

The trick now is to replace ✓ by 12 : since ✓  1

2 the resulting interval around x

1
x ± z↵/2 · p
2 n
p
is always at least as wide as the interval using ± ✓ / n. A wider interval is more likely to
contain the true value of ✓ so we have a ‘conservative’ (1 ↵) confidence interval for ✓.
Again, we call this conservative because 2p1 n overestimates the standard deviation of x̄,
resulting in a wider interval than is necessary to achieve a (1 ↵) confidence level.

3.3 How political polls are reported

Political polls are often reported as a value with a margin-of-error. For example you might
hear
52% favor candidate A with a margin-of-error of ±5%.
The actual precise meaning of this is
if ✓ is the proportion of the population that supports A then the point
estimate for ✓ is 52% and the 95% confidence interval is 52% ± 5%.
Notice that reporters of polls in the news do not mention the 95% confidence. You just
have to know that that’s what pollsters do.
18.05 class 23, Confidence Intervals for the Mean of Non-normal Data , Spring 2014 3

The 95% rule-of-thumb confidence interval.

Recall that the (1 ↵) conservative normal confidence interval is
1
x ± z↵/2 · p .
2 n
If we use the standard approximation z.025 = 2 (instead of 1.96) we get the rule-of thumb
95% confidence interval for ✓:
1
x± p .
n
Example 2. Polling. Suppose there will soon be a local election between candidate A and
candidate B. Suppose that the fraction of the voting population that supports A is ✓.
Two polling organizations ask voters who they prefer.
1. The firm of Fast and First polls 40 random voters and finds 22 support A.
2. The firm of Quick but Cautious polls 400 random voters and finds 190 support A.
Find the point estimates and 95% rule-of-thumb confidence intervals for each poll. Explain
how the statistics reflect the intuition that the poll of 400 voters is more accurate.
answer: For poll 1 we have
Point estimate: x = 22/40 = 0.55
1 1
Confidence interval: x ± p = 0.55 ± p = 0.55 ± 0.16 = 55% ± 16%.
n 40
For poll 2 we have
Point estimate: x = 190/400 = 0.475
1 1
Confidence interval: x ± p = 0.475 ± p = 0.475 ± 0.05 = 47.5% ± 5%.
n 400
The greater accuracy of the poll of 400 voters is reflected in the smaller margin of error, i.e.
5% for the poll of 400 voters vs. 16% for the poll of 40 voters.

Other binomial proportion confidence intervals

There are many methods of producing confidence intervals for the proportion p of a binomial(n,
p) distribution. For a number of other common approaches, see:
http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

4 Large sample confidence intervals

One typical goal in statistics is to estimate the mean of a distribution. When the data follows
a normal distribution we could use confidence intervals based on standardized statistics to
estimate the mean.
But suppose the data x1 , x2 , . . . , xn is drawn from a distribution with pmf or pdf f (x) that
may not be normal or even parametric. If the distribution has finite mean and variance
and if n is sufficiently large, then the following version of the central limit theorem shows
we can still use a standardized statistic.
Central Limit Theorem: For large n, the sampling distribution of the studentized mean
x̄ µ
is approximately standard normal: p ⇡ N(0, 1).
s/ n
18.05 class 23, Confidence Intervals for the Mean of Non-normal Data , Spring 2014 4

So for large n the (1 ↵) confidence interval for µ is approximately


s s
x̄ p · z↵/2 , x̄ + p · z↵/2
n n

where z↵/2 is the ↵/2 critical value for N(0, 1). This is called the large sample confidence
interval.
Example 3. How large must n be?
Recall that a type 1 CI error occurs when the confidence interval does not contain the true
value of the parameter, in this case the mean. Let’s call the value (1 ↵) the nominal
confidence level. We say nominal because unless n is large we shouldn’t expect the true
type 1 CI error rate to be ↵.
We can run numerical simulations to approximate of the true confidence level. We expect
that as n gets larger the true confidence level of the large sample confidence interval will
converge to the nominal value.
We ran such simulations for x drawn from the exponential distribution exp(1) (which is far
from normal). For several values of n and nominal confidence level c we ran 100,000 trials.
Each trial consisted of the following steps:
1. draw n samples from exp(1).
2. compute the sample mean x̄ and sample standard deviation s.
s
3. construct the large sample c confidence interval: x ± z↵/2 · p .
n
4. check for a type 1 CI error, i.e. see if the true mean µ = 1 is not in the interval.

With 100,000 trials, the empirical confidence level should closely approximate the true level.
For comparison we ran the same tests on data drawn from a standard normal distribution.
Here are the results.
nominal conf. nominal conf.
n 1 ↵ simulated conf. n 1 ↵ simulated conf.
20 0.95 0.905 20 0.95 0.936
20 0.90 0.856 20 0.90 0.885
20 0.80 0.762 20 0.80 0.785
50 0.95 0.930 50 0.95 0.944
50 0.90 0.879 50 0.90 0.894
50 0.80 0.784 50 0.80 0.796
100 0.95 0.938 100 0.95 0.947
100 0.90 0.889 100 0.900 0.896
100 0.80 0.792 100 0.800 0.797
400 0.95 0.947 400 0.950 0.949
400 0.90 0.897 400 0.900 0.898
400 0.80 0.798 400 0.800 0.798
Simulations for exp(1) Simulations for N(0, 1).
For the exp(1) distribution we see that for n = 20 the simulated confidence of the large
sample confidence interval is less than the nominal confidence 1 ↵. But for n = 100 the
18.05 class 23, Confidence Intervals for the Mean of Non-normal Data , Spring 2014 5

simulated confidence and nominal confidence are quite close. So for exp(1), n somewhere
between 50 and 100 is large enough for most purposes.
Think: For n = 20 why is the simulated confidence for the N(0, 1) distribution is smaller
than the nominal confidence?
This is because we used z↵/2 instead of t↵/2 . For large n these are quite close, but for n = 20
there is a noticable di↵erence, e.g. z.025 = 1.96 and t.025 = 2.09.
Bootstrap confidence intervals
Class 24, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals
1. Be able to construct and sample from the empirical distribution of data.

2. Be able to explain the bootstrap principle.

3. Be able to design and run an empirical bootstrap to compute confidence intervals.

4. Be able to design and run a parametric bootstrap to compute confidence intervals.

2 Introduction

The empirical bootstrap is a statistical technique popularized by Bradley Efron in 1979.

Though remarkably simple to implement, the bootstrap would not be feasible without
modern computing power. The key idea is to perform computations on the data itself
to estimate the variation of statistics that are themselves computed from the same data.
That is, the data is ‘pulling itself up by its own bootstrap.’ (A google search of ‘by ones
own bootstraps’ will give you the etymology of this metaphor.) Such techniques existed
before 1979, but Efron widened their applicability and demonstrated how to implement the
bootstrap e↵ectively using computers. He also coined the term ‘bootstrap’ 1 .
Our main application of the bootstrap will be to estimate the variation of point estimates;
that is, to estimate confidence intervals. An example will make our goal clear.
Example 1. Suppose we have data

x 1 , x2 , . . . , x n

If we knew the data was drawn from N(µ, 2 ) with the unknown mean µ and known variance
2 then we have seen that

x 1.96 p , x + 1.96 p
n n
is a 95% confidence interval for µ.
Now suppose the data is drawn from some completely unknown distribution. To have a
name we’ll call this distribution F and its (unknown) mean µ. We can still use the sample
mean x as a point estimate of µ. But how can we find a confidence interval for µ around
x? Our answer will be to use the bootstrap!
In fact, we’ll see that the bootstrap handles other statistics as easily as it handles the mean.
For example: the median, other percentiles or the trimmed mean. These are statistics
where, even for normal distributions, it can be difficult to compute a confidence interval
from theory alone.
1
Paraphrased from Dekking et al. A Modern Introduction to Probabilty and Statistics, Springer, 2005,
page 275.

1
18.05 class 24, Bootstrap confidence intervals, Spring 2014 2

3 Sampling

In statistics to sample from a set is to choose elements from that set. In a random sample
the elements are chosen randomly. There are two common methods for random sampling.
Sampling without replacement
Suppose we draw 10 cards at random from a deck of 52 cards without putting any of the
cards back into the deck between draws. This is called sampling without replacement or
simple random sampling. With this method of sampling our 10 card sample will have no
duplicate cards.
Sampling with replacement
Now suppose we draw 10 cards at random from the deck, but after each draw we put the
card back in the deck and shu✏e the cards. This is called sampling with replacement. With
this method, the 10 card sample might have duplicates. It’s even possible that we would
draw the 6 of hearts all 10 times.
Think: What’s the probability of drawing the 6 of hearts 10 times in a row?
Example 2. We can view rolling an 8-sided die repeatedly as sampling with replacement
from the set {1,2,3,4,5,6,7,8}. Since each number is equally likely, we say we are sampling
uniformly from the data. There is a subtlety here: each data point is equally probable, but
if there are repeated values within the data those values will have a higher probability of
being chosen. The next example illustrates this.

Note. In practice if we take a small number from a very large set then it doesn’t matter
whether we sample with or without replacement. For example, if we randomly sample 400
out of 300 million people in the U.S. then it is so unlikely that the same person will be
picked twice that there is no real di↵erence between sampling with or without replacement.

4 The empirical distribution of data

The empirical distribution of data is simply the distribution that you see in the data. Let’s
illustrate this with an example.
Example 3. Suppose we roll an 8-sided die 10 times and get the following data, written
in increasing order:
1, 1, 2, 3, 3, 3, 3, 4, 7, 7.
Imagine writing these values on 10 slips of paper, putting them in a hat and drawing one
at random. Then, for example, the probability of drawing a 3 is 4/10 and the probability
of drawing a 4 is 1/10. The full empirical distribution can be put in a probability table
value x 1 2 3 4 7
p(x) 2/10 1/10 4/10 1/10 2/10
Notation. If we label the true distribution the data is drawn from as F , then we’ll label
the empirical distribution of the data as F ⇤ . If we have enough data then the law of large
numbers tells us that F ⇤ should be a good approximation of F .
Example 4. In the dice example just above, the true and empirical distributions are:
18.05 class 24, Bootstrap confidence intervals, Spring 2014 3

value x 1 2 3 4 5 5 7 8
true p(x) 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
empirical p(x) 2/10 1/10 4/10 1/10 0 0 2/10 0
The true distribution F and the empirical distribution F ⇤ of the 8-sided die.

Because F ⇤ is derived strictly from data we call it the empirical distribution of the data.
We will also call it the resampling distribution. Notice that we always know F ⇤ explicitly.
In particular the expected value of F ⇤ is just the sample mean x.

5 Resampling

The empirical bootstrap proceeds by resampling from the data. We continue the dice
example above.
Example 5. Suppose we have 10 data points, given in increasing order:

1, 1, 2, 3, 3, 3, 3, 4, 7, 7

We view this as a sample taken from some underlying distribution. To resample is to sample
with replacement from the empirical distribution, e.g. put these 10 numbers in a hat and
draw one at random. Then put the number back in the hat and draw again. You draw as
many numbers as the desired size of the resample.
To get us a little closer to implementing this on a computer we rephrase this in the following
way. Label the 10 data points x1 , x2 , . . . , x10 . To resample is to draw a number j from the
uniform distribution on {1, 2, . . . , 10} and take xj as our resampled value. In this case we
could do so by rolling a 10-sided die. For example, if we roll a 6 then our resampled value
is 3, the 6th element in our list.
If we want a resampled data set of size 5, then we roll the 10-sided die 5 times and choose
the corresponding elements from the list of data. If the 5 rolls are

5, 3, 6, 6, 1

then the resample is

3, 2, 3, 3, 1.

Notes: 1. Because we are sampling with replacement, the same data point can appear
multiple times when we resample.
2. Also because we are sampling with replacement, we can have a resample data set of any
size we want, e.g. we could resample 1000 times.

Of course, in practice one uses a software package like R to do the resampling.

5.1 Star notation

If we have sample data of size n

x 1 , x2 , . . . , x n
18.05 class 24, Bootstrap confidence intervals, Spring 2014 4

then we denote a resample of size m by adding a star to the symbols

x⇤1 , x2⇤ , . . . , x⇤m
Similarly, just as x is the mean of the original data, we write x⇤ for the mean of the
resampled data.

6 The empirical bootstrap

Suppose we have n data points

x1 , x2 , . . . , xn
drawn from a distribution F . An empirical bootstrap sample is a resample of the same size
n:
x⇤1 , x2⇤ , . . . , x⇤n .
You should think of the latter as a sample of size n drawn from the empirical distribution
F ⇤ . For any statistic v computed from the original sample data, we can define a statistic v ⇤
by the same formula but computed instead using the resampled data. With this notation
we can state the bootstrap principle.

6.1 The bootstrap principle

The bootstrap setup is as follows:

1. x1 , x2 , . . . , xn is a data sample drawn from a distribution F .
2. u is a statistic computed from the sample.
3. F ⇤ is the empirical distribution of the data (the resampling distribution).
4. x⇤1 , x2⇤ , . . . , x⇤n is a resample of the data of the same size as the original sample
5. u⇤ is the statistic computed from the resample.
Then the bootstrap principle says that
1. F ⇤ ⇡ F .
2. The variation of u is well-approximated by the variation of u⇤ .
Our real interest is in point 2: we can approximate the variation of u by that of u⇤ . We
will exploit this to estimate the size of confidence intervals.

6.2 Why the resample is the same size as the original sample

This is straightforward: the variation of the statistic u will depend on the size of the sample.
If we want to approximate this variation we need to use resamples of the same size.

6.3 Toy example of an empirical bootstrap confidence interval

Example 6. Toy example. We start with a made-up set of data that is small enough to
show each step explicitly. The sample data is
30, 37, 36, 43, 42, 43, 43, 46, 41, 42
18.05 class 24, Bootstrap confidence intervals, Spring 2014 5

Problem: Estimate the mean µ of the underlying distribution and give an 80% bootstrap
confidence interval.
Note: R code for this example is shown in the section ‘R annotated transcripts’ below. The
code is also implemented in the R script class24-empiricalbootstrap.r which is posted
with our other R code.
answer: The sample mean is x = 40.3. We use this as an estimate of the true mean µ
of the underlying distribution. As in Example 1, to make the confidence interval we need
to know how much the distribution of x varies around µ. That is, we’d like to know the
distribution of
= x µ.
If we knew this distribution we could find .1 and .9 , the 0.1 and 0.9 critical values of .
Then we’d have

P( .9 x µ .1 | µ) = 0.8 , P (x .9 µ x .1 | µ) = 0.8

which gives an 80% confidence interval of

[x .1 , x .9 ] .

As always with confidence intervals, we hasten to point out that the probabilities computed
above are probabilities concerning the statistic x given that the true mean is µ.
The bootstrap principle o↵ers a practical approach to estimating the distribution of =
x µ. It says that we can approximate it by the distribution of
⇤
= x⇤ x

where x⇤ is the mean of an empirical bootstrap sample.

Here’s the beautiful key to this: since ⇤ is computed by resampling the original data, we
we can have a computer simulate ⇤ as many times as we’d like. Hence, by the law of large
numbers, we can estimate the distribution of ⇤ with high precision.
Now let’s return to the sample data with 10 points. We used R to generate 20 bootstrap
samples, each of size 10. Each of the 20 columns in the following array is one bootstrap
sample.
43 36 46 30 43 43 43 37 42 42 43 37 36 42 43 43 42 43 42 43
43 41 37 37 43 43 46 36 41 43 43 42 41 43 46 36 43 43 43 42
42 43 37 43 46 37 36 41 36 43 41 36 37 30 46 46 42 36 36 43
37 42 43 41 41 42 36 42 42 43 42 43 41 43 36 43 43 41 42 46
42 36 43 43 42 37 42 42 42 46 30 43 36 43 43 42 37 36 42 30
36 36 42 42 36 36 43 41 30 42 37 43 41 41 43 43 42 46 43 37
43 37 41 43 41 42 43 46 46 36 43 42 43 30 41 46 43 46 30 43
41 42 30 42 37 43 43 42 43 43 46 43 30 42 30 42 30 43 43 42
46 42 42 43 41 42 30 37 30 42 43 42 43 37 37 37 42 43 43 46
42 43 43 41 42 36 43 30 37 43 42 43 41 36 37 41 43 42 43 43
Next we compute ⇤ = x⇤ x for each bootstrap sample (i.e. each column) and sort them
from smallest to biggest:

-1.6, -1.4, -1.4, -0.9, -0.5, -0.2, -0.1, 0.1, 0.2, 0.2, 0.4, 0.4, 0.7, 0.9, 1.1, 1.2, 1.2, 1.6, 1.6, 2.0
18.05 class 24, Bootstrap confidence intervals, Spring 2014 6

We will approximate the critical values .1 and .9 by .1 ⇤ and ⇤ . Since ⇤ is at the 90th
.9 .1
percentile we choose the 18th element in the list, i.e. 1.6. Likewise, since ⇤ is at the 10th
.9
percentile we choose the 2nd element in the list, i.e. -1.4.
Therefore our bootstrap 80% confidence interval for µ is
⇤ ⇤
[x .1 , x .9 ] = [40.3 1.6, 40.3 + 1.4] = [38.7, 41.7]

In this example we only generated 20 bootstrap samples so they would fit on the page.
Using R, we would generate 10000 or more bootstrap samples in order to obtain a very
accurate estimate of .⇤1 and .⇤9 .

6.4 Justification for the bootstrap principle

The bootstrap is remarkable because resampling gives us a decent estimate on how the
point estimate might vary. We can only give you a ‘hand-waving’ explanation of this, but
it’s worth a try. The bootstrap is based roughly on the law of large numbers, which says,
in short, that with enough data the empirical distribution will be a good approximation of
the true distribution. Visually it says that the histogram of the data should approximate
the density of the true distribution.
First let’s note what resampling can’t do for us: it can’t improve our point estimate. For
example, if we estimate the mean µ by x then in the bootstrap we would compute x⇤ for
many resamples of the data. If we took the average of all the x⇤ we would expect it to be
very close to x. This wouldn’t tell us anything new about the true value of µ.
Even with a fair amount of data the match between the true and empirical distributions
is not perfect, so there will be error in estimating the mean (or any other value). But
the amount of variation in the estimates is much less sensitive to di↵erences between the
density and the histogram. As long as they are reasonably close both the empirical and
true distributions will exhibit the similar amounts of variation. So, in general the bootstrap
principle is more robust when approximating the distribution of relative variation than
when approximating absolute distributions.
What we have in mind is the scenario of our examples. The distribution (over di↵erent
sets of experimental data) of x is ‘centered’ at µ and the distribution of x⇤ is centered at
x If there is a significant separation between x and µ then these two distributions will also
di↵er significantly. On the other hand the distribution of = x µ describes the variation
of x about its center. Likewise the distribution of ⇤ = x⇤ x describes the variation of
x⇤ about its center. So even if the centers are quite di↵erent the two variations about the
centers can be approximately equal.
The figure below illustrates how the empirical distribution approximates the true distribu-
tion. To make the figure we generate 100 random values from a chi-square distribution with
3 degrees of freedom. The figure shows the pdf of the true distribution as a blue line and a
histogram of the empirical distribuion in orange.
18.05 class 24, Bootstrap confidence intervals, Spring 2014 7

0.20
0.10
0.00

0 5 10 15

The true and empirical distributions are approximately equal.

7 Other statistics

So far in this class we’ve avoided confidence intervals for the median and other statistics
because their sample distributions are hard to describe theoretically. The bootstrap has no
such problem. In fact, to handle the median all we have to do is change ‘mean’ to ‘median’
in the R code from Example 6.
Example 7. Old Faithful: confidence intervals for the median
Old Faithful is a geyser in Yellowstone National Park in Wyoming:
http://en.wikipedia.org/wiki/Old_Faithful
There is a publicly available data set which gives the durations of 272 consecutive eruptions.
Here is a histogram of the data.

Question: Estimate the median length of an eruption and give a 90% confidence interval
for the median.
answer: The full answer to this question is in the R file oldfaithful simple.r and the
Old Faithful data set. Both are posted on the class R code page. (Look under ‘Other R
code’ for the old faithful script and data.)
18.05 class 24, Bootstrap confidence intervals, Spring 2014 8

Note: the code in oldfaithful simple.r assumes that the data oldfaithful.txt is in
the current working directory.
Let’s walk through a summary of the steps needed to answer the question.
1. Data: x1 , . . . , x272
2. Data median: xmedian = 240
3. Find the median x⇤median of a bootstrap sample x⇤1 , . . . , x272
⇤ . Repeat 1000 times.

4. Compute the bootstrap di↵erences

⇤
= x⇤median xmedian
Put these 1000 values in order and pick out the .95 and .05 critical values, i.e. the 50th and
⇤ .
950th biggest values. Call these .⇤95 and .05
⇤ and ⇤ as estimates of
5. The bootstrap principle says that we can use .95 and
.05 .95 .05 .
So our estimated 90% bootstrap confidence interval for the median is
⇤ ⇤
[xmedian .05 , xmedian .95 ]

The bootstrap 90% CI we found for the Old Faithful data was [235, 250]. Since we used 1000
bootstrap samples a new simulation starting from the same sample data should produce a
similar interval. If in Step 3 we increase the number of bootstrap samples to 10000, then the
intervals produced by simulation would vary even less. One common strategy is to increase
the number of bootstrap samples until the resulting simulations produce intervals that vary
less than some acceptable level.

Example 8. Using the Old Faithful data, estimate P (|x µ| > 5 | µ).
answer: We proceed exactly as in the previous example except using the mean instead of
the median.
1. Data: x1 , . . . , x272
2. Data mean: x = 209.27
3. Find the mean x⇤ of 1000 empirical bootstrap samples: x⇤1 , . . . , x272
⇤ .

4. Compute the bootstrap di↵erences

⇤
= x⇤ x

5. The bootstrap principle says that we can use the distribution of ⇤ as an approximation
for the distribution = x µ. Thus,
P (|x µ| > 5 | µ) = P (| | > 5 | µ) ⇡ P (| ⇤ | > 5)
Our bootstrap simulation for the Old Faithful data gave 0.225 for this probability.

8 Parametric bootstrap

The examples in the previous sections all used the empirical bootstrap, which makes no as-
sumptions at all about the underlying distribution and draws bootstrap samples by resam-
pling the data. In this section we will look at the parametric bootstrap. The only di↵erence
18.05 class 24, Bootstrap confidence intervals, Spring 2014 9

between the parametric and empirical bootstrap is the source of the bootstrap sample. For
the parametric bootstrap, we generate the bootstrap sample from a parametrized distribu-
tion.
Here are the elements of using the parametric bootstrap to estimate a confidence interval
for a parameter.
0. Data: x1 , . . . , xn drawn from a distribution F (✓) with unknown parameter ✓.
1. A statistic ✓ˆ that estimates ✓.
2. Our bootstrap samples are drawn from F (✓ˆ).
3. For each bootstrap sample
x⇤1 , . . . , xn⇤
we compute ✓ˆ⇤ and the bootstrap di↵erence ⇤ = ✓ˆ⇤ ✓ˆ.
4. The bootstrap principle says that the distribution of ⇤ approximates the distribution of
= ✓ˆ ✓.
5. Use the bootstrap di↵erences to make a bootstrap confidence interval for ✓.

Example 9. Suppose the data x1 , . . . , x300 is drawn from an exp( ) distribution. Assume
also that the data mean x = 2. Estimate and give a 95% parametric bootstrap confidence
interval for .
answer: This is implemented in the R script class24-parametricbootstrap.r which is
posted with our other R code.
It’s will be easiest to explain the solution using commented code.
# Parametric bootstrap
# Given 300 data points with mean 2.
# Assume the data is exp(lambda)
# PROBLEM: Compute a 95% parametric bootstrap confidence interval for lambda
# We are given the number of data points and mean
n = 300
xbar = 2
# The MLE for lambda is 1/xbar
lambdahat = 1.0/xbar
# Generate the bootstrap samples
# Each column is one bootstrap sample (of 300 resampled values)
nboot = 1000
# Here’s the key difference with the empirical bootstrap:
# We draw the bootstrap sample from Exponential(lambdahat)
x = rexp(n*nboot, lambdahat)
bootstrapsample = matrix(x, nrow=n, ncol=nboot)
# Compute the bootstrap lambdastar
lambdastar = 1.0/colMeans(bootstrapsample)
# Compute the differences
deltastar = lambdastar - lambdahat
18.05 class 24, Bootstrap confidence intervals, Spring 2014 10

# Find the 0.05 and 0.95 quantile for deltastar

d = quantile(deltastar, c(0.05,0.95))
# Calculate the 95% confidence interval for lambda.
ci = lambdahat - c(d[2], d[1])
# This line of code is just one way to format the output text.
# sprintf is an old C function for doing this. R has many other
# ways to do the same thing.
s = sprintf("Confidence interval for lambda: [%.3f, %.3f]", ci[1], ci[2])
cat(s)

9 The bootstrap percentile method (should not be used)

Instead of computing the di↵erences ⇤ , the bootstrap percentile method uses the distribu-
tion of the bootstrap sample statistic as a direct approximation of the data sample statistic.
Example 10. Let’s redo Example 6 using the bootstrap percentile method.
We first compute x⇤ from the bootstrap samples given in Example 6. After sorting we get
35.7 37.4 38.0 39.5 39.7 39.8 39.8 40.1 40.1 40.6 40.7 40.8 41.1 41.1 41.7 42.0
42.1 42.4 42.4 42.4
The percentile method says to use the distribution of x⇤ as an approximation to the dis-
tribution of x. The 0.9 and 0.1 critical values are given by the 2nd and 18th elements.
Therefore the 80% confidence interval is [37.4, 42.4]. This is a bit wider than our answer
to Example 6.
The bootstrap percentile method is appealing due to its simplicity. However it depends on
the bootstrap distribution of x⇤ based on a particular sample being a good approximation to
the true distribution of x. Rice says of the percentile method, “Although this direct equation
of quantiles of the bootstrap sampling distribution with confidence limits may seem initially
appealing, it’s rationale is somewhat obscure.” 2 In short, don’t use the bootstrap
percentile method. Use the empirical bootstrap instead (we have explained both in the
hopes that you won’t confuse the empirical bootstrap for the percentile bootstrap).

10 R annotated transcripts

10.1 Using R to generate an empirical bootstrap confidence interval

This code only generates 20 bootstrap samples. In real practice we would generate many
more bootstrap samples. It is making a bootstrap confidence interval for the mean. This
code is implemented in the R script class24-empiricalbootstrap.r which is posted with
our other R code.
# Data for the example 6
x = c(30,37,36,43,42,43,43,46,41,42)
n = length(x)
2
John Rice, Mathematical Statistics and Data Analysis, 2nd edition, p. 272.
18.05 class 24, Bootstrap confidence intervals, Spring 2014 11

# sample mean
xbar = mean(x)
nboot = 20
# Generate 20 bootstrap samples, i.e. an n x 20 array of
# random resamples from x
tmpdata = sample(x,n*nboot, replace=TRUE)
bootstrapsample = matrix(tmpdata, nrow=n, ncol=nboot)
# Compute the means x⇤
bsmeans = colMeans(bootstrapsample)
# Compute ⇤ for each bootstrap sample
deltastar = bsmeans - xbar
# Find the 0.1 and 0.9 quantile for deltastar
d = quantile(deltastar, c(0.1, 0.9))
# Calculate the 80% confidence interval for the mean.
ci = xbar - c(d[2], d[1])
cat(’Confidence interval: ’,ci, ’\n’)
# ALTERNATIVE: the quantile() function is sophisticated about
# choosing a quantile between two data points. A less sophisticated
# approach is to pick the quantiles by sorting deltastar and
# choosing the index that corresponds to the desired quantiles.
# We do this below.
# Sort the results
sorteddeltastar = sort(deltastar)
# Look at the sorted results
hist(sorteddeltastar, nclass=6)
print(sorteddeltastar)
# Find the .1 and .9 critical values of deltastar
d9alt = sorteddeltastar[2]
d1alt = sorteddeltastar[18]
# Find and print the 80% confidence interval for the mean
ciAlt = xbar - c(d1alt,d9alt)
cat(’Alternative confidence interval: ’,ciAlt, ’\n’)
Linear regression
Class 25, 18.05
Jeremy Orlo↵ and Jonathan Bloom

1 Learning Goals
1. Be able to use the method of least squares to fit a line to bivariate data.

2. Be able to give a formula for the total squared error when fitting any type of curve to
data.

3. Be able to say the words homoscedasticity and heteroscedasticity.

2 Introduction

Suppose we have collected bivariate data (xi , yi ), i = 1, . . . , n. The goal of linear regression
is to model the relationship between x and y by finding a function y = f (x) that is a
close fit to the data. The modeling assumptions we will use are that xi is not random and
that yi is a function of xi plus some random noise. With these assumptions x is called the
independent or predictor variable and y is called the dependent or response variable.

Example 1. The cost of a first class stamp in cents over time is given in the following list.
.05 (1963) .06 (1968) .08 (1971) .10 (1974) .13 (1975) .15 (1978) .20 (1981) .22 (1985)
.25 (1988) .29 (1991) .32 (1995) .33 (1999) .34 (2001) .37 (2002) .39 (2006) .41 (2007)
.42 (2008) .44 (2009) .45 (2012) .46 (2013) .49 (2014)

Using the R function lm we found the ‘least squares fit’ for a line to this data is

y= 0.06558 + 0.87574x,

where x is the number of years since 1960 and y is in cents.

Using this result we ‘predict’ that in 2016 (x = 56) the cost of a stamp will be 49 cents
(since 0.06558 + 0.87574x · 56 = 48.98).
50

● ●

●
●
●

●
40

●
●
●
30

●
y

●
20

●
10

●
●

0 10 20 30 40 50 60
x

Stamp cost (cents) vs. time (years since 1960). Red dot is predicted cost in 2016.

1
18.05 class 25, Linear regression, Spring 2014 2

Note that none of the data points actually lie on the line. Rather this line has the ‘best fit’
with respect to all the data, with a small error for each data point.

Example 2. Suppose we have n pairs of fathers and adult sons. Let xi and yi be the
heights of the ith father and son, respectively. The least squares line for this data could be
used to predict the adult height of a young boy from that of his father.

Example 3. We are not limited to best fit lines. For all positive d, the method of least
squares may be used to find a polynomial of degree d with the ‘best fit’ to the data. Here’s
a figure showing the least squares fit of a parabola (d = 2).

Fitting a parabola, b2 x2 + b1 x + b0 , to data

3 Fitting a line using least squares

Suppose we have data (xi , yi ) as above. The goal is to find the line

y= 1x + 0

that ‘best fits’ the data. Our model says that each yi is predicted by xi up to some error ✏i :

yi = 1 xi + 0 + ✏i .

So
✏i = yi 1 xi 0.

The method of least squares finds the values ˆ0 and ˆ1 of 0 and 1 that minimize the sum
of the squared errors:
X X
S( 0 , 1 ) = ✏2i = (yi 1 xi 0)
2
.
i

Using calculus or linear algebra (details in the appendix), we find

ˆ1 = sxy ˆ0 = ȳ ˆ1 x̄ (1)
sxx
18.05 class 25, Linear regression, Spring 2014 3

where
1X 1X 1 X 1 X
x̄ = xi , ȳ = yi , sxx = (xi x̄)2 , sxy = (xi x̄)(yi ȳ).
n n (n 1) (n 1)

Here x̄ is the sample mean of x, ȳ is the sample mean of y, sxx is the sample variance of x,
and sxy is the sample covariance of x and y.

Example 4. Use least squares to fit a line to the following data: (0,1), (2,1), (3,4).
answer: In our case, (x1 , y1 ) = (0, 1), (x2 , y2 ) = (2, 1) and (x3 , y3 ) = (3, 4). So
5 14 4
x̄ = , ȳ = 2, sxx = , sxy =
3 9 3
Using the above formulas we get

ˆ1 = 6 , ˆ0 = 4 .
7 7
4 6
So the least squares line has equation y = + x. This is shown as the green line in the
7 7
following figure. We will discuss the blue parabola soon.
y
4

1
x
1 2 3
Least squares fit of a line (green) and a parabola (blue)

Simple linear regression: It’s a little confusing, but the word linear in ‘linear regression’
does not refer to fitting a line. We will explain its meaning below. However, the most
common curve to fit is a line. When we fit a line to bivariate data it is called simple linear
regression.

3.1 Residuals

For a line the model is

yi = ˆ1 x + ˆ0 + ✏i .
We think of ˆ1 xi + ˆ0 as predicting or explaining yi . The left-over term ✏i is called the
residual, which we think of as random noise or measurement error. A useful visual check of
the linear regression model is to plot the residuals. The data points should hover near the
regression line. The residuals should look about the same across the range of x.
18.05 class 25, Linear regression, Spring 2014 4

20
●
●

4
● ●
●
● ● ● ●

3
●● ● ● ●
● ●
● ●
15
● ● ●
●●
●● ● ● ● ●
● ● ●
●● ●
● ● ● ●

2
● ● ● ●
●● ● ●
● ●
● ●
● ● ●
●
●
●●● ●● ● ● ●
● ● ● ● ●●
● ● ●

1
● ● ● ●
10

● ● ●● ● ●
● ●

e
y

● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●● ●
● ● ● ● ●
● ● ●

0
● ●
● ●
● ●
●● ●
● ●
● ●● ●● ● ●
● ●● ● ● ● ● ●
●● ● ●
● ●● ● ● ●
● ● ●● ●
●●

−1
●
5

● ● ●
● ●● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ●●
● ●
● ●
● ●

−2
●
● ●
●
●
●
●
0

−3
0 2 4 6 8 10 0 2 4 6 8 10
x x

Data with regression line (left) and residuals (right). Note the homoscedasticity.

3.2 Homoscedasticity

An important assumption of the linear regression model is that the residuals ✏i have the
same variance for all i. This is called homoscedasticity. You can see this is the case for
both figures above. The data hovers in the band of fixed width around the regression line
and at every x the residuals have about the same vertical spread.
Below is a figure showing heteroscedastic data. The vertical spread of the data increases as
x increases. Before using least squares on this data we would need to transform the data
to be homoscedastic.
20

●
●
● ●
●

●
●
●
● ●
● ●
15

●● ● ●
● ●
● ●
●
●
● ● ● ● ● ●●●
● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●● ● ●
● ●● ●●
● ●
● ●
● ● ●
10

●● ● ●
●● ● ● ●
y

● ●
● ●● ●
● ● ● ●●
● ● ●
●
● ● ●
● ● ●
● ● ● ●
● ● ●●● ●●
● ● ●
● ● ●● ● ●
● ● ●
5

●
● ●● ● ●● ●
● ● ●
●● ●●
● ● ●
● ● ●●
●● ●
●● ●
0

0 2 4 6 8 10
x

Heteroscedastic Data

4 Linear regression for fitting polynomials

When we fit a line to data it is called simple linear regression. We can also use linear
regression to fit polynomials to data. The use of the word linear in both cases may seem
confusing. This is because the word ‘linear’ in linear regression does not refer to fitting a
line. Rather it refers to the linear algebraic equations for the unknown parameters i , i.e.
each i has exponent 1.

Example 5. Take the same data as in the Example 4 and use least squares to find the
18.05 class 25, Linear regression, Spring 2014 5

best fitting parabola to the data.

answer: A parabola has the formula y = 0 + 1 x + 2 x2 . The squared error is
X
S( 0 , 1 , 2 ) = (yi ( 0 + 1 xi + 2 x2i ))2 .

After substituting the given values for each xi and yi , we can use calculus to find the triple
( 0 , 1 , 2 ) that minimizes S. With this data, we find that the least squares parabola has
equation
y = 1 2x + x2 .
Note that for 3 points the quadratic fit is perfect.
y
4

1
x
1 2 3
Least squares fit of a line (green) and a parabola (blue)
Example 6. The pairs (xi , yi ) may give the the age and vocabulary size of a n children.
Since we expect that young children acquire new words at an accelerating pace, we might
guess that a higher order polynomial might best fit the data.
Example 7. (Transforming the data) Sometimes it is necessary to transform the data
before using linear regression. For example, let’s suppose the relationship is exponential,
i.e. y = ceax . Then
ln(y) = ax + ln(c).
So we can use simple linear regression to obtain a model

ln(yi ) = ˆ0 + ˆ1 xi

and then exponentiate to obtain the exponential model

ˆ ˆ1 x
yi = e 0 e .

4.1 Overfitting

You can always achieve a better fit by using a higher order polynomial. For instance, given 6
data points (with distinct xi ) one can always find a fifth order polynomial that goes through
all of them. This can result in what’s called overfitting. That is, fitting the noise as well
as the true relationship between x and y. An overfit model will fit the original data better
but perform less well on predicting y for new values of x. Indeed, a primary challenge of
statistical modeling is balancing model fit against model complexity.
Example 8. In the plot below, we fit polynomials of degree 1, 3, and 9 to bivariate data
consisting of 10 data points. The degree 2 model (maroon) gives a significantly better fit
18.05 class 25, Linear regression, Spring 2014 6

than the degree 1 model (blue). The degree 10 model (orange) gives fits the data exactly,
but at a glance we would guess it is overfit. That is, we don’t expect it to do a good job
fitting the next data point we see.
In fact, we generated this data using a quadratic model, so the degree 2 model will tend to
perform best fitting new data points.

10
y
5
0

0 2 4 6 8 10
x

4.2 R function lm

As you would expect we don’t actually do linear regression by hand. Computationally,

linear regression reduces to solving simultaneous equations, i.e. to matrix calculations. The
R function lm can be used to fit any order polynomial to data. (lm stands for linear model).
We will explore this in the next studio class. In fact lm can fit many types of functions
besides polynomials, as you can explore using R help or google.

5 Multiple linear regression

Data is not always bivariate. It can be trivariate or even of some higher dimension. Suppose
we have data in the form of tuples

(yi , x1,i , x2,i , . . . xm,i )

We can analyze this in a manner very similar to linear regression on bivariate data. That
is, we can use least squares to fit the model

y= 0 + 1 x1 + 2 x2 + ... + m xm .

Here each xj is a predictor variable and y is the response variable. For example, we might
be interested in how a fish population varies with measured levels of several pollutants, or
we might want to predict the adult height of a son based on the height of the mother and
the height of the father.
We don’t have time in 18.05 to study multiple linear regression, but we wanted you to see
the name.
18.05 class 25, Linear regression, Spring 2014 7

6 Least squares as a statistical model

The linear regression model for fitting a line says that the value yi in the pair (xi , yi ) is
drawn from a random variable
Y i = 0 + 1 x i + "i
where the ‘error’ terms "i are independent random variables with mean 0 and standard
deviation . The standard assumption is that the "i are i.i.d. with distribution N (0, 2 ).
In any case, the mean of Yi is given by:

E(Yi ) = 0 + 1 xi + E("i ) = 0 + 1 xi .

From this perspective, the least squares method chooses the values of 0 and 1 which
minimize the sample variance about the line.
In fact, the least square estimate ( ˆ0 , ˆ1 ) coincides with the maximum likelihood estimate
for the parameters ( 0 , 1 ); that is, among all possible coefficients, ( ˆ0 , ˆ1 ) are the ones that
make the observed data most probable.

7 Regression to the mean

The reason for the term ‘regression’ is that the predicted response variable y will tend to
be ‘closer’ to (i.e., regress to) its mean than the predictor variable x is to its mean. Here
closer is in quotes because we have to control for the scale (i.e. standard deviation) of each
variable. The way we control for scale is to first standardize each variable.
xi x̄ yi ȳ
ui = p , vi = p .
sxx syy

Standardization changes the mean to 0 and variance to 1:

ū = v̄ = 0, suu = svv = 1.

The algebraic properties of covariance show

sxy
suv = p = ⇢,
sxx syy

the correlation coefficient. Thus the least squares fit to v = 0 + 1u has

ˆ1 = suv = ⇢ and ˆ0 = v̄ ˆ1 ū = 0.
suu
So the least squares line is v = ⇢u. Since ⇢ is the correlation coefficient, it is between -1 and
1. Let’s assume it is positive and less than 1 (i.e., x and y are positively but not perfectly
correlated). Then the formula v = ⇢u means that if u is positive then the predicted value
of v is less than u. That is, v is closer to 0 than u. Equivalently,
y ȳ x x̄
p < p
syy sxx

i.e., y regresses to ȳ. Notice how the standardization takes care of controlling the scale.
18.05 class 25, Linear regression, Spring 2014 8

Consider the extreme case of 0 correlation between x and y. Then, no matter what the x
value, the predicted value of y is always ȳ. That is, y has regressed all the way to its mean.
Note also that the regression line always goes through the point (x̄, ȳ).

Example 9. Regression to the mean is important in longitudinal studies. Rice (Mathemat-

ical Statistics and Data Analysis) gives the following example. Suppose children are given
an IQ test at age 4 and another at age 5 we expect the results will be positively correlated.
The above analysis says that, on average, those kids who do poorly on the first test will
tend to show improvement (i.e. regress to the mean) on the second test. Thus, a useless
intervention might be misinterpreted as useful since it seems to improve scores.

Example 10. Another example with practical consequences is reward and punishment.
Imagine a school where high performance on an exam is rewarded and low performance is
punished. Regression to the mean tells us that (on average) the high performing students
will do slighly worse on the next exam and the low performing students will do slightly
better. An unsophisticated view of the data will make it seem that punishment improved
performance and reward actually hurt performance. There are real consequences if those in
authority act on this idea.

8 Appendix

We collect in this appendix a few things you might find interesting. You will not be asked
to know these things for exams.

8.1 Proof of the formula for least square fit of a line

The most straightforward proof is to use calculus. The sum of the squared errors is
n
X
2
S( 0, 1) = (yi 1 xi 0) .
i=1

Taking partial derivatives (and remembering that xi and yi are the data, hence constant)

X n
@S
= 2(yi 1 xi 0) =0
@ 0
i=1
Xn
@S
= 2xi (yi 1 xi 0) =0
@ 1
i=1

Summing this up we get two linear equations in the unknowns 0 and 1:

⇣X ⌘ X
xi 1 + n 0 = yi
⇣X ⌘ ⇣X ⌘ X
x2i 1 + xi 0 = xi yi

Solving for 1 and 0 gives the formulas in Equation (1).

18.05 class 25, Linear regression, Spring 2014 9

A sneakier approach which avoids calculus is to standardize the data, find the best fit line,
and then unstandardize. We omit the details.

For a deluge of applications across disciplines see:

http://en.wikipedia.org/wiki/Linear_regression#Applications_of_linear_regression

8.2 Measuring the fit

Once one computes the regression coefficients, it is important to check how well the regres-
sion model fits the data (i.e., how closely the best fit line tracks the data). A common but
crude ‘goodness of fit’ measure is the coefficient of determination, denoted R2 . We’ll need
some notation to define it. The total sum of squares is given by:
X
TSS = (yi ȳ)2 .

The residual sum of squares is given by the sum of the squares of the residuals. When
fitting a line, this is: X
RSS = (yi ˆ0 ˆ1 xi )2 .

The RSS is the “unexplained” portion of the total sum of squares, i.e. unexplained by the
regression equation. The di↵erence TSS RSS is the “explained” portion of the total sum
of squares. The coefficient of determination R2 is the ratio of the “explained” portion to
the total sum of squares:
TSS RSS
R2 = .
TSS
In other words, R2 measures the proportion of the variability of the data that is accounted
for by the regression model. A value close to 1 indicates a good fit, while a value close to 0
indicates a poor fit. In the case of simple linear regression, R2 is simply the square of the
correlation coefficient between the observed values yi and the predicted values 0 + 1 xi .
Example 11. In the overfitting example (8), the values of R2 are:
degree R2
1 0.3968
2 0.9455
9 1.0000
Notice the goodness of fit measure increases as n increases. The fit is better, but the
model also becomes more complex, since it takes more coefficients to describe higher order
polynomials.

5 C 9 A 39521 D 42
No ratings yet
5 C 9 A 39521 D 42
426 pages
Chapter 3 - Bayesian Inference
No ratings yet
Chapter 3 - Bayesian Inference
114 pages
Introduction
No ratings yet
Introduction
106 pages
Lectures Ma 2203
No ratings yet
Lectures Ma 2203
209 pages
Introduction To Probability, Statistics, and Random Processes - Hossein Pishro-Nik
100% (1)
Introduction To Probability, Statistics, and Random Processes - Hossein Pishro-Nik
1,007 pages
Hossein Pishro-Nik - Introduction To Probability, Statistics, and Random Processes (2014, Kappa Research, LLC) - Libgen - Li
No ratings yet
Hossein Pishro-Nik - Introduction To Probability, Statistics, and Random Processes (2014, Kappa Research, LLC) - Libgen - Li
1,007 pages
Probability and Probability Distn
100% (2)
Probability and Probability Distn
138 pages
Probability
100% (4)
Probability
10 pages
Chapter 5 Slides
No ratings yet
Chapter 5 Slides
90 pages
Slides-Sksk
100% (1)
Slides-Sksk
151 pages
Unit 2
No ratings yet
Unit 2
102 pages
Stat Merge
No ratings yet
Stat Merge
162 pages
Lecture 01 Probability
No ratings yet
Lecture 01 Probability
51 pages
Mit18 05 s22 Probability
No ratings yet
Mit18 05 s22 Probability
112 pages
Unit V
No ratings yet
Unit V
57 pages
Lecture 02 IUB MAT 212
0% (1)
Lecture 02 IUB MAT 212
35 pages
Math 224 Module
No ratings yet
Math 224 Module
62 pages
Chap3 Probability STAT320
No ratings yet
Chap3 Probability STAT320
25 pages
Introduction To Probability and Statistics Course ID:MA2203: Course Teacher: Dr. Manas Ranjan Tripathy
No ratings yet
Introduction To Probability and Statistics Course ID:MA2203: Course Teacher: Dr. Manas Ranjan Tripathy
184 pages
Statistical Modeling & Intro To Probability
No ratings yet
Statistical Modeling & Intro To Probability
31 pages
Bayesian Reasoning and Methods
No ratings yet
Bayesian Reasoning and Methods
341 pages
Lecture 3 - Probability Final
No ratings yet
Lecture 3 - Probability Final
96 pages
UNIT 5 - Uncertainty
No ratings yet
UNIT 5 - Uncertainty
36 pages
Introduction to Gambling Theory – Know the Odds!
From Everand
Introduction to Gambling Theory – Know the Odds!
stanbook449
3.5/5 (2)
Legacies and Extensions of Charles Tilly
No ratings yet
Legacies and Extensions of Charles Tilly
40 pages
PrelimsProb MT23 26sep2023
No ratings yet
PrelimsProb MT23 26sep2023
71 pages
Chapter 5. Probability and Probability Distributions Part 2
No ratings yet
Chapter 5. Probability and Probability Distributions Part 2
82 pages
1st (1-2 Weeks)
No ratings yet
1st (1-2 Weeks)
60 pages
BK Chap12
No ratings yet
BK Chap12
74 pages
Bayesian Statistical Modeling With Stan, R, and Python (Kentaro Matsuura) (Z-Library)
No ratings yet
Bayesian Statistical Modeling With Stan, R, and Python (Kentaro Matsuura) (Z-Library)
395 pages
Slides 11 09 PDF
No ratings yet
Slides 11 09 PDF
105 pages
1 Intro
No ratings yet
1 Intro
9 pages
Slide 2023.2 MI2036 Chap1
No ratings yet
Slide 2023.2 MI2036 Chap1
109 pages
Making Models With Bayes
No ratings yet
Making Models With Bayes
51 pages
MTPDF2 Probability
No ratings yet
MTPDF2 Probability
107 pages
Probability
No ratings yet
Probability
65 pages
Uncertainty Management in Rule-Based Expert Systems
No ratings yet
Uncertainty Management in Rule-Based Expert Systems
46 pages
Week 1: Learning Objectives: Introduction To Probability
No ratings yet
Week 1: Learning Objectives: Introduction To Probability
30 pages
Mit18 05 s22 Class01-Prep-B
No ratings yet
Mit18 05 s22 Class01-Prep-B
9 pages
Probability & Statistics For Engineers: An Introduction and Overview
No ratings yet
Probability & Statistics For Engineers: An Introduction and Overview
93 pages
Probability MIT
No ratings yet
Probability MIT
116 pages
Osta L5
No ratings yet
Osta L5
26 pages
Basic Concepts of Probability: AID-521 Mathematics For Data Science
No ratings yet
Basic Concepts of Probability: AID-521 Mathematics For Data Science
16 pages
1.3 Counting
No ratings yet
1.3 Counting
13 pages
Lecture 2
No ratings yet
Lecture 2
24 pages
Chapter 4
No ratings yet
Chapter 4
52 pages
Bayes Theorem (Class Notes)
No ratings yet
Bayes Theorem (Class Notes)
11 pages
Introduction To Probability and Statistics
No ratings yet
Introduction To Probability and Statistics
5 pages
Probabailty Notes 1
No ratings yet
Probabailty Notes 1
25 pages
Probability
No ratings yet
Probability
78 pages
Probablity
No ratings yet
Probablity
10 pages
Materi Uts (Baru)
No ratings yet
Materi Uts (Baru)
92 pages
Lemh 207
No ratings yet
Lemh 207
33 pages
Attacking Probability and Statistics Problems
From Everand
Attacking Probability and Statistics Problems
David S. Kahn
No ratings yet
Aiml Iii
No ratings yet
Aiml Iii
28 pages
Week 1
No ratings yet
Week 1
71 pages
Mit18 05 s22 Class01-Prep-A
No ratings yet
Mit18 05 s22 Class01-Prep-A
3 pages
Maths
No ratings yet
Maths
13 pages
SLIDES Probability-Part1
No ratings yet
SLIDES Probability-Part1
27 pages
MA-2203: Introduction To Probability and Statistics: Lectures Notes
No ratings yet
MA-2203: Introduction To Probability and Statistics: Lectures Notes
27 pages
Dipmaths
No ratings yet
Dipmaths
15 pages
2021 - Nature - Bayesian Statistics and Modelling
100% (1)
2021 - Nature - Bayesian Statistics and Modelling
26 pages
COMP4610 Notes Set 1
No ratings yet
COMP4610 Notes Set 1
12 pages
STAT 328: Probabilities & Statistics: Lecture-5
No ratings yet
STAT 328: Probabilities & Statistics: Lecture-5
28 pages
STT 430/630/ES 760 Lecture Notes: Chapter 3: Probability
No ratings yet
STT 430/630/ES 760 Lecture Notes: Chapter 3: Probability
15 pages
Class 1, 18.05 Jeremy Orloff and Jonathan Bloom 1 Probability vs. Statistics
No ratings yet
Class 1, 18.05 Jeremy Orloff and Jonathan Bloom 1 Probability vs. Statistics
20 pages
Chapter # 4 Exhaustive Events
No ratings yet
Chapter # 4 Exhaustive Events
5 pages
7th Lecture (Introduction To Probability)
No ratings yet
7th Lecture (Introduction To Probability)
17 pages
Chapter 4
No ratings yet
Chapter 4
7 pages
Introduction To Probability: Unit 1
No ratings yet
Introduction To Probability: Unit 1
22 pages
Stats Prob Week 1
No ratings yet
Stats Prob Week 1
13 pages
Probability For GRE (Level 4-5) / Quantum CAT (Level 2) / GMAT
No ratings yet
Probability For GRE (Level 4-5) / Quantum CAT (Level 2) / GMAT
23 pages
4 - Mobile Cloud Computing
No ratings yet
4 - Mobile Cloud Computing
3 pages
Probabilistic Methods in Engineering: Lecture 3: Counting/Conditional
No ratings yet
Probabilistic Methods in Engineering: Lecture 3: Counting/Conditional
44 pages
Probability
No ratings yet
Probability
10 pages
Lecture 2
No ratings yet
Lecture 2
9 pages
Probability
No ratings yet
Probability
3 pages
Counting and Sets Class 1, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Counting and Sets Class 1, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
9 pages
Chapter 3 Probability and Counting Rules
100% (2)
Chapter 3 Probability and Counting Rules
77 pages
Robiel H. Statistics For Management
No ratings yet
Robiel H. Statistics For Management
18 pages
Chapter 2: Conditional Probability and Bayes Formula
No ratings yet
Chapter 2: Conditional Probability and Bayes Formula
12 pages
Possible Outcomes With Uncertainty As To Which Will Occur
No ratings yet
Possible Outcomes With Uncertainty As To Which Will Occur
2 pages
Table
No ratings yet
Table
10 pages
Chapter 7 Statistics
No ratings yet
Chapter 7 Statistics
7 pages
Probabilistic Reasoning in Artificial Intelligence
No ratings yet
Probabilistic Reasoning in Artificial Intelligence
5 pages
Notes 1
No ratings yet
Notes 1
4 pages
Functions and Probability for Sixth Graders
From Everand
Functions and Probability for Sixth Graders
Home School Brew
No ratings yet
Probability LP
No ratings yet
Probability LP
2 pages