0% found this document useful (0 votes)
78 views10 pages

Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals

This document discusses Bayesian updating when there is a continuous range of hypotheses rather than a finite number. It begins by introducing examples where the parameter being estimated, like the probability of heads for a coin, can take on any value in a continuous range. It then reviews concepts like the law of total probability and Bayes' theorem that are needed to perform Bayesian updating in the continuous case. A coin flipping example is used throughout to illustrate the concepts.

Uploaded by

Md Cassim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views10 pages

Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals

This document discusses Bayesian updating when there is a continuous range of hypotheses rather than a finite number. It begins by introducing examples where the parameter being estimated, like the probability of heads for a coin, can take on any value in a continuous range. It then reviews concepts like the law of total probability and Bayes' theorem that are needed to perform Bayesian updating in the continuous case. A coin flipping example is used throughout to illustrate the concepts.

Uploaded by

Md Cassim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Bayesian Updating with Continuous Priors

Class 13, 18.05

Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Understand a parameterized family of distributions as representing a continuous range


of hypotheses for the observed data.

2. Be able to state Bayes’ theorem and the law of total probability for continous densities.

3. Be able to apply Bayes’ theorem to update a prior probability density function to a


posterior pdf given data and a likelihood function.

4. Be able to interpret and compute posterior predictive probabilities.

2 Introduction

Up to now we have only done Bayesian updating when we had a finite number of hypothesis,
e.g. our dice example had five hypotheses (4, 6, 8, 12 or 20 sides). Now we will study
Bayesian updating when there is a continuous range of hypotheses. The Bayesian update
process will be essentially the same as in the discrete case. As usual when moving from
discrete to continuous we will need to replace the probability mass function by a probability
density function, and sums by integrals.
The first few sections of this note are devoted to working with pdfs. In particular we will
cover the law of total probability and Bayes’ theorem. We encourage you to focus on how
these are essentially identical to the discrete versions. After that, we will apply Bayes’
theorem and the law of total probability to Bayesian updating.

3 Examples with continuous ranges of hypotheses

Here are three standard examples with continuous ranges of hypotheses.

Example 1. Suppose you have a system that can succeed or fail with probability p. Then
we can hypothesize that p is anywhere in the range [0, 1]. That is, we have a continuous
range of hypotheses. We will often model this example with a ‘bent’ coin with unknown
probability p of heads.

Example 2. The lifetime of a certain isotope is modeled by an exponential distribution


exp(λ). In principal, the mean lifetime 1/λ can be any real number in (0, ∞).

Example 3. We are not restricted to a single parameter. In principle, the parameters µ


and σ of a normal distribution can be any real numbers in (−∞, ∞) and (0, ∞), respectively.
If we model gestational length for single births by a normal distribution, then from millions
of data points we know that µ is about 40 weeks and σ is about one week.

1
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 2

In all of these examples we modeled the random process giving rise to the data by a dis­
tribution with parameters –called a parametrized distribution. Every possible choice of the
parameter(s) is a hypothesis, e.g. we can hypothesize that the probability of succcess in
Example 1 is p = 0.7313. We have a continuous set of hypotheses because we could take
any value between 0 and 1.

4 Notational conventions

4.1 Parametrized models

As in the examples above our hypotheses often take the form a certain parameter has value
θ. We will often use the letter θ to stand for an arbitrary hypothesis. This will leave
symbols like p, f , and x to take there usual meanings as pmf, pdf, and data. Also, rather
than saying ‘the hypothesis that the parameter of interest has value θ’ we will simply say
the hypothesis θ.

4.2 Big and little letters

We have two parallel notations for outcomes and probability:


1. (Big letters) Event A, probability function P (A).
2. (Little letters) Value x, pmf p(x) or pdf f (x).
These notations are related by P (X = x) = p(x), where x is a value the discrete random

variable X and ‘X = x’ is the corresponding event.

We carry these notations over to the probabilities used in Bayesian updating.

1. (Big letters) From hypotheses H and data D we compute several associated probabilities

P (H), P (D), P (H|D), P (D|H).

In the coin example we might have H = ‘the chosen coin has probability 0.6 of heads’, D
= ‘the flip was heads’, and P (D|H) = 0.6
2. (Small letters) Hypothesis values θ and data values x both have probabilities or proba­
bility densities:
p(θ) p(x) p(θ|x) p(x|θ)
f (θ) f (x) f (θ|x) f (x|θ)
In the coin example we might have θ = 0.6 and x = 1, so p(x|θ) = 0.6. We might also write
p(x = 1|θ = 0.6) to emphasize the values of x and θ, but we will never just write p(1|0.6)
because it is unclear which value is x and which is θ.
Although we will still use both types of notation, from now on we will mostly use the small
letter notation involving pmfs and pdfs. Hypotheses will usually be parameters represented
by Greek letters (θ, λ, µ, σ, . . . ) while data values will usually be represented by English
letters (x, xi , y, . . . ).
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 3

5 Quick review of pdf and probability

Suppose X is a random variable with pdf f (x). Recall f (x) is a density; its units are
probability/(units of x).
f (x) f (x)

probability f (x)dx

P (c ≤ X ≤ d)

x dx
c x
d x

The probability that the value of X is in [c, d] is given by


l d
f (x) dx.
c

The probability that X is in an infinitesimal range dx around x is f (x) dx. In fact, the
integral formula is just the ‘sum’ of these infinitesimal probabilities. We can visualize these
probabilities by viewing the integral as area under the graph of f (x).
In order to manipulate probabilities instead of densities in what follows, we will make
frequent use of the notion that f (x) dx is the probability that X is in an infinitesimal range
around x of width dx. Please make sure that you fully understand this notion.

6 Continuous priors, discrete likelihoods

In the Bayesian framework we have probabilities of hypotheses –called prior and posterior
probabilities– and probabilities of data given a hypothesis –called likelihoods. In earlier
classes both the hypotheses and the data had discrete ranges of values. We saw in the
introduction that we might have a continuous range of hypotheses. The same is true for
the data, but for today we will assume that our data can only take a discrete set of values.
In this case, the likelihood of data x given hypothesis θ is written using a pmf: p(x|θ).
We will use the following coin example to explain these notions. We will carry this example
through in each of the succeeding sections.
Example 4. Suppose we have a bent coin with unknown probability θ of heads. The
value of of θ is random and could be anywhere between 0 and 1. For this and the examples
that follow we’ll suppose that the value of θ follows a distribution with continuous prior
probability density f (θ) = 2θ. We have a discrete likelihood because tossing a coin has only
two outcomes, x = 1 for heads and x = 0 for tails.

p(x = 1|θ) = θ, p(x = 0|θ) = 1 − θ.

Think: This can be tricky to wrap your mind around. We have a coin with an unknown
probability θ of heads. The value of the parameter θ is itself random and has a prior pdf
f (θ). It may help to see that the discrete examples we did in previous classes are similar.
For example, we had a coin that might have probability of heads 0.5, 0.6, or 0.9. So,
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 4

we called our hypotheses H0.5 , H0.6 , H0.9 and these had prior probabilities P (H0.5 ) etc. In
other words, we had a coin with an unknown probability of heads, we had hypotheses about
that probability and each of these hypotheses had a prior probability.

7 The law of total probability

The law of total probability for continuous probability distributions is essentially the same
as for discrete distributions. We replace the prior pmf by a prior pdf and the sum by an
integral. We start by reviewing the law for the discrete case.
Recall that for a discrete set of hypotheses H1 , H2 , . . . Hn the law of total probability says
n
n
P (D) = P (D|Hi )P (Hi ). (1)
i=1

This is the total prior probability of D because we used the prior probabilities P (Hi )
In the little letter notation with θ1 , θ2 , . . . , θn for hypotheses and x for data the law of total
probability is written
n n
p(x) = p(x|θi )p(θi ). (2)
i=1
We also called this the prior predictive probability of the outcome x to distinguish it from
the prior probability of the hypothesis θ.
Likewise, there is a law of total probability for continuous pdfs. We state it as a theorem
using little letter notation.
Theorem. Law of total probability. Suppose we have a continuous parameter θ in the
range [a, b], and discrete random data x. Assume θ is itself random with density f (θ) and
that x and θ have likelihood p(x|θ). In this case, the total probability of x is given by the
formula. l b
p(x) = p(x|θ)f (θ) dθ (3)
a
Proof. Our proof will be by analogy to the discrete version: The probability term p(x|θ)f (θ) dθ
is perfectly analogous to the term p(x|θi )p(θi ) in Equation 2 (or the term P (D|Hi )P (Hi )
in Equation 1). Continuing the analogy: the sum in Equation 2 becomes the integral in
Equation 3
As in the discrete case, when we think of θ as a hypothesis explaining the probability of the
data we call p(x) the prior predictive probability for x.
Example 5. (Law of total probability.) Continuing with Example 4. We have a bent coin
with probability θ of heads. The value of θ is random with prior pdf f (θ) = 2θ on [0, 1].
Suppose I flip the coin once. What is the total probability of heads?
answer: In Example 4 we noted that the likelihoods are p(x = 1|θ) = θ and p(x = 0|θ) =
1 − θ. So the total probability of x = 1 is
l 1 l 1 l 1
2
p(x = 1) = p(x = 1|θ) f (θ) dθ = θ · 2θ dθ = 2θ 2 dθ = .
0 0 0 3
Since the prior is weighted towards higher probabilities of heads, so is the total probability.

18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 5

8 Bayes’ theorem for continuous probability densities

The statement of Bayes’ theorem for continuous pdfs is essentially identical to the statement
for pmfs. We state it including dθ so we have genuine probabilities:
Theorem. Bayes’ Theorem. Use the same assumptions as in the law of total probability,
i.e. θ is a continuous parameter with pdf f (θ) and range [a, b]; x is random discrete data;
together they have likelihood p(x|θ). With these assumptions:
p(x|θ)f (θ) dθ p(x|θ)f (θ) dθ
f (θ|x) dθ = = Jb . (4)
p(x)
a p(x|θ)f (θ) dθ

Proof. Since this is a statement about probabilities it is just the usual statement of Bayes’
theorem. This is important enough to warrant spelling it out in words: Let Θ be the random
variable that produces the value θ. Consider the events

H = ‘Θ is in an interval of width dθ around the value θ’

and
D = ‘the value of the data is x’.
Then P (H) = f (θ) dθ, P (D) = p(x), and P (D|H) = p(x|θ). Now our usual form of Bayes’
theorem becomes
P (D|H)P (H) p(x|θ)f (θ) dθ
f (θ|x) dθ = P (H|D) = =
P (D) p(x)
Looking at the first and last terms in this equation we see the new form of Bayes’ theorem.
Finally, we firmly believe that is is more conducive to careful thinking about probability
to keep the factor of dθ in the statement of Bayes’ theorem. But because it appears in the
numerator on both sides of Equation 4 many people drop the dθ and write Bayes’ theorem
in terms of densities as
p(x|θ)f (θ) p(x|θ)f (θ)
f (θ|x) = = Jb .
p(x)
a p(x|θ)f (θ) dθ

9 Bayesian updating with continuous priors

Now that we have Bayes’ theorem and the law of total probability we can finally get to
Bayesian updating. Before continuing with Example 4, we point out two features of the
Bayesian updating table that appears in the next example:
1. The table for continuous priors is very simple: since we cannot have a row for each of
an infinite number of hypotheses we’ll have just one row which uses a variable to stand for
all hypotheses θ.
2. By including dθ, all the entries in the table are probabilities and all our usual probability
rules apply.
Example 6. (Bayesian updating.) Continuing Examples 4 and 5. We have a bent coin
with unknown probability θ of heads. The value of θ is random with prior pdf f (θ) = 2θ.
Suppose we flip the coin once and get heads. Compute the posterior pdf for θ.
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 6

answer: We make an update table with the usual columns. Since this is our first example
the first row is the abstract version of Bayesian updating in general and the second row is
Bayesian updating for this particular example.
hypothesis prior likelihood Bayes numerator posterior

θ f (θ) dθ p(x = 1|θ) p(x = 1|θ)f (θ) dθ f (θ|x = 1) dθ

θ 2θ dθ θ 2θ2 dθ 3θ2 dθ
Jb J1
total a f (θ) dθ = 1 p(x = 1) = 0 2θ2 dθ = 2/3 1

Therefore the posterior pdf (after seeing 1 heads) is f (θ|x) = 3θ2 .


We have a number of comments:
1. Since we used the prior probability f (θ) dθ, the hypothesis should have been:
’the unknown paremeter is in an interval of width dθ around θ’.
Even for us that is too much to write, so you will have to think it everytime we write that
the hypothesis is θ.
2. The posterior pdf for θ is found by removing the dθ from the posterior probability in
the table.
f (θ|x) = 3θ2 .

3. (i) As always p(x) is the total probability. Since we have a continuous distribution
instead of a sum we compute an integral.
(ii) Notice that by including dθ in the table, it is clear what integral we need to compute
to find the total probability p(x).
4. The table organizes the continuous version of Bayes’ theorem. Namely, the posterior pdf
is related to the prior pdf and likelihood function via:
p(x|θ) f (θ)dθ p(x|θ) f (θ)
f (θ|x)dθ = J b =
p(x)
a p(x|θ)f (θ) dθ

Removing the dθ in the numerator of both sides we have the statement in terms of densities.

5. Regarding both sides as functions of θ, we can again express Bayes’ theorem in the form:

f (θ|x) ∝ p(x|θ) · f (θ)

posterior ∝ likelihood × prior.

9.1 Flat priors

One important prior is called a flat or uniform prior. A flat prior assumes that every
hypothesis is equally probable. For example, if θ has range [0, 1] then f (θ) = 1 is a flat
prior.
Example 7. (Flat priors.) We have a bent coin with unknown probability θ of heads.
Suppose we toss it once and get tails. Assume a flat prior and find the posterior probability
for θ.
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 7

answer: This is the just Example 6 with a change of prior and likelihood.
hypothesis prior likelihood Bayes numerator posterior
θ f (θ) dθ p(x = 0|θ) f (θ|x = 0) dθ

θ 1 · dθ 1−θ (1 − θ) dθ 2(1 − θ) dθ
l 1
Jb
total a f (θ) dθ = 1 p(x = 0) = (1 − θ) dθ = 1/2 1
0

9.2 Using the posterior pdf

Example 8. In the previous example the prior probability was flat. First show that this
means that a priori the coin is equally like to be biased towards heads or tails. Then, after
observing one heads, what is the (posterior) probability that the coin is biased towards
heads?
answer: Since the parameter θ is the probability the coin lands heads, the first part of the
problem asks us to show P (θ > .5) = 0.5 and the second part asks for P (θ > .5 | x = 1).
These are easily computed from the prior and posterior pdfs respectively.
The prior probability that the coin is biased towards heads is
l 1 l 1
1
P (θ > .5) = f (θ) dθ = 1 · dθ = θ|1.5 = .
.5 .5 2

The probability of 1/2 means the coin is equally likely to be biased toward heads or tails.
The posterior probabilitiy that it’s biased towards heads is
l 1 l 1
1 3
P (θ > .5|x = 1) = f (θ|x = 1) dθ = 2θ dθ = θ 2 .5 = .
.5 .5 4
We see that observing one heads has increased the probability that the coin is biased towards
heads from 1/2 to 3/4.

10 Predictive probabilities

Just as in the discrete case we are also interested in using the posterior probabilities of the
hypotheses to make predictions for what will happen next.
Example 9. (Prior and posterior prediction.) Continuing Examples 4, 5, 6: we have a
coin with unknown probability θ of heads and the value of θ has prior pdf f (θ) = 2θ. Find
the prior predictive probability of heads. Then suppose the first flip was heads and find the
posterior predictive probabilities of both heads and tails on the second flip.
answer: For notation let x1 be the result of the first flip and let x2 be the result of the
second flip. The prior predictive probability is exactly the total probability computed in
Examples 5 and 6.
l 1 l 1
2
p(x1 = 1) = p(x1 = 1|θ)f (θ) dθ = 2θ2 dθ = .
0 0 3
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 8

The posterior predictive probabilities are the total probabilities computed using the poste­
rior pdf. From Example 6 we know the posterior pdf is f (θ|x1 = 1) = 3θ2 . So the posterior
predictive probabilities are
l 1 l 1
p(x2 = 1|x1 = 1) = p(x2 = 1|θ, x1 = 1)f (θ|x1 = 1) dθ = θ · 3θ2 dθ = 3/4
0 0
l 1 l 1
p(x2 = 0|x1 = 1) = p(x2 = 0|θ, x1 = 1)f (θ|x1 = 1) dθ = (1 − θ) · 3θ2 dθ = 1/4
0 0

(More simply, we could have computed p(x2 = 0|x1 = 1) = 1 − p(x2 = 1|x1 = 1) = 1/4.)

11 From discrete to continuous Bayesian updating

To develop intuition for the transition from discrete to continuous Bayesian updating, we’ll
walk a familiar road from calculus. Namely we will:
(i) approximate the continuous range of hypotheses by a finite number.
(ii) create the discrete updating table for the finite number of hypotheses.
(iii) consider how the table changes as the number of hypotheses goes to infinity.
In this way, will see the prior and posterior pmf’s converge to the prior and posterior pdf’s.
Example 10. To keep things concrete, we will work with the ‘bent’ coin with a flat prior
f (θ) = 1 from Example 7. Our goal is to go from discrete to continuous by increasing the

number of hypotheses

4 hypotheses. We slice [0, 1] into 4 equal intervals: [0, 1/4], [1/4, 1/2], [1/2, 3/4], [3/4, 1].

Each slice has width Δθ = 1/4. We put our 4 hypotheses θi at the centers of the four slices:

θ1 : ‘θ = 1/8’, θ2 : ‘θ = 3/8’, θ3 : ‘θ = 5/8’, θ4 : ‘θ = 7/8’.


The flat prior gives each hypothesis a probability of 1/4 = 1 · Δθ. We have the table:
hypothesis prior likelihood Bayes num. posterior

θ = 1/8 1/4 1/8 (1/4) × (1/8) 1/16

θ = 3/8 1/4 3/8 (1/4) × (3/8) 3/16

θ = 5/8 1/4 5/8 (1/4) × (5/8) 5/16

θ = 7/8 1/4 7/8 (1/4) × (7/8) 7/16


n
X
Total 1 – θi ∆θ 1
i=1

Here are the density histograms of the prior and posterior pmf. The prior and posterior
pdfs from Example 7 are superimposed on the histograms in red.
18.05 class 13, Bayesian Updating with Continuous Priors, Spring 2014 9

2 density 2 density

1.5 1.5

1 1

.5 .5

x x
1/8 3/8 5/8 7/8 1/8 3/8 5/8 7/8

8 hypotheses. Next we slice [0,1] into 8 intervals each of width Δθ = 1/8 and use the
center of each slice for our 8 hypotheses θi .
θ1 : ’θ = 1/16’, θ2 : ’θ = 3/16’, θ3 : ’θ = 5/16’, θ4 : ’θ = 7/16’
θ5 : ’θ = 9/16’, θ6 : ’θ = 11/16’, θ7 : ’θ = 13/16’, θ8 : ’θ = 15/16’
The flat prior gives each hypothesis the probablility 1/8 = 1 · Δθ. Here are the table and
density histograms.
hypothesis prior likelihood Bayes num. posterior
θ = 1/16 1/8 1/16 (1/8) × (1/16) 1/64
θ = 3/16 1/8 3/16 (1/8) × (3/16) 3/64
θ = 5/16 1/8 5/16 (1/8) × (5/16) 5/64
θ = 7/16 1/8 7/16 (1/8) × (7/16) 7/64
θ = 9/16 1/8 9/16 (1/8) × (9/16) 9/64
θ = 11/16 1/8 11/16 (1/8) × (11/16) 11/64
θ = 13/16 1/8 13/16 (1/8) × (13/16) 13/64
θ = 15/16 1/8 15/16 (1/8) × (15/16) 15/64
n
X
Total 1 – θi ∆θ 1
i=1

2 density 2 density

1.5 1.5

1 1

.5 .5

x x
1/16 3/16 5/16 7/16 9/16 11/16 13/16 15/16 1/16 3/16 5/16 7/16 9/16 11/16 13/16 15/16

20 hypotheses. Finally we slice [0,1] into 20 pieces. This is essentially identical to the
previous two cases. Let’s skip right to the density histograms.
2 density 2 density

1.5 1.5

1 1

.5 .5

x x

Looking at the sequence of plots we see how the prior and posterior density histograms
converge to the prior and posterior probability density functions.
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

You might also like