0% found this document useful (0 votes)
81 views

Bayesian Reasoning and Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

Bayesian Reasoning and Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 341

An Introduction to Bayesian Reasoning and

Methods

Kevin Ross

2022-02-19
2
Contents

Preface 5

1 Introductory Example 7

2 Ideas of Bayesian Reasoning 15

3 Interpretations of Probability and Statistics 27


3.1 Instances of randomness . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Interpretations of probability . . . . . . . . . . . . . . . . . . . . 30
3.3 Working with probabilities . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Interpretations of Statistics . . . . . . . . . . . . . . . . . . . . . 43

4 Bayes’ Rule 53

5 Introduction to Estimation 63
5.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Introduction to Inference 87
6.1 Comparing Bayesian and frequentist interval estimates . . . . . . 100

7 Introduction to Prediction 105


7.1 Posterior predictive checking . . . . . . . . . . . . . . . . . . . . 126
7.2 Prior predictive tuning . . . . . . . . . . . . . . . . . . . . . . . . 133

8 Introduction to Continuous Prior and Posterior Distributions 137


8.1 A brief review of continuous distributions . . . . . . . . . . . . . 138
8.2 Continuous distributions for a population proportion . . . . . . . 141

3
4 CONTENTS

9 Considering Prior Distributions 155


9.1 What NOT to do when considering priors . . . . . . . . . . . . . 164

10 Introduction to Posterior Simulation and JAGS 165


10.1 Introduction to JAGS . . . . . . . . . . . . . . . . . . . . . . . . 173

11 Odds and Bayes Factors 191

12 Introduction to Bayesian Model Comparison 199

13 Bayesian Analysis of Poisson Count Data 211

14 Introduction to Multi-Parameter Models 241

15 Bayesian Analysis of a Numerical Variable 259

16 Comparing Two Samples 277

17 Introduction to Markov Chain Monte Carlo (MCMC) Simula-


tion 301

18 Some Diagnostics for MCMC Simulation 327


Preface

Statistics is the science of learning from data. Statistics involves

• Asking questions
• Formulating conjectures
• Designing studies
• Collecting data
• Wrangling data
• Summarizing data
• Visualizing data
• Analyzing data
• Developing models
• Drawing conclusions
• Communicating results

We will assume some familiarity with many of these aspects, and we will focus
on the items in italics. That is, we will focus on statistical inference, the pro-
cess of using data analysis to draw conclusions about a population or process
beyond the existing data. “Traditional” hypothesis tests and confidence inter-
vals that you are familiar with are components of “frequestist” statistics. This
book will introduce aspects of “Bayesian” statistics. We will focus on analyz-
ing data, developing models, drawing conclusions, and communicating results
from a Bayesian perspective. We will also discuss some similarities and differ-
ences between frequentist and Bayesian approaches, and some advantages and
disadvantages of each approach.
We want to make clear from the start: Bayesian versus frequentist is NOT a
question of “right versus wrong”. Both Bayesian and frequentist are valid ap-
proaches to statistical analyses, each with advantages and disadvantages. We’ll
address some of the issues along the way. But at no point in your career do
you need to make a definitive decision to be a Bayesian or a frequentist; a good
modern statistician is probably a bit of both.
While our focus will be on statistical inference, remember that the other parts
of Statistics are equally important, if not more important. In particular, any
statistical analysis is only as good as the data upon which it is based.

5
6 CONTENTS

The exercises in this book are used to both motivate new topics and to help you
practice your understanding of the material. You should attempt the exercises
on your own before reading the solutions. To encourage you to do so, the
solutions have been hidden. You can reveal the solution by clicking on the
Show/hide solution button.
Show/hide solution
Here is where a solution would be, but be sure to think about the problem on
your own first!
(Careful: in your browser, the triangle for the Show/hide solution button might
be close to the back button, so clicking on Show/hide might take you to the
previous page. To avoid this, click on the words Show/hide.)
Chapter 1

Introductory Example

Statistics is the science of learning from data. But what is “Bayesian” statistics?
This chapter provides a relatively simple and brief example of a Bayesian sta-
tistical analysis. As you work through the example, think about: What aspects
are familiar? What features are new or different? Think big picture for now;
we’ll fill in lots of details later.

Example 1.1. Suppose we’re interested in the proportion of all current


Cal Poly students who have ever read at least one book in the Harry
Potter series. We’ll refer to this proportion as the “population proportion”.

1. What are some challenges to computing the population proportion? How


could we estimate it?

2. What are the possible values of the population proportion?

3. Which one of the following do you think is the most plausible value of the
population proportion? Record your value in the plot on the board.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

4. Sketch the plot of guesses here. What seems to be the consensus? What
do we think are the most plausible values of the population proportion?
Somewhat plausible? Not plausible?

5. The plot just shows our guesses for the population proportion. How could
we estimate the actual population proportion based on data?

6. We will treat the roughly 30 students in our class as a random sample


from the population of current Cal Poly students. But before collecting
the data, let’s consider what might happen.

7
8 CHAPTER 1. INTRODUCTORY EXAMPLE

Suppose that the actual population proportion is 0.5. That is, suppose
that 50% of current Cal Poly students have read at least one Harry Potter
book. How many students in a random sample of 30 students would you
expect to have read at least one Harry Potter book? Would it necessarily
be 15 students? How could you use a coin to simulate how many students
in a random sample of 30 students might have read at least one Harry
Potter book?

7. Now suppose the actual population proportion is 0.1. How would the
previous part change?

8. Using your choice for the most plausible value of the population propor-
tion, simulate how many students in a random sample of 30 students might
have read at least one Harry Potter book. Repeat to get a few hypothetical
samples, using your guess for the most plausible value of the population
proportion, and record your results in the plot. (Here is one applet you
can use.)

9. Why are there more dots corresponding to a proportion of 0.4 than to a


proportion of 0.9?

10. How could we get an even clearer picture of what might happen?

11. Sketch the plot that we created. The plot illustrates two sources of uncer-
tainty or variability. What are these two sources?

12. So far, everything we’ve considered is what might happen in a class of


30 students. Now let’s see what is actually true for our class. What
proportion of students in the class have read at least one Harry Potter
book? Is the proportion of all current Cal Poly students who have read at
least one Harry Potter book necessarily equal to the sample proportion?

13. Remember that we started with guesses about which values of the popula-
tion proportion were more plausible than others, and we used these guesses
to get a picture of what might happen in samples. How can we reconsider
the plausibility of the possible values of the population proportion in light
of the sample data that we actually observed?

14. Given the observed sample proportion, what can we say about the plau-
sible values of the population proportion? How has our assessment of
plausibility changed from before observing the sample data?

15. What elements of the analysis are similar to the kinds of statistical analysis
you have done before? What elements are new or different?

Solution. to Example 1.1

Show/hide solution
9

1. It would be extremely challenging to survey all Cal Poly students. Even if


we were able to obtain contact information for all students, many students
would not respond to the survey. Instead, we can take a sample of Cal Poly
students, collect data for students in the sample, and use the proportion
of students in the sample who have read at least one Harry Potter book
as a starting point to estimate the population proportion.

2. The population proportion could possibly be any value in the interval [0,
1]. Between 0% and 100% of current Cal Poly students have read at least
one Harry Potter book.
3. There is no right answer for what you think is most plausible. Maybe
you have a lot of friends that have read at least one Harry Potter book,
so you might think the population proportion is 0.8. Maybe you don’t
know anyone who has read at least one Harry Potter book, so you might
think the population proportion is 0.1. Maybe you have no idea and
you just guess that the population proportion is 0.5. Everyone has their
own background information which influences their initial assessment of
plausibility.
4. Results for the class will vary, but Figure 1.1 shows an example. The
consensus for the class in Figure 1.1 is that values of 0.3, 0.4, and 0.5
are most plausible, 0.2 and 0.6 less so, and values close to 0 or 1 are not
plausible.

5. We could use our class of 30 students as a sample, ask each student if


they have read at least one Harry Potter book, and find the proportion of
students in our class who have read at least one Harry Potter book.
6. If the actual population proportion is 0.5, we would expect around 15
students in a random sample of 30 students to have read at least one
Harry Potter book. However, there would be natural sample-to-sample
variability. To get a sense of this variability we could:

• Flip a fair coin. Heads represents a student who has read at least
one Harry Potter book; Tails, not.
• A set of 30 flips represents one hypothetical random sample of 30
students.
• The number of the 30 flips that land on Heads represents one hypo-
thetical value of the number of students in a random sample of 30
students who have read at least one Harry Potter book.
• Repeat the above process to get many hypothetical values of the
number of students in a random sample of 30 students who have
read at least one Harry Potter book, assuming that the population
proportion is 0.5.

7. If the population proportion is 0.1 we would expect around 3 students in


a random sample of 30 students to have read at least one Harry Potter
10 CHAPTER 1. INTRODUCTORY EXAMPLE

book. Again, there would be natural sample-to-sample variability. To get


a sense of this variability we could:

• Roll a fair 10-sided die. A roll of 1 represents a student who has read
at least one Harry Potter book; all other rolls, not.
• A set of 30 rolls represents one hypothetical random sample of 30
students.
• The number of the 30 rolls that land on 1 represents one hypothetical
value of the number of students in a random sample of 30 students
who have read at least one Harry Potter book.
• Repeat the above process to get many hypothetical values of the
number of students in a random sample of 30 students who have
read at least one Harry Potter book, assuming that the population
proportion is 0.1.

8. Figure 1.2 shows the number of students who have read at least one Harry
Potter book in 5 hypothetical samples assuming the population proportion
is 0.5, and in 5 hypothetical samples assuming the population proportion
is 0.1.
9. Results for the class will vary. In the scenario in Figure 1.1, a value of 0.4
was initially more plausible than a value of 0.9. There were more students
who thought 0.4 was the most plausible value than 0.9. So the value 0.4
gets more “weight” in the simulation than 0.9. The plot on the left in
Figure 1.3 reflects the results of a simulation where every student who
plotted a dot in Figure 1.1 simulates 5 random samples of size 30, using
their guess for the population proportion.
10. Repeat the simulation process to get many hypothetical samples for each
value for the population proportion, reflecting differences in initial plau-
sibility. Imagine each student simulated 10000 samples instead of 5. The
plot on the right in Figure 1.3 displays the results.
11. The plot illustrates natural sample-to-sample variability in the sample
proportion for a given value of the population proportion. The plot also
illustrates the uncertainty in the value of the population proportion. That
is, the population proportion has a distribution of values determined by
our relative initial plausibilities.
12. Results will vary. We’ll assume that 9 out of 30 students have read at least
one Harry Potter book, for a sample proportion of 9/30 = 0.3. While we
hope that 0.3 is close to the proportion of all current Cal Poly students
who have read at least one Harry Potter book, because of natural sample-
to-sample variability the sample proportion is not necessarily equal to the
population proportion.
13. The simulation demonstrated what might happen in a sample of size 30.
Now we can zoom in on what actually did happen. Among the samples
11

in the simulation that resulted in 9 students having read a Harry Potter


book, what were the corresponding population proportions?

14. Figure 1.4 displays the results based on the smaller scale simulation in the
plot on the left in Figure 1.3, in which every initial guess for the sample
proportion generated five hypothetical samples of size 30. Now we focus
on samples that resulted in a sample proportion of 9/30, the observed
sample proportion. The middle plot displays the population proportions
correspoding to samples with a sample proportion of 9/30. The distribu-
tion of all the dots in the middle plot illustrates our initial plausibility.
The plot on the right displays only the green dots, which correspond to
samples with a sample proportion of 9/30. The distribution in the plot on
the right reflects a reassessment of the plausibilities of possible values of
the population proportion given the observed sample proportion of 9/30
and the simulation results. Among the simulated samples that resulted in
a sample proportion of 9/30, the population proportion was much more
likely to be 0.3 than to be 0.5.
Figure 1.5 displays the same analsyis based on the full simulation from the
plot on the right in Figure 1.3. The plot on the right in Figure 1.5 com-
pares the initial plausibilities to the plausibilities revised upon observing
a sample proportion of 9/30. Initially, the values 0.3, 0.4, and 0.5 were
roughly equally plausible, and more plausible than any other value. After
observing a sample proportion of 9/30:

• 0.3 is the most plausible value of the population proportion


• 0.3 is about two times more plausible than the next most plausible
value, 0.4
• 0.3 and 0.4 together account for the bulk of plausibility.
• Initially, 0.5 was much more plausible than 0.2, but given the ob-
served data 0.2 is now more plausible than 0.5 (though neither is
very plausible)

15. Familiar elements include:

• using sample statistics to make inference about population parame-


ters
• reflecting sample-to-sample variability of statistics, for a given value
of the population parameter
• using simulation to analyze data and understand ideas

New elements include:

• Quantifying the uncertainty of the population proportion with rela-


tive plausibilities of possible values
• Treating the population proportion as a variable with a distribution
determined by the relative plausibilities
12 CHAPTER 1. INTRODUCTORY EXAMPLE

• Conditioning on the observed data and revising our assessment of


plausibilities

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Population proportion

Figure 1.1: Example plot of the guesses of 30 students for the most plausible
value of the proportion of current Cal Poly students who have read at least one
Harry Potter book.
13

30
Number of students who have read HP book

25

20

Population proportion
15 0.1
0.5

10

Figure 1.2: Number of students who have read at least one Harry Potter book in
hypothetical samples of size 30. Five samples simulated assuming the population
proportion is 0.1 (yellow), and five samples simulated assuming the population
proportion is 0.5 (purple).

30 30
Number of students who have read HP book

Number of students who have read HP book

25 25

20 20

15 15

10 10

5 5

0 0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Population proportion Population proportion

Figure 1.3: Simulation of the number of students who have read at least one
Harry Potter book in hypothetical samples of size 30, reflecting initial plausibil-
ity of values of the population proportion from Figure 1.1. Left: 5 hypothetical
samples for each guess for the population proportion. Right: 10000 hypothetical
samples for each guess for the population proportion.
14 CHAPTER 1. INTRODUCTORY EXAMPLE

30

Number of students who have read HP book


25

20

Number of students Number of students


15 not 9 not 9
9 9

10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Population proportion Population proportion

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Population proportion

Figure 1.4: Left: Simulation results from the plot on the left in Figure 1.3
highlighting samples with a sample proportion of 9/30. Middle: Comparison
of initial distribution of population proportion with conditional distribution of
population proportion given a sample proportion of 9/30. Right: Distribution
reflecting relative plausibility of possible values of the population proportion
after observing a sample of 30 students in which 9 have read at least one Harry
Potter book.

30
Number of students who have read HP book

25
prior plausibility
posterior plausibility
20

Number of students
15 not 9
9

10

0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Population proportion
Population proportion

Figure 1.5: Left: Simulation results from the plot on the right in Figure 1.3
highlighting samples with a sample proportion of 9/30. Right: Distribution
reflecting relative plausibility of possible values of the population proportion,
both “prior” plausibility (blue) and “posterior” plausibility after observing a
sample of 30 students in which 9 have read at least one Harry Potter book
(green).
Chapter 2

Ideas of Bayesian Reasoning

In this section we’ll look a closer look at the example from the previous section.
In particular, we’ll inspect the simulation process and results in more detail.
We’ll also consider a more “realistic” scenario.
In the previous section, each student identified a “most plausible” value. We
collected these guesses to form a “wisdom of crowds” measure of initial plausi-
bility.
However, your own initial assessment of the plausibility of the different values
could involve much more than just identifying the most plausible value. What
was your next most plausible value? How much less plausible was it? What
about the other values and their relative plausibilities?
In the example below we’ll start with an assessment of plausibility. We’ll discuss
later how you might obtain such an assessment. For now, focus on the big
picture: we start with some initial assessment of plausibility before observing
data, and we want to update that assessment upon observing some data.

Example 2.1. Suppose we’re interested in the proportion of all current


Cal Poly students who have ever read at least one book in the Harry
Potter series. We’ll refer to this proportion as the “population proportion”
and denote it as 𝜃 (the Greek letter “theta”).

1. We’ll start by considering only the values 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 as
initially plausible for the population proportion 𝜃. Suppose that before
collecting sample data, our prior assessment is that:

• 0.2 is four times more plausible than 0.1


• 0.3 is two times more plausible than 0.2
• 0.3 and 0.4 are equally plausible
• 0.2 and 0.5 are equally plausible

15
16 CHAPTER 2. IDEAS OF BAYESIAN REASONING

• 0.1 and 0.6 are equally plausible

Construct a table to represent this prior distribution and sketch a plot of


it.

2. Discuss what the prior distribution from the previous part represents.

3. If we simulate many values of the population proportion according to the


above distribution, on what proportion of repetitions would we expect to
see a value of 0.1? If we conduct 260000 repetitions of the simulation,
on how many repetitions would we expect to see a value of 0.1? Repeat
for the other plausible values to make a table of what we would expect
the simulation results to look like. (For now, we’re ignoring simulation
variability; just consider expected values.)

4. The “prior” describes our initial assessment of plausibility of possible val-


ues of the population proportion prior to observing data. Now suppose we
observe a sample of 30 students in which 9 students have read at least one
HP book. We want to update our assessment of the population proportion
in light of the observed data.
Suppose that the actual population proportion is 𝜃 = 0.1. How could we
use simulation to determine the likelihood of observing 9 students who
have read at least one HP book in a sample of 30 students? How could
we use math? (Hint: what is the probability distribution of the number
of “successes” in a sample of size 30 when 𝜃 = 0.1?)

5. Recall the simulation with 260000 repetitions that we started above. Con-
sider the 10000 repetitions in which the population proportion is 0.1. Sup-
pose that for each of these repetitions we simulate the number of students
in a class of 30 who have read at least one HP book. On what proportion
of these repetitions would we expect to see a sample count of 9? On how
many of these 10000 repetitions would we expect to see a sample count of
9?

6. Repeat the previous part for each of the possible values of 𝜃: 0.1, … , 0.6.
Add two columns to the table: one column for the likelihood of observing
a count of 9 in a sample of size 30 for each value of 𝜃, and one column for
the expected number of repetitions in the simulation which would result
in the count of 9.

7. Consider just the repetitions that resulted in a simulated sample count


of 9. What proportion of these repetitions correspond to a population
proportion of 0.1? Of 0.2? Continue for the other possible values of 𝜃 to
construct this posterior distribution, and sketch a plot of it.

8. After observing a sample of 30 Cal Poly students with a proportion of


9/30=0.3 who have read at least one Harry Potter book, what can we
say about the plausibility of possible values of the population proportion?
17

How has our assessment of plausibility changed from before observing the
sample data?

9. Prior to observing data, how many times more plausible is a value of 0.3
than 0.2 for the population proportion 𝜃?

10. Recall that we observed a sample count of 9 in a sample of size 30. How
many times more likely is a count of 9 in a sample of size 30 when the
population proportion 𝜃 is 0.3 than when it is 0.2?

11. After observing data, how many times more plausible is 0.3 than 0.2 for
the population proportion 𝜃?

12. How are the values from the three previous parts related?

Solution. to Example 2.1

1. The relative plausibilities allow us to draw the shape of the plot below:
the spike for 0.2 is four times as high as the one for 0.1; the spike for 0.3
is two times higher than the spike for 0.2, etc. To make a distribution we
need to rescale the heights, maintaining the relative ratios, so that they
add up to 1. It helps to consider one value as a “baseline”; we’ll choose
0.1 (but it doesn’t matter which value is the baseline). Assign 1 “unit” of
plausibility to the value 0.1 Then 0.6 also gets 1 unit of plausibility. The
values 0.2 and 0.5 each receive 4 units of plausibility, and the values 0.3
and 0.4 each receive 8 units of plausibility. The six values account for 26
total units of plausibility. Divide the units by 26 to obtain values that sum
to 1. See the “Prior” column in the table below. Check that the relative
ratios are maintained; for example, the rescaled prior plausibility of 0.308
for 0.3 is two times larger than the rescaled prior plausibility of 0.154 for
0.2.

Population proportion Prior ”Units” Prior


0.1 1 0.0385
0.2 4 0.1538
0.3 8 0.3077
0.4 8 0.3077
0.5 4 0.1538
0.6 1 0.0385
Total 26 1.0000
18 CHAPTER 2. IDEAS OF BAYESIAN REASONING

0.1 0.2 0.3 0.4 0.5 0.6

Population proportion

2. The parameter 𝜃 is an uncertain quantity. We are considering this quantity


as a random variable, and using distributions to describe our degree of
uncertainty. The prior distribution represents our degree of uncertainty, or
our assessment of relative plausibility of possible values of the parameter,
prior to observing data.
3. The relative prior plausibility of 0.1 is 1/26, so we would expect to see
a value of 0.1 for 𝜃 on about 3.8% of repetitions. If we conduct 260000
repetitions of the simulation, we would expect to see a value of 0.1 for 𝜃
on about 10000 repetitions.
Population proportion Prior ”Units” Prior Number of reps
0.1 1 0.0385 10000
0.2 4 0.1538 40000
0.3 8 0.3077 80000
0.4 8 0.3077 80000
0.5 4 0.1538 40000
0.6 1 0.0385 10000
Total 26 1.0000 260000

4. If the population proportion is 0.1 we would expect around 3 students in


a random sample of 30 students to have read at least one Harry Potter
book. Again, there would be natural sample-to-sample variability. To get
a sense of this variability we could:

• Roll a fair 10-sided die. A roll of 1 represents a student who has read
at least one Harry Potter book; all other rolls, not.
19

• A set of 30 rolls represents one hypothetical random sample of 30


students.
• The number of the 30 rolls that land on 1 represents one hypothetical
value of the number of students in a random sample of 30 students
who have read at least one Harry Potter book.
• Repeat the above process to get many hypothetical values of the
number of students in a random sample of 30 students who have
read at least one Harry Potter book, assuming that the population
proportion is 0.1.

If 𝑌 is the number of students in the same who have read at least one HP
book, then 𝑌 has a Binomial(30, 𝜃) distribution. If 𝜃 = 0.1 then 𝑌 has
a Binomial(30, 0.1) distribution. If 𝜃 = 0.1 the probability that 𝑌 = 9 is
(30 9 21
9 )(0.1 )(0.9 ) = 0.0016, which can be computed using dbinom(9, 30,
0.1) in R.

dbinom(9, 30, 0.1)

## [1] 0.001565

5. From the previous part, the likelihood of observing a count of 9 in a


sample of size 30 when 𝜃 = 0.1 is 0.0016. If 𝜃 = 0.1 then we would expect
to observe a sample count of 9 in about 0.16% of samples. In the 10000
repetitions with 𝜃 = 0.1, we would expect to observe a count of 9 in about
10000 × 0.0016 = 16 repetitions.

6. See the table below. For example, if 𝜃 = 0.2 then the likelihood of a
sample count of 9 in a sample of size 30 is (30 9 21
9 )(0.2 )(0.8 ) = 0.068, which
can be computed using dbinom(9, 30, 0.2). If 𝜃 = 0.2 then we would
expect to observe a sample count of 9 in about 6.8% of samples. In the
40000 repetitions with 𝜃 = 0.2, we would expect to observe a count of 9
in about 40000 × 0.068 = 2703 repetitions.
In the table below, “Likelihood of 9” represents the probability of a sample
count of 9 in a sample of size 30 computed for each possible value of 𝜃. Note
that this column does not sum to 1, as the values in this column do not
comprise a probability distribution. Rather, the values in the likelihood
column represent the probability of the same event (sample count of 9)
computed under various different scenarios (different possible values of 𝜃).
The “Repetitions with a count of 9” column corresponds to the green dots
in Figure 1.5. The prior plausibilities and total number of repetitions are
different between the two examples, but the process is the same. (The
overall “Number of reps” column corresponds to all the dots.)
20 CHAPTER 2. IDEAS OF BAYESIAN REASONING

Population proportion Prior ”Units” Prior Number of reps Likelihood of count of 9 R


0.1 1 0.0385 10000 0.0016
0.2 4 0.1538 40000 0.0676
0.3 8 0.3077 80000 0.1573
0.4 8 0.3077 80000 0.0823
0.5 4 0.1538 40000 0.0133
0.6 1 0.0385 10000 0.0006
Total 26 1.0000 260000 0.3227

7. There were 22423 repetitions that resulted in a simulated sample count of


9. Of these, 16 correspond to a population proportion of 0.1. Therefore,
the proportion of repetitions that resulted in a count of 9 that correspond
to a proportion of 0.1 is 16 / 22423 = 0.0007. The proportion of repetitions
that resulted in a count of 9 that correspond to a proportion of 0.2 is 2703
/ 22423 = 0.1205. See the “Posterior” column in the table below.

Population proportion Prior ”Units” Prior Number of reps Likelihood of count of 9 R


0.1 1 0.0385 10000 0.0016
0.2 4 0.1538 40000 0.0676
0.3 8 0.3077 80000 0.1573
0.4 8 0.3077 80000 0.0823
0.5 4 0.1538 40000 0.0133
0.6 1 0.0385 10000 0.0006
Total 26 1.0000 260000 0.3227

8. The plot below compares our prior plausibility (blue) and our posterior
plausibility (green) after observing the data. The values 0.3 and 0.4 ini-
tially were equally plausible for 𝜃, but after observing a sample proportion
of 0.3 the value 0.3 is now almost 2 times more plauible than a value of 0.4.
The values 0.2, 0.3, 0.4 together accounted for about 77% of our initial
plausibility, but after observing the data these three values now account
for over 97% of our plausibility.
21

0.4
Plausibility

0.2

0.0

0.2 0.4 0.6


Population proportion

9. Our prior assessment was that a value of 0.3 is 2 times more plausible
than 0.2 for the population proportion 𝜃.

10. A count of 9 in a sample of size 30 is 0.1573 / 0.0676 = 2.33 times mores


likely when the population proportion 𝜃 is 0.3 than when it is 0.2.

11. After observing a count of 9 in a sample of size 30, a value of 0.3 is 0.5612 /
0.1205 = 4.66 times more plausible than 0.2 for the population proportion
𝜃.

12. The ratio of the posterior plausibilities (4.66) is the product of the ratio
of the prior plausibilities and the ratio of the likelihoods (2.33). In short,
posterior is proportional to product of prior and likelihood.

Example 2.2. In the previous example we only considered the values 0.1, 0.2,
…, 0.6 as plausible. Now we’ll consider a more realistic scenario.

1. What is the problem with considering only the values 0.1, 0.2, …, 0.6 as
plausible? How could we resolve this issue?

2. Suppose we have a prior distribution that assigns initial relative plausibil-


ity to a fine grid of possible values of the population proportion 𝜃, e.g., 0,
0.0001, 0.0002, 0.0003, …, 0.9999, 1. We then observe a sample count of
9 in a sample of size 30. Explain the process we would use to construct a
table like the one in the previous example to find the posterior distribution
of the population proportion 𝜃.
22 CHAPTER 2. IDEAS OF BAYESIAN REASONING

3. How could we use the posterior distribution to fill in the blank in the
following: “There is a [blank] percent chance that fewer than 50 percent
of current Cal Poly students have read at least one HP book.”
4. What are some other questions of interest regarding 𝜃? How could you
use the posterior distribution to answer them?
Solution. to Example 2.2

1. These six values are not the only possible values of 𝜃. The parameter 𝜃 is
a proportion, which could take any value in [0, 1]. We really want a prior
that assigns relative plausibility to all values in the continuous interval [0,
1]. One way to bridge the gap is to consider a fine grid of values in [0, 1],
rather than all possible values.
We’ll consider the possible values of 𝜃 to be 0, 0.0001, 0.0002, 0.0003, … , 0.9998, 0.9999, 1
and assign a relative plausibility to each of these values. We’ll start
with our assessment from the previous example: 0.1 and 0.6 are equally
plausible, 0.2 is four times more plausible than 0.1, etc. We’ll assign
plausibility to in between values by “smoothly connecting the dots”. In
the plot below this is achieved with a Normal distribution, but the details
are not important for now. Just understand that (1) we have expanded
our grid of possible values of 𝜃, and (2) we have assigned a relative
plausibility to each of the possible values.
Prior plausibility

0.0 0.2 0.4 0.6 0.8 1.0

Population proportion

2. The table has one row for each possible value of 𝜃: 0, 0.0001, 0.0002, … , 0.9999, 1.

• Prior: There would be a column for prior plausibility, say corre-


sponding to the plot above.
23

• Likelihood: For each value of 𝜃, we would compute the likelihood of


observing a sample count of 9 in a sample of size 30: (30 9
9 )(𝜃 )(1 − 𝜃)
21

or dbinom(9, 30, theta).


• Product: For each value of 𝜃, compute the product of prior and
likelihood. This is essentially what we did in the previous example in
the “Reps with a count of 9” column. Here we’re just not multiplying
by the total number of repetitions.
• Posterior: The product column gives us the relative ratios. For ex-
ample, the product column tells us that the posterior plausibility of
0.3 is 4.66 times greater than the posterior plausibility of 0.2. We
simply need to rescale these values — by dividing by the sum of the
product column — to obtain posterior plausibilities in the proper
ratios that add up to one.

The following is some code; think of this as creating a spreadsheet. (Note


that only a few select rows of the spreadsheet are displayed below.) We
will explore code like this in much more detail as we go. For now, just
notice that we can accomplish what we wanted in just a few lines of code.

# Possible values of theta


theta = seq(0, 1, 0.0001)

# Prior distribution
# Smoothly connect the dots using a Normal distribution
# Then rescale to sum to 1

prior = dnorm(theta, 0.35, 0.12) # prior "units" - relative values


prior = prior / sum(prior) # recale to sum to 1

# Likelihood
# Likelihood of observing sample count of 9 out of 30
# for each theta

likelihood = dbinom(9, 30, theta)

# Posterior
# Product gives relative posterior plausibilities
# Then rescale to sum to 1

product = prior * likelihood


posterior = product / sum(product)

# Put the columns together


24 CHAPTER 2. IDEAS OF BAYESIAN REASONING

bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)

# Display a portion of the table


bayes_table %>%
slice(seq(2001, 4001, 250)) %>% # selects a few rows to display
kable(digits = 8)

theta prior likelihood product posterior


0.200 0.0001525 0.06756 0.00001030 0.0001204
0.225 0.0001936 0.10012 0.00001938 0.0002265
0.250 0.0002353 0.12981 0.00003055 0.0003570
0.275 0.0002740 0.15019 0.00004115 0.0004808
0.300 0.0003054 0.15729 0.00004803 0.0005613
0.325 0.0003259 0.15062 0.00004909 0.0005736
0.350 0.0003330 0.13285 0.00004424 0.0005170
0.375 0.0003259 0.10847 0.00003535 0.0004131
0.400 0.0003054 0.08228 0.00002512 0.0002936

The plot below displays the prior, likelihood1 , and posterior. Notice that
the likehood of the observed data is highest for 𝜃 near 0.3, so our plausi-
bility has “moved” in the direction of 𝜃 near 0.3 after observing the data.

# The code below plots three curves


# One for each of prior, likelihood, posterior
# There are easier/better ways to do this

ggplot(bayes_table, aes(x = theta)) +


geom_line(aes(y = posterior), col = "seagreen") +
geom_line(aes(y = prior), col = "skyblue") +
geom_line(aes(y = likelihood / sum(likelihood)), col = "orange") +
labs(x = "Population proportion (theta)",
y = "Plausibility",
title = "Prior (blue), (Scaled) Likelihood (orange), Posterior (green)") +
theme_bw()

1 Prior and posterior are distributions which sum to 1, so prior and posterior are on the

same scale. However, the likelihood does not sum to anything in particular. In order to plot
the likelhood on the same scale, it has been rescaled to sum to 1. Only the relative shape of
the likelihood matters; not its absolute scale.
25

Prior (blue), (Scaled) Likelihood (orange), Posterior (green)


0.0006

0.0004
Plausibility

0.0002

0.0000

0.00 0.25 0.50 0.75 1.00


Population proportion (theta)

3. Sum the posterior plausibilities for 𝜃 values between 0.5. We can see from
the plot that almost all our plausibility is placed on values of 𝜃 less than
0.5.

sum(posterior[theta < 0.5])

## [1] 0.9939

4. The posterior distribution gives us lots of information. We might be inter-


ested in questions like: “There is a [blank1] percent chance that between
[blank2] and [blank3] percent of Cal Poly students have read a HP book.”
For example, for 80 percent in blank1, we might compute the 10th per-
centile for blank2 and the 90th percentile for blank3. Using the spread-
sheet, start from 𝜃 = 0 and go down the table summing the posterior
probabilities until they reach 0.1; the corresponding 𝜃 value is the 10th
percentile.

theta[max(which(cumsum(posterior) < 0.1))]

## [1] 0.235

We can find the 90th percentile similarly.

theta[max(which(cumsum(posterior) < 0.9))]


26 CHAPTER 2. IDEAS OF BAYESIAN REASONING

## [1] 0.411

There is an 80% chance that between 24% and 41% of Cal Poly students
have read a HP book.
Chapter 3

Interpretations of
Probability and Statistics

You have some familiarity with “probability” or “chance” or “odds”. But what
do we really mean when talk about “probability”? It turns out there are two
main interpretations: relative frequency and “subjective” probability. These two
interpretations provide the philosophical foundation for two schools of statistics:
frequentist (hypothesis tests and confidence intervals that you’ve seen before)
and Bayesian (what this book is about). This chapter introduces the two inter-
pretations.

3.1 Instances of randomness


A wide variety of situations involve probability. Consider just a few examples.

1. The probability that you roll doubles in a turn of a board game.


2. The probability you win the next Powerball lottery if you purchase a single
ticket, 4-8-15-16-42, plus the Powerball number, 23.
3. The probability that a “randomly selected” Cal Poly student is a California
resident.
4. The probability that the high temperature in San Luis Obispo tomorrow
is above 90 degrees F.
5. The probability that Hurricane Peter makes landfall in the U.S.
6. The probability that the San Francisco 49ers win the next Superbowl.
7. The probability that President Biden wins the 2024 U.S. Presidential Elec-
tion.
8. The probability that extraterrestrial life currently exists somewhere in the
universe.

27
28CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

9. The probability that Alexander Hamilton actually wrote 51 of the Fed-


eralist Papers. (The papers were published under a common pseudonym
and authorship of some of the papers is disputed.)
10. The probability that you ate an apple on April 17, 2009.

Example 3.1. How are the situations above similar, and how are they differ-
ent? What is one feature that all of the situations have in common? Is the
interpretation of “probability” the same in all situations? Take some time to
consider these questions before looking at the solution. The goal here is to just
think about these questions, and not to compute any probabilities (or to even
think about how you would).

Solution. to Example 3.1

Show/hide solution
This exercise is intended to motivate discussion, so you might have thought of
some other ideas we don’t address here. That’s good! And some of the things
you considered might come up later in the book. But here are a few thoughts
we specifically want to mention now.
The one feature that all of the situations have in common is uncertainty. Some-
times the uncertainty arises from a repeatedable physical phenomenon that can
result in multiple potential outcomes, like rolling dice or drawing the winning
Powerball number. In other cases, there is uncertainty because the probability
concerns the future, like tomorrow’s high temperature or the result of the next
Superbowl. But there can also be uncertainty about the past: there are some
Federalist papers for which the author is unknown, and you probably don’t
know for sure whether or not you ate an apple on April 17, 2009.
Whenever there is uncertainty, it is reasonable to consider relative likelihoods
of potential outcomes. For example, even though you don’t know for certain
whether you ate an apple on April 17, 2009, if you’re usually an apple-a-day
person (or were when you were younger) you might think the probability is high.
We don’t know for sure what team will win the next Superbowl, but we might
think that the 49ers are more likely than the Eagles to be the winner.
While all of the situations in the example involve uncertainty, it seems that there
are different “types” of uncertainty. Even though we don’t know which side a
die will land on, the notion of “fairness” implies that the sides are “equally
likely”. Likewise, there are some rules to how the Powerball drawing works,
and it seems like these rules should determine the probability of drawing that
particular winning number.
However, there aren’t any specific “rules of uncertainty” that govern whether or
not you ate an apple on April 17, 2009. You either did or you didn’t, but that
doesn’t mean the two outcomes are necessarily equally likely. Regarding the
Superbowl, of course there are rules that govern the NFL season and playoffs,
but there are no “rules of uncertainty” that tell us precisely how likely any
3.1. INSTANCES OF RANDOMNESS 29

particular team is to win any particular game, let alone how likely a team is to
advance to and win the Superbowl.
It also seems that there are different interpretations of probability. Given that
a six-sided die is fair, we might all agree that the probability that it lands on
any particular side is 1/6. Similarly, given the rules of the Powerball lottery,
we might all agree on the probability that a drawing results in a particular win-
ning number. However, there isn’t necessarily consensus about what the high
temperature will be in San Luis Obispo tomorrow. Different weather prediction
models, forecasters, or websites might provide different values for the probabil-
ity that the high temperature will be above 90 degrees Fahrenheit. Similarly,
Superbowl odds might vary by source. Situations like tomorrow’s weather or the
Superbowl where there is no consensus about the “rules of uncertainty” require
some subjectivity in determining probabilities.
Finally, some of these situations are repeatedable. We could (in principle) roll a
pair of dice many times and see how often we get doubles, or repeat the Power-
ball drawing over and over to see how the winning numbers behave. However,
many of these situations involve something that only happens once, like tomor-
row or April, 17, 2009 or the next Superbowl. Even when the phenomenon
happens only once in reality, we can still develop models of what might happen
if we were to hypothetically repeat the phenomenon many times. For exam-
ple, meteorologists use historical data and meteorological models to forecast
potential paths of a hurricane.
The subject of probability concerns random phenomena. A phenomenon is
random1 if there are multiple potential outcomes, and there is uncertainty
about which outcome will occur. Uncertainty is understood in broad terms, and
in particular does not only concern future occurrences.
Some phenomena involve physical randomness2 , like flipping coins, rolling dice,
drawing Powerballs at random from a bin, or randomly selecting Cal Poly stu-
dents. In many other situations randomness just vaguely reflects uncertainty.
Contrary to colloquial uses of the word, random does not mean haphazard. In a
random phenomenon, while individual outcomes are uncertain, we will see that
there is a regular distribution of outcomes over a large number of (hypothetical)
repetitions. For example,
1 In this book, “random” and “uncertain” are synonyms; the opposite of “random” is “cer-

tain”. (Later we will encounter random variables; “constant” is an antonym of “random


variable”.) The word “random” has many uses in everyday life, which have evolved over
time. Unfortunately, some of the everyday meanings of “random”, like “haphazard” or “un-
expected”, are contrary to what we mean by “random” in this book. For example, we would
consider Steph Curry shooting a free throw to be a random phenomenon because we’re not
certain if he’ll make it or miss it; but we would not consider this process to be haphazard or
unexpected.
2 We will refer to as “random” any scenario that involves a reasonable degree of uncertainty.

We’re avoiding philosophical questions about what is “true” randomness, like the following.
Is a coin flip really random? If all factors that affect the trajectory of the coin were known
precisely, then wouldn’t the outcome be determined? Does true randomness only exist in
quantum mechanics?
30CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

• In two flips of a fair coin we wouldn’t necessarily see one head and one
tail. But in 10000 flips of a fair coin, we might expect to see close to 5000
heads and 5000 tails.
• We don’t know who will win the next Superbowl, but we can and should
consider some teams as more likely to win than others. We could imagine
a large number of hypothetical 2021-2022 seasons; how often would we
expect the 49ers to win? The Eagles?

Random also does not necessarily mean equally likely. In a random phenomenon,
certain outcomes or events might be more or less likely than others. For example,

• It’s much more likely than not that a randomly selected Cal Poly student
is a California resident.
• Not all NFL teams are equally likely to win the next Superbowl.

Finally, randomness is also not necessarily undesirable. In particular, many


statistical applications often employ the planned use of randomness with the
goal of collecting “good” data. For example,

• Random selection involves selecting a sample of individuals “at random”


from a population (e.g., via random digit dialing), with the goal of
selecting a representative sample.

• Random assignment involves assigning individuals at random to groups


(e.g., in a randomized experiment), with the goal of constructing groups
that are similar in all aspects so that the effect of a treatment (like a new
vaccine) can be isolated.

The probability of an event associated with a random phenomenon is a number


in the interval [0, 1] measuring the event’s likelihood or degree of uncertainty.
A probability can take any values in the continuous scale from 0% to 100%3 . In
particular, a probability requires much more interpretation than “is the proba-
bility greater than, less than, or equal to 50%?” As Example 3.1 suggests, there
can be different interpretations of “probability”, which we’ll start to explore in
the next section.

3.2 Interpretations of probability


In the previous section we encountered a variety of scenarios which involved
uncertainty, a.k.a. randomness. Just as there are a few “types” of randomness,
there are a few ways of interpreting probability, most notably, long run relative
frequency and subjective probability.
3 Probabilities are usually defined as decimals, but are often colloquially referred to as per-

centages. We’re not sticklers; we’ll refer to probabilities both as decimals and as percentages.
3.2. INTERPRETATIONS OF PROBABILITY 31

3.2.1 Long run relative frequency

One of the oldest documented4 problems in probability is the following: If three


fair six-sided dice are rolled, what is more likely: a sum of 9 or a sum of 10?
Let’s try to answer this question by simply rolling dice and seeing if a sum of
9 or 10 happens more frequently. Roll three fair six-sided dice, find the sum,
repeat many times, and see how often we get a sum of 9 versus a sum of 10.
Of course, this would be a time consuming process by hand, but it’s quick and
easy on a computer. Figure 3.1 displays the result of one million repetitions
of this process, each repetition resulting in the sum of three rolls. A sum of 9
occurred in 115384 repetitions and a sum of 10 occurred in 125005 repetitions.
Comparing these frequencies, our results suggest that a sum of 10 is more likely
than a sum of 9.
120000
Number of sets of three rolls

80000
40000
0

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Sum of three rolls of six−sided dice

Figure 3.1: Results of one million sets of three rolls of fair six-sided dice. Sets
in which the sum of the dice is 9 (10) are represented by orange (blue) spike.

In the previous problem we assessed relative likelihoods by repeating the process


many times. This is the idea behind the relative frequency interpretation of
probability. We’ll investigate this idea further in the context of what is probably
the most iconic random process: coin flipping.
4 The Grand Duke of Tuscany posed this problem to Galileo, who published his solution in

1620. However, unbeknownst to Galileo, the same problem had been solved almost 100 years
earlier by Gerolamo Cardano, one of the first mathematicians to study probability.
32CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

Table 3.1: Results and running proportion of H for 10 flips of a fair coin.

Flip Result Running count of H Running proportion of H


1 T 0 0.000
2 H 1 0.500
3 T 1 0.333
4 H 2 0.500
5 H 3 0.600
6 H 4 0.667
7 H 5 0.714
8 T 5 0.625
9 T 5 0.556
10 T 5 0.500

We might all agree that the probability that a single flip of a fair coin lands on
heads is 1/2, a.k.a., 0.5, a.k.a, 50%. After all, the notion of “fairness” implies
that the two outcomes, heads and tails, should be equally likely, so we have a
“50/50 chance” of heads. But how else can we interpret this 50%? As in the
dice rolling problem, we can consider what would happen if we flipped the coin
main times. Now, if we would flipped the coin twice, we wouldn’t expect to
necessarily see one head and one tail. But in many flips, we might expect to see
heads on something close to 50% of flips.
Let’s try this out. Table 3.1 displays the results of 10 flips of a fair coin. The
first column is the flip number and the second column is the result of the flip.
The third column displays the running proportion of flips that result in H. For
example, the first flip results in T so the running proportion of H after 1 flip
is 0/1; the first two flips result in (T, H) so the running proportion of H after
2 flips is 1/2; and so on. Figure 3.2 plots the running proportion of H by the
number of flips. We see that with just a small number of flips, the proportion
of H fluctuates considerably and is not guaranteed to be close to 0.5. Of course,
the results depend on the particular sequence of coin flips. We encourage you
to flip a coin 10 times and compare your results.
Now we’ll flip the coin 90 more times for a total of 100 flips. The plot on the left
in Figure 3.3 summarizes the results, while the plot on the right also displays
the results for 3 additional sets of 100 flips. The running proportion fluctuates
considerably in the early stages, but settles down and tends to get closer to
0.5 as the number of flips increases. However, each of the fours sets results in
a different proportion of heads after 100 flips: 0.5 (blue), 0.44 (orange), 0.56
(green), 0.56 (purple). Even after 100 flips the proportion of flips that result in
H isn’t guaranteed to be very close to 0.5.
Now for each set of 100 flips, we’ll flip the coin 900 more times for a total of 1000
flips in each of the four sets. The plot on the left in Figure 3.4 summarizes the
3.2. INTERPRETATIONS OF PROBABILITY 33

1.0
Running proportion of H

0.8
0.6
0.4
0.2
0.0

1 2 3 4 5 6 7 8 9 10

Flip number

Figure 3.2: Running proportion of H versus number of flips for the 10 coin flips
in Table 3.1.
1.0

1.0
Running proportion of H

Running proportion of H
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Flip number Flip number

Figure 3.3: Running proportion of H versus number of flips for four sets of 100
coin flips.
34CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

results for our original set, while the plot on the right also displays the results
for the three additional sets from Figure 3.4. Again, the running proportion
fluctuates considerably in the early stages, but settles down and tends to get
closer to 0.5 as the number of flips increases. Compared to the results after 100
flips, there is less variability between sets in the proportion of H after 1000 flips:
0.51 (blue), 0.488 (orange), 0.525 (green), 0.492 (purple). Now, even after 1000
flips the proportion of flips that result in H isn’t guaranteed to be exactly 0.5,
but we see a tendency for the proportion to get closer to 0.5 as the number of
flips increases.
1.0

1.0
Running proportion of H

Running proportion of H
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000

Flip number Flip number

Figure 3.4: Running proportion of H versus number of flips for four sets of 1000
coin flips.

In summary, in a large number of flips of a fair coin we expect about 50% of


flips to result in H. That is, the probability that a flip of a fair coin results in
H can be interpreted as the long run proportion of flips that result in H, or in
other words, the long run relative frequency of H.
In general, the probability of an event associated with a random phenomenon
can be interpreted as a long run proportion or long run relative frequency:
the probability of the event is the proportion of times that the event would occur
in a very large number5 of hypothetical repetitions of the random phenomenon.
The long run relative frequency interpretation of probability can be applied
when a situation can be repeated numerous times, at least conceptually, and
an outcome can be observed for each repetition. One benefit of the relative fre-
quency interpretation is that the probability of an event can be approximated by
simulating the random phenomenon a large number of times and determining
the proportion of simulated repetitions on which the event occurred out of the
total number of repetitions in the simulation. A simulation involves an arti-
ficial recreation of the random phenomenon, usually using a computer. After
many repetitions the relative frequency of the event will settle down to a single
constant value, and that value is the approximately the probability of the event.
5 A natural question is: “how many repetitions are required to represent the long run?”

We’ll consider this question when we discuss MCMC methods.


3.2. INTERPRETATIONS OF PROBABILITY 35

Of course, the accuracy of simulation-based approximations of probabilities de-


pends on how well the simulation represents the actual random phenomenon.
Conducting a simulation can involve many assumptions which influence the re-
sults. Simulating many flips of a fair coin is one thing; simulating an entire
NFL season and the winner of the Superbowl is an entirely different story.

3.2.2 Subjective probability

The long run relative frequency interpretation is natural in repeatable situations


like flipping coins, rolling dice, drawing Powerballs, or randomly selecting Cal
Poly students.
On the other hand, it is difficult to conceptualize some scenarios in the long
run. The next Superbowl will only be played once, the 2024 U.S. Presidential
Election will only be conducted once (we hope), and there was only one April 17,
2009 on which you either did or did not eat an apple. But while these situations
are not naturally repeatable they still involve randomness (uncertainty) and it
is still reasonable to assign probabilities. At this point in time we might think
that the Kansas City Chiefs are more likely than the Philadelphia Eagles to win
Superbowl 2022 and that President Biden is more likely than Dwayne Johnson
to win the U.S. 2024 Presidential Election. If you’ve always been an apple-a-day
person, you might think there’s a good chance you ate one on April 17, 2009.
It is still reasonable to assign probabilities to quantify such assessments even
when an uncertain phenomenon is not repeated.
However, the meaning of probability does seem different in a physically repeat-
able situations like coin flips than in single occurrences like the 2022 Superbowl.
For example, as of Dec 30, 2021,

• According to FiveThirtyEight, the Kansas City Chiefs have a 26% chance


of winning the 2022 Superbowl, and the Green Bay Packers have a 24%
chance.
• According to Football Outsiders, the Kansas City Chiefs have a 19.4%
chance of winning the 2022 Superbowl, and the Green Bay Packers have
a 14% chance.
• As reported by CBS Sports, the Kansas City Chiefs have a 20% chance
of winning the 2022 Superbowl, and the Green Bay Packers have a 21%
chance.

Each source, as well as many others, assigns different probabilities to the Chiefs
and Packers winning. Which source, if any, is “correct”?
When the situation involves a fair coin flip, we could perform a simulation to
see that the long run proportion of flips that land on H is 0.5, and so the
probability that a fair coin flip lands on H is 0.5. Even though the actual
2022 Superbowl will only happen once, we could still perform a simulation
36CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

involving hypothetical repetitions. However, simulating the Superbowl involves


first simulating the 2021-2022 season to determine the playoff matchups, then
simulating the playoffs to see which teams make the Superbowl, then simulating
the Superbowl matchup itself. And simulating the season involves simulating
all the individual games. Even just simulating a single game involves many
assumptions; differences in opinions with regards to these assumptions can lead
to different probabilities. For example, on Dec 30, according to FiveThirtyEight
the Eagles had a 55% chance of beating the Washington Football Team in their
game on Jan 2, but according to numberFire it was 65%. (Let’s hope the
Eagles won.) Even though the differences in probabilities between sources are
often small, many small differences over the course of the season could result in
large differences in predictions for the Superbowl champion.
Unlike physically repeatable situations such as flipping a coin, there is no single
set of “rules” for conducting a simulation of a season of football games or the
Superbowl champion. Therefore, there is no single long run relative frequency
that determines the probability. Instead we consider subjective probability.
A subjective (a.k.a. personal) probability describes the degree of like-
lihood a given individual assigns to a certain event. As the name suggests,
different individuals (or probabilistic models) might have different subjective
probabilities for the same event. In contrast, in the long run relative frequency
interpretation the probability is agreed to be defined as the long run relative
frequency, a single number.
Think of subjective probabilities as measuring relative degrees of like-
lihood, uncertainty, or plausibility rather than long run relative frequen-
cies. For example, in the FiveThirtyEight forecast (as of Dec 30), the Chiefs
(26% chance) are about 3.25 times more likely to win the 2022 Superbowl
than the Cowboys (8% chance); 3.25 = 26/8. Relative likelihoods can also be
compared across different forecasts or scenarios. For example, FiveThirtyEight
believes that the Packers are about 1.7 times more likely to win the Superbowl
than Football Outsiders does (24% versus 14%). Also, FiveThirtyEight believes
that the likelihood that a fair coin lands on H is about 1.92 times larger than
the likelihood that the Chiefs win the 2022 Superbowl.
The FiveThirtyEight NFL predictions are the output of a probabilistic fore-
cast. A probabilistic forecast combines observed data and statistical models
to make predictions. Rather than providing a single prediction (such as “the
Chiefs will win the 2022 Superbowl”), probabilistic forecasts provide a range of
scenarios and their relative likelihoods. Such forecasts are subjective in nature,
relying upon the data used and assumptions of the model. Changing the data
or assumptions can result in different forecasts and probabilities. In particu-
lar, probabilistic forecasts are usually revised over time as more data becomes
available.
Simulations can also be based on subjective probabilities. If we were to conduct
a simulation consistent with FiveThirtyEight’s model (as of Dec 30), then in
3.2. INTERPRETATIONS OF PROBABILITY 37

about 26% of repetitions the Chiefs would win the Superbowl, and in about
8% of repetitions the Cowboys would win. Of course, different sets of sub-
jective probabilities correspond to different assumptions and different ways of
conducting the simulation.
Subjective probabilities can be calibrated by weighing the relative favorability
of different bets, as in the following example.
Example 3.2. What is your subjective probability that Professor Ross has a
TikTok account? Consider the following two bets, and suppse you must choose
only one6 .

A) You win $100 if Professor Ross has a TikTok account, and you win nothing
otherwise.
B) A box contains 40 green and 60 gold marbles that are otherwise identical.
The marbles are thoroughly mixed and one marble is selected at random.
You win $100 if the selected marble is green, and you win nothing other-
wise.

1. Which of the above bets would you prefer? Or are you completely indiffer-
ent? What does this say about your subjective probability that Professor
Ross has a Tik Tok account?
2. If you preferred bet B to bet A, consider bet C which has a similar setup
to B but now there are 20 green and 80 gold marbles. Do you prefer bet
A or bet C? What does this say about your subjective probability that
Professor Ross has a Tik Tok account?
3. If you preferred bet A to bet B, consider bet D which has a similar setup
to B but now there are 60 green and 40 gold marbles. Do you prefer bet
A or bet D? What does this say about your subjective probability that
Professor Ross has a Tik Tok account?
4. Continue to consider different numbers of green and gold marbles. Can
you zero in on your subjective probability?

Solution. to Example 3.2

Show/hide solution

1. Since the two bets have the same payouts, you should prefer the one that
gives you a greater chance of winning! If you choose bet B you have a
40% chance of winning.
• If you prefer bet B to bet A, then your subjective probability that
Professor Ross has a TikTok account is less than 40%.
• If you prefer bet A to bet B, then your subjective probability that
Professor Ross has a TikTok account is greater than 40%.
6 We do not advocate gambling. We merely use gambling contexts to motivate probability

concepts.
38CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

• If you’re indifferent between bets A and B, then your subjective


probability that Professor Ross has a TikTok account is equal to 40%.

2. If you choose bet C you have a 20% chance of winning.


• If you prefer bet C to bet A, then your subjective probability that
Professor Ross has a TikTok account is less than 20%.
• If you prefer bet A to bet C, then your subjective probability that
Professor Ross has a TikTok account is greater than 20%.
• If you’re indifferent between bets A and C, then your subjective
probability that Professor Ross has a TikTok account is equal to 20%.

3. If you choose bet D you have a 60% chance of winning.


• If you prefer bet D to bet A, then your subjective probability that
Professor Ross has a TikTok account is less than 60%.
• If you prefer bet A to bet D, then your subjective probability that
Professor Ross has a TikTok account is greater than 60%.
• If you’re indifferent between bets A and D, then your subjective
probability that Professor Ross has a TikTok account is equal to 60%.

4. Continuing in this way you can narrow down your subjective probability.
For example, if you prefer bet B to bet A and bet A to bet C, your
subjective probability is between 20% and 40%. Then you might consider
bet E corresponding to 30 gold marbles and 70 green to determine if you
subjective probability is greater than or less than 30%. At some point it
will be hard to choose, and you will be in the ballpark of your subjective
probability. (Think of it like going to the eye doctor: “which is better: 1
or 2?” At some point you can’t really see a difference.)

Of course, the strategy in the above example isn’t an exact science, and there is
a lot of behavioral psychology behind how people make choices in situations like
this, especially when betting with real money. But the example provides a very
rough idea of how you might discern a subjective probability of an event. The
example also illustrates that probabilities can be “personal”; your information
or assumptions will influence your assessment of the likelihood.
We close this section with some brief comments about subjectivity. Subjec-
tivity is not bad; “subjective” is not a “dirty” word. Any probability model
involves some subjectivity, even when probabilities can be interpreted naturally
as long run relative frequencies. For example, assuming a die is fair does not
codify an objective truth about the die. Instead, “fairness” reflects a reason-
able and tractable mathematical model. In the real world, any “fair” six-sided
die has small physical imperfections that cause the six faces to have different
probabilities. However, the differences are usually small enough to be ignored
for most practical purposes. Assuming that the probability that the die lands
3.3. WORKING WITH PROBABILITIES 39

40 green, 60 gold 20 green, 80 gold

60 green, 40 gold

Figure 3.5: The three marble bins in Example 3.2. Left: Bet A, 40% chance
of selecting green. Middle: Bet B, 20% chance of selecting green. Left: Bet C,
60% chance of selecting green.

on each side is 1/6 is much more tractable than assuming the probability of a
1 is 0.1666666668, the probability of a 2 is 0.1666666665, etc. (Furthermore,
measuring the probability of each side so precisely would be extremely diffi-
cult.) But assuming that the probability that the die lands on each side is 1/6
is also subjective. We might readily agree to assume that the probability that
a six-sided die lands on 1 is 1/6, but we might not reach a concensus on the
probability that the Chiefs win the Superbowl. But the fact that there cam be
many reasonable probability models for a situation like the 2022 Superbowl does
not make the corresponding subjective probabilities any less valid than long run
relative frequencies.

3.3 Working with probabilities

In the previous section we encountered two interpretations of probability: long


run relative frequency and subjective. We will use these interpretations inter-
changeably. With subjective probabilities it is often helpful to consider what
might happen in a simulation. It is also useful to consider long run relative fre-
quencies in terms of relative degrees of likelihood. Fortunately, the mathematics
of probability work the same way regardless of the interpretation.

3.3.1 Consistency requirements

With either the long run relative frequency or subjective probability interpre-
tation there are some basic logical consistency requirements which probabilities
40CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

need to satisfy. Roughly, probabilities cannot be negative and the sum of prob-
abilities over all possible outcomes must be 100%.
Example 3.3. As of Dec 30, FiveThirtyEight listed the following probabilities
for who will win the 2022 Superbowl.

Team Probability
Kansas City Chiefs 26%
Green Bay Packers 24%
Tampa Bay Buccaneers 9%
Dallas Cowboys 8%
Other

According to FiveThirtyEight (as of Dec 30):

1. What would you expect the results of 10000 repetitions of a simulation


of the Superbowl champion to look like? Construct a table summarizing
what you expect. Is this necessarily what would happen?
2. What must be the probability that the Chiefs do not win the 2022 Super-
bowl?
3. What must be the probability that one of the above four teams is the
Superbowl champion?
4. What must be the probability that a team other than the above four teams
is the Superbowl champion? That is, what value goes in the “Other” row
in the table?

Solution. to Example 3.3

Show/hide solution

1. While these particular probabilities are subjective, imagining probabilities


as relative frequencies often helps our intuition. If we think of this as a
simulation, each repetition results in a World Series champion and in the
long run we would expect the Dodgers would be the champion in 22%, or
2200, of the 10000 repetitions. We would expect the simulation results to
look like

Team Repetitions as winner


Kansas City Chiefs 2600
Green Bay Packers 2400
Tampa Bay Buccaneers 900
Dallas Cowboys 800
3.3. WORKING WITH PROBABILITIES 41

Team Repetitions as winner


Other 3300

Of course, there would be some variability from simulation to simulation,


just like in the sets of 1000 coin flips in Figure 3.4. But the above counts
represent about what we would expect.

2. 74%. Either the Chiefs win or they don’t; if there’s a 26% chance that the
Chiefs win, there must be a 74% chance that they do not win. If we think
of this as a simulation with 10000 repetitions, each repetition results in
either the Chiefs winning or not, so if they win in 2600 of repetitions then
they must not win in the other 7400.

3. 67%. There is only one Superbowl champion, so if say the Chiefs win then
no other team can win. Thinking again of the simulation, the repetitions
in which the Chiefs win are distinct from those in which the Cowboys
win. So if the Chiefs win in 2600 repetitions and the Cowboys win in
800 repetitions, then on a total of 3400 repetitions either the Chiefs or
Cowboys win. Adding the four probabilities, we see that the probability
that one of the four teams above wins must be 67%.

4. 33%. Either one of the four teams above wins, or some other team wins.
If one of the four teams above wins in 6700 repetitions, then in 3300
repetitions the winner is not one of these four teams.

Example 3.4. Suppose your subjective probabilities for the 2022 Superbowl
champion satisfy the following conditions.

• The Cowboys and Buccaneers are equally likely to win


• The Packers are 1.5 times more likely than the Cowboys to win
• The Chiefs are 2 times more likely than the Packers to win
• The winner is as likely to be among these four teams — Chiefs, Packers,
Buccaneers, Cowboys — as not

Construct a table of your subjective probabilities like the one in Example 3.3.

Solution. to Example 3.4

Show/hide solution
Here, probabilities are specified indirectly via relative likelihoods. We need to
find probabilities that are in the given ratios and add up to 100%. It helps
to designate one outcome as the “baseline”. It doesn’t matter which one; we’ll
choose the Cowboys.
42CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

• Suppose the Cowboys account for 1 “unit”. It doesn’t really matter what
a unit is, but let’s say it corresponds to 1000 repetitions of the simulation.
That is, the Cowboys win in 1000 repetitions. Careful: we haven’t yet
specified how many total repetitions we have done, or how many units the
entire simulation accounts for. We’re just starting with a baseline of what
happens for the Cowboys.
• The Cowboys and Buccaneers are equally like to win, so the Buccaneers
also account for 1 unit.
• The Packers are 1.5 times more likely than the Cowboys to win, so the
Packers account for 1.5 units. If 1 unit is 1000 repetitions, then the Packers
win in 1500 repetitions, 1.5 times more often than the Cowboys.
• The Chiefs are 2 times more likely than the Packers to win, so the Chiefs
account for 2 × 1.5 = 3 units. If 1 unit is 1000 repetitions, then the Chiefs
win in 3000 repetitions.
• The four teams account for a total of 1 + 1 + 1.5 + 3 = 6.5 units. Since the
winner is as likely to among these four teams as not, then “Other” also
accounts for 6.5 units.
• In total, there are 13 units which account for 100% of the probability.
The Cowboys account for 1 unit, so their probability of winning is 1/13
or about 7.7%. Likewise, the probability that the Chiefs win is 3/13 or
about 23.1%.

Team Units Repetitions Probability


Kansas City Chiefs 3.0 3000 23.1%
Green Bay Packers 1.5 1500 11.5%
Tampa Bay Buccaneers 1.0 1000 7.7%
Dallas Cowboys 1.0 1000 7.7%
Other 6.5 6500 50.0%
Total 13.0 13000 100.0%

You should verify that all of the probabilities are in the specified ratios. For
example, the Chiefs are 2 times more likely (2 = 23.1/11.5) than the Packers
to win, and the Packers are 1.5 times more likely (1.5 ≈ 11.5/7.7) than the
Cowboys to win.
We could have also solved this problem using algebra. Let 𝑥 be the probability,
as a decimal, that the Cowboys are the winner. (Again, it doesn’t matter which
team is the baseline.) Then 𝑥 is also the probability that the Buccaneers are
the winner, 1.5𝑥 for the Packers, and 3𝑥 for the Chiefs. The probability that
one of the four teams wins is 𝑥 + 𝑥 + 1.5𝑥 + 3𝑥 = 6.5𝑥, so the probability of
Other is also 6.5𝑥. The probabilities in decimal form must sum to 1 (that is,
100%), so 1 = 𝑥 + 𝑥 + 1.5𝑥 + 3𝑥 + 6.5𝑥 = 13𝑥. Solve for 𝑥 = 1/13 and then plug
in 𝑥 = 1/13 to find the other probabilities.
3.4. INTERPRETATIONS OF STATISTICS 43

Example 3.4 illustrates one way of formulating probabilities. We start by spec-


ifying probabilities in relative terms, and then “normalize” these probabilities
so that they add up to 100% while maintaining the ratios. As in the example,
it helps to consider one outcome as a “baseline” and to specify all likelihoods
relative to the baseline.
Figure 3.6 provides a visual representation of Example 3.4. The ratios provided
in the problem setup are enough to draw the shape of the plot, represented by
the plot on the left without a scale on the vertical axis. The heights are equal
for the Cowboys and Buccaneers, the height for the Packers is 1.5 times higher,
etc. The plot on the right simply adds a probability axis to ensure the values
add to 1. The plot on the right represents the “normalization” step, but it does
not affect the shape of the plot or the relative heights of the bars.

0.5

0.4
Subjective Probability

0.3

0.2

0.1

0.0

Other Chiefs Packers Buccaneers Cowboys Other Chiefs Packers Buccaneers Cowboys
Team Team

Figure 3.6: Bar chart representation of the subjective probabilities in Example


3.4. Left: Relative heights without absolute scale. Right: Heights scaled to sum
to 1 to represent probabilities.

3.4 Interpretations of Statistics

In the previous sections we have seen two interpretations of statistics: relative


frequency and subjective. These two interpretations provide the philosophical
foundation for two schools of statistics: frequentist (hypothesis tests and con-
fidence intervals that you’ve seen before) and Bayesian. This section provides
a very brief introduction to some of the main ideas in Bayesian statistics. The
examples in this section only motivate ideas. We will fill in lots more details
throughout the book.

Example 3.5. How old do you think your instructor (Professor Ross) currently
is7 ? Consider age on a continuous scale, e.g., you might be 20.73 or 21.36 or
19.50.
7 You could probably get a pretty good idea by searching online, but don’t do that. Instead,

answer the questions based on what you already know about me.
44CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

In this example, you will use probability to quantify your uncertainty about your
instructor’s age. You only need to give ballpark estimates of your subjective
probabilities, but you might consider what kinds of bets you would be willing
to accept like in Example 3.2. (This exercise just motivates some ideas. We’ll
fill in lots of details later.)

1. What is your subjective probability that your instructor is at most 30


years old? More than 30 years old? (What must be true about these two
probabilities?)
2. What is your subjective probability that your instructor is at most 60
years old? More than 60 years old?
3. What is your subjective probability that your instructor is at most 40
years old? More than 40 years old?
4. What is your subjective probability that your instructor is at most 50
years old? More than 50 years old?
5. Fill in the blank: your subjective probability that your instructor is at
most [blank] years old is equal to 50%.
6. Fill in the blanks: your subjective probability that your instructor is be-
tween [blank] and [blank] years old is equal to 95%.
7. Let 𝜃 represent your instructor’s age at noon on Jan 6, 2022. Use your
answers to the previous parts to sketch a continuous probability density
function to represent your subjective probability distribution for 𝜃.
8. If you ascribe a probability distribution to 𝜃, then are you treating 𝜃 as a
constant or a random variable?

Solution. to Example 3.5

Show/hide solution
Even though in reality your instructor’s current age is a fixed number, its value
is unknown or uncertain to you, and you can use probability to quantify this
uncertainty. You would probably be willing to bet any amount of money that
your instructor is over 20 years old, so you would assign a probability of 100%
to that event, and 0% to the event that he’s at most 20 years old. Let’s say
you’re pretty sure that he’s over 30, but you don’t know that for a fact, so you
assign a probability of 99% to that event (and 1% to the event that he’s at most
30). You think he’s over 40, but you’re even less sure about that, so maybe you
assign the event that he’s over 40 a probability of 67% (say you’d accept a bet
at 2 to 1 odds.) You think there’s a 50/50 chance that he’s over 50. You’re
95% sure that he’s between 35 and 60. And so on. Continuing in this way, you
can start to determine a probability distribution to represent your beliefs about
the instructor’s age. Your distribution should correspond to your subjective
probabilities. For example, the distribution should assign a probability of 67%
to values over 40.
This is just one example. Different students will have different distributions
depending upon (1) how much information you know about the instructor, and
3.4. INTERPRETATIONS OF STATISTICS 45

(2) how that information informs your beliefs about the instructor’s age. We’ll
see some example plots in the next exercise.
Regarding the last question, since we are using a probability distribution to
quantify our uncertainty about 𝜃, we are treating 𝜃 as a random variable.
Recall that a random variable is a numerical quantity whose value is deter-
mined by the outcome of a random or uncertain phenomenon. The random
phenomenon might involve physically repeatable randomness, as in “flip a coin
10 times and count the number of heads.” But remember that “random” just
means “uncertain” and there are lots of different kinds of uncertainty. For ex-
ample, the total number of points scored in the 2022 Superbowl will be one and
only one number, but since we don’t know what that number is we can treat
it as a random variable. Treating the number of points as a random variable
allows us to quantify our uncertainty about it through probability statements
like “there is a 60% chance that fewer than 45 points will be scored in Superbowl
2022”.
The (probability) distribution of a random variable specifies the possible val-
ues of the random variable and a way of determining corresponding probabilities.
Like probabilities themselves, probability distributions of random variables can
also be interpreted as:

• relative frequency distributions, e.g., what pattern would emerge if I sim-


ulated many values of the random variable? or as
• subjective probability distributions, e.g., which potential values of this un-
certain quantity are relatively more plausible than others?

As the name suggests, different individuals might have different subjective prob-
ability distributions for the same random variable.

Example 3.6. Continuing Example 3.5, Figure 3.7 displays the subjective prob-
ability distribution of the instructor’s age for four students.

1. Since age is treated as a continuous random variable, each of the above


plots is a probability “density”. Explain briefly what this means. How is
probability represented in density plots like these?
2. Rank the students in terms of their subjective probability that the instruc-
tor is at most 40 years old.
3. Rank the students in terms of their answers to the question: your subjec-
tive probability that your instructor is at most [blank] years old is equal
to 50%.
4. Rank the students in terms of their uncertainty about the instructor’s age.
Who is the most uncertain? The least?

Solution. to Example 3.6


46CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

Ariana
0.12 Billie
Cardi
Dua
0.10

0.08
Density

0.06

0.04

0.02

0.00
30 40 50 60 70
Instructor age

Figure 3.7: Subjective probability distributions of instructor age for four stu-
dents in Example 3.6.

Show/hide solution

1. In a density plot, probability is represented by area under the curve. The


total area under each curve is 1, corresponding to 100% probability. The
density height at any particular value 𝑥 represents the relative likelihood
that the random variable takes a value “close to” 𝑥. (We’ll consider den-
sities in more detail later.)
2. Each student’s subjective probability that the instructor is at most 40 is
equal to the area under her subjective probability density over the range
of values less than 40. Billie has the smallest probability, then Dua, then
Ariana, then Cardi has the largest probability.
3. Now we want to find the “equal areas point” of each distribution. From
smallest to largest: Cardi then Billie, and Ariana and Dua appear to be
about the same. The equal areas point appears to be around 40 or so for
Cardi. It’s definitely less than 45, which appears to the equal areas point
for Billie. The equal areas point for Ariana is 50 (halfway between 25 and
75), and Dua’s appears to be about 50 also.
4. Ariana is most uncertain, then Dua, then Cardi, then Billie is the least
uncertain. Each distribution represents 100% probability, but Ariana
stretches this probability over the largest range of possible values, while
Billie stretches this over the shortest. Ariana is basically saying the in-
3.4. INTERPRETATIONS OF STATISTICS 47

structor can be any age between 25 and 75. Billie is fairly certain that
the instructor is close to 45, and she’s basically 100% certain that the
instructor is between 35 and 55.
Example 3.7. Consider Ariana’s subjective probability distribution in Figure
3.7. Ariana learns that her instructor received a Ph.D. in 2006. How would her
subjective probability distribution change?
Solution. to Example 3.7

Show/hide solution
Ariana’s original subjective probability distribution reflects very little knowledge
about her instructor. Ariana now reasons that her instructor was probably be-
tween 25 and 35 when he received his Ph.D. in 2006, so she revises her subjective
probability distribution to place almost 100% probability on ages between 40
and 50. Ariana’s subjective probability distribution now looks more like Billie’s
in Figure 3.7.
The previous examples introduce how probability can be used to quantify uncer-
tainty about unknown numbers. One key aspect of Bayesian analyses is applying
a subjective probability distribution to a parameter in a statistical model.
Example 3.8. Let 𝜃𝑏 represent the proportion of current Cal Poly students who
have ever read any of the books in the Harry Potter series. Let 𝜃𝑚 represent the
proportion of current Cal Poly students who have ever seen any of the movies
in the Harry Potter series.

1. Are 𝜃𝑏 and 𝜃𝑚 parameters or statistics? Why?


2. Are the values of 𝜃𝑏 and 𝜃𝑚 known or unknown, certain or uncertain?
3. What are the possible values of 𝜃𝑏 and 𝜃𝑚 ?
4. Sketch a probability distribution representing what you think are
more/less credible values of 𝜃𝑏 . Repeat for 𝜃𝑚 .
5. Are you more certain about the value of 𝜃𝑏 or 𝜃𝑚 ? How is this reflected
in your distributions?
6. Suppose that in a class of 35 Cal Poly students, 21 have read a Harry
Potter book, and 30 have seen a Harry Potter movie. Now that we have
observed some data, sketch a probability distribution representing what
you think are more/less credible values of 𝜃𝑏 . Repeat for 𝜃𝑚 . How do
your distributions after observing data compare to the distributions you
sketched before?
Solution. to Example 3.8

Show/hide solution

1. The population of interest is current Cal Poly students, so 𝜃𝑏 and 𝜃𝑚 are


parameters. We don’t have relevant data for the entire population, but we
could collect data on a sample.
48CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

2. Since we don’t have data on the entire population, the values of 𝜃𝑏 and
𝜃𝑚 are unknown, uncertain.
3. 𝜃𝑏 and 𝜃𝑚 are proportions so they take values between 0 and 1. Any value
on the continuous scale between 0 and 1 is theoretically possible, though
the values are not equally plausible.
4. Results will vary, but here’s my thought process. I think that a strong
majority of Cal Poly students have seen at least one Harry Potter movie,
maybe 80% or so. I wouldn’t be that surprised if it were even close to
100%, but I would be pretty surprised if it were less than 60%.
However, I’m less certain about 𝜃𝑏 . I suspect that fewer than 50% of
students have read at least one Harry Potter book, but I’m not very sure
and I wouldn’t be too surprised if it were actually more than 50%.
See Figure 3.8 for what my subjective probability distributions might look
like.
5. I’m less certain about 𝜃𝑏 , so its density is “spread out” over a wider range
of values.
6. The values of 𝜃𝑏 and 𝜃𝑚 are still unknown, but I am less uncertain about
their values now that I have observed some data. The sample proportion
who have watched a Harry Potter movie is 30/35 = 0.857, which is pretty
consistent with my initial beliefs. But now I update my subjective distri-
bution to concentrate even more of my subjective probability on values in
the 80 percent range.
I had suspected that 𝜃𝑏 was less than 0.5, so the observed sample propor-
tion of 21/35 = 0.6 goes against my expectations. However, I was fairly
uncertain about the value of 𝜃𝑚 prior to observing the data, so 0.6 is not
too surprising to me. I update my subjective distribution so that it’s cen-
tered closer to 0.6, while still allowing for my suspicion that 𝜃𝑏 is less than
0.5.
See Figure 3.9 for what my subjective probability distributions might look
like after observing the sample data. Of course, the sample proportions
are not necessarily equal to the population proportions. But if the sam-
ples are reasonably representative, I would hope that the observed sample
proportions are close to the respective population proportions. Even after
observing data, there is still uncertainty about the parameters 𝜃𝑏 and 𝜃𝑚 ,
and my subjective distributions quantify this uncertainty.

Recall some statistical terminology.

• Observational units (a.k.a., cases, individuals, subjects) are the people,


places, things, etc we collect information on.
• A variable is any characteristic of an observational unit that we can
measure.
3.4. INTERPRETATIONS OF STATISTICS 49

Prior distributions

movie
book
Density

0.0 0.2 0.4 0.6 0.8 1.0

Population proportion

Figure 3.8: Example subjective distributions in Example 3.8, prior to observing


sample data.

• Statistical inference involves using data collected on a sample to make


conclusions about a population.

• Inference often concerns specific numerical summaries, using values of


statistics to make conclusions about parameters.
• A parameter is a number that describes the population, e.g., population
mean, population proportion. The actual value of a parameter is almost
always unknown.
– Parameters are often denoted with Greek letters. We’ll often use the
Greek letter 𝜃 (“theta”) to denote a generic parameter.
• A statistic is a number that describes the sample, e.g., sample mean,
sample proportion.

Parameters are unknown numbers. In “traditional”, frequentist statistical anal-


ysis, parameters are treated as fixed — that is, not random — constants. Any
randomness in a frequentist analysis arises from how the data were collected,
e.g., via random sampling or random assignment. In a frequentist analysis,
statistics are random variables; parameters are fixed numbers.
For example, a frequentist 95% confidence interval for 𝜃𝑏 in the previous example
is [0.434, 0.766]. We estimate with 95% confidence that the proportion of Cal
50CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS

Posterior distributions

movie (posterior)
book (posterior)
movie (prior)
book (prior)
Density

0.0 0.2 0.4 0.6 0.8 1.0

Population proportion

Figure 3.9: Example subjective distributions in Example 3.8, after observing


sample data.

Poly students that have read any of the books in the Harry Potter series is
between 0.434 and 0.766. Does this mean that there is a 95% probability that
𝜃𝑏 is between 0.434 and 0.766? No! In a frequentist analysis, the parameter 𝜃𝑏
is treated like a fixed constant. That constant is either between 0.434 and 0.766
or it’s not; we don’t know which it is, but there’s no probability to it. In a
frequentist analysis, it doesn’t make sense to say “what is the probability that
𝜃𝑏 (a number) is between 0.434 and 0.766?” just like it doesn’t make sense to
say “what is the probability that 0.5 is between 0.434 and 0.766?” Remember
that 95% confidence derives from the fact that for 95% of samples the procedure
that was used to produce the interval [0.434, 0.766] will produce intervals that
contain the true parameter 𝜃𝑏 . It is the samples and the intervals that are
changing from sample to sample; 𝜃𝑏 stays constant at its fixed but unknown
value. In a frequentist analysis, probability quantifies the randomness in the
sampling procedure.
On the other hand, in a Bayesian statistical analysis, since a parameter 𝜃 is
unknown — that is, it’s value is uncertain to the observer — 𝜃 is treated as
a random variable. That is, in Bayesian statistical analyses unknown
parameters are random variables that have probability distributions.
The probability distribution of a parameter quantifies the degree of uncertainty
about the value of the parameter. Therefore, the Bayesian perspective allows
for probability statements about parameters. For example, a Bayesian analysis
3.4. INTERPRETATIONS OF STATISTICS 51

of the previous example might conclude that there is a 95% chance that 𝜃𝑏 is
between 0.426 and 0.721. Such a statement is valid in the Bayesian context, but
nonsensical in the frequentist context.
In the previous example, we started with distributions that represented our
uncertainty about 𝜃𝑏 and 𝜃𝑚 based on our “beliefs”, then we revised these
distributions after observing some data. If we were to observe more data, we
could revise again. In this course we will see (among other things) (1) how to
quantify uncertainty about parameters using probability distributions, and (2)
how to update those distributions to reflect new data.
Throughout these notes we will focus on Bayesian statistical analyses. We will
occasionally compare Bayesian and frequentist analyses and viewpoints. But we
want to make clear from the start: Bayesian versus frequentist is NOT a question
of right versus wrong. Both Bayesian and frequentist are valid approaches to
statistical analyses, each with advantages and disadvantages. We’ll address
some of the issues along the way. But at no point in your career do you need
to make a definitive decision to be a Bayesian or a frequentist; a good modern
statistician is probably a bit of both.
52CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
Chapter 4

Bayes’ Rule

The mechanism that underpins all of Bayesian statistical analysis is Bayes’


rule1 , which describes how to update uncertainty in light of new information,
evidence, or data.

Example 4.1. A recent survey of American adults asked: “Based on what you
have heard or read, which of the following two statements best describes the
scientific method?”

• 70% selected “The scientific method produces findings meant to be con-


tinually tested and updated over time”. (We’ll call this the “iterative”
opinion.)
• 14% selected “The scientific method identifies unchanging core principles
and truths”. (We’ll call this the “unchanging” opinion).
• 16% were not sure which of the two statements was best.

How does the response to this question change based on education level? Sup-
pose education level is classified as: high school or less (HS), some college but
no Bachelor’s degree (college), Bachelor’s degree (Bachelor’s), or postgraduate
degree (postgraduate). The education breakdown is

• Among those who agree with “iterative”: 31.3% HS, 27.6% college, 22.9%
Bachelor’s, and 18.2% postgraduate.
• Among those who agree with “unchanging”: 38.6% HS, 31.4% college,
19.7% Bachelor’s, and 10.3% postgraduate.
• Among those “not sure”: 57.3% HS, 27.2% college, 9.7% Bachelor’s, and
5.8% postgraduate
1 This section only covers Bayes’ rule for events. We’ll see Bayes’ rule for distributions of

random variables later. But the ideas are analogous.

53
54 CHAPTER 4. BAYES’ RULE

1. Use the information to construct an appropriate two-way table.


2. Overall, what percentage of adults have a postgraduate degree? How is
this related to the values 18.2%, 10.3%, and 5.8%?
3. What percent of those with a postgraduate degree agree that the scientific
method is “iterative”? How is this related to the values provided?
Solution. to Example 4.1

Show/hide solution

1. Suppose there are 100000 hypothetical American adults. Of these 100000,


100000 × 0.7 = 70000 agree with the “iterative” statement. Of the 70000
who agree with the “iterative” statement, 70000 × 0.182 = 12740 also have
a postgraduate degree. Continue in this way to complete the table below.
2. Overall 15.11% of adults have a postgraduate degree (15110/100000 in the
table). The overall percentage is a weighted average of the three percent-
ages; 18.2% gets the most weight in the average because the “iterative”
statement has the highest percentage of people that agree with it com-
pared to “unchanging” and “not sure”.
0.1511 = (0.70)(0.182) + (0.14)(0.103) + (0.16)(0.058)
3. Of the 15110 who have a postgraduate degree 12740 agree with the “itera-
tive” statement, and 12740/15110 = 0.843. 84.3% of those with a graduate
degree agree that the scientific method is “iterative”. The value 0.843 is
equal to the product of (1) 0.70, the overall proportion who agree with
the “iterative” statement, and (2) 0.182, the proportion of those who agree
with the “iterative” statement that have a postgraduate degree; divided
by 0.1511, the overall proportion who have a postgraduate degree.
0.182 × 0.70
0.843 =
0.1511

HS college Bachelors postgrad total


iterative 21910 19320 16030 12740 70000
unchanging 5404 4396 2758 1442 14000
not sure 9168 4352 1552 928 16000
total 36482 28068 20340 15110 100000
Bayes’ rule for events specifies how a prior probability 𝑃 (𝐻) of event 𝐻
is updated in response to the evidence 𝐸 to obtain the posterior probability
𝑃 (𝐻|𝐸).
𝑃 (𝐸|𝐻)𝑃 (𝐻)
𝑃 (𝐻|𝐸) =
𝑃 (𝐸)

• Event 𝐻 represents a particular hypothesis2 (or model or case)


2 We’re using “hypothesis” in the sense of a general scientific hypothesis, not necessarily a

statistical null or alternative hypothesis.


55

• Event 𝐸 represents observed evidence (or data or information)


• 𝑃 (𝐻) is the unconditional or prior probability of 𝐻 (prior to observing
evidence 𝐸)
• 𝑃 (𝐻|𝐸) is the conditional or posterior probability of 𝐻 after observing
evidence 𝐸.
• 𝑃 (𝐸|𝐻) is the likelihood of evidence 𝐸 given hypothesis (or model or
case) 𝐻

Example 4.2. Continuing the previous example. Randomly select an American


adult.

1. Consider the conditional probability that a randomly selected American


adult agrees that the scientific method is “iterative” given that they have a
postgraduate degree. Identify the prior probability, hypothesis, evidence,
likelihood, and posterior probability, and use Bayes’ rule to compute the
posterior probability.
2. Find the conditional probability that a randomly selected American adult
with a postgraduate degree agrees that the scientific method is “unchang-
ing”.
3. Find the conditional probability that a randomly selected American adult
with a postgraduate degree is not sure about which statement is best.
4. How many times more likely is it for an American adult to have a post-
graduate degree and agree with the “iterative” statement than to have a
postgraduate degree and agree with the “unchanging” statement?
5. How many times more likely is it for an American adult with a postgraduate
degree to agree with the “iterative” statement than to agree with the
“unchanging” statement?
6. What do you notice about the answers to the two previous parts?
7. How many times more likely is it for an American adult to agree with the
“iterative” statement than to agree with the “unchanging” statement?
8. How many times more likely is it for an American adult to have a post-
graduate degree when the adult agrees with the iterative statement than
when the adult agree with the unchanging statement?
9. How many times more likely is it for an American adult with a postgraduate
degree to agree with the “iterative” statement than to agree with the
“unchanging” statement?
10. How are the values in the three previous parts related?

Solution. to Example 4.2

Show/hide solution

1. This is essentially the same question as the last part of the previous prob-
lem, just with different terminology.
• The hypothesis is 𝐻1 , the event that the randomly selected adult
agrees with the “iterative” statement.
56 CHAPTER 4. BAYES’ RULE

• The prior probability is 𝑃 (𝐻1 ) = 0.70, the overall or unconditional


probability that a randomly selected American adult agrees with the
“iterative” statement.
• The given “evidence” 𝐸 is the event that the randomly selected adult
has a postgraduate degree. The marginal probability of the evidence
is 𝑃 (𝐸) = 0.1511, which can be obtained by the law of total proba-
bility as in the previous problem.
• The likelihood is 𝑃 (𝐸|𝐻1 ) = 0.182, the conditional probability that
the adult has a postgraduate degree (the evidence) given that the
adult agrees with the “iterative” statement (the hypothesis).
• The posterior probability is 𝑃 (𝐻1 |𝐸) = 0.843, the conditional proba-
bility that a randomly selected American adult agrees that the scien-
tific method is “iterative” given that they have a postgraduate degree.
By Bayes rule
𝑃 (𝐸|𝐻1 )𝑃 (𝐻1 ) 0.182 × 0.70
𝑃 (𝐻1 |𝐸) = = = 0.843
𝑃 (𝐸) 0.1511
2. Let 𝐻2 be the event that the randomly selected adult agrees with the
“unchanging” statement; the prior probability is 𝑃 (𝐻2 ) = 0.14. The
evidence 𝐸 is still “postgraduate degree” but now the likelihood of this
evidence is 𝑃 (𝐸|𝐻2 ) = 0.103 under the “unchanging” hypothesis. The
conditional probability that a randomly selected adult with a postgraduate
degree agrees that the scientific method is “unchanging” is
𝑃 (𝐸|𝐻2 )𝑃 (𝐻2 ) 0.103 × 0.14
𝑃 (𝐻2 |𝐸) = = = 0.095
𝑃 (𝐸) 0.1511
3. Let 𝐻3 be the event that the randomly selected adult is “not sure”; the
prior probability is 𝑃 (𝐻3 ) = 0.16. The evidence 𝐸 is still “postgraduate
degree” but now the likelihood of this evidence is 𝑃 (𝐸|𝐻3 ) = 0.058 under
the “not sure” hypothesis. The conditional probability that a randomly
selected adult with a postgraduate degree is “not sure” is
𝑃 (𝐸|𝐻3 )𝑃 (𝐻3 ) 0.058 × 0.16
𝑃 (𝐻3 |𝐸) = = = 0.061
𝑃 (𝐸) 0.1511
4. The probability that an American adult has a postgraduate degree and
agrees with the “iterative” statement is 𝑃 (𝐸 ∩ 𝐻1 ) = 𝑃 (𝐸|𝐻1 )𝑃 (𝐻1 ) =
0.182 × 0.70 = 0.1274. The probability that an American adult has a
postgraduate degree and agrees with the “unchanging” statement is 𝑃 (𝐸 ∩
𝐻2 ) = 𝑃 (𝐸|𝐻2 )𝑃 (𝐻2 ) = 0.103 × 0.14 = 0.01442. Since
𝑃 (𝐸 ∩ 𝐻1 ) 0.182 × 0.70 0.1274
= = = 8.835
𝑃 (𝐸 ∩ 𝐻2 ) 0.103 × 0.14 0.01442
an American adult is 8.835 times more likely to have a postgraduate degree
and agree with the “iterative” statement than to have a postgraduate
degree and agree with the “unchanging” statement.
57

5. The conditional probability that an American adult with a post-


graduate degree agrees with the “iterative” statement is 𝑃 (𝐻1 |𝐸) =
𝑃 (𝐸|𝐻1 )𝑃 (𝐻1 )/𝑃 (𝐸) = 0.182 × 0.70/0.1511 = 0.843. The conditional
probability that an American adult with a postgraduate degree agrees
with the “unchanging” statement is 𝑃 (𝐻2 |𝐸) = 𝑃 (𝐸|𝐻2 )𝑃 (𝐻2 )/𝑃 (𝐸) =
0.103 × 0.14/0.1511 = 0.09543. Since

𝑃 (𝐻1 |𝐸) 0.182 × 0.70/0.1511 0.84315


= = = 8.835
𝑃 (𝐻2 |𝐸) 0.103 × 0.14/0.1511 0.09543

an American adult with a postgraduate degree is 8.835 times more likely to


agree with the “iterative” statement than to agree with the “unchanging”
statement.
6. The ratios are the same! Conditioning on having a postgraduate degree
just “slices” out the Americans who have a postgraduate degree. The
ratios are determined by the overall probabilities for Americans. The
conditional probabilities, given postgraduate degree, simply rescale the
probabilities for Americans who have a postgraduate degree to add up to
1 (by dividing by 0.1511.)
7. This is a ratio of prior probabilities: 0.70 / 0.14 = 5. An American adult
is 5 times more likely to agree with the “iterative” statement than to agree
with the “unchanging” statement.
8. This is a ratio of likelihoods: 0.182 / 0.103 = 1.767. An American adult
is 1.767 times more likely to have a postgraduate degree when the adult
agrees with the iterative statement than when the adult agree with the
unchanging statement.
9. This is a ratio of posterior probabilities: 0.8432 / 0.0954 = 8.835. An
American adult with a postgraduate degree is 8.835 times more likely to
agree with the “iterative” statement than to agree with the “unchanging”
statement.
10. The ratio of the posterior probabilities is equal to the product of the ratio
of the prior probabilities and the ratio of the likelihoods: 8.835 = 5×1.767.
Posterior is proportional to the product of prior and likelihood.

Bayes rule is often used when there are multiple hypotheses or cases. Suppose
𝐻1 , … , 𝐻𝑘 is a series of distinct hypotheses which together account for all pos-
sibilities3 , and 𝐸 is any event (evidence). Then Bayes’ rule implies that the
posterior probability of any particular hypothesis 𝐻𝑗 satisfies

𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 )
𝑃 (𝐻𝑗 |𝐸) =
𝑃 (𝐸)

The marginal probability of the evidence, 𝑃 (𝐸), in the denominator can be


3 More formally, 𝐻 , … , 𝐻 is a partition which satisfies 𝑃 (∪𝑘 𝐻 ) = 1 and 𝐻 , … , 𝐻
1 𝑘 𝑖=1 𝑖 1 𝑘
are disjoint — 𝐻𝑖 ∩ 𝐻𝑗 = ∅, 𝑖 ≠ 𝑗.
58 CHAPTER 4. BAYES’ RULE

calculated using the law of total probability


𝑘
𝑃 (𝐸) = ∑ 𝑃 (𝐸|𝐻𝑖 )𝑃 (𝐻𝑖 )
𝑖=1

The law of total probability says that we can interpret the unconditional prob-
ability 𝑃 (𝐸) as a probability-weighted average of the case-by-case conditional
probabilities 𝑃 (𝐸|𝐻𝑖 ) where the weights 𝑃 (𝐻𝑖 ) represent the probability of en-
countering each case.
Combining Bayes’ rule with the law of total probability,
𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 )
𝑃 (𝐻𝑗 |𝐸) =
𝑃 (𝐸)
𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 )
= 𝑘
∑𝑖=1 𝑃 (𝐸|𝐻𝑖 )𝑃 (𝐻𝑖 )

𝑃 (𝐻𝑗 |𝐸) ∝ 𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 )

The symbol ∝ is read “is proportional to”. The relative ratios of the posterior
probabilities of different hypotheses are determined by the product of the prior
probabilities and the likelihoods, 𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 ). The marginal probability of
the evidence, 𝑃 (𝐸), in the denominator simply normalizes the numerators to
ensure that the updated probabilities sum to 1 over all the distinct hypotheses.
In short, Bayes’ rule says4

posterior ∝ likelihood × prior

In the previous examples, the prior probabilities for an American adult’s per-
ception of the scientific method are 0.70 for “iterative”, 0.14 for “unchanging”,
and 0.16 for “not sure”. After observing that the American has a postgrad-
uate degree, the posterior probabilities for an American adult’s perception of
the scientific method become 0.8432 for “iterative”, 0.0954 for “unchanging”,
and 0.0614 for “not sure”. The following organizes the calculations in a Bayes’
table which illustrates “posterior is proportional to likelihood times prior”.
hypothesis prior likelihood product posterior
iterative 0.70 0.182 0.1274 0.8432
unchanging 0.14 0.103 0.0144 0.0954
not sure 0.16 0.058 0.0093 0.0614
sum 1.00 NA 0.1511 1.0000
The likelihood column depends on the evidence, in this case, observing that the
American has a postgraduate degree. This column contains the probability of
4 “Posterior is proportional to likelihood times prior” summarizes the whole course in a

single sentence.
59

the same event, 𝐸 = “the American has a postgraduate degree”, under each of
the distinct hypotheses:

• 𝑃 (𝐸|𝐻1 ) = 0.182, given the American agrees with the “iterative” state-
ment
• 𝑃 (𝐸|𝐻2 ) = 0.103, given the American agrees with the “unchanging” state-
ment
• 𝑃 (𝐸|𝐻3 ) = 0.058, given the American is “not sure”

Since each of these probabilities is computed under a different case, these values
do not need to add up to anything in particular. The sum of the likelihoods
is meaningless, which is why we have listed a sum of “NA” for the likelihood
column.
The “product” column contains the product of the values in the prior and like-
lihood columns. The product of prior and likelihood for “iterative” (0.1274)
is 8.835 (0.1274/0.0144) times higher than the product of prior and likelihood
for “unchanging” (0.0144). Therefore, Bayes rule implies that the conditional
probability that an American with a postgraduate degree agrees with “iterative”
should be 8.835 times higher than the conditional probability that an American
with a postgraduate degree agrees with “unchanging”. Similarly, the conditional
probability that an American with a postgraduate degree agrees with “iterative”
should be 0.1274/0.0093 = 13.73 times higher than the conditional probability
that an American with a postgraduate degree is “not sure”, and the condi-
tional probability that an American with a postgraduate degree agrees with
“unchanging” should be 0.0144/0.0093 = 1.55 times higher than the conditional
probability that an American with a postgraduate degree is “not sure”. The last
column just translates these relative relationships into probabilities that sum to
1.
The sum of the “product” column is 𝑃 (𝐸), the marginal probability of the
evidence. The sum of the product column represents the result of the law of total
probability calculation. However, for the purposes of determining the posterior
probabilities, it isn’t really important what 𝑃 (𝐸) is. Rather, it is the ratio of
the values in the “product” column that determine the posterior probabilities.
𝑃 (𝐸) is whatever it needs to be to ensure that the posterior probabilities sum
to 1 while maintaining the proper ratios.
The process of conditioning can be thought of as “slicing and renormalizing”.

• Extract the “slice” corresponding to the event being conditioned on (and


discard the rest). For example, a slice might correspond to a particular
row or column of a two-way table.

• “Renormalize” the values in the slice so that corresponding probabilities


add up to 1.
60 CHAPTER 4. BAYES’ RULE

We will see that the “slicing and renormalizing” interpretation also applies when
dealing with conditional distributions of random variables, and corresponding
plots. Slicing determines the shape; renormalizing determines the scale. Slicing
determines relative probabilities; renormalizing just makes sure they “add up”
to 1 while maintaining the proper ratios.
Example 4.3. Now suppose we want to compute the posterior probabilities for
an American adult’s perception of the scientific method given that the randomly
selected American adult has some college but no Bachelor’s degree (“college”).

1. Before computing, make an educated guess for the posterior probabilities.


In particular, will the changes from prior to posterior be more or less
extreme given the American has some college but no Bachelor’s degree
than when given the American has a postgraduate degree? Why?
2. Construct a Bayes table and compute the posterior probabilities. Compare
to the posterior probabilities given postgraduate degree from the previous
examples.

Solution. to Example 4.3

Show/hide solution

1. We start with the same prior probabilities as before: 0.70 for iterative,
0.14 for unchanging, 0.16 for not sure. Now the evidence is that the
American has some college but no Bachelor’s degree. The likelihood of
the evidence (“college”) is 0.276 under the iterative hypothesis, 0.314 un-
der the unchanging hypothesis, and 0.272 under the not sure hypothesis.
The likelihood of the evidence does not change as much across the dif-
ferent hypotheses when the evidence is “college” than when the evidence
was “postgraduate degree”. Therefore, the changes from prior to poste-
rior should be less extreme when the evidence is “college” than when the
evidence was “postgraduate degree”. Furthermore, since the likelihood
doesn’t vary much across hypotheses when the evidence is “college” we
expect the posterior probabilities to be close to the prior probabilities.
2. See the table below. As expected, the posterior probabilities are closer
to the prior probabilities when the evidence is “college” than when the
evidence is “postgraduate degree”.

hypothesis = c("iterative", "unchanging", "not sure")

prior = c(0.70, 0.14, 0.16)

likelihood = c(0.276, 0.314, 0.272) # likelihood of college

product = prior * likelihood


61

posterior = product / sum(product)

bayes_table = data.frame(hypothesis,
prior,
likelihood,
product,
posterior) %>%
add_row(hypothesis = "sum",
prior = sum(prior),
likelihood = NA,
product = sum(product),
posterior = sum(posterior))

kable(bayes_table, digits = 4, align = 'r')

hypothesis prior likelihood product posterior


iterative 0.70 0.276 0.1932 0.6883
unchanging 0.14 0.314 0.0440 0.1566
not sure 0.16 0.272 0.0435 0.1551
sum 1.00 NA 0.2807 1.0000
Like the scientific method, Bayesian analysis is often an iterative process. Pos-
terior probabilities are updated after observing some information or data. These
probabilities can then be used as prior probabilities before observing new data.
Posterior probabilities can be sequentially updated as new data becomes avail-
able, with the posterior probabilities after the previous stage serving as the prior
probabilities for the next stage. The final posterior probabilities only depend
upon the cumulative data. It doesn’t matter if we sequentially update the pos-
terior after each new piece of data or only once after all the data is available; the
final posterior probabilities will be the same either way. Also, the final posterior
probabilities are not impacted by the order in which the data are observed.
62 CHAPTER 4. BAYES’ RULE
Chapter 5

Introduction to Estimation

A parameter is a number that describes the population, e.g., population mean,


population proportion. The actual value of a parameter is almost always un-
known.
A statistic is a number that describes the sample, e.g., sample mean, sample
proportion. We can use observed sample statistics to estimate unknown popu-
lation parameters.
Example 5.1. Most people are right-handed, and even the right eye is dominant
for most people. In a 2003 study reported in Nature, a German bio-psychologist
conjectured that this preference for the right side manifests itself in other ways
as well. In particular, he investigated if people have a tendency to lean their
heads to the right when kissing. The researcher observed kissing couples in
public places and recorded whether the couple leaned their heads to the right
or left. (We’ll assume this represents a randomly selected representative sample
of kissing couples.)
The parameter of interest in this study is the population proportion of kissing
couples who lean their heads to the right. Denote this unknown parameter 𝜃;
our goal is to estimate 𝜃 based on sample data.
Let 𝑌 be the number of couples in a random sample of 𝑛 kissing couples that
lean to right. Suppose that in a sample of 𝑛 = 12 couples 𝑦 = 8 leaned to the
right. We’ll start with a non-Bayesian analysis.

1. If you were to estimate 𝜃 with a single number based on this sample data
alone, intuitively what number would you pick?
2. For a general 𝑛 and 𝜃, what is the distribution of 𝑌 ?
3. For the next few parts suppose 𝑛 = 12. For a moment we’ll only consider
these potential values for 𝜃: 0.1, 0.3, 0.5, 0.7, 0.9. If 𝜃 = 0.1 what is the
distribution of 𝑌 ? Compute and interpret the probability that 𝑌 = 8 if
𝜃 = 0.1.

63
64 CHAPTER 5. INTRODUCTION TO ESTIMATION

4. If 𝜃 = 0.3 what is the distribution of 𝑌 ? Compute and interpret the


probability that 𝑌 = 8 if 𝜃 = 0.3.
5. If 𝜃 = 0.5 what is the distribution of 𝑌 ? Compute and interpret the
probability that 𝑌 = 8 if 𝜃 = 0.5.
6. If 𝜃 = 0.7 what is the distribution of 𝑌 ? Compute and interpret the
probability that 𝑌 = 8 if 𝜃 = 0.7.
7. If 𝜃 = 0.9 what is the distribution of 𝑌 ? Compute and interpret the
probability that 𝑌 = 8 if 𝜃 = 0.9.
8. Now remember that 𝜃 is unknown. If you had to choose your estimate
of 𝜃 from the values 0.1, 0.3, 0.5, 0.7, 0.9, which one of these values would
you choose based on of observing 𝑦 = 8 couples leaning to the right in a
sample of 12 kissing couples? Why?
9. Obviously our choice is not restricted to those five values of 𝜃. Describe in
principle the process you would follow to find the estimate of 𝜃 based on
of observing 𝑦 = 8 couples leaning to the right in a sample of 12 kissing
couples.
10. Let 𝑓(𝑦|𝜃) denote the probability of observing 𝑦 couples leaning to the
right in a sample of 12 kissing couples. Determine 𝑓(𝑦 = 8|𝜃) and sketch
a graph of it. What is this a function of? What is an appropriate name
for this function?
11. From the previous part, what seems like a reasonable estimate of 𝜃 based
solely on the data of observing 𝑦 = 8 couples leaning to the right in a
sample of 12 kissing couples?

Solution. to Example 5.1

Show/hide solution

1. Seems reasonable to use the sample proportion 8/12 = 0.667.


2. 𝑌 has a Binomial(𝑛, 𝜃) distribution.
3. If 𝜃 = 0.1 then 𝑌 has a Binomial(12, 0.1) distribution and 𝑃 (𝑌 = 8|𝜃 =
0.1) = (12 8
8 )0.1 (1 − 0.1)
12−8
≈ 0.000; dbinom(8, 12, 0.1).
4. If 𝜃 = 0.3 then 𝑌 has a Binomial(12, 0.3) distribution and 𝑃 (𝑌 = 8|𝜃 =
0.3) = (12 8
8 )0.3 (1 − 0.3)
12−8
≈ 0.008; dbinom(8, 12, 0.3).
5. If 𝜃 = 0.5 then 𝑌 has a Binomial(12, 0.5) distribution and 𝑃 (𝑌 = 8|𝜃 =
0.5) = (12 8
8 )0.5 (1 − 0.5)
12−8
≈ 0.121; dbinom(8, 12, 0.5).
6. If 𝜃 = 0.7 then 𝑌 has a Binomial(12, 0.7) distribution and 𝑃 (𝑌 = 8|𝜃 =
0.7) = (12 8
8 )0.7 (1 − 0.7)
12−8
≈ 0.231; dbinom(8, 12, 0.7).
7. If 𝜃 = 0.9 then 𝑌 has a Binomial(12, 0.9) distribution and 𝑃 (𝑌 = 8|𝜃 =
0.9) = (12 8
8 )0.9 (1 − 0.9)
12−8
≈ 0.021; dbinom(8, 12, 0.9).
8. Compare the above values, and see the plots below. The probability of
observing 𝑦 = 8 is greatest when 𝜃 = 0.7, so in some sense the data seems
most “consistent” with 𝜃 = 0.7. The sample that we actually observed has
the greatest likelihood of occurring when 𝜃 = 0.7 (among these choices
for 𝜃). From a different perspective, if 𝜃 = 0.1 then the likelihood of
observing 8 successes in a sample of size 12 is very small. Therefore, since
65

we actually did observed 8 successes in a sample of size 12, the data do


not seem consistent with 𝜃 = 0.1.
9. For each value of 𝜃 between 0 and 1 compute the probability of observing
𝑦 = 8, 𝑃 (𝑌 = 8|𝜃), and find which value of 𝜃 maximizes this probability.

10. 𝑓(𝑦 = 8|𝜃) = 𝑃 (𝑌 = 8|𝜃) = (12 8


8 )𝜃 (1 − 𝜃)
12−8
. This is a function of 𝜃,
with the data 𝑦 = 8 fixed. Since this function computes the likelihood
of observing the data (evidence) under different values of 𝜃, “likelihood
function” seems like an appropriate name. See the plots below.
11. The value which maximizes the likelihood of 𝑦 = 8 is 8/12. So the maxi-
mum likelihood estimate of 𝜃 is 8/12.
66 CHAPTER 5. INTRODUCTION TO ESTIMATION

• For given data 𝑦, the likelihood function 𝑓(𝑦|𝜃) is the probability (or
density for continuous data) of observing the sample data 𝑦 viewed as a
function of the parameter 𝜃.
• In the likelihood function, the observed value of the data 𝑦 is treated as a
fixed constant.
• The value of a parameter that maximizes the likelihood function is called
a maximum likelihood estimate (MLE).
• The MLE depends on the data 𝑦. For given data 𝑦, the MLE is the value
of 𝜃 which gives the largest likelihood of having produced the observed
data 𝑦.
• Maximum likelihood estimation is a common frequentist technique for es-
timating the value of a parameter based on data from a sample.
Example 5.2. We’ll now take a Bayesian approach to estimating 𝜃 in Example
5.1. We treat the unknown parameter 𝜃 as a random variable and wish to find
its posterior distribution after observing 𝑦 = 8 couples leaning to the right in a
sample of 12 kissing couples.
We will start with a very simplified, unrealistic prior distribution that assumes
only five possible, equally likely values for 𝜃: 0.1, 0.3, 0.5, 0.7, 0.9.

1. Sketch a plot of the prior distribution and fill in the prior column of the
Bayes table.
2. Now suppose that 𝑦 = 8 couples in a sample of size 𝑛 = 12 lean right.
Sketch a plot of the likelihood function and fill in the likelihood column
in the Bayes table.
3. Complete the Bayes table and sketch a plot of the posterior distribution.
What does the posterior distribution say about 𝜃? How does it compare to
the prior and the likelihood? If you had to estimate 𝜃 with a single number
based on this posterior distribution, what number might you pick?
4. Now consider a prior distribution which places probability 1/9, 2/9, 3/9,
2/9, 1/9 on the values 0.1, 0.3, 0.5, 0.7, 0.9, respectively. Redo the previous
parts. How does the posterior distribution change?
5. Now consider a prior distribution which places probability 5/15, 4/15,
3/15, 2/15, 1/15 on the values 0.1, 0.3, 0.5, 0.7, 0.9, respectively. Redo
the previous parts. How does the posterior distribution change?
67

Solution. to Example 5.2

1. See plot below; the prior is “flat”.

2. The likelihood is computed as in Example 5.1. See the plots above.

3. See the Bayes table below. Since the prior is flat, the posterior is propor-
tional to the likelihood. The values 0.7 and 0.5 account for the bulk of
posterior plausibility, and 0.7 is about twice as plausible as 0.5. If we had
to estimate 𝜃 with a single number, we might pick 0.7 because that has
the highest posterior probability.

theta = seq(0.1, 0.9, 0.2)

# prior

prior = rep(1, length(theta))


prior = prior / sum(prior)

# data
n = 12 # sample size
y = 8 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)

kable(bayes_table %>%
adorn_totals("row"),
digits = 4,
align = 'r')
68 CHAPTER 5. INTRODUCTION TO ESTIMATION

theta prior likelihood product posterior


0.1 0.2 0.0000 0.0000 0.0000
0.3 0.2 0.0078 0.0016 0.0205
0.5 0.2 0.1208 0.0242 0.3171
0.7 0.2 0.2311 0.0462 0.6065
0.9 0.2 0.0213 0.0043 0.0559
Total 1.0 0.3811 0.0762 1.0000

# plots
plot(theta-0.01, prior, type='h', xlim=c(0, 1), ylim=c(0, 1), col="skyblue", xlab=
par(new=T)
plot(theta+0.01, likelihood/sum(likelihood), type='h', xlim=c(0, 1), ylim=c(0, 1),
par(new=T)
plot(theta, posterior, type='h', xlim=c(0, 1), ylim=c(0, 1), col="seagreen", xlab=
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyb
1.0

prior
scaled likelihood
0.8

posterior
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

theta

4. See table and plot below. Because the prior probability is now greater
for 0.5 than for 0.7, the posterior probability of 𝜃 = 0.5 is greater than
in the previous part, and the posterior probability of 𝜃 = 0.7 is less than
in the previous part. The values 0.7 and 0.5 still account for the bulk of
posterior plausibility, but now 0.7 is only about 1.3 times more plausible
than 0.5. If we had to estimate 𝜃 with a single number, we might pick 0.7
because that has the highest posterior probability.
69

theta prior likelihood product posterior


0.1 0.1111 0.0000 0.0000 0.0000
0.3 0.2222 0.0078 0.0017 0.0181
0.5 0.3333 0.1208 0.0403 0.4207
0.7 0.2222 0.2311 0.0514 0.5365
0.9 0.1111 0.0213 0.0024 0.0247
Total 1.0000 0.3811 0.0957 1.0000
1.0

prior
scaled likelihood
0.8

posterior
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

theta

5. See the table and plot below. The prior probability is large for 0.1 and
0.3, but since the likelihood corresponding to these values is so small, the
posterior probabilities are small. This posterior distribution is similar to
the one from the previous part.

theta prior likelihood product posterior


0.1 0.3333 0.0000 0.0000 0.0000
0.3 0.2667 0.0078 0.0021 0.0356
0.5 0.2000 0.1208 0.0242 0.4132
0.7 0.1333 0.2311 0.0308 0.5269
0.9 0.0667 0.0213 0.0014 0.0243
Total 1.0000 0.3811 0.0585 1.0000
70 CHAPTER 5. INTRODUCTION TO ESTIMATION

1.0
prior
scaled likelihood

0.8
posterior

0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

theta

Bayesian estimation

• Regards parameters as random variables with probability distributions


• Assigns a subjective prior distribution to parameters
• Conditions on the observed data
• Applies Bayes’ rule to produce a posterior distribution for parameters

posterior ∝ likelihood × prior

• Determines parameter estimates from the posterior distribution

In a Bayesian analysis, the posterior distribution contains all relevant informa-


tion about parameters. That is, all Bayesian inference is based on the posterior
distribution. The posterior distribution is a compromise between

• prior “beliefs”, as represented by the prior distribution


• data, as represented by the likelihood function

In contrast, a frequentist approach regards parameters as unknown but fixed


(not random) quantities. Frequentist estimates are commonly determined by
the likelihood function.
It is helpful to plot prior, likelihood, and posterior on the same plot. Since prior
and likelihood are probability distributions, they are on the same scale. How-
ever, remember that the likelihood does not add up to anything in particular.
To put the likelihood on the same scale as prior and posterior, it is helpful to
71

rescale the likelihood so that it adds up to 1. Such a rescaling does not change
the shape of the likelihood, it merely allows for easier comparison with prior
and posterior.
Example 5.3. Continuing Example 5.2. While the previous exercise introduced
the main ideas, it was unrealistic to consider only five possible values of 𝜃.

1. What are the possible values of 𝜃? Does the parameter 𝜃 take values on a
continuous or discrete scale? (Careful: we’re talking about the parameter
and not the data.)
2. Let’s assume that any multiple of 0.0001 is a possible value of 𝜃:
0, 0.0001, 0.0002, … , 0.9999, 1. Assume a discrete uniform prior distri-
bution on these values. Suppose again that 𝑦 = 8 couples in a sample
of 𝑛 = 12 kissing couples lean right. Use software to plot the prior
distribution, the (scaled) likelihood function, and then find the posterior
and plot it. Describe the posterior distribution. What does it say about
𝜃? If you had to estimate 𝜃 with a single number based on this posterior
distribution, what number might you pick?
3. Now assume a prior distribution which is proportional to 1 − 2|𝜃 − 0.5|
for 𝜃 = 0, 0.0001, 0.0002, … , 0.9999, 1. Use software to plot this prior;
what does it say about 𝜃? Then suppose again that 𝑦 = 8 couples in a
sample of 𝑛 = 12 kissing couples lean right. Use software to plot the prior
distribution, the (scaled) likelihood function, and then find the posterior
and plot it. What does the posterior distribution say about 𝜃? If you had
to estimate 𝜃 with a single number based on this posterior distribution,
what number might you pick?
4. Now assume a prior distribution which is proportional to 1 − 𝜃 for 𝜃 =
0, 0.0001, 0.0002, … , 0.9999, 1. Use software to plot this prior; what does it
say about 𝜃? Then suppose again that 𝑦 = 8 couples in a sample of 𝑛 = 12
kissing couples lean right. Use software to plot the prior distribution, the
(scaled) likelihood function, and then find the posterior and plot it. What
does the posterior distribution say about 𝜃? If you had to estimate 𝜃 with
a single number based on this posterior distribution, what number might
you pick?
5. Compare the posterior distributions corresponding to the three different
priors. How does each posterior distribution compare to the prior and the
likelihood? Does the prior distribution influence the posterior distribu-
tion?
Solution. to Example 5.3

1. The parameter 𝜃 is a proportion, so it can possibly take any value in the


continuous interval from 0 to 1.
2. See plot below. Since the prior is flat, the posterior is proportional to the
likelihood. So the posterior distribution places highest posterior proba-
bility on values near the sample proportion 8/12. The interval of values
72 CHAPTER 5. INTRODUCTION TO ESTIMATION

from about 0.4 to 0.9 accounts for almost all of the posterior plausibility.
If we had to estimate 𝜃 with a single number, we might pick 8/12 = 0.667
because that has the highest posterior probability.

theta = seq(0, 1, 0.0001)

# prior
prior = rep(1, length(theta))
prior = prior / sum(prior)

# data
n = 12 # sample size
y = 8 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# plots
plot_posterior <- function(theta, prior, likelihood){

# posterior
product = likelihood * prior
posterior = product / sum(product)

ylim = c(0, max(c(prior, posterior, likelihood / sum(likelihood))))


plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta
par(new=T)
plot(theta, likelihood/sum(likelihood), type='l', xlim=c(0, 1), ylim=ylim, col="
par(new=T)
plot(theta, posterior, type='l', xlim=c(0, 1), ylim=ylim, col="seagreen", xlab='
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("sk
}

plot_posterior(theta, prior, likelihood)


73

0.00030

prior
scaled likelihood
posterior
0.00020
0.00010
0.00000

0.0 0.2 0.4 0.6 0.8 1.0

theta

3. See plot below. The posterior is a compromise between the “triangular”


prior which places highest prior probability near 0.5, and the likelihood.
For this posterior, the posterior probability is greater near 0.5 than for the
one in the previous part; the posterior here has been “shifted” towards 0.5
relative to the posterior from the previous part. If we had to estimate 𝜃
with a single number, we might pick 0.615 because that has the highest
posterior probability.

# prior
theta = seq(0, 1, 0.0001)
prior = 1 - 2 * abs(theta - 0.5)
prior = prior / sum(prior)

# data
n = 12 # sample size
y = 8 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# plots
plot_posterior(theta, prior, likelihood)
74 CHAPTER 5. INTRODUCTION TO ESTIMATION

0.00000 0.00010 0.00020 0.00030


prior
scaled likelihood
posterior

0.0 0.2 0.4 0.6 0.8 1.0

theta

4. Again the posterior is a compromise between prior and likelihood. The


prior probabilities are greatest for values of 𝜃 near 0; however, the likeli-
hood corresponding to these values is small, so the posterior probabilities
are close to 0. As in the previous part, some of the posterior probability is
shifted towards 0.5, as opposed to what happens with the uniform prior.
The posterior here is fairly similar to the one from the previous part, and
the maximum posterior probability still occurs around 0.61.

theta = seq(0, 1, 0.0001)

# prior
prior = 1 - theta
prior = prior / sum(prior)

# data
n = 12 # sample size
y = 8 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# plots
plot_posterior(theta, prior, likelihood)
75

0.00030

prior
scaled likelihood
posterior
0.00020
0.00010
0.00000

0.0 0.2 0.4 0.6 0.8 1.0

theta

5. For the “flat” prior, the posterior is proportional to the likelihood. For the
other priors, the posterior is a compromise between prior and likelihood.
The prior does have some influence. We do see three somewhat different
posterior distributions corresponding to these three prior distributions.

• Even in situations where the data are discrete (e.g., binary success/failure
data, count data), most statistical parameters take values on a continuous
scale.
• Thus in a Bayesian analysis, parameters are usually continuous random
variables, and have continuous probability distributions, a.k.a., densities.
• An alternative to dealing with continuous distributions is to use grid
approximation: Treat the parameter as discrete, on a sufficiently fine
grid of values, and use discrete distributions.

Example 5.4. Continuing Example 5.1. Now we’ll perform a Bayesian analysis
on the actual study data in which 80 couples out of a sample of 124 leaned right.
We’ll again use a grid approximation and assume that any multiple of 0.0001
between 0 and 1 is a possible value of 𝜃: 0, 0.0001, 0.0002, … , 0.9999, 1.

1. Before performing the Bayesian analysis, use software to plot the likelihood
when 𝑦 = 80 couples in a sample of 𝑛 = 124 kissing couples lean right,
and compute the maximum likelihood estimate of 𝜃 based on this data.
How does the likelihood for this sample compare to the likelihood based
on the smaller sample (8/12) from previous exercises?
2. Now back to Bayesian analysis. Assume a discrete uniform prior distri-
bution for 𝜃. Suppose that 𝑦 = 80 couples in a sample of 𝑛 = 124 kissing
76 CHAPTER 5. INTRODUCTION TO ESTIMATION

couples lean right. Use software to plot the prior distribution, the like-
lihood function, and then find the posterior and plot it. Describe the
posterior distribution. What does it say about 𝜃?
3. Now assume a prior distribution which is proportional to 1 − 2|𝜃 − 0.5|
for 𝜃 = 0, 0.0001, 0.0002, … , 0.9999, 1. Then suppose again that 𝑦 = 80
couples in a sample of 𝑛 = 124 kissing couples lean right. Use software
to plot the prior distribution, the likelihood function, and then find the
posterior and plot it. What does the posterior distribution say about 𝜃?
4. Now assume a prior distribution which is proportional to 1 − 𝜃 for 𝜃 =
0, 0.0001, 0.0002, … , 0.9999, 1. Then suppose again that 𝑦 = 80 couples in
a sample of 𝑛 = 124 kissing couples lean right. Use software to plot the
prior distribution, the likelihood function, and then find the posterior and
plot it. What does the posterior distribution say about 𝜃?
5. Compare the posterior distributions corresponding to the three different
priors. How does each posterior distribution compare to the prior and
the likelihood? Comment on the influence that the prior distribution has.
Does the Bayesian inference for these data appear to be highly sensitive
to the choice of prior? How does this compare to the 𝑛 = 12 situation?
6. If you had to produce a single number Bayesian estimate of 𝜃 based on
the sample data, what number might you pick?

Solution. to Example 5.4

1. See plot below. The likelihood function is 𝑓(𝑦 = 80|𝜃) = (124 80


80 )𝜃 (1 −
124−80
𝜃) , 0 ≤ 𝜃 ≤ 1, the likelihood of observing a value of 𝑦 = 80 from
a Binomial(124, 𝜃) distribution (dbinom(80, 124, theta)). The maxi-
mum likelihood estimate of 𝜃 is the sample proportion 80/124 = 0.645.
With the largest sample, the likelihood function is more “peaked” around
its maximum.
2. See plot below. Since the prior is flat, the posterior is proportional to
the likelihood. The posterior places almost all of its probability on 𝜃
values between about 0.55 and 0.75, with the highest probability near the
observed sample proportion of 0.645.

# prior
theta = seq(0, 1, 0.0001)
prior = rep(1, length(theta))
prior = prior / sum(prior)

# data
n = 124 # sample size
y = 80 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta
77

# plots
plot_posterior(theta, prior, likelihood)

prior
0.0008

scaled likelihood
posterior
0.0004
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

3. See the plot below. The posterior is very similar to the one from the
previous part.

prior
0.0008

scaled likelihood
posterior
0.0004
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

4. See the plot below. The posterior is very similar to the one from the
78 CHAPTER 5. INTRODUCTION TO ESTIMATION

previous part.

prior

0.0008
scaled likelihood
0.0004
0.0000 posterior

0.0 0.2 0.4 0.6 0.8 1.0

theta

5. Even though the priors are different, the posterior distributions are all
similar to each other and all similar to the shape of the likelihood. Com-
paring these priors it does not appear that the posterior is highly sensitive
to choice of prior. The data carry more weight when 𝑛 = 124 than it
did when 𝑛 = 12. In other words, the prior has less influence when the
sample size is larger. When the sample size is larger, the likelihood is
more “peaked” and so the likelihood, and hence posterior, is small outside
a narrower range of values than when the sample size is small.

6. It seems that regardless of the prior, the posterior distribution is about


the same, yielding the highest posterior probability around the sample
proportion 0.645. So if we had to estimate 𝜃 with a single number, we
might just choose the sample proportion. In this case, we end up with the
same numerical estimate of 𝜃 in both the Bayesian and frequentist analysis.
But the process and interpretation differs between the two approaches;
we’ll discuss in more detail soon.

5.1 Point estimation

In a Bayesian analysis, the posterior distribution contains all relevant infor-


mation about parameters after observing sample data. We often use certain
summary characteristics of the posterior distribution to make inferences about
parameters.
5.1. POINT ESTIMATION 79

Example 5.5. Continuing the kissing study in Example 5.2 where 𝜃 can only
take values 0.1, 0.3, 0.5, 0.7, 0.9. Consider a prior distribution which places
probability 5/15, 4/15, 3/15, 2/15, 1/15 on the values 0.1, 0.3, 0.5, 0.7, 0.9,
respectively. Suppose we want a single number point estimate of 𝜃. What are
some reasonable choices?

1. Suppose we want a single number point estimate of 𝜃 before observing


sample data. Find the mode of the prior distribution of 𝜃, a.k.a., the
“prior mode”.
2. Find the median of the prior distribution of 𝜃, a.k.a., the “prior median”.
3. Find the expected value of the prior distribution of 𝜃, a.k.a., the “prior
mean”.
Now suppose that 𝑦 = 8 couples in a sample of size 𝑛 = 12 lean right.
Recall the Bayes table.
theta prior likelihood product posterior
0.1 0.3333 0.0000 0.0000 0.0000
0.3 0.2667 0.0078 0.0021 0.0356
0.5 0.2000 0.1208 0.0242 0.4132
0.7 0.1333 0.2311 0.0308 0.5269
0.9 0.0667 0.0213 0.0014 0.0243
Total 1.0000 0.3811 0.0585 1.0000

4. Find the mode of the posterior distribution of 𝜃, a.k.a., the “posterior


mode”.
5. Find the median of the posterior distribution of 𝜃, a.k.a., the “posterior
median”.
6. Find the expected value of the posterior distribution of 𝜃, a.k.a., the “pos-
terior mean”.
7. How have the posterior values changed from the respective prior values?

Solution. to Example 5.5

Show/hide solution

1. The prior mode is 0.1, the value of 𝜃 with the greatest prior probability.
2. The prior median is 0.3. Start with the smallest possible value of 𝜃 and
add up the prior probabilities until they go from below 0.5 to above 0.5.
This happens when you add in the prior probability for 𝜃 = 0.3.
3. The prior mean is 0.367. Remember that an expected value is a
probability-weighted average value

0.1(5/15) + 0.3(4/15) + 0.5(3/15) + 0.7(2/15) + 0.9(1/15) = 0.367.


80 CHAPTER 5. INTRODUCTION TO ESTIMATION

4. The posterior mode is 0.7, the value of 𝜃 with the greatest posterior prob-
ability.
5. The posterior median is 0.7. Start with the smallest possible value of 𝜃
and add up the posterior probabilities until they go from below 0.5 to
above 0.5. This happens when you add in the posterior probability for
𝜃 = 0.7.
6. The posterior mean is 0.608. Now the posterior probabilities are used in
the probability-weighted average value

0.1(0.000) + 0.3(0.036) + 0.5(0.413) + 0.7(0.527) + 0.9(0.024) = 0.608.

7. The point estimates (mode, median, mean) shift from their prior values
(0.1, 0.3, 0.367) towards the observed sample proportion of 8/12. However,
the posterior distribution is not symmetric, and the posterior mean is less
than the posterior median. In particular, note that the posterior mean
(0.608) lies between the prior mean (0.367) and the sample proportion
(0.667).

A point estimate of an unknown parameter is a single-number estimate of


the parameter. Given a posterior distribution of a parameter 𝜃, three possible
Bayesian point estimates of 𝜃 are:

• the posterior mean


• the posterior median
• the posterior mode.

In particular, the posterior mean is the expected value of 𝜃 according to the


posterior distribution.
Recall that the expected value, a.k.a., mean, of a discrete random variable 𝑈 is
its probability-weighted average value

E(𝑈 ) = ∑ 𝑢 𝑃 (𝑈 = 𝑢)
𝑢

In the calculation of a posterior mean, the parameter 𝜃 plays the role of the ran-
dom variable 𝑈 and the posterior distribution provides the probability-weights.
In many situations, the posterior distribution will be roughly symmetric with a
single peak, in which case posterior mean, median, and mode will all be about
the same.
Reducing the posterior distribution to a single-number point estimate loses a
lot of the information the posterior distribution provides. The entire posterior
distribution quantifies the uncertainty about 𝜃 after observing sample data. We
will soon see how to more fully use the posterior distribution in making inference
about 𝜃.
5.1. POINT ESTIMATION 81

Example 5.6. Continuing the kissing study in Example 5.3. Now


assume a prior distribution which is proportional to 1 − 𝜃 for 𝜃 =
0, 0.0001, 0.0002, … , 0.9999, 1. Use software to answer the following.

1. Find the mode of the prior distribution of 𝜃, a.k.a., the “prior mode”.

2. Find the median of the prior distribution of 𝜃, a.k.a., the “prior median”.

3. Find the expected value of the prior distribution of 𝜃, a.k.a., the “prior
mean”.

Now suppose that 𝑦 = 8 couples in a sample of size 𝑛 = 12 lean right.


Recall the prior, likelihood, and posterior.

theta = seq(0, 1, 0.0001)

# prior
prior = 1 - theta # shape of prior
prior = prior / sum(prior) # scales so that prior sums to 1

# data
n = 12 # sample size
y = 8 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)
82 CHAPTER 5. INTRODUCTION TO ESTIMATION

0.00030
prior
scaled likelihood
posterior

0.00020
0.00010
0.00000

0.0 0.2 0.4 0.6 0.8 1.0

theta

4. Find the mode of the posterior distribution of 𝜃, a.k.a., the “posterior


mode”.

5. Find the median of the posterior distribution of 𝜃, a.k.a., the “posterior


median”.

6. Find the expected value of the posterior distribution of 𝜃, a.k.a., the “pos-
terior mean”.

7. How have the posterior values changed from the respective prior values?

Solution. to Example 5.6

1. See the code below. The prior mode is 0.

2. See the code below. The prior median is 0.293.

3. See the code below. The prior mean is 0.333.

## prior

# prior mode
theta[which.max(prior)]

## [1] 0
5.1. POINT ESTIMATION 83

# prior median
min(theta[which(cumsum(prior) >= 0.5)])

## [1] 0.2929

# prior mean
sum(theta * prior)

## [1] 0.3333

4. See the code below. The posterior mode is 0.615.

5. See the code below. The posterior median is 0.605.

6. See the code below. The posterior mean is 0.6.

7. Each of the posterior point estimates has shifted from its prior value to-
wards the sample proportion of 0.667. But note that each of the posterior
point estimates is in between the prior point estimate and the sample
proportion.

## posterior

# posterior mode
theta[which.max(posterior)]

## [1] 0.6154

# posterior median
min(theta[which(cumsum(posterior) >= 0.5)])

## [1] 0.6046

# posterior mean
sum(theta * posterior)

## [1] 0.6

Example 5.7. Continuing Example 5.6, now suppose that 𝑦 = 80 couples in


a sample of size 𝑛 = 124 lean right (the actual study data). Recall the prior,
likelihood, and posterior.
84 CHAPTER 5. INTRODUCTION TO ESTIMATION

# data
n = 124 # sample size
y = 80 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

prior
0.0008

scaled likelihood
posterior
0.0004
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

1. Find the mode of the posterior distribution of 𝜃, a.k.a., the “posterior


mode”.
2. Find the median of the posterior distribution of 𝜃, a.k.a., the “posterior
median”.
3. Find the expected value of the posterior distribution of 𝜃, a.k.a., the “pos-
terior mean”.
4. How have the posterior values changed from the respective prior values?
How does this compare to the smaller sample (8 out of 12)?

Solution. to Example 5.7

1. See the code below. The posterior mode is 0.64.


5.1. POINT ESTIMATION 85

2. See the code below. The posterior median is 0.639.


3. See the code below. The posterior mean is 0.638.
4. The posterior distribution is roughly symmetric, and posterior mean, me-
dian, and mode are about the same. Each of the posterior point estimates
has shifted from its prior value towards the sample proportion of 0.645.
The posterior point estimates are now closer to the sample proportion
than they were with the smaller sample size. With the larger sample size,
the data carry more weight.

## posterior

# posterior mode
theta[which.max(posterior)]

## [1] 0.64

# posterior median
min(theta[which(cumsum(posterior) >= 0.5)])

## [1] 0.6385

# posterior mean
sum(theta * posterior)

## [1] 0.6378
86 CHAPTER 5. INTRODUCTION TO ESTIMATION
Chapter 6

Introduction to Inference

In a Bayesian analysis, the posterior distribution contains all relevant informa-


tion about parameters after observing sample data. We can use the posterior
distribution to make inferences about parameters.
Example 6.1. Suppose we want to estimate 𝜃, the population proportion of
American adults who have read a book in the last year.

1. Sketch your prior distribution for 𝜃. Make a guess for your prior mode.
2. Suppose Henry formulates a Normal distribution prior for 𝜃. Henry’s prior
mean is 0.4 and prior standard deviation is 0.1. What does Henry’s prior
say about 𝜃?
3. Suppose Mudge formulates a Normal distribution prior for 𝜃. Mudge’s
prior mean is 0.4 and prior standard deviation is 0.05. Who has more
prior certainty about 𝜃? Why?

Solution. to Example 6.1

Show/hide solution

1. Your prior distribution is whatever it is and represents your assessment of


the degree of uncertainty of 𝜃.
2. The posterior mean, median, and mode are all 0.4. A Normal distribution
follows the empirical rule. In particular, the interval [0.3, 0.5] accounts
for 68% of prior plausibility, [0.2, 0.6] for 95%, and [0.1, 0.7] for 99.7% of
prior plausibility. Henry thinks 𝜃 is around 0.4, about twice as likely to
lie inside the interval [0.3, 0.5] than to lie outside, and he would be pretty
surprised if 𝜃 were outside of [0.1, 0.7].
3. Mudge has more prior certainty about 𝜃 due to the smaller prior standard
deviation. The interval [0.3, 0.5] accounts for 95% of Mudge’s plausibility,
versus 68% for Henry.

87
88 CHAPTER 6. INTRODUCTION TO INFERENCE

The previous section discussed Bayesian point estimates of parameters, includ-


ing the posterior mean, median, and mode. In some sense these values provide
a single number “best guess” of the value of 𝜃. However, reducing the posterior
distribution to a single-number point estimate loses a lot of the information
the posterior distribution provides. In particular, the posterior distribution
quantifies the uncertainty about 𝜃 after observing sample data. The posterior
standard deviation summarizes in a single number the degree of uncertainty
about 𝜃 after observing sample data. The smaller the posterior standard de-
viation, the more certainty we have about the value of the parameter after
observing sample data. (Similar considerations apply for the prior distribution.
The prior standard deviation summarizes in a single number the degree of
uncertainty about 𝜃 before observing sample data.)
Recall that the variance of a random variable 𝑈 is its probability-weighted
average squared distance from its expected value
2
Var(𝑈 ) = E [(𝑈 − E(𝑈 )) ]

The following is an equivalent “shortcut” formula for variance: “expected value


of the square minus the square of the expected value.”
2
Var(𝑈 ) = E(𝑈 2 ) − (E(𝑈 ))

The standard deviation of a random variable is the square root of its variance
is SD(𝑈 ) = √Var(𝑈 ). Standard deviation is measured in the same measurement
units as the variable itself, while variance is measured in squared units.
In the calculation of a posterior standard deviation, 𝜃 plays the role of the
variable 𝑈 and the posterior distribution provides the probability-weights.

Example 6.2. Continuing Example 6.1, we’ll assume a Normal prior distribu-
tion for 𝜃 with prior mean 0.4 and prior standard deviation 0.1.

1. Compute and interpret the prior probability that 𝜃 is greater than 0.7.

2. Find the 25th and 75th percentiles of the prior distribution. What is
the prior probability that 𝜃 lies in the interval with these percentiles as
endpoints? According to the prior, how plausible is it for 𝜃 to lie inside
this interval relative to outside it? (Hint: use qnorm)

3. Repeat the previous part with the 10th and 90th percentiles of the prior
distribution.

4. Repeat the previous part with the 1st and 99th percentiles of the prior
distribution.
In a sample of 150 American adults, 75% have read a book in the past
year. (The 75% value is motivated by a real sample we’ll see in a later
example.)
89

5. Find the posterior distribution based on this data, and make a plot of
prior, likelihood, and posterior.

6. Compute the posterior standard deviation. How does it compare to the


prior standard deviation? Why?

7. Compute and interpret the posterior probability that 𝜃 is greater than 0.7.
Compare to the prior probability.

8. Find the 25th and 75th percentiles of the posterior distribution. What is
the posterior probability that 𝜃 lies in the interval with these percentiles
as endpoints? According to the posterior, how plausible is it for 𝜃 to lie
inside this interval relative to outside it? Compare to the prior interval.

9. Repeat the previous part with the 10th and 90th percentiles of the poste-
rior distribution.

10. Repeat the previous part with the 1st and 99th percentiles of the posterior
distribution.

Solution. to Example 6.2

Show/hide solution

1. We can use software (1 - pnorm(0.7, 0.4, 0.1)) but we can also use
the empirical rule. For a Normal(0.4, 0.1) distribution, the value 0.7 is
0.7−0.4
0.1 = 3 SDs above the mean, so the probability is about 0.0015 (since
about 99.7% of values are within 3 SDs of the mean). According to the
prior, there is about a 0.1% chance that more than 70% of Americans
adults have read a book in the last year.

2. We can use qnorm(0.75) = 0.6745 to find that the 75th percentile of


a Normal distribution is about 0.67 SDs above the mean, so the 25th
percentile is about 0.67 SDs below the mean. For the prior distribution,
the 25th percentile is about 0.33 and the 75th percentile is about 0.47.
The prior probability that 𝜃 lies in the interval [0.33, 0.47] is about 50%.
According to the prior, it is equally plausible for 𝜃 to lie inside the interval
[0.33, 0.47] as to lie outside this interval.

3. We can use qnorm(0.9) = 1.28 to find that the 90th percentile of a Normal
distribution is about 1.28 SDs above the mean, so the 10th percentile is
about 1.28 SDs below the mean. For the prior distribution, the 10th
percentile is about 0.27 and the 90th percentile is about 0.53. The prior
probability that 𝜃 lies in the interval [0.27, 0.53] is about 80%. According
to the prior, it is four times more plausible for 𝜃 to lie inside the interval
[0.27, 0.53] than to lie outside this interval.
90 CHAPTER 6. INTRODUCTION TO INFERENCE

4. We can use qnorm(0.99) = 2.33 to find that the 99th percentile of a Nor-
mal distribution is about 2.33 SDs above the mean, so the 1st percentile
is about 2.33 SDs below the mean. For the prior distribution, the 1st
percentile is about 0.167 and the 99th percentile is about 0.633. The prior
probability that 𝜃 lies in the interval [0.167, 0.633] is about 98%. Accord-
ing to the prior, it is 49 times more plausible for 𝜃 to lie inside the interval
[0.167, 0.633] than to lie outside this interval.

5. See below for a plot. Our prior gave very little plausibility to a sample
like the one we actually observed. However, given our sample data, the
likelihood corresponding to the values of 𝜃 we initially deemed most plau-
sible is very low. Therefore, our posterior places most of the plausibility
on values in the neighborhood of the observed sample proportion, even
though the prior probability for many of these values was low. The prior
does still have some influence; the posterior mean is 0.709 so we haven’t
shifted all the way towards the sample proportion yet.

6. Compute the posterior variance first using either the definition or the
shortcut version, then take the square root; see code below. The posterior
SD is 0.036, almost 3 times smaller than the prior SD. After observing data
we have more certainty about the value of the parameter, resulting in a
smaller posterior SD. The posterior distribution is approximately Normal
with posterior mean 0.709 and posterior SD 0.036.

7. We can use the grid approximation; just sum the posterior probabilities for
𝜃 > 0.7 to see that the posterior probability is about 0.603. Since the pos-
terior distribution is approximately Normal, we can also use the empirical
rule: the standardized value for 0.7 is 0.7−0.709
0.036 = −0.24, or 0.24 SDs below
the mean. Using the empirical rule (or software 1 - pnorm(-0.24)) gives
about 0.596, similar to the grid calculation.
We started with a very low prior probability that more than 70% of Amer-
ican adults have read at least one book in the last year. But after observ-
ing a sample in which more than 70% have read at least one book in the
last year, we assign a much higher plausibility to more than 70% of all
American adults having read at least one book in the last year. Seeing is
believing.

8. See code below for calculations based on the grid approximation. But we
can also use the fact the posterior distribution is approximately Normal;
e.g., the 25th percentile is about 0.67 SDs below the mean: 0.709 − 0.67 ×
0.036 = 0.684. For the posterior distibution, the 25th percentile is about
0.684 and the 75th percentile is about 0.733. The posterior probability
that 𝜃 lies in the interval [0.684, 0.733] is about 50%. According to the
posterior, it is equally plausible for 𝜃 to lie inside the interval [0.684, 0.733]
as to lie outside this interval. This 50% interval is both (1) narrower than
the prior interval, due to the smaller posterior SD, and (2) shifted towards
91

higher values of 𝜃 relative to the prior interval, due to the larger posterior
mean.
9. For the posterior distibution, the 10th percentile is about 0.662 and the
90th percentile is about 0.754. The posterior probability that 𝜃 lies in the
interval [0.662, 0.754] is about 80%. According to the posterior, it is four
times more plausible for 𝜃 to lie inside the interval [0.662, 0.754] as to lie
outside this interval. This 80% interval is both (1) narrower than the prior
interval, due to the smaller posterior SD, and (2) shifted towards higher
values of 𝜃 relative to the prior interval, due to the larger posterior mean.
10. For the posterior distibution, the 1st percentile is about 0.622 and the
99th percentile is about 0.789. The posterior probability that 𝜃 lies in
the interval [0.622, 0.789] is about 98%. According to the posterior, it is
49 times more plausible for 𝜃 to lie inside the interval [0.622, 0.789] as to
lie outside this interval. This interval is both (1) narrower than the prior
interval, due to the smaller posterior SD, and (2) shifted towards higher
values of 𝜃 relative to the prior interval, due to the larger posterior mean.

theta = seq(0, 1, 0.0001)

# prior
prior = dnorm(theta, 0.4, 0.1) # shape of prior
prior = prior / sum(prior) # scales so that prior sums to 1

# data
n = 150 # sample size
y = round(0.75 * n, 0) # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# posterior mean
posterior_mean = sum(theta * posterior)
posterior_mean

## [1] 0.7024

# posterior_variance - "shortcut" formula


posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd
92 CHAPTER 6. INTRODUCTION TO INFERENCE

## [1] 0.03597

# posterior probability that theta is greater than 0.7


sum(posterior[theta > 0.7])

## [1] 0.5345

# posterior percentiles - central 50% interval


theta[max(which(cumsum(posterior) < 0.25))]

## [1] 0.6783

theta[max(which(cumsum(posterior) < 0.75))]

## [1] 0.727

# posterior percentiles - central 80% interval


theta[max(which(cumsum(posterior) < 0.1))]

## [1] 0.6557

theta[max(which(cumsum(posterior) < 0.9))]

## [1] 0.7481

# posterior percentiles - central 98% interval


theta[max(which(cumsum(posterior) < 0.01))]

## [1] 0.6161

theta[max(which(cumsum(posterior) < 0.99))]

## [1] 0.7828
93

prior
scaled likelihood
posterior
0.0008
0.0004
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

Bayesian inference for a parameter is based on its posterior distribution. Since


a Bayesian analysis treats parameters as random variables, it is possible to make
probability statements about a parameter.
A Bayesian credible interval is an interval of values for the parameter that
has at least the specified probability, e.g., 50%, 80%, 98%. Credible intervals
can be computed based on both the prior and the posterior distribution, though
we are primarily interested in intervals based on the posterior distribution. For
example,

• With a 50% credible interval, it is equally plausible that the parameter


lies inside the interval as outside

• With an 80% credible interval, it is 4 times more plausible that the pa-
rameter lies inside the interval than outside
• With a 98% credible interval, it is 49 times more plausible that the pa-
rameter lies inside the interval than outside

Central credible intervals split the complementary probability evenly be-


tween the two tails. For example,

• The endpoints of a 50% central posterior credible interval are the 25th
and the 75th percentiles of the posterior distribution.
• The endpoints of an 80% central posterior credible interval are the 10th
and the 90th percentiles of the posterior distribution.
94 CHAPTER 6. INTRODUCTION TO INFERENCE

• The endpoints of a 98% central posterior credible interval are the 1st and
the 99th percentiles of the posterior distribution.

There is nothing special about the values 50%, 80%, 98%. These are just a few
convenient choices1 whose endpoints correspond to “round number” percentiles
(1st, 10th, 25th, 75th, 90th, 99th) and inside/outside ratios (1-to-1, 4-to-1,
about 50-to-1). You could also throw in, say 70% (15th and 85th percentiles,
about 2-to-1) or 90% (5th and 95th percentiles, about 10-to-1), if you wanted.
As the previous example illustrates, it’s not necessary to just select a single
credible interval (e.g., 95%). Bayesian inference is based on the full posterior
distribution. Credible intervals simply provide a summary of this distribution.
Reporting a few credible intervals, rather than just one, provides a richer picture
of how the posterior distribution represents the uncertainty in the parameter.
In many situations, the posterior distribution of a single parameter is approx-
imately Normal, so posterior probabilities can be approximated with Normal
distribution calculations — standardizing and using the empirical rule. In par-
ticular, an approximate central credible interval has endpoints

posterior mean ± 𝑧 ∗ × posterior SD

where 𝑧 ∗ is the appropriate multiple for a standard Normal distribution corre-


sponding to the specified probability. For example,

Central credibility 50% 80% 95% 98%



Normal 𝑧 multiple 0.67 1.28 1.96 2.33

Central credible intervals are easier to compute, but are not the only or most
widely used credible intervals. A highest posterior density interval is the
interval of values that has the specified posterior probability and is such that the
posterior density within the interval is never lower than the posterior density
outside the interval. If the posterior distribution is relatively symmetric with
a single peak, central posterior credible intervals and highest posterior density
intervals are similar.
Example 6.3. Continuing Example 6.1, we’ll assume a Normal prior distribu-
tion for 𝜃 with prior mean 0.4 and prior standard deviation 0.1.
In a recent survey of 1502 American adults conducted by the Pew Research
Center, 75% of those surveyed said thay have read a book in the past year.
1 In Section 3.2.2 of Statistical Rethinking (McElreath (2020)), the author suggests 67%,

89%, and 97%: “a series of nested intervals may be more useful than any one interval. For
example, why not present 67%, 89%, and 97% intervals, along with the median? Why these
values? No reason. They are prime numbers, which makes them easy to remember. But all
that matters is they be spaced enough to illustrate the shape of the posterior. And these values
avoid 95%, since conventional 95% intervals encourage many readers to conduct unconscious
hypothesis tests.”
95

1. Find the posterior distribution based on this data, and make a plot of
prior, likelihood, and posterior. Describe the posterior distribution. How
does this posterior compare to the one based on the smaller sample size
(𝑛 = 150)?
2. Compute and interpret the posterior probability that 𝜃 is greater than 0.7.
Compare to the prior probability.
3. Compute and interpret in context 50%, 80%, and 98% central posterior
credible intervals.
4. Here is how the survey question was worded: “During the past 12 months,
about how many BOOKS did you read either all or part of the way
through? Please include any print, electronic, or audiobooks you may
have read or listened to.” Does this change your conclusions? Explain.
Solution. to Example 6.3

Show/hide solution

1. See below for code and plots. The posterior distribution is approximately
Normal with posterior mean 0.745 and posterior SD 0.011. Despite our
prior beliefs that 𝜃 was in the 0.4 range, enough data has convinced us
otherwise. With a large sample size, the prior has little influence on the
posterior; much less than with the smaller sample size. Compared to the
posterior based on the small sample size, the posterior now (1) has shifted
to the neighborhood of the sample data, (2) exhibits a smaller degree of
uncertainty about the parameter.
2. The posterior probability that 𝜃 is greater than 0.7 is about 0.9999. We
started with only a 0.1% chance that more than 70% of American adults
have read a book in the last year, but the large sample has convinced us
otherwise.
3. There is a posterior probability of:

• 50% that the population proportion of American adults who have


read a book in the past year is between 0.737 and 0.753. We believe
that the population proportion is as likely to be inside this interval
as outside.
• 80% that the population proportion of American adults who have
read a book in the past year is between 0.730 and 0.759. We believe
that the population proportion is four times more likely to be inside
this interval than to be outside it.
• 98% that the population proportion of American adults who have
read a book in the past year is between 0.718 and 0.771. We believe
that the population proportion is 49 times more likely to be inside
this interval than to be outside it.

In short, our conclusion is that somewhere-in-the-70s percent of American


adults have read a book in the past year. But see the next part…
96 CHAPTER 6. INTRODUCTION TO INFERENCE

4. It depends on what our goal is. Do we want to only count completed


books? Does there have to be a certain word count? Does it count if
the adult read a children’s book? Does listening to an audiobook count?
Does it have to be for “fun” or does reading books for work count? If our
goal is to estimate the population proportion of Americans who have read
completely a 100,000 word non-audiobook book in the last year, then this
particular sample data would be fairly biased from our perspective.

theta = seq(0, 1, 0.0001)

# prior
prior = dnorm(theta, 0.4, 0.1) # shape of prior
prior = prior / sum(prior) # scales so that prior sums to 1

# data
n = 1502 # sample size
y = round(0.75 * n, 0) # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# posterior mean
posterior_mean = sum(theta * posterior)
posterior_mean

## [1] 0.745

# posterior_variance - "shortcut" formula


posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd

## [1] 0.01123

# posterior probability that theta is greater than 0.7


sum(posterior[theta > 0.7])

## [1] 0.9999
97

# posterior percentiles - central 50% interval


theta[max(which(cumsum(posterior) < 0.25))]

## [1] 0.7374

theta[max(which(cumsum(posterior) < 0.75))]

## [1] 0.7525

# posterior percentiles - central 80% interval


theta[max(which(cumsum(posterior) < 0.1))]

## [1] 0.7304

theta[max(which(cumsum(posterior) < 0.9))]

## [1] 0.7592

# posterior percentiles - central 98% interval


theta[max(which(cumsum(posterior) < 0.01))]

## [1] 0.7183

theta[max(which(cumsum(posterior) < 0.99))]

## [1] 0.7705
98 CHAPTER 6. INTRODUCTION TO INFERENCE

prior
0.0030 scaled likelihood
posterior
0.0020
0.0010
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

The quality of any statistical analysis depends very heavily on the quality of
the data. Always investigate how the data were collected to determine what
conclusions are appropriate. Is the sample reasonably representative of the
population? Were the variables reliably measured?
Example 6.4. Continuing Example 6.3, we’ll use the same sample data (𝑛 =
1502, 75%) but now we’ll consider different priors.
For each of the priors below, plot prior, likelihood, and posterior, and compute
the posterior probability that 𝜃 is greater than 0.7. Compare to Example 6.3.

1. Normal distribution prior with prior mean 0.4 and prior SD 0.05.
2. Uniform distribution prior on the interval [0, 0.7]

Solution. to Example 6.4

Show/hide solution

1. The Normal(0.4, 0.05) prior concentrates almost all prior plausibility in a


fairly narrow range of values (0.25 to 0.55 or so) and represents more prior
certainty about 𝜃 than the Normal(0.4, 0.1) prior. Even with the large
sample size, we see that the Normal(0.4, 0.05) prior has more influence
on the posterior than the Normal(0.4, 0.1). However, the two posterior
distributions are not that different: Normal(0.73, 0.011) here compared
with Normal(0.745, 0.011) from the previous problem. Both posteriors
assign almost all posterior credibility to values in the low to mid 70s
99

percent. In particular, the posterior probability that 𝜃 is greater than 0.7


is 0.997 (compared with 0.9999 from the previous problem).
2. The Uniform prior distribution spreads prior plausibility over a fairly wide
range of values, [0, 0.7]. However, the prior probability that 𝜃 is greater
than 0.7 is 0. Even when we observed a large sample with a sample
proportion greater than 0.7, the posterior probability that 𝜃 is greater than
0.7 remains 0. See the plot below; the posterior distribution is basically
a spike that puts almost all of the posterior credibility on the value 0.7.
Assigning 0 prior probability for 𝜃 values greater than 0.7 has essentially
identified such 𝜃 values as impossible, and no amount of data can make
the impossible possible.

prior
0.0030

scaled likelihood
posterior
0.0020
0.0010
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta
100 CHAPTER 6. INTRODUCTION TO INFERENCE

prior
0.03 scaled likelihood
posterior
0.02
0.01
0.00

0.0 0.2 0.4 0.6 0.8 1.0

theta

You have a great deal of flexibility in choosing a prior, and there are many
reasonable approaches. However, do NOT choose a prior that assigns 0 prob-
ability/density to possible values of the parameter regardless of how initially
implausible the values are. Even very stubborn priors can be overturned with
enough data, but no amount of data can turn a prior probability of 0 into a
positive posterior probability. Always consider the range of possible values of
the parameter, and be sure the prior density is non-zero over the entire range
of possible values.

6.1 Comparing Bayesian and frequentist inter-


val estimates
The most widely used elements of “traditional” frequentist inference are confi-
dence intervals and hypothesis tests (a.k.a, null hypothesis significance tests).
The numerical results of Bayesian and frequentist analysis are often similar.
However, the interpretations are very different.

Example 6.5. We’ll now compare the Bayesian credible intervals in Example
6.4 to frequentist confidence intervals. Recall the actual study data in which
75% of the 1502 American adults surveyed said they read a book in the last
year.

1. Compute a 98% confidence interval for 𝜃.


6.1. COMPARING BAYESIAN AND FREQUENTIST INTERVAL ESTIMATES101

2. Write a clearly worded sentence reporting the confidence interval in con-


text.
3. Explain what “98% confidence” means.
4. Compare the numerical results of the Bayesian and frequentist analysis.
Are they similar or different?
5. How does the interpretation of these results differ between the two ap-
proaches?

Solution. to Example 6.5

Show/hide solution

1. The observed sample proportion is 𝑝̂ = 0.75 and its standard error is


√𝑝(1
̂ − 𝑝)/𝑛
̂ = √0.75(1 − 0.75)/1502 = 0.011. The usual formula for a
confidence interval for a population prportion is

𝑝(1
̂ − 𝑝)̂
𝑝̂ ± 𝑧 ∗ √
𝑛
where 𝑧∗ is the multiple from a standard Normal distribution correspond-
ing to the level of confidence (e.g., 𝑧∗ = 2.33 for 98% confidence). A 98%
confidence interval for 𝜃 is [0.724, 0.776].

2. We estimate with 98% confidence that the population proportion of Amer-


ican adults who have read a book in the last year is between 0.724 and
0.776.

3. Confidence is in the estimation procedure. Over many samples, 98% of


samples will yield confidence intervals, computed using the above formula,
that contain the true parameter value (a fixed number) The intervals
change from sample to sample; the parameter is fixed.

4. The numerical results are similar: the 98% posterior credible interval is
similar to the 98% confidence interval. Both reflect a conclusion that we
think that somewhere-in-the-70s percent of American adults have read at
least one book in the past year.

5. However, the interpretation of these results is very different between the


two approaches.
The Bayesian approach provides probability statements about the param-
eter: There is a 98% chance that 𝜃 is between 0.718 and 0.771; our as-
sessment is that 𝜃 is 49 times more likely to lie inside the interval [0.718,
0.771] than outside.
In the frequestist approach such a probability statement makes no sense.
From the frequentist perspective, 𝜃 is an unknown number: either that
number is in the interval [0.724, 0.776] or it’s not; there’s no probability
to it. Rather, the frequentist approach develops procedures based on the
102 CHAPTER 6. INTRODUCTION TO INFERENCE

probability of what might happen over many samples. Notice that in the
interpretation of what 98% confidence means above, the actual numbers
[0.724, 0.776] did not appear. The confidence is in the procedure that
produced the interval, and not in the interval itself.

Example 6.6. Have more than 70% of Americans read a book in the last year?
We’ll now compare the Bayesian analysis in Example 6.4 to a frequentist (null)
hypothesis (significance) test. Recall the actual study data in which 75% of the
1502 American adults surveyed said they read a book in the last year.

1. Conduct an appropriate hypothesis test.


2. Write a clearly worded sentence reporting the conclusion of the hypothesis
test in context.
3. Write a clearly worded sentence interpreting the p-value in context.
4. Now back to the Bayesian analysis of Example 6.4. Compute the posterior
probability that 𝜃 is less than or equal to 0.70.
5. Compare the numerical values of the posterior probability and the p-value.
Are they similar or different?
6. How does the interpretation of these results differ between the two ap-
proaches?

Solution. to Example 6.6

Show/hide solution

1. The null hypothesis is 𝐻0 ∶ 𝜃 = 0.7. The alternative hypothesis is


𝐻𝑎 ∶ 𝜃 > 0.7. The standard deviation of the null distribution is
√0.7(1 − 0.7)/1502 = 0.0118. The standardized (test) statistic is
(0.75 − 0.7)/0.0118 = 4.23. With such a large sample size, the null
distribution of sample proportions is approximately Normal, so the
p-value is approximately 1 - pnorm(4.23) = 0.000012.

2. With a p-value of 0.000012 we have extremely strong evidence to reject


the null hypothesis and conclude that more than 70% of Americans have
read a book in the last year.

3. Interpreting the p-value

• If the population proportion of Americans who have read a book in


the last year is equal to 0.7,
• Then we would observe a sample proportion of 0.75 or more in about
0.0012% (about 1 in 100,000) of random samples of size 1502.
• Since we actually observed a sample proportion of 0.75, which would
be extremely unlikely if the population proportion were 0.7,
• The data provide evidence that the population proportion is not 0.7.
6.1. COMPARING BAYESIAN AND FREQUENTIST INTERVAL ESTIMATES103

4. See Example 6.4 where we computed the posterior probability that 𝜃 is


greater than 0.7. The posterior probability that 𝜃 is less than or equal to
0.7 is 0.000051.
Note: in the frequentist hypothesis test, the null hypothesis 𝐻0 ∶ 𝜃 = 0.7
is operationally the same as 𝐻0 ∶ 𝜃 ≤ 0.7; the test is conducted the same
way and results in the same p-value. Computing the posterior probability
that 𝜃 ≤ 0.7 is like computing the probability that the null hypothesis is
true. Now, the p-value is not the probability that the null hypothesis is
true, even though that is a common misinterpretation. But there is no
direct Bayesian analog of a p-value, so this will have to do.

5. The numerical results are similar; both the p-value and the posterior prob-
ability are on the order of 1/100000. Both reflect a strong endorsement of
the conclusion that more than 70% of Americans have read a book in the
past year.

6. However, the interpretation of these results is very different between the


two approaches.
The Bayesian analysis computes a probability that 𝜃 < 0.7: there’s an
extremely small probability that 𝜃 is less than 0.7, so we’d be willing to
bet a very large amount of money that it’s not.
But such a probability make no sense from a frequentist perspective. From
the frequentist perspective, the unknown parameter 𝜃 is a number: either
than number is greater than 0.7 or it’s not; there’s no probability to it.
The p-value is a probability referring to what would happen over many
samples.

Since a Bayesian analysis treats parameters as random variables, it is possible to


make probability statements about parameters. In contrast, a frequentist anal-
ysis treats unknown parameters as fixed — that is, not random — so probability
statements do not apply to parameters. In a frequentist approach, probability
statements (like “95% confidence”) are based on how the sample data would
behave over many hypothetical samples.
In a Bayesian approach

• Parameters are random variables and have distributions.


• Observed data are treated as fixed, not random.
• All inference is based on the posterior distribution of parameters which
quantifies our uncertainty about the parameters.
• The posterior distribution quantifies our uncertainty in the parameters,
after observing the sample data.
• The posterior (or prior) distribution can be used to make probability state-
ments about parameters.
104 CHAPTER 6. INTRODUCTION TO INFERENCE

• For example, “95% credible” quantifies our assessment that the parameter
is 19 times more likely to lie inside the credible interval than outside.
(Roughly, we’d be willing to bet at 19-to-1 odds on whether 𝜃 is inside the
interval [0.718, 0.771].)

In a frequentist approach

• Parameters are treated as fixed (not random), but unknown numbers


• Data are treated as random
• All inference is based on the sampling distribution of the data which quan-
tifies how the data behaves over many hypothetical samples.
• For example, “95% confidence” is confidence in the procedure: confidence
intervals vary from sample-to-sample; over many samples 95% of confi-
dence intervals contain the parameter being estimated.
Chapter 7

Introduction to Prediction

A Bayesian analysis leads directly and naturally to making predictions about


future observations from the random process that generated the data. Predic-
tion is also useful for checking if model assumptions seem reasonable in light of
observed data.

Example 7.1. Do people prefer to use the word “data” as singular or plural?
Data journalists at FiveThirtyEight conducted a poll to address this question
(and others). Rather than simply ask whether the respondent considered “data”
to be singular or plural, they asked which of the following sentences they prefer:

a. Some experts say it’s important to drink milk, but the data is inconclusive.
b. Some experts say it’s important to drink milk, but the data are inconclu-
sive.

Suppose we wish to study the opinions of students in Cal Poly statistics classes
regarding this issue. That is, let 𝜃 represent the population proportion of stu-
dents in Cal Poly statistics classes who prefer to consider data as a singular
noun, as in option a) above.
To illustrate ideas, we’ll start with a prior distribution which places probability
0.01, 0.05, 0.15, 0.30, 0.49 on the values 0.1, 0.3, 0.5, 0.7, 0.9, respectively.

1. Before observing any data, suppose we plan to randomly select a single Cal
Poly statistics student. Consider the unconditional prior probability that
the selected student prefers data as singular. (This is called a prior pre-
dictive probability.) Explain how you could use simulation to approximate
this probability.

2. Compute the prior predictive probability from the previous part.

105
106 CHAPTER 7. INTRODUCTION TO PREDICTION

3. Before observing any data, suppose we plan to randomly select a sam-


ple of 35 Cal Poly statistics students. Consider the unconditional prior
distribution of the number of students in the sample who prefer data as
singular. (This is called a prior predictive distribution.) Explain how you
could use simulation to approximate this distribution. In particular, how
could you use simulation to approximate the prior predictive probability
that at least 34 students in the sample prefer data as singular?

4. Compute and interpret the prior predictive probability that at least 34


students in a sample of size 35 prefer data as singular.
For the remaining parts, suppose that 31 students in a sample
of 35 Cal Poly statistics students prefer data as singular.

5. Find the posterior distribution of 𝜃.

6. Now suppose we plan to randomly select an additional Cal Poly statistics


student. Consider the posterior predictive probability that this student
prefers data as singular. Explain how you could use simulation to estimate
this probability.

7. Compute the posterior predictive probability from the previous part.

8. Suppose we plan to collect data on another sample of 35 Cal Poly statistics


students. Consider the posterior predictive distribution of the number of
students in the new sample who prefer data as singular. Explain how you
could use simulation to approximate this distribution. In particular, how
could you use simulation to approximate the prior predictive probability
that at least 34 students in the sample prefer data as singular? (Of course,
the sample size of the new sample does not have to be 35. However, we’re
keeping it the same so we can compare the prior and posterior predictions.)

9. Compute and interpret the posterior predictive probability that at least


34 students in a sample of size 35 prefer data as singular.

Solution. to Example 7.1

1. If we knew what 𝜃 was, this probability would just be 𝜃. For example,


if 𝜃 = 0.9, then there is a probability of 0.9 that a randomly selected
student prefers data singular. If 𝜃 were 0.9, we could approximate the
probability by constructing a spinner with 90% of the area marked as
“success”, spinning it many times, and recording the proportion of spins
that land on success, which should be roughly 90%. However, 0.9 is only
one possible value of 𝜃. Since we don’t know what 𝜃 is, we need to first
simulate a value of it, giving more weight to 𝜃 values with high prior
probability. Therefore, we

1. Simulate a value of 𝜃 from the prior distribution.


107

2. Given the value of 𝜃, construct a spinner that lands on success with


probability 𝜃. Spin the spinner once and record the result, success or
not.
3. Repeat steps 1 and 2 many times, and find the proportion of rep-
etitions which result in success. This proportion approximates the
unconditional prior probability of success.

2. Use the law of total probability, where the weights are given by the prior
probabilities.

0.1(0.01) + 0.3(0.05) + 0.5(0.15) + 0.7(0.30) + 0.9(0.49) = 0.742

(This calculation is equivalent to the expected value of 𝜃 according to its


prior distributon, that is, the prior mean.)
3. If we knew what 𝜃 was, we could construct a spinner than lands on success
with probability 𝜃, spin it 35 times, and count the number of successes.
Since we don’t know what 𝜃 is, we need to first simulate a value of it,
giving more weight to 𝜃 values with high prior probability. Therefore, we

1. Simulate a value of 𝜃 from the prior distribution.


2. Given the value of 𝜃, construct a spinner that lands on success with
probability 𝜃. Spin the spinner 35 times and count 𝑦, the number of
spins that land on success.
3. Repeat steps 1 and 2 many times, and record the number of successes
(out of 35) for each repetition. Summarize the simulated 𝑦 values to
approximate the prior predictive distribution. To approximate the
prior predictive probability that at least 34 students in a sample
of size 35 prefer data as singular, count the number of simulated
repetitions that result in at least 34 successes (𝑦 ≥ 34) and divide by
the total number of simulated repetitions.

4. If we knew 𝜃, the probability of at least 34 (out of 35) successes is, from


a Binomial distribution,

35𝜃34 (1 − 𝜃) + 𝜃35

Use the law of total probability again.

(35(0.1)34 (1 − 0.1) + 0.135 ) (0.01) + (35(0.3)34 (1 − 0.3) + 0.335 ) (0.05)


+ (35(0.5)34 (1 − 0.5) + 0.535 ) (0.15) + (35(0.7)34 (1 − 0.7) + 0.735 ) (0.30)
+ (35(0.9)34 (1 − 0.9) + 0.935 ) (0.49) = 0.06

According to this model, about 6% of samples of size 35 would result


in at least 34 successes. The value 0.06 accounts for both (1) our prior
uncertainty about 𝜃, (2) sample-to-sample variability in the number of
successes 𝑦.
108 CHAPTER 7. INTRODUCTION TO PREDICTION

5. The likelihood is (35 31 4


31)𝜃 (1−𝜃) , a function of 𝜃; dbinom(31, 35, theta).
The posterior places almost all probability on 𝜃 = 0.9, due to both its high
prior probability and high likelihood.

theta = seq(0.1, 0.9, 0.2)

# prior
prior = c(0.01, 0.05, 0.15, 0.30, 0.49)

# data
n = 35 # sample size
y = 31 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)

bayes_table %>%
adorn_totals("row") %>%
kable(digits = 4, align = 'r')

theta prior likelihood product posterior


0.1 0.01 0.0000 0.0000 0.0000
0.3 0.05 0.0000 0.0000 0.0000
0.5 0.15 0.0000 0.0000 0.0000
0.7 0.30 0.0067 0.0020 0.0201
0.9 0.49 0.1998 0.0979 0.9799
Total 1.00 0.2065 0.0999 1.0000

6. The simulation would be similar to the prior simulation, but now we sim-
ulate 𝜃 from its posterior distribution rather than the prior distribution.

1. Simulate a value of 𝜃 from the posterior distribution.


2. Given the value of 𝜃, construct a spinner that lands on success with
probability 𝜃. Spin the spinner once and record the result, success or
not.
109

3. Repeat steps 1 and 2 many times, and find the proportion of rep-
etitions which result in success. This proportion approximates the
unconditional posterior probability of success.

7. Use the law of total probability, where the weights are given by the pos-
terior probabilities.

0.1(0.0000)+0.3(0.0000)+0.5(0.0000)+0.7(0.0201)+0.9(0.9799) = 0.8960

(This calculation is equivalent to the expected value of 𝜃 according to its


posterior distributon, that is, the posterior mean.)
8. The simulation would be similar to the prior simulation, but now we sim-
ulate 𝜃 from its posterior distribution rather than the prior distribution.

1. Simulate a value of 𝜃 from the posterior distribution.


2. Given the value of 𝜃, construct a spinner that lands on success with
probability 𝜃. Spin the spinner 35 times and count the number of
spins that land on success.
3. Repeat steps 1 and 2 many times, and record the number of successes
(out of 35) for each repetition. Summarize the simulated values to
approximate the posterior predictive distribution. To approximate
the posterior predictive probability that at least 34 students in a
sample of size 35 prefer data as singular, count the number of sim-
ulated repetitions that result in at least 34 successes and divide by
the total number of simulated repetitions.

Since the posterior probability that 𝜃 equals 0.9 is close to 1, the posterior
predictive distribution would be close to, but not quite, the Binomial(35,
0.9) distribution.
9. Use the law of total probability again, but with the posterior probabilities
rather than the prior probabilities as the weights.

(35(0.1)34 (1 − 0.1) + 0.135 ) (0.0000) + (35(0.3)34 (1 − 0.3) + 0.335 ) (0.0000)


+ (35(0.5)34 (1 − 0.5) + 0.535 ) (0.0000) + (35(0.7)34 (1 − 0.7) + 0.735 ) (0.0201)
+ (35(0.9)34 (1 − 0.9) + 0.935 ) (0.9799) = 0.1199

According to this posterior model, about 12% of samples of size 35 would


result in at least 34 successes. The value 0.12 accounts for both (1) our pos-
terior uncertainty about 𝜃, after observing the sample data, (2) sample-to-
sample variability in the number of successes 𝑦 for the yet-to-be-observed
sample.

The plots below illustrate the distributions from the previous example.
The first plot below illustrates the conditional distribution of 𝑌 given each value
of 𝜃.
110 CHAPTER 7. INTRODUCTION TO PREDICTION

theta: 0.1 theta: 0.3 theta: 0.5

0.20

0.15

0.10

0.05 Prior(theta)
p(y|theta)

0.00 0.4
0 10 20 30 0.3
theta: 0.7 theta: 0.9
0.2
0.20
0.1

0.15

0.10

0.05

0.00
0 10 20 30 0 10 20 30
y

Figure 7.1: Sample-to-sample distribution of 𝑌 , the number of successes in sam-


ples of size 𝑛 = 35 for different values of 𝜃 in Example 7.1. Color represents the
prior probability of 𝜃, with darker colors corresponding to higher prior proba-
bility.
111

In the plot above, the prior distribution of 𝜃 is represented by color. The prior
predictive distribution of 𝑌 mixes the five conditional distributions in the pre-
vious plot, weighting by the prior distribution of 𝜃, to obtain the unconditional
prior predictive distribution of 𝑌 .

0.09

0.06
p(y)

0.03

0.00

0 10 20 30
y

Figure 7.2: Prior predictive distribution of 𝑌 , the number of successes in samples


of size 𝑛 = 35, in Example 7.1.

The prior predictive distribution reflects the sample-to-sample variability of the


number of successes over many samples of size 𝑛 = 35, accounting for the prior
uncertainty of 𝜃.
After observing a sample, we compute the posterior distribution of 𝜃 as usual.
The following plot is similar to Figure 7.1, with the colors revised to reflect the
posterior probabilities of the values of 𝜃.
The posterior predictive distribution of 𝑌 mixes the five conditional distribu-
tions in the previous plot, weighting by the posterior distribution of 𝜃, to obtain
the unconditional posterior predictive distribution of 𝑌 . Since the posterior
distribution of 𝜃 places almost all posterior probability on the value 0.9, the
posterior predictive distribution of 𝑌 is very similar to the conditional distribu-
tion of 𝑌 given 𝜃 = 0.9.
The predictive distribution of a random variable is the marginal distribution
(of the unobserved values) after accounting for the uncertainty in the parame-
ters. A prior predictive distribution is calculated using the prior distribution
112 CHAPTER 7. INTRODUCTION TO PREDICTION

theta: 0.1 theta: 0.3 theta: 0.5

0.20

0.15

0.10

0.05 Prior(theta)
p(y|theta)

0.00 0.75
0 10 20 30
theta: 0.7 theta: 0.9 0.50

0.20 0.25

0.15

0.10

0.05

0.00
0 10 20 30 0 10 20 30
y

Figure 7.3: Sample-to-sample distribution of 𝑌 , the number of successes in


samples of size 𝑛 = 35 for different values of 𝜃 in Example 7.1. Color represents
the posterior probability of 𝜃, after observing data from a different sample of
size 35, with darker colors corresponding to higher posterior probability.
113

0.20

0.15
p(y)

0.10

0.05

0.00

0 10 20 30
y

Figure 7.4: Posterior predictive distribution of 𝑌 , the number of successes in


samples of size 𝑛 = 35, in Example 7.1 after observing data from a different
sample of size 35.
114 CHAPTER 7. INTRODUCTION TO PREDICTION

of the parameters. A posterior predictive distribution is calculated using


the posterior distribution of the parameters, conditional on the observed data.
Be sure to carefully distinguish between posterior distributions and posterior
predictive distributions (or between prior distributions and prior predictive dis-
tributions.)
Prior and posterior distributions are distributions on values of the parameters.
These distributions quantify the degree of uncertainty about the unknown pa-
rameter 𝜃 (before and after observing data).
On the other hand, prior and posterior predictive distributions are distribution
on potential values of the data. Predictive distributions reflect sample-to-sample
variability of the sample data, while accounting for the uncertainty in the pa-
rameters.
Predictive probabilities can be computed via the law of total probability, as
weighted averages over possible values of 𝜃. However, even when conditional
distributions of data given the parameters are well known (e.g., Binomial(𝑛,
𝜃)), marginal distributions of the data are often not. Simulation is an effective
tool in approximating predictive distributions.

• Step 1: Simulate a value of 𝜃 from its posterior distribution (or prior


distribution).
• Step 2: Given this value of 𝜃 simulate a value of 𝑦 from 𝑓(𝑦|𝜃), the data
model conditional on 𝜃.
• Repeat many times to simulate many (𝜃, 𝑦) pairs, and summarize the
values of 𝑦 to approximate the posterior predictive distribution (or prior
predictive distribution).

Example 7.2. Continuing the previous example. We’ll use a grid approx-
imation and assume that any multiple of 0.0001 is a possible value of 𝜃:
0, 0.0001, 0.0002, … , 0.9999, 1.

1. Consider the context of this problem and sketch your prior distribution
for 𝜃. What are the main features of your prior?
2. Assume the prior distribution for 𝜃 is proportional to 𝜃2 . Plot this prior
distribution and describe its main features.
3. Given the shape of the prior distribution, explain why we might not want
to compute central prior credible intervals. Suggest an alternative ap-
proach, and compute and interpret 50%, 80%, and 98% prior credible
intervals for 𝜃.
4. Before observing any data, suppose we plan to randomly select a sample
of 35 Cal Poly statistics students. Let 𝑌 represent the number of students
in the selected sample who prefer data as singular. Use simulation to
approximate the prior predictive distribution of 𝑌 and plot it.
115

5. Use software to compute the prior predictive distribution of 𝑌 . Compare


to the simulation results.
6. Find a 95% prior prediction interval for 𝑌 . Write a clearly worded sentence
interpreting this interval in context.
For the remaining parts, suppose that 31 students in a sample
of 35 Cal Poly statistics students prefer data as singular.
7. Use software to plot the prior distribution and the (scaled) likelihood,
then find the posterior distribution of 𝜃 and plot it and describe its main
features.
8. Find and interpret 50%, 80%, and 98% central posterior credible intervals
for 𝜃.
9. Suppose we plan to randomly select another sample of 35 Cal Poly statis-
tics students. Let 𝑌 ̃ represent the number of students in the selected
sample who prefer data as singular. Use simulation to approximate the
posterior predictive distribution of 𝑌 ̃ and plot it. (Of course, the sample
size of the new sample does not have to be 35. However, we’re keeping it
the same so we can compare the prior and posterior predictions.)

10. Use software to compute the posterior predictive distribution of 𝑌 ̃ . Com-


pare to the simulation results.
11. Find a 95% posterior prediction interval for 𝑌 ̃ . Write a clearly worded
sentence interpreting this interval in context.
12. Now suppose instead of using the Cal Poly sample data (31/35) to form the
posterior distribution of 𝜃, we had used the data from the FiveThirtyEight
study in which 865 out of 1093 respondents preferred data as singular. Use
software to plot the prior distribution and the (scaled) likelihood, then find
the posterior distribution of 𝜃 and plot it and describe its main features.
In particular, find and interpret a 98% central posterior credible interval
for 𝜃. How does the posterior based on the FiveThirtyEight data compare
to the posterior distribution based on the Cal Poly sample data (31/35)?
Why?
13. Again, suppose we use the FiveThirtyEight data to form the posterior
distribution of 𝜃. Suppose we plan to randomly select a sample of 35
Cal Poly statistics students. Let 𝑌 ̃ represent the number of students
in the selected sample who prefer data as singular. Use simulation to
approximate the posterior predictive distribution of 𝑌 ̃ and plot it. In
particular, find and interpret a 95% posterior prediction interval for 𝑌 ̃ .
How does the predictive distribution which uses the posterior distribution
based on the FiveThirtyEight data compare to the one based on the Cal
Poly sample data (31/35)? Why?

Solution. to Example 7.2


116 CHAPTER 7. INTRODUCTION TO PREDICTION

1. Results will of course vary, but do consider what your prior would look
like.
2. We believe a majority, and probably a strong majority, of students will
prefer data as singular. The prior mode is 1, the prior mean is 0.75, and
the prior standard deviation is 0.19.

theta = seq(0, 1, 0.0001)

# prior
prior = theta ^ 2
prior = prior / sum(prior)

ylim = c(0, max(prior))


plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta',
0.00030
0.00020
0.00010
0.00000

0.0 0.2 0.4 0.6 0.8 1.0

theta

# prior mean
prior_ev = sum(theta * prior)
prior_ev

## [1] 0.75

# prior variance
prior_var = sum(theta ^ 2 * prior) - prior_ev ^ 2

# prior sd
sqrt(prior_var)
117

## [1] 0.1937

3. Central credibles would exclude 𝜃 values near 1, but these are the values
with highest prior probability. For example, a central 50% prior credible
interval is [0.630, 0.909], but this excludes values of 𝜃 with the highest prior
probability. An alternative is to use highest prior probability intervals.
For this prior, it seems reasonable to just fix the upper endpoint of the
credible intervals to be 1, and to find the lower endpoint corresponding to
the desired probability. The lower bound of such a 50% credible interval
is the 50th percentile; of an 80% credible interval is the 20th percentile; of
a 98% credible interval is the 2nd percentile. There is a prior probability
of 50% that at least 79.4% of Cal Poly students prefer data as singular;
it’s equally plausible that 𝜃 is above 0.794 as below.
There is a prior probability of 80% that at least 58.5% of Cal Poly students
prefer data as singular; it’s four times more plausible that 𝜃 is above 0.585
than below. There is a prior probability of 98% that at least 27.1% of Cal
Poly students prefer data as singular; it’s 49 times more plausible that 𝜃
is above 0.271 than below.

prior_cdf = cumsum(prior)

# 50th percentile
theta[max(which(prior_cdf <= 0.5))]

## [1] 0.7936

# 10th percentile
theta[max(which(prior_cdf <= 0.2))]

## [1] 0.5847

# 2nd percentile
theta[max(which(prior_cdf <= 0.02))]

## [1] 0.2714

4. We use the sample function with the prob argument to simulate a value
of 𝜃 from its prior distribution, and then use rbinom to simulate a sample.
The table below displays the results of a few repetitions of the simulation.

n = 35

n_sim = 10000
118 CHAPTER 7. INTRODUCTION TO PREDICTION

theta_sim = sample(theta, n_sim, replace = TRUE, prob = prior)

y_sim = rbinom(n_sim, n, theta_sim)

data.frame(theta_sim, y_sim) %>%


head(10) %>%
kable(digits = 5)

theta_sim y_sim
0.6711 26
0.7621 26
0.8947 33
0.8719 30
0.5030 16
0.9272 34
0.8771 31
0.6362 25
0.8140 29
0.1598 3

5. We program the law of total probability calculation for each possible value
of 𝑦. (There are better ways of doing this than a for loop, but it’s good
enough.)

# Predictive distribution
y_predict = 0:n

py_predict = rep(NA, length(y_predict))

for (i in 1:length(y_predict)) {
py_predict[i] = sum(dbinom(y_predict[i], n, theta) * prior) # prior
}
119

Prior Predictive Distribution


0.08

Theoretical
Simulation
0.06
Probability

0.04
0.02
0.00

0 2 456 8 10 13 1516 1920 22 25 28 3031 3435

# Prediction interval
py_predict_cdf = cumsum(py_predict)
c(y_predict[max(which(py_predict_cdf <= 0.025))], y_predict[min(which(py_predict_cdf >= 0.97

## [1] 8 35

6. There is prior predictive probability of 95% that between 8 and 35 students


in a sample of 35 students will prefer data as singular.
For the remaining parts, suppose that 31 students in a sample
of 35 Cal Poly statistics students prefer data as singular.

7. The observed sample proportion is 31/35=0.886. The posterior distribu-


tion is slightly skewed to the left with a posterior mean of 0.872 and a
posterior standard deviation of 0.053.

# data
n = 35 # sample size
y = 31 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)
120 CHAPTER 7. INTRODUCTION TO PREDICTION

# posterior mean
posterior_ev = sum(theta * posterior)
posterior_ev

## [1] 0.8718

# posterior variance
posterior_var = sum(theta ^ 2 * posterior) - posterior_ev ^ 2

# posterior sd
sqrt(posterior_var)

## [1] 0.05286

# posterior cdf
posterior_cdf = cumsum(posterior)

# posterior 50% central credible interval


c(theta[max(which(posterior_cdf <= 0.25))], theta[min(which(posterior_cdf >= 0.75)

## [1] 0.8397 0.9106

# posterior 80% central credible interval


c(theta[max(which(posterior_cdf <= 0.1))], theta[min(which(posterior_cdf >= 0.9))]

## [1] 0.8005 0.9346

# posterior 98% central credible interval


c(theta[max(which(posterior_cdf <= 0.01))], theta[min(which(posterior_cdf >= 0.99)

## [1] 0.7238 0.9651


121

0.0008

prior
scaled likelihood
posterior
0.0004
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

8. There is a posterior probability of 50% that between 84.0% and 91.1% of


Cal Poly students prefer data as singular; after observing the sample data,
it’s equally plausible that 𝜃 is inside [0.840, 0.911] as outside. There is a
posterior probability of 80% that between 80.0% and 93.5% of Cal Poly
students prefer data as singular; after observing the sample data, it’s four
times mores plausible that 𝜃 is inside [0.800, 0.935] as outside. There is a
posterior probability of 98% that between 72.4% and 96.5% of Cal Poly
students prefer data as singular; after observing the sample data, it’s 49
times mores plausible that 𝜃 is inside [0.724, 0.965] as outside.

9. Similar to the prior simulation, but now we simulate 𝜃 based on its poste-
rior distribution. The table below displays the results of a few repetitions
of the simulation.

theta_sim = sample(theta, n_sim, replace = TRUE, prob = posterior)

y_sim = rbinom(n_sim, n, theta_sim)

data.frame(theta_sim, y_sim) %>%


head(10) %>%
kable(digits = 5)
122 CHAPTER 7. INTRODUCTION TO PREDICTION

theta_sim y_sim
0.8304 30
0.8823 32
0.9251 30
0.9442 35
0.9069 32
0.9083 32
0.9564 34
0.8123 32
0.7888 30
0.8705 29
10. Similar to the prior calculation, but now we use the posterior probabilities
as the weights in the law of total probability calculation.
# Predictive distribution
y_predict = 0:n

py_predict = rep(NA, length(y_predict))

for (i in 1:length(y_predict)) {
py_predict[i] = sum(dbinom(y_predict[i], n, theta) * posterior) # posterior
}

Posterior Predictive Distribution


0.15

Theoretical
Simulation
0.10
Probability

0.05
0.00

0 5 10 1516 1920 22 25 28 3031 3435

# Prediction interval
py_predict_cdf = cumsum(py_predict)
c(y_predict[max(which(py_predict_cdf <= 0.025))], y_predict[min(which(py_predict_c
123

## [1] 23 35

11. There is posterior predictive probability of 95% that between 23 and 35


students in a sample of 35 students will prefer data as singular.
12. The observed sample proportion is 865/1093=0.791. The posterior mean
is 0.791, and the posterior standard deviation is 0.012. There is a posterior
probability of 98% that between 76.2% and 81.9% of Cal Poly students
prefer data as singular. The posterior SD is much smaller and the 98%
credible interval is narrower based on the FiveThirtyEight due to the much
larger sample size. (The posterior means and location of the credible
intervals are also different due to the difference in sample proportions.)

# data
n = 1093 # sample size
y = 865 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# posterior mean
posterior_ev = sum(theta * posterior)
posterior_ev

## [1] 0.7912

# posterior variance
posterior_var = sum(theta ^ 2 * posterior) - posterior_ev ^ 2

# posterior sd
sqrt(posterior_var)

## [1] 0.01227

# posterior 98% credible interval


posterior_cdf = cumsum(posterior)
c(theta[max(which(posterior_cdf <= 0.01))], theta[min(which(posterior_cdf >= 0.99))])

## [1] 0.7619 0.8190


124 CHAPTER 7. INTRODUCTION TO PREDICTION

0.0030
prior
scaled likelihood
posterior

0.0020
0.0010
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

13. There is posterior predictive probability of 95% that between 23 and 35


students in a sample of 35 students will prefer data as singular. Despite
the fact that the posterior distributions of 𝜃 are different in the two scenar-
ios, the posterior predictive distributions are fairly similar. Even though
there is less uncertainty about 𝜃 in the FiveThirtyEight case, the predic-
tive distribution reflects the sample-to-sample variability of the number
of students who prefer data as singular, which is mainly impacted by the
size of the sample being “predicted”. The table below displays a few repe-
titions of the posterior predictive simulation. Notice that all the 𝜃 values
are around 0.79 or so, but there is still sample-to-sample variability in the
𝑌 values.

n = 35

# Predictive simulation
theta_sim = sample(theta, n_sim, replace = TRUE, prob = posterior)

y_sim = rbinom(n_sim, n, theta_sim)

data.frame(theta_sim, y_sim) %>%


head(10) %>%
kable(digits = 5)
125

theta_sim y_sim
0.7979 32
0.7768 27
0.7559 26
0.8000 33
0.7844 28
0.7823 30
0.7811 25
0.7660 23
0.8057 32
0.7910 27

# Predictive distribution
y_predict = 0:n

py_predict = rep(NA, length(y_predict))

for (i in 1:length(y_predict)) {
py_predict[i] = sum(dbinom(y_predict[i], n, theta) * posterior) # posterior
}

# Prediction interval
py_predict_cdf = cumsum(py_predict)
c(y_predict[max(which(py_predict_cdf <= 0.025))], y_predict[min(which(py_predict_cdf >= 0.97

## [1] 22 32

Posterior Predictive Distribution


0.15

Theoretical
Simulation
0.10
Probability

0.05
0.00

0 5 10 1516 1920 22 25 28 3031 3435

y
126 CHAPTER 7. INTRODUCTION TO PREDICTION

Be sure to distinguish between a prior/posterior distribution and a


prior/posterior predictive distribution.

• A prior/posterior distribution is a distribution on potential values of the


parameters 𝜃. These distributions quantify the degree of uncertainty about
the unknown parameter 𝜃 (before and after observing data).
• A prior/posterior predictive distribution is a distribution on potential val-
ues of the data 𝑦. Predictive distributions reflect sample-to-sample vari-
ability of the sample data, while accounting for the uncertainty in the
parameters.

Even if parameters are essentially “known” — that is, even if the prior/posterior
variance of parameters is small — there will still be sample-to-sample variability
reflected in the predictive distribution of the data, mainly influenced by the size
𝑛 of the sample being “predicted”.

7.1 Posterior predictive checking


Example 7.3. Continuing the previous example, suppose that before collecting
data for our sample of Cal Poly students, we had based our prior distribution
off the FiveThirtyEight data. Suppose we assume a prior distribution that is
proportional to 𝜃864 (1 − 𝜃)227 for 𝜃 values in the grid. (We will see where such
a distribution might come from later.)

1. Plot the prior distribution. What does this say about our prior beliefs?
2. Now suppose we randomly select a sample of 35 Cal Poly students and 21
students prefer data as singular. Plot the prior and likelihood, and find
the posterior distribution and plot it. Have our beliefs about 𝜃 changed?
Why?
3. Find the posterior predictive distribution corresponding to samples of size
35. Compare the observed sample value of 21/35 with the posterior pre-
dictive distribution. What do you notice? Does this indicate problems
with the model?

Solution. to Example 7.3

1. We have a very strong prior belief that 𝜃 is close to 0.79; the prior SD
is only 0.012. There is a prior probability of 98% that between 76% and
82% of Cal Poly students prefer data as singular.

# prior
theta = seq(0, 1, 0.0001)
prior = theta ^ 864 * (1 - theta) ^ 227
7.1. POSTERIOR PREDICTIVE CHECKING 127

prior = prior / sum(prior)

ylim = c(0, max(prior))


plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta', ylab='')
0.0030
0.0020
0.0010
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

# prior mean
prior_ev = sum(theta * prior)
prior_ev

## [1] 0.7914

# prior variance
prior_var = sum(theta ^ 2 * prior) - prior_ev ^ 2

# prior sd
sqrt(prior_var)

## [1] 0.01228

# prior 98% credible interval


prior_cdf = cumsum(prior)
c(theta[max(which(prior_cdf <= 0.01))], theta[min(which(prior_cdf >= 0.99))])

## [1] 0.7620 0.8192


128 CHAPTER 7. INTRODUCTION TO PREDICTION

2. Our posterior distribution has barely changed from the prior. Even though
the sample proportion is 21/35 = 0.61, our prior beliefs were so strong
(represented by the small prior SD) that a sample of size 35 isn’t very
convincing.

# data
n = 35 # sample size
y = 21 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# posterior mean
posterior_ev = sum(theta * posterior)
posterior_ev

## [1] 0.7855

# posterior variance
posterior_var = sum(theta ^ 2 * posterior) - posterior_ev ^ 2

# posterior sd
sqrt(posterior_var)

## [1] 0.01222

# posterior 98% credible interval


posterior_cdf = cumsum(posterior)
c(theta[max(which(posterior_cdf <= 0.01))], theta[min(which(posterior_cdf >= 0.99)

## [1] 0.7562 0.8131


7.1. POSTERIOR PREDICTIVE CHECKING 129

0.0030

prior
scaled likelihood
posterior
0.0020
0.0010
0.0000

0.0 0.2 0.4 0.6 0.8 1.0

theta

3. According to the posterior predictive distribution, it is very unlikely to


observe a sample with only 21 students preferring data as singular; only
about 1% of examples are this extreme. However, remember that the
posterior predictive distribution is based on the observed data. So we’re
saying that based on the fact that we observed 21 students in a sample of
35 preferring data as singular it would be unlikely to observe 21 students
in a sample of 35 preferring data as singular????? Seems problematic. In
this case, the problem is that the prior is way too strict, and it doesn’t
give the data enough say.

n = 35

# Predictive simulation
theta_sim = sample(theta, n_sim, replace = TRUE, prob = posterior)

y_sim = rbinom(n_sim, n, theta_sim)

# Predictive distribution
y_predict = 0:n

py_predict = rep(NA, length(y_predict))

for (i in 1:length(y_predict)) {
py_predict[i] = sum(dbinom(y_predict[i], n, theta) * posterior) # posterior
}
130 CHAPTER 7. INTRODUCTION TO PREDICTION

# Prediction interval
sum(py_predict[y_predict <= y])

## [1] 0.01105

Posterior Predictive Distribution


0.15

Theoretical
Simulation
0.10
Probability

0.05
0.00

0 5 10 15 17 20 23 2526 2930 32 35

A Bayesian model is composed of both a model for the data (likelihood) and a
prior distribution on model parameters.
Predictive distributions can be used as tools in model checking. Posterior pre-
dictive checking involves comparing the observed data to simulated samples
(or some summary statistics) generated from the posterior predictive distribu-
tion. We’ll focus on graphical checks: Compare plots for the observed data with
those for simulated samples. Systematic differences between simulated samples
and observed data indicate potential shortcomings of the model.
If the model fits the data, then replicated data generated under the model should
look similar to the observed data. If the observed data is not plausible under
the posterior predictive distribution, then this could indicate that the model is
not a good fit for the data. (“Based on the data we observed, we conclude that
it would be unlikely to observe the data we observed???”)
However, a problematic model isn’t necessarily due to the prior. Remember
that a Bayesian model consists of both a prior and a likelihood, so model mis-
specification can occur in the prior or likelihood or both. The form of the
likelihood is also based on subjective assumptions about the variables being
measured and how the data are collected. Posterior predictive checking can
7.1. POSTERIOR PREDICTIVE CHECKING 131

help assess whether these assumptions are reasonable in light of the observed
data.

Example 7.4. A basketball player will attempt a sequence of 20 free throws.


Our model assumes

• The probability that the player successfully makes any particular free
throw attempt is 𝜃.
• A Uniform prior distribution for 𝜃 values in a grid from 0 to 1.
• Conditional on 𝜃, the number of successfully made attempts has a Bino-
mial(20, 𝜃) distribution. (This determines the likelihood.)

1. Suppose the player misses her first 10 attempts and makes her second 10
attempts. Does this data seem consistent with the model?
2. Explain how you could use posterior predictive checking to check the fit
of the model.

Solution. to Example 7.4

1. Remember that one condition of a Binomial model is independence of


trials: the probability of success on a shot should not depend on the results
of previous shots. However, independence seems to be violated here, since
the shooter has a long hot streak followed by a long cold streak. So a
Binomial model might not be appropriate.

2. We’re particularly concerned about the independence assumption, so how


could we check that? For example, the data seems consistent with a value
of 𝜃 = 0.5, but if the trials were independent, you would expect to see
more alterations between makes and misses. So one way to measure de-
gree of dependence is to count the number of “switches” between makes
and misses. For the observed data there is only 1 switch. We can use sim-
ulation to approximate the posterior predictive distribution of the number
of switches assuming the model is true, and then we can see if a value of
1 (the observed number of switches) would be consistent with the model.

1. Find the posterior distribution of 𝜃. Simulate a value of 𝜃 from its


posterior distribution.
2. Given 𝜃, simulate a sequence of 20 independent success/failure trials
with probability of success 𝜃 on each trial. Compute the number of
switches for the sequence. (Since we’re interested in the number of
switches, we have to generate the individual success/failure results,
and not just the total number of successes).
3. Repeat many times, recording the total number of switches each time.
Summarize the values to approximate the posterior predictive distri-
bution of the number of switches.
132 CHAPTER 7. INTRODUCTION TO PREDICTION

See the simulation results below. It would be very unlikely to observe


only 1 switch in 20 independent trials. Therefore, the proposed model
does not fit the observed data well. There is evidence that the assumption
of independence is violated.

theta = seq(0, 1, 0.0001)

# prior
prior = rep(1, length(theta))
prior = prior / sum(prior)

# data
n = 20 # sample size
y = 10 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# predictive simulation

n_sim = 10000

switches = rep(NA, n_sim)

for (r in 1:n_sim){
theta_sim = sample(theta, 1, replace = TRUE, prob = posterior)
trials_sim = rbinom(n, 1, theta_sim)
switches[r] = length(rle(trials_sim)$lengths) - 1 # built in function
}

plot(table(switches) / n_sim,
xlab = "Number of switches",
ylab = "Posterior predictive probability",
panel.first = rect(0, 0, 1, 1, col='gray', border=NA))
7.2. PRIOR PREDICTIVE TUNING 133

Posterior predictive probability

0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10 12 14 16

Number of switches

sum(switches <= 1) / n_sim

## [1] 0.0005

7.2 Prior predictive tuning


Prior distributions of parameters quantify uncertainty about parameters before
observing data. Considering prior predictive distributions of possible samples
under the proposed model can help tune prior distributions of parameters.
Example 7.5. Suppose we want to estimate 𝜃, the population mean hours of
sleep on a typical night for Cal Poly students. Assume that sleep hours for
individual students follow a Normal distribution with unknown mean 𝜃 and
known standard deviation 1.5 hours. (Known population SD is an unrealistic
assumption that we use for simplicity here.)
Suppose we want to use a fairly uninformative prior for 𝜃, so we choose a Uniform
distribution on the interval [4, 12].

1. Simulate sleep hours for 10000 Cal Poly students under this model and
make a histogram of the simulated values.
2. According to this model, (approximately) what percent of students sleep
less than 5 hours a night? More than 11? Do these values seem reasonable?

Solution. to Example 7.5


134 CHAPTER 7. INTRODUCTION TO PREDICTION

1. First simulate a value 𝜃 from the Uniform(4, 12) distribution. Then given
𝜃 simulate a value 𝑦 from a Normal(𝜃, 1.5) distribution. Repeat many
times to get many (𝜃, 𝑦) pairs and summarize the 𝑦 values.

N_sim = 10000

theta = runif(N_sim, 4, 12)

sigma = 1.5
y = rnorm(N_sim, theta, sigma)

hist(y, xlab = "Sleep hours")

Histogram of y
1000
Frequency

600
200
0

0 5 10 15

Sleep hours

sum(y < 5) / N_sim

## [1] 0.1499

sum(y > 11) / N_sim

## [1] 0.1552

2. According to this model, about 15 percent of students sleep fewer than


5 hours on a typical night, and about 16 percent of students sleep more
than 11 hours on a typical night. These values seem to be overestimates,
indicating that perhaps the model isn’t the greatest.
7.2. PRIOR PREDICTIVE TUNING 135

It could be that the prior distribution is too uninformative. But it could


also be that the assumptions of the data model are inadequate; perhaps
a Normal distribution isn’t appropriate for sleep times. (Of course, the
value 𝜎 could be also wrong, but here we’re assuming it’s known.)

In the previous example, it was helpful to think about the distribution of sleep
hours for individual students when formulating prior beliefs about the popula-
tion mean. In general, it is often easier to think in terms of the scale of the data
(individual sleep hours) rather than the scale of the parameters (mean sleep
hours).
Prior predictive distributions “live” on the scale of the data, and are sometimes
easier to interpret than prior distributions themselves. It is often helpful to
tune prior distributions indirectly via prior predictive distributions rather than
directly. We can choose a prior distribution for parameters, simulate a prior
predictive distribution for the data given this prior, and consider if the distribu-
tion of possible data values seems reasonable given our background knowledge
about the variable and context. If not, we can choose another prior and repeat
the process until we have suitably “tuned” the prior.
Remember, the prior does not have to be perfect; there is no perfect prior.
However, if a particular prior gives rise to obviously unreasonable data values
(e.g., negative sleep hours) we should try to improve it. It’s always a good idea
to consider prior predictive distributions when formulating a prior distribution
for parameters.
136 CHAPTER 7. INTRODUCTION TO PREDICTION
Chapter 8

Introduction to Continuous
Prior and Posterior
Distributions

Bayesian analysis is based on the posterior distribution of parameters 𝜃 given


data 𝑦. The data 𝑦 might be discrete (e.g., count data) or continuous (e.g.,
measurement data). However, parameters 𝜃 almost always take values on a
continuous scale, even when the data are discrete. For example, in a Binomial
situation, the number of successes 𝑦 takes values on a discrete scale, but the
probability of success on any single trial 𝜃 can potentially take any value in the
continuous interval (0, 1).

Recall that the posterior distribution is proportional to the product of the prior
distribution and the likelihood. Thus, there are two probability distributions
which will influence the posterior distribution.

• The (unconditional) prior distribution of parameters 𝜃, which is (almost


always) a continuous distribution
• The conditional distribution of the data 𝑦 given parameters 𝜃, which de-
termines the likelihood function. Viewed as a conditional distribution of
𝑦 given 𝜃, the distribution can be discrete or continuous, corresponding
to the data type of 𝑦. However, the likelihood function treats the data
𝑦 as fixed and the parameters 𝜃 as varying, and therefore the likelihood
function is (almost always) a function of continuous 𝜃.

This section provides an introduction to using continuous prior and posterior


distributions to quantify uncertainty about parameters. Some general notation:

137
138CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS

• 𝜃 represents1 parameters of interest usually taking values on a continuous


scale
• 𝑦 denotes observed sample data (discrete or continuous)
• 𝜋(𝜃) denotes the prior distribution of 𝜃, usually a probability density func-
tion (pdf) over possible values of 𝜃
• 𝑓(𝑦|𝜃) denotes the likelihood function, a function of continuous 𝜃 for fixed
𝑦
• 𝜋(𝜃|𝑦) denotes the posterior distribution of 𝜃, the conditional distribution
of 𝜃 given the data 𝑦.

Bayes rule works analogously for a continuous parameter 𝜃, given data 𝑦


𝑓(𝑦|𝜃)𝜋(𝜃)
𝜋(𝜃|𝑦) =
𝑓𝑌 (𝑦)

𝜋(𝜃|𝑦) ∝ 𝑓(𝑦|𝜃)𝜋(𝜃)
posterior ∝ likelihood × prior

The continuous analog of the law of total probability is



𝑓𝑌 (𝑦) = ∫ 𝑓(𝑦|𝜃)𝜋(𝜃)𝑑𝜃
−∞

8.1 A brief review of continuous distributions


This section provides a brief review of continuous probability distributions.
Throughout, 𝑈 represents a continuous random variable that takes values de-
noted 𝑢. In a Bayesian framework, 𝑢 can represent either values of parameters
𝜃 or values of data 𝑦.
The probability distribution of a continuous random variable is (usually) spec-
ified by its probability density function (pdf) (a.k.a., density), usually de-
noted 𝑓 or 𝑓𝑈 . A pdf 𝑓 must satisfy:
𝑓(𝑢) ≥ 0 for all 𝑢

∫ 𝑓(𝑢)𝑑𝑢 = 1
−∞

For a continuous random variable 𝑈 with pdf 𝑓 the probability that the random
variable falls between any two values 𝑎 and 𝑏 is given by the area under the
density between those two values.
𝑏
𝑃 (𝑎 ≤ 𝑈 ≤ 𝑏) = ∫ 𝑓(𝑢)𝑑𝑢
𝑎
1𝜃 is used to denote both: (1) the actual parameter (i.e., the random variable) 𝜃 itself, and
(2) possible values of 𝜃.
8.1. A BRIEF REVIEW OF CONTINUOUS DISTRIBUTIONS 139

A pdf will assign zero probability to intervals where the density is 0. A pdf
is usually defined for all real values, but is often nonzero only for some subset
of values, the possible values of the random variable. Given a specific pdf, the
generic bounds (−∞, ∞) should be replaced by the range of possible values,
that is, those values 𝑢 for which 𝑓(𝑢) > 0.
For example, if 𝑈 can only take positive values we can write its pdf as

some function of 𝑢, 𝑢 > 0,


𝑓(𝑢) = {
0, otherwise

The “0 otherwise” part is often omitted, but be sure to specify the range of
values where 𝑓 is positive.
The expected value of a continuous random variable 𝑈 with pdf 𝑓 is

𝐸(𝑈 ) = ∫ 𝑢 𝑓(𝑢) 𝑑𝑢
−∞

The probability that a continuous random variable 𝑈 equals any par-


ticular value is 0: 𝑃 (𝑈 = 𝑢) = 0 for all 𝑢. A continuous random variable
can take uncountably many distinct values, e.g. 0.500000000 … is different than
0.50000000010 … is different than 0.500000000000001 …, etc. Simulating values
of a continuous random variable corresponds to an idealized spinner with an
infinitely precise needle which can land on any value in a continuous scale.
A density is an idealized mathematical model for the entire population distri-
bution of infinitely many distinct values of the random variable. In practical
applications, there is some acceptable degree of precision, and events like “X,
rounded to 4 decimal places, equals 0.5” correspond to intervals that do have
positive probability. For continuous random variables, it doesn’t really make
sense to talk about the probability that the random value equals a particular
value. However, we can consider the probability that a random variable is close
to a particular value.
The density 𝑓(𝑢) at value 𝑢 is not a probability But the density 𝑓(𝑢) at value
𝑢 is related to the probability that the random variable 𝑈 takes a value “close
to 𝑢” in the following sense
𝜖 𝜖
𝑃 (𝑢 − ≤ 𝑈 ≤ 𝑢 + ) ≈ 𝑓(𝑢)𝜖, for small 𝜖
2 2
So a random variable 𝑈 is more likely to take values close to those with greater
density.
In general, a pdf is often defined only up to some multiplicative constant 𝑐, for
example

𝑓(𝑢) = 𝑐 × some function of 𝑢, or


𝑓(𝑢) ∝ some function of 𝑢
140CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS

The constant 𝑐 does not affect the shape of the density as a function of 𝑢, only
the scale on the density (vertical) axis. The absolute scaling on the density axis
is somewhat irrelevant; it is whatever it needs to be to provide the proper area.
In particular, the total area under the pdf must be 1. The scaling constant is

determined by the requirement that ∫−∞ 𝑓(𝑢)𝑑𝑢 = 1. (Remember to replace the
generic (−∞, ∞) bounds with the range of possible values.)

What is important about the pdf is relative height. For example, if two values
𝑢 and 𝑢̃ satisfy 𝑓(𝑢)̃ = 2𝑓(𝑢) then 𝑈 is roughly “twice as likely to be near 𝑢̃
than 𝑢”

𝜖
𝑓(𝑢)̃ 𝑓(𝑢)𝜖
̃ 𝑃 (𝑢̃ − 2 ≤ 𝑈 ≤ 𝑢̃ + 2𝜖 )
2= = ≈ 𝜖
𝑓(𝑢) 𝑓(𝑢)𝜖 𝑃 (𝑢 − 2 ≤ 𝑈 ≤ 𝑢 + 2𝜖 )

1.00 1.00

0.75 0.75
fU(u)

fU(u)

0.50 0.50

0.25 0.25

0.00 0.00
0 1 2 3 4 5 0 1 2 3 4 5
u u

Figure 8.1: Illustration of 𝑃 (1 < 𝑈 < 2.5) (left) and 𝑃 (0.995 < 𝑈 < 1.005) and
𝑃 (1.695 < 𝑈 < 1.705) (right) for 𝑈 with an Exponential(1) distribution, with
pdf 𝑓𝑈 (𝑢) = 𝑒−𝑢 , 𝑢 > 0. The plot on the left displays the true area under the
curve over (1, 2.5). The plot on the right illustrates how the probability that
𝑈 is “close to” 𝑢 can be approximated by the area of a rectangle with height
equal to the density at 𝑢, 𝑓𝑈 (𝑢). The density height at 𝑢 = 1 is twice as large
than the density height at 𝑢 = 1.7, so the probability that 𝑈 is “close to” 1 is
(roughly) twice as large as the probability that 𝑈 is “close to” 1.7.

A sample of values of a continuous random variable is often displayed in a


histogram which displays the frequencies of values falling in interval “bins”.
The vertical axis of a histogram is typically on the density scale, so that areas
of the bars correspond to relative frequencies.
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION141

1.0

0.8

0.6
Density

0.4

0.2

0.0
0 2 4 6 8 10

8.2 Continuous distributions for a population


proportion
We have seen a few examples where we used Normal distributions as prior
distributions for a population proportion 𝜃. Normal distributions are commonly
used as priors, but they do not allow for asymmetric prior distributions. We’ll
now consider Beta distributions, a family of distributions that are commonly
used as prior distributions for population proportions.

Example 8.1. Continuing Example 7.1 where 𝜃 represents the population pro-
portion of students in Cal Poly statistics classes who prefer to consider data as
a singular noun.

1. Assume a continuous prior distribution for 𝜃 which is proportional to


𝜃2 , 0 < 𝜃 < 1. Sketch this distribution.
2. The previous part implies that 𝜋(𝜃) = 𝑐𝜃2 , 0 < 𝜃 < 1, for an appropriate
constant 𝑐. Find 𝑐.
3. Compute the prior mean of 𝜃.
4. Now we’ll consider a few more prior distributions. Sketch each of the
following priors. How do they compare?
a. proportional to 𝜃2 , 0 < 𝜃 < 1. (from previous)
b. proportional to 𝜃5 , 0 < 𝜃 < 1.
142CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS

c. proportional to (1 − 𝜃)2 , 0 < 𝜃 < 1.


d. proportional to 𝜃2 (1 − 𝜃)2 , 0 < 𝜃 < 1.
e. proportional to 𝜃5 (1 − 𝜃)2 , 0 < 𝜃 < 1.

Solution. to Example 8.1

1. See the plot below. The distribution is similar to the discrete grid approx-
imation in Example 7.2.
2. Set the total area under the curve equal to 1 and solve for 𝑐 = 3
1 1
1 = ∫ 𝑐𝜃2 𝑑𝜃 = 𝑐 ∫ 𝜃2 𝑑𝜃 = 𝑐(1/3) ⇒ 𝑐 = 3
0 0

3. Since 𝜃 is continuous we use calculus


1 1
𝐸(𝜃) = ∫ 𝜃 𝜋(𝜃)𝑑𝜃 = ∫ 𝜃(3𝜃2 )𝑑𝜃 = 3/4
0 0

4. See the plot below. The prior proportional to (1 − 𝜃)2 is the mirror image
of the prior proportional to 𝜃2 , reflected about 0.5. As the exponent on
𝜃 increases, more density is shifted towards 1. As the exponent on 1 − 𝜃
increases, more density is shifted towards 0. When the exponents are the
same, the density is symmetric about 0.5

6 2
5

5 (1 )2
2(1 )2
5(1 )2
4

0
0.0 0.2 0.4 0.6 0.8 1.0
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION143

A continuous random variable 𝑈 has a Beta distribution with shape param-


eters 𝛼 > 0 and 𝛽 > 0 if its density satisfies2

𝑓(𝑢) ∝ 𝑢𝛼−1 (1 − 𝑢)𝛽−1 , 0 < 𝑢 < 1,

and 𝑓(𝑢) = 0 otherwise.

• If 𝛼 = 𝛽 the distribution is symmetric about 0.5


• If 𝛼 > 𝛽 the distribution is skewed to the left (with greater density above
0.5 than below)
• If 𝛼 < 𝛽 the distribution is skewed to the right (with greater density below
0.5 than above)
• If 𝛼 = 1 and 𝛽 = 1, the Beta(1, 1) distribution is the Uniform distribution
on (0, 1).

It can be shown that a Beta(𝛼, 𝛽) density has


𝛼
Mean (EV):
𝛼+𝛽
𝛼 𝛼
( 𝛼+𝛽 ) (1 − 𝛼+𝛽 )
Variance:
𝛼+𝛽+1
𝛼−1
Mode: , (if 𝛼 > 1, 𝛽 ≥ 1 or 𝛼 ≥ 1, 𝛽 > 1)
𝛼+𝛽−2
Example 8.2. Continuing Example 8.1

1. Each of the previous distributions in the previous example was a Beta


distribution. For each distribution, identify the shape parameters and the
prior mean and standard deviation.
a. proportional to 𝜃2 , 0 < 𝜃 < 1.
b. proportional to 𝜃5 , 0 < 𝜃 < 1.
c. proportional to (1 − 𝜃)2 , 0 < 𝜃 < 1.
d. proportional to 𝜃2 (1 − 𝜃)2 , 0 < 𝜃 < 1.
e. proportional to 𝜃5 (1 − 𝜃)2 , 0 < 𝜃 < 1.
2. Now suppose that 31 students in a sample of 35 Cal Poly statistics students
prefer data as singular. Specify the shape of the likelihood as a function
of 𝜃, 0 < 𝜃 < 1.
2 The expression defines the shape of the Beta density. All that’s missing is the scaling

constant which ensures that the total area under the density is 1. The actual Beta density
formula, including the normalizing constant, is
Γ(𝛼 + 𝛽) 𝛼−1
𝑓(𝑢) = 𝑢 (1 − 𝑢)𝛽−1 , 0 < 𝑢 < 1,
Γ(𝛼)Γ(𝛽)

where Γ(𝛼) = ∫0 𝑒−𝑣 𝑣𝛼−1 𝑑𝑣 is the Gamma function. For a positive integer 𝑘, Γ(𝑘) = (𝑘 − 1)!.

Also, Γ(1/2) = 𝜋.
144CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS

3. Starting with each of the prior distributions from the first part, find the
posterior distribution of 𝜃 based on this sample, and identify it as a Beta
distribution by specifying the shape parameters 𝛼 and 𝛽
a. proportional to 𝜃2 , 0 < 𝜃 < 1.
b. proportional to 𝜃5 , 0 < 𝜃 < 1.
c. proportional to (1 − 𝜃)2 , 0 < 𝜃 < 1.
d. proportional to 𝜃2 (1 − 𝜃)2 , 0 < 𝜃 < 1.
e. proportional to 𝜃5 (1 − 𝜃)2 , 0 < 𝜃 < 1.
4. For each of the posterior distributions in the previous part, compute the
posterior mean and standard deviation. How does each posterior distri-
bution compare to its respective prior distribution?
Solution. to Example 8.2

1. Careful with the exponents. For example, 𝜃2 = 𝜃2 (1−𝜃)0 = 𝜃3−1 (1−𝜃)1−1 ,


which corresponds to a Beta(3, 1) distribution.

Distribution 𝛼 𝛽 Proportional to Mean SD


2
a Beta(3, 1) 3 1 𝜃 ,0 < 𝜃 < 1 0.750 0.194
b Beta(6, 1) 6 1 𝜃5 , 0 < 𝜃 < 1 0.857 0.124
c Beta(1, 3) 1 3 (1 − 𝜃)2 , 0 < 𝜃 < 1 0.250 0.194
d Beta(3, 3) 3 3 𝜃 (1 − 𝜃)2 , 0 < 𝜃 < 1
2
0.500 0.189
e Beta(6, 3) 6 3 𝜃5 (1 − 𝜃)2 , 0 < 𝜃 < 1 0.667 0.149

2. Given 𝜃, the number of students in the sample who prefer data as singular,
𝑌 , follows a Binomial(35, 𝜃) distribution. The likelihood is the probability
of observing 𝑌 = 31 viewed as a function of 𝜃.
35 31
𝑓(31|𝜃) = ( )𝜃 (1 − 𝜃)4 , 0<𝜃<1
31
∝ 𝜃31 (1 − 𝜃)4 , 0<𝜃<1

The constant (35


31) does not affect the shape of the likelihood as a function
of 𝜃.
3. As always, the posterior distribution is proportional to the product of the
prior distribution and the likelihood. For the Beta(3, 1) prior, the prior
density is proportional to 𝜃2 , 0 < 𝜃 < 1, and for the observed data 𝑦 = 31
with 𝑛 = 35, the likelihood is proportional to 𝜃31 (1 − 𝜃)4 , 0 < 𝜃 < 1.
Therefore, the posterior density, as a function of 𝜃, is proportional to
𝜋(𝜃|𝑦 = 31) ∝ (𝜃2 ) (𝜃31 (1 − 𝜃)4 ) , 0<𝜃<1
33 4
∝ 𝜃 (1 − 𝜃) , 0<𝜃<1
34−1 5−1
∝𝜃 (1 − 𝜃) , 0<𝜃<1
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION145

Therefore, the posterior distribution of 𝜃 is the Beta(3 + 31, 1 + 35 - 31),


that is, the Beta(34, 5) distribution. The other situations are similar. The
prior changes but the likelihood stays the same, based on a sample with
31 successes and 35 − 31 = 4 failures. If the prior distribution is Beta(𝛼,
𝛽) then the posterior distribution is Beta(𝛼 + 31, 𝛽 + 35 − 31).

Prior Posterior proportional to Posterior Posterior Posterior


Distribu- Distribu- Mean SD
tion tion
a Beta(3, 𝜃2+31 (1 − 𝜃)0+4 , 0 < 𝜃 < 1 Beta(34, 5) 0.872 0.053
1)
b Beta(6, 𝜃5+31 (1 − 𝜃)0+4 , 0 < 𝜃 < 1 Beta(37, 5) 0.881 0.049
1)
c Beta(1, 𝜃0+31 (1 − 𝜃)2+4 , 0 < 𝜃 < 1 Beta(32, 7) 0.821 0.061
3)
d Beta(3, 𝜃2+31 (1 − 𝜃)2+4 , 0 < 𝜃 < 1 Beta(34, 7) 0.829 0.058
3)
e Beta(6, 𝜃5+31 (1 − 𝜃)2+4 , 0 < 𝜃 < 1 Beta(37, 7) 0.841 0.055
3)

4. See the table above. Each posterior distribution concentrates more prob-
ability towards the observed sample proportion 31/35 = 0.886, though
there are some small differences due to the prior. The posterior SD is less
than the prior SD; there is less uncertainty about 𝜃 after observing some
data.

6 Beta(3, 1) Beta(34, 5)
Beta(6, 1) 8 Beta(37, 5)
5 Beta(1, 3) Beta(32, 7)
Beta(3, 3) Beta(34, 7)
Beta(6, 3) 6 Beta(37, 7)
4

3 4
2
2
1

0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Beta distributions are often used in Bayesian models involving population pro-
portions. Consider some binary (“success/failure”) variable and let 𝜃 be the
population proportion of success. Select a random sample of size 𝑛 from the
population and let 𝑌 count the number of successes in the sample.
Beta-Binomial model. If 𝜃 has a Beta(𝛼, 𝛽) prior distribution and the con-
ditional distribution of 𝑌 given 𝜃 is the Binomial(𝑛, 𝜃) distribution, then the
146CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS

posterior distribution of 𝜃 given 𝑦 is the Beta(𝛼 + 𝑦, 𝛽 + 𝑛 − 𝑦) distribution.

prior: 𝜋(𝜃) ∝ 𝜃𝛼−1 (1 − 𝜃)𝛽−1 , 0 < 𝜃 < 1,

likelihood: 𝑓(𝑦|𝜃) ∝ 𝜃𝑦 (1 − 𝜃)𝑛−𝑦 , 0 < 𝜃 < 1,

posterior: 𝜋(𝜃|𝑦) ∝ 𝜃𝛼+𝑦−1 (1 − 𝜃)𝛽+𝑛−𝑦−1 , 0 < 𝜃 < 1.

Try this applet which illustrates the Beta-Binomial model.


In a sense, you can interpret 𝛼 as “prior successes” and 𝛽 as “prior failures”, but
these are only “pseudo-observations”. Also, 𝛼 and 𝛽 are not necessarily integers.

Prior Data Posterior


Successes 𝛼 𝑦 𝛼+𝑦
Failures 𝛽 𝑛−𝑦 𝛽+𝑛−𝑦
Total 𝛼+𝛽 𝑛 𝛼+𝛽+𝑛

When the prior and posterior distribution belong to the same family, that fam-
ily is called a conjugate prior distribution for the likelihood. So, the Beta
distributions form a conjugate prior family for Binomial distributions.

Example 8.3. In Example 7.2 we used a grid approximation to the prior dis-
tribution of 𝜃. Now we will assume a continuous prior distributions. Assume
that 𝜃 has a Beta(3, 1) prior distribution and that 31 students in a sample of
35 Cal Poly statistics students prefer data as singular.

1. Plot the prior distribution, (scaled) likelihood, and posterior distribution.


2. Use software to find 50%, 80%, and 98% central posterior credible inter-
vals.
3. Compare the results to those using the grid approximation in Example
7.2.
4. Express the posterior mean as a weighted average of the prior mean and
sample proportion. Describe what the weights are, and explain why they
make sense.

Solution. to Example 8.3

1. See plot below. The posterior distribution is the Beta(34, 5) distribution.


Note that the grid in the code is just to plot things in R. In particular,
the posterior is computed using the Beta-Binomial model, not the grid.
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION147

theta = seq(0, 1, 0.0001) # the grid is just for plotting

# prior
alpha_prior = 3
beta_prior = 1
prior = dbeta(theta, alpha_prior, beta_prior)

# data
n = 35
y = 31

# likelihood
likelihood = dbinom(y, n, theta)

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = dbeta(theta, alpha_post, beta_post)

# plot
ymax = max(c(prior, posterior))
scaled_likelihood = likelihood * ymax / max(likelihood)

plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n'
par(new=T)
plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='
par(new=T)
plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yax
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "ora
148CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS

prior
scaled likelihood
posterior

0.0 0.2 0.4 0.6 0.8 1.0

theta

# 50% posterior credible interval


qbeta(c(0.25, 0.75), alpha_post, beta_post)

## [1] 0.8398 0.9106

# 80% posterior credible interval


qbeta(c(0.1, 0.9), alpha_post, beta_post)

## [1] 0.8006 0.9346

# 98% posterior credible interval


qbeta(c(0.01, 0.99), alpha_post, beta_post)

## [1] 0.7239 0.9651

2. We can use qbeta to compute quantiles (a.k.a. percentiles). The pos-


terior mean is 0.872, and the prior standard deviation is 0.053. There
is a posterior probability of 50% that between 84.0% and 91.1% of Cal
Poly students prefer data as singular; after observing the sample data,
it’s equally plausible that 𝜃 is inside [0.840, 0.911] as outside. There is a
posterior probability of 80% that between 80.0% and 93.5% of Cal Poly
students prefer data as singular; after observing the sample data, it’s four
times mores plausible that 𝜃 is inside [0.800, 0.935] as outside. There is a
posterior probability of 98% that between 72.4% and 96.5% of Cal Poly
students prefer data as singular; after observing the sample data, it’s 49
times mores plausible that 𝜃 is inside [0.724, 0.965] as outside.
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION149

3. The results based on continuous distributions are the same as those for
the grid approximation. The grid is just an approximation of the “true”
Beta-Binomial theory.
3 31
4. The prior mean is 3+1 = 0.75. The sample proportion is 35 = 0.886. The
34
posterior mean is 39 = 0.872. We can write

34 3 4 31 35
=( )×( )+( )×( )
39 4 39 35 39
3 4 31 35
=( )×( )+( )×( )
4 4 + 35 35 4 + 35

The posterior mean is a weighted average of the prior mean and the sample
proportion where the weights are given by the relative “samples sizes”. The
“prior sample size” is 3 + 1 = 4. The actual observed sample size is 35.

In the Beta-Binomial model, the posterior mean 𝐸(𝜃|𝑦) can be expressed as


𝛼
a weighted average of the prior mean 𝐸(𝜃) = 𝛼+𝛽 and the sample proportion
𝑝̂ = 𝑦/𝑛.
𝛼+𝛽 𝑛
𝐸(𝜃|𝑦) = 𝐸(𝜃) + 𝑝̂
𝛼+𝛽+𝑛 𝛼+𝛽+𝑛
As more data are collected, more weight is given to the sample proportion (and
less weight to the prior mean). The prior “weight” is detemined by 𝛼 + 𝛽, which
is sometimes called the concentration and measured in “pseudo-observations”.
Larger values of 𝛼 + 𝛽 indicate stronger prior beliefs, due to smaller prior vari-
ance, and give more weight to the prior mean.
The posterior variance generally gets smaller as more data are collected

𝐸(𝜃|𝑦)(1 − 𝐸(𝜃|𝑦))
Var(𝜃|𝑦) =
𝛼+𝛽+𝑛+1

Example 8.4. Now let’s reconsider the posterior prediction parts of Example
7.2, treating 𝜃 as continuous. Assume that 𝜃 has a Beta(3, 1) prior distribution
and that 31 students in a sample of 35 Cal Poly statistics students prefer data as
singular, so that the posterior distribution of 𝜃 is the Beta(34, 5) distribution.

1. Suppose we plan to randomly select another sample of 35 Cal Poly statis-


tics students. Let 𝑌 ̃ represent the number of students in the selected
sample who prefer data as singular. How could we use simulation to ap-
proximate the posterior predictive distribution of 𝑌 ̃ ?
2. Use software to run the simulation and plot the posterior predictive dis-
tribution3 . Compare to Example 7.2.
3 The posterior predictive distribution can be found analytically in the Beta-Binomial sit-
150CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS

3. Use the simulation results to approximate a 95% posterior prediction in-


terval for 𝑌 ̃ . Write a clearly worded sentence interpreting this interval in
context.

Solution. to Example 8.3

1. Simulate a value of 𝜃 from the posterior Beta(34, 5) distribution. Given


this value of 𝜃, simulate a value 𝑦 ̃ from a Binomial(35, 𝜃) distribution. Re-
peat many times, simulating many (𝜃, 𝑦)̃ pairs. The simulated distribution
of 𝑦 ̃ values will approximate the posterior predictive distribution.

2. We can use rbeta to simulate from a Beta distribution. The simulation


results are similar to those from the grid approximation.

n_sim = 10000

theta_sim = rbeta(n_sim, 34, 5)

y_sim = rbinom(n_sim, 35, theta_sim)

plot(table(y_sim) / n_sim, xlab = "y", ylab = "Posterior predictive probability",

uation. If 𝜃 ∼ Beta(𝛼, 𝛽) and (𝑌 |𝜃) ∼ Binomial(𝑛, 𝜃) then the marginal distribution of 𝑌 is


the Beta-Binomial distribution with

𝑛 𝐵(𝛼 + 𝑦, 𝛽 + 𝑛 − 𝑦)
𝑃 (𝑌 = 𝑦) = ( ) , 𝑦 = 0, 1, … , 𝑛,
𝑦 𝐵(𝛼, 𝛽)

(𝛼−1)!(𝛽−1)!
𝐵(𝛼, 𝛽) is the beta function, for which 𝐵(𝛼, 𝛽) = (𝛼+𝛽−1)! if 𝛼, 𝛽 are positive integers. (For
1
general 𝛼, 𝛽 > 0, 𝐵(𝛼, 𝛽) = ∫0 𝑢𝛼−1 (1 − 𝑢) 𝛽−1 𝑑𝑢 = Γ(𝛼)Γ(𝛽) 𝛼
Γ(𝛼+𝛽) .) The mean is 𝑛 ( 𝛼+𝛽 ). In R:
dbbinom, rbbinom, pbbinom in extraDistr package
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION151

0.15
Posterior predictive probability

0.10
0.05
0.00

17 19 21 23 25 27 29 31 33 35

quantile(y_sim, c(0.025, 0.975))

## 2.5% 97.5%
## 24 35

3. The interval is similar to the one from the grid approximation, and the
interpretation is the same. There is posterior predictive probability of 95%
that between 24 and 35 students in a sample of 35 students will prefer data
as singular.

You can tune the shape parameters — 𝛼 (like “prior successes”) and 𝛽 (like
“prior failures”) — of a Beta distribution to your prior beliefs in a few ways.
Recall that 𝜅 = 𝛼 + 𝛽 is the “concentration” or “equivalent prior sample size”.

• If prior mean 𝜇 and prior concentration 𝜅 are specified then

𝛼 = 𝜇𝜅
𝛽 = (1 − 𝜇)𝜅

• If prior mode 𝜔 and prior concentration 𝜅 (with 𝜅 > 2) are specified then

𝛼 = 𝜔(𝜅 − 2) + 1
𝛽 = (1 − 𝜔)(𝜅 − 2) + 1
152CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS

• If prior mean 𝜇 and prior sd 𝜎 are specified then

𝜇(1 − 𝜇)
𝛼 = 𝜇( − 1)
𝜎2
𝜇(1 − 𝜇)
𝛽 = (1 − 𝜇) ( − 1)
𝜎2

• You can also specify two percentiles and use software to find 𝛼 and 𝛽. For
example, you could specify the endpoints of a prior 98% credible interval.

Example 8.5. Suppose we want to estimate 𝜃, the proportion of Cal Poly


students that are left-handed.

1. Sketch your Beta prior distribution for 𝜃. Describe its main features and
your reasoning. Then translate your prior into a Beta distribution by
specifying the shape parameters 𝛼 and 𝛽.
2. Assume a prior Beta distribution for 𝜃 with prior mean 0.15 and prior SD
is 0.08. Find 𝛼 and 𝛽, and a prior 98% credible interval for 𝜃.

Solution. to Example 8.5

1. Of course, choices will vary, based on what you know about left-
handedness. But do think about what your prior might look like, and use
one of the methods to translate it to a Beta distribution.
2. Let’s say we’ve heard that about 15% of people in general are left-handed,
but we’ve also heard 10% so we’re not super sure, and we also don’t know
how Cal Poly students compare to the general population. So we’ll assume
a prior Beta distribution for 𝜃 with prior mean 0.15 (our “best guess”) and
a prior SD of 0.08 to reflect our degree of uncertainty. This translates to
a Beta(2.8, 16.1) prior, with a central 98% prior credible interval for 𝜃
that between 2.2% and 38.1% of Cal Poly students are left-handed. We
could probably go with more prior certainty than this, but it seems at
least like a reasonable starting place before observing data. We can (and
should) use prior predictive tuning to aid in choosing 𝛼 and 𝛽 for our Beta
distribution prior.

mu = 0.15
sigma = 0.08

alpha = mu ^ 2 * ((1 - mu) / sigma ^ 2 - 1 / mu); alpha

## [1] 2.838
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION153

beta <- alpha * (1 / mu - 1); beta

## [1] 16.08

qbeta(c(0.01, 0.99), alpha, beta)

## [1] 0.02222 0.38104


154CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS
Chapter 9

Considering Prior
Distributions

One of the most commonly asked questions when one first encounters Bayesian
statistics is “how do we choose a prior?” While there is never one “perfect”
prior in any situation, we’ll discuss in this chapter some issues to consider when
choosing a prior. But first, here are a few big picture ideas to keep in mind.

• Bayesian inference is based on the posterior distribution, not the prior.


Therefore, the posterior requires much more attention than the prior.
• The prior is only one part of the Bayesian model. The likelihood is the
other part. And there is the data that is used to fit the model. Choice of
prior is just one of many modeling assumptions that should be evaluated
and checked.
• In many situations, the posterior distribution is not too sensitive to rea-
sonable changes in prior. In these situations, the important question isn’t
“what is the prior?” but rather “is there a prior at all”? That is, are you
adopting a Bayesian approach, treating parameters as random variables,
and quantifying uncertainty about parameters with probability distribu-
tions?
• One criticism of Bayesian statistics in general and priors in particular is
that they are subjective. However, any statistical analysis is inherently
subjective, filled with many assumptions and decisions along the way. Ex-
cept in the simplest situations, if you ask five statisticians how to approach
a particular problem, you will likely get five different answers. Priors and
Bayesian data analysis are no more inherently subjective than any of the
myriad other assumptions made in statistical analysis.

Subjectivity is OK, and often beneficial. Choosing a subjective prior allows us


to explicitly incorporate a wealth of past experience into our analysis.

155
156 CHAPTER 9. CONSIDERING PRIOR DISTRIBUTIONS

Example 9.1. Xiomara claims that she can predict which way a coin flip will
land. Rogelio claims that he can taste the difference between Coke and Pepsi.
Before reading further, stop to consider: whose claim - Xiomara’s or Rogelio’s
- is initially more convincing? Or are you equally convinced? Why? To put
it another way, whose claim are you initially more skeptical of? Or are you
equally skeptical? To put it one more way, whose claim would require more
data to convince you?1
To test Xiomara’s claim, you flip a fair coin 10 times, and she correctly predicts
the result of 9 of the 10 flips. (You can assume the coin is fair, the flips are
independent, and there is no funny business in data collection.)
To test Rogelio’s claim, you give him a blind taste test of 10 cups, flipping
a coin for each cup to determine whether to serve Coke or Pespi. Rogelio
correctly identifies 9 of the 10 cups. (You can assume the coin is fair, the flips
are independent, and there is no funny business in data collection.)
Let 𝜃𝑋 be the probability that Xiomara correctly guesses the result of a fair coin
flip. Let 𝜃𝑅 be the probability that Rogelio correctly guesses the soda (Coke or
Pepsi) in a randomly selected cup.

1. How might a frequentist address this situation? What would the conclu-
sion be?
2. Consider a Bayesian approach. Describe, in general terms, your prior
distributions for the two parameters. How do they compare? How would
this impact your conclusions?

Solution. to Example 9.1

1. For Xiomara, a frequentist might conduct a hypothesis test of the null


hypothesis 𝐻0 ∶ 𝜃𝑋 = 0.5 versus the alternative hypothesis: 𝐻𝑎 ∶ 𝜃𝑋 > 0.5.
The p-value would be about 0.01, the probability of observing at least 9
out of 10 successes from a Binomial distribution with parameters 10 and
0.5 (1 - pbinom(8, 10, 0.5)). Rogelio’s set up would be similar and
would yield the same p-value. So a strict frequentist would be equally
convinced of the two claims.
2. Prior to observing data, we are probably more skeptical of Xiomara’s
claim than Rogelio’s. Since coin flips are unpredictable, we would have a
strong prior belief that 𝜃𝑋 is close to 0.5 (what it would be if she were just
guessing). Our prior for 𝜃𝑋 would have a mean of 0.5 and a small prior
SD, to reflect that only values close to 0.5 seem plausible. Therefore, it
would require a lot of evidence to sway our prior beliefs.
On the other hand, we might be familiar with people who can tell the
difference between Coke and Pepsi; maybe we even can ourselves. Our
prior for 𝜃𝑅 would have a smaller prior SD than that of 𝜃𝑋 to allow for
1 This example is motivated by an example in Section 1.1 of Dogucu et al. (2022).
157

a wider range of plausible values. We might even have a prior mean for
𝜃𝑅 above 0.5 if we have experience with a lot of people who can tell the
difference between Coke and Pepsi. Given the sample data, our posterior
probability that 𝜃𝑅 > 0.5 would be larger than the posterior probability
that 𝜃𝑋 > 0.5, and we would be more convinced by Rogelio’s claim than
by Xiomara’s.

Prior Distributions Prior and Posterior Distributions

Mabel (prior) Mabel (prior)


Dipper (prior) Dipper (prior)
Mabel (posterior)
Dipper (posterior)
Density

Density

0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7

Population proportion Population proportion

Even if a prior does not represent strong prior beliefs, just having a prior dis-
tribution at all allows for Bayesian analysis. Remember, both Bayesian and
frequentist are valid approaches to statistical analyses, each with advantages
and disadvantages. That said, there are some issues with frequentist approaches
that incorporating a prior distribution and adopting a Bayesian approach alle-
viates. (To be fair, an upcoming investigation will address some disadvantages
of the Bayesian approach compared with the frequentist approach.)
Example 9.2. Tamika is a basketball player who throughout her career has
had a probability of 0.5 of making any three point attempt. However, her coach
is afraid that her three point shooting has gotten worse. To check this, the
coach has Tamika shoot a series of three pointers; she makes 7 out of 24. Does
the coach have evidence that Tamika has gotten worse?
Let 𝜃 be the probability that Tamika successfully makes any three point attempt.
Assume attempts are independent.

1. Prior to collecting data, the coach decides that he’ll have convincing ev-
idence that Tamika has gotten worse if the p-value is less than 0.025.
Suppose the coach told Tamika to shoot 24 attempts and then stop and
count the number of successful attempts. Use software to compute the
p-value. Is the coach convinced that Tamika has gotten worse?
2. Prior to collecting data, the coach decides that he’ll have convincing ev-
idence that Tamika has gotten worse if the p-value is less than 0.025.
Suppose the coach told Tamika to shoot until she makes 7 three point-
ers and then stop and count the number of total attempts. Use software
to compute the p-value. Is the coach convinced that Tamika has gotten
worse? (Hint: the total number of attempts has a Negative Binomial
distribution.)
158 CHAPTER 9. CONSIDERING PRIOR DISTRIBUTIONS

3. Now suppose the coach takes a Bayesian approach and assumes a Beta(𝛼,
𝛽) prior distribution for 𝜃. Suppose the coach told Tamika to shoot 24
attempts and then stop and count the number of successful attempts. Iden-
tify the likelihood function and the posterior distribution of 𝜃.
4. Now suppose the coach takes a Bayesian approach and assumes a Beta(𝛼,
𝛽) prior distribution for 𝜃. Suppose the coach told Tamika to shoot until
she makes 7 three pointers and then stop and count the number of total
attempts. Identify the likelihood function and the posterior distribution
of 𝜃.
5. Compare the Bayesian and frequentist approaches in this example. Does
the “strength of the evidence” depend on how the data were collected?

Solution. to Example 9.2

1. The null hypothesis is 𝐻0 ∶ 𝜃 = 0.5 and the alternative hypothesis is


𝐻𝑎 ∶ 𝜃 < 0.5. If the null hypothesis is true and Tamika has not gotten
worse, then 𝑌 , the number of successful attempts, has a Binomial(24,
0.5) distribution. The p-value is 𝑃 (𝑌 ≤ 7) = 0.032 from pbinom(7, 24,
0.5). Using a strict threshold of 0.025, the coach has NOT been convinced
that Tamika has gotten worse.
2. The null hypothesis is 𝐻0 ∶ 𝜃 = 0.5 and the alternative hypothesis is
𝐻𝑎 ∶ 𝜃 < 0.5. If the null hypothesis is true and Tamika has not got-
ten worse, then 𝑁 , the number of total attempts required to achieve 7
successful attempts, has a Negative Binomial(7, 0.5) distribution. The
p-value is 𝑃 (𝑁 ≥ 24) = 0.017 from 1 - pnbinom(23 - 7, 7, 0.5). (In
R, nbinom only counts the total number of failures, not the total number
of trials.) Using a strict threshold of 0.025, the coach has been convinced
that Tamika has gotten worse.
3. The data is 𝑌 , the number of successful attempts in 24 attempts, which
follows a Binomial(24, 𝜃) distribution. The likelihood is 𝑃 (𝑌 = 7|𝜃)

24 7
𝑓(𝑦 = 7|𝜃) = ( )𝜃 (1 − 𝜃)17 ∝ 𝜃7 (1 − 𝜃)17 , 0 < 𝜃 < 1.
7

The posterior distribution is the Beta(𝛼 + 7, 𝛽 + 17) distribution.


4. The data is 𝑁 , the number of total attempts required to achieve 7 success-
ful attempts, which follows a Negative Binomial(7, 𝜃) distribution. The
likelihood is 𝑃 (𝑁 = 24|𝜃)

24 − 1 7
𝑓(𝑛 = 24|𝜃) = ( )𝜃 (1 − 𝜃)17 ∝ 𝜃7 (1 − 𝜃)17 , 0 < 𝜃 < 1.
7−1

(The (24−1
7−1 ) follows from the fact that the last attempt has to be success.)
Note that the shape of the likelihood as a function of 𝜃 is the same as in
the previous part. Therefore, the posterior distribution is the Beta(𝛼 + 7,
𝛽 + 17) distribution.
159

5. Even though both frequentist scenario involves 7 successes in 24 attempts,


the p-value measuring the strength of the evidence to reject the null hy-
pothesis differed depending on how the data were collected. Using a strict
cutoff of 0.025 led the coach to reject the null hypothesis in one scenario
but not the other. However, the Bayesian analysis is the same in either
scenario since the posterior distributions were the same. For the Bayesian
analysis, all that mattered about the data was that there were 7 successes
in 24 attempts.

Bayesian data analysis treats parameters as random variables with probabil-


ity distributions. The prior distribution quantifies the researcher’s uncertainty
about parameters before observing data. Some issues to consider when choosing
a prior include, in no particular order:

• The researcher’s prior beliefs! A prior distribution is part of a statisti-


cal model, and should be consistent with knowledge about the underlying
scientific problem. Researchers are often experts with a wealth of past ex-
perience that can be explicitly incorporated into the analysis via the prior
distribution. Such a prior is called an informative or weakly informative
prior.
• A regularizing prior. A prior which, when tuned properly, reduces over-
fitting or “overreacting” to the data.
• Noninformative prior a.k.a., (reference, vague, flat prior). A prior is sought
that plays a minimal role in inference so that “the data can speak for itself”.
• Mathematical convenience. The prior is chosen so that computation of
the posterior is simplified, as in the case of conjugate priors.
• Interpretation. The posterior is a compromise between the data and prior.
Some priors allow for easy interpretation of the relative contributions of
data and prior to the posterior. For example, think of the “prior successes
and prior failures” interpretation in the Beta-Binomial model.
• Prior based on past data. Bayesian updating can be viewed as an itera-
tive process. The posterior distribution obtained from one round of data
collection can inform the prior distribution for another round.

For those initially skeptical of prior distributions at all, the strategy of always
choosing an noninformative or flat prior might be appealing. Flat priors are
common, but are rarely ever the best choices from a modeling perspective. Just
like you would not want to assume a Normal distribution for the likelihood in
every problem, you would not to use a flat prior in every problem.
Furthermore, there are some subtle issues that arise when attempting to choose
a noninformative prior.

Example 9.3. Suppose we want to estimate 𝜃, the population proportion of


Cal Poly students who wore socks at any point yesterday.
160 CHAPTER 9. CONSIDERING PRIOR DISTRIBUTIONS

1. What are the possible values for 𝜃? What prior distribution might you
consider a noninformative prior distribution?
2. You might choose a Uniform(0, 1) prior, a.k.a., a Beta(1, 1) prior. Recall
how we interpreted the parameters 𝛼 and 𝛽 in the Beta-Binomial model.
Does the Beta(1, 1) distribution represent “no prior information”?

3. Suppose in a sample of 20 students, 4 wore socks yesterday. How would


you estimate 𝜃 with a single number based only on the data?
4. Assume a Beta(1, 1) prior and the 4/20 sample data. Identify the posterior
distribution. Recall that one Bayesian point estimate of 𝜃 is the posterior
mean. Find the posterior mean of 𝜃. Does this estimate let the “data
speak entirely for itself”?
5. How could you change 𝛼 and 𝛽 in the Beta distribution prior to repre-
sent no prior information? Sketch the prior. Do you see any potential
problems?
6. Assume a Beta(0, 0) prior for 𝜃 and the 4/20 sample data. Identify the
posterior distribution. Find the posterior mode of 𝜃. Does this estimate
let the “data speak entirely for itself”?
7. Now suppose the parameter you want to estimate is the odds that a student
𝜃
wore socks yesterday, 𝜙 = 1−𝜃 . What are the possible values of 𝜙? What
might a non-informative prior look like? Is this a proper prior?
8. Assume a Beta(1, 1) prior for 𝜃. Use simulation to approximate the prior
distribution of the odds 𝜙. Would you say this is a noninformative prior
for 𝜙?
Solution. to Example 9.3

1. 𝜃 takes values in (0, 1). We might assume a flat prior on (0, 1), that is a
Uniform(0, 1) prior.
2. We interpreted 𝛼 as “prior successes” and 𝛽 as “prior failures”. So a Beta(1,
1) is in some some equivalent to a “prior sample size” of 2. Certainly not
a lot of prior information, but it’s not “no prior information” either.
3. The sample proportion, 4/20 = 0.2.
4. With a Beta(1, 1) prior and the 4/20 sample data, the posterior distri-
bution is Beta(5, 17). The posterior mean of 𝜃 is 5/22 = 0.227. The
posterior mean is a weighted average of the prior mean and the sample
proportion: 0.227 = (0.5)(2/22) + (0.2)(20/22). The “noninformative”
prior does have influence; the data does not “speak entirely for itself”.
5. If 𝛼 + 𝛽 represents “prior sample size”, we could try a Beta(0, 0) prior.
Unfortunately, such a probability distribution does not actually exist. For
a Beta distribution, the parameters 𝛼 and 𝛽 have to be strictly positive in
order to have a valid pdf. The Beta(0, 0) density would be proportional
to
𝜋(𝜃) ∝ 𝜃−1 (1 − 𝜃)−1 , 0 < 𝜃 < 1.
161

1
However, this is not a valid pdf since ∫0 𝜃−1 (1 − 𝜃)−1 𝑑𝜃 = ∞, so there is
no constant that can normalize it to integrate to 1. Even so, here is a plot
of the “density”.

Improper Beta(0, 0)

0.0 0.2 0.4 0.6 0.8 1.0

theta

Would you say this is a “noninformative” prior? It seems to concentrate


almost all prior “density” near 0 and 1.

6. Beta(0, 0) is an “improper” prior. It’s not a proper prior distribution,


but it can lead to a proper posterior distribution. The likelihood is 𝑓(𝑦 =
4|𝜃) ∝ 𝜃4 (1 − 𝜃)16 , 0 < 𝜃 < 1. If we assume the prior is 𝜋(𝜃) ∝ 𝜃−1 (1 −
𝜃)−1 , 0 < 𝜃 < 1, then the posterior is

𝜋(𝜃|𝑦 = 4) ∝ (𝜃−1 (1 − 𝜃)−1 ) (𝜃4 (1 − 𝜃)16 ) = 𝜃4−1 (1−𝜃)16−1 , 0<𝜃<1

That is, the posterior distribution is the Beta(4, 16) distribution. The
posterior mean is 4/20=0.2, the sample proportion. Hoever, the posterior
4−1 3
mode is 4+16−2 = 18 = 0.167. So the posterior mode does not let the
“data speak entirely for itself”.

7. If 𝜃 = 0 then 𝜙 = 0; if 𝜃 = 1 then 𝜙 = ∞. So 𝜙 takes values in (0, ∞). We


might choose a flat prior on (0, ∞), 𝜋(𝜙) ∝ 1, 𝜙 > 0. However, this would
be an improper prior.
𝜃
8. Simulate a value of 𝜃 from a Beta(1, 1) distribution, compute 𝜙 = 1−𝜃 , and
repeat many times. The simulation results are below. (The distribution
is extremely skewed to the right, so we’re only plotting values in (0, 50).)
162 CHAPTER 9. CONSIDERING PRIOR DISTRIBUTIONS

theta = rbeta(1000000, 1, 1)
odds = theta / (1 - theta)
hist(odds[odds<50], breaks = 100, xlab = "odds", freq = FALSE,
ylab = "density",
main = "Prior distribution of odds if prior distribution of probability is Unifor

Prior distribution of odds if prior distribution of probability is Uniform(0, 1)


0.6
0.4
density

0.2
0.0

0 10 20 30 40 50

odds

Even though the prior for 𝜃 was flat, the prior for a transformation of 𝜃 is
not.

An improper prior distribution is a prior distribution that does not integrate to


1, so is not a proper probability density. However, an improper proper often
results in a proper posterior distribution. Thus, improper prior distributions
are sometimes used in practice.
Flat priors are common choices in some situations, but are rarely ever the best
choices from a modeling perspective. Furthermore, flat priors are generally not
preserved under transformations of parameters. So a prior that is flat under one
parametrization of the problem will generally not be flat under another. For
example, when trying to estimate a population SD 𝜎, assuming a flat prior for
𝜎 will result in a non-flat prior for the population variance 𝜎2 , and vice versa.
Example 9.4. Suppse that 𝜃 represents the population proportion of adults
who have a particular rare disease.

1. Explain why you might not want to use a flat Uniform(0, 1) prior for 𝜃.
2. Assume a Uniform(0, 1) prior. Suppose you will test 𝑛 = 100 suspected
cases. Use simulation to approximate the prior predictive distribution of
163

the number in the sample who have the disease. Does this seem reason-
able?
3. Assume a Uniform(0, 1) prior. Suppose that in 𝑛 = 100 suspected cases,
none actually has the disease. Find and interpret the posterior median.
Does this seem reasonable?

Solution. to Example 9.4

1. We know it’s a rare disease! We want to concentrate most of our prior


probability for 𝜃 near 0.
2. If the disease is rare, we might not expect any actual cases in a sample
of 100, maybe 1 or 2. However, the prior predictive distribution says that
any value between 0 and 100 actual cases is equally likely! This seems
very unreasonable given that the disease is rare.

theta_sim = runif(10000)
y_sim = rbinom(10000, 100, theta_sim)
hist(y_sim,
xlab = "Simulated number of successes",
main = "Prior predictive distribution")

Prior predictive distribution


500
Frequency

300
100
0

0 20 40 60 80 100

Simulated number of successes

3. The posterior distribution is the Beta(1, 101) distribution. The posterior


median is 0.007 (qbeta(0.5, 1, 101)). Based on a sample of 100 sus-
pected cases with no actual cases, there is a posterior probability of 50%
that more than 0.7% of people have the disease. A rate of 7 actual cases in
1000 is not a very rare disease, and we think there’s a 50% chance that the
164 CHAPTER 9. CONSIDERING PRIOR DISTRIBUTIONS

rate is even greater than this? Again, this does not seem very reasonable
based on our knowledge that the disease is rare.

Prior predictive distributions can be used to check the reasonableness of


a prior for a given situation before observing sample data. Do the simulated
samples seem consistent with what you might expect of the data based on your
background knowledge of the situation? If not, another prior might be more
reasonable.

9.1 What NOT to do when considering priors


You have a great deal of flexibility in choosing a prior, and there are many
reasonable approaches. However, there are a few things that you should NOT
do.
Do NOT choose a prior that assigns 0 probability/density to possible
values of the parameter regardless of how initially implausible the values are.
Even very stubborn priors can be overturned with enough data, but no amount
of data can turn a prior probability of 0 into a positive posterior probability.
Always consider the range of possible values of the parameter, and be sure the
prior density is non-zero over that range of values.
Do NOT base the prior on the observed data. The prior reflects the
degree of uncertainty about parameters before observing data. Adjusting the
prior to reflect observed data to achieve some desired result is akin to “data
snooping” or “p-hacking” and is bad statistics. (Of course, the posterior is
based on the observed data. But not the prior.)
Do NOT feel like you have to find that one, perfect prior. The prior
is just one assumption of the model and should be considered just like other
assumptions. In practice, no assumption of a statistical model is ever satisfied
exactly. We only hope that our set of assumptions provides a reasonable model
for reality. No one prior will ever be just right for a situation, but some might
be more reasonable than others. You are not only allowed but encouraged to
try different priors to see how sensitive the results are to the choice of prior.
(Remember, you should check the other assumptions too!) There is also no
requirement that you have to choose a single prior. It’s possible to consider
several models, each consisting of its own prior, and average over these models.
(We’ll see a little more detail about model averaging later.)
Do NOT worry too much about the prior! In general, in Bayesian esti-
mation the larger the sample size the smaller the role that the prior plays. But
it is often desirable for the prior to play some role. You should not feel the need
to apologize for priors when significant prior knowledge is available.
Chapter 10

Introduction to Posterior
Simulation and JAGS

In the Beta-Binomial model there is a simple expression for the posterior dis-
tribution. However, in most problems it is not possible to find the posterior
distribution analytically, and therefore we must approximate it.

Example 10.1. Consider Example 8.5 again, in which we wanted to estimate


the proportion of Cal Poly students that are left-handed. In that example we
specifed a prior by first specifying a prior mean of 0.15 and a prior SD of 0.08
and then we found the corresponding Beta prior. However, when dealing with
means and SDs, it is natural — but by no means necessary — to work with
Normal distributions. Suppose we want to assume a Normal distribution prior
for 𝜃 with mean 0.15 and SD 0.08. Also suppose that in a sample of 25 Cal Poly
students 5 are left-handed. We want to find the posterior distribution.
Note: the Normal distribution prior assigns positive (but small) density outside
of (0, 1). So we can either truncate the prior to 0 outside of (0, 1) or just rely on
the fact that the likelihood will be 0 for 𝜃 outside of (0, 1) to assign 0 posterior
density outside (0, 1).

1. Write an expression for the shape of the posterior density. Is this a rec-
ognizable probability distribution?
2. We have seen one method for approximating a posterior distribution. How
could you employ it here?

Solution. to Example 10.1

1. As always, the posterior density is proportional to the product of the prior

165
166CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

density and the likelihood function.

2
1 (𝜃 − 0.15)
Prior: 𝜋(𝜃) ∝ exp (− )
0.08 2(0.082 )
Likelihood: 𝑓(𝑦|𝜃) ∝ 𝜃5 (1 − 𝜃)20
2
5 20 1 (𝜃 − 0.15)
Posterior: 𝜋(𝜃|𝑦) ∝ (𝜃 (1 − 𝜃) ) ( exp (− ))
0.08 2(0.082 )

This is not a recognizable probability density.

2. We can use grid approximation and treat the continuous parameter 𝜃 as


discrete.

theta = seq(0, 1, 0.0001)

# prior
prior = dnorm(theta, 0.15, 0.08)
prior = prior / sum(prior)

# data
n = 25 # sample size
y = 5 # sample count of success

# likelihood, using binomial


likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# plot
ylim = c(0, max(c(prior, posterior, likelihood / sum(likelihood))))
plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta', ylab
par(new=T)
plot(theta, likelihood / sum(likelihood), type='l', xlim=c(0, 1), ylim=ylim, col="orang
par(new=T)
plot(theta, posterior, type='l', xlim=c(0, 1), ylim=ylim, col="seagreen", xlab='', ylab
legend("topright", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue"
167

prior
scaled likelihood
0.0000 0.0002 0.0004 0.0006

posterior

0.0 0.2 0.4 0.6 0.8 1.0

theta

Grid approximation is one method for approximating a posterior distribution.


However, finding a sufficiently fine grid approximation suffers from the “curse
of dimensionality” and does not work well in multi-parameter problems. For
example, suppose you use a grid of 1000 points to approximate the distribution
of any single parameter. Then you would need a grid of 10002 points to ap-
proximate the joint distribution of any two parameters, 10003 points for three
parameters, and so on. The size of the grid increases exponentially with the
number of parameters and becomes computationally infeasible in problems with
more than a few parameters. (And later we’ll see some examples that include
hundreds of parameters.) Furthermore, if the posterior density changes very
quickly over certain regions, then even finer grids might be needed to provide
reliable approximations of the posterior in these regions. (Though if the poste-
rior density is relative smooth over some regions, then we might be able to get
away with a coarser grid in these regions.)
The most common way to approximate a posterior distribution is via simulation.
The inputs to the simulation are

• Observed data 𝑦
• Model for the data, 𝑓(𝑦|𝜃) which depends on parameters 𝜃. (This model
determines the likelihood function.)
• Prior distribution for parameters 𝜋(𝜃)

We then employ some simulation algorithm to approximate the posterior distri-


bution of 𝜃 given the observed data 𝑦, 𝜋(𝜃|𝑦), without computing the posterior
distribution.
168CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

Careful: we have already used simulation to approximate predictive distribu-


tions. Here we are primarily focusing on using simulation to approximate the
posterior distribution of parameters.
Let’s consider a discrete example first.

Example 10.2. Continuing the kissing study in Example 5.2 where 𝜃 can only
take values 0.1, 0.3, 0.5, 0.7, 0.9. Consider a prior distribution which places
probability 1/9, 2/9, 3/9, 2/9, 1/9 on the values 0.1, 0.3, 0.5, 0.7, 0.9, respec-
tively. Suppose that 𝑦 = 8 couples in a sample of size 𝑛 = 12 lean right.

1. Describe in detail how you could use simulation to approximate the pos-
terior distribution of 𝜃, without first computing the posterior distribution.
2. Code and run the simulation. Compare the simulation-based approxima-
tion to the true posterior distribution from Example 5.2.
3. How would the simulation/code change if 𝜃 had a Beta prior distribution,
say Beta(3, 3)?
4. Suppose that 𝑛 = 1200 and 𝑦 = 800. What would be the problem with
running the above simulation in this situation? Hint: compute the proba-
bility that 𝑌 equals 800 for a Binomial distribution with parameters 1200
and 0.667.

Solution. to Example 10.2

1. Remember that the posterior distribution is the conditional distribution of


parameters 𝜃 given the observed data 𝑦. Therefore, we need to approximate
the conditional distribution of 𝜃 given 𝑦 = 8 successes in a sample of size
𝑛 = 12.

• Simulate a value of 𝜃 from the prior distribution.


• Given 𝜃, simulate a value of 𝑌 from the Binomial distribution with
parameters 𝑛 = 12 and 𝜃.
• Repeat the above steps many times, generating many (𝜃, 𝑌 ) pairs.
• To condition on 𝑦 = 8, discard any (𝜃, 𝑌 ) pairs for which 𝑌 is not 8.
Summarize the 𝜃 values for the remaining pairs to approximate the
posterior distribution of 𝜃. For example, to approximate the posterior
probability that 𝜃 equals 0.7, count the number of repetitions in which
𝜃 equals 0.7 and 𝑌 equals 8 and divide by the count of repetitions in
which 𝑌 equals 8.

2. See code below. The simulation approximates the posterior distribution


fairly well in this case. Notice that we simulate 100,000 (𝜃, 𝑌 ) pairs, but
only around 10,000 or so yield a value of 𝑌 equal to 8. Therefore, the
posterior approximation is based on roughly 10,000 values, not 100,000.
169

n_sim = 100000

theta_prior_sim = sample(c(0.1, 0.3, 0.5, 0.7, 0.9),


size = n_sim,
replace = TRUE,
prob = c(1, 2, 3, 2, 1) / 9)

y_sim = rbinom(n_sim, 12, theta_prior_sim)

kable(head(data.frame(theta_prior_sim, y_sim), 20))

theta_prior_sim y_sim
0.1 0
0.5 2
0.5 5
0.7 10
0.1 2
0.5 6
0.5 8
0.1 1
0.7 7
0.5 7
0.7 10
0.3 3
0.5 5
0.5 7
0.9 11
0.7 12
0.3 6
0.5 8
0.3 5
0.9 10

theta_post_sim = theta_prior_sim[y_sim == 8]

table(theta_post_sim)

## theta_post_sim
## 0.3 0.5 0.7 0.9
## 177 4070 5031 253

plot(table(theta_post_sim) / length(theta_post_sim),
xlab = "theta",
ylab = "Relative frequency")
170CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

# true posterior for comparison


par(new = T)
plot(c(0.3, 0.5, 0.7, 0.9), c(0.0181, 0.4207, 0.5365, 0.0247),
col = "orange", type = "o",
xaxt = 'n', yaxt = 'n', xlab = "", ylab = "")

0.5
0.4
Relative frequency

0.3
0.2
0.1
0.0

0.3 0.5 0.7 0.9

theta

3. The only difference is that we would first simulate a value of 𝜃 from its
Beta(3, 3) prior distribution (using rbeta). Now any value between 0 and
1 is a possible value of 𝜃. But we would still approximate the posterior
distribution by discarding any (𝜃, 𝑌 ) pairs for which 𝑌 is not equal to 8.
Since 𝜃 is continuous, we could summarize the simulated values with a
histogram or density plot.

n_sim = 100000

theta_prior_sim = rbeta(n_sim, 3, 3)

y_sim = rbinom(n_sim, 12, theta_prior_sim)

kable(head(data.frame(theta_prior_sim, y_sim), 20))


171

theta_prior_sim y_sim
0.4965 6
0.3985 6
0.4446 6
0.4233 7
0.4427 5
0.5821 6
0.1518 4
0.6136 6
0.7128 6
0.3245 5
0.7393 8
0.3007 4
0.9726 12
0.3001 3
0.2044 1
0.3401 2
0.6289 6
0.7696 11
0.7605 10
0.3823 4

theta_post_sim = theta_prior_sim[y_sim == 8]

hist(theta_post_sim, freq = FALSE,


xlab = "theta",
ylab = "Density")
lines(density(theta_post_sim))

# true posterior for comparison


lines(density(rbeta(100000, 3 + 8, 3 + 4)), col = "orange")
172CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

Histogram of theta_post_sim

3.0
2.0
Density

1.0
0.0

0.2 0.4 0.6 0.8

theta

4. Now we need to approximate the conditional distribution of 𝜃 given 800


successes in a sample of size 𝑛 = 1200. The probability that 𝑌 equals 800
for a Binomial distribution with parameters 1200 and 2/3 is about 0.024
(dbinom(800, 1200, 2 / 3)). Since the sample proportion 800/1200 =
2/3 maximizes the likelihood of 𝑦 = 800, the probability is even smaller
for the other values of 𝜃.
Therefore, if we generate 100,000 (𝜃, 𝑌 ) pairs, only a few hundred or so
of them would yield 𝑦 = 800 and so the posterior approximation would
be unreliable. If we wanted the posterior approximation to be based on
10,000 simulated values from the conditional distribution of 𝜃 given 𝑦 = 8,
we would first have to general about 10 million (𝜃, 𝑌 ) pairs.

In principle, the posterior distribution 𝜋(𝜃|𝑦) given observed data 𝑦 can be found
by

• simulating many 𝜃 values from the prior distribution


• simulating, for each simulated value of 𝜃, a 𝑌 value from the corresponding
conditional distribution of 𝑌 given 𝜃 (𝑌 could be a sample or the value of
a sample statistic)
• discarding (𝜃, 𝑌 ) pairs for which the simulated 𝑌 value is not equal to the
observed 𝑦 value
• Summarizing the simulated 𝜃 values for the remaining pairs with 𝑌 = 𝑦.

However, this is a very computationally inefficient way of approximating the


posterior distribution. Unless the sample size is really small, the simulated
sample statistic 𝑌 will only match the observed 𝑦 value in relatively few samples,
10.1. INTRODUCTION TO JAGS 173

simply because in large samples there are just many more possibilities. For
example, in 1000 flips of a fair coin, the most likely value of the number of
heads is 500, but the probability of exactly 500 heads in 1000 flips is only 0.025.
When there are many possibilities, the probability gets stretched fairly thin.
Therefore, if we want say 10000 simulated values of 𝜃 given 𝑦, we would have
first simulate many, many more values.
The situation is even more extreme when the data is continuous, where the
probably of replicating the observed sample is essentially 0.
Therefore, we need more efficient simulation algorithms for approximating pos-
terior distributions. Markov chain Monte Carlo (MCMC) methods1 pro-
vide powerful and widely applicable algorithms for simulating from probabil-
ity distributions, including complex and high-dimensional distributions. These
algorithms include Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte
Carlo, among others. We will see later some of the ideas behind MCMC algo-
rithms. However, we will rely on software to carry out MCMC simulations.

10.1 Introduction to JAGS

JAGS2 (“Just Another Gibbs Sampler”) is a stand alone program for performing
MCMC simulations. JAGS takes as input a Bayesian model description — prior
plus likelihood — and data and returns an MCMC sample from the posterior
distribution. JAGS uses a combination of Metropolis sampling, Gibbs sampling,
and other MCMC algorithms.
A few JAGS resources:

• JAGS User Manual


• JAGS documentation
• Some notes about JAGS error messages
• Doing Bayesian Data Analysis textbook website

The basic steps of a JAGS program are:

1. Load the data


2. Define the model: likelihood and prior
3. Compile the model in JAGS
4. Simulate values from the posterior distribution
5. Summarize simulated values and check diagnostics
1 For some history, and an origin of the use of “Monte Carlo”, see Wikipedia.
2 Ifyou’ve ever heard of BUGS (or WinBUGS) JAGS is very similar but with a few nicer
features.
174CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

This section introduces a brief introduction to JAGS in some relatively simple


situations.
Using the rjags package, one can interact with JAGS entirely within R.

library(rjags)

10.1.1 Load the data


We’ll use the “data is singular” context as an example. Compare the results of
JAGS simulations to the results in Chapter 7.
The data could be loaded from a file, or specified via sufficient summary statis-
tics. Here we’ll just load the summary statistics and in later examples we’ll
show how to load individual values.

n = 35 # sample size
y = 31 # number of successes

10.1.2 Specify the model: likelihood and prior


A JAGS model specification starts with model. The model provides a textual
description of likelihood and prior. This text string will then be passed to JAGS
for translation.
Recall that for the Beta-Binomial model, the prior distribution is 𝜃 ∼ Beta(𝛼, 𝛽)
and the likelihood for the total number of successes 𝑌 in a sample of size 𝑛
corresponds to (𝑌 |𝜃) ∼ Binomial(𝑛, 𝜃). Notice how the following text reflects
the model (prior & likelihood).
Note: JAGS syntax is similar to, but not the same, as R syntax. For example,
compare dbinom(y, n, theta) in R versus y ~ dbinom(theta, n) in JAGS.
See the JAGS user manual for more details. You can use comments with # in
JAGS models, similar to R.

model_string <- "model{

# Likelihood
y ~ dbinom(theta, n)

# Prior
theta ~ dbeta(alpha, beta)
alpha <- 3 # prior successes
beta <- 1 # prior failures

}"
10.1. INTRODUCTION TO JAGS 175

Again, the above is just a text string, which we’ll pass to JAGS for translation.

10.1.3 Compile in JAGS

We pass the model (which is just a text string) and the data to JAGS to be
compiled via jags.model. The model is defined by the text string via the
textConnection function. The model can also be saved in a separate file, with
the file name being passed to JAGS. The data is passed to JAGS in a list. In
dataList below y = y, n = n maps the data defined by y=31, n=35 to the
terms y, n specified in the model_string.

dataList = list(y = y, n = n)

model <- jags.model(file = textConnection(model_string),


data = dataList)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 5
##
## Initializing model

10.1.4 Simulate values from the posterior distribution

Simulating values in JAGS is completed in essentially two steps. The update


command runs the simulation for a “burn-in” period. The update function
merely “warms-up” the simulation, and the values sampled during the update
phase are not recorded. (We will discuss “burn-in” in more detail later.)

update(model, n.iter = 1000)

After the update phase, we simulate values from the posterior distribution that
we’ll actually keep using coda.samples. Using coda.samples arranges the
output in a format conducive to using coda, a package which contains helpful
functions for summarizing and diagnosing MCMC simulations. The variables to
record simulated values for are specified with the variable.names argument.
Here there is only a single parameter theta, but we’ll see multi-parameter ex-
amples later.
176CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

Nrep = 10000 # number of values to simulate

posterior_sample <- coda.samples(model,


variable.names = c("theta"),
n.iter = Nrep)

10.1.5 Summarizing simulated values and diagnostic


checking

Standard R functions like summary and plot can be used to summarize re-
sults from coda.samples. We can summarize the simulated values of theta to
approximate the posterior distribution.

summary(posterior_sample)

##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.871382 0.052462 0.000525 0.000777
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.756 0.839 0.878 0.909 0.956

plot(posterior_sample)
10.1. INTRODUCTION TO JAGS 177

Trace of theta Density of theta


1.0

8
0.9

6
0.8

4
0.7

2
0.6

0
2000 6000 10000 0.6 0.7 0.8 0.9 1.0

Iterations N = 10000 Bandwidth = 0.008813

The Doing Bayesian Data Analysis (DBDA2E) textbook package also has
some nice functions built in, in particular in the DBD2AE-utilities.R file.

For example, the plotPost functions creates an annotated plot of the posterior
distribution along with some summary statistics. (See the DBDA2E documen-
tation for additional arguments.)

source("DBDA2E-utilities.R")

##
## *********************************************************************
## Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
## A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
## *********************************************************************

plotPost(posterior_sample)
178CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

mode = 0.896

95% HDI
0.772 0.967

0.6 0.7 0.8 0.9 1.0

Param. Val.

## ESS mean median mode hdiMass hdiLow hdiHigh compVal pGtCompVal


## Param. Val. 4563 0.8714 0.8776 0.8964 0.95 0.772 0.9666 NA NA
## ROPElow ROPEhigh pLtROPE pInROPE pGtROPE
## Param. Val. NA NA NA NA NA

The bayesplot R package also provides lots of nice plotting functionality.

library(bayesplot)
mcmc_hist(posterior_sample)
10.1. INTRODUCTION TO JAGS 179

0.6 0.7 0.8 0.9 1.0


theta

mcmc_dens(posterior_sample)

0.6 0.7 0.8 0.9


theta
180CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

mcmc_trace(posterior_sample)

1.0

0.9
theta

0.8

0.7

0.6

0 2000 4000 6000 8000 10000

10.1.6 Posterior prediction

The output from coda.samples is stored as an mcmc.list format. The simu-


lated values of the variables identified in the variable.names argument can be
extracted as a matrix (or array) and then manipulated as usual R objects.

thetas = as.matrix(posterior_sample)
head(thetas)

## theta
## [1,] 0.8479
## [2,] 0.8715
## [3,] 0.8592
## [4,] 0.8828
## [5,] 0.8846
## [6,] 0.8927

hist(thetas)
10.1. INTRODUCTION TO JAGS 181

Histogram of thetas
1500
1000
Frequency

500
0

0.6 0.7 0.8 0.9 1.0

thetas

The matrix would have one column for each variable named in variable.names;
in this case, there is only one column corresponding to the simulated values of
theta.

We can now use the simulated values of theta to simulate replicated samples to
approximate the posterior predictive distribution. To be clear, the code below
is running R commands within R (not JAGS).

(There is a way to simulate predictive values within JAGS itself, but I think
it’s more straightforward in R. Just use JAGS to get a simulated sample from
the posterior distribution. On the other hand, if you’re using Stan there are
functions for simulating and summarizing posterior predicted values.)

ynew = rbinom(Nrep, n, thetas)

plot(table(ynew),
main = "Posterior Predictive Distribution for samples of size 35",
xlab = "y")
182CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

Posterior Predictive Distribution for samples of size 35

1500
1000
table(ynew)

500
0

18 20 22 24 26 28 30 32 34

10.1.7 Loading data as individual values rather than sum-


mary statistics
Instead of the total count (modeled by a Binomial likelihood), the individual
data values (1/0 = S/F) can be provided, which could be modeled by a Bernoulli
(i.e. Binomial(trials=1)) likelihood. That is, (𝑌1 , … , 𝑌𝑛 |𝜃) ∼ i.i.d. Bernoulli(𝜃),
rather than (𝑌 |𝜃) ∼ Binomial(𝑛, 𝜃). The vector y below represents the data
in this format. Notice how the likelihood in the model specification changes in
response; the n observations are specified via a for loop.
# Load the data
y = c(rep(1, 31), rep(0, 4)) # vector of 31 1s and 4 0s
n = length(y)

model_string <- "model{

# Likelihood
for (i in 1:n){
y[i] ~ dbern(theta)
}

# Prior
theta ~ dbeta(alpha, beta)
alpha <- 3
10.1. INTRODUCTION TO JAGS 183

beta <- 1

}"

10.1.8 Simulating multiple chains

The Bernoulli model can be passed to JAGS similar to the Binomial model
above. Below, we have also introduced the n.chains argument, which simu-
lates multiple Markov chains and allows for some additional diagnostic checks.
Simulating multiple chains helps assess convergence of the Markov chain to the
target distribution. (We’ll discuss more details later.) Initial values for the
chains can be provided in a list with the inits argument; otherwise initial
values are generated automatically.

# Compile the model


dataList = list(y = y, n = n)

model <- jags.model(textConnection(model_string),


data = dataList,
n.chains = 5)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 35
## Unobserved stochastic nodes: 1
## Total graph size: 39
##
## Initializing model

# Simulate
update(model, 1000, progress.bar = "none")

Nrep = 10000

posterior_sample <- coda.samples(model,


variable.names = c("theta"),
n.iter = Nrep,
progress.bar = "none")

# Summarize and check diagnostics


summary(posterior_sample)
184CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 5
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.872245 0.052547 0.000235 0.000238
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.753 0.841 0.878 0.911 0.956

plot(posterior_sample)

Trace of theta Density of theta


1.0

8
0.9

6
0.8

4
0.7

2
0.6

2000 6000 10000 0.6 0.7 0.8 0.9 1.0

Iterations N = 10000 Bandwidth = 0.006372

If multiple chains are simulated, the DBDA2E function diagMCMC can be used
for diagnostics.
Note: Some of the DBDA2E output, in particular from diagMCMC, isn’t always
displayed when the RMarkdown file is knit. You might need to manually run
these cells within RStudio. I’m not sure why; please let me know if you figure
it out.
10.1. INTRODUCTION TO JAGS 185

plotPost(posterior_sample)

mode = 0.897

95% HDI
0.767 0.963

0.6 0.7 0.8 0.9 1.0

Param. Val.

## ESS mean median mode hdiMass hdiLow hdiHigh compVal


## Param. Val. 50000 0.8722 0.8782 0.8974 0.95 0.7672 0.9635 NA
## pGtCompVal ROPElow ROPEhigh pLtROPE pInROPE pGtROPE
## Param. Val. NA NA NA NA NA NA

diagMCMC(posterior_sample)
186CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
10.1. INTRODUCTION TO JAGS 187

10.1.9 ShinyStan

We can use regular R functionality for plotting, or functions from packages


likes DBDA2E or bayesplot. Another nice tool is ShinyStan, which provides
an interactive utility for exploring the results of MCMC simulations. While
ShinyStan was developed for the Stan package, it can use output from JAGS
and other MCMC packages. You’ll need to install the shinystan package and
its dependencies.
The code below will launch in a browser the ShinyStan GUI for exploring the re-
sults of the JAGS simulation. The as.shinystan command takes coda.samples
output (stored as an mcmc-list) and puts it in the proper format for ShinyStan.
(Note: this code won’t display anything in the notes. You’ll have to actually
run it to see what happens.)

library(shinystan)
my_sso <- launch_shinystan(as.shinystan(posterior_sample,
model_name = "Bortles!!!"))

10.1.10 Back to the left-handed problem

Let’s return again to the problem in Example 10.1, in which we wanted to


estimate the proportion of Cal Poly students that are left-handed. Assume a
Normal distribution prior for 𝜃 with mean 0.15 and SD 0.08. Also suppose that
in a sample of 25 Cal Poly students 5 are left-handed. We will use JAGS to find
the (approximate) posterior distribution.
Important note: in JAGS a Normal distribution is parametrized by its precision,
which is the reciprocal of the variance: dnorm(mean, precision). That is, for
a 𝑁 (𝜇, 𝜎) distribution, the precision, often denoted 𝜏 , is 𝜏 = 1/𝜎2 . For example,
in JAGS dnorm(0, 1 / 4) corresponds to a precision of 1/4, a variance of 4,
and a standard deviation of 2.

# Data
n = 25
y = 5

# Model
model_string <- "model{

# Likelihood
y ~ dbinom(theta, n)

# Prior
theta ~ dnorm(mu, tau)
188CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS

mu <- 0.15 # prior mean


tau <- 1 / 0.08 ^ 2 # prior precision; prior SD = 0.08

}"

dataList = list(y = y, n = n)

# Compile
model <- jags.model(file = textConnection(model_string),
data = dataList)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 9
##
## Initializing model

model <- jags.model(textConnection(model_string),


data = dataList,
n.chains = 5)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 9
##
## Initializing model

# Simulate
update(model, 1000, progress.bar = "none")

Nrep = 10000

posterior_sample <- coda.samples(model,


variable.names = c("theta"),
n.iter = Nrep,
progress.bar = "none")
10.1. INTRODUCTION TO JAGS 189

# Summarize and check diagnostics


summary(posterior_sample)

##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 5
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.183882 0.052470 0.000235 0.000308
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.0894 0.1468 0.1812 0.2181 0.2930

plot(posterior_sample)

Trace of theta Density of theta


0.4

6
0.3

4
0.2

2
0.1

2000 6000 10000 0.0 0.1 0.2 0.3 0.4 0.5

Iterations N = 10000 Bandwidth = 0.006389

The posterior density is similar to what we computed with the grid approxima-
tion.
190CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
Chapter 11

Odds and Bayes Factors

Example 11.1. The ELISA test for HIV was widely used in the mid-1990s for
screening blood donations. As with most medical diagnostic tests, the ELISA
test is not perfect. If a person actually carries the HIV virus, experts estimate
that this test gives a positive result 97.7% of the time. (This number is called
the sensitivity of the test.) If a person does not carry the HIV virus, ELISA
gives a negative (correct) result 92.6% of the time (the specificity of the test).
Estimates at the time were that 0.5% of the American public carried the HIV
virus (the base rate).
Suppose that a randomly selected American tests positive; we are interested in
the conditional probability that the person actually carries the virus.

1. Before proceeding, make a guess for the probability in question.

0 − 20% 20 − 40% 40 − 60% 60 − 80% 80 − 100%

2. Denote the probabilities provided in the setup using proper notation


3. Construct an appropriate two-way table and use it to compute the prob-
ability of interest.
4. Construct a Bayes table and use it to compute the probability of interest.
5. Explain why this probability is small, compared to the sensitivity and
specificity.
6. By what factor has the probability of carrying HIV increased, given a
positive test result, as compared to before the test?

Solution. to Example 11.1

Show/hide solution

1. We don’t know what you guessed, but from experience many people guess
80-100%. Afterall, the test is correct for most of people who carry HIV,

191
192 CHAPTER 11. ODDS AND BAYES FACTORS

and also correct for most people who don’t carry HIV, so it seems like the
test is correct most of the time. But this argument ignores one important
piece of information that has a huge impact on the results: most people
do not carry HIV.

2. Let 𝐻 denote the event that the person carries HIV (hypothesis), and let
𝐸 denote the event that the test is positive (evidence). Therefore, 𝐻 𝑐 is
the event that the person does not carry HIV, another hypothesis. We are
given

• prior probability: 𝑃 (𝐻) = 0.005


• likelihood of testing positive, if the person carries HIV: 𝑃 (𝐸|𝐻) =
0.977
• 𝑃 (𝐸 𝑐 |𝐻 𝑐 ) = 0.926
• likelihood of testing positive, if the person does not carry HIV:
𝑃 (𝐸|𝐻 𝑐 ) = 1 − 𝑃 (𝐸 𝑐 |𝐻 𝑐 ) = 1 − 0.926 = 0.074
• We want to find the posterior probability 𝑃 (𝐻|𝐸).

3. Considering a hypothetical population of Americans (at the time)

• 0.5% of Americans carry HIV


• 97.7% of Americans who carry HIV test positive
• 92.6% of Americans who do not carry HIV test negative
• We want to find the percentage of Americans who test positive that
carry HIV.

4. Assuming 1000000 Americans

Tests positive Does not test positive Total


Carries HIV 4885 115 5000
Does not carry HIV 73630 921370 995000
Total 78515 921485 1000000

Among the 78515 who test positive, 4885 carry HIV, so the probability
that an American who tests positive actually carries HIV is 4885/78515
= 0.062.

5. See the Bayes table below.

6. The result says that only 6.2% of Americans who test positive actually
carry HIV. It is true that the test is correct for most Americans with HIV
(4885 out of 5000) and incorrect only for a small proportion of Americans
who do not carry HIV (73630 out of 995000). But since so few Americans
carry HIV, the sheer number of false positives (73630) swamps the number
of true positives (4885).
193

7. Prior to observing the test result, the prior probability that an American
carries HIV is 𝑃 (𝐻) = 0.005. The posterior probability that an American
carries HIV given a positive test result is 𝑃 (𝐻|𝐸) = 0.062.

𝑃 (𝐻|𝐸) 0.062
= = 12.44
𝑃 (𝐻) 0.005

An American who tests positive is about 12.4 times more likely to carry
HIV than an American whom the test result is not known. So while 0.067
is still small in absolute terms, the posterior probability is much larger
relative to the prior probability.

hypothesis prior likelihood product posterior


Carries HIV 0.005 0.977 0.0049 0.0622
Does not carry HIV 0.995 0.074 0.0736 0.9378
sum 1.000 NA 0.0785 1.0000
Remember, the conditional probability of 𝐻 given 𝐸, 𝑃 (𝐸|𝐻), is not the same
as the conditional probability of 𝐸 given 𝐻, 𝑃 (𝐸|𝐻), and they can be vastly
different. It is helpful to think of probabilities as percentages and ask “percent of
what?” For example, the percentage of people who carry HIV that test positive
is a very different quantity than the percentage of people who test positive that
carry HIV. Make sure to properly identify the “denominator” or baseline group
the percentages apply to.
Posterior probabilities can be highly influenced by the original prior probabili-
ties, sometimes called the base rates. The example illustrates that when the
base rate for a condition is very low and the test for the condition is less than
perfect there will be a relatively high probability that a positive test is a false
positive. Don’t neglect the base rates when evaluating posterior probabilities

Example 11.2. True story: On a camping trip in 2003, my wife and I were
driving in Vermont when, suddenly, a very large, hairy, black animal lumbered
across the road in front of us and into the woods on the other side. It happened
very quickly, and at first I said “It’s a gorilla!” But then after some thought,
and much derision from my wife, I said “it was probably a bear.”
I think this story provides an anecdote about Bayesian reasoning, albeit bad
reasoning at first but then good. Put the story in a Bayesian context by identi-
fying hypotheses, evidence, prior, and likelihood. What was the mistake I made
initially?

Show/hide solution

• “Type of animal” is playing the role of the hypothesis: gorilla, bear, dog,
squirrel, rabbit, etc.
• That the animal is very large, hairy, and black is the evidence.
194 CHAPTER 11. ODDS AND BAYES FACTORS

• The likelihood value for the animal being very large, hairy, and black is
close to 1 for both a bear and gorilla, maybe more middling for a dog, but
close to 0 for a squirrel, rabbit, etc.

The mistake I made initially was to neglect the base rates and not consider
my prior probabilities. Let’s say the likelihood is 1 for both gorilla and bear
and 0 for all other animals. Then based solely on the likelihoods, the posterior
probability would be 50/50 for gorilla and bear, which maybe is why I guessed
gorilla.
After my initial reaction, I paused to formulate my prior probabilities, which
considering I was in Vermont, gave much higher probability to a bear than a
gorilla. (My prior probabilities should also have given even higher probability
to animals such as dogs, squirrels, and rabbits.)
By combining prior and likelihood in the appropriate way, the posterior prob-
ability is

• very high for a bear, due to high likelihood and not-too-small prior,
• close to 0 for a gorilla, due to the very small prior,
• and very low for a squirrel or rabbit or other small animals because of the
close-to-zero likelihood, even if the prior is large.

Recall that the odds of an event is a ratio involving the probability that the
event occurs and the probability that the event does not occur
𝑃 (𝐴) 𝑃 (𝐴)
odds(𝐴) = 𝑐
=
𝑃 (𝐴 ) 1 − 𝑃 (𝐴)
In many situations (e.g. gambling) odds are reported as odds against 𝐴, that
is, the odds in favor of 𝐴 not occuring, a.k.a., the odds of 𝐴𝑐 : 𝑃 (𝐴𝑐 )/𝑃 (𝐴).
The probability of an even can be obtained from odds
odds(𝐴)
𝑃 (𝐴) =
1 + odds(𝐴)
Example 11.3. Continuing Example 11.1

1. In symbols and words, what does one minus the answer to the probability
in question in Example 11.1 represent?
2. Calculate the prior odds of a randomly selected American having the HIV
virus, before taking an ELISA test.
3. Calculate the posterior odds of a randomly selected American having the
HIV virus, given a positive test result.
4. By what factor has the odds of carrying HIV increased, given a positive
test result, as compared to before the test? This is called the Bayes
factor.
195

5. Suppose you were given the prior odds and the Bayes factor. How could
you compute the posterior odds?
6. Compute the ratio of the likelihoods of testing positive, for those who
carry HIV and for those who do not carry HIV. What do you notice?
Solution. to Example 11.3
Show/hide solution

1. 1 − 𝑃 (𝐻|𝐸) = 𝑃 (𝐻 𝑐 |𝐸) = 0.938 is the posterior probability that an


American who has a positive test does not carry HIV.
2. The prior probability of carrying HIV is 𝑃 (𝐻) = 0.005 and the prior
probability of not carrying HIV is 𝑃 (𝐻 𝑐 ) = 1 − 0.005 = 0.995
𝑃 (𝐻) 0.005 1
= = ≈ 0.005025
𝑃 (𝐻 𝑐 ) 0.995 199
These are the prior odds in favor of carrying HIV. The prior odds against
carrying HIV are
𝑃 (𝐻 𝑐 ) 0.995
= = 199
𝑃 (𝐻) 0.005
That is, prior to taking the test, an American is 199 times more likely to
not carry HIV than to carry HIV.
3. The posterior probability of carrying HIV given a positive test is
𝑃 (𝐻|𝐸) = 0.062 and the posterior probability of not carrying HIV given
a positive test is 𝑃 (𝐻 𝑐 |𝐸) = 1 − 0.062 = 0.938.
𝑃 (𝐻|𝐸) 0.062
𝑐
= ≈ 0.066
𝑃 (𝐻 |𝐸) 0.938
These are the posterior odds in favor of carrying HIV given a positive test.
The posterior odds against carrying HIV given a positive test are
𝑃 (𝐻 𝑐 |𝐸) 0.938
= ≈ 15.1
𝑃 (𝐻|𝐸) 0.062
That is, given a positive test, an American is 15.1 times more likely to not
carry HIV than to carry HIV.
4. Comparing the prior and posterior odds in favor of carrying HIV,
posterior odds 0.066
𝐵𝐹 = = = 13.2
prior odds 0.005025
The odds of carrying HIV are 13.2 times greater given a positive test result
than prior to taking the test. The Bayes Factor is 𝐵𝐹 = 13.2.
5. By definition
posterior odds
𝐵𝐹 =
prior odds
Rearranging yields
posterior odds = prior odds × 𝐵𝐹
196 CHAPTER 11. ODDS AND BAYES FACTORS

6. The likelihood of testing positive given HIV is 𝑃 (𝐸|𝐻) = 0.977 and the
likelihood of testing positive given no HIV is 𝑃 (𝐸|𝐻 𝑐 ) = 1−0.926 = 0.074.

𝑃 (𝐸|𝐻) 0.977
= = 13.2
𝑃 (𝐸|𝐻 𝑐 ) 0.074

This value is the Bayes factor! So we could have computed the Bayes
factor without first computing the posterior probabilities or odds.

• If 𝑃 (𝐻) is the prior probability of 𝐻, the prior odds (in favor) of 𝐻 are
𝑃 (𝐻)/𝑃 (𝐻 𝑐 )
• If 𝑃 (𝐻|𝐸) is the posterior probability of 𝐻 given 𝐸, the posterior odds
(in favor) of 𝐻 given 𝐸 are 𝑃 (𝐻|𝐸)/𝑃 (𝐻 𝑐 |𝐸)
• The Bayes factor (BF) is defined to be the ratio of the posterior odds
to the prior odds

posterior odds 𝑃 (𝐻|𝐸)/𝑃 (𝐻 𝑐 |𝐸)


𝐵𝐹 = =
prior odds 𝑃 (𝐻)/𝑃 (𝐻 𝑐 )

• The odds form of Bayes rule says

posterior odds = prior odds × Bayes factor


𝑃 (𝐻|𝐸) 𝑃 (𝐻)
= × 𝐵𝐹
𝑃 (𝐻 𝑐 |𝐸) 𝑃 (𝐻 𝑐 )

• Apply Bayes rule to 𝑃 (𝐻|𝐸) and 𝑃 (𝐻 𝑐 |𝐸)

𝑃 (𝐻|𝐸) 𝑃 (𝐸|𝐻)𝑃 (𝐻)/𝑃 (𝐸)


=
𝑃 (𝐻 𝑐 |𝐸) 𝑃 (𝐸|𝐻 𝑐 )𝑃 (𝐻 𝑐 )/𝑃 (𝐸)
𝑃 (𝐻) 𝑃 (𝐸|𝐻)
= ×
𝑃 (𝐻 𝑐 ) 𝑃 (𝐸|𝐻 𝑐 )
𝑃 (𝐸|𝐻)
posterior odds = prior odds ×
𝑃 (𝐸|𝐻 𝑐 )

• Therefore, the Bayes factor for hypothesis 𝐻 given evidence 𝐸 can be


calculated as the ratio of the likelihoods

𝑃 (𝐸|𝐻)
𝐵𝐹 =
𝑃 (𝐸|𝐻 𝑐 )

• That is, the Bayes factor can be computed without first computing pos-
terior probabilities or odds.
• Odds form of Bayes rule

𝑃 (𝐻|𝐸) 𝑃 (𝐻) 𝑃 (𝐸|𝐻)


𝑐
= ×
𝑃 (𝐻 |𝐸) 𝑃 (𝐻 ) 𝑃 (𝐸|𝐻 𝑐 )
𝑐

posterior odds = prior odds × Bayes factor


197

Example 11.4. Continuing Example 11.1. Now suppose that 5% of individuals


in a high-risk group carry the HIV virus. Consider a randomly selectd person
from this group who takes the test. Suppose the sensitivity and specificity of
the test are the same as in Example 11.1.

1. Compute and interpret the prior odds that a person carries HIV.
2. Use the odds form of Bayes rule to compute the posterior odds that the
person carries HIV given a positive test, and interpret the posterior odds.
3. Use the posterior odds to compute the posterior probability that the per-
son carries HIV given a positive test.

Solution. to Example 11.4

1. 𝑃 (𝐻)/𝑃 (𝐻 𝑐 ) = 0.05/0.95 = 1/19 ≈ 0.0526. A person in this group is 19


times more likely to not carry HIV than to carry HIV.
2. The posterior odds are the product of the prior odds and the Bayes factor.
The Bayes factor is the ratio of the likelihoods. Since the sensitivity and
specificity are the same as in the previous example, the likelihoods are the
same, and the Bayes factor is the same.

𝑃 (𝐸|𝐻) 0.977
𝑐
= = 13.2
𝑃 (𝐸|𝐻 ) 0.074

Therefore
1 1
posterior odds = prior odds × Bayes factor = × 13.2 ≈ ≈ 0.695
19 1.44
Given a positive test, a person in this group is 1.44 times more likely to
not carry HIV than to carry HIV.
3. The odds is the ratios of the posterior probabilities, and we basically just
rescale so they add to 1. The posterior probability is
0.695 1
𝑃 (𝐻|𝐸) = = ≈ 0.410
1 + 0.695 1 + 1.44
The Bayes table is below; we have added a row for the ratios to illustrate
the odds calculations.

hypothesis prior likelihood product posterior


Carries HIV 0.0500 0.977 0.0489 0.4100
Does not carry HIV 0.9500 0.074 0.0703 0.5900
sum 1.0000 NA 0.1191 1.0000
ratio 0.0526 13.203 0.6949 0.6949
198 CHAPTER 11. ODDS AND BAYES FACTORS
Chapter 12

Introduction to Bayesian
Model Comparison

A Bayesian model is composed of both a model for the data (likelihood) and
a prior distribution on model parameters. Model selection usually refers to
choosing between different models for the data (likelihoods). But it can also
concern choosing between models with the same likelihood but different priors.
In Bayesian model comparison, prior probabilities are assigned to each of the
models, and these probabilities are updated given the data according to Bayes
rule. Bayesian model comparison can be viewed as Bayesian estimation in a
hierarchical model with an extra level for “model”. (We’ll cover hiearchical
models in more detail later.)

Example 12.1. Suppose I have some trick coins, some of which are biased in
favor of landing on heads, and some of which are biased in favor of landing on
tails.1 I will select a trick coin at random; let 𝜃 be the probability that the
selected coin lands on heads in any single flip. I will flip the coin 𝑛 times and
use the data to decide about the direction of its bias. This can be viewed as a
choice between two models

• Model 1: the coin is biased in favor of landing on heads


• Model 2: the coin is biased in favor of landing on tails

1. Assume that in model 1 the prior distribution for 𝜃 is Beta(7.5, 2.5).


Suppose in 𝑛 = 10 flips there are 6 heads. Use simulation to approximate
the probability of observing 6 heads in 10 flips given that model 1 is correct.

1 The examples in the section are motivated by examples in Kruschke (2015).

199
200CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON

2. Assume that in model 2 the prior distribution for 𝜃 is Beta(2.5, 7.5).


Suppose in 𝑛 = 10 flips there are 6 heads. Use simulation to approximate
the probability of observing 6 heads in 10 flips given that model 2 is correct.
3. Use the simulation results to approximate and interpret the Bayes Factor
in favor of model 1 given 6 heads in 10 flips.
4. Suppose our prior probability for each model was 0.5. Find the posterior
probability of each model given 6 heads in 10 flips.
5. Suppose I know I have a lot more tail biased coins, so my prior probability
for model 1 was 0.1. Find the posterior probability of each model given 6
heads in 10 flips.
Now suppose I want to predict the number of heads in the next 10 flips of
the selected coin.
6. Use simulation to approximate the posterior predictive distribution of the
number of heads in the next 10 flips given 6 heads in the first 10 flips
given that model 1 is the correct model. In particular, approximate the
posterior predictive probability that there are 7 heads in the next 10 flips
given then model 1 is the correct model.
7. Repeat the previous part assuming model 2 is the correct model.
8. Suppose our prior probability for each model was 0.5. Use simulation to
approximate the posterior predictive distribution of the number of heads
in the next 10 flips given 6 heads in the first 10 flips. In particular,
approximate the posterior predictive probability that there are 7 heads in
the next 10 flips.

Solution. to Example 12.1

1. Given that model 1 is correct, simulate a value of 𝜃 from a Beta(7.5,


2.5) prior, and then given 𝜃 simulate a value of 𝑦 from a Binomial(10, 𝜃)
distribution. Repeat many times. The proportion of simulated repetitions
that yield a 𝑦 value of 6 approximates the probability of observing 6 heads
in 10 flips given that model 1 is correct. The probability is 0.124.

Nrep = 1000000
theta = rbeta(Nrep, 7.5, 2.5)
y = rbinom(Nrep, 10, theta)
sum(y == 6) / Nrep

## [1] 0.1243

2. Similar to he previous part, with the model 2 prior. The probability is


0.042.
201

Nrep = 1000000
theta = rbeta(Nrep, 2.5, 7.5)
y = rbinom(Nrep, 10, theta)
sum(y == 6) / Nrep

## [1] 0.04191

3. The Bayes factor is the ratio of the likelihoods. The likelihood of 6 heads
in 10 flips under model 1 is 0.124, and under model 2 is 0.042. The Bayes
factor in favor of model 1 is 0.124/0.042 = 2.95. Observing 6 heads in
10 flips is 2.95 more likely under model 1 than under model 2. Also, the
posterior odds in favor of model 1 given 6 heads in 10 flips are 2.95 times
greater than the prior odds in favor of model 1.

4. In this case, the prior odds are 1, so the posterior odds in favor of model 1
are 2.95. The posterior probability of model 1 is 0.747, and the posterior
probability of model 2 is 0.253.

5. Now the prior odds in favor of model 1 are 1/9. So the posterior odds
in favor of model 1 given 6 heads in 10 flips are (1/9)(2.95)=0.328. The
posterior probability of model 1 is 0.247, and the posterior probability of
model 2 is 0.753.

Now suppose I want to predict the number of heads in the next 10 flips of
the selected coin.

6. If model 1 is correct the prior is Beta(7.5, 2.5) so the posterior after ob-
serving 6 heads in 10 flips is Beta(13.5, 6.5). Simulate a value of 𝜃 from
a Beta(13.5, 6.5) distribution and given 𝜃 simulate a value of 𝑦 from a
Binomial(10, 𝜃) distribution. Repeat many times. Approximate the pos-
terior predictive probability of 7 heads in the 10 flips flips, given model 1
is correct and 6 heads in the first 10 flips, with the proportion of simulated
repetitions that yield a 𝑦 value of 7; the probability is 0.216.

Nrep = 1000000
theta = rbeta(Nrep, 7.5 + 6, 2.5 + 4)
y = rbinom(Nrep, 10, theta)
plot(table(y) / Nrep,
ylab = "Posterior predictive probability",
main = "Given Model 1")
202CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON

Given Model 1

0.20
Posterior predictive probability

0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10

7. The simulation is similar, just use the prior in model 2. The posterior
predictive probability of 7 heads in the 10 flips flips, given model 2 is
correct and 6 heads in the first 10 flips, is 0.076.

Nrep = 1000000
theta = rbeta(Nrep, 2.5 + 6, 7.5 + 4)
y = rbinom(Nrep, 10, theta)
plot(table(y) / Nrep,
ylab = "Posterior predictive probability",
main = "Given model 2")
203

Given model 2
0.20
Posterior predictive probability

0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10

8. We saw in a previous part that with a 0.5/0.5 prior on model and 6 heads
in 10 flips, the posterior probability of model 1 is 0.747 and of model 2 is
0.253. We now add another stage to our simulation

• Simulate a model: model 1 with probability 0.747 and model 2 with


probability 0.253
• Given the model simulate a value of 𝜃 from its posterior distribution:
Beta(13.5, 6.5) if model 1, Beta(8.5, 11.5) if model 2.
• Given 𝜃 simulate a value of 𝑦 from a Binomial(10, 𝜃) distribtution

The simulation results are below. We can also find the posterior predictive
probability of 7 heads in the next 10 flips using the law of total probabil-
ity to combine the results from the two previous parts: (0.747)(0.216) +
(0.253)(0.076) = 0.18

Nrep = 1000000
alpha = c(7.5, 2.5) + 6
beta = c(2.5, 7.5) + 4

model = sample(1:2, size = Nrep, replace = TRUE, prob = c(0.747, 0.253))

theta = rbeta(Nrep, alpha[model], beta[model])

y = rbinom(Nrep, 10, theta)

plot(table(y) / Nrep,
204CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON

ylab = "Posterior predictive probability",


main = "Model Average")

Posterior predictive probability Model Average

0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10

When several models are under consideration, the Bayesian model is the full
hierarchical structure which spans all models being compared. Thus, the most
complete posterior prediction takes into account all models, weighted by their
posterior probabilities. That is, prediction is accomplished by taking a weighted
average across the models, with weights equal to the posterior probabilities of
the models. This is called model averaging.

Example 12.2. Suppose again I select a coin, but now the decision is whether
the coin is fair. Suppose we consider the two models

• “Must be fair” model: prior distribution for 𝜃 is Beta(500, 500)


• “Anything is possible” model: prior distribution for 𝜃 is Beta(1, 1)

1. Suppose we observe 15 heads in 20 flips. Use simulation to approximate


the Bayes factor in favor of the “must be fair” model given 15 heads in 20
flips. Which model does the Bayes factor favor?
2. Suppose we observe 11 heads in 20 flips. Use simulation to approximate
the Bayes factor in favor of the “must be fair” model given 11 heads in 20
flips. Which model does the Bayes factor favor?
3. The “anything is possible” model has any value available to it, including
0.5 and the sample proportion 0.55. Why then is the “must be fair” option
favored in the previous part?
205

Solution. to Example 12.2

1. A sample proportion of 15/20 = 0.75 does not seem consistent with the
“must be fair” model, so we expect the Bayes Factor to favor the “anything
in possible” model.
The Bayes Factor is the ratio of the likelihoods (of 15/20). To approximate
the likelihood of 15 heads in 20 flips for the “must be fair” model

• Simulate a value 𝜃 from a Beta(500, 500) distribution


• Given 𝜃, simulate a value 𝑦 from a Binomial(20, 𝜃) distribution
• Repeat many times and the proportion of simulated repetitions that
yield a 𝑦 of 15.

Approximate the likelihood of 15 heads in 20 flips for the “anything is


possible” model similarly. The Bayes factor is the ratio of the likelihoods,
about 0.323 in favor of the “must be fair” model. That is, the Bayes factor
favors the “anything is possible” model.

Nrep = 1000000

theta1 = rbeta(Nrep, 500, 500)


y1 = rbinom(Nrep, 20, theta1)

theta2 = rbeta(Nrep, 1, 1)
y2 = rbinom(Nrep, 20, theta2)

sum(y1 == 15) / sum(y2 == 15)

## [1] 0.3197

2. Similar to the previous part but with different data; now we compute the
likelihood of 11 heads in 20 flips. The Bayes factor is about 3.34. Thus,
the Bayes factor favors the “must be fair” model.

sum(y1 == 11) / sum(y2 == 11)

## [1] 3.328

3. A central 99% prior credible interval for 𝜃 based on the “must be fair”
model is (0.459, 0.541), which does not include the sample proportion
of 0.55. So you might think that the data would favor the “anything is
possible” model. However, the numerator and denominator in the Bayes
factor are average likelihoods: the likelihood of the data averaged over each
possible value of 𝜃. The “must be fair” model only gives initial plausibility
206CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON

to 𝜃 values that are close to 0.5, and for such 𝜃 values the likelihood of
11 heads in 20 flips is not so small. Values of 𝜃 that are far from 0.5 are
effectively not included in the average, due to their low prior probability,
so the average likelihood is not so small.
In contrast, the “anything is possible” model stretches the prior probability
over all values in (0, 1). For many 𝜃 values in (0, 1) the likelihood of
observing 11 heads in 20 flips is close to 0, and with the Uniform(0, 1)
prior, each of these 𝜃 values contributes equally to the average likelihood.
Thus, the average likelihood is smaller for the “anything is possible” model
than for the “must be fair” model.

Complex models generally have an inherent advantage over simpler models be-
cause complex models have many more options available, and one of those op-
tions is likely to fit the data better than any of the fewer options in the simpler
model. However, we don’t always want to just choose the more complex model.
Always choosing the more complex model overfits the data.
Bayesian model comparison naturally compensates for discrepancies in model
complexity. In more complex models, prior probabilities are diluted over the
many options available. Even if a complex model has some particular combina-
tion of parameters that fit the data well, the prior probability of that particular
combination is likely to be small because the prior is spread more thinly than
for a simpler model. Thus, in Bayesian model comparison, a simpler model can
“win” if the data are consistent with it, even if the complex model fits well.

Example 12.3. Continuing Example 12.2 where we considered the two models

• “Must be fair” model: prior distribution for 𝜃 is Beta(500, 500)


• “Anything is possible” model: prior distribution for 𝜃 is Beta(1, 1)

Suppose we observe 65 heads in 100 flips.

1. Use simulation to approximate the Bayes factor in favor of the “must be


fair” model given 65 heads in 100 flips. Which model does the Bayes factor
favor?

2. We have discussed different notions of a “non-informative/vague” prior.


We often think of Beta(1, 1) = Uniform(0, 1) as a non-informative prior,
but there are other considerations. In particular, a Beta(0.01, 0.01) is
often used a non-informative prior in this context. Think of a Beta(0.01,
0.01) prior like an approximation to the improper Beta(0, 0) prior based
on “no prior successes or failures”.
Suppose now that the “anything is possible” model corresponds to a
Beta(0.01, 0.01) prior distribution for 𝜃. Use simulation to approximate
the Bayes factor in favor of the “must be fair” model given 65 heads in
207

100 flips. Which model does the Bayes factor favor? Is the choice of
model sensitive to the change of prior distribution within the “anything is
possible” model?
3. For each of the two “anything is possible” priors, find the posterior dis-
tribution of 𝜃 and a 98% posterior credible interval for 𝜃 given 65 heads
in 100 flips. Is estimation of 𝜃 within the “anything is possible” model
sensitive to the change in the prior distribution for 𝜃?

Solution. to Example 12.3

1. The simulation is similar to the ones in the previous example, just with
different data. The Bayes Factor is about 0.126 in favor of the “must be
fair” model. So the Bayes Factor favors the “anything is possible” model.

Nrep = 1000000

theta1 = rbeta(Nrep, 500, 500)


y1 = rbinom(Nrep, 100, theta1)

theta2 = rbeta(Nrep, 1, 1)
y2 = rbinom(Nrep, 100, theta2)

sum(y1 == 65) / sum(y2 == 65)

## [1] 0.1325

2. The simulation is similar to the one in the previous part, just with a
different prior. The Bayes Factor is about 5.73 in favor of the “must be
fair” model. So the Bayes Factor favors the “must be fair” model. Even
though there both non-informative priors, the Beta(1, 1) and Beta(0.01,
0.01) priors leads to very different Bayes factors and decisions. The choice
of model does appear to be sensitive to the choice of prior distribution.

Nrep = 1000000

theta1 = rbeta(Nrep, 500, 500)


y1 = rbinom(Nrep, 100, theta1)

theta2 = rbeta(Nrep, 0.01, 0.01)


y2 = rbinom(Nrep, 100, theta2)

sum(y1 == 65) / sum(y2 == 65)

## [1] 5.272
208CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON

3. For a Beta(1, 1) prior, the posterior of 𝜃 given 65 heads in 100 flips is the
Beta(66, 36) distribution, and a central 98% posterior credible interval
for 𝜃 is (0.534, 0.752). For a Beta(0.01, 0.01) prior, the posterior of 𝜃
given 65 heads in 100 flips is the Beta(65.01, 35.01) distribution, and a
central 98% posterior credible interval for 𝜃 is (0.536, 0.755). The Beta(66,
36) and Beta(65.01, 35.01) distributions are virtually identical, and the
98% credible intervals are practically the same. At least in this case, the
estimation of 𝜃 within the “anything is possible” model does not appear
to be sensitive to the choice of prior.

qbeta(c(0.01, 0.99), 1 + 65, 1 + 35)

## [1] 0.5340 0.7517

qbeta(c(0.01, 0.99), 0.01 + 65, 0.01 + 35)

## [1] 0.5359 0.7553

In Bayesian estimation of continuous parameters within a model, the posterior


distribution is typically not too sensitive to changes in prior (provided that there
is a reasonable amount of data and the prior is not too strict).
In contrast, in Bayesian model comparison, the posterior probabilities of the
models and the Bayes factors can be extremely sensitive to the choice of prior
distribution within each model.
When comparing different models, prior distributions on parameters within each
model should be equally informed. One strategy is to use a small set of “training
data” to inform the prior of each model before comparing.
Example 12.4. Continuing Example 12.3 where we considered two priors in
the “anything is possible model”: Beta(1, 1) and Beta(0.1, 0.1). We will again
compare the “anything is possible model” to the “must be fair” model which
corresponds to a Beta(500, 500) prior.
Suppose we observe 65 heads in 100 flips.

1. Assume the “anything is possible” model corresponds to the Beta(1, 1)


prior. Suppose that in the first 10 flips there were 6 heads. Compute
the posterior distribution of 𝜃 in each of the models after the first 10 flips.
Then use simulation to approximate the Bayes factor in favor of the “must
be fair” model given 65 heads in 100 flips, using the posterior distribution
of 𝜃 after the first 10 flips as the prior distribution in the simulation.
Which model does the Bayes factor favor?
2. Repeat the previous part assuming the “anything is possible” model cor-
responds to the Beta(0.01, 0.01) prior. Compare with the previous part.
209

Solution. to Example 12.4

1. With the Beta(1, 1) prior in the “anything is possible” model, the posterior
distribution of 𝜃 after 6 heads in the first 10 flips is the Beta(7, 5) distri-
bution. With the Beta(500, 500) prior in the “must be fair” model, the
posterior distribution of 𝜃 after 6 heads in the first 10 flips is the Beta(506,
504) distribution. The simulation to approximate the likelihood in each
model is similar to before, but now we simulate 𝜃 from its posterior distri-
bution after the first 10 flips, and evaluate the likelihood of observing 59
heads in the remaining 90 flips. The Bayes factor is about 0.056 in favor
of the “must be fair” model. So the Bayes Factor favors the “anything is
possible” model.

Nrep = 1000000

theta1 = rbeta(Nrep, 500 + 6, 500 + 4)


y1 = rbinom(Nrep, 90, theta1)

theta2 = rbeta(Nrep, 1 + 6, 1 + 4)
y2 = rbinom(Nrep, 90, theta2)

sum(y1 == 59) / sum(y2 == 59)

## [1] 0.05509

2. With the Beta(0.01, 0.01) prior in the “anything is possible” model, the
posterior distribution of 𝜃 after 6 heads in the first 10 flips is the Beta(6.01,
4.01) distribution. The simulation is similar to the previous part, just with
the different distribution for 𝜃 in the “anything is possible” model. The
Bayes factor is about 0.057 in favor of the “must be fair” model, about the
same as in the previous part. So the Bayes Factor favors the “anything is
possible” model. Notice that after “training” the models on the first 10
observations, the model comparison is no longer so sensitive to the choice
of prior within the “anything is possible” model.

Nrep = 1000000

theta1 = rbeta(Nrep, 500 + 6, 500 + 4)


y1 = rbinom(Nrep, 90, theta1)

theta2 = rbeta(Nrep, 0.01 + 6, 0.01 + 4)


y2 = rbinom(Nrep, 90, theta2)

sum(y1 == 59) / sum(y2 == 59)


210CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON

## [1] 0.05941

Example 12.5. Consider a null hypothesis significance test of 𝐻0 ∶ 𝜃 = 0.5


versus 𝐻1 ∶ 𝜃 ≠ 0.5. How does this situation resemble the previous problem?
Solution. to Example 12.5

We could treat this as a problem of Bayesian model comparison. The null


hypothesis corresponds to a prior distribution which places all prior probability
on the null hypothesized value of 0.5. The alternative hypothesis corresponds
to a prior distribution over the full range of possible values of 𝜃. Given data,
we could compute the posterior probability of each model and use that to make
a decision regarding the hypotheses. However, there are infinitely many choices
for the prior that corresponds to the alternative hypothesis, and we have already
seen that Bayesian model comparison can be very sensitive to the choice of prior
within in model.
A null hypothesis significance test can be viewed as a problem of Bayesian
model selection in which one model has a prior distribution that places all its
credibility on the null hypothesized value. However, is it really plausible that
the parameter is exactly equal to the hypothesized value?
Unfortunately, this model-comparison (Bayes factor) approach to testing can
be extremely sensitive to the choice of prior corresponding to the alternative
hypothesis.
An alternative Bayesian approach to testing involves choosing a region of prac-
tical equivalence (ROPE). A ROPE indicates a small range of parameter
values that are considered to be practically equivalent to the null hypothesized
value.

• A hypothesized value is rejected — that is, declared to be not credible —


if its ROPE lies outside a posterior credible interval (e.g., 99%) for the
parameter.
• A hypothesized value is accepted for practical purposes if its ROPE con-
tains the posterior credible interval (e.g., 99%) for the parameter.

How do you choose the ROPE? That determines on the practical application.
In general, traditional testing of point null hypotheses (that is, “no ef-
fect/difference”) is not a primary concern in Bayesian statistics. Rather, the
posterior distribution provides all relevant information to make decisions about
practically meaningful issues. Ask research questions that are important in the
context of the problem and use the posterior distribution to answer them.
Chapter 13

Bayesian Analysis of
Poisson Count Data

In this chapter we’ll consider Bayesian analysis for count data.


We have covered in some detail the problem of estimating a population pro-
portion for a binary categorical variable. In these situations we assumed a
Binomial likelihood for the count of “successes” in the sample. However, a Bi-
nomial model has several restrictive assumptions that might not be satisfied in
practice. Poisson models are more flexible models for count data.
Example 13.1. Let 𝑌 be the number of home runs hit (in total by both teams)
in a randomly selected Major League Baseball game.

1. In what ways is this like the Binomial situation? (What is a trial? What
is “success”?)
2. In what ways is this NOT like the Binomial situation?
Solution. to Example 13.1
Show/hide solution

1. Each pitch is a trial, and on each trial either a home run is hit (“success”)
or not. The random variable 𝑌 counts the number of home runs (successes)
over all the trials
2. Even though 𝑌 is counting successes, this is not the Binomial situation.
• The number of trials is not fixed. The total number of pitches varies
from game to game. (The average is around 300 pitches per game).
• The probability of success is not the same on each trial. Different
batters have different probabilities of hitting home runs. Also, dif-
ferent pitch counts or game situations lead to different probabilities
of home runs.

211
212 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

• The trials might not be independent, though this is a little more ques-
tionable. Make sure you distinguish independence from the previous
assumption of unequal probabilities of success; you need to consider
conditional probabilities to assess independence. Maybe if a pitcher
gives up a home run on one pitch, then the pitcher is “rattled” so
the probability that he also gives up a home run on the next pitch
increases, or the pitcher gets pulled for a new pitcher which changes
the probability of a home run on the next pitch.

Example 13.2. Let 𝑌 be the number of automobiles that get in accidents on


Highway 101 in San Luis Obispo on a randomly selected day.

1. In what ways is this like the Binomial situation? (What is a trial? What
is “success”?)
2. In what ways is this NOT like the Binomial situation?

Solution. to Example 13.2

Show/hide solution

1. Each automobile on the road in the day is a trial, and on each automo-
bile either gets in an accident (“success”) or not. The random variable
𝑌 counts the number of automobiles that get into accidents (successes).
(Remember “success” is just a generic label for the event you’re interested
in; “success” is not necessarily good.)
2. Even though 𝑌 is counting successes, this is not the Binomial situation.
• The number of trials is not fixed. The total number of automobiles
on the road varies from day to day.
• The probability of success is not the same on each trial. Different
drivers have different probabilities of getting into accidents; some
drivers are safer than others. Also, different conditions increase the
probability of an accident, like driving at night.
• The trials are plausibly not independent. Make sure you distinguish
independence from the previous assumption of unequal probabilities
of success; you need to consider conditional probabilities to assess
independence. If an automobile gets into an accident, then the prob-
ability of getting into an accident increases for the automobiles that
are driving near it.

Poisson models are models for counts that have more flexibility than Binomial
models. Poisson models are parameterized by a single parameter (the mean) and
do not require all the assumptions of a Binomial model. Poisson distributions
are often used to model the distribution of variables that count the number
of “relatively rare” events that occur over a certain interval of time or in a
certain location (e.g., number of accidents on a highway in a day, number of
213

car insurance policies that have claims in a week, number of bank loans that go
into default, number of mutations in a DNA sequence, number of earthquakes
that occur in SoCal in an hour, etc.)

A discrete random variable 𝑌 has a Poisson distribution with parameter


𝜃 > 0 if its probability mass function satisfies

𝑒−𝜃 𝜃𝑦
𝑓(𝑦|𝜃) = , 𝑦 = 0, 1, 2, …
𝑦!

If 𝑌 has a Poisson(𝜃) distribution then

𝐸(𝑌 ) = 𝜃
𝑉 𝑎𝑟(𝑌 ) = 𝜃

For a Poisson distribution, both the mean and variance are equal to 𝜃, but
remember that the mean is measured in the count units (e.g., home runs) but
the variance is measured in squared units (e.g., (home runs)2 ).

Poisson distributions have many nice properties, including the following.

Poisson aggregation. If 𝑌1 and 𝑌2 are independent, 𝑌1 has a Poisson(𝜃1 ) dis-


tribution, and 𝑌2 has a Poisson(𝜃2 ) distribution, then 𝑌1 +𝑌2 has a Poisson(𝜃1 +
𝜃2 ) distribution1 . That is, if independent component counts each follow a Pois-
son distribution then the total count also follows a Poisson distribution. Poisson
aggregation extends naturally to more than two components. For example, if
the number of babies born each day at a certain hospital follows a Poisson
distribution — perhaps with different daily rates (e.g., higher for Friday than
Saturday) — independently from day to day, then the number of babies born
each week at the hospital also follows a Poisson distribution.

1 If 𝑌 has mean 𝜃 and 𝑌 has mean 𝜃 then linearity of expected value implies that 𝑌 +𝑌
1 1 2 2 1 2
has mean 𝜃1 +𝜃2 . If 𝑌1 has variance 𝜃1 and 𝑌2 has variance 𝜃2 then independence of 𝑌1 and 𝑌2
implies that 𝑌1 + 𝑌2 has variance 𝜃1 + 𝜃2 . What Poisson aggregation says is that if component
counts are independent and each with a Poisson shape, then the total count also has a Poisson
shape.
214 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

0.6 Poisson(0.5)
Poisson(1)
Poisson(1.5)
0.5 Poisson(2)
0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6 7 8

Example 13.3. Suppose the number of home runs hit per game (by both
teams in total) at a particular Major League Baseball park follows a Poisson
distribution with parameter 𝜃.

1. Sketch your prior distribution for 𝜃 and describe its features. What are
the possible values of 𝜃? Does 𝜃 take values on a discrete or continuous
scale?

2. Suppose 𝑌 represents a home run count for a single game. What are the
possible values of 𝑌 ? Does 𝑌 take values on a discrete or continuous scale?

3. We’ll start with a discrete prior for 𝜃 to illustrate ideas.

𝜃 0.5 1.5 2.5 3.5 4.5


Probability 0.13 0.45 0.28 0.11 0.03

Suppose a single game with 1 home run is observed. Find the poste-
rior distribution of 𝜃. In particular, how do you determine the likelihood
column?

4. Now suppose a second game, with 3 home runs, is observed, independently


of the first. Find the posterior distribution of 𝜃 after observing these two
games, using the posterior distribution from the previous part as the prior
distribution in this part.
215

5. Now consider the original prior again. Find the posterior distribution of
𝜃 after observing 1 home run in the first game and 3 home runs in the
second, without the intermediate updating of the posterior after the first
game. How does the likelihood column relate to the likelihood columns
from the previous parts? How does the posterior distribution compare
with the posterior distribution from the previous part?

6. Now consider the original prior again. Suppose that instead of observing
the two individual values, we only observe that there is a total of 4 home
runs in 2 games. Find the posterior distribution of 𝜃. In particular, you
do you determine the likelihood column? How does the likelihood column
compare to the one from the previous part? How does posterior compare
to the previous part?

7. Suppose we’ll observe a third game tomorrow. How could you find — both
analytically and via simulation —the posterior predictive probability that
this game has 0 home runs?

8. Now let’s consider a continuous prior distribution for 𝜃 which satisfies

𝜋(𝜃) ∝ 𝜃4−1 𝑒−2𝜃 , 𝜃>0

Use grid approximation to compute the posterior distribution of 𝜃 given


1 home run in a single game. Plot the prior, (scaled) likelihood, and
posterior. (Note: you will need to cut the grid off at some point. While 𝜃
can take any value greater than 0, the interval [0, 8] accounts for 99.99%
of the prior probability.)

9. Now let’s consider some real data. Assume home runs per game at Citizens
Bank Park (Phillies!) follow a Poisson distribution with parameter 𝜃.
Assume that the prior distribution for 𝜃 satisfies

𝜋(𝜃) ∝ 𝜃4−1 𝑒−2𝜃 , 𝜃>0

The following summarizes data for the 2020 season2 . There were 97 home
runs in 32 games. Use grid approximation to compute the posterior dis-
tribution of 𝜃 given the data. Be sure to specify the likelihood. Plot the
prior, (scaled) likelihood, and posterior.

2 Source: https://www.baseball-reference.com/teams/PHI/2020.shtml
216 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

Home runs Number of games


0 0
1 8
2 8
3 5
4 4
5 3
6 2
7 1
8 1
9 0
0.25
0.20
Proportion of games

0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9

Number of home runs

Solution. to Example 13.3

1. Your prior is whatever it is. We’ll discuss how we chose a prior in a later
part. Even though each data value is an integer, the mean number of home
runs per game 𝜃 can be any value greater than 0. That is, the parameter
𝜃 takes values on a continuous scale.
2. 𝑌 can be 0, 1, 2, and so on, taking values on a discrete scale. Technically,
there is no fixed upper bound on what 𝑌 can be.
3. The likelihood is the Poisson probability of 1 home run in a game computed
for each value of 𝜃.
𝑒−𝜃 𝜃1
𝑓(𝑦 = 1|𝜃) =
1!
For example, the likelihood of 1 home run in a game given 𝜃 = 0.5 is
−0.5 1
𝑓(𝑦 = 1|𝜃 = 0.5) = 𝑒 1!0.5 = 0.3033. If on average there are 0.5 home
217

runs per game, then about 30% of games would have exactly 1 home run.
As always posterior is proportional to the product of prior and likelihood.
We see that the posterior distibution puts even greater probability on
𝜃 = 1.5 than the prior.
theta prior likelihood product posterior
0.5 0.13 0.3033 0.0394 0.1513
1.5 0.45 0.3347 0.1506 0.5779
2.5 0.28 0.2052 0.0575 0.2205
3.5 0.11 0.1057 0.0116 0.0446
4.5 0.03 0.0500 0.0015 0.0058
4. The likelihood is the Poisson probability of 3 home runs in a game com-
puted for each value of 𝜃.

𝑒−𝜃 𝜃3
𝑓(𝑦 = 3|𝜃) =
3!

The posterior places about 90% of the probability on 𝜃 being either 1.5 or
2.5.
theta prior likelihood product posterior
0.5 0.1513 0.0126 0.0019 0.0145
1.5 0.5779 0.1255 0.0725 0.5488
2.5 0.2205 0.2138 0.0471 0.3566
3.5 0.0446 0.2158 0.0096 0.0728
4.5 0.0058 0.1687 0.0010 0.0073

5. Since the games are independent3 the likelihood is the product of the
likelihoods from the two previous parts

𝑒−𝜃 𝜃1 𝑒−𝜃 𝜃3
𝑓(𝑦 = (1, 3)|𝜃) = ( )( )
1! 3!

Unsuprisingly, the posterior distribution is the same as in the previous


part.
theta prior likelihood product posterior
0.5 0.13 0.0038 0.0005 0.0145
1.5 0.45 0.0420 0.0189 0.5488
2.5 0.28 0.0439 0.0123 0.3566
3.5 0.11 0.0228 0.0025 0.0728
4.5 0.03 0.0084 0.0003 0.0073
6. By Poisson aggregation, the total number of home runs in 2 games follows
a Poisson(2𝜃) distribution. The likelihood is the probability of a value of
3 I keep meaning to say this, but technically the 𝑌 values are not independent. Rather,

they are conditionally independent given 𝜃. This is a somewhat subtle distinction, so I’ve
glossed over the details.
218 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

4 (home runs in 2 games) computed using a Poisson(2𝜃) for each value of


𝜃.
𝑒−2𝜃 (2𝜃)4
𝑓(𝑦 ̄ = 2|𝜃) =
4!
For example, the likelihood of 4 home runs in 2 games given 𝜃 = 0.5 is
−2×0.5
(2×0.5)4
𝑓(𝑦 ̄ = 2|𝜃 = 0.5) = 𝑒 4! = 0.0153. If on average there are 0.5
home runs per game, then about 1.5% of samples of 2 games would have
exactly 4 home runs.
The likelihood is not the same as in the previous part because there are
more samples of two games that yield a total of 4 home runs than those
that yield 1 home run in the first game and 3 in the second. However, the
likelihoods are proportionally the same. For example, the likelihood for
𝜃 = 2.5 is about 1.92 times greater than the likelihood for 𝜃 = 3.5 in both
this part and the previous part. Therefore, the posterior distribution is
the same as in the previous part.
theta prior likelihood product posterior
0.5 0.13 0.0153 0.0020 0.0145
1.5 0.45 0.1680 0.0756 0.5488
2.5 0.28 0.1755 0.0491 0.3566
3.5 0.11 0.0912 0.0100 0.0728
4.5 0.03 0.0337 0.0010 0.0073

7. Simulate a value of 𝜃 from its posterior distribution and then given 𝜃 sim-
ulate a value of 𝑌 from a Poisson(𝜃) distribution, and repeat many times.
Approximate the probability of 0 home runs by finding the proportion of
repetitions that yield a 𝑌 value of 0. (We’ll see some code a little later.)
We can compute the probability using the law of total probability. Find
the probability of 0 home runs for each value of 𝜃, that is 𝑒−𝜃 𝜃0 /0! = 𝑒−𝜃 ,
and then weight these values by their posterior probabilities to find the
predictive probability of 0 home runs, which is 0.163.

𝑒−0.5 (0.0145) + 𝑒−1.5 (0.5488) + 𝑒−2.5 (0.3566) + 𝑒−3.5 (0.0728) + 𝑒−4.5 (0.0073)
=(0.6065)(0.0145) + (0.2231)(0.5488) + (0.0821)(0.3566) + (0.0302)(0.0728) + (0.0111)(0.0073) = 0

According to this model, we predict that about 16% of games would have
0 home runs.

8. Now let’s consider a continuous prior distribution for 𝜃 which satisfies

𝜋(𝜃) ∝ 𝜃4−1 𝑒−2𝜃 , 𝜃>0

Use grid approximation to compute the posterior distribution of 𝜃 given


1 home run in a single game. Plot the prior, (scaled) likelihood, and
posterior. (Note: you will need to cut the grid off at some point. While 𝜃
219

can take any value greater than 0, the interval [0, 8] accounts for 99.99%
of the prior probability.)

# prior
theta = seq(0, 8, 0.001)

prior = theta ^ (4 - 1) * exp(-2 * theta)


prior = prior / sum(prior)

# data
n = 1 # sample size
y = 1 # sample mean

# likelihood
likelihood = dpois(y, theta)

# posterior
product = likelihood * prior
posterior = product / sum(product)

ylim = c(0, max(c(prior, posterior, likelihood / sum(likelihood))))


xlim = range(theta)
plot(theta, prior, type='l', xlim=xlim, ylim=ylim, col="orange", xlab='theta', ylab='', yaxt
par(new=T)
plot(theta, likelihood/sum(likelihood), type='l', xlim=xlim, ylim=ylim, col="skyblue", xlab=
par(new=T)
plot(theta, posterior, type='l', xlim=xlim, ylim=ylim, col="seagreen", xlab='', ylab='', yax
legend("topright", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("orange", "sky
220 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

prior
scaled likelihood
posterior

0 2 4 6 8

theta

9. By Poisson aggregation, the total number of home runs in 32 games follows


a Poisson(32𝜃) distribution. The likelihood is the probability of observing
a value of 97 (for the total number of home runs in 32 games) from a
Poisson(32𝜃) distribution.

𝑓(𝑦 ̄ = 97/32|𝜃) = 𝑒−32𝜃 (32𝜃)97 /97!, 𝜃>0


−32𝜃 97
∝𝑒 𝜃 , 𝜃>0

The likelihood is centered at the sample mean of 97/32 = 3.03. The


posterior distribution follows the likelihood fairly closely, but the prior
still has a little influence.

# prior
theta = seq(0, 8, 0.001)

prior = theta ^ (4 - 1) * exp(-2 * theta)


prior = prior / sum(prior)

# data
n = 32 # sample size
y = 97 / 32 # sample mean

# likelihood - for total count


likelihood = dpois(n * y, n * theta)
221

# posterior
product = likelihood * prior
posterior = product / sum(product)

ylim = c(0, max(c(prior, posterior, likelihood / sum(likelihood))))


xlim = range(theta)
plot(theta, prior, type='l', xlim=xlim, ylim=ylim, col="orange", xlab='theta', ylab='', yaxt
par(new=T)
plot(theta, likelihood/sum(likelihood), type='l', xlim=xlim, ylim=ylim, col="skyblue", xlab=
par(new=T)
plot(theta, posterior, type='l', xlim=xlim, ylim=ylim, col="seagreen", xlab='', ylab='', yax
legend("topright", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("orange", "sky

prior
scaled likelihood
posterior

0 2 4 6 8

theta

Gamma distributions are commonly used as prior distributions for parameters


that take positive values, 𝜃 > 0.

A continuous RV 𝑈 has a Gamma distribution with shape parameter 𝛼 > 0


222 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

and rate parameter 4 𝜆 > 0 if its density satisfies5

𝑓(𝑢) ∝ 𝑢𝛼−1 𝑒−𝜆𝑢 , 𝑢 > 0,

In R: dgamma(u, shape, rate) for density, rgamma to simulate, qgamma for


quantiles, etc.
It can be shown that a Gamma(𝛼, 𝜆) density has
𝛼
Mean (EV) =
𝜆
𝛼
Variance = 2
𝜆
𝛼−1
Mode = , if 𝛼 ≥ 1
𝜆

Gamma densities with rate parameter = 1 Gamma densities with shape parameter = 3
1.0
1.0
0.8
0.8
0.6
0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 5 10 15 20 0 2 4 6 8 10

Example 13.4. The plots above show a few examples of Gamma distributions.

1. The plot on the left above contains a few different Gamma densities, all
with rate parameter 𝜆 = 1. Match each density to its shape parameter 𝛼;
the choices are 1, 2, 5, 10.
2. The plot on the right above contains a few different Gamma densities, all
with shape parameter 𝛼 = 3. Match each density to its rate parameter 𝜆;
the choices are 1, 2, 3, 4.

Solution. to Example 13.4


4 Sometimes Gamma densities are parametrized in terms of the scale parameter 1/𝜆, so

that the mean is 𝛼𝜆.


5 The expression defines the shape of a Gamma density. All that’s missing is the scaling

constant which ensures that the total area under the density is 1. The actual Gamma density
formula, including the normalizing constant, is
𝜆𝛼
𝑓(𝑢) = 𝑢𝛼−1 𝑒−𝜆𝑢 , 𝑢 > 0,
Γ(𝛼)

where Γ(𝛼) = ∫0 𝑒−𝑢 𝑢𝛼−1 𝑑𝑢 is the Gamma function. For a positive integer 𝑘, Γ(𝑘) = (𝑘−1)!.

Also, Γ(1/2) = 𝜋.
223

1. For a fixed 𝜆, as the shape parameter 𝛼 increases, both the mean and the
standard deviation increase.

2. For a fixed 𝛼, as the rate parameter 𝜆 increases, both the mean and the
standard deviation decrease.
Observe that changing 𝜆 doesn’t change the overall shape of the curve,
just the scale of values that it covers. However, changing 𝛼 does change
the shape of the curve; notice the changes in concavity in the plot on the
left.

Gamma densities with rate parameter = 1 Gamma densities with shape parameter = 3
1.0 =1 =1
=2 1.0 =2
=5 =3
0.8 = 10 =4
0.8
0.6
0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 5 10 15 20 0 2 4 6 8 10

Example 13.5. Assume home runs per game at Citizens Bank Park follow a
Poisson distribution with parameter 𝜃. Assume for 𝜃 a Gamma prior distribution
with shape parameter 𝛼 = 4 and rate parameter 𝜆 = 2.

1. Write an expression for the prior density 𝜋(𝜃). Plot the prior distribution.
Find the prior mean, prior SD, and prior 50%, 80%, and 98% credible
intervals for 𝜃.
2. Suppose a single game with 1 home run is observed. Write the likelihood
function.
3. Write an expression for the posterior distribution of 𝜃 given a single game
with 1 home run. Identify by the name the posterior distribution and
the values of relevant parameters. Plot the prior distribution, (scaled)
likelihood, and posterior distribution. Find the posterior mean, posterior
SD, and posterior 50%, 80%, and 98% credible intervals for 𝜃.
4. Now consider the original prior again. Determine the likelihood of observ-
ing 1 home run in game 1 and 3 home runs in game 2 in a sample of 2
games, and the posterior distribution of 𝜃 given this sample. Identify by
the name the posterior distribution and the values of relevant parameters.
Plot the prior distribution, (scaled) likelihood, and posterior distribution.
Find the posterior mean, posterior SD, and posterior 50%, 80%, and 98%
credible intervals for 𝜃.
5. Consider the original prior again. Determine the likelihood of observing a
total of 4 home runs in a sample of 2 games, and the posterior distribution
of 𝜃 given this sample. Identify by the name the posterior distribution and
224 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

the values of relevant parameters. How does this compare to the previous
part?
6. Consider the 2020 data in which there were 97 home runs in 32 games.
Determine the likelihood function, and the posterior distribution of 𝜃 given
this sample. Identify by the name the posterior distribution and the values
of relevant parameters. Plot the prior distribution, (scaled) likelihood,
and posterior distribution. Find the posterior mean, posterior SD, and
posterior 50%, 80%, and 98% credible intervals for 𝜃.
7. Interpret the credible interval from the previous part in context.
8. Express the posterior mean of 𝜃 based on the 2020 data as a weighted
average of the prior mean and the sample mean.
9. While the main parameter is 𝜃, there are other parameters of interest.
For example, 𝜂 = 𝑒−𝜃 is the population proportion of games in which
there are 0 home runs. Assuming that you already have the posterior
distribution of 𝜃 (or a simulation-based approximation), explain how you
could use simulation to approximate the posterior distribution of 𝜂. Run
the simulation and plot the posterior distribution, and find and interpret
50%, 80%, and 98% posterior credible intervals for 𝜂.
10. Use JAGS to approximate the posterior distribution of 𝜃 given this sample.
Compare with the results from the previous example.

Solution. to Example 13.5

1. Remember that in the Gamma(4,2) prior distribution 𝜃 is treated as the


variable.
𝜋(𝜃) ∝ 𝜃4−1 𝑒−2𝜃 , 𝜃 > 0.

This is the same prior we used in the grid approximation in Example 13.3.
See below for a plot.
𝛼 4
Prior mean = =2
𝜆 2
𝛼 √4 =1
Prior SD = √
𝜆2 22

Use qgamma for find the endpoints of the credible intervals.

qgamma(c(0.25, 0.75), shape = 4, rate = 2)

## [1] 1.268 2.555

qgamma(c(0.10, 0.90), shape = 4, rate = 2)

## [1] 0.8724 3.3404


225

qgamma(c(0.01, 0.99), shape = 4, rate = 2)

## [1] 0.4116 5.0226

2. The likelihood is the Poisson probability of 1 home run in a game computed


for each value of 𝜃 > 0.
𝑒−𝜃 𝜃1
𝑓(𝑦 = 1|𝜃) = ∝ 𝑒−𝜃 𝜃, 𝜃 > 0.
1!

3. Posterior is proportional to likelihood times prior

𝜋(𝜃|𝑦 = 1) ∝ (𝑒−𝜃 𝜃) (𝜃4−1 𝑒−2𝜃 ) , 𝜃 > 0,


(4+1)−1 −(2+1)𝜃
∝𝜃 𝑒 , 𝜃 > 0.

We recognize the above as the Gamma density with shape parameter


𝛼 = 4 + 1 and rate parameter 𝜆 = 2 + 1.

𝛼 5
Posterior mean = = 1.667
𝜆 3
𝛼 5
Posterior SD = √ √ = 0.745
𝜆2 32

qgamma(c(0.25, 0.75), shape = 4 + 1, rate = 2 + 1)

## [1] 1.123 2.091

qgamma(c(0.10, 0.90), shape = 4 + 1, rate = 2 + 1)

## [1] 0.8109 2.6645

qgamma(c(0.01, 0.99), shape = 4 + 1, rate = 2 + 1)

## [1] 0.4264 3.8682

theta = seq(0, 8, 0.001) # the grid is just for plotting

# prior
alpha = 4
lambda = 2
prior = dgamma(theta, shape = alpha, rate = lambda)
226 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

# likelihood
n = 1 # sample size
y = 1 # sample mean
likelihood = dpois(n * y, n * theta)

# posterior
posterior = dgamma(theta, alpha + n * y, lambda + n)

# plot
plot_continuous_posterior <- function(theta, prior, likelihood, posterior) {

ymax = max(c(prior, posterior))

scaled_likelihood = likelihood * ymax / max(likelihood)

plot(theta, prior, type='l', col='orange', xlim= range(theta), ylim=c(0, ymax),


par(new=T)
plot(theta, scaled_likelihood, type='l', col='skyblue', xlim=range(theta), ylim=
par(new=T)
plot(theta, posterior, type='l', col='seagreen', xlim=range(theta), ylim=c(0, ym
legend("topright", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("o
}

plot_continuous_posterior(theta, prior, likelihood, posterior)

prior
scaled likelihood
posterior

0 2 4 6 8

theta
227

4. The likelihood is the product of the likelihoods of 𝑦 = 1 and 𝑦 = 3.

𝑒−𝜃 𝜃1 𝑒−𝜃 𝜃3
𝑓(𝑦 = (1, 3)|𝜃) = ( )( ) ∝ 𝑒−2𝜃 𝜃4 , 𝜃 > 0.
1! 3!

The posterior satisfies

𝜋(𝜃|𝑦 = (1, 3)) ∝ (𝑒−2𝜃 𝜃4 ) (𝜃4−1 𝑒−2𝜃 ) , 𝜃 > 0,


(4+4)−1 −(2+2)𝜃
∝𝜃 𝑒 , 𝜃 > 0.

We recognize the above as the Gamma density with shape parameter


𝛼 = 4 + 4 and rate parameter 𝜆 = 2 + 2.

𝛼 8
Posterior mean = =2
𝜆 4
𝛼 8
Posterior SD = √ √ = 0.707
𝜆2 42

qgamma(c(0.25, 0.75), shape = 4 + 4, rate = 2 + 2)

## [1] 1.489 2.421

qgamma(c(0.10, 0.90), shape = 4 + 4, rate = 2 + 2)

## [1] 1.164 2.943

qgamma(c(0.01, 0.99), shape = 4 + 4, rate = 2 + 2)

## [1] 0.7265 4.0000

n = 2 # sample size
y = 2 # sample mean

# likelihood
likelihood = dpois(1, theta) * dpois(3, theta)

# posterior
posterior = dgamma(theta, alpha + n * y, lambda + n)

# plot
plot_continuous_posterior(theta, prior, likelihood, posterior)
228 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

prior
scaled likelihood
posterior

0 2 4 6 8

theta

5. By Poisson aggregation, the total number of home runs in 2 games follows


a Poisson(2𝜃) distribution. The likelihood is the probability of a value of
4 (home runs in 2 games) computed using a Poisson(2𝜃) for each value of
𝜃.

𝑒−2𝜃 (2𝜃)4
𝑓(𝑦 ̄ = 2|𝜃) = ∝ 𝑒−2𝜃 𝜃4 , 𝜃>0
4!

The shape of the likelihood as a function of 𝜃 is the same as in the previ-


ous part; the likelihood functions are proportionally the same regardless of
whether you observe the individual values or just the total count. There-
fore, the posterior distribution is the same as in the previous part.

# likelihood
n = 2 # sample size
y = 2 # sample mean
likelihood = dpois(n * y, n * theta)

# posterior
posterior = dgamma(theta, alpha + n * y, lambda + n)

# plot
plot_continuous_posterior(theta, prior, likelihood, posterior)
229

prior
scaled likelihood
posterior

0 2 4 6 8

theta

6. By Poisson aggregation, the total number of home runs in 32 games follows


a Poisson(32𝜃) distribution. The likelihood is the probability of observing
a value of 97 (for the total number of home runs in 32 games) from a
Poisson(32𝜃) distribution.
𝑓(𝑦 ̄ = 97/32|𝜃) = 𝑒−32𝜃 (32𝜃)97 /97!, 𝜃>0
−32𝜃 97
∝𝑒 𝜃 , 𝜃>0

The posterior satisfies


𝜋(𝜃|𝑦 ̄ = 97/32) ∝ (𝑒−32𝜃 𝜃97 ) (𝜃4−1 𝑒−2𝜃 ) , 𝜃 > 0,
∝ 𝜃(4+97)−1 𝑒−(2+32)𝜃 , 𝜃 > 0.

We recognize the above as the Gamma density with shape parameter


𝛼 = 4 + 97 and rate parameter 𝜆 = 2 + 32.

𝛼 101
Posterior mean = = 2.97
𝜆 34
𝛼 101
Posterior SD = √ √ = 0.296
𝜆2 342

The likelihood is centered at the sample mean of 97/32 = 3.03. The


posterior distribution follows the likelihood fairly closely, but the prior
still has a little influence. The posterior is essentially identical to the one
we computed via grid approximation in Example 13.3.
230 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

# likelihood
n = 32 # sample size
y = 97 / 32 # sample mean
likelihood = dpois(n * y, n * theta)

# posterior
posterior = dgamma(theta, alpha + n * y, lambda + n)

# plot
plot_continuous_posterior(theta, prior, likelihood, posterior)

prior
scaled likelihood
posterior

0 2 4 6 8

theta

qgamma(c(0.25, 0.75), alpha + n * y, lambda + n)

## [1] 2.766 3.164

qgamma(c(0.10, 0.90), alpha + n * y, lambda + n)

## [1] 2.599 3.355

qgamma(c(0.01, 0.99), alpha + n * y, lambda + n)

## [1] 2.326 3.701


231

7. The credible intervals represent conclusions about 𝜃, the mean number of


home runs per game at Citizen Bank Park.

There is a posterior probability of 50% that the mean number of home


runs per games at Citizen Bank Park is between 2.77 and 3.16. It is
equally plausible that 𝜃 is inside this interval as outside.

There is a posterior probability of 80% that the mean number of home


runs per games at Citizen Bank Park is between 2.6 and 3.36. It is four
times more plausible that 𝜃 is inside this interval than outside.

There is a posterior probability of 98% that the mean number of home


runs per games at Citizen Bank Park is between 2.33 and 3.7. It is 49
times more plausible that 𝜃 is inside this interval than outside.

8. The prior mean is 4/2=2, based on a “prior sample size” of 2. The sample
mean is 97/32 = 3.03, based on a sample size of 32. The posterior mean
is (4 + 97)/(2 + 32) = 2.97. The posterior mean is a weighted average
of the prior mean and the sample mean with the weights based on the
“sample sizes”

4 + 97 2 4 32 97
2.97 = =( ) ( )+( ) ( ) = (0.0589)(2)+(0.941)(3.03)
2 + 32 2 + 32 2 2 + 32 32

9. Simulate a value of 𝜃 from its posterior distribution, compute 𝜂 = 𝑒−𝜃 ,


repeat many times, and summarize the simulated values of 𝜂. We can use
the quantile function to find the endpoints of credible intervals.

theta_sim = rgamma(10000, alpha + n * y, lambda + n)

eta_sim = exp(-theta_sim)

hist(eta_sim, freq = FALSE,


xlab = "Population proportion of games with 0 HRs",
ylab = "Posterior density",
main = "Posterior distribution of exp(-theta)")
232 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

Posterior distribution of exp(−theta)

25
20
Posterior density

15
10
5
0

0.02 0.04 0.06 0.08 0.10 0.12 0.14

Population proportion of games with 0 HRs

quantile(eta_sim, c(0.25, 0.75))

## 25% 75%
## 0.04220 0.06315

quantile(eta_sim, c(0.10, 0.90))

## 10% 90%
## 0.03481 0.07455

quantile(eta_sim, c(0.01, 0.99))

## 1% 99%
## 0.02488 0.09849

There is a posterior probability of 98% that the population proportion of


games with 0 home runs is between 0.025 and 0.098.
10. The JAGS code is below. The results are very similar to the theoretical
results from previous parts.

Here is the JAGS code. Note

• The data has been loaded as individual values, number of home runs in
each of the 32 games
233

• Likelihood is defined as a loop. For each y[i] value, the likelihood is


computing according to a Poisson(𝜃) distribution
• Prior distribution is a Gamma distribution. (Remember, JAGS syntax for
dgamma, dpois, etc, is not the same as in R.)

# data
df = read.csv("_data/citizens-bank-hr-2020.csv")
y = df$hr
n = length(y)

# model
model_string <- "model{

# Likelihood
for (i in 1:n){
y[i] ~ dpois(theta)
}

# Prior
theta ~ dgamma(alpha, lambda)
alpha <- 4
lambda <- 2

}"

# Compile the model


dataList = list(y=y, n=n)

Nrep = 10000
Nchains = 3

model <- jags.model(textConnection(model_string),


data=dataList,
n.chains=Nchains)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 32
## Unobserved stochastic nodes: 1
## Total graph size: 36
##
## Initializing model
234 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

update(model, 1000, progress.bar="none")

posterior_sample <- coda.samples(model,


variable.names=c("theta"),
n.iter=Nrep,
progress.bar="none")

# Summarize and check diagnostics


summary(posterior_sample)

##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 2.97290 0.29624 0.00171 0.00173
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 2.42 2.77 2.96 3.17 3.58

plot(posterior_sample)
235

Trace of theta Density of theta

1.2
4.0
3.5

0.8
3.0

0.4
2.5
2.0

0.0
2000 6000 10000 2.0 3.0 4.0

Iterations N = 10000 Bandwidth = 0.03995

In the previous example we saw that if the values of the measured variable follow
a Poisson distribution with parameter 𝜃 and the prior for 𝜃 follows a Gamma
distribution, then the posterior distribution for 𝜃 given the data also follows a
Gamma distribution.
Gamma-Poisson model.6 Consider a measured variable 𝑌 which, given 𝜃,
follows a Poisson(𝜃) distribution. Let 𝑦 ̄ be the sample mean for a random sample
of size 𝑛. Suppose 𝜃 has a Gamma(𝛼, 𝜆) prior distribution. Then the posterior
distribution of 𝜃 given 𝑦 ̄ is the Gamma(𝛼 + 𝑛𝑦,̄ 𝜆 + 𝑛) distribution.
That is, Gamma distributions form a conjugate prior family for a Poisson like-
lihood.
The posterior distribution is a compromise between prior and likelihood. For the
Gamma-Poisson model, there is an intuitive interpretation of this compromise.
In a sense, you can interpret 𝛼 as “prior total count” and 𝜆 as “prior sample
size”, but these are only “pseudo-observations”. Also, 𝛼 and 𝜆 are not necessarily
integers.
𝑛
Note that if 𝑦 ̄ is the sample mean count is then 𝑛𝑦 ̄ = ∑𝑖=1 𝑦𝑖 is the sample total
count.
6 I’ve been naming these models in the form “Prior-Likelihood”, e.g. Gamma prior and

Poisson likelihood. I would rather do it as “Likelihood-Prior”. In modeling, the likelihood


comes first; what is an appropriate distributional model for the observed data? This likelihood
depends on some parameters, and then a prior distribution is placed on these parameters. So
in modeling the order is likelihood then prior, and it would be nice if the names followed that
pattern. But “Beta-Binomial” is the canonical example, and no one calls that “Binomial-
Beta”. To be consistent, we’ll stick with the “Prior-Likelihood” naming convention.
236 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

Prior Data Posterior


Total count 𝛼 𝑛𝑦 ̄ 𝛼 + 𝑛𝑦 ̄
Sample size 𝜆 𝑛 𝜆+𝑛
𝛼 𝛼+𝑛𝑦̄
Mean 𝜆 𝑦̄ 𝜆+𝑛

• The posterior total count is the sum of the “prior total count” 𝛼 and the
sample total count 𝑛𝑦.̄
• The posterior sample size is the sum of the “prior sample size” 𝜆 and the
observed sample size 𝑛.
• The posterior mean is a weighted average of the prior mean and the sample
mean, with weights proportional to the “sample sizes”.

𝛼 + 𝑛𝑦 ̄ 𝜆 𝛼 𝑛
= ( )+ 𝑦̄
𝜆+𝑛 𝜆+𝑛 𝜆 𝜆+𝑛

• As more data are collected, more weight is given to the sample mean (and
less weight to the prior mean)
• Larger values of 𝜆 indicate stronger prior beliefs, due to smaller prior
variance (and larger “prior sample size”), and give more weight to the
prior mean

Try this applet which illustrates the Gamma-Poisson model.


Rather than specifying 𝛼 and 𝛽, a Gamma distribution prior can be specified
by its prior mean and SD directly. If the prior mean is 𝜇 and the prior SD is 𝜎,
then
𝜇
𝜆= 2
𝜎
𝛼 = 𝜇𝜆

Example 13.6. Continuing the previous example, assume home runs per game
at Citizens Bank Park follow a Poisson distribution with parameter 𝜃. Assume
for 𝜃 a Gamma prior distribution with shape parameter 𝛼 = 4 and rate param-
eter 𝜆 = 2. Consider the 2020 data in which there were 97 home runs in 32
games.

1. How could you use simulation (not JAGS) to approximate the posterior
predictive distribution of home runs in a game?
2. Use the simulation from the previous part to find and interpret a 95%
posterior prediction interval with a lower bound of 0.
3. Is a Poisson model a reasonable model for the data? How could you use
posterior predictive simulation to simulate what a sample of 32 games
might look like under this model. Simulate many such samples. Does the
observed sample seem consistent with the model?
237

4. Regarding the appropriateness of a Poisson model, we might be concerned


that there are no games in the sample with 0 home runs. Use simulation to
approximate the posterior predictive distribution of the number of games
in a sample of 32 with 0 home runs. From this perspective, does the
observed value of the statistic seem consistent with the Gamma-Poisson
model?
Solution. to Example 13.6

1. Simulate a value of 𝜃 from its Gamma(101, 34) posterior distribution, then


given 𝜃 simulate a value of 𝑦 from a Poisson(𝜃) distribution. Repeat many
times and summarize the 𝑦 values to approximate the posterior predictive
distribution.

Nrep = 10000
theta_sim = rgamma(Nrep, 101, 34)

y_sim = rpois(Nrep, theta_sim)

plot(table(y_sim) / Nrep, type = "h",


xlab = "Number of home runs",
ylab = "Simulated relative frequency",
main = "Posterior predictive distribution")

Posterior predictive distribution


0.20
Simulated relative frequency

0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10 11

Number of home runs

2. There is a posterior predictive probability of 95% of between 0 and 6 home


runs in a game. Very roughly, about 95% of games have between 0 and 6
home runs.
238 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

quantile(y_sim, 0.95)

## 95%
## 6

3. Simulate a value of 𝜃 from its Gamma(101, 34) posterior distribution, then


given 𝜃 simulate 32 values of 𝑦 from a Poisson(𝜃) distribution. Summarize
each sample. Repeat many times to simulate many samples of size 32.
Compare the observed sample with the simulated samples. Aside from
the fact there the sample has no games with 0 home runs, the model
seems reasonable.

df = read.csv("_data/citizens-bank-hr-2020.csv")
y = df$hr
n = length(y)

plot(table(y) / n, type = "h", xlim = c(0, 13), ylim = c(0, 0.4),


xlab = "Number of home runs",
ylab = "Observed/Simulated relative frequency",
main = "Posterior predictive distribution")
axis(1, 0:13)

n_samples = 100

# simulate samples
for (r in 1:n_samples){

# simulate theta from posterior distribution


theta_sim = rgamma(1, 101, 34)

# simulate values from Poisson(theta) distribution


y_sim = rpois(n, theta_sim)

# add plot of simulated sample to histogram


par(new = T)
plot(table(factor(y_sim, levels = 0:13)) / n, type = "o", xlim = c(0, 13), ylim
xlab = "", ylab = "", xaxt='n', yaxt='n',
col = rgb(135, 206, 235, max = 255, alpha = 25))
}
239

Posterior predictive distribution


Observed/Simulated relative frequency

0.4
0.3
0.2
0.1
0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Number of home runs

4. Continuing with the simulation from the previous part, now for each sim-
ulated sample we record the number of games with 0 home runs. Each
“dot” in the plot below represents a sample of size 32 for which we mea-
sure the number of games in the sample with 0 home runs. While we see
that it’s less likely to have 0 home runs in 32 games than not, it would not
be too surprising to see 0 home runs in a sample of 32 games. Therefore,
the fact that there are 0 home runs in the observed sample alone does not
invalidate the model.

n_samples = 10000

zero_count = rep(NA, n_samples)

# simulate samples
for (r in 1:n_samples){

# simulate theta from posterior distribution


theta_sim = rgamma(1, 101, 34)

# simulate values from Poisson(theta) distribution


y_sim = rpois(n, theta_sim)
zero_count[r] = sum(y_sim == 0)
}

par(mar = c(5, 5, 4, 2) + 0.1)


plot(table(zero_count) / n_samples, type = "h",
240 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA

xlab = "Number of games in sample of size 32 with 0 home runs",


ylab = "Simulated posterior predictive probability\n Proportion of samples of si

Simulated posterior predictive probability

0.30
Proportion of samples of size 32

0.20
0.10
0.00

0 1 2 3 4 5 6 7 8 9

Number of games in sample of size 32 with 0 home runs


Chapter 14

Introduction to
Multi-Parameter Models

So far we have considered situations with a just a single unknown parameter 𝜃.


However, most interesting problems involve multiple unknown parameters.
For example, we have considered the problem of estimating the population mean
of a numerical variable assuming the population standard deviation was known.
However, in practice both the population mean and population standard devi-
ation are unknown. Even if we are only interested in estimating the population
mean, we still need to account for the uncertainty in the population standard
deviation.
When there are two (or more) unknown parameters the prior and posterior
distribution will each be a joint probability distribution over pairs (or tu-
ples/vectors) of possible values of the parameters.
Example 14.1. Assume body temperatures (degrees Fahrenheit) of healthy
adults follow a Normal distribution with unknown mean 𝜇 and unknown stan-
dard deviation 𝜎. Suppose we wish to estimate both 𝜇, the population mean
healthy human body temperature, and 𝜎, the population standard deviation of
body temperatures.

1. Assume a discrete prior distribution according to which


• 𝜇 takes values 97.6, 98.1, 98.6 with prior probability 0.2, 0.3, 0.5,
respectively.
• 𝜎 takes values 0.5, 1 with prior probability 0.25, 0.75, respectively.
• 𝜇 and 𝜎 are independent.
Start to construct the Bayes table. What are the possible values of the
parameter? What are the prior probabilities? (Hint: the parameter 𝜃 is a
pair (𝜇, 𝜎).)

241
242 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

2. Suppose two temperatures of 97.5 and 97.9 are observed, independently.


Identify the likelihood.
3. Complete the Bayes table and find the posterior distribution after observ-
ing these two measurements. Compare to the prior distribution.
4. Suppose that we only observe that in a sample of size 2 the mean is 97.7. Is
this information enough to evaluate the likelihood function and determine
the posterior distribution?
5. The prior assumes that 𝜇 and 𝜎 are independent. Are they independent
according to the posterior distribution?

Solution. to Example 14.1

1. See the table below. There are 3 possible values for 𝜇 and 2 possible
values for 𝜎 so there are (3)(2) = 6 possible (𝜇, 𝜎) pairs. Each row in the
Bayes table represents a (𝜇, 𝜎) pair. Since the prior assumes independence,
the prior probability of any pair is the product of the marginal prior
probabilities of 𝜇 and 𝜎. For example, the probability probability that
𝜇 = 97.6 and 𝜎 = 0.5 is (0.2)(0.25) = 0.05
2. The likelihood is similar to what we have seen in other examples concern-
ing body temperature, but it is now a function of both 𝜇 and 𝜎. That is,
the likelihood is a function of two variables. The likelihood is determined
by evaluating, for each (𝜇, 𝜎) pair, the Normal(𝜇, 𝜎) density at each of
𝑦 = 97.9 and 𝑦 = 97.5 and then finding the product:
2 2
𝑓(𝑦=(97.9,97.5)|𝜇,𝜎)∝[𝜎−1 exp(− 21 ( 97.9−𝜇
𝜎 ) )][𝜎−1 exp(− 21 ( 97.5−𝜇
𝜎 ) )]

3. See the table below. As always, posterior is proportional to likelihood


times prior. For the sample (97.9, 97.5), the sample mean is 97.7 and
the sample standard deviation is 0.283. The posterior distribution pushes
probability away from 𝜇 = 98.6, and pushes more probability towards
𝜎 = 0.5.

mu = c(97.6, 98.1, 98.6)


sigma = c(0.5, 1)
theta = expand.grid(mu, sigma) # all possible (mu, sigma) pairs
names(theta) = c("mu", "sigma")

# prior

prior_mu = c(0.20, 0.30, 0.50)


prior_sigma = c(0.25, 0.75)
prior = apply(expand.grid(prior_mu, prior_sigma), 1, prod)
prior = prior / sum(prior)
243

# data
y = c(97.9, 97.5) # single observed value

# likelihood
likelihood = dnorm(97.9, mean = theta$mu, sd = theta$sigma) *
dnorm(97.5, mean = theta$mu, sd = theta$sigma)

# posterior
product = likelihood * prior
posterior = product / sum(product)

# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)

kable(bayes_table, digits = 4, align = 'r')

mu sigma prior likelihood product posterior


97.6 0.5 0.050 0.5212 0.0261 0.2041
98.1 0.5 0.075 0.2861 0.0215 0.1680
98.6 0.5 0.125 0.0212 0.0027 0.0208
97.6 1.0 0.150 0.1514 0.0227 0.1778
98.1 1.0 0.225 0.1303 0.0293 0.2296
98.6 1.0 0.375 0.0680 0.0255 0.1997

4. Intuitively, knowing only the posterior mean would not be sufficient, since
it would not give us enough information to estimate the standard deviation
𝜎. In order to evaluate the likelihood we need to compute 𝑦−𝜇 𝜎 for each
individual 𝑦 value, so if we only had the sample mean we would not be
able to fill in the likelihood column.

5. The posterior distribution represents some dependence between 𝜇 and 𝜎.


For example, consider the pair 𝜇 = 97.6 and 𝜎 = 0.5. The marginal
posterior probability that 𝜇 = 97.6 is 0.3819. The marginal posterior
probability that 𝜎 = 0.5 is 0.3929. But the joint posterior probability
that 𝜇 = 97.6 and 𝜎 = 0.5 is 0.2041, which is not the product of the
marginal probabilities.

The plots below compare the prior and posterior distributions from the previous
problem.
244 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

ggplot(bayes_table %>%
mutate(mu = factor(mu),
sigma = factor(sigma)),
aes(mu, sigma)) +
geom_tile(aes(fill = prior)) +
scale_fill_viridis(limits = c(0, max(c(prior, posterior))))

ggplot(bayes_table %>%
mutate(mu = factor(mu),
sigma = factor(sigma)),
aes(mu, sigma)) +
geom_tile(aes(fill = posterior)) +
scale_fill_viridis(limits = c(0, max(c(prior, posterior))))

1 1

prior posterior

0.3 0.3
sigma

sigma

0.2 0.2

0.1 0.1

0.0 0.0
0.5 0.5

97.6 98.1 98.6 97.6 98.1 98.6


mu mu

Example 14.2. Continuing Example 14.1, let’s assume a more reasonable,


continuous prior for (𝜇, 𝜎). We have seen that we often work with the precision
𝜏 = 1/𝜎2 rather than the SD. Assume a continuous prior distribution which
assumes

• 𝜇 has a Normal distribution with mean 98.6 and standard deviation 0.3.
• 𝜏 has a Gamma distribution with shape parameter 5 and rate parameter
2.
• 𝜇 and 𝜏 are independent.

(This problem just concerns the prior distribution. We’ll look at this posterior
distribution in the next example.)

1. Simulate (𝜇, 𝜏 ) pairs from the prior distribution and plot them.
2. Simulate (𝜇, 𝜎) pairs from the prior distribution and plot them. Describe
the prior distribution of 𝜎.
3. Find and interpret a central 98% prior credible interval for 𝜇.
4. Find a central 98% prior credible interval for the precision 𝜏 = 1/𝜎2 .
5. Find and interpret a central 98% prior credible interval for 𝜎.
245

6. What is the prior credibility that both 𝜇 and 𝜎 lie within their credible
intervals?

Solution. to Example 14.2

1. We could plot the prior distribution directly. However, distributions are


usually only approximated via simulation, so we’ll just simulate. The prior
distribution is a distribution on (𝜇, 𝜏 ) pairs.

Nrep = 100000

mu_sim_prior = rnorm(Nrep, 98.6, 0.3)


tau_sim_prior = rgamma(Nrep, shape = 5, rate = 2)
sigma_sim_prior = 1 / sqrt(tau_sim_prior)
sim_prior = data.frame(mu_sim_prior, tau_sim_prior, sigma_sim_prior)

ggplot(sim_prior, aes(mu_sim_prior, tau_sim_prior)) +


geom_point(color = "skyblue", alpha = 0.4)

ggplot(sim_prior, aes(mu_sim_prior, tau_sim_prior)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c()

10.0 5

4
7.5

level
0.5
tau_sim_prior

tau_sim_prior

3 0.4
5.0 0.3

0.2

2 0.1

2.5

0.0

98 99 98.0 98.4 98.8 99.2


mu_sim_prior mu_sim_prior

2. See plots below.


√ The prior distribution on 𝜏 induces a prior distribution
on 𝜎 = 1/ 𝜏 .

ggplot(sim_prior, aes(x = sigma_sim_prior)) +


geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "skyblue")

ggplot(sim_prior, aes(mu_sim_prior, sigma_sim_prior)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c()
246 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

3
1.0

2
0.8 level

sigma_sim_prior
3.5
3.0

density
2.5
2.0
1.5
1 1.0
0.6
0.5

0
0.4
1 2 3 98.00 98.25 98.50 98.75 99.00 99.25
sigma_sim_prior mu_sim_prior

3. There is a prior probability of 98% that population mean human body


temperature is between 97.9 and 99.3 degrees F.

quantile(mu_sim_prior, c(0.01, 0.99))

## 1% 99%
## 97.9 99.3

4. We can compute a credible interval like usual. Precision just doesn’t have
as practical an interpretation as standard deviation.

quantile(tau_sim_prior, c(0.01, 0.99))

## 1% 99%
## 0.6374 5.8224

5. There is a prior probability of 98% that population standard deviation of


human body temperatures is between 0.41 and 1.25 degrees F.

quantile(sigma_sim_prior, c(0.01, 0.99))

## 1% 99%
## 0.4144 1.2526

6. Since 𝜇 and 𝜎 are independent according to the prior distribution,


the probability that both parameters lie in their respective intervals is
(0.98)(0.98)=0.9604. If we want 98% joint prior credibility, we need a
different region.
Example 14.3. Continuing the previous example, we’ll now compute the pos-
terior distribution given a sample of two measurements of 97.9 and 97.5.

1. Assume a grid of 𝜇 values from 96.0 to 100.0 in increments of 0.01, and a


grid of 𝜏 values from 0.1 to 25.0 in increments of 0.01. How many possible
values of the pair (𝜇, 𝜏 ) are there; that is, how many rows are there in the
Bayes table?
247

2. Use grid approximation to approximate the joint posterior distribution of


(𝜇, 𝜏 ) Simulate values from the joint posterior distribution and plot them.
Compute the posterior correlation between 𝜇 and 𝜏 ; are they independent
according to the posterior distribution?
3. Plot the simulated joint posterior distribution of 𝜇 and 𝜎. Compare to
the prior.
4. Suppose we wanted to approximate the posterior distribution without first
using grid approximation. Describe how, in principle, you would use a
naive (non-MCMC) simulation to approximate the posterior distribution.
In practice, what is the problem with such a simulation?

Solution. to Example 14.3

1. There are (100-96)/0.01 = 400 values of 𝜇 in the grid (actually 401 in-
cluding both endpoints) and (25-0.1)/0.01 = 2490 values of 𝜇 in the grid
(actually 2491). There are almost 1 million possible values of the pair
(𝜇, 𝜏 ) in the grid.
2. See below. Even though 𝜇 and 𝜏 are independent according to the prior
distribution, there is a negative posterior correlation. (Below the posterior
is computed via grid approximation. After the posterior distribution was
computed, values were simulated from it for plotting.)

# parameters
mu = seq(96.0, 100.0, 0.01)
tau = seq(0.1, 25, 0.01)

theta = expand.grid(mu, tau)


names(theta) = c("mu", "tau")
theta$sigma = 1 / sqrt(theta$tau)

# prior
prior_mu_mean = 98.6
prior_mu_sd = 0.3

prior_precision_shape = 5
prior_precision_rate = 2

prior = dnorm(theta$mu, prior_mu_mean, sd = prior_mu_sd) *


dgamma(theta$tau, shape = prior_precision_shape,
rate = prior_precision_rate)
prior = prior / sum(prior)

# data
y = c(97.9, 97.5)
248 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

# likelihood
likelihood = dnorm(97.9, mean = theta$mu, sd = theta$sigma) *
dnorm(97.5, mean = theta$mu, sd = theta$sigma)

# posterior
product = likelihood * prior
posterior = product / sum(product)

# posterior simulation
sim_posterior = theta[sample(1:nrow(theta), 100000, replace = TRUE, prob = posteri

cor(sim_posterior$mu, sim_posterior$tau)

## [1] -0.2888

#plots

ggplot(sim_posterior, aes(mu)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")

ggplot(sim_posterior, aes(tau)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")

ggplot(sim_posterior, aes(mu, tau)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c()

5
1.5
0.4

0.3
level
1.0
0.6
density

density

3
tau

0.4
0.2

0.2
0.5 2

0.1

0.0 0.0

98 99 0.0 2.5 5.0 7.5 10.0 97.75 98.00 98.25 98.50 98.75 99.00
mu tau mu

3. See below. We see that the posterior shifts the density towards smaller
values of 𝜇 and 𝜎. There is also a slight positive posterior correlation
between 𝜇 and 𝜎.

cor(sim_posterior$mu, sim_posterior$sigma)

## [1] 0.2804
249

#plots

ggplot(sim_posterior, aes(mu)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")

ggplot(sim_posterior, aes(sigma)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")

ggplot(sim_posterior, aes(mu, sigma)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c()

3
1.5 1.0

2
level
1.0 0.8
4
density

density

sigma
3

1 1
0.5 0.6

0.0 0
0.4
98 99 0.5 1.0 1.5 2.0 97.75 98.00 98.25 98.50 98.75
mu sigma mu

4. Simulate a value of (𝜇, 𝜎) from their joint prior distribution, by simulat-


ing a value of 𝜇 from a Normal(98.6, 0.3) distribution and simulating,
independently,
√ a value of 𝜏 from a Gamma(5, 2) distribution and setting
𝜎 = 1/ 𝜏 . Given 𝜇 and 𝜎 simulate two independent 𝑦 values from a
Normal(𝜇, 𝜎) distribution. Repeat many times. Condition on the ob-
served data by discarding any repetitions for which the 𝑦 values are not
(97.9, 97.5), to some reasonable degree of precision, say rounded to 1 dec-
imal place. Approximate the posterior distribution using the remaining
simulated values of (𝜇, 𝜎).
In practice, the probability of seeing a sample with 𝑦 values of 97.9 and
97.5 is extremely small, so almost all repetitions of the simulation would
be discarded and such a simulation would be extremely computationally
inefficient. (For example, the values of 𝜇 and 𝜎 which maximize the like-
lihood of (97.9, 97.5) are 97.7 and 0.2, respectively, and even for those
values and rounding to 1 decimal place the probability of seeing such a
sample is only 0.015.)

The previous problem illustrates that grid approximation can quickly become
computationally infeasible when there are multiple parameters (to obtain suf-
ficient precision). Naively conditioning a simulation on the observed sample
is also computationally infeasible, since except in the simplest situations the
probability of recreating the observed sample in a simulation is essentially 0.
250 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

Therefore, we need more efficient computational methods, and MCMC will do


the trick.

Example 14.4. The temperature data file contains 208 measurements of hu-
man body temperature (degrees F). The sample mean is 97.71 degrees F and
the sample SD is 0.75 degrees F. Assuming the same prior distribution as in the
previous problem, use JAGS to approximate the joint posterior distribution of
𝜇 and 𝜎. Summarize the posterior distribution in context.

Solution. to Example 14.4

The JAGS code is below. A few comments on the code

• The data is read as individual values, so the likelihood of each y[i] is


computed via a for loop.
• We have called the parameters by their names mu, tau, sigma, rather than
just a single theta.
• We specify a prior distribution on tau and then define sigma <- 1 /
sqrt(tau).
• In JAGS dnorm is of the form dnorm(mean, precision)
• We are interested in the posterior distribution of 𝜇 and 𝜎, so we include
both parameters in the model.names argument of the coda.samples func-
tion.
• The output of coda.samples is a special object called an mcmc.list.
Calling plot on this object produces a trace plot and a density plot for each
parameter included in variable.names. But it does not automatically
produce any joint distribution plots.
• We use mcmc_scatter from the bayesplot package to create a scatter
plot of the joint posterior distribution, to which we can add contours.
• We can also extract the JAGS output as a matrix, put it in a data frame,
and then use R or ggplot commands to create plots.
• The simulated values from an mcmc.list can be extracted as a matrix
with as.matrix and then manipulated as usual, e.g., to compute the cor-
relation.

Nrep = 10000
Nchains = 3

# data
data = read.csv("_data/temperature.csv")
y = data$temperature
n = length(y)

# model
model_string <- "model{
251

# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 1 / sigma ^ 2)
}

# Prior
mu ~ dnorm(98.6, 1 / 0.3 ^ 2)

sigma <- 1 / sqrt(tau)


tau ~ dgamma(5, 2)

}"

# Compile the model


dataList = list(y=y, n=n)

model <- jags.model(textConnection(model_string),


data=dataList,
n.chains=Nchains)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 208
## Unobserved stochastic nodes: 2
## Total graph size: 222
##
## Initializing model

update(model, 1000, progress.bar="none")

posterior_sample <- coda.samples(model,


variable.names=c("mu", "sigma"),
n.iter=Nrep,
progress.bar="none")

# Summarize and check diagnostics


summary(posterior_sample)

##
## Iterations = 2001:12000
## Thinning interval = 1
252 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

## Number of chains = 3
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 97.736 0.0514 0.000297 0.000297
## sigma 0.749 0.0365 0.000211 0.000269
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 97.635 97.701 97.736 97.771 97.836
## sigma 0.682 0.724 0.748 0.773 0.825

plot(posterior_sample)

Trace of mu Density of mu
8
97.9

4
97.6

2000 6000 10000 97.5 97.6 97.7 97.8 97.9

Iterations N = 10000 Bandwidth = 0.006929

Trace of sigma Density of sigma


0.85

8
4
0.65

2000 6000 10000 0.65 0.75 0.85 0.95

Iterations N = 10000 Bandwidth = 0.004919

# Scatterplot from bayesplot package


color_scheme_set("green")
mcmc_scatter(posterior_sample, pars = c("mu", "sigma"), alpha = 0.1) +
stat_ellipse(level = 0.98, color = "black", size = 2) +
stat_density_2d(color = "grey", size = 1)
253

0.9

0.8
sigma

0.7

97.6 97.7 97.8 97.9


mu

# posterior summary
posterior_sim = data.frame(as.matrix(posterior_sample))

head(posterior_sim)

## mu sigma
## 1 97.74 0.7210
## 2 97.68 0.7885
## 3 97.77 0.7331
## 4 97.72 0.7154
## 5 97.71 0.7449
## 6 97.74 0.7251

apply(posterior_sim, 2, mean)

## mu sigma
## 97.7362 0.7493

apply(posterior_sim, 2, sd)

## mu sigma
## 0.05138 0.03647
254 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

quantile(posterior_sim$mu, c(0.01, 0.99))

## 1% 99%
## 97.62 97.86

quantile(posterior_sim$sigma, c(0.01, 0.99))

## 1% 99%
## 0.6717 0.8423

cor(posterior_sim)

## mu sigma
## mu 1.00000 0.04211
## sigma 0.04211 1.00000

ggplot(posterior_sim, aes(mu)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")

6
density

97.5 97.6 97.7 97.8 97.9


mu
255

ggplot(posterior_sim, aes(sigma)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")

9
density

0.7 0.8 0.9


sigma

ggplot(posterior_sim, aes(mu, sigma)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c()
256 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

0.80

level
80
sigma

60
0.75
40

20

0.70

97.65 97.70 97.75 97.80 97.85


mu

A few comments about the posterior distribution

• The joint posterior distribution appears to be roughly Bivariate Normal.


The correlation is close to 0, indicating independence1 between 𝜇 and 𝜎
in the posterior.
• The posterior distribution of 𝜇 is approximately Normal with posterior
mean 97.7 (basically the sample mean) and posterior SD 0.05. There
is a 98% posterior probability that the population mean human body
temperature is between 97.6 and 97.9 degrees F.
• The posterior distribution of 𝜎 is approximately Normal with posterior
mean 0.75 (basically the sample SD) and posterior SD 0.036. There is a
98% posterior probability that the population SD of human body temper-
atures is between 0.67 and 0.84 degrees F.
• Since 𝜇 and 𝜎 are roughly independent in the posterior, there is a posterior
probability of 96% that both of the above statements are true, that is,
that both parameters lie in their respective credible intervals. To have
joint posterior credibility of 98%, we could lengthen each interval (to 99%
for two independent intervals) to obtain a rectangular credibility region.
The scatterplot also shows a 98% posterior credible ellipse (in black) for
both 𝜇 and 𝜎.
Example 14.5. Continuing the previous example, how could you use simula-
tion to approximate the posterior predictive distribution of a single body tem-
1 Remember, if 𝑋 and 𝑌 are independent then the correlation is 0, but the converse is

not true in general. However, if 𝑋 and 𝑌 have a Bivariate Normal distribution and their
correlation is 0, then 𝑋 and 𝑌 are independent.
257

perature? Conduct the simulation and compute and interpret a 95% prediction
interval.

Solution. to Example 14.5

• Simulate a (𝜇, 𝜎) pair from the joint posterior distribution.


• Given 𝜇 and 𝜎, simulate a value of 𝑦 from a 𝑁 (𝜇, 𝜎) distribution.
• Repeat many times and summarize the simulated 𝑦 values to approximate
the posterior predictive distribution.

See the code below. JAGS has already returned a simulation from the joint
posterior distribution of (𝜇, 𝜎) For each of these simulated values, simulate a
corresponding 𝑦 value like usual.

theta_sim = as.matrix(posterior_sample)

y_sim = rnorm(nrow(theta_sim), theta_sim[, "mu"], theta_sim[, "sigma"])

hist(y_sim, freq = FALSE, xlab = "Body temperature (degrees F)",


main = "Posterior preditive distribution")
lines(density(y_sim))
abline(v = quantile(y_sim, c(0.025, 0.975)), col = "orange")

Posterior preditive distribution


0.5
0.4
0.3
Density

0.2
0.1
0.0

95 96 97 98 99 100 101

Body temperature (degrees F)


258 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS

quantile(y_sim, c(0.025, 0.975))

## 2.5% 97.5%
## 96.27 99.23

There is a posterior predictive probability of 95% that a body temperature


is between 96.25 and 99.20 degrees F. Roughly, 95% of healthy human body
temperatures are between 96.25 and 99.20 degrees F.
Chapter 15

Bayesian Analysis of a
Numerical Variable

In this chapter we’ll continue our study of a single numerical variable. When the
distribution of the measured variable is symmetric and unimodal, the population
mean is often the main parameter of interest. However, the population SD also
plays an important role.
In the previous section we assumed that the measured numerical variable fol-
lowed a Normal distribution. That is, we assumed a Normal likelihood function.
However, the assumption of Normality is not always justified, even when the
distribution of the measured variable is symmetric and unimodal. In this chap-
ter we’ll investigate an alternative to a Normal likelihood that is more flexible
and robust to outliers and extreme values.
Example 15.1. In a previous assignment, we assumed that birthweights
(grams) of human babies follow a Normal distribution with unknown mean 𝜇
and known SD 𝜎 = 600. (1 pound � 454 grams.) We assumed a Normal(3400,
100) (in grams, or Normal(7.5, 0.22) in pounds) prior distribution for 𝜇; this
prior distribution places most of its probability on mean birthweight being
between 7 and 8 pounds.
Now we’ll assume, more realistically, that 𝜎 is unknown.

1. What does the parameter 𝜎 represent? What is a reasonable prior mean for
𝜎? What range of values of 𝜎 will account for most of the prior probability?
2. Assume a Gamma prior distribution for 𝜎 with mean 600 and SD 200; this
is a Gamma distribution with shape parameter 𝛼 = 6002 /2002 and rate
parameter 600/2002 . Also assume that 𝜇 and 𝜎 are independent according
to the prior distribution. Explain how you could use simulation to approx-
imate the prior predictive distribution of birthweights. Run the simulation
and summarize the results. Does the choice of prior seem reasonable?

259
260CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

3. The following summarizes data on a random sample1 of 1000 live births


in the U.S. in 2001.

data = read.csv("_data/birthweight.csv")
y = data$birthweight

hist(y, freq = FALSE, breaks = 50)

Histogram of y
0.0008
Density

0.0004
0.0000

1000 2000 3000 4000 5000

summary(y)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 595 3005 3350 3315 3714 5500

sd(y)

## [1] 631.3

n = length(y)
ybar = mean(y)

Does it seem reasonable to assume birthweights follow a Normal distribu-


tion?
1 There are about 4 million live births in the U.S. per year. The data is available at the

CDC website. We’re only using a random sample to cut down on computation time.
261

4. Regardless of your answer to the previous question, continue to assume


the model above. Use JAGS to find the posterior distribution.
5. How could you use simulation to approximate the posterior predictive
distribution of birthweights? Run the simulation and find a 99% posterior
prediction interval.
6. What percent of values in the observed sample fall outside the prediction
interval? What does that tell you?

Solution. to Example 15.1

1. The parameter 𝜎 represents the population standard of individual birth-


weights: how much do birthweights vary from baby to baby? Our prior
for 𝜇 says that a mean birthweight in the 7-8 or so pounds range seems
reasonable. The parameter 𝜎 represents how much individual birthweights
vary about this mean. Let’s say that we think most babies weigh between
5 and 10 pounds; then we might want the interval [5, 10] to account for
95% of birthweights, or the values with 2 SDs of the mean, so we might
want 5 pounds (the length of [5, 10]) to represent 4 SDs. So one reasonable
prior mean of 𝜎 might be around 1.25 pounds, or around 600 grams or
so. Let’s choose a prior SD of 200 grams for 𝜎 to cover a reasonably wide
range of values of 𝜎: values like 100 grams which represent little variability
in birthweights, to values like 1000 grams which represent a great deal of
variability in birthweights. That is, we’ll choose a Gamma prior distribu-
tion for 𝜎 with mean 600 and SD 200; this is a Gamma distribution with
shape parameter 𝛼 = 6002 /2002 and rate parameter 600/2002 . Remem-
ber, this is only one choice for prior. There are many other reasonable
choices.
2. Simulate a value 𝜇 from a Normal(3400, 100) distribution, and indepen-
dently simulate a value of 𝜎 from a Gamma(6002 /2002 , 600/2002 ) distribu-
tion. Given 𝜇 and 𝜎, simulate a value 𝑦 from a Normal(𝜇, 𝜎) distribution.
Repeat many times and summarize the simulated 𝑦 values to approximate
the prior predictive distribution. The results are below. According to
this prior model, we predict that most babies weigh between about 4.5
and 10.5 pounds, which seems reasonable based on what we know about
birthweights.

n_rep = 10000

mu_sim = rnorm(n_rep, 3400, sd = 100)

sigma_sim = rgamma(n_rep, 600 ^ 2 / 200 ^ 2, 600 / 200 ^2)

y_sim = rnorm(n_rep, mu_sim, sigma_sim)


262CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

hist(y_sim,
breaks = 50,
freq = FALSE,
main = "Prior Predictive Distribution",
xlab = "birthweight (grams)")

Prior Predictive Distribution


0.0006
0.0004
Density

0.0002
0.0000

0 2000 4000 6000

birthweight (grams)

quantile(y_sim, c(0.025, 0.975))

## 2.5% 97.5%
## 2086 4685

3. Assuming a Normal distribution doesn’t seem terrible, but it does appear


that maybe the tails are a little heavier than we would expect for a Nor-
mal distribution, especially at the low birthweights. That is, there might
be some evidence in the data that extremely low (and maybe high) birth-
weights don’t quite follow what would be expected if birthweights followed
a Normal distribution.

4. See JAGS code below. A few notes:

• The JAGS code takes the full sample of size 1000 as an input.
• The data consists of 1000 values assumed to each be from a Normal(𝜇,
𝜎) distribution.
• Remember that in JAGS, it’s dnorm(mean, precision).
263

• To find the likelihood of observing the entire sample, JAGS finds the
likelihood of each of the individual values and then multiplies the
values together for us to find the likelihood of the sample.
• This is accomplished in JAGS by specifying the likelihood via a for
loop which evaluates the likelihood y[i] ~ dnorm(mu, 1 / sigma
^ 2) for each y[i] in the sample.

Nrep = 10000
Nchains = 3

# data
# data has already been loaded in previous code
# y is the full sample
# n is the sample size

# model
model_string <- "model{

# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 1 / sigma ^ 2)
}

# Prior
mu ~ dnorm(3400, 1 / 100 ^ 2)

sigma ~ dgamma(600 ^ 2 / 200 ^ 2, 600 / 200 ^2)

}"

# Compile the model


dataList = list(y=y, n=n)

model <- jags.model(textConnection(model_string),


data=dataList,
n.chains=Nchains)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1000
## Unobserved stochastic nodes: 2
## Total graph size: 1017
264CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

##
## Initializing model

update(model, 1000, progress.bar="none")

posterior_sample <- coda.samples(model,


variable.names=c("mu", "sigma"),
n.iter=Nrep,
progress.bar="none")

# Summarize and check diagnostics


summary(posterior_sample)

##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 3318 19.5 0.1125 0.115
## sigma 632 14.1 0.0815 0.105
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 3280 3305 3318 3331 3356
## sigma 605 622 631 641 661

plot(posterior_sample)
265

Trace of mu Density of mu

0.015
3250 3350

0.000
2000 6000 10000 3250 3300 3350 3400

Iterations N = 10000 Bandwidth = 2.628

Trace of sigma Density of sigma

0.000 0.020
660
580

2000 6000 10000 580 620 660 700

Iterations N = 10000 Bandwidth = 1.899

5. Simulate a (𝜇, 𝜎) pair from the posterior distribution; JAGS has already
done this for you. Then given 𝜇 and 𝜎 simulate a value of 𝑦 from a
Normal(𝜇, 𝜎) distribution. Repeat many times and summarize the sim-
ulated values of 𝑦 to approximate the posterior predictive distribution of
birthweights.

theta_sim = data.frame(as.matrix(posterior_sample))

y_sim = rnorm(Nrep, theta_sim$mu, theta_sim$sigma)

hist(y_sim, freq = FALSE, breaks = 50,


xlab = "Birthweight (grams)",
main = "Posterior predictive distribution")
266CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

Posterior predictive distribution

0.0006
0.0004
Density

0.0002
0.0000

1000 2000 3000 4000 5000

Birthweight (grams)

quantile(y_sim, c(0.005, 0.995))

## 0.5% 99.5%
## 1714 4892

There is a posterior predictive probability of 99% that a randomly selected


birthweight is between 1714 and 4892 grams. Roughly, this model and
data say that 99% of birthweights are between 1714 and 4892 grams (3.8
and 10.8 pounds.).

6. Find the proportion of values in the observed sample that lie outside of
the prediction interval

(sum(y < quantile(y_sim, 0.005)) + sum(y > quantile(y_sim, 0.995))) / n

## [1] 0.024

About 2.4 percent of birthweights in the sample fall outside of the 99%
prediction interval, when we would only expect 1%. While not a large dif-
ference in magnitude, we are observing a higher percentage of birthweights
in the tails than we would expect if birthweights followed a Normal dis-
tribution. So we have some evidence that a Normal model — that is, a
Normal likelihood — might not be the best model for birthweights of all
live births as it doesn’t properly account for extreme birthweights.
267

The code below performs a posterior predictive check by simulating hypothetical


samples of size 1000 from the posterior model, and comparing with the observed
sample of size 1000. The simulation is similar to the posterior predictive sim-
ulation in the previous example, but now every time we simulate a (𝜇, 𝜎) pair,
we simulate a random sample of 1000 𝑦 values. Again, while not a terrible fit,
there do seem to be more values in the tail — the lower tail especially — than
would be expected under this model.

# plot the observed data


hist(y, freq = FALSE, breaks = 50) # observed data

# number of samples to simulate


n_samples = 100

# simulate (mu, sigma) pairs from the posterior


# we just randomly select rows from theta_sim
index_sample = sample(Nrep, n_samples)

# simulate samples
for (r in 1:n_samples){

i = index_sample[r]

# simulate values from N(theta, sigma) distribution


y_sim = rnorm(n, theta_sim[i, "mu"], theta_sim[i, "sigma"])

# add plot of simulated sample to histogram


lines(density(y_sim),
col = rgb(135, 206, 235, max = 255, alpha = 25))
}
268CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

Histogram of y

0.0008
Density

0.0004
0.0000

1000 2000 3000 4000 5000

In the previous example we assumed a Normal likelihood. A Normal likelihood


assumes that the population distribution of individual values of the measured
numerical variable is Normal. Posterior predictive checking can be used to
assess whether a Normal likelihood is appropriate for the observed data. If a
Normal likelihood isn’t an appropriate model for the data then other likelihood
functions can be used. In particular, if the observed data is relatively unimodal
and symmetric2 but has more extreme values than can be accommodated by a
Normal likelihood, a 𝑡-distribution or other distribution with heavy tails can be
used to model the likelihood.

Normal distributions don’t allow much room for extreme values. An alternative
is to assume a distribution with heavier tails. For example, t-distributions have
heavier tails than Normal. For t-distributions, the degrees of freedom parameter
𝑑 ≥ 1 controls how heavy the tails are. When 𝑑 is small, the tails are much
heavier than for a Normal distribution, leading to a higher frequency of extreme
values. As 𝑑 increases, the tails get lighter and a 𝑡-distribution gets closer to a
Normal distribution. For 𝑑 greater than 30 or so, there is very little different
between a 𝑡-distribution and a Normal distribution except in the extreme tails.
The degrees of freedom parameter 𝑑 is sometimes referred to as the “Normality
parameter”, with larger values of 𝑑 indicating a population distribution that is
closer to Normal.

2 If the observed data has multiple modes or is skewed, then other parameters like median

or mode might be more appropriate measures of center than the population mean.
269

0.4

Normal(0, 1)
t(1)
t(3)
0.3
Density

0.2
0.1
0.0

−4 −2 0 2 4

Variable

Example 15.2. Continuing the birthweight example, we’ll now model the dis-
tribution of birthweights with a 𝑡(𝜇, 𝜎, 𝑑) distribution.

1. How many parameters is the likelihood function based on? What are
they?
2. What does assigning a prior distribution to the Normality parameter 𝑑
represent?
3. The Normality parameter must satisfy 𝑑 ≥ 1, so we want a distribution
for which only values above 1 are possible. One way to accomplish this is
to let 𝑑0 = 𝑑 − 1, assign a Gamma distribution prior to 𝑑0 ≥ 0, and then
let 𝑑 = 1 + 𝑑0 . One common approach is to let the shape parameter of the
Gamma distribution to be 1, and to let the scale parameter be 1/29, so
that the prior mean of 𝑑 is 30. Assume the same priors for 𝜇 and 𝜎 as in
the previous example, and a Gamma(1, 1/29) prior for (𝑑 − 1). Use JAGS
to fit the model to the birthweight data and approximate and summarize
the posterior distribution.
4. Consider the posterior distribution for 𝑑. Based on this posterior distri-
bution, is it plausible that birthweights follow a Normal distribution?
5. Consider the posterior distribution for 𝜎. What seems strange about this
distribution? (Hint: consider the sample SD.)
6. The standard deviation of a Normal(𝜇, 𝜎) distribution is 𝜎. However,
the standard deviation of a 𝑡(𝜇, 𝜎, 𝑑) distribution is not 𝜎; rather it is
270CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

𝑑 𝑑
𝜎√ 𝑑−2 > 𝜎. When 𝑑 is large, √ 𝑑−2 ≈ 1, and so the standard deviation
is approximately 𝜎. However, it can make a difference when 𝑑 is small.
Using the JAGS output, create a plot of the posterior distribution of
𝑑
𝜎√ 𝑑−2 . Does this posterior distribution of the population standard devi-
ation seem more reasonable in light of the sample data?

7. How could you use simulation to approximate the posterior predictive


distribution of birthweights? Run the simulation and find a 99% posterior
prediction interval. How does it compare to the predictive interval from
the model with the Normal likelihood?

Solution. to Example 15.2

1. The 𝑡(𝜇, 𝜎, 𝑑) is based on 3 parameters: the population mean 𝜇, the vari-


ability parameter 𝜎 (which is NOT the population SD; see below), and
the Normality parameter 𝑑.

2. Assigning a prior distribution to 𝑑 allows for a posterior distribution of 𝑑,


which quantifies uncertainty about the degree of Normality versus “heavy-
tailed-ness” of the distribution of birthweights. Assigning a prior distri-
bution on 𝑑 allows the model to explore different values of 𝑑 to see what
values seem most plausible given the observed data.

3. See the JAGS code below and output below. Note that the posterior
distribution for 𝜇 is similar to the posterior distribution for 𝜇 from the
model with the Normal likelihood.

Nrep = 10000
Nchains = 3

# data
# data has already been loaded in previous code
# y is the full sample
# n is the sample size

# model
model_string <- "model{

# Likelihood
for (i in 1:n){
y[i] ~ dt(mu, 1 / sigma ^ 2, tdf)
}

# Prior
271

mu ~ dnorm(3400, 1 / 100 ^ 2)

sigma ~ dgamma(600 ^ 2 / 200 ^ 2, 600 / 200 ^ 2)

tdf <- 1 + tdf0

tdf0 ~ dexp(1 / 29)

}"

# Compile the model


dataList = list(y=y, n=n)

model <- jags.model(textConnection(model_string),


data=dataList,
n.chains=Nchains)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1000
## Unobserved stochastic nodes: 3
## Total graph size: 1021
##
## Initializing model

update(model, 1000, progress.bar="none")

posterior_sample <- coda.samples(model,


variable.names=c("mu", "sigma", "tdf"),
n.iter=Nrep,
progress.bar="none")

# Summarize and check diagnostics


summary(posterior_sample)

##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
272CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

## plus standard error of the mean:


##
## Mean SD Naive SE Time-series SE
## mu 3363.69 17.124 0.09886 0.12760
## sigma 467.64 17.850 0.10306 0.20079
## tdf 4.32 0.593 0.00342 0.00703
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 3330.05 3352.21 3363.67 3375.15 3397.26
## sigma 433.34 455.48 467.48 479.61 503.24
## tdf 3.33 3.91 4.27 4.68 5.65

plot(posterior_sample)

Trace of mu Density of mu

0.000
3300

2000 4000 6000 8000 10000 12000 3300 3350 3400 3450

Iterations N = 10000 Bandwidth = 2.309

Trace of sigma Density of sigma


0.000
400

2000 4000 6000 8000 10000 12000 400 450 500 550

Iterations N = 10000 Bandwidth = 2.407

Trace of tdf Density of tdf


7

0.0
3

2000 4000 6000 8000 10000 12000 3 4 5 6 7 8

Iterations N = 10000 Bandwidth = 0.07701

# Extract the JAGS output as a data frame


theta_sim = data.frame(as.matrix(posterior_sample))

4. The posterior distribution for 𝑑 places most of its probability on relatively


small values of 𝑑. A 98% posterior credible interval for 𝑑 is 3.2 to 6. Values
in this interval indicate relatively heavy tails in comparison to a Normal
distribution. Therefore, the posterior distribution indicates that it is not
plausible that birthweights follow a Normal disttribution.
5. The posterior distribution for 𝜎 might seem initially strange. The sample
standard is 631, but this value is not included the range of plausible values
273

of 𝜎 according to the posterior. This is an artifact of the interplay between


𝜎 and 𝑑 in 𝑡-distributions, especially when 𝑑 is small. See the next part
for more details.

6. See below. This posterior distribution seems more reasonable given the
sample SD.

hist(theta_sim$sigma * sqrt(theta_sim$tdf / (theta_sim$tdf - 2)),


freq = FALSE,
breaks = 50,
main = "Posterior distribution",
xlab = expression(paste(sigma, " ", sqrt(frac(d, d-2)))))

Posterior distribution
0.010
Density

0.005
0.000

550 600 650 700 750 800 850


d
σ
d−2

7. Simulate a triple (𝜇, 𝜎, 𝑑) from the joint posterior distribution; JAGS has
already done this for you. Given (𝜇, 𝜎, 𝑑), simulate a value 𝑦 from a
𝑡(𝜇, 𝜎, 𝑑) distribution. Repeat many times and summarize the simulated
𝑦 values to approximate the posterior distribution.
See the code and output below. The posterior predictive distribution now
spans a more disperse range of birthweights that with the Normal model.

y_sim = theta_sim$mu + theta_sim$sigma * rt(Nrep, theta_sim$tdf)

hist(y_sim, freq = FALSE, breaks = 50,


xlab = "Birthweight (grams)",
main = "Posterior predictive distribution")
274CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

Posterior predictive distribution

0.0000 0.0002 0.0004 0.0006


Density

−5000 0 5000 10000

Birthweight (grams)

quantile(y_sim, c(0.005, 0.995))

## 0.5% 99.5%
## 1104 5605

The code below performs a posterior predictive check by simulating hypothetical


samples of size 1000 from the posterior model, and comparing with the observed
sample of size 1000. The simulation is similar to the posterior predictive sim-
ulation in the previous example, but now every time we simulate a (𝜇, 𝜎) pair,
we simulate a random sample of 1000 𝑦 values.
We’ll now try posterior predictive checks. We use the simulated values of 𝜃
from JAGS. For each value of 𝜃, we generate a sample of size 1000 from a t-
distribution centered at 𝜃 and with a standard deviation of 600. (rt in R returns
values on a standardized scale which we then rescale.)

# plot the observed data


hist(y, freq = FALSE, breaks = 50,
ylim = c(0, 0.001)) # observed data

# number of samples to simulate


n_samples = 100

# simulate (mu, sigma, d) triples from the posterior


# we just randomly select rows from theta_sim
index_sample = sample(Nrep, n_samples)
275

# simulate samples
for (r in 1:n_samples){

# simulate values from a t-distribution


y_sim = theta_sim[r, "mu"] + theta_sim[r, "sigma"] * rt(n, theta_sim[r, "tdf"])

# add plot of simulated sample to histogram


lines(density(y_sim),
col = rgb(135, 206, 235, max = 255, alpha = 25))
}

Histogram of y
0.0008
Density

0.0004
0.0000

1000 2000 3000 4000 5000

We see that the actual sample resembles simulated samples more closely for
the model based on the t-distribution likelihood than for the one based on the
Normal likelihood. While we do want a model that fits the data well, we also
do not want to risk overfitting the data. In this case, we do not want a few
extreme outliers to unduly influence the model. However, it does appear that a
model that allows for heavier tails than a Normal distribution could be useful
here. Moreover, accommodating the tails improves the fit in the center of the
distribution too.
When the degrees of freedom are very small, 𝑡-distributions can give rise to
super extreme values. We see this in the posterior predictive distribution for
birthweights, where there are some negative birthweights and birthweights over
10000 grams. The model based on the the 𝑡-likelihood seems to fit well over
the range of observed values of birthweight. As usual, we should refrain from
276CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE

making predictions outside the ranges of the observed data.


In models with multiple parameters, there can be dependencies between parame-
ters, so interpreting the marginal posterior distribution of any single parameter
can be difficult. It is often more helpful to consider predictive distributions,
which accounts for the joint distribution of all parameters. Interpreting predic-
tive distributions is often more intuitive since predictive distributions live on
the scale of the measured variable.
The observational units in the examples in this section are live births. Many
of the extremely low birthweights are attributed to babies who were born live
but did not survive. Rather than model birthweights with one single likelihood,
it might be more appropriate to first categorize the births and use the “group”
variable in the model. For example, me might include a variable to indicate
whether a baby is full term or not (or even just the length of the pregnancy).
We will see how to compare a numerical variable across groups in upcoming
sections.
Chapter 16

Comparing Two Samples

Most interesting statistical problems involve multiple unknown parameters. For


example, many problems involve comparing two (or more) populations or groups
based on two (or more) samples. In such situations, each population or group
will have its own parameters, and there will often be dependence between pa-
rameters. We are usually interested in difference or ratios of parameters between
groups.
The example below concerns the familiar context of comparing two means. How-
ever, the way independence is treated in the example is not the most common.
Soon we will see hierarchical models for comparing groups.

Example 16.1. Do newborns born to mothers who smoke tend to weigh less
at birth than newborns from mothers who don’t smoke? We’ll investigate this
question using birthweight (pounds) data on a sample of births in North Car-
olina over a one year period.
Assume birthweights follow a Normal distribution with mean 𝜇1 for nonsmokers
and mean 𝜇2 for smokers, and standard deviation 𝜎.
Note: our primary goal will be to compare the means 𝜇1 and 𝜇2 . We’re assuming
a common standard deviation 𝜎 to simplify a little, but we could (and probably
should) also let standard deviation vary by smoking status.

1. The prior distribution will be a joint distribution on (𝜇1 , 𝜇2 , 𝜎) triples.


We could assume a prior under which 𝜇1 and 𝜇2 are independent. But
why might we want to assume some prior dependence between 𝜇1 and 𝜇2 ?
(For some motivation it might help to consider what the frequentist null
hypothesis would be.)

2. One way to incorporate prior dependence is to assume a Multivariate


Normal prior distribution. For the prior assume:

277
278 CHAPTER 16. COMPARING TWO SAMPLES

• 𝜎 is independent of (𝜇1 , 𝜇2 )
• 𝜎 has a Gamma(1, 1) distribution
• (𝜇1 , 𝜇2 ) follow a Bivariate Normal distribution with prior means (7.5,
7.5) pounds, prior standard deviations (0.5, 0.5) pounds, and prior
correlation 0.9.

Simulate values of (𝜇1 , 𝜇2 ) from the prior distribution1 and plot them.
Briefly2 describe the prior distribution.

3. How do you interpret the parameter 𝜇1 − 𝜇2 ? Plot the prior distribution


of 𝜇1 − 𝜇2 , and find prior central 50%, 80%, and 98% credible interval.
Also compute the prior probability that 𝜇1 − 𝜇2 > 0.

4. The following code loads and summarizes the sample data. Briefly describe
the data.

data = read.csv("_data/baby_smoke.csv")

ggplot(data, aes(weight, fill = habit)) +


geom_histogram(alpha = 0.3,
aes(y = ..density..),
position = 'identity')

1 Values from a Multivariate Normal distribution can be simulated using mvrnorm from the

MASS package. For Bivariate Normal, the inputs are the mean vector [𝐸(𝜇1 ), 𝐸(𝜇2 )] and the
covariance matrix

Var(𝜇1 ) Cov(𝜇1 , 𝜇2 )
[ ]
Cov(𝜇1 , 𝜇2 ) Var(𝜇2 )

where Cov(𝜇1 , 𝜇2 ) = Corr(𝜇1 , 𝜇2 )SD(𝜇1 )SD(𝜇2 ).


2 Why briefly? Because we want to focus on the posterior distribution.
279

0.4

0.3

habit
density

nonsmoker
0.2 smoker

0.1

0.0

2.5 5.0 7.5 10.0 12.5


weight

data %>%
group_by(habit) %>%
summarize(n(), mean(weight), sd(weight)) %>%
kable(digits = 2)

habit n() mean(weight) sd(weight)


nonsmoker 873 7.14 1.52
smoker 126 6.83 1.39

5. Is it reasonable to assume that the two samples are independent? (In


this case the 𝑦1 and 𝑦2 samples would be conditionally independent given
(𝜇1 , 𝜇2 , 𝜎).)

6. Describe how you would compute the likelihood. For concreteness, how
you would you compute the likelihood if there were only 4 babies in the
sample: 2 non-smokers with birthweights of 8 pounds and 7 pounds, and
2 smokers with birthweights of 8.3 pounds and 7.1 pounds.

7. Use JAGS to approximate the posterior distribution. (The coding is a


little tricky. See the code and some comments below.) Plot the posterior
distribution. How strong is the dependence between 𝜇1 and 𝜇2 in the
posterior? Why do you think that is?

8. Plot the posterior distribution of 𝜇1 − 𝜇2 and describe it. Compute and


interpret posterior central 50%, 80%, and 98% credible intervals. Also
compute and interpret the posterior probability that 𝜇1 − 𝜇2 > 0.
280 CHAPTER 16. COMPARING TWO SAMPLES

9. If we’re interested in 𝜇1 −𝜇2 , why didn’t we put a prior directly on 𝜇1 −𝜇2


rather than on (𝜇1 , 𝜇2 )?

10. Plot the posterior distribution of 𝜇1 /𝜇2 , describe it, and find and interpret
posterior central 50%, 80%, and 98% credible intervals.

11. Is there some evidence that babies whose mothers smoke tend to weigh
less than those whose mothers don’t smoke?

12. Can we say that smoking is the cause of the difference in mean weights?

13. Is there some evidence that babies whose mothers smoke tend to weigh
much less than those whose mothers don’t smoke? Explain.

14. One quantity of interest is the effect size, which is a way of measuring
the magnitude of the difference between groups. When comparing two
means, a simple measure of effect size (Cohen’s 𝑑) is
𝜇1 − 𝜇 2
𝜎
Plot the posterior distribution of this effect size and describe it. Compute
and interpret posterior central 50%, 80%, and 98% credible intervals.

Solution. to Example 16.1

1. If our prior belief is that there is no difference in mean birthweights be-


tween babies of smokers and non-smokers, then our prior should place
high probability on 𝜇1 being close to 𝜇2 . Even if we want our prior to
allow for different distributions for 𝜇1 and 𝜇2 , there might still be some
dependence. For example, we would assign a different prior conditional
probability to the event that 𝜇1 > 8.5 given 𝜇2 > 8.5 than we would
given 𝜇2 < 7.5. Our prior uncertainty about mean birthweights of babies
of nonsmokers informs our prior uncertainty about mean birthweights of
babies in general, and hence also of babies of smokers.

2. Our main focus is on (𝜇1 , 𝜇2 ). We see that the prior places high density
on (𝜇1 , 𝜇2 ) pairs with 𝜇1 close to 𝜇2 .

mu_prior_mean <- c(7.5, 7.5)


mu_prior_sd <- c(0.5, 0.5)
mu_prior_corr <- 0.9

mu_prior_cov <- matrix(c(mu_prior_sd[1] ^ 2,


mu_prior_corr * mu_prior_sd[1] * mu_prior_sd[2],
mu_prior_corr * mu_prior_sd[1] * mu_prior_sd[2],
mu_prior_sd[2] ^ 2), nrow = 2)
281

library(MASS)

sim_prior = data.frame(mvrnorm(10000, mu_prior_mean, mu_prior_cov),


rgamma(10000, 1, 1))

names(sim_prior) = c("mu1", "mu2", "sigma")

ggplot(sim_prior, aes(mu1, mu2)) +


geom_point(color = "skyblue", alpha = 0.4) +
stat_ellipse(level = 0.98, color = "black", size = 2) +
stat_density_2d(color = "grey", size = 1) +
geom_abline(intercept = 0, slope = 1)

ggplot(sim_prior, aes(mu1, mu2)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c() +
geom_abline(intercept = 0, slope = 1)

8
8 level

0.9
mu2

mu2

0.6

7
0.3

6 7 8 9 6.5 7.0 7.5 8.0 8.5


mu1 mu1

3. The parameter 𝜇1 − 𝜇2 is the difference in mean birthweights, smokers


minus non-smokers. The prior mean of 𝜇1 − 𝜇2 is 0, reflecting a prior
belief towards no difference in mean birthweight between smokers and
non-smokers. Furthermore, there is a fairly high prior probability that
the mean birthweight for smokers is close to the mean birthweight for
non-smokers, with a difference of at most about 0.5 pounds. Under this
prior, nonsmokers and smokers are equally likely to have the higher mean
birthweight.

sim_prior <- sim_prior %>%


mutate(mu_diff = mu1 - mu2)

ggplot(sim_prior,
aes(mu_diff)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "skyblue") +
282 CHAPTER 16. COMPARING TWO SAMPLES

labs(x = "Difference in population mean birthweight (pounds, non-smokers - smoke


title = "Prior Distribution")

Prior Distribution

1.5

1.0
density

0.5

0.0

−0.5 0.0 0.5


Difference in population mean birthweight (pounds, non−smokers − smokers)

quantile(sim_prior$mu_diff, c(0.01, 0.10, 0.25, 0.75, 0.90, 0.99))

## 1% 10% 25% 75% 90% 99%


## -0.5173 -0.2863 -0.1505 0.1540 0.2825 0.5297

sum(sim_prior$mu_diff > 0 ) / 10000

## [1] 0.5059

4. The distributions of birthweights are fairly similar for smokers and non-
smokers. The sample mean birthweight for smokers is about 0.3 pounds
less than the sample mean birthweight for smokers. The sample SDs of
birthweights are similar for both groups, around 1.4-1.5 pounds.
5. Yes, it is reasonable to assume that the two samples are independent. The
data for smokers was collected separately from the data for non-smokers.
That is, it is reasonable to assume independence in the data.
6. For each observed value of birthweight for non-smokers, evaluate the like-
lihood based on a 𝑁 (𝜇1 , 𝜎) distribution. For example, if birthweight of
a non-smoker is 8 pounds, the likelihood is dnorm(8, mu1, sigma); if
birthweight of a non-smoker is 7 pounds, the likelihood is dnorm(7, mu1,
283

sigma). The likelihood for the sample of non-smokers would be the prod-
ucts — assuming independence within sample — of the likelihoods of the
individual values, as a function of 𝜇1 and 𝜎: dnorm(8, mu1, sigma) *
dnorm(7, mu1, sigma) * ...
The likelihood for the sample of smokers would be the products of the
likelihoods of the individual values, as a function of 𝜇2 and 𝜎: dnorm(8.3,
mu2, sigma) * dnorm(7.1, mu2, sigma) * ...
The likelihood function for the full sample would be the product — as-
suming independence between samples — of the likelihoods for the two
samples, a function of 𝜇1 , 𝜇2 and 𝜎.

7. Here is the code; there are some comments about syntax at the end of this
chapter.

# data
y = data$weight

x = (data$habit == "smoker") + 1

n = length(y)

n_groups = 2

# Prior parameters
mu_prior_mean <- c(7.5, 7.5)
mu_prior_sd <- c(0.5, 0.5)
mu_prior_corr <- 0.9

mu_prior_cov <- matrix(c(mu_prior_sd[1] ^ 2,


mu_prior_corr * mu_prior_sd[1] * mu_prior_sd[2],
mu_prior_corr * mu_prior_sd[1] * mu_prior_sd[2],
mu_prior_sd[2] ^ 2), nrow = 2)

# Model
model_string <- "model{

# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[x[i]], 1 / sigma ^ 2)
}

# Prior
mu[1:n_groups] ~ dmnorm.vcov(mu_prior_mean[1:n_groups],
mu_prior_cov[1:n_groups, 1:n_groups])
284 CHAPTER 16. COMPARING TWO SAMPLES

sigma ~ dgamma(1, 1)

}"

dataList = list(y = y, x = x, n = n, n_groups = n_groups,


mu_prior_mean = mu_prior_mean, mu_prior_cov = mu_prior_cov)

# Compile
Nrep = 10000

n.chains = 5

model <- jags.model(textConnection(model_string),


data = dataList,
n.chains = n.chains)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 999
## Unobserved stochastic nodes: 2
## Total graph size: 2016
##
## Initializing model

# Simulate
update(model, 1000, progress.bar = "none")

posterior_sample <- coda.samples(model,


variable.names = c("mu", "sigma"),
n.iter = Nrep,
progress.bar = "none")

sim_posterior = as.data.frame(as.matrix(posterior_sample))
names(sim_posterior) = c("mu1", "mu2", "sigma")
head(sim_posterior)

## mu1 mu2 sigma


## 1 7.050 7.035 1.528
## 2 7.069 7.069 1.457
## 3 7.082 7.091 1.464
## 4 7.090 7.089 1.485
## 5 7.055 7.000 1.476
## 6 7.075 6.931 1.475
285

ggplot(sim_posterior, aes(mu1, mu2)) +


geom_point(color = "seagreen", alpha = 0.4) +
stat_ellipse(level = 0.98, color = "black", size = 2) +
stat_density_2d(color = "grey", size = 1) +
geom_abline(intercept = 0, slope = 1)

7.2

6.9
mu2

6.6

7.0 7.1 7.2 7.3


mu1

ggplot(sim_posterior, aes(mu1, mu2)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c() +
geom_abline(intercept = 0, slope = 1)
286 CHAPTER 16. COMPARING TWO SAMPLES

7.1

7.0 level
25

20
mu2

15
6.9
10

6.8

6.7
7.05 7.10 7.15 7.20
mu1

cor(sim_posterior$mu1, sim_posterior$mu2)

## [1] 0.127

The posterior mean of 𝜇1 is close to the sample mean birthweight for


non-smokers, and the posterior mean of 𝜇2 is close to the sample mean
birthweight for smokers. The posterior SD of 𝜇1 is smaller than that of
𝜇2 , reflecting the larger sample size for non-smokers than smokers. The
posterior distribution places most of its probability on (𝜇1 , 𝜇2 ) pairs with
𝜇1 > 𝜇2 , representing a much stronger belief (than prior) that mean birth-
weight for smokers is less than for non-smokers. The posterior correlation
between 𝜇1 and 𝜇2 is about 0.13, which is much smaller than the prior
correlation. Even though there was fairly strong dependence between 𝜇1
and 𝜇2 in the prior, there was independence between the samples in the
data, represented in the likelihood. With the large sample sizes (especially
for non-smokers) the data has more influence on the posterior than the
prior does.
8. The code below summarizes the posterior distribution of the difference
in means 𝜇1 − 𝜇2 . JAGS has already simulated (𝜇1 , 𝜇2 ) pairs from the
posterior distribution, so we just need to compute 𝜇1 − 𝜇2 for each pair.

sim_posterior = sim_posterior %>%


mutate(mu_diff = mu1 - mu2)

ggplot(sim_posterior,
287

aes(mu_diff)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen") +
labs(x = "Difference in population mean birthweight (pounds, non-smokers - smokers)",
title = "Posterior Distribution")

Posterior Distribution

2
density

−0.25 0.00 0.25 0.50 0.75


Difference in population mean birthweight (pounds, non−smokers − smokers)

quantile(sim_posterior$mu_diff, c(0.01, 0.10, 0.25, 0.75, 0.90, 0.99))

## 1% 10% 25% 75% 90% 99%


## -0.06925 0.05636 0.13074 0.29190 0.36488 0.48873

sum(sim_posterior$mu_diff > 0 ) / length(sim_posterior$mu_diff)

## [1] 0.9604

The posterior distribution of 𝜇1 −𝜇2 is approximately Normal. The poste-


rior mean of 𝜇1 − 𝜇2 is about 0.21 pounds, which is a compromise between
the prior mean of 𝜇1 − 𝜇2 of 0 (no difference) and the difference in sample
means of about 0.31 pounds.
There is a posterior probability of 50% that mean birthweight for non-
smokers is between 0.13 and 0.29 pounds greater than mean birthweight
for non-smokers.
There is a posterior probability of 80% that mean birthweight for non-
smokers is between 0.06 and 0.36 pounds greater than mean birthweight
for non-smokers.
288 CHAPTER 16. COMPARING TWO SAMPLES

There is a posterior probability of 98% that mean birthweight for non-


smokers is between 0.07 pounds less and 0.49 pounds greater than mean
birthweight for non-smokers.

There is a posterior probability of about 96 percent that the mean birth-


weight for non-smokers is greater than the mean birthweight for smokers.

9. Putting a prior directly on 𝜇1 − 𝜇2 does allow us to make inference about


the difference in mean birthweights. But what if we also want to estimate
the mean birthweight for each group? Having a posterior distribution just
on the difference between the two groups does not allow us to estimate the
mean for either group. Also, 𝜇1 − 𝜇2 is the absolute difference in means,
but what if we want to measure the difference in relative terms? Putting a
prior distribution on (𝜇1 , 𝜇2 ) enables us to make posterior inference about
mean birthweight for both non-smokers and smokers and any parameter
(difference, ratio) that depends on the means.

10. See code and output below. JAGS has already simulated (𝜇1 , 𝜇2 ) pairs
from the posterior distribution, so we just need to compute 𝜇1 /𝜇2 for each
pair.

sim_posterior = sim_posterior %>%


mutate(mu_ratio = mu1 / mu2)

ggplot(sim_posterior,
aes(mu_ratio)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen") +
labs(x = "Ratio of population mean birthweight (non-smokers / smokers)",
title = "Posterior Distribution")
289

Posterior Distribution

20

15
density

10

0.96 1.00 1.04 1.08 1.12


Ratio of population mean birthweight (non−smokers / smokers)

quantile(sim_posterior$mu_ratio, c(0.01, 0.10, 0.25, 0.75, 0.90, 0.99))

## 1% 10% 25% 75% 90% 99%


## 0.9903 1.0080 1.0187 1.0426 1.0537 1.0732

The posterior distribution of 𝜇1 /𝜇2 is approximately Normal. The pos-


terior mean of 𝜇1 /𝜇2 is about 1.03, which is a compromise between the
prior mean of 𝜇1 /𝜇2 = 1 (no difference) and the ratio of sample means of
1.046.
There is a posterior probability of 50% that the mean birthweight of non-
smokers is between 1.02 and 1.04 times greater than the mean birthweight
of smokers.
There is a posterior probability of 80% that the mean birthweight of non-
smokers is between 1.01 and 1.05 times greater than the mean birthweight
of smokers.
There is a posterior probability of 98% that the mean birthweight of non-
smokers is between 0.99 and 1.07 times greater than the mean birthweight
of smokers.

11. Yes, there is some evidence. Even though we started with fairly strong
prior credibility of no difference, with the relatively large sample sizes, the
difference in sample means observed in the data was enough to overturn
the prior beliefs. Now, the 98% credible interval for 𝜇1 − 𝜇2 does contain
0, indicating some plausibility of no difference. But there’s nothing special
290 CHAPTER 16. COMPARING TWO SAMPLES

about 98% credibility, and we should look at the whole posterior distri-
bution. According to our posterior distribution, we place a high degree of
plausibility on the mean birthweight for smokers being less than the mean
birthweight of non-smokers.

12. The question of causation has nothing to do with whether we are doing
a Bayesian or frequentist analysis. Rather, the question of causation con-
cerns: how were the data collected? In particular, was this an experiment
with random assignment of the explanatory variable? It wasn’t; it was an
observational study (you can’t randomly assign some mothers to smoke).
Therefore, there is potential for confounding variables. Maybe mothers
who smoke tend to be less healthy in general than mothers who don’t
smoke, and maybe some other aspect of health is more closely associated
with lower birthweight than smoking is.

13. The posterior distribution of 𝜇1 − 𝜇2 does not give much plausibility to


large differences in mean birthweight. Almost all of the posterior proba-
bility is placed on the absolute difference being less than 0.5 pounds, and
the relative difference being no more than 1.07 times. Just because we
have evidence that there is a difference, doesn’t necessarily mean that the
difference is large in practical terms.

14. The observed effect size is about 0.3/1.5 = 0.2. Birthweights vary naturally
from baby to baby by about 1.5 pounds, so a difference of 0.3 pounds
seems relatively small. The sample mean birthweight for non-smokers
is 0.2 standard deviations greater than the sample mean birthweight for
smokers.

The following simulates and summarizes the posterior distribution of the


population effect size. JAGS has already simulated (𝜇1 , 𝜇2 , 𝜎) triples for
us; we just need to compute (𝜇1 − 𝜇2 )/𝜎 for each triple.

sim_posterior = sim_posterior %>%


mutate(effect_size = (mu1 - mu2) / sigma)

ggplot(sim_posterior,
aes(effect_size)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen") +
labs(x = "Effect size (non-smokers - smokers)",
title = "Posterior Distribution")
291

Posterior Distribution
5

3
density

0.0 0.2 0.4


Effect size (non−smokers − smokers)

quantile(sim_posterior$effect_size, c(0.01, 0.10, 0.25, 0.75, 0.90, 0.99))

## 1% 10% 25% 75% 90% 99%


## -0.04583 0.03741 0.08694 0.19396 0.24273 0.32664

The posterior mean of (𝜇1 − 𝜇2 )/𝜎 is about 0.14, which is a compromise


between the prior mean of (𝜇1 − 𝜇2 )/𝜎 of 0 (no difference) and the sample
effect size of 0.2.
There is a posterior probability of 50% that the mean birthweight of non-
smokers is between 0.09 and 0.19 standard deviations greater than the
mean birthweight of smokers.
There is a posterior probability of 80% that the mean birthweight of non-
smokers is between 0.04 and 0.24 standard deviations greater than the
mean birthweight of smokers.
There is a posterior probability of 98% that the mean birthweight of non-
smokers is between 0.05 standard deviations less than and 0.33 standard
deviations greater than the mean birthweight of smokers.
The posterior distribution indicates that the effect size is pretty small.
The difference between mean birthweight of smokers and non-smokers is
small relative to the variability in birthweights.
(Of course, smoking has many other adverse health effects. But looking at
birthweight alone, based on this data set we cannot conclude that there is
a large difference in mean birthweight between smokers and non-smokers.)
292 CHAPTER 16. COMPARING TWO SAMPLES

The previous example introduced many important ideas.


Independence in the data versus in the prior/posterior
It is typical to assume independence in the data, e.g., independence of values of
the measured variables within and between samples (conditional on the param-
eters). Whether independence in the data is a reasonable assumption depends
on how the data is collected.
But whether it is reasonable to assume prior independence of parameters is a
completely separate question and is dependent upon our subjective beliefs about
any relationships between parameters.
Transformations of parameters
The primary output of a Bayesian data analysis is the full joint posterior dis-
tribution on all parameters. Given the joint distribution, the distribution of
transformations of the primary parameters is readily obtained.
Effect size for comparing means
When comparing groups, a more important question than “is there a difference?”
is “how large is the difference?” An effect size is a measure of the magnitude of
a difference between groups. A difference in parameters can be used to measure
the absolute size of the difference in the measurement units of the variable, but
effect size can also be measured as a relative difference.
When comparing a numerical variable between two groups, one measure of the
population effect size is Cohens’s 𝑑
𝜇1 − 𝜇2
𝜎

The values of any numerical variable vary naturally from unit to unit. The SD
of the numerical variable measures the degree to which individual values of the
variable vary naturally, so the SD provides a natural “scale” for the variable.
Cohen’s 𝑑 compares the magnitude of the difference in means relative to the
natural scale (SD) for the variable
Some rough guidelines for interpreting |𝑑|:

d 0.2 0.5 0.8 1.2 2.0


Effect size Small Medium Large Very Large Huge

For example, assume the two population distributions are Normal and the two
population standard deviations are equal. Then when the effect size is 1.0 the
median of the distribution with the higher mean is the 84th percentile of the
distribution with the lower mean, which is a very large difference.
293

d 0.2 0.5 0.8 1.0 1.2 2.0


Effect size Small Medium Large Very Large Huge
Median of population 1 is
(blank) percentile of population 2 58th 69th 79th 84th 89th 98th
294 CHAPTER 16. COMPARING TWO SAMPLES

Notes on the JAGS code.

• You should be able to define the prior parameters for the Multivariate
Normal distribution within JAGS, but I keep getting an error. So I’m
defining prior parameters outside of JAGS and then passing them in with
the data. (I can never remember what you can do in JAGS and what you
295

need to do in R and pass to JAGS.)


• The Bivariate Normal prior is coded in JAGS using dmnorm.vcov which
has two parameters: a mean vector and a covariance matrix. (There is
also dmnorm is which is parametrized by the precision matrix.)
• The prior mu[1:2] ~ dmnorm(...) creates a vector mu with two com-
ponents mu[1] and mu[2]. When "mu" is called in the variable.names
= c("mu", "sigma") argument of coda.samples JAGS will return the
vector mu — that is, both components mu[1] and mu[2]. See the output
of posterior_sample.
• Group variables (like non-smoker/smoker) need to be coded as numbers in
JAGS, starting with 1. So x recodes smoking status as 1 for non-smokers
and 2 for smokers.
• We have data on individual birthweights, so we evaluate the likelihood of
each individual value y[i] using a Normal distribution and then use a for
loop to find the likelihood for the sample.
• Notice that the mean used in the likelihood depends on the group:
mu[x[i]]. For example, if element i has birthweight y[i] = 8 and is a
non-smoker x[i] = 1, then the likelihood is evaluated using a Normal
distribution with mean 𝜇1 ; for this element y[i] ~ dnorm(mu[x[i]],
...) in JAGS is like calling dnorm(y[i], mu[x[i]], ...) = dnorm(8,
mu[1], ...) in R. If element i has birthweight y[i] = 7.3 and is a
smoker x[i] = 2, then the likelihood is evaluated using a Normal distri-
bution with mean 𝜇2 ; for this element y[i] ~ dnorm(mu[x[i]], ...)
in JAGS is like calling dnorm(y[i], mu[x[i]], ...) = dnorm(7.3,
mu[2], ...) in R.
• The variable.names = c("mu", "sigma") argument of coda.samples
tells JAGS which simulation output to save. Given the joint posterior
distributon of the primary parameters, it is relatively easy to obtain the
posterior distribution of transformations of these parameters outside of
JAGS in R.

Non-Normal Likelihood
The sample data exhibits a long left tail, similar to what we observed in the pre-
vious chapter. Therefore, a non-Normal model might be more appropriate for
the distribution of birthweights. The code and output below uses a 𝑡-distribution
for the likelihood, similar to what was done in the previous section. The poste-
rior distribution of (𝜇1 , 𝜇2 ) is fairly similar to what was computed above for the
model with the Normal likelihood. The model below based on the 𝑡-distribution
likelihood shifts posterior credibility a little more towards mean birthweights for
non-smokers being greater than birthweights for smokers. But the difference is
still small in absolute terms; at most 0.5 pounds or so. In terms of comparing
population means, the choice of likelihood (Normal versus 𝑡) does not make
much of a difference in this example. That is, the inference regarding 𝜇1 − 𝜇2
appears not to be too sensitive to the choice of likelihood. However, if we were
using the model to predict birthweights, then the 𝑡-distribution based model
296 CHAPTER 16. COMPARING TWO SAMPLES

might be more appropriate, as we observed in the previous chapter.

# Model
model_string <- "model{

# Likelihood
for (i in 1:n){
y[i] ~ dt(mu[x[i]], 1 / sigma ^ 2, tdf)
}

# Prior
mu[1:n_groups] ~ dmnorm.vcov(mu_prior_mean[1:n_groups],
mu_prior_cov[1:n_groups, 1:n_groups])

sigma ~ dgamma(1, 1)

tdf <- 1 + tdf0

tdf0 ~ dexp(1 / 29)

}"

dataList = list(y = y, x = x, n = n, n_groups = n_groups,


mu_prior_mean = mu_prior_mean, mu_prior_cov = mu_prior_cov)

# Compile
Nrep = 10000

n.chains = 5

model <- jags.model(textConnection(model_string),


data = dataList,
n.chains = n.chains)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 999
## Unobserved stochastic nodes: 3
## Total graph size: 2020
##
## Initializing model
297

# Simulate
update(model, 1000, progress.bar = "none")

posterior_sample <- coda.samples(model,


variable.names = c("mu", "sigma", "tdf"),
n.iter = Nrep,
progress.bar = "none")

sim_posterior = as.data.frame(as.matrix(posterior_sample))
names(sim_posterior) = c("mu1", "mu2", "sigma", "tdf")
head(sim_posterior)

## mu1 mu2 sigma tdf


## 1 7.225 6.850 1.080 3.953
## 2 7.219 6.982 1.061 4.057
## 3 7.217 6.982 1.094 4.248
## 4 7.232 7.042 1.034 2.978
## 5 7.242 6.958 1.000 2.934
## 6 7.275 6.970 1.012 3.954

ggplot(sim_posterior, aes(mu1, mu2)) +


geom_point(color = "seagreen", alpha = 0.4) +
stat_ellipse(level = 0.98, color = "black", size = 2) +
stat_density_2d(color = "grey", size = 1) +
geom_abline(intercept = 0, slope = 1)
298 CHAPTER 16. COMPARING TWO SAMPLES

7.50

7.25
mu2

7.00

6.75

7.2 7.3 7.4 7.5


mu1

ggplot(sim_posterior, aes(mu1, mu2)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c() +
geom_abline(intercept = 0, slope = 1)
299

7.2

level
7.1 35
30
mu2

25
20
15
7.0
10
5

6.9

7.25 7.30 7.35 7.40


mu1

cor(sim_posterior$mu1, sim_posterior$mu2)

## [1] 0.06929

sim_posterior = sim_posterior %>%


mutate(mu_diff = mu1 - mu2)

ggplot(sim_posterior,
aes(mu_diff)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen") +
labs(x = "Difference in population mean birthweight (pounds, non-smokers - smokers)",
title = "Posterior Distribution")
300 CHAPTER 16. COMPARING TWO SAMPLES

Posterior Distribution

3
density

−0.25 0.00 0.25 0.50 0.75


Difference in population mean birthweight (pounds, non−smokers − smokers)

quantile(sim_posterior$mu_diff, c(0.01, 0.10, 0.25, 0.75, 0.90, 0.99))

## 1% 10% 25% 75% 90% 99%


## 0.01004 0.12352 0.18782 0.33075 0.39720 0.51219

sum(sim_posterior$mu_diff > 0 ) / length(sim_posterior$mu_diff)

## [1] 0.992
Chapter 17

Introduction to Markov
Chain Monte Carlo
(MCMC) Simulation

Bayesian data analysis is based on the posterior distribution of relevant param-


eters given the data. However, in many situations the posterior distribution
cannot be determined analytically or via grid approximation. Therefore we
use simulation to approximate the posterior distribution and its characteristics.
But in many situations it is difficult or impossible to simulate directly from a
distribution, so we turn to indirect methods.
Markov chain Monte Carlo (MCMC)1 methods provide powerful and
widely applicable algorithms for simulating from probability distributions, in-
cluding complex and high-dimensional distributions.
Example 17.1. A politician campaigns on a long east-west chain of islands2 .
At the end of each day she decides to stay on her current island, move to the
island to the east, or move to the island to the west. Her goal is to visit all
the islands proportional to their population, so that she spends the most days
on the most populated island, and proportionally fewer days on less populated
islands. But, (1) she doesn’t know how many islands there are, and (2) she
doesn’t know the population of each island. However, when she visits an island
she can determine its population. And she can send a scout to the east/west
adjacent islands to determine their population before visiting. How can the
politician achieve her goal in the long run?
1 For some history, and an origin of the use of the the term “Monte Carlo”, see Wikipedia.

Monte Carlo methods consist of a broad class of algorithms for obtaining numerical results
based on random numbers, even in problems that don’t explicitly involve probability (e.g.,
Monte Carlo integration).
2 This island hopping example is inspired by Kruschke, Doing Bayesian Data Analysis.

301
302CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

Suppose that every day, the politician makes her travel plans according to the
following algorithm.

• She flips a fair coin to propose travel to the island to the east (heads) or
west (tails). (If there is no island to the east/west, treat it as an island
with population zero below.)
• If the proposed island has a population greater than that of the current
island, then she travels to the proposed island.
• If the proposed island has a population less than that of the current island,
then:
– She computes 𝑎, the ratio of the population of the proposed island
to the current island.
– She travels to the proposed island with probability 𝑎,
– And with probability 1 − 𝑎 she spends another day on the current
island.

1. Suppose there are 5 islands, labeled 1, … , 5 from west to east, and that
island 𝜃 has population 𝜃 (thousand), and that she starts at island 3 on
day 1. How could you use a coin and a spinner to simulate by hand the
politician’s movements over a number of days? Conduct the simulation
and plot the path of her movements.
5
4
Island

3
2
1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Day

2. Construct by hand a plot displaying the simulated relative frequencies of


days in each of the 5 islands.
3. Now write code to simulate the politician’s movements for many days.
Plot the island path.
303

4. Plot the simulated relative frequencies of days in each of the 5 islands.


5. Recall the politician’s goal of visiting each island in proportion to its
population. Given her goal, what should the plot from the previous part
look like? Does the algorithm result in a reasonable approximation?
6. Suppose again that the number of islands and their populations is un-
known, but that the population for island 𝜃 is proportional to 𝜃2 𝑒−0.5𝜃 ,
where the islands are labeled from west to east 1, 2, …. Based on this
information, can she implement her algorithm? That is, is it sufficient
to know that the populations are in proportion to 𝜃2 𝑒−0.5𝜃 without know-
ing the actual populations? In particular, is there enough information to
compute the “acceptance probability” 𝑎?
7. Write code to run the algorithm for the previous situation and compare
the simulated relative frequencies to the target distribution. Does the
algorithm result in a reasonable approximation?
8. Why doesn’t the politician just always travel to or stay on the island with
the larger population?
9. Is the next island visited dependent on the current island?
10. Is the next island visited dependent on how she got to the current island?
That is, given the current island is her next island independent of her past
history of previous movements?
11. What would happen if there were an island among the east-west chain
with population 0? (For example, suppose there are 10 islands, she starts
on island 1, and island 5 has population 0.) How could she modify her
algorithm to address this issue?

Solution. to Example 17.1

1. Starting at island 𝜃, flip a coin to propose a move to either 𝜃 − 1 or


𝜃 + 1. If the proposed move is to 𝜃 + 1, it will be accepted because the
population is larger. If the proposed move is to 𝜃 − 1 it will be accepted
with probability (𝜃 − 1)/𝜃; otherwise the move will be rejected and she
will stay in the current island.
For example, starting in island 3 she proposes a move to either island 2
or island 4. If the proposed move is to island 4, it is accepted and she
moves to island 4. If the proposed move is to island 2, it is accepted with
probability 2/3. If it is accepted she moves to island 2; otherwise she stays
at island 3 for another day.
If she is on island 1 and she proposes a move to the west, the proposal is
rejected (because the population of island “0” is 0) and she spends another
day on island 1. Likewise if she proposes a move to the east from island
5.
304CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

The acceptance probabilities are

𝑎(1 → 0) = 0 𝑎(1 → 2) = 1
𝑎(2 → 1) = 1/2 𝑎(2 → 3) = 1
𝑎(3 → 2) = 2/3 𝑎(3 → 4) = 1
𝑎(4 → 3) = 3/4 𝑎(4 → 5) = 1
𝑎(5 → 4) = 4/5 𝑎(5 → 6) = 0

Here is an example plot for 30 days.


5
4
Island

3
2
1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Day

2. The plot below corresponds to the path from the previous plot.
305

0.4
0.3
Proportion of days

0.2
0.1
0.0

1 2 3 4 5

Island

3. Some code is below. There are different ways to implement this algorithm,
but note the proposal and acceptance steps below.

n_states = 5
theta = 1:n_states
pi_theta = theta

n_steps = 10000
theta_sim = rep(NA, n_steps)
theta_sim[1] = 3 # initialize

for (i in 2:n_steps){
current = theta_sim[i - 1]
proposed = sample(c(current + 1, current - 1), size = 1, prob = c(0.5, 0.5))
if (!(proposed %in% theta)){ # to correct for proposing moves outside of boundaries
proposed = current
}
a = min(1, pi_theta[proposed] / pi_theta[current])
theta_sim[i] = sample(c(proposed, current), size = 1, prob = c(a, 1-a))
}

# trace plot
plot(1:n_steps, theta_sim, type = "l", ylim = range(theta), xlab = "Day", ylab = "Island")
306CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

5
4
Island

3
2
1

0 2000 4000 6000 8000 10000

Day

4. The plot below corresponds to the path from the previous plot.

plot(table(theta_sim) / n_steps, xlab = "Island", ylab = "Proportion of days")

points(theta, theta / 15, type = "o", col = "seagreen")


0.30
Proportion of days

0.20
0.10
0.00

1 2 3 4 5

Island

5. Since the population of island 𝜃 is proportional to 𝜃, the target probability


distribution satisfies 𝜋(𝜃) ∝ 𝜃, 𝜃 = 1, … , 5. That is, 𝜋(𝜃) = 𝜃/15, 𝜃 =
307

1, … , 5. This distribution is depicted in green in the previous plot. We see


that the algorithm does produce a reasonable approximation.

6. She can still make east-west proposals with a coin flip. And she can still
decide whether to accept the proportion based on relative population.
If she is currently on island 𝜃current and she proposes a move to island
𝜃proposed , she will accept the proposal with probability

2
𝜃proposed 𝑒−0.5𝜃proposed
𝑎(𝜃current → 𝜃proposed ) = 2
𝜃current 𝑒−0.5current

In terms of running the algorithm it is sufficient to know that the popula-


tions are in proportion to 𝜃2 𝑒−0.5𝜃 without knowing the actual populations.

7. See below; it seems like a reasonable approximation.

n_states = 30
theta = 1:n_states
pi_theta = theta ^ 2 * exp(-0.5 * theta) # notice: not probabilities

n_steps = 10000
theta_sim = rep(NA, n_steps)
theta_sim[1] = 1 # initialize

for (i in 2:n_steps){
current = theta_sim[i - 1]
proposed = sample(c(current + 1, current - 1), size = 1, prob = c(0.5, 0.5))
if (!(proposed %in% theta)){ # to correct for proposing moves outside of boundaries
proposed = current
}
a = min(1, pi_theta[proposed] / pi_theta[current])
theta_sim[i] = sample(c(proposed, current), size = 1, prob = c(a, 1-a))
}

# trace plot
plot(1:n_steps, theta_sim, type = "l", ylim = range(theta_sim), xlab = "Day", ylab = "Island
308CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

25
20
15
Island

10
5

0 2000 4000 6000 8000 10000

Day

plot(table(theta_sim) / n_steps, xlab = "Island", ylab = "Proportion of days")

points(theta, pi_theta / sum(pi_theta), type = "o", col = "seagreen")


0.12
Proportion of days

0.08
0.04
0.00

1 3 5 7 9 11 13 15 17 19 21 23 25

Island

8. She wants to visit the islands in proportion to their population. So she


still wants to visit the smaller islands, just not as often as the larger ones.
309

But if she always move towards islands with larger populations, she would
not visit the smaller ones at all.
9. Yes, the next island visited dependent on the current island. For example,
if she is on island 3 today, tomorrow she can only be on island 2, 3, or 4.
10. No the next island visited is not dependent on how she got to the current
island. The proposals and acceptance probability only depend on the
current state, and not past states (given the current state).
11. With only east-west proposals, since she would never visit an island with
population 0 (because the acceptance probability would be 0), she could
never get to islands on the other side. She could modify her algorithm
to cast a wider net in her proposals, instead of just proposing moves to
adjacent islands.

The goal of a Markov chain Monte Carlo method is to simulate from a proba-
bility distribution of interest. In Bayesian contexts, the distribution of interest
will usually be the posterior distribution of parameters given data.
A Markov chain is a random process that exhibits a special “one-step” depen-
dence structure. Namely, conditional on the most recent value, any future value
is conditionally independent of any past values. In a Markov chain: “Given the
present, the future is conditionally independent of the past.” Roughly, in terms
of simulating the next value of a Markov chain, all that matters is where you
are now, not how you got there.
The idea of MCMC is to build a Markov chain whose long run distribution
— that is, the distribution of state visits after a large number of “steps” —
is the probability distribution of interest. Then we can indirectly simulate a
representative sample from the probability distribution of interest, and use the
simulated values to approximate the distribution and its characteristics, by run-
ning an appropriate Markov chain for a sufficiently large number of steps. The
Markov chain does not need to be fully specified in advance, and is often con-
structed “as you go” via an algorithm like a “modified random walk”. Each step
of the Markov chain typically involves

• A proposal for the next state, which is generated according to some


known probability distribution or mechanism,
• A decision of whether or to accept the proposal. The decision
usually involves probability
– With probability 𝑎, accept the proposal and step to the next state
– With probability 1 − 𝑎, reject the proposal and remain in the current
state for the next step.

In principle, proposals can be fairly naive and not related to the target distri-
bution (though in practice choice of proposal is very important since it affects
310CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

computational efficiency). Furthermore, the target distribution of interest only


needs to be specified up to a constant of proportionality, and the state space of
possible values does not need to be fully specified in advance.
The island hopping example illustrated an MCMC algorithm for a discrete pa-
rameter 𝜃. Recall that most parameters in statistical models take values on a
continuous scale, so most posterior distributions are continuous distributions.
MCMC simulation can also be used to approximate the posterior distribution
and related characteristics for continuous distributions.
The following examples illustrates how MCMC can be used to approximate the
posterior distribution in a Beta-Binomial setting. Of course, this scenario can
be handled analytically and so MCMC is not necessary. However, it will help
to see how the ideas work in a familiar context where an analytical solution is
available.

Example 17.2. Suppose we wish to estimate 𝜃, the proportion of Cal Poly


students who have read a non-school related book in 2022. Assume a Beta(1,
3) prior distribution for 𝜃. In a sample of 25 Cal Poly students, 4 have read a
book in 2021. We’ll use MCMC to approximate the posterior distribution of 𝜃.

1. Without actually computing the posterior distribution, what can we say


about it based on the assumptions of the model?
2. What are the possible “states” that we want our Markov chain to visit?
3. Given a current state 𝜃current how could we propose a new value 𝜃proposed ,
using a continuous analog of “random walk to neighboring states”?
4. How would we decide whether to accept the proposed move? How would
we compute the probability of accepting the proposed move?
5. Suppose the current state is 𝜃current = 0.20 and the proposed state is
𝜃proposed = 0.15. Compute the probability of accepting this proposal.
6. Suppose the current state is 𝜃current = 0.20 and the proposed state is
𝜃proposed = 0.25. Compute the probability of accepting this proposal.
7. Write code to run the algorithm and plot the simulated values of 𝜃.
8. What is the posterior distribution? Does the distribution of simulated
values of 𝜃 provide a reasonable approximation to the posterior distribu-
tion?

Solution. to Example 17.2

1. Since posterior is proportional to likelihood times prior, we know that

𝜋(𝜃|𝑦 = 4) ∝ 𝑓(𝑦 = 4|𝜃)𝜋(𝜃)


∝ (𝜃4 (1 − 𝜃)25−4 ) (𝜃1−1 (1 − 𝜃)3−1 )

2. 𝜃 takes values in (0, 1) so each possible value in (0, 1) is a state.


311

3. There are different approaches, but here’s a common one. We want to pro-
pose a new state in the neighborhood of the current state. Given 𝜃current ,
propose 𝜃proposed from a 𝑁 (𝜃current , 𝛿) distribution where the standard
deviation 𝛿 represents the size of the “neighborhood”. For example, if
𝜃current = 0.5 and 𝛿 = 0.05 then we would draw the proposal from the
𝑁 (0.5, 0.05) distribution, so there’s a 68% chance the proposal is between
0.45 and 0.55 and a 95% chance that it’s between 0.40 and 0.60.

4. Remember, the goal is to approximate the posterior distribution, so we


want to visit the 𝜃 states in proportion to their posterior density. If
the proposed state has higher posterior density than the current state,
𝜋(𝜃proposed |𝑦 = 4) > 𝜋(𝜃current |𝑦 = 4), then we accept the proposal. Oth-
erwise, accept the proposal with a probability based on the relative pos-
terior densities of the proposed and current states. That is, we accept the
proposed move with probability

𝜋(𝜃proposed |𝑦=4) (𝜃4


proposed (1−𝜃proposed )
25−4 )(𝜃1−1
proposed (1−𝜃proposed )
3−1 )
𝑎(𝜃current →𝜃proposed )=min(1, 𝜋(𝜃current |𝑦=4) )=min(1, )
(𝜃4
current (1−𝜃current ) 25−4 )(𝜃1−1
current (1−𝜃current ) 3−1 )

5. The posterior density is larger for the proposed state, so the proposed
move is accepted with probability 1.
𝜋(0.15|𝑦=4) (0.154 (1−0.15)25−4 )(0.151−1 (1−0.15)3−1 )
𝑎(0.20→0.15)=min(1, 𝜋(0.20|𝑦=4) )=min(1, (0.204 (1−0.20)25−4 )(0.201−1 (1−0.20)3−1 ) )=1

(dbeta(0.15, 1, 3) * dbinom(4, 25, 0.15)) / (dbeta(0.2, 1, 3) * dbinom(4, 25, 0.2))

## [1] 1.276

6. The posterior density is smaller for the proposed state, so based on the
ratio of the posterior densities, the proposed move is accepted with prob-
ability 0.553.
𝜋(0.25|𝑦=4) (0.254 (1−0.25)25−4 )(0.251−1 (1−0.25)3−1 )
𝑎(0.20→0.25)=min(1, 𝜋(0.20|𝑦=4) )=min(1, (0.204 (1−0.20)25−4 )(0.201−1 (1−0.20)3−1 ) )=0.553

(dbeta(0.25, 1, 3) * dbinom(4, 25, 0.25)) / (dbeta(0.2, 1, 3) * dbinom(4, 25, 0.2))

## [1] 0.5533

7. See below. The Normal distribution proposal can propose values outside
of (0, 1), so we set 𝜋(𝜃|𝑦 = 4) equal to 0 for 𝜃 ∉ (0, 1). This way, proposals
to states outside (0, 1) will never be accepted.

n_steps = 10000
delta = 0.05
312CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

theta = rep(NA, n_steps)


theta[1] = 0.5 # initialize

# Posterior is proportional to prior * likelihood


pi_theta <- function(theta) {
if (theta > 0 & theta < 1) dbeta(theta, 1, 3) * dbinom(4, 25, theta) else 0
}

for (n in 2:n_steps){
current = theta[n - 1]
proposed = current + rnorm(1, mean = 0, sd = delta)
accept = min(1, pi_theta(proposed) / pi_theta(current))
theta[n] = sample(c(current, proposed), 1, prob = c(1 - accept, accept))
}

# simulated values of theta


hist(theta, breaks = 50, freq = FALSE,
xlab = "theta", ylab = "pi(theta|y = 4)", main = "Posterior Distribution")

# plot of theoretical posterior density of theta


x_plot= seq(0, 1, 0.0001)
lines(x_plot, dbeta(x_plot, 1 + 4, 3 + 25 - 4), col = "seagreen", lwd = 2)

Posterior Distribution
6
5
pi(theta|y = 4)

4
3
2
1
0

0.0 0.1 0.2 0.3 0.4 0.5

theta

plot(1:100, theta[1:100], type = "o", xlab = "n",


ylab = expression(theta[n]), main = "First 100 steps")
313

First 100 steps


0.5
0.4
0.3
θn

0.2
0.1

0 20 40 60 80 100

plot(1:n_steps, theta, type = "l", xlab = "n",


ylab = expression(theta[n]), main = "All steps")

All steps
0.5
0.4
0.3
θn

0.2
0.1
0.0

0 2000 4000 6000 8000 10000

8. The theoretical posterior distribution is the Beta(5, 24) distribution, de-


picted in green above. The distribution of simulated values of 𝜃 provides
a reasonable approximation to the posterior distribution.
314CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

The goal of an MCMC method is to simulate 𝜃 values from a probability distribu-


tion 𝜋(𝜃). One commonly used MCMC method is the Metropolis algorithm3
To generate 𝜃new given 𝜃current :

1. Given the current value 𝜃current propose a new value 𝜃proposed according
to the proposal ( or “jumping”) distribution 𝑗.

𝑗(𝜃current → 𝜃proposed )

is the conditional density that 𝜃proposed is proposed as the next state given
that 𝜃current is the current state
2. Compute the acceptance probability as the ratio of target density at the
current and proposed states
𝜋(𝜃proposed )
𝑎(𝜃current → 𝜃proposed ) = min (1, )
𝜋(𝜃current )

3. Accept the proposal with probability 𝑎(𝜃current → 𝜃proposed ) and set 𝜃new =
𝜃proposed . With probability 1 − 𝑎(𝜃current → 𝜃proposed ) reject the proposal
and set 𝜃new = 𝜃current .
• If 𝜋(𝜃proposed ) ≥ 𝜋(𝜃current ) then the proposal will be accepted with
probability 1.
• Otherwise, there is a positive probability or rejecting the proposal
and remaining in the current state. But this still counts as a “step”
of the MC.

The Metropolis algorithm assumes the proposal distribution is symmetric. That


is, the algorithm assumes that the proposal density of moving in the direction
𝜃 → 𝜃 ̃ is equal to the proposal density of moving the direction 𝜃 ̃ → 𝜃.

𝑗(𝜃 → 𝜃)̃ = 𝑗(𝜃 ̃ → 𝜃)

A generalization, the Metropolis-Hastings algorithm, allows for asymmetric


proposal distributions, with the acceptance probabilities adjusted to accommo-
date the asymmetry.
𝜋(𝜃proposed )𝑗(𝜃proposed → 𝜃current )
𝑎(𝜃current → 𝜃proposed ) = min (1, )
𝜋(𝜃current )𝑗(𝜃current → 𝜃proposed )

The Metropolis algorithm only uses the target distribution 𝜋 through ratios of
𝜋(𝜃 )
the form 𝜋(𝜃proposed) . Therefore, 𝜋 only needs to be specified up to a constant
current
of proportionality, since even if the normalizing constant were known it would
3 The algorithm is named after Nicholas Metropolis, a physicist who led the research group

which first proposed the method in the early 1950s, consisting of Arianna Rosenbluth, Marshall
Rosenbluth, Augusta Teller, and Edward Teller. It is disputed whether Metropolis himself
had anything to do with the actual invention of the algorithm.
315

cancel out anyway. This is especially useful in Bayesian contexts where the
target posterior distribution is only specified up to a constant of proportionality
via
posterior ∝ likelihood × prior

We will most often use MCMC methods to simulate values from a posterior
distribution 𝜋(𝜃|𝑦) of parameters 𝜃 given data 𝑦. The Metropolis (or Metropolis-
Hastings algorithm) allows us to simulate from a posterior distribution without
computing the posterior distribution. Recall that the inputs of a Bayesian model
are (1) the data 𝑦, (2) the likelihood 𝑓(𝑦|𝜃), and (3) the prior distribution 𝜋(𝜃).
The target posterior distribution satisfies

𝜋(𝜃|𝑦) ∝ 𝑓(𝑦|𝜃)𝜋(𝜃)

Therefore, the acceptance probability of the Metropolis algorithm can be com-


puted based on the form of the prior and likelihood alone

𝜋(𝜃proposed |𝑦) 𝑓(𝑦|𝜃proposed )𝜋(𝜃proposed )


𝑎(𝜃current → 𝜃proposed ) = min (1, ) = min (1, )
𝜋(𝜃current |𝑦) 𝑓(𝑦|𝜃current )𝜋(𝜃current )

To reiterate

• Proposed values are simulated according to the proposal distribution 𝑗.


• Proposals are accepted based on probabilities determined by the target
distribution 𝜋.

The Metropolis-Hastings algorithm works for any proposal distribution which


allows for eventual access to all possible values of 𝜃. That is, if we run the
algorithm long enough then the distribution of the simulated values of 𝜃 will
approximate the target distribution 𝜋(𝜃). Thus we can choose a proposal distri-
bution that is easy to simulate from. However, in practice the choice of proposal
distribution is extremely important — especially when simulating from high di-
mensional distributions — because it determines how fast the MC converges
to its long run distribution. There are a wide variety of MCMC methods and
their extensions that strive for computational efficiency by making “smarter”
proposals. These algorithms include Gibbs sampling, Hamiltonian Monte Carlo
(HMC), and No U-Turn Sampling (NUTS). We won’t go into the details of the
many different methods available. Regardless of the details, all MCMC methods
are based on the two core principles of proposal and acceptance.

Example 17.3. We have seen how to estimate a process probability 𝑝 in a


Binomial situation, but what if the number of trials is also random? For exam-
ple, suppose we want to estimate both the average number of three point shots
Steph Curry attempts per game (𝜇) and the probability of success on a single
attempt (𝑝). Assume that
316CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

• Conditional on 𝜇, the number of attempts in a game 𝑁 has a Poisson(𝜇)


distribution
• Conditional on 𝑛 and 𝑝, the number of successful attempts in a game 𝑌
has a Binomial(𝑛, 𝑝) distribution.
• Conditional on (𝜇, 𝑝), the values of (𝑁 , 𝑌 ) are independent from game to
game

For the prior distribution, assume

• 𝜇 has a Gamma(10, 2) distribution


• 𝑝 has a Beta(4, 6) distribution
• 𝜇 and 𝑝 are independent

In his two most recent games, Steph Curry made 4 out of 10 and 6 out of 11
attempts.

1. What is the likelihood function?


2. Without actually computing the posterior distribution, what can we say
about it based on the assumptions of the model?
3. What are the possible “states” that we want our Markov chain to visit?
4. Given a current state 𝜃current how could we propose a new value 𝜃proposed ,
using a continuous analog of “random walk to neighboring states”?
5. Suppose the current state is 𝜃current = (8, 0.5) and the proposed state is
𝜃proposed = (7.5, 0.55). Compute the probability of accepting this proposal.
6. Write code to run the algorithm and plot the simulated values of 𝜃.
7. Write and run JAGS code to approximate the posterior distribution, and
compare with the previous part.

Solution. to Example 17.3

1. For an (𝑛, 𝑦) pair for a single game, the likelihood satisfies

𝑓((𝑛, 𝑦)|𝜇, 𝑝) ∝ (𝑒−𝜇 𝜇𝑛 )( 𝑝𝑦 (1 − 𝑝)𝑛−𝑦 )

Since the games are assumed to be independent, we evaluate the likelihood


for each observed (𝑛, 𝑦) pair and then find the product

𝑓(((10,4),(11,6))|𝜇,𝑝)∝(𝑒−𝜇 𝜇10 𝑝4 (1−𝑝)10−4 )(𝑒−𝜇 𝜇11 𝑝6 (1−𝑝)11−6 )=𝑒−2𝜇 𝜇21 𝑝10 (1−𝑝)21−10

Notice that the likelihood can be evaluated based on (1) the total number
of games, 2, (2) the total number of attempts, 21, and (3) the total number
of successful attempts, 10.
2. Posterior is proportional to prior times likelihood. The priors for 𝜇 and 𝑝
are independent.

𝜋(𝜇,𝑝|((10,4),(11,6)))∝(𝜇10−1 𝑒−2𝜇 𝑝4 (1−𝑝)6 )(𝑒−2𝜇 𝜇21 𝑝10 (1−𝑝)21−10 )


317

3. Each (𝜇, 𝑝) pair with 𝜇 > 0 and 0 < 𝑝 < 1 is a possible state.
4. Given 𝜃current = (𝜇current , 𝑝current ) we can propose a state using a Bivariate
Normal distribution centered at the current state. The proposed values of
𝜇 and 𝑝 could be chosen independently, but they could also reflect some
dependence.
5. Suppose the current state is 𝜃current = (8, 0.5) and the proposed state is
𝜃proposed = (7.5, 0.55). Compute the probability of accepting this proposal.

pi_theta <- function(mu, p) {


if (mu > 0 & p > 0 & p < 1) {
dgamma(mu, 10, 2) * dbeta(p, 4, 6) * dpois(21, 2 * mu) * dbinom(11, 21, p)
} else {
0
}
}

pi_theta(7.5, 0.55) / pi_theta(8, 0.5)

## [1] 0.8334

6. The code and output is below. Now the state is two-dimensional,


(𝜇, 𝑝), and the trace plot shows how the Markov chain explores this
two-dimensional space as it steps. You can’t tell from this trace plot
when the Markov chain rejects a proposal and stays in the same place
since the plot points get overlaid in that case, but you can see where the
step numbers coincide.

n_steps = 11000
delta = c(0.4, 0.05) # mu, p

theta = data.frame(mu = rep(NA, n_steps),


p = rep(NA, n_steps))
theta[1, ] = c(10.5, 10 / 21) # initialize

for (n in 2:n_steps){
current = theta[n - 1, ]
proposed = current + rnorm(2, mean = 0, sd = delta)
accept = min(1, pi_theta(proposed$mu, proposed$p) / pi_theta(current$mu, current$p))
accept_ind = sample(0:1, 1, prob = c(1 - accept, accept))
theta[n, ] = proposed * accept_ind + current * (1 - accept_ind)
}

# Trace plot of first 100 steps


ggplot(theta[1:100, ] %>%
318CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

mutate(label = 1:100),
aes(mu, p)) +
geom_path() +
geom_point(size = 2) +
geom_text(aes(label = label, x = mu + 0.1, y = p + 0.01)) +
labs(title = "Trace plot of first 100 steps")

Trace plot of first 100 steps


98
80 9781
69 7994
70 95 9291
93
100
99 36

0.6 82 96 54 57
56
55
39 29
28
27
71 38
90
89
88
87
86 7441
37 35 40
59
84 76 83
8564 75 42
60
68 61 34
58 23
67 77
78 30
0.5 66 43
65 53
52 73 31 24 26
25
p

63
45
44 72 20
19
18
17
22 2
1
51
16
32
62 33 21
46 49 15 9
5048 8
7
6 5 4
3
0.4
14
11
10
12
13
47

6 7 8 9 10
mu

# Delete the first 1000 steps - we'll see why in the next chapter
theta = theta[-(1:1000), ]

ggplot(theta, aes(x = mu)) +


geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")
319

0.2
density

0.1

0.0

4 6 8 10 12
mu

ggplot(theta, aes(x = p)) +


geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")

3
density

0.2 0.4 0.6 0.8


p

ggplot(theta, aes(mu, p)) +


geom_point(color = "seagreen", alpha = 0.4) +
320CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

stat_ellipse(level = 0.98, color = "black", size = 2) +


stat_density_2d(color = "grey", size = 1) +
geom_abline(intercept = 0, slope = 1)

0.8

0.6
p

0.4

0.2

5.0 7.5 10.0 12.5


mu

ggplot(theta, aes(mu, p)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c() +
geom_abline(intercept = 0, slope = 1)
321

0.7

0.6

level

0.9
0.5
p

0.6

0.3

0.4

0.3

6 8 10
mu

7. The JAGS code is below. The results are similar, but not quite the same
as our code from scratch in the previous part. JAGS has a lot of built in
features that improve efficiency. In particular, JAGS is making smarter
proposals and is not rejecting as many proposals as our from-scratch al-
gorithm. The scatterplots of simulated (𝜇, 𝑝) pairs kind of show this; the
plot based on the from-scratch algorithm is “thinner” than the one based
on JAGS because the from-scratch algorithm rejects proposals and sits in
place more often.

# data
n = c(10, 11)
y = c(4, 6)
n_sample = 2

# Model
model_string <- "model{

# Likelihood
for (i in 1:n_sample){

y[i] ~ dbinom(p, n[i])

n[i] ~ dpois(mu)
}
322CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

# Prior

mu ~ dgamma(10, 2)

p ~ dbeta(4, 6)

}"

dataList = list(y = y, n = n, n_sample = n_sample)

# Compile
Nrep = 10000

n.chains = 5

model <- jags.model(textConnection(model_string),


data = dataList,
n.chains = n.chains)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 4
## Unobserved stochastic nodes: 2
## Total graph size: 11
##
## Initializing model

# Simulate
update(model, 1000, progress.bar = "none")

posterior_sample <- coda.samples(model,


variable.names = c("mu", "p"),
n.iter = Nrep,
progress.bar = "none")

sim_posterior = as.data.frame(as.matrix(posterior_sample))
head(sim_posterior)

## mu p
## 1 9.034 0.3177
## 2 8.791 0.5250
## 3 8.290 0.3150
## 4 7.839 0.3179
323

## 5 8.326 0.5670
## 6 8.983 0.5996

ggplot(sim_posterior, aes(x = mu)) +


geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")

0.2
density

0.1

0.0

4 8 12 16
mu

ggplot(sim_posterior, aes(x = p)) +


geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")
324CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION

density 3

0.2 0.4 0.6 0.8


p

ggplot(sim_posterior, aes(mu, p)) +


geom_point(color = "seagreen", alpha = 0.4) +
stat_ellipse(level = 0.98, color = "black", size = 2) +
stat_density_2d(color = "grey", size = 1) +
geom_abline(intercept = 0, slope = 1)

0.8

0.6
p

0.4

0.2

4 8 12 16
mu
325

ggplot(sim_posterior, aes(mu, p)) +


stat_density_2d(aes(fill = ..level..),
geom = "polygon", color = "white") +
scale_fill_viridis_c() +
geom_abline(intercept = 0, slope = 1)

0.6

level
0.5
0.9
p

0.6

0.4 0.3

0.3

6 8 10
mu
326CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION
Chapter 18

Some Diagnostics for


MCMC Simulation

The goal we wish to achieve with MCMC is to simulate from a probability


distribution of interest (e.g., the posterior distribution). The idea of MCMC is to
build a Markov chain whose long run distribution is the probability distribution
of interest. Then we can simulate a sample from the probability distribution,
and use the simulated values to summarize and investigate characteristics of the
probability distribution, by running the Markov chain for a sufficiently large
number of steps. In practice, we stop the chain after some number of steps; how
can we tell if the chain has sufficiently converged?
In this chapter we will introduce some issues to consider in determining if an
MCMC algorithm “works”.

• Does the algorithm produce samples that are representative of the target
distribution of interest?
• Are estimates of characteristics of the distribution (e.g. posterior mean,
posterior standard deviation, central 98% credible region) based on the
simulated Markov chain accurate and stable?
• Is the algorithm efficient, in terms of time or computing power required
to run?

Example 18.1. Recall Example 17.2 in which we used a Metropolis algorithm


to simulate from a Beta(5, 24) distribution. Given current state 𝜃𝑐 , the pro-
posal was generated from a 𝑁 (𝜃𝑐 , 𝛿) distribution, where 𝛿 was a specified value
(determining what constitutes the “neighborhood” in the continuous random
walk analog). The algorithm also needs some initial value of 𝜃 to start with; in
Example 17.2 we used an initial value of 0.5. What is the impact of the initial
value?

327
328 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION

The following plots display the values of the first 200 steps and their density,
and the values of 10,000 steps and their density, for 5 different runs of the
Metropolis chain each starting from a different initial value: 0.1, 0.3, 0.5, 0.7,
0.9. The value of 𝛿 is 0.005. (We’re setting this value to be small to illustrate
a point.) What do you notice in the plots? How does the initial value influence
the results?

First 200 steps of 5 different runs of MC First 10000 steps of 5 different runs of MC
theta_n

theta_n
0.6

0.6
0.0

0.0
0 50 100 150 200 0 2000 6000 10000

n n

First 200 steps of 5 different runs of MC First 10000 steps of 5 different runs of MC
8
10 20
density

density

4
0

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

theta theta

Solution. to Example 18.1

Show/hide solution
With such a small 𝛿 value the chain tends to take a long time to move away
from its current value. For example, the chain that starts at a value of 0.9 tends
to stay near 0.9 for the first hundreds of steps. Values near 0.9 are rare in a
Beta(5, 24) distribution, so this chain generates a lot of unrepresentative values
before it warms up to the target distribution. After a thousand or so iterations
all the chains start to overlap and become indistinguishable regardless of the
initial condition. However, the density plots for each of the chains illustrate
that the initial steps of the chain still carry some influence.
The goal of an MCMC simulation is to simulate a representative sample of
values from the target distribution. While an MCMC algorithm should converge
eventually to the target distribution, it might take some time to get there. In
particular, it might take a while for the influence of the initial state to diminish.
Burn in refers to the process of discarding the first several hundred or thousand
steps of the chain to allow for a “warm up” period. Only values simulated after
the burn in period are used to approximate the target distribution.
329

The update step in rjags runs the MCMC simulation for a burn in period,
consisting of n.iter steps. (The n.iter in update is not the same as the n.iter
in coda.samples.) The update function merely “warms-up” the simulation, and
the values sampled during the update phase are not recorded.
The JAGS code below simulates 5 different chains, from 5 different initial con-
ditions, each with a burn in period of 1000 steps, after which 10,000 steps of
each chain are simulated. The output consists of 50,000 simulated values of 𝜃.

# Data
n = 25
y = 4

# Model
model_string <- "model{

# Likelihood
y ~ dbinom(theta, n)

# Prior
theta ~ dbeta(1, 3)

}"

data_list = list(y = y, n = n)

# Compile
model <- jags.model(textConnection(model_string),
data = data_list,
n.chains = 5)

## Compiling model graph


## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 5
##
## Initializing model

# Simulate
update(model, n.iter = 1000, progress.bar = "none")

Nrep = 10000
330 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION

posterior_sample <- coda.samples(model,


variable.names = c("theta"),
n.iter = Nrep,
progress.bar = "none")
summary(posterior_sample)

##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 5
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.172895 0.069324 0.000310 0.000435
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.0609 0.1223 0.1652 0.2153 0.3286

plotPost(posterior_sample)

mode = 0.153

95% HDI
0.0489 0.308

0.0 0.1 0.2 0.3 0.4 0.5

Param. Val.
331

## ESS mean median mode hdiMass hdiLow hdiHigh compVal


## Param. Val. 25109 0.1729 0.1652 0.1535 0.95 0.04886 0.3084 NA
## pGtCompVal ROPElow ROPEhigh pLtROPE pInROPE pGtROPE
## Param. Val. NA NA NA NA NA NA

nrow(as.matrix(posterior_sample))

## [1] 50000

In practice, it is common to use a burn in period of several hundred or thousands


of steps. To get a better idea of how long the burn in period should be, run the
chain starting from several disperse initial conditions to see how long it takes for
the paths to “overlap”. JAGS will generate different initial values, but you can
also specify them with the inits argument in jags.model. After the burn in
period, examine trace plots or density plots for multiple chains; if the plots do
not “overlap” then there is evidence that the chains have not converged, so they
might not be producing representative samples from the target distribution, and
therefore a longer burn in period is needed.

The Gelman-Rubin statistic (a.k.a., shrink factor) is a numerical check of


convergence based on a measure of variance between chains relative to variability
with chains. The idea is that if multiple chains have settled into representative
sampling, then the average difference between chains should be equal to the
average difference within chains (i.e., across steps). Roughly, if multiple chains
are all producing representative samples from the target distribution, then given
a current value, it shouldn’t matter if you take the next value from the chain or
if you hop to another chain. Thus, after the burn-in period, the shrink factor
should be close to 1. As a rule of thumb, a shrink factor above 1.1 is evidence
that the MCMC algorithm is not producing representative samples.

Recall that there are several packages available for summarizing MCMC output,
and these packages contain various diagnostics. For example, the output of the
diagMCMC function in the DBDA2E-utilities.R file includes a plot the shrink
factor.

diagMCMC(posterior_sample)
332 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
333

Example 18.2. Continuing with Metropolis sampling from a Beta(5, 24) dis-
tribution, the following plots display the results of three different runs, one each
for 𝛿 = 0.01, 𝛿 = 0.1, 𝛿 = 1, all with an initial value of 0.5. Describe the
differences in the behavior of the chains. Which chain seems “best”? Why?

delta=0.01; first 200 steps delta=0.1; first 200 steps delta=1; first 200 steps
0.8

0.8

0.8
Xn

Xn

Xn
0.4

0.4

0.4
0.0

0.0

0.0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

n n n

delta=0.01; first 10000 steps delta=0.1; first 10000 steps delta=1; first 10000 steps
0.8

0.8

0.8
Xn

Xn

Xn
0.4

0.4

0.4
0.0

0.0

0.0

0 2000 6000 10000 0 2000 6000 10000 0 2000 6000 10000

n n n

Solution. to Example 18.2

Show/hide solution
When 𝛿 = 0.01 only values that are close to the current value are proposed. A
proposed value close to the current value will have a density that is close to, if
not greater, than that of the current value. Therefore, most of the proposals
will be accepted, but these proposals don’t really go anywhere. With 𝛿 = 0.01
the chain moves often, but it does not move far.
When 𝛿 = 1 a wide range of values will be proposed, including values outside
of (0, 1). Many proposed values will have density that is much less than that
of the current value, if not 0. Therefore many proposals will be rejected. With
𝛿 = 1 the chain tends to get stuck in a value for a large number of steps before
moving (though when it does move, it can move far.)
Both of the above cases tend to get stuck in place and require a large number
of steps to explore the target distribution. The case 𝛿 = 0.1 is a more efficient.
The proposals are neither so narrow that it takes a long time to move nor so
wide that many proposals are rejected. The fast up and down pattern of the
trace plot shows that the chain with 𝛿 = 0.2 explores the target distribution
much more efficiently than the other two cases.
334 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION

The values of a Markov chain at different steps are dependent. If the degree
of dependence is too high, the chain will tend to get “stuck”, requiring a large
number of steps to fully explore the target distribution of interest. Not only
will the algorithm be inefficient, but it can also produce inaccurate and unstable
estimates of chararcteristics of the target distribution.
If the MCMC algorithm is working, trace plots should look like a “fat, hairy
catepillar.”1 Plots of the autocorrelation function (ACF) can also help determine
how “clumpy” the chain is. An autocorrelation measures the correlation between
values at different lags. For example, the lag 1 autocorrelation measures the
correlation between the values and the values from the next step; the lag 2
autocorrelation measures the correlation between the values and the values from
2 steps later.

Example 18.3. Continuing with Metropolis sampling from a Beta(5, 24) dis-
tribution, the following plots display, for the case 𝛿 = 0.1, the actual values
of the chain (after burn in) and the values lagged by 1, 5, and 10 time steps.
Are the values at different steps dependent? In what way are they not too
dependent?

lag= 1 lag= 5 lag= 10


0.35

0.35

0.35
value; lagged value

value; lagged value

value; lagged value


0.20

0.20

0.20
0.05

0.05

0.05

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

step step step

correlation=0.647 correlation=0.126 correlation=−0.289


0.35

0.35

0.35
lagged value

lagged value

lagged value
0.20

0.20

0.20
0.05

0.05

0.05

0.05 0.15 0.25 0.35 0.05 0.15 0.25 0.35 0.05 0.15 0.25 0.35

value value value

Solution. to Example 18.3

Show/hide solution
1 I’ve seen this description in many references, but I don’t know who first used this termi-

nology.
335

Yes, the values are dependent. In particular, the lag 1 autocorrelation is about
0.8, and the lag 5 autocorrelation is about 0.4. However, the autocorrelation
decays rather quickly as a function of lag. The lag 10 autocorrelation is already
close to 0. In this way, the chain is “not too dependent”; each value is only
correlated with the values in the next few steps.
An autocorrelation plot displays the autocorrelation within in a chain as a func-
tion of lag. If the ACF takes too long to decay to 0, the chain exhibits a high
degree of dependence and will tend to get stuck in place.
The plot below displays the ACFs corresponding to each of the 𝛿 values in
Example 18.2. Notice that with 𝛿 = 0.2 the ACF decays fairly quickly to 0,
while in the other cases there is still fairly high autocorrelation even after long
lags.

sigma=0.01; ESS=44 sigma=0.1; ESS=1656 sigma=1; ESS=448


1.0

1.0

1.0
0.8
0.8

0.8

0.6
0.6

0.6
ACF

ACF

ACF

0.4
0.4

0.4

0.2
0.2

0.2

0.0
0.0

0.0

0 10 20 30 40 0 10 20 30 40 0 10 20 30 40

Lag Lag Lag

Example 18.4. Continuing with Metropolis sampling from a Beta(5, 24) pos-
terior distribution. We know that the posterior mean is 5/29 = 0.172. But what
if we want to approximate this via simulation?

1. Suppose you simulated 10000 independent values from a Beta(5, 24) dis-
tribution, e.g. using rbeta. How would you use the simulated values to
estimate the posterior mean?
2. What is the standard error of your estimate from the previous part? What
does the standard error measure? How could you use simulation to ap-
proximate the standard error?
3. Now suppose you simulated 10000 from a Metropolis chain (after burn in).
How would you use the simulated values to estimate the posterior mean?
336 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION

What does the standard error measure in this case? Could you use the
formula from the previous part to compute the standard error? Why?
4. Consider the three chains in Example 18.2 corresponding to the three 𝛿
values 0.01, 0.1, and 1. Which chain provides the most reliable estimate
of the posterior mean? Which chain yields the smallest standard error of
this estimate?

Solution. to Example 18.4

Show/hide solution

1. Simulate 10000 values and compute the sample mean of the simulated
values.
2. For the Beta(5, 24) distribution, the population SD is √(5/29)(1 − 5/29)/(29 + 1) =
0.07. √ The standard error of the sample mean of 10000 values is
0.07/ 10000 = 0.0007. The standard error measures the sample-to-
sample variability of sample means over many samples of size 10000. To
approximate the standard error via simulation: sample 10000 values from
a Beta(5, 24) distribution and compute the sample mean, then repeat
many times and find the standard deviation of the simulated sample
means.
3. You would still use the sample mean of the 10000 values to approximate
the posterior mean. The standard error measures how much the sample
mean varies from run-to-run of the Markov chain. To approximate the
standard error via simulation: simulate 10000 steps of the Metropolis
chain and compute the sample mean, then repeat many times and find
the standard deviation of the simulated sample means. The standard
error formula from the previous part assumes that the 10000 values are
independent, but the values on the Markov chain are not, so we can’t use
the same formula.
4. Among these three, the chain with 𝛿 = 0.1 provides the most reliable
estimate of the posterior mean since it does the best job of sampling from
the posterior distribution. While there is dependence in all three chains,
the chain with 𝛿 = 0.1 has the least dependence and so comes closest to
independent sampling, so it would have the smallest standard error.

A Markov chain that exhibits a high degree of dependence will tend to get
stuck in place. Even if you simulate 10000 steps of the chain, you don’t really
get 10000 “new” values. The effective sample size (ESS) is a measure of how
much independent information there is in an autocorrelated chain. Roughly, the
effective sample size answers the question: what is the equivalent sample size of
a completely independent chain?
337

The effective sample size2 of a chain with 𝑁 steps (after burn in) is

𝑁
ESS = ∞
1+ 2 ∑ℓ=1 ACF(ℓ)

where the infinite sum is typically cut off at some upper lag (say ℓ = 20). For
a completely independent chain, the autocorrelation would be 0 for all lags and
the ESS would just be the number of steps 𝑁 . The more quickly the ACF decays
to 0, the larger the ESS. The more slowly the ACF decays to 0, the smaller the
ESS.
The larger the ESS of a Markov chain, the more accurate and stable are MCMC-
based estimates of characteristics of the posterior distribution (e.g., posterior
mean, posterior standard deviation, 98% credible region). That is, if the ESS
is large and we run the chain multiple times, then estimates do not vary much
from run to run.
The standard error of a statistic is a measure of its accuracy. The standard error
of a statistic measures the sample-to-sample variability of values of the statistic
over many samples of the same size. A standard error can be approximated via
simulation.

• Simulate a sample and compute the value of the statistic.


• Repeat many times and find the standard deviation of simulated values of
the statistic.

For many statistics (means, proportions) the standard error based on a sample
of 𝑛 independent values is on the order of √1𝑛 .
For example, the standard error of a sample mean measures the sample-to-
sample variability of sample means over many samples of the same size. The
standard error of a sample mean based on an independent sample of size 𝑛 is

population SD

𝑛

where the population SD measures the variability of individual values of the


variable.
The usual √1𝑛 formulas for standard errors are based on samples of 𝑛 independent
values. However, in a Markov chain the values will be dependent. The Monte
Carlo standard error (MCSE) is the standard error of a statistic generated
based on an MCMC algorithm. The MCSE of a statistic measures the run-to-
run variability of values of the statistic over many runs of the chain of the same
number of steps. A MCSE can be approximated via simulation
2 The coda library in R contains a lot of diagnostic tests for MCMC methods, including the

function effectiveSize.
338 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION

• Simulate many steps of the Markov chain and compute the value of the
statistic for the simulated chain
• Repeat many times and find the standard deviation of simulated values of
the statistic

For many statistics (means, proportions) the MCSE based on a chain with
1
effective sample size ESS is on the order of √ESS .

The MCSE effects the accuracy of parameter estimates based on the MCMC
method. If the chain is too dependent, the ESS will be small, the MCSE will be
large, and resulting estimates will not be accurate. That is, two different runs
of the chain could produce very different estimates of a particular characteristic
of the target distribution.

How large of an ESS is appropriate depends on the particular characteristic of


the posterior distribution being estimated. A larger ESS will be required for
accurate estimates of characteristics that depend heavily on sparse regions of
the posterior that are visited relatively rarely by the chain, like the endpoints
of a 98% credible interval.

The plot belows correspond to each of the 𝛿 values in Example 18.2. Each
plot represents 500 runs of the chain, each run with 1000 steps (after burn
in). For each run we computed both the sample mean (our estimate of the
posterior mean) and the 0.5th percentile (our estimate of the lower endpoint of
a central 99% credible interval.) Therefore, each plot in the top row displays 500
simulated sample means, and each plot in the bottom row displays 500 simulated
0.5th percentiles. The MCSE is represented by the degree of variability in each
plot. We see that for both statistics the MCSE is smallest when 𝛿 = 0.1,
corresponding to the smallest degree of autocorrelation and the largest ESS.
339

delta=0.01 delta=0.1 delta=1

150
120
Frequency

Frequency

Frequency
100

60
80

50
40

0 20
0

0
0.10 0.20 0.30 0.40 0.10 0.20 0.30 0.40 0.10 0.20 0.30 0.40

Estimate of mean Estimate of mean Estimate of mean


100 150

120
120
Frequency

Frequency

Frequency

80
80
50

40
40
0

0
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20

Estimate of lower CI endpoint Estimate of lower CI endpoint Estimate of lower CI endpoint

For most of the situations we’ll see in this course, standard MCMC algorithms
will run fairly efficiently, and checking diagnostics is simply a matter of due
diligence. However, especially in more complex models, diagnostic checking is
an important step in Bayesian data analysis. Poor diagnostics can indicate the
need for better MCMC algorithms to obtain a more accurate picture of the
posterior distribtuion. Algorithms that use “smarter” proposals will usually
lead to better results.
340 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
Bibliography

Dogucu, M., Johnson, A., and Ott, M. (2022). Bayes Rules! An Introduction to
Applied Bayesian Modeling. Chapman and Hall/CRC, Boca Raton, Florida,
1st edition. ISBN 978-0367255398.
Kruschke, J. (2015). Doing Bayesian data analysis: A tutorial with R, JAGS,
and Stan. Academic Press, 2nd edition.
McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples
in R and Stan, 2nd Edition. CRC Press, 2 edition.

341

You might also like