Bayesian Reasoning and Methods
Bayesian Reasoning and Methods
Methods
Kevin Ross
2022-02-19
2
Contents
Preface 5
1 Introductory Example 7
4 Bayes’ Rule 53
5 Introduction to Estimation 63
5.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Introduction to Inference 87
6.1 Comparing Bayesian and frequentist interval estimates . . . . . . 100
3
4 CONTENTS
• Asking questions
• Formulating conjectures
• Designing studies
• Collecting data
• Wrangling data
• Summarizing data
• Visualizing data
• Analyzing data
• Developing models
• Drawing conclusions
• Communicating results
We will assume some familiarity with many of these aspects, and we will focus
on the items in italics. That is, we will focus on statistical inference, the pro-
cess of using data analysis to draw conclusions about a population or process
beyond the existing data. “Traditional” hypothesis tests and confidence inter-
vals that you are familiar with are components of “frequestist” statistics. This
book will introduce aspects of “Bayesian” statistics. We will focus on analyz-
ing data, developing models, drawing conclusions, and communicating results
from a Bayesian perspective. We will also discuss some similarities and differ-
ences between frequentist and Bayesian approaches, and some advantages and
disadvantages of each approach.
We want to make clear from the start: Bayesian versus frequentist is NOT a
question of “right versus wrong”. Both Bayesian and frequentist are valid ap-
proaches to statistical analyses, each with advantages and disadvantages. We’ll
address some of the issues along the way. But at no point in your career do
you need to make a definitive decision to be a Bayesian or a frequentist; a good
modern statistician is probably a bit of both.
While our focus will be on statistical inference, remember that the other parts
of Statistics are equally important, if not more important. In particular, any
statistical analysis is only as good as the data upon which it is based.
5
6 CONTENTS
The exercises in this book are used to both motivate new topics and to help you
practice your understanding of the material. You should attempt the exercises
on your own before reading the solutions. To encourage you to do so, the
solutions have been hidden. You can reveal the solution by clicking on the
Show/hide solution button.
Show/hide solution
Here is where a solution would be, but be sure to think about the problem on
your own first!
(Careful: in your browser, the triangle for the Show/hide solution button might
be close to the back button, so clicking on Show/hide might take you to the
previous page. To avoid this, click on the words Show/hide.)
Chapter 1
Introductory Example
Statistics is the science of learning from data. But what is “Bayesian” statistics?
This chapter provides a relatively simple and brief example of a Bayesian sta-
tistical analysis. As you work through the example, think about: What aspects
are familiar? What features are new or different? Think big picture for now;
we’ll fill in lots of details later.
3. Which one of the following do you think is the most plausible value of the
population proportion? Record your value in the plot on the board.
4. Sketch the plot of guesses here. What seems to be the consensus? What
do we think are the most plausible values of the population proportion?
Somewhat plausible? Not plausible?
5. The plot just shows our guesses for the population proportion. How could
we estimate the actual population proportion based on data?
7
8 CHAPTER 1. INTRODUCTORY EXAMPLE
Suppose that the actual population proportion is 0.5. That is, suppose
that 50% of current Cal Poly students have read at least one Harry Potter
book. How many students in a random sample of 30 students would you
expect to have read at least one Harry Potter book? Would it necessarily
be 15 students? How could you use a coin to simulate how many students
in a random sample of 30 students might have read at least one Harry
Potter book?
7. Now suppose the actual population proportion is 0.1. How would the
previous part change?
8. Using your choice for the most plausible value of the population propor-
tion, simulate how many students in a random sample of 30 students might
have read at least one Harry Potter book. Repeat to get a few hypothetical
samples, using your guess for the most plausible value of the population
proportion, and record your results in the plot. (Here is one applet you
can use.)
10. How could we get an even clearer picture of what might happen?
11. Sketch the plot that we created. The plot illustrates two sources of uncer-
tainty or variability. What are these two sources?
13. Remember that we started with guesses about which values of the popula-
tion proportion were more plausible than others, and we used these guesses
to get a picture of what might happen in samples. How can we reconsider
the plausibility of the possible values of the population proportion in light
of the sample data that we actually observed?
14. Given the observed sample proportion, what can we say about the plau-
sible values of the population proportion? How has our assessment of
plausibility changed from before observing the sample data?
15. What elements of the analysis are similar to the kinds of statistical analysis
you have done before? What elements are new or different?
Show/hide solution
9
2. The population proportion could possibly be any value in the interval [0,
1]. Between 0% and 100% of current Cal Poly students have read at least
one Harry Potter book.
3. There is no right answer for what you think is most plausible. Maybe
you have a lot of friends that have read at least one Harry Potter book,
so you might think the population proportion is 0.8. Maybe you don’t
know anyone who has read at least one Harry Potter book, so you might
think the population proportion is 0.1. Maybe you have no idea and
you just guess that the population proportion is 0.5. Everyone has their
own background information which influences their initial assessment of
plausibility.
4. Results for the class will vary, but Figure 1.1 shows an example. The
consensus for the class in Figure 1.1 is that values of 0.3, 0.4, and 0.5
are most plausible, 0.2 and 0.6 less so, and values close to 0 or 1 are not
plausible.
• Flip a fair coin. Heads represents a student who has read at least
one Harry Potter book; Tails, not.
• A set of 30 flips represents one hypothetical random sample of 30
students.
• The number of the 30 flips that land on Heads represents one hypo-
thetical value of the number of students in a random sample of 30
students who have read at least one Harry Potter book.
• Repeat the above process to get many hypothetical values of the
number of students in a random sample of 30 students who have
read at least one Harry Potter book, assuming that the population
proportion is 0.5.
• Roll a fair 10-sided die. A roll of 1 represents a student who has read
at least one Harry Potter book; all other rolls, not.
• A set of 30 rolls represents one hypothetical random sample of 30
students.
• The number of the 30 rolls that land on 1 represents one hypothetical
value of the number of students in a random sample of 30 students
who have read at least one Harry Potter book.
• Repeat the above process to get many hypothetical values of the
number of students in a random sample of 30 students who have
read at least one Harry Potter book, assuming that the population
proportion is 0.1.
8. Figure 1.2 shows the number of students who have read at least one Harry
Potter book in 5 hypothetical samples assuming the population proportion
is 0.5, and in 5 hypothetical samples assuming the population proportion
is 0.1.
9. Results for the class will vary. In the scenario in Figure 1.1, a value of 0.4
was initially more plausible than a value of 0.9. There were more students
who thought 0.4 was the most plausible value than 0.9. So the value 0.4
gets more “weight” in the simulation than 0.9. The plot on the left in
Figure 1.3 reflects the results of a simulation where every student who
plotted a dot in Figure 1.1 simulates 5 random samples of size 30, using
their guess for the population proportion.
10. Repeat the simulation process to get many hypothetical samples for each
value for the population proportion, reflecting differences in initial plau-
sibility. Imagine each student simulated 10000 samples instead of 5. The
plot on the right in Figure 1.3 displays the results.
11. The plot illustrates natural sample-to-sample variability in the sample
proportion for a given value of the population proportion. The plot also
illustrates the uncertainty in the value of the population proportion. That
is, the population proportion has a distribution of values determined by
our relative initial plausibilities.
12. Results will vary. We’ll assume that 9 out of 30 students have read at least
one Harry Potter book, for a sample proportion of 9/30 = 0.3. While we
hope that 0.3 is close to the proportion of all current Cal Poly students
who have read at least one Harry Potter book, because of natural sample-
to-sample variability the sample proportion is not necessarily equal to the
population proportion.
13. The simulation demonstrated what might happen in a sample of size 30.
Now we can zoom in on what actually did happen. Among the samples
11
14. Figure 1.4 displays the results based on the smaller scale simulation in the
plot on the left in Figure 1.3, in which every initial guess for the sample
proportion generated five hypothetical samples of size 30. Now we focus
on samples that resulted in a sample proportion of 9/30, the observed
sample proportion. The middle plot displays the population proportions
correspoding to samples with a sample proportion of 9/30. The distribu-
tion of all the dots in the middle plot illustrates our initial plausibility.
The plot on the right displays only the green dots, which correspond to
samples with a sample proportion of 9/30. The distribution in the plot on
the right reflects a reassessment of the plausibilities of possible values of
the population proportion given the observed sample proportion of 9/30
and the simulation results. Among the simulated samples that resulted in
a sample proportion of 9/30, the population proportion was much more
likely to be 0.3 than to be 0.5.
Figure 1.5 displays the same analsyis based on the full simulation from the
plot on the right in Figure 1.3. The plot on the right in Figure 1.5 com-
pares the initial plausibilities to the plausibilities revised upon observing
a sample proportion of 9/30. Initially, the values 0.3, 0.4, and 0.5 were
roughly equally plausible, and more plausible than any other value. After
observing a sample proportion of 9/30:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Population proportion
Figure 1.1: Example plot of the guesses of 30 students for the most plausible
value of the proportion of current Cal Poly students who have read at least one
Harry Potter book.
13
30
Number of students who have read HP book
25
20
Population proportion
15 0.1
0.5
10
Figure 1.2: Number of students who have read at least one Harry Potter book in
hypothetical samples of size 30. Five samples simulated assuming the population
proportion is 0.1 (yellow), and five samples simulated assuming the population
proportion is 0.5 (purple).
30 30
Number of students who have read HP book
25 25
20 20
15 15
10 10
5 5
0 0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Population proportion Population proportion
Figure 1.3: Simulation of the number of students who have read at least one
Harry Potter book in hypothetical samples of size 30, reflecting initial plausibil-
ity of values of the population proportion from Figure 1.1. Left: 5 hypothetical
samples for each guess for the population proportion. Right: 10000 hypothetical
samples for each guess for the population proportion.
14 CHAPTER 1. INTRODUCTORY EXAMPLE
30
20
10
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Population proportion Population proportion
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Population proportion
Figure 1.4: Left: Simulation results from the plot on the left in Figure 1.3
highlighting samples with a sample proportion of 9/30. Middle: Comparison
of initial distribution of population proportion with conditional distribution of
population proportion given a sample proportion of 9/30. Right: Distribution
reflecting relative plausibility of possible values of the population proportion
after observing a sample of 30 students in which 9 have read at least one Harry
Potter book.
30
Number of students who have read HP book
25
prior plausibility
posterior plausibility
20
Number of students
15 not 9
9
10
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Population proportion
Population proportion
Figure 1.5: Left: Simulation results from the plot on the right in Figure 1.3
highlighting samples with a sample proportion of 9/30. Right: Distribution
reflecting relative plausibility of possible values of the population proportion,
both “prior” plausibility (blue) and “posterior” plausibility after observing a
sample of 30 students in which 9 have read at least one Harry Potter book
(green).
Chapter 2
In this section we’ll look a closer look at the example from the previous section.
In particular, we’ll inspect the simulation process and results in more detail.
We’ll also consider a more “realistic” scenario.
In the previous section, each student identified a “most plausible” value. We
collected these guesses to form a “wisdom of crowds” measure of initial plausi-
bility.
However, your own initial assessment of the plausibility of the different values
could involve much more than just identifying the most plausible value. What
was your next most plausible value? How much less plausible was it? What
about the other values and their relative plausibilities?
In the example below we’ll start with an assessment of plausibility. We’ll discuss
later how you might obtain such an assessment. For now, focus on the big
picture: we start with some initial assessment of plausibility before observing
data, and we want to update that assessment upon observing some data.
1. We’ll start by considering only the values 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 as
initially plausible for the population proportion 𝜃. Suppose that before
collecting sample data, our prior assessment is that:
15
16 CHAPTER 2. IDEAS OF BAYESIAN REASONING
2. Discuss what the prior distribution from the previous part represents.
5. Recall the simulation with 260000 repetitions that we started above. Con-
sider the 10000 repetitions in which the population proportion is 0.1. Sup-
pose that for each of these repetitions we simulate the number of students
in a class of 30 who have read at least one HP book. On what proportion
of these repetitions would we expect to see a sample count of 9? On how
many of these 10000 repetitions would we expect to see a sample count of
9?
6. Repeat the previous part for each of the possible values of 𝜃: 0.1, … , 0.6.
Add two columns to the table: one column for the likelihood of observing
a count of 9 in a sample of size 30 for each value of 𝜃, and one column for
the expected number of repetitions in the simulation which would result
in the count of 9.
How has our assessment of plausibility changed from before observing the
sample data?
9. Prior to observing data, how many times more plausible is a value of 0.3
than 0.2 for the population proportion 𝜃?
10. Recall that we observed a sample count of 9 in a sample of size 30. How
many times more likely is a count of 9 in a sample of size 30 when the
population proportion 𝜃 is 0.3 than when it is 0.2?
11. After observing data, how many times more plausible is 0.3 than 0.2 for
the population proportion 𝜃?
12. How are the values from the three previous parts related?
1. The relative plausibilities allow us to draw the shape of the plot below:
the spike for 0.2 is four times as high as the one for 0.1; the spike for 0.3
is two times higher than the spike for 0.2, etc. To make a distribution we
need to rescale the heights, maintaining the relative ratios, so that they
add up to 1. It helps to consider one value as a “baseline”; we’ll choose
0.1 (but it doesn’t matter which value is the baseline). Assign 1 “unit” of
plausibility to the value 0.1 Then 0.6 also gets 1 unit of plausibility. The
values 0.2 and 0.5 each receive 4 units of plausibility, and the values 0.3
and 0.4 each receive 8 units of plausibility. The six values account for 26
total units of plausibility. Divide the units by 26 to obtain values that sum
to 1. See the “Prior” column in the table below. Check that the relative
ratios are maintained; for example, the rescaled prior plausibility of 0.308
for 0.3 is two times larger than the rescaled prior plausibility of 0.154 for
0.2.
Population proportion
• Roll a fair 10-sided die. A roll of 1 represents a student who has read
at least one Harry Potter book; all other rolls, not.
19
If 𝑌 is the number of students in the same who have read at least one HP
book, then 𝑌 has a Binomial(30, 𝜃) distribution. If 𝜃 = 0.1 then 𝑌 has
a Binomial(30, 0.1) distribution. If 𝜃 = 0.1 the probability that 𝑌 = 9 is
(30 9 21
9 )(0.1 )(0.9 ) = 0.0016, which can be computed using dbinom(9, 30,
0.1) in R.
## [1] 0.001565
6. See the table below. For example, if 𝜃 = 0.2 then the likelihood of a
sample count of 9 in a sample of size 30 is (30 9 21
9 )(0.2 )(0.8 ) = 0.068, which
can be computed using dbinom(9, 30, 0.2). If 𝜃 = 0.2 then we would
expect to observe a sample count of 9 in about 6.8% of samples. In the
40000 repetitions with 𝜃 = 0.2, we would expect to observe a count of 9
in about 40000 × 0.068 = 2703 repetitions.
In the table below, “Likelihood of 9” represents the probability of a sample
count of 9 in a sample of size 30 computed for each possible value of 𝜃. Note
that this column does not sum to 1, as the values in this column do not
comprise a probability distribution. Rather, the values in the likelihood
column represent the probability of the same event (sample count of 9)
computed under various different scenarios (different possible values of 𝜃).
The “Repetitions with a count of 9” column corresponds to the green dots
in Figure 1.5. The prior plausibilities and total number of repetitions are
different between the two examples, but the process is the same. (The
overall “Number of reps” column corresponds to all the dots.)
20 CHAPTER 2. IDEAS OF BAYESIAN REASONING
8. The plot below compares our prior plausibility (blue) and our posterior
plausibility (green) after observing the data. The values 0.3 and 0.4 ini-
tially were equally plausible for 𝜃, but after observing a sample proportion
of 0.3 the value 0.3 is now almost 2 times more plauible than a value of 0.4.
The values 0.2, 0.3, 0.4 together accounted for about 77% of our initial
plausibility, but after observing the data these three values now account
for over 97% of our plausibility.
21
0.4
Plausibility
0.2
0.0
9. Our prior assessment was that a value of 0.3 is 2 times more plausible
than 0.2 for the population proportion 𝜃.
11. After observing a count of 9 in a sample of size 30, a value of 0.3 is 0.5612 /
0.1205 = 4.66 times more plausible than 0.2 for the population proportion
𝜃.
12. The ratio of the posterior plausibilities (4.66) is the product of the ratio
of the prior plausibilities and the ratio of the likelihoods (2.33). In short,
posterior is proportional to product of prior and likelihood.
Example 2.2. In the previous example we only considered the values 0.1, 0.2,
…, 0.6 as plausible. Now we’ll consider a more realistic scenario.
1. What is the problem with considering only the values 0.1, 0.2, …, 0.6 as
plausible? How could we resolve this issue?
3. How could we use the posterior distribution to fill in the blank in the
following: “There is a [blank] percent chance that fewer than 50 percent
of current Cal Poly students have read at least one HP book.”
4. What are some other questions of interest regarding 𝜃? How could you
use the posterior distribution to answer them?
Solution. to Example 2.2
1. These six values are not the only possible values of 𝜃. The parameter 𝜃 is
a proportion, which could take any value in [0, 1]. We really want a prior
that assigns relative plausibility to all values in the continuous interval [0,
1]. One way to bridge the gap is to consider a fine grid of values in [0, 1],
rather than all possible values.
We’ll consider the possible values of 𝜃 to be 0, 0.0001, 0.0002, 0.0003, … , 0.9998, 0.9999, 1
and assign a relative plausibility to each of these values. We’ll start
with our assessment from the previous example: 0.1 and 0.6 are equally
plausible, 0.2 is four times more plausible than 0.1, etc. We’ll assign
plausibility to in between values by “smoothly connecting the dots”. In
the plot below this is achieved with a Normal distribution, but the details
are not important for now. Just understand that (1) we have expanded
our grid of possible values of 𝜃, and (2) we have assigned a relative
plausibility to each of the possible values.
Prior plausibility
Population proportion
2. The table has one row for each possible value of 𝜃: 0, 0.0001, 0.0002, … , 0.9999, 1.
# Prior distribution
# Smoothly connect the dots using a Normal distribution
# Then rescale to sum to 1
# Likelihood
# Likelihood of observing sample count of 9 out of 30
# for each theta
# Posterior
# Product gives relative posterior plausibilities
# Then rescale to sum to 1
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
The plot below displays the prior, likelihood1 , and posterior. Notice that
the likehood of the observed data is highest for 𝜃 near 0.3, so our plausi-
bility has “moved” in the direction of 𝜃 near 0.3 after observing the data.
1 Prior and posterior are distributions which sum to 1, so prior and posterior are on the
same scale. However, the likelihood does not sum to anything in particular. In order to plot
the likelhood on the same scale, it has been rescaled to sum to 1. Only the relative shape of
the likelihood matters; not its absolute scale.
25
0.0004
Plausibility
0.0002
0.0000
3. Sum the posterior plausibilities for 𝜃 values between 0.5. We can see from
the plot that almost all our plausibility is placed on values of 𝜃 less than
0.5.
## [1] 0.9939
## [1] 0.235
## [1] 0.411
There is an 80% chance that between 24% and 41% of Cal Poly students
have read a HP book.
Chapter 3
Interpretations of
Probability and Statistics
You have some familiarity with “probability” or “chance” or “odds”. But what
do we really mean when talk about “probability”? It turns out there are two
main interpretations: relative frequency and “subjective” probability. These two
interpretations provide the philosophical foundation for two schools of statistics:
frequentist (hypothesis tests and confidence intervals that you’ve seen before)
and Bayesian (what this book is about). This chapter introduces the two inter-
pretations.
27
28CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
Example 3.1. How are the situations above similar, and how are they differ-
ent? What is one feature that all of the situations have in common? Is the
interpretation of “probability” the same in all situations? Take some time to
consider these questions before looking at the solution. The goal here is to just
think about these questions, and not to compute any probabilities (or to even
think about how you would).
Show/hide solution
This exercise is intended to motivate discussion, so you might have thought of
some other ideas we don’t address here. That’s good! And some of the things
you considered might come up later in the book. But here are a few thoughts
we specifically want to mention now.
The one feature that all of the situations have in common is uncertainty. Some-
times the uncertainty arises from a repeatedable physical phenomenon that can
result in multiple potential outcomes, like rolling dice or drawing the winning
Powerball number. In other cases, there is uncertainty because the probability
concerns the future, like tomorrow’s high temperature or the result of the next
Superbowl. But there can also be uncertainty about the past: there are some
Federalist papers for which the author is unknown, and you probably don’t
know for sure whether or not you ate an apple on April 17, 2009.
Whenever there is uncertainty, it is reasonable to consider relative likelihoods
of potential outcomes. For example, even though you don’t know for certain
whether you ate an apple on April 17, 2009, if you’re usually an apple-a-day
person (or were when you were younger) you might think the probability is high.
We don’t know for sure what team will win the next Superbowl, but we might
think that the 49ers are more likely than the Eagles to be the winner.
While all of the situations in the example involve uncertainty, it seems that there
are different “types” of uncertainty. Even though we don’t know which side a
die will land on, the notion of “fairness” implies that the sides are “equally
likely”. Likewise, there are some rules to how the Powerball drawing works,
and it seems like these rules should determine the probability of drawing that
particular winning number.
However, there aren’t any specific “rules of uncertainty” that govern whether or
not you ate an apple on April 17, 2009. You either did or you didn’t, but that
doesn’t mean the two outcomes are necessarily equally likely. Regarding the
Superbowl, of course there are rules that govern the NFL season and playoffs,
but there are no “rules of uncertainty” that tell us precisely how likely any
3.1. INSTANCES OF RANDOMNESS 29
particular team is to win any particular game, let alone how likely a team is to
advance to and win the Superbowl.
It also seems that there are different interpretations of probability. Given that
a six-sided die is fair, we might all agree that the probability that it lands on
any particular side is 1/6. Similarly, given the rules of the Powerball lottery,
we might all agree on the probability that a drawing results in a particular win-
ning number. However, there isn’t necessarily consensus about what the high
temperature will be in San Luis Obispo tomorrow. Different weather prediction
models, forecasters, or websites might provide different values for the probabil-
ity that the high temperature will be above 90 degrees Fahrenheit. Similarly,
Superbowl odds might vary by source. Situations like tomorrow’s weather or the
Superbowl where there is no consensus about the “rules of uncertainty” require
some subjectivity in determining probabilities.
Finally, some of these situations are repeatedable. We could (in principle) roll a
pair of dice many times and see how often we get doubles, or repeat the Power-
ball drawing over and over to see how the winning numbers behave. However,
many of these situations involve something that only happens once, like tomor-
row or April, 17, 2009 or the next Superbowl. Even when the phenomenon
happens only once in reality, we can still develop models of what might happen
if we were to hypothetically repeat the phenomenon many times. For exam-
ple, meteorologists use historical data and meteorological models to forecast
potential paths of a hurricane.
The subject of probability concerns random phenomena. A phenomenon is
random1 if there are multiple potential outcomes, and there is uncertainty
about which outcome will occur. Uncertainty is understood in broad terms, and
in particular does not only concern future occurrences.
Some phenomena involve physical randomness2 , like flipping coins, rolling dice,
drawing Powerballs at random from a bin, or randomly selecting Cal Poly stu-
dents. In many other situations randomness just vaguely reflects uncertainty.
Contrary to colloquial uses of the word, random does not mean haphazard. In a
random phenomenon, while individual outcomes are uncertain, we will see that
there is a regular distribution of outcomes over a large number of (hypothetical)
repetitions. For example,
1 In this book, “random” and “uncertain” are synonyms; the opposite of “random” is “cer-
We’re avoiding philosophical questions about what is “true” randomness, like the following.
Is a coin flip really random? If all factors that affect the trajectory of the coin were known
precisely, then wouldn’t the outcome be determined? Does true randomness only exist in
quantum mechanics?
30CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
• In two flips of a fair coin we wouldn’t necessarily see one head and one
tail. But in 10000 flips of a fair coin, we might expect to see close to 5000
heads and 5000 tails.
• We don’t know who will win the next Superbowl, but we can and should
consider some teams as more likely to win than others. We could imagine
a large number of hypothetical 2021-2022 seasons; how often would we
expect the 49ers to win? The Eagles?
Random also does not necessarily mean equally likely. In a random phenomenon,
certain outcomes or events might be more or less likely than others. For example,
• It’s much more likely than not that a randomly selected Cal Poly student
is a California resident.
• Not all NFL teams are equally likely to win the next Superbowl.
centages. We’re not sticklers; we’ll refer to probabilities both as decimals and as percentages.
3.2. INTERPRETATIONS OF PROBABILITY 31
80000
40000
0
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Figure 3.1: Results of one million sets of three rolls of fair six-sided dice. Sets
in which the sum of the dice is 9 (10) are represented by orange (blue) spike.
1620. However, unbeknownst to Galileo, the same problem had been solved almost 100 years
earlier by Gerolamo Cardano, one of the first mathematicians to study probability.
32CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
Table 3.1: Results and running proportion of H for 10 flips of a fair coin.
We might all agree that the probability that a single flip of a fair coin lands on
heads is 1/2, a.k.a., 0.5, a.k.a, 50%. After all, the notion of “fairness” implies
that the two outcomes, heads and tails, should be equally likely, so we have a
“50/50 chance” of heads. But how else can we interpret this 50%? As in the
dice rolling problem, we can consider what would happen if we flipped the coin
main times. Now, if we would flipped the coin twice, we wouldn’t expect to
necessarily see one head and one tail. But in many flips, we might expect to see
heads on something close to 50% of flips.
Let’s try this out. Table 3.1 displays the results of 10 flips of a fair coin. The
first column is the flip number and the second column is the result of the flip.
The third column displays the running proportion of flips that result in H. For
example, the first flip results in T so the running proportion of H after 1 flip
is 0/1; the first two flips result in (T, H) so the running proportion of H after
2 flips is 1/2; and so on. Figure 3.2 plots the running proportion of H by the
number of flips. We see that with just a small number of flips, the proportion
of H fluctuates considerably and is not guaranteed to be close to 0.5. Of course,
the results depend on the particular sequence of coin flips. We encourage you
to flip a coin 10 times and compare your results.
Now we’ll flip the coin 90 more times for a total of 100 flips. The plot on the left
in Figure 3.3 summarizes the results, while the plot on the right also displays
the results for 3 additional sets of 100 flips. The running proportion fluctuates
considerably in the early stages, but settles down and tends to get closer to
0.5 as the number of flips increases. However, each of the fours sets results in
a different proportion of heads after 100 flips: 0.5 (blue), 0.44 (orange), 0.56
(green), 0.56 (purple). Even after 100 flips the proportion of flips that result in
H isn’t guaranteed to be very close to 0.5.
Now for each set of 100 flips, we’ll flip the coin 900 more times for a total of 1000
flips in each of the four sets. The plot on the left in Figure 3.4 summarizes the
3.2. INTERPRETATIONS OF PROBABILITY 33
1.0
Running proportion of H
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5 6 7 8 9 10
Flip number
Figure 3.2: Running proportion of H versus number of flips for the 10 coin flips
in Table 3.1.
1.0
1.0
Running proportion of H
Running proportion of H
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Figure 3.3: Running proportion of H versus number of flips for four sets of 100
coin flips.
34CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
results for our original set, while the plot on the right also displays the results
for the three additional sets from Figure 3.4. Again, the running proportion
fluctuates considerably in the early stages, but settles down and tends to get
closer to 0.5 as the number of flips increases. Compared to the results after 100
flips, there is less variability between sets in the proportion of H after 1000 flips:
0.51 (blue), 0.488 (orange), 0.525 (green), 0.492 (purple). Now, even after 1000
flips the proportion of flips that result in H isn’t guaranteed to be exactly 0.5,
but we see a tendency for the proportion to get closer to 0.5 as the number of
flips increases.
1.0
1.0
Running proportion of H
Running proportion of H
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Figure 3.4: Running proportion of H versus number of flips for four sets of 1000
coin flips.
Each source, as well as many others, assigns different probabilities to the Chiefs
and Packers winning. Which source, if any, is “correct”?
When the situation involves a fair coin flip, we could perform a simulation to
see that the long run proportion of flips that land on H is 0.5, and so the
probability that a fair coin flip lands on H is 0.5. Even though the actual
2022 Superbowl will only happen once, we could still perform a simulation
36CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
about 26% of repetitions the Chiefs would win the Superbowl, and in about
8% of repetitions the Cowboys would win. Of course, different sets of sub-
jective probabilities correspond to different assumptions and different ways of
conducting the simulation.
Subjective probabilities can be calibrated by weighing the relative favorability
of different bets, as in the following example.
Example 3.2. What is your subjective probability that Professor Ross has a
TikTok account? Consider the following two bets, and suppse you must choose
only one6 .
A) You win $100 if Professor Ross has a TikTok account, and you win nothing
otherwise.
B) A box contains 40 green and 60 gold marbles that are otherwise identical.
The marbles are thoroughly mixed and one marble is selected at random.
You win $100 if the selected marble is green, and you win nothing other-
wise.
1. Which of the above bets would you prefer? Or are you completely indiffer-
ent? What does this say about your subjective probability that Professor
Ross has a Tik Tok account?
2. If you preferred bet B to bet A, consider bet C which has a similar setup
to B but now there are 20 green and 80 gold marbles. Do you prefer bet
A or bet C? What does this say about your subjective probability that
Professor Ross has a Tik Tok account?
3. If you preferred bet A to bet B, consider bet D which has a similar setup
to B but now there are 60 green and 40 gold marbles. Do you prefer bet
A or bet D? What does this say about your subjective probability that
Professor Ross has a Tik Tok account?
4. Continue to consider different numbers of green and gold marbles. Can
you zero in on your subjective probability?
Show/hide solution
1. Since the two bets have the same payouts, you should prefer the one that
gives you a greater chance of winning! If you choose bet B you have a
40% chance of winning.
• If you prefer bet B to bet A, then your subjective probability that
Professor Ross has a TikTok account is less than 40%.
• If you prefer bet A to bet B, then your subjective probability that
Professor Ross has a TikTok account is greater than 40%.
6 We do not advocate gambling. We merely use gambling contexts to motivate probability
concepts.
38CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
4. Continuing in this way you can narrow down your subjective probability.
For example, if you prefer bet B to bet A and bet A to bet C, your
subjective probability is between 20% and 40%. Then you might consider
bet E corresponding to 30 gold marbles and 70 green to determine if you
subjective probability is greater than or less than 30%. At some point it
will be hard to choose, and you will be in the ballpark of your subjective
probability. (Think of it like going to the eye doctor: “which is better: 1
or 2?” At some point you can’t really see a difference.)
Of course, the strategy in the above example isn’t an exact science, and there is
a lot of behavioral psychology behind how people make choices in situations like
this, especially when betting with real money. But the example provides a very
rough idea of how you might discern a subjective probability of an event. The
example also illustrates that probabilities can be “personal”; your information
or assumptions will influence your assessment of the likelihood.
We close this section with some brief comments about subjectivity. Subjec-
tivity is not bad; “subjective” is not a “dirty” word. Any probability model
involves some subjectivity, even when probabilities can be interpreted naturally
as long run relative frequencies. For example, assuming a die is fair does not
codify an objective truth about the die. Instead, “fairness” reflects a reason-
able and tractable mathematical model. In the real world, any “fair” six-sided
die has small physical imperfections that cause the six faces to have different
probabilities. However, the differences are usually small enough to be ignored
for most practical purposes. Assuming that the probability that the die lands
3.3. WORKING WITH PROBABILITIES 39
60 green, 40 gold
Figure 3.5: The three marble bins in Example 3.2. Left: Bet A, 40% chance
of selecting green. Middle: Bet B, 20% chance of selecting green. Left: Bet C,
60% chance of selecting green.
on each side is 1/6 is much more tractable than assuming the probability of a
1 is 0.1666666668, the probability of a 2 is 0.1666666665, etc. (Furthermore,
measuring the probability of each side so precisely would be extremely diffi-
cult.) But assuming that the probability that the die lands on each side is 1/6
is also subjective. We might readily agree to assume that the probability that
a six-sided die lands on 1 is 1/6, but we might not reach a concensus on the
probability that the Chiefs win the Superbowl. But the fact that there cam be
many reasonable probability models for a situation like the 2022 Superbowl does
not make the corresponding subjective probabilities any less valid than long run
relative frequencies.
With either the long run relative frequency or subjective probability interpre-
tation there are some basic logical consistency requirements which probabilities
40CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
need to satisfy. Roughly, probabilities cannot be negative and the sum of prob-
abilities over all possible outcomes must be 100%.
Example 3.3. As of Dec 30, FiveThirtyEight listed the following probabilities
for who will win the 2022 Superbowl.
Team Probability
Kansas City Chiefs 26%
Green Bay Packers 24%
Tampa Bay Buccaneers 9%
Dallas Cowboys 8%
Other
Show/hide solution
2. 74%. Either the Chiefs win or they don’t; if there’s a 26% chance that the
Chiefs win, there must be a 74% chance that they do not win. If we think
of this as a simulation with 10000 repetitions, each repetition results in
either the Chiefs winning or not, so if they win in 2600 of repetitions then
they must not win in the other 7400.
3. 67%. There is only one Superbowl champion, so if say the Chiefs win then
no other team can win. Thinking again of the simulation, the repetitions
in which the Chiefs win are distinct from those in which the Cowboys
win. So if the Chiefs win in 2600 repetitions and the Cowboys win in
800 repetitions, then on a total of 3400 repetitions either the Chiefs or
Cowboys win. Adding the four probabilities, we see that the probability
that one of the four teams above wins must be 67%.
4. 33%. Either one of the four teams above wins, or some other team wins.
If one of the four teams above wins in 6700 repetitions, then in 3300
repetitions the winner is not one of these four teams.
Example 3.4. Suppose your subjective probabilities for the 2022 Superbowl
champion satisfy the following conditions.
Construct a table of your subjective probabilities like the one in Example 3.3.
Show/hide solution
Here, probabilities are specified indirectly via relative likelihoods. We need to
find probabilities that are in the given ratios and add up to 100%. It helps
to designate one outcome as the “baseline”. It doesn’t matter which one; we’ll
choose the Cowboys.
42CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
• Suppose the Cowboys account for 1 “unit”. It doesn’t really matter what
a unit is, but let’s say it corresponds to 1000 repetitions of the simulation.
That is, the Cowboys win in 1000 repetitions. Careful: we haven’t yet
specified how many total repetitions we have done, or how many units the
entire simulation accounts for. We’re just starting with a baseline of what
happens for the Cowboys.
• The Cowboys and Buccaneers are equally like to win, so the Buccaneers
also account for 1 unit.
• The Packers are 1.5 times more likely than the Cowboys to win, so the
Packers account for 1.5 units. If 1 unit is 1000 repetitions, then the Packers
win in 1500 repetitions, 1.5 times more often than the Cowboys.
• The Chiefs are 2 times more likely than the Packers to win, so the Chiefs
account for 2 × 1.5 = 3 units. If 1 unit is 1000 repetitions, then the Chiefs
win in 3000 repetitions.
• The four teams account for a total of 1 + 1 + 1.5 + 3 = 6.5 units. Since the
winner is as likely to among these four teams as not, then “Other” also
accounts for 6.5 units.
• In total, there are 13 units which account for 100% of the probability.
The Cowboys account for 1 unit, so their probability of winning is 1/13
or about 7.7%. Likewise, the probability that the Chiefs win is 3/13 or
about 23.1%.
You should verify that all of the probabilities are in the specified ratios. For
example, the Chiefs are 2 times more likely (2 = 23.1/11.5) than the Packers
to win, and the Packers are 1.5 times more likely (1.5 ≈ 11.5/7.7) than the
Cowboys to win.
We could have also solved this problem using algebra. Let 𝑥 be the probability,
as a decimal, that the Cowboys are the winner. (Again, it doesn’t matter which
team is the baseline.) Then 𝑥 is also the probability that the Buccaneers are
the winner, 1.5𝑥 for the Packers, and 3𝑥 for the Chiefs. The probability that
one of the four teams wins is 𝑥 + 𝑥 + 1.5𝑥 + 3𝑥 = 6.5𝑥, so the probability of
Other is also 6.5𝑥. The probabilities in decimal form must sum to 1 (that is,
100%), so 1 = 𝑥 + 𝑥 + 1.5𝑥 + 3𝑥 + 6.5𝑥 = 13𝑥. Solve for 𝑥 = 1/13 and then plug
in 𝑥 = 1/13 to find the other probabilities.
3.4. INTERPRETATIONS OF STATISTICS 43
0.5
0.4
Subjective Probability
0.3
0.2
0.1
0.0
Other Chiefs Packers Buccaneers Cowboys Other Chiefs Packers Buccaneers Cowboys
Team Team
Example 3.5. How old do you think your instructor (Professor Ross) currently
is7 ? Consider age on a continuous scale, e.g., you might be 20.73 or 21.36 or
19.50.
7 You could probably get a pretty good idea by searching online, but don’t do that. Instead,
answer the questions based on what you already know about me.
44CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
In this example, you will use probability to quantify your uncertainty about your
instructor’s age. You only need to give ballpark estimates of your subjective
probabilities, but you might consider what kinds of bets you would be willing
to accept like in Example 3.2. (This exercise just motivates some ideas. We’ll
fill in lots of details later.)
Show/hide solution
Even though in reality your instructor’s current age is a fixed number, its value
is unknown or uncertain to you, and you can use probability to quantify this
uncertainty. You would probably be willing to bet any amount of money that
your instructor is over 20 years old, so you would assign a probability of 100%
to that event, and 0% to the event that he’s at most 20 years old. Let’s say
you’re pretty sure that he’s over 30, but you don’t know that for a fact, so you
assign a probability of 99% to that event (and 1% to the event that he’s at most
30). You think he’s over 40, but you’re even less sure about that, so maybe you
assign the event that he’s over 40 a probability of 67% (say you’d accept a bet
at 2 to 1 odds.) You think there’s a 50/50 chance that he’s over 50. You’re
95% sure that he’s between 35 and 60. And so on. Continuing in this way, you
can start to determine a probability distribution to represent your beliefs about
the instructor’s age. Your distribution should correspond to your subjective
probabilities. For example, the distribution should assign a probability of 67%
to values over 40.
This is just one example. Different students will have different distributions
depending upon (1) how much information you know about the instructor, and
3.4. INTERPRETATIONS OF STATISTICS 45
(2) how that information informs your beliefs about the instructor’s age. We’ll
see some example plots in the next exercise.
Regarding the last question, since we are using a probability distribution to
quantify our uncertainty about 𝜃, we are treating 𝜃 as a random variable.
Recall that a random variable is a numerical quantity whose value is deter-
mined by the outcome of a random or uncertain phenomenon. The random
phenomenon might involve physically repeatable randomness, as in “flip a coin
10 times and count the number of heads.” But remember that “random” just
means “uncertain” and there are lots of different kinds of uncertainty. For ex-
ample, the total number of points scored in the 2022 Superbowl will be one and
only one number, but since we don’t know what that number is we can treat
it as a random variable. Treating the number of points as a random variable
allows us to quantify our uncertainty about it through probability statements
like “there is a 60% chance that fewer than 45 points will be scored in Superbowl
2022”.
The (probability) distribution of a random variable specifies the possible val-
ues of the random variable and a way of determining corresponding probabilities.
Like probabilities themselves, probability distributions of random variables can
also be interpreted as:
As the name suggests, different individuals might have different subjective prob-
ability distributions for the same random variable.
Example 3.6. Continuing Example 3.5, Figure 3.7 displays the subjective prob-
ability distribution of the instructor’s age for four students.
Ariana
0.12 Billie
Cardi
Dua
0.10
0.08
Density
0.06
0.04
0.02
0.00
30 40 50 60 70
Instructor age
Figure 3.7: Subjective probability distributions of instructor age for four stu-
dents in Example 3.6.
Show/hide solution
structor can be any age between 25 and 75. Billie is fairly certain that
the instructor is close to 45, and she’s basically 100% certain that the
instructor is between 35 and 55.
Example 3.7. Consider Ariana’s subjective probability distribution in Figure
3.7. Ariana learns that her instructor received a Ph.D. in 2006. How would her
subjective probability distribution change?
Solution. to Example 3.7
Show/hide solution
Ariana’s original subjective probability distribution reflects very little knowledge
about her instructor. Ariana now reasons that her instructor was probably be-
tween 25 and 35 when he received his Ph.D. in 2006, so she revises her subjective
probability distribution to place almost 100% probability on ages between 40
and 50. Ariana’s subjective probability distribution now looks more like Billie’s
in Figure 3.7.
The previous examples introduce how probability can be used to quantify uncer-
tainty about unknown numbers. One key aspect of Bayesian analyses is applying
a subjective probability distribution to a parameter in a statistical model.
Example 3.8. Let 𝜃𝑏 represent the proportion of current Cal Poly students who
have ever read any of the books in the Harry Potter series. Let 𝜃𝑚 represent the
proportion of current Cal Poly students who have ever seen any of the movies
in the Harry Potter series.
Show/hide solution
2. Since we don’t have data on the entire population, the values of 𝜃𝑏 and
𝜃𝑚 are unknown, uncertain.
3. 𝜃𝑏 and 𝜃𝑚 are proportions so they take values between 0 and 1. Any value
on the continuous scale between 0 and 1 is theoretically possible, though
the values are not equally plausible.
4. Results will vary, but here’s my thought process. I think that a strong
majority of Cal Poly students have seen at least one Harry Potter movie,
maybe 80% or so. I wouldn’t be that surprised if it were even close to
100%, but I would be pretty surprised if it were less than 60%.
However, I’m less certain about 𝜃𝑏 . I suspect that fewer than 50% of
students have read at least one Harry Potter book, but I’m not very sure
and I wouldn’t be too surprised if it were actually more than 50%.
See Figure 3.8 for what my subjective probability distributions might look
like.
5. I’m less certain about 𝜃𝑏 , so its density is “spread out” over a wider range
of values.
6. The values of 𝜃𝑏 and 𝜃𝑚 are still unknown, but I am less uncertain about
their values now that I have observed some data. The sample proportion
who have watched a Harry Potter movie is 30/35 = 0.857, which is pretty
consistent with my initial beliefs. But now I update my subjective distri-
bution to concentrate even more of my subjective probability on values in
the 80 percent range.
I had suspected that 𝜃𝑏 was less than 0.5, so the observed sample propor-
tion of 21/35 = 0.6 goes against my expectations. However, I was fairly
uncertain about the value of 𝜃𝑚 prior to observing the data, so 0.6 is not
too surprising to me. I update my subjective distribution so that it’s cen-
tered closer to 0.6, while still allowing for my suspicion that 𝜃𝑏 is less than
0.5.
See Figure 3.9 for what my subjective probability distributions might look
like after observing the sample data. Of course, the sample proportions
are not necessarily equal to the population proportions. But if the sam-
ples are reasonably representative, I would hope that the observed sample
proportions are close to the respective population proportions. Even after
observing data, there is still uncertainty about the parameters 𝜃𝑏 and 𝜃𝑚 ,
and my subjective distributions quantify this uncertainty.
Prior distributions
movie
book
Density
Population proportion
Posterior distributions
movie (posterior)
book (posterior)
movie (prior)
book (prior)
Density
Population proportion
Poly students that have read any of the books in the Harry Potter series is
between 0.434 and 0.766. Does this mean that there is a 95% probability that
𝜃𝑏 is between 0.434 and 0.766? No! In a frequentist analysis, the parameter 𝜃𝑏
is treated like a fixed constant. That constant is either between 0.434 and 0.766
or it’s not; we don’t know which it is, but there’s no probability to it. In a
frequentist analysis, it doesn’t make sense to say “what is the probability that
𝜃𝑏 (a number) is between 0.434 and 0.766?” just like it doesn’t make sense to
say “what is the probability that 0.5 is between 0.434 and 0.766?” Remember
that 95% confidence derives from the fact that for 95% of samples the procedure
that was used to produce the interval [0.434, 0.766] will produce intervals that
contain the true parameter 𝜃𝑏 . It is the samples and the intervals that are
changing from sample to sample; 𝜃𝑏 stays constant at its fixed but unknown
value. In a frequentist analysis, probability quantifies the randomness in the
sampling procedure.
On the other hand, in a Bayesian statistical analysis, since a parameter 𝜃 is
unknown — that is, it’s value is uncertain to the observer — 𝜃 is treated as
a random variable. That is, in Bayesian statistical analyses unknown
parameters are random variables that have probability distributions.
The probability distribution of a parameter quantifies the degree of uncertainty
about the value of the parameter. Therefore, the Bayesian perspective allows
for probability statements about parameters. For example, a Bayesian analysis
3.4. INTERPRETATIONS OF STATISTICS 51
of the previous example might conclude that there is a 95% chance that 𝜃𝑏 is
between 0.426 and 0.721. Such a statement is valid in the Bayesian context, but
nonsensical in the frequentist context.
In the previous example, we started with distributions that represented our
uncertainty about 𝜃𝑏 and 𝜃𝑚 based on our “beliefs”, then we revised these
distributions after observing some data. If we were to observe more data, we
could revise again. In this course we will see (among other things) (1) how to
quantify uncertainty about parameters using probability distributions, and (2)
how to update those distributions to reflect new data.
Throughout these notes we will focus on Bayesian statistical analyses. We will
occasionally compare Bayesian and frequentist analyses and viewpoints. But we
want to make clear from the start: Bayesian versus frequentist is NOT a question
of right versus wrong. Both Bayesian and frequentist are valid approaches to
statistical analyses, each with advantages and disadvantages. We’ll address
some of the issues along the way. But at no point in your career do you need
to make a definitive decision to be a Bayesian or a frequentist; a good modern
statistician is probably a bit of both.
52CHAPTER 3. INTERPRETATIONS OF PROBABILITY AND STATISTICS
Chapter 4
Bayes’ Rule
Example 4.1. A recent survey of American adults asked: “Based on what you
have heard or read, which of the following two statements best describes the
scientific method?”
How does the response to this question change based on education level? Sup-
pose education level is classified as: high school or less (HS), some college but
no Bachelor’s degree (college), Bachelor’s degree (Bachelor’s), or postgraduate
degree (postgraduate). The education breakdown is
• Among those who agree with “iterative”: 31.3% HS, 27.6% college, 22.9%
Bachelor’s, and 18.2% postgraduate.
• Among those who agree with “unchanging”: 38.6% HS, 31.4% college,
19.7% Bachelor’s, and 10.3% postgraduate.
• Among those “not sure”: 57.3% HS, 27.2% college, 9.7% Bachelor’s, and
5.8% postgraduate
1 This section only covers Bayes’ rule for events. We’ll see Bayes’ rule for distributions of
53
54 CHAPTER 4. BAYES’ RULE
Show/hide solution
Show/hide solution
1. This is essentially the same question as the last part of the previous prob-
lem, just with different terminology.
• The hypothesis is 𝐻1 , the event that the randomly selected adult
agrees with the “iterative” statement.
56 CHAPTER 4. BAYES’ RULE
Bayes rule is often used when there are multiple hypotheses or cases. Suppose
𝐻1 , … , 𝐻𝑘 is a series of distinct hypotheses which together account for all pos-
sibilities3 , and 𝐸 is any event (evidence). Then Bayes’ rule implies that the
posterior probability of any particular hypothesis 𝐻𝑗 satisfies
𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 )
𝑃 (𝐻𝑗 |𝐸) =
𝑃 (𝐸)
The law of total probability says that we can interpret the unconditional prob-
ability 𝑃 (𝐸) as a probability-weighted average of the case-by-case conditional
probabilities 𝑃 (𝐸|𝐻𝑖 ) where the weights 𝑃 (𝐻𝑖 ) represent the probability of en-
countering each case.
Combining Bayes’ rule with the law of total probability,
𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 )
𝑃 (𝐻𝑗 |𝐸) =
𝑃 (𝐸)
𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 )
= 𝑘
∑𝑖=1 𝑃 (𝐸|𝐻𝑖 )𝑃 (𝐻𝑖 )
The symbol ∝ is read “is proportional to”. The relative ratios of the posterior
probabilities of different hypotheses are determined by the product of the prior
probabilities and the likelihoods, 𝑃 (𝐸|𝐻𝑗 )𝑃 (𝐻𝑗 ). The marginal probability of
the evidence, 𝑃 (𝐸), in the denominator simply normalizes the numerators to
ensure that the updated probabilities sum to 1 over all the distinct hypotheses.
In short, Bayes’ rule says4
In the previous examples, the prior probabilities for an American adult’s per-
ception of the scientific method are 0.70 for “iterative”, 0.14 for “unchanging”,
and 0.16 for “not sure”. After observing that the American has a postgrad-
uate degree, the posterior probabilities for an American adult’s perception of
the scientific method become 0.8432 for “iterative”, 0.0954 for “unchanging”,
and 0.0614 for “not sure”. The following organizes the calculations in a Bayes’
table which illustrates “posterior is proportional to likelihood times prior”.
hypothesis prior likelihood product posterior
iterative 0.70 0.182 0.1274 0.8432
unchanging 0.14 0.103 0.0144 0.0954
not sure 0.16 0.058 0.0093 0.0614
sum 1.00 NA 0.1511 1.0000
The likelihood column depends on the evidence, in this case, observing that the
American has a postgraduate degree. This column contains the probability of
4 “Posterior is proportional to likelihood times prior” summarizes the whole course in a
single sentence.
59
the same event, 𝐸 = “the American has a postgraduate degree”, under each of
the distinct hypotheses:
• 𝑃 (𝐸|𝐻1 ) = 0.182, given the American agrees with the “iterative” state-
ment
• 𝑃 (𝐸|𝐻2 ) = 0.103, given the American agrees with the “unchanging” state-
ment
• 𝑃 (𝐸|𝐻3 ) = 0.058, given the American is “not sure”
Since each of these probabilities is computed under a different case, these values
do not need to add up to anything in particular. The sum of the likelihoods
is meaningless, which is why we have listed a sum of “NA” for the likelihood
column.
The “product” column contains the product of the values in the prior and like-
lihood columns. The product of prior and likelihood for “iterative” (0.1274)
is 8.835 (0.1274/0.0144) times higher than the product of prior and likelihood
for “unchanging” (0.0144). Therefore, Bayes rule implies that the conditional
probability that an American with a postgraduate degree agrees with “iterative”
should be 8.835 times higher than the conditional probability that an American
with a postgraduate degree agrees with “unchanging”. Similarly, the conditional
probability that an American with a postgraduate degree agrees with “iterative”
should be 0.1274/0.0093 = 13.73 times higher than the conditional probability
that an American with a postgraduate degree is “not sure”, and the condi-
tional probability that an American with a postgraduate degree agrees with
“unchanging” should be 0.0144/0.0093 = 1.55 times higher than the conditional
probability that an American with a postgraduate degree is “not sure”. The last
column just translates these relative relationships into probabilities that sum to
1.
The sum of the “product” column is 𝑃 (𝐸), the marginal probability of the
evidence. The sum of the product column represents the result of the law of total
probability calculation. However, for the purposes of determining the posterior
probabilities, it isn’t really important what 𝑃 (𝐸) is. Rather, it is the ratio of
the values in the “product” column that determine the posterior probabilities.
𝑃 (𝐸) is whatever it needs to be to ensure that the posterior probabilities sum
to 1 while maintaining the proper ratios.
The process of conditioning can be thought of as “slicing and renormalizing”.
We will see that the “slicing and renormalizing” interpretation also applies when
dealing with conditional distributions of random variables, and corresponding
plots. Slicing determines the shape; renormalizing determines the scale. Slicing
determines relative probabilities; renormalizing just makes sure they “add up”
to 1 while maintaining the proper ratios.
Example 4.3. Now suppose we want to compute the posterior probabilities for
an American adult’s perception of the scientific method given that the randomly
selected American adult has some college but no Bachelor’s degree (“college”).
Show/hide solution
1. We start with the same prior probabilities as before: 0.70 for iterative,
0.14 for unchanging, 0.16 for not sure. Now the evidence is that the
American has some college but no Bachelor’s degree. The likelihood of
the evidence (“college”) is 0.276 under the iterative hypothesis, 0.314 un-
der the unchanging hypothesis, and 0.272 under the not sure hypothesis.
The likelihood of the evidence does not change as much across the dif-
ferent hypotheses when the evidence is “college” than when the evidence
was “postgraduate degree”. Therefore, the changes from prior to poste-
rior should be less extreme when the evidence is “college” than when the
evidence was “postgraduate degree”. Furthermore, since the likelihood
doesn’t vary much across hypotheses when the evidence is “college” we
expect the posterior probabilities to be close to the prior probabilities.
2. See the table below. As expected, the posterior probabilities are closer
to the prior probabilities when the evidence is “college” than when the
evidence is “postgraduate degree”.
bayes_table = data.frame(hypothesis,
prior,
likelihood,
product,
posterior) %>%
add_row(hypothesis = "sum",
prior = sum(prior),
likelihood = NA,
product = sum(product),
posterior = sum(posterior))
Introduction to Estimation
1. If you were to estimate 𝜃 with a single number based on this sample data
alone, intuitively what number would you pick?
2. For a general 𝑛 and 𝜃, what is the distribution of 𝑌 ?
3. For the next few parts suppose 𝑛 = 12. For a moment we’ll only consider
these potential values for 𝜃: 0.1, 0.3, 0.5, 0.7, 0.9. If 𝜃 = 0.1 what is the
distribution of 𝑌 ? Compute and interpret the probability that 𝑌 = 8 if
𝜃 = 0.1.
63
64 CHAPTER 5. INTRODUCTION TO ESTIMATION
Show/hide solution
• For given data 𝑦, the likelihood function 𝑓(𝑦|𝜃) is the probability (or
density for continuous data) of observing the sample data 𝑦 viewed as a
function of the parameter 𝜃.
• In the likelihood function, the observed value of the data 𝑦 is treated as a
fixed constant.
• The value of a parameter that maximizes the likelihood function is called
a maximum likelihood estimate (MLE).
• The MLE depends on the data 𝑦. For given data 𝑦, the MLE is the value
of 𝜃 which gives the largest likelihood of having produced the observed
data 𝑦.
• Maximum likelihood estimation is a common frequentist technique for es-
timating the value of a parameter based on data from a sample.
Example 5.2. We’ll now take a Bayesian approach to estimating 𝜃 in Example
5.1. We treat the unknown parameter 𝜃 as a random variable and wish to find
its posterior distribution after observing 𝑦 = 8 couples leaning to the right in a
sample of 12 kissing couples.
We will start with a very simplified, unrealistic prior distribution that assumes
only five possible, equally likely values for 𝜃: 0.1, 0.3, 0.5, 0.7, 0.9.
1. Sketch a plot of the prior distribution and fill in the prior column of the
Bayes table.
2. Now suppose that 𝑦 = 8 couples in a sample of size 𝑛 = 12 lean right.
Sketch a plot of the likelihood function and fill in the likelihood column
in the Bayes table.
3. Complete the Bayes table and sketch a plot of the posterior distribution.
What does the posterior distribution say about 𝜃? How does it compare to
the prior and the likelihood? If you had to estimate 𝜃 with a single number
based on this posterior distribution, what number might you pick?
4. Now consider a prior distribution which places probability 1/9, 2/9, 3/9,
2/9, 1/9 on the values 0.1, 0.3, 0.5, 0.7, 0.9, respectively. Redo the previous
parts. How does the posterior distribution change?
5. Now consider a prior distribution which places probability 5/15, 4/15,
3/15, 2/15, 1/15 on the values 0.1, 0.3, 0.5, 0.7, 0.9, respectively. Redo
the previous parts. How does the posterior distribution change?
67
3. See the Bayes table below. Since the prior is flat, the posterior is propor-
tional to the likelihood. The values 0.7 and 0.5 account for the bulk of
posterior plausibility, and 0.7 is about twice as plausible as 0.5. If we had
to estimate 𝜃 with a single number, we might pick 0.7 because that has
the highest posterior probability.
# prior
# data
n = 12 # sample size
y = 8 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
kable(bayes_table %>%
adorn_totals("row"),
digits = 4,
align = 'r')
68 CHAPTER 5. INTRODUCTION TO ESTIMATION
# plots
plot(theta-0.01, prior, type='h', xlim=c(0, 1), ylim=c(0, 1), col="skyblue", xlab=
par(new=T)
plot(theta+0.01, likelihood/sum(likelihood), type='h', xlim=c(0, 1), ylim=c(0, 1),
par(new=T)
plot(theta, posterior, type='h', xlim=c(0, 1), ylim=c(0, 1), col="seagreen", xlab=
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyb
1.0
prior
scaled likelihood
0.8
posterior
0.6
0.4
0.2
0.0
theta
4. See table and plot below. Because the prior probability is now greater
for 0.5 than for 0.7, the posterior probability of 𝜃 = 0.5 is greater than
in the previous part, and the posterior probability of 𝜃 = 0.7 is less than
in the previous part. The values 0.7 and 0.5 still account for the bulk of
posterior plausibility, but now 0.7 is only about 1.3 times more plausible
than 0.5. If we had to estimate 𝜃 with a single number, we might pick 0.7
because that has the highest posterior probability.
69
prior
scaled likelihood
0.8
posterior
0.6
0.4
0.2
0.0
theta
5. See the table and plot below. The prior probability is large for 0.1 and
0.3, but since the likelihood corresponding to these values is so small, the
posterior probabilities are small. This posterior distribution is similar to
the one from the previous part.
1.0
prior
scaled likelihood
0.8
posterior
0.6
0.4
0.2
0.0
theta
Bayesian estimation
rescale the likelihood so that it adds up to 1. Such a rescaling does not change
the shape of the likelihood, it merely allows for easier comparison with prior
and posterior.
Example 5.3. Continuing Example 5.2. While the previous exercise introduced
the main ideas, it was unrealistic to consider only five possible values of 𝜃.
1. What are the possible values of 𝜃? Does the parameter 𝜃 take values on a
continuous or discrete scale? (Careful: we’re talking about the parameter
and not the data.)
2. Let’s assume that any multiple of 0.0001 is a possible value of 𝜃:
0, 0.0001, 0.0002, … , 0.9999, 1. Assume a discrete uniform prior distri-
bution on these values. Suppose again that 𝑦 = 8 couples in a sample
of 𝑛 = 12 kissing couples lean right. Use software to plot the prior
distribution, the (scaled) likelihood function, and then find the posterior
and plot it. Describe the posterior distribution. What does it say about
𝜃? If you had to estimate 𝜃 with a single number based on this posterior
distribution, what number might you pick?
3. Now assume a prior distribution which is proportional to 1 − 2|𝜃 − 0.5|
for 𝜃 = 0, 0.0001, 0.0002, … , 0.9999, 1. Use software to plot this prior;
what does it say about 𝜃? Then suppose again that 𝑦 = 8 couples in a
sample of 𝑛 = 12 kissing couples lean right. Use software to plot the prior
distribution, the (scaled) likelihood function, and then find the posterior
and plot it. What does the posterior distribution say about 𝜃? If you had
to estimate 𝜃 with a single number based on this posterior distribution,
what number might you pick?
4. Now assume a prior distribution which is proportional to 1 − 𝜃 for 𝜃 =
0, 0.0001, 0.0002, … , 0.9999, 1. Use software to plot this prior; what does it
say about 𝜃? Then suppose again that 𝑦 = 8 couples in a sample of 𝑛 = 12
kissing couples lean right. Use software to plot the prior distribution, the
(scaled) likelihood function, and then find the posterior and plot it. What
does the posterior distribution say about 𝜃? If you had to estimate 𝜃 with
a single number based on this posterior distribution, what number might
you pick?
5. Compare the posterior distributions corresponding to the three different
priors. How does each posterior distribution compare to the prior and the
likelihood? Does the prior distribution influence the posterior distribu-
tion?
Solution. to Example 5.3
from about 0.4 to 0.9 accounts for almost all of the posterior plausibility.
If we had to estimate 𝜃 with a single number, we might pick 8/12 = 0.667
because that has the highest posterior probability.
# prior
prior = rep(1, length(theta))
prior = prior / sum(prior)
# data
n = 12 # sample size
y = 8 # sample count of success
# plots
plot_posterior <- function(theta, prior, likelihood){
# posterior
product = likelihood * prior
posterior = product / sum(product)
0.00030
prior
scaled likelihood
posterior
0.00020
0.00010
0.00000
theta
# prior
theta = seq(0, 1, 0.0001)
prior = 1 - 2 * abs(theta - 0.5)
prior = prior / sum(prior)
# data
n = 12 # sample size
y = 8 # sample count of success
# plots
plot_posterior(theta, prior, likelihood)
74 CHAPTER 5. INTRODUCTION TO ESTIMATION
theta
# prior
prior = 1 - theta
prior = prior / sum(prior)
# data
n = 12 # sample size
y = 8 # sample count of success
# plots
plot_posterior(theta, prior, likelihood)
75
0.00030
prior
scaled likelihood
posterior
0.00020
0.00010
0.00000
theta
5. For the “flat” prior, the posterior is proportional to the likelihood. For the
other priors, the posterior is a compromise between prior and likelihood.
The prior does have some influence. We do see three somewhat different
posterior distributions corresponding to these three prior distributions.
• Even in situations where the data are discrete (e.g., binary success/failure
data, count data), most statistical parameters take values on a continuous
scale.
• Thus in a Bayesian analysis, parameters are usually continuous random
variables, and have continuous probability distributions, a.k.a., densities.
• An alternative to dealing with continuous distributions is to use grid
approximation: Treat the parameter as discrete, on a sufficiently fine
grid of values, and use discrete distributions.
Example 5.4. Continuing Example 5.1. Now we’ll perform a Bayesian analysis
on the actual study data in which 80 couples out of a sample of 124 leaned right.
We’ll again use a grid approximation and assume that any multiple of 0.0001
between 0 and 1 is a possible value of 𝜃: 0, 0.0001, 0.0002, … , 0.9999, 1.
1. Before performing the Bayesian analysis, use software to plot the likelihood
when 𝑦 = 80 couples in a sample of 𝑛 = 124 kissing couples lean right,
and compute the maximum likelihood estimate of 𝜃 based on this data.
How does the likelihood for this sample compare to the likelihood based
on the smaller sample (8/12) from previous exercises?
2. Now back to Bayesian analysis. Assume a discrete uniform prior distri-
bution for 𝜃. Suppose that 𝑦 = 80 couples in a sample of 𝑛 = 124 kissing
76 CHAPTER 5. INTRODUCTION TO ESTIMATION
couples lean right. Use software to plot the prior distribution, the like-
lihood function, and then find the posterior and plot it. Describe the
posterior distribution. What does it say about 𝜃?
3. Now assume a prior distribution which is proportional to 1 − 2|𝜃 − 0.5|
for 𝜃 = 0, 0.0001, 0.0002, … , 0.9999, 1. Then suppose again that 𝑦 = 80
couples in a sample of 𝑛 = 124 kissing couples lean right. Use software
to plot the prior distribution, the likelihood function, and then find the
posterior and plot it. What does the posterior distribution say about 𝜃?
4. Now assume a prior distribution which is proportional to 1 − 𝜃 for 𝜃 =
0, 0.0001, 0.0002, … , 0.9999, 1. Then suppose again that 𝑦 = 80 couples in
a sample of 𝑛 = 124 kissing couples lean right. Use software to plot the
prior distribution, the likelihood function, and then find the posterior and
plot it. What does the posterior distribution say about 𝜃?
5. Compare the posterior distributions corresponding to the three different
priors. How does each posterior distribution compare to the prior and
the likelihood? Comment on the influence that the prior distribution has.
Does the Bayesian inference for these data appear to be highly sensitive
to the choice of prior? How does this compare to the 𝑛 = 12 situation?
6. If you had to produce a single number Bayesian estimate of 𝜃 based on
the sample data, what number might you pick?
# prior
theta = seq(0, 1, 0.0001)
prior = rep(1, length(theta))
prior = prior / sum(prior)
# data
n = 124 # sample size
y = 80 # sample count of success
# plots
plot_posterior(theta, prior, likelihood)
prior
0.0008
scaled likelihood
posterior
0.0004
0.0000
theta
3. See the plot below. The posterior is very similar to the one from the
previous part.
prior
0.0008
scaled likelihood
posterior
0.0004
0.0000
theta
4. See the plot below. The posterior is very similar to the one from the
78 CHAPTER 5. INTRODUCTION TO ESTIMATION
previous part.
prior
0.0008
scaled likelihood
0.0004
0.0000 posterior
theta
5. Even though the priors are different, the posterior distributions are all
similar to each other and all similar to the shape of the likelihood. Com-
paring these priors it does not appear that the posterior is highly sensitive
to choice of prior. The data carry more weight when 𝑛 = 124 than it
did when 𝑛 = 12. In other words, the prior has less influence when the
sample size is larger. When the sample size is larger, the likelihood is
more “peaked” and so the likelihood, and hence posterior, is small outside
a narrower range of values than when the sample size is small.
Example 5.5. Continuing the kissing study in Example 5.2 where 𝜃 can only
take values 0.1, 0.3, 0.5, 0.7, 0.9. Consider a prior distribution which places
probability 5/15, 4/15, 3/15, 2/15, 1/15 on the values 0.1, 0.3, 0.5, 0.7, 0.9,
respectively. Suppose we want a single number point estimate of 𝜃. What are
some reasonable choices?
Show/hide solution
1. The prior mode is 0.1, the value of 𝜃 with the greatest prior probability.
2. The prior median is 0.3. Start with the smallest possible value of 𝜃 and
add up the prior probabilities until they go from below 0.5 to above 0.5.
This happens when you add in the prior probability for 𝜃 = 0.3.
3. The prior mean is 0.367. Remember that an expected value is a
probability-weighted average value
4. The posterior mode is 0.7, the value of 𝜃 with the greatest posterior prob-
ability.
5. The posterior median is 0.7. Start with the smallest possible value of 𝜃
and add up the posterior probabilities until they go from below 0.5 to
above 0.5. This happens when you add in the posterior probability for
𝜃 = 0.7.
6. The posterior mean is 0.608. Now the posterior probabilities are used in
the probability-weighted average value
7. The point estimates (mode, median, mean) shift from their prior values
(0.1, 0.3, 0.367) towards the observed sample proportion of 8/12. However,
the posterior distribution is not symmetric, and the posterior mean is less
than the posterior median. In particular, note that the posterior mean
(0.608) lies between the prior mean (0.367) and the sample proportion
(0.667).
E(𝑈 ) = ∑ 𝑢 𝑃 (𝑈 = 𝑢)
𝑢
In the calculation of a posterior mean, the parameter 𝜃 plays the role of the ran-
dom variable 𝑈 and the posterior distribution provides the probability-weights.
In many situations, the posterior distribution will be roughly symmetric with a
single peak, in which case posterior mean, median, and mode will all be about
the same.
Reducing the posterior distribution to a single-number point estimate loses a
lot of the information the posterior distribution provides. The entire posterior
distribution quantifies the uncertainty about 𝜃 after observing sample data. We
will soon see how to more fully use the posterior distribution in making inference
about 𝜃.
5.1. POINT ESTIMATION 81
1. Find the mode of the prior distribution of 𝜃, a.k.a., the “prior mode”.
2. Find the median of the prior distribution of 𝜃, a.k.a., the “prior median”.
3. Find the expected value of the prior distribution of 𝜃, a.k.a., the “prior
mean”.
# prior
prior = 1 - theta # shape of prior
prior = prior / sum(prior) # scales so that prior sums to 1
# data
n = 12 # sample size
y = 8 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
82 CHAPTER 5. INTRODUCTION TO ESTIMATION
0.00030
prior
scaled likelihood
posterior
0.00020
0.00010
0.00000
theta
6. Find the expected value of the posterior distribution of 𝜃, a.k.a., the “pos-
terior mean”.
7. How have the posterior values changed from the respective prior values?
## prior
# prior mode
theta[which.max(prior)]
## [1] 0
5.1. POINT ESTIMATION 83
# prior median
min(theta[which(cumsum(prior) >= 0.5)])
## [1] 0.2929
# prior mean
sum(theta * prior)
## [1] 0.3333
7. Each of the posterior point estimates has shifted from its prior value to-
wards the sample proportion of 0.667. But note that each of the posterior
point estimates is in between the prior point estimate and the sample
proportion.
## posterior
# posterior mode
theta[which.max(posterior)]
## [1] 0.6154
# posterior median
min(theta[which(cumsum(posterior) >= 0.5)])
## [1] 0.6046
# posterior mean
sum(theta * posterior)
## [1] 0.6
# data
n = 124 # sample size
y = 80 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
prior
0.0008
scaled likelihood
posterior
0.0004
0.0000
theta
## posterior
# posterior mode
theta[which.max(posterior)]
## [1] 0.64
# posterior median
min(theta[which(cumsum(posterior) >= 0.5)])
## [1] 0.6385
# posterior mean
sum(theta * posterior)
## [1] 0.6378
86 CHAPTER 5. INTRODUCTION TO ESTIMATION
Chapter 6
Introduction to Inference
1. Sketch your prior distribution for 𝜃. Make a guess for your prior mode.
2. Suppose Henry formulates a Normal distribution prior for 𝜃. Henry’s prior
mean is 0.4 and prior standard deviation is 0.1. What does Henry’s prior
say about 𝜃?
3. Suppose Mudge formulates a Normal distribution prior for 𝜃. Mudge’s
prior mean is 0.4 and prior standard deviation is 0.05. Who has more
prior certainty about 𝜃? Why?
Show/hide solution
87
88 CHAPTER 6. INTRODUCTION TO INFERENCE
The standard deviation of a random variable is the square root of its variance
is SD(𝑈 ) = √Var(𝑈 ). Standard deviation is measured in the same measurement
units as the variable itself, while variance is measured in squared units.
In the calculation of a posterior standard deviation, 𝜃 plays the role of the
variable 𝑈 and the posterior distribution provides the probability-weights.
Example 6.2. Continuing Example 6.1, we’ll assume a Normal prior distribu-
tion for 𝜃 with prior mean 0.4 and prior standard deviation 0.1.
1. Compute and interpret the prior probability that 𝜃 is greater than 0.7.
2. Find the 25th and 75th percentiles of the prior distribution. What is
the prior probability that 𝜃 lies in the interval with these percentiles as
endpoints? According to the prior, how plausible is it for 𝜃 to lie inside
this interval relative to outside it? (Hint: use qnorm)
3. Repeat the previous part with the 10th and 90th percentiles of the prior
distribution.
4. Repeat the previous part with the 1st and 99th percentiles of the prior
distribution.
In a sample of 150 American adults, 75% have read a book in the past
year. (The 75% value is motivated by a real sample we’ll see in a later
example.)
89
5. Find the posterior distribution based on this data, and make a plot of
prior, likelihood, and posterior.
7. Compute and interpret the posterior probability that 𝜃 is greater than 0.7.
Compare to the prior probability.
8. Find the 25th and 75th percentiles of the posterior distribution. What is
the posterior probability that 𝜃 lies in the interval with these percentiles
as endpoints? According to the posterior, how plausible is it for 𝜃 to lie
inside this interval relative to outside it? Compare to the prior interval.
9. Repeat the previous part with the 10th and 90th percentiles of the poste-
rior distribution.
10. Repeat the previous part with the 1st and 99th percentiles of the posterior
distribution.
Show/hide solution
1. We can use software (1 - pnorm(0.7, 0.4, 0.1)) but we can also use
the empirical rule. For a Normal(0.4, 0.1) distribution, the value 0.7 is
0.7−0.4
0.1 = 3 SDs above the mean, so the probability is about 0.0015 (since
about 99.7% of values are within 3 SDs of the mean). According to the
prior, there is about a 0.1% chance that more than 70% of Americans
adults have read a book in the last year.
3. We can use qnorm(0.9) = 1.28 to find that the 90th percentile of a Normal
distribution is about 1.28 SDs above the mean, so the 10th percentile is
about 1.28 SDs below the mean. For the prior distribution, the 10th
percentile is about 0.27 and the 90th percentile is about 0.53. The prior
probability that 𝜃 lies in the interval [0.27, 0.53] is about 80%. According
to the prior, it is four times more plausible for 𝜃 to lie inside the interval
[0.27, 0.53] than to lie outside this interval.
90 CHAPTER 6. INTRODUCTION TO INFERENCE
4. We can use qnorm(0.99) = 2.33 to find that the 99th percentile of a Nor-
mal distribution is about 2.33 SDs above the mean, so the 1st percentile
is about 2.33 SDs below the mean. For the prior distribution, the 1st
percentile is about 0.167 and the 99th percentile is about 0.633. The prior
probability that 𝜃 lies in the interval [0.167, 0.633] is about 98%. Accord-
ing to the prior, it is 49 times more plausible for 𝜃 to lie inside the interval
[0.167, 0.633] than to lie outside this interval.
5. See below for a plot. Our prior gave very little plausibility to a sample
like the one we actually observed. However, given our sample data, the
likelihood corresponding to the values of 𝜃 we initially deemed most plau-
sible is very low. Therefore, our posterior places most of the plausibility
on values in the neighborhood of the observed sample proportion, even
though the prior probability for many of these values was low. The prior
does still have some influence; the posterior mean is 0.709 so we haven’t
shifted all the way towards the sample proportion yet.
6. Compute the posterior variance first using either the definition or the
shortcut version, then take the square root; see code below. The posterior
SD is 0.036, almost 3 times smaller than the prior SD. After observing data
we have more certainty about the value of the parameter, resulting in a
smaller posterior SD. The posterior distribution is approximately Normal
with posterior mean 0.709 and posterior SD 0.036.
7. We can use the grid approximation; just sum the posterior probabilities for
𝜃 > 0.7 to see that the posterior probability is about 0.603. Since the pos-
terior distribution is approximately Normal, we can also use the empirical
rule: the standardized value for 0.7 is 0.7−0.709
0.036 = −0.24, or 0.24 SDs below
the mean. Using the empirical rule (or software 1 - pnorm(-0.24)) gives
about 0.596, similar to the grid calculation.
We started with a very low prior probability that more than 70% of Amer-
ican adults have read at least one book in the last year. But after observ-
ing a sample in which more than 70% have read at least one book in the
last year, we assign a much higher plausibility to more than 70% of all
American adults having read at least one book in the last year. Seeing is
believing.
8. See code below for calculations based on the grid approximation. But we
can also use the fact the posterior distribution is approximately Normal;
e.g., the 25th percentile is about 0.67 SDs below the mean: 0.709 − 0.67 ×
0.036 = 0.684. For the posterior distibution, the 25th percentile is about
0.684 and the 75th percentile is about 0.733. The posterior probability
that 𝜃 lies in the interval [0.684, 0.733] is about 50%. According to the
posterior, it is equally plausible for 𝜃 to lie inside the interval [0.684, 0.733]
as to lie outside this interval. This 50% interval is both (1) narrower than
the prior interval, due to the smaller posterior SD, and (2) shifted towards
91
higher values of 𝜃 relative to the prior interval, due to the larger posterior
mean.
9. For the posterior distibution, the 10th percentile is about 0.662 and the
90th percentile is about 0.754. The posterior probability that 𝜃 lies in the
interval [0.662, 0.754] is about 80%. According to the posterior, it is four
times more plausible for 𝜃 to lie inside the interval [0.662, 0.754] as to lie
outside this interval. This 80% interval is both (1) narrower than the prior
interval, due to the smaller posterior SD, and (2) shifted towards higher
values of 𝜃 relative to the prior interval, due to the larger posterior mean.
10. For the posterior distibution, the 1st percentile is about 0.622 and the
99th percentile is about 0.789. The posterior probability that 𝜃 lies in
the interval [0.622, 0.789] is about 98%. According to the posterior, it is
49 times more plausible for 𝜃 to lie inside the interval [0.622, 0.789] as to
lie outside this interval. This interval is both (1) narrower than the prior
interval, due to the smaller posterior SD, and (2) shifted towards higher
values of 𝜃 relative to the prior interval, due to the larger posterior mean.
# prior
prior = dnorm(theta, 0.4, 0.1) # shape of prior
prior = prior / sum(prior) # scales so that prior sums to 1
# data
n = 150 # sample size
y = round(0.75 * n, 0) # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
# posterior mean
posterior_mean = sum(theta * posterior)
posterior_mean
## [1] 0.7024
## [1] 0.03597
## [1] 0.5345
## [1] 0.6783
## [1] 0.727
## [1] 0.6557
## [1] 0.7481
## [1] 0.6161
## [1] 0.7828
93
prior
scaled likelihood
posterior
0.0008
0.0004
0.0000
theta
• With an 80% credible interval, it is 4 times more plausible that the pa-
rameter lies inside the interval than outside
• With a 98% credible interval, it is 49 times more plausible that the pa-
rameter lies inside the interval than outside
• The endpoints of a 50% central posterior credible interval are the 25th
and the 75th percentiles of the posterior distribution.
• The endpoints of an 80% central posterior credible interval are the 10th
and the 90th percentiles of the posterior distribution.
94 CHAPTER 6. INTRODUCTION TO INFERENCE
• The endpoints of a 98% central posterior credible interval are the 1st and
the 99th percentiles of the posterior distribution.
There is nothing special about the values 50%, 80%, 98%. These are just a few
convenient choices1 whose endpoints correspond to “round number” percentiles
(1st, 10th, 25th, 75th, 90th, 99th) and inside/outside ratios (1-to-1, 4-to-1,
about 50-to-1). You could also throw in, say 70% (15th and 85th percentiles,
about 2-to-1) or 90% (5th and 95th percentiles, about 10-to-1), if you wanted.
As the previous example illustrates, it’s not necessary to just select a single
credible interval (e.g., 95%). Bayesian inference is based on the full posterior
distribution. Credible intervals simply provide a summary of this distribution.
Reporting a few credible intervals, rather than just one, provides a richer picture
of how the posterior distribution represents the uncertainty in the parameter.
In many situations, the posterior distribution of a single parameter is approx-
imately Normal, so posterior probabilities can be approximated with Normal
distribution calculations — standardizing and using the empirical rule. In par-
ticular, an approximate central credible interval has endpoints
Central credible intervals are easier to compute, but are not the only or most
widely used credible intervals. A highest posterior density interval is the
interval of values that has the specified posterior probability and is such that the
posterior density within the interval is never lower than the posterior density
outside the interval. If the posterior distribution is relatively symmetric with
a single peak, central posterior credible intervals and highest posterior density
intervals are similar.
Example 6.3. Continuing Example 6.1, we’ll assume a Normal prior distribu-
tion for 𝜃 with prior mean 0.4 and prior standard deviation 0.1.
In a recent survey of 1502 American adults conducted by the Pew Research
Center, 75% of those surveyed said thay have read a book in the past year.
1 In Section 3.2.2 of Statistical Rethinking (McElreath (2020)), the author suggests 67%,
89%, and 97%: “a series of nested intervals may be more useful than any one interval. For
example, why not present 67%, 89%, and 97% intervals, along with the median? Why these
values? No reason. They are prime numbers, which makes them easy to remember. But all
that matters is they be spaced enough to illustrate the shape of the posterior. And these values
avoid 95%, since conventional 95% intervals encourage many readers to conduct unconscious
hypothesis tests.”
95
1. Find the posterior distribution based on this data, and make a plot of
prior, likelihood, and posterior. Describe the posterior distribution. How
does this posterior compare to the one based on the smaller sample size
(𝑛 = 150)?
2. Compute and interpret the posterior probability that 𝜃 is greater than 0.7.
Compare to the prior probability.
3. Compute and interpret in context 50%, 80%, and 98% central posterior
credible intervals.
4. Here is how the survey question was worded: “During the past 12 months,
about how many BOOKS did you read either all or part of the way
through? Please include any print, electronic, or audiobooks you may
have read or listened to.” Does this change your conclusions? Explain.
Solution. to Example 6.3
Show/hide solution
1. See below for code and plots. The posterior distribution is approximately
Normal with posterior mean 0.745 and posterior SD 0.011. Despite our
prior beliefs that 𝜃 was in the 0.4 range, enough data has convinced us
otherwise. With a large sample size, the prior has little influence on the
posterior; much less than with the smaller sample size. Compared to the
posterior based on the small sample size, the posterior now (1) has shifted
to the neighborhood of the sample data, (2) exhibits a smaller degree of
uncertainty about the parameter.
2. The posterior probability that 𝜃 is greater than 0.7 is about 0.9999. We
started with only a 0.1% chance that more than 70% of American adults
have read a book in the last year, but the large sample has convinced us
otherwise.
3. There is a posterior probability of:
# prior
prior = dnorm(theta, 0.4, 0.1) # shape of prior
prior = prior / sum(prior) # scales so that prior sums to 1
# data
n = 1502 # sample size
y = round(0.75 * n, 0) # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
# posterior mean
posterior_mean = sum(theta * posterior)
posterior_mean
## [1] 0.745
## [1] 0.01123
## [1] 0.9999
97
## [1] 0.7374
## [1] 0.7525
## [1] 0.7304
## [1] 0.7592
## [1] 0.7183
## [1] 0.7705
98 CHAPTER 6. INTRODUCTION TO INFERENCE
prior
0.0030 scaled likelihood
posterior
0.0020
0.0010
0.0000
theta
The quality of any statistical analysis depends very heavily on the quality of
the data. Always investigate how the data were collected to determine what
conclusions are appropriate. Is the sample reasonably representative of the
population? Were the variables reliably measured?
Example 6.4. Continuing Example 6.3, we’ll use the same sample data (𝑛 =
1502, 75%) but now we’ll consider different priors.
For each of the priors below, plot prior, likelihood, and posterior, and compute
the posterior probability that 𝜃 is greater than 0.7. Compare to Example 6.3.
1. Normal distribution prior with prior mean 0.4 and prior SD 0.05.
2. Uniform distribution prior on the interval [0, 0.7]
Show/hide solution
prior
0.0030
scaled likelihood
posterior
0.0020
0.0010
0.0000
theta
100 CHAPTER 6. INTRODUCTION TO INFERENCE
prior
0.03 scaled likelihood
posterior
0.02
0.01
0.00
theta
You have a great deal of flexibility in choosing a prior, and there are many
reasonable approaches. However, do NOT choose a prior that assigns 0 prob-
ability/density to possible values of the parameter regardless of how initially
implausible the values are. Even very stubborn priors can be overturned with
enough data, but no amount of data can turn a prior probability of 0 into a
positive posterior probability. Always consider the range of possible values of
the parameter, and be sure the prior density is non-zero over the entire range
of possible values.
Example 6.5. We’ll now compare the Bayesian credible intervals in Example
6.4 to frequentist confidence intervals. Recall the actual study data in which
75% of the 1502 American adults surveyed said they read a book in the last
year.
Show/hide solution
𝑝(1
̂ − 𝑝)̂
𝑝̂ ± 𝑧 ∗ √
𝑛
where 𝑧∗ is the multiple from a standard Normal distribution correspond-
ing to the level of confidence (e.g., 𝑧∗ = 2.33 for 98% confidence). A 98%
confidence interval for 𝜃 is [0.724, 0.776].
4. The numerical results are similar: the 98% posterior credible interval is
similar to the 98% confidence interval. Both reflect a conclusion that we
think that somewhere-in-the-70s percent of American adults have read at
least one book in the past year.
probability of what might happen over many samples. Notice that in the
interpretation of what 98% confidence means above, the actual numbers
[0.724, 0.776] did not appear. The confidence is in the procedure that
produced the interval, and not in the interval itself.
Example 6.6. Have more than 70% of Americans read a book in the last year?
We’ll now compare the Bayesian analysis in Example 6.4 to a frequentist (null)
hypothesis (significance) test. Recall the actual study data in which 75% of the
1502 American adults surveyed said they read a book in the last year.
Show/hide solution
5. The numerical results are similar; both the p-value and the posterior prob-
ability are on the order of 1/100000. Both reflect a strong endorsement of
the conclusion that more than 70% of Americans have read a book in the
past year.
• For example, “95% credible” quantifies our assessment that the parameter
is 19 times more likely to lie inside the credible interval than outside.
(Roughly, we’d be willing to bet at 19-to-1 odds on whether 𝜃 is inside the
interval [0.718, 0.771].)
In a frequentist approach
Introduction to Prediction
Example 7.1. Do people prefer to use the word “data” as singular or plural?
Data journalists at FiveThirtyEight conducted a poll to address this question
(and others). Rather than simply ask whether the respondent considered “data”
to be singular or plural, they asked which of the following sentences they prefer:
a. Some experts say it’s important to drink milk, but the data is inconclusive.
b. Some experts say it’s important to drink milk, but the data are inconclu-
sive.
Suppose we wish to study the opinions of students in Cal Poly statistics classes
regarding this issue. That is, let 𝜃 represent the population proportion of stu-
dents in Cal Poly statistics classes who prefer to consider data as a singular
noun, as in option a) above.
To illustrate ideas, we’ll start with a prior distribution which places probability
0.01, 0.05, 0.15, 0.30, 0.49 on the values 0.1, 0.3, 0.5, 0.7, 0.9, respectively.
1. Before observing any data, suppose we plan to randomly select a single Cal
Poly statistics student. Consider the unconditional prior probability that
the selected student prefers data as singular. (This is called a prior pre-
dictive probability.) Explain how you could use simulation to approximate
this probability.
105
106 CHAPTER 7. INTRODUCTION TO PREDICTION
2. Use the law of total probability, where the weights are given by the prior
probabilities.
35𝜃34 (1 − 𝜃) + 𝜃35
# prior
prior = c(0.01, 0.05, 0.15, 0.30, 0.49)
# data
n = 35 # sample size
y = 31 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
bayes_table %>%
adorn_totals("row") %>%
kable(digits = 4, align = 'r')
6. The simulation would be similar to the prior simulation, but now we sim-
ulate 𝜃 from its posterior distribution rather than the prior distribution.
3. Repeat steps 1 and 2 many times, and find the proportion of rep-
etitions which result in success. This proportion approximates the
unconditional posterior probability of success.
7. Use the law of total probability, where the weights are given by the pos-
terior probabilities.
0.1(0.0000)+0.3(0.0000)+0.5(0.0000)+0.7(0.0201)+0.9(0.9799) = 0.8960
Since the posterior probability that 𝜃 equals 0.9 is close to 1, the posterior
predictive distribution would be close to, but not quite, the Binomial(35,
0.9) distribution.
9. Use the law of total probability again, but with the posterior probabilities
rather than the prior probabilities as the weights.
The plots below illustrate the distributions from the previous example.
The first plot below illustrates the conditional distribution of 𝑌 given each value
of 𝜃.
110 CHAPTER 7. INTRODUCTION TO PREDICTION
0.20
0.15
0.10
0.05 Prior(theta)
p(y|theta)
0.00 0.4
0 10 20 30 0.3
theta: 0.7 theta: 0.9
0.2
0.20
0.1
0.15
0.10
0.05
0.00
0 10 20 30 0 10 20 30
y
In the plot above, the prior distribution of 𝜃 is represented by color. The prior
predictive distribution of 𝑌 mixes the five conditional distributions in the pre-
vious plot, weighting by the prior distribution of 𝜃, to obtain the unconditional
prior predictive distribution of 𝑌 .
0.09
0.06
p(y)
0.03
0.00
0 10 20 30
y
0.20
0.15
0.10
0.05 Prior(theta)
p(y|theta)
0.00 0.75
0 10 20 30
theta: 0.7 theta: 0.9 0.50
0.20 0.25
0.15
0.10
0.05
0.00
0 10 20 30 0 10 20 30
y
0.20
0.15
p(y)
0.10
0.05
0.00
0 10 20 30
y
Example 7.2. Continuing the previous example. We’ll use a grid approx-
imation and assume that any multiple of 0.0001 is a possible value of 𝜃:
0, 0.0001, 0.0002, … , 0.9999, 1.
1. Consider the context of this problem and sketch your prior distribution
for 𝜃. What are the main features of your prior?
2. Assume the prior distribution for 𝜃 is proportional to 𝜃2 . Plot this prior
distribution and describe its main features.
3. Given the shape of the prior distribution, explain why we might not want
to compute central prior credible intervals. Suggest an alternative ap-
proach, and compute and interpret 50%, 80%, and 98% prior credible
intervals for 𝜃.
4. Before observing any data, suppose we plan to randomly select a sample
of 35 Cal Poly statistics students. Let 𝑌 represent the number of students
in the selected sample who prefer data as singular. Use simulation to
approximate the prior predictive distribution of 𝑌 and plot it.
115
1. Results will of course vary, but do consider what your prior would look
like.
2. We believe a majority, and probably a strong majority, of students will
prefer data as singular. The prior mode is 1, the prior mean is 0.75, and
the prior standard deviation is 0.19.
# prior
prior = theta ^ 2
prior = prior / sum(prior)
theta
# prior mean
prior_ev = sum(theta * prior)
prior_ev
## [1] 0.75
# prior variance
prior_var = sum(theta ^ 2 * prior) - prior_ev ^ 2
# prior sd
sqrt(prior_var)
117
## [1] 0.1937
3. Central credibles would exclude 𝜃 values near 1, but these are the values
with highest prior probability. For example, a central 50% prior credible
interval is [0.630, 0.909], but this excludes values of 𝜃 with the highest prior
probability. An alternative is to use highest prior probability intervals.
For this prior, it seems reasonable to just fix the upper endpoint of the
credible intervals to be 1, and to find the lower endpoint corresponding to
the desired probability. The lower bound of such a 50% credible interval
is the 50th percentile; of an 80% credible interval is the 20th percentile; of
a 98% credible interval is the 2nd percentile. There is a prior probability
of 50% that at least 79.4% of Cal Poly students prefer data as singular;
it’s equally plausible that 𝜃 is above 0.794 as below.
There is a prior probability of 80% that at least 58.5% of Cal Poly students
prefer data as singular; it’s four times more plausible that 𝜃 is above 0.585
than below. There is a prior probability of 98% that at least 27.1% of Cal
Poly students prefer data as singular; it’s 49 times more plausible that 𝜃
is above 0.271 than below.
prior_cdf = cumsum(prior)
# 50th percentile
theta[max(which(prior_cdf <= 0.5))]
## [1] 0.7936
# 10th percentile
theta[max(which(prior_cdf <= 0.2))]
## [1] 0.5847
# 2nd percentile
theta[max(which(prior_cdf <= 0.02))]
## [1] 0.2714
4. We use the sample function with the prob argument to simulate a value
of 𝜃 from its prior distribution, and then use rbinom to simulate a sample.
The table below displays the results of a few repetitions of the simulation.
n = 35
n_sim = 10000
118 CHAPTER 7. INTRODUCTION TO PREDICTION
theta_sim y_sim
0.6711 26
0.7621 26
0.8947 33
0.8719 30
0.5030 16
0.9272 34
0.8771 31
0.6362 25
0.8140 29
0.1598 3
5. We program the law of total probability calculation for each possible value
of 𝑦. (There are better ways of doing this than a for loop, but it’s good
enough.)
# Predictive distribution
y_predict = 0:n
for (i in 1:length(y_predict)) {
py_predict[i] = sum(dbinom(y_predict[i], n, theta) * prior) # prior
}
119
Theoretical
Simulation
0.06
Probability
0.04
0.02
0.00
# Prediction interval
py_predict_cdf = cumsum(py_predict)
c(y_predict[max(which(py_predict_cdf <= 0.025))], y_predict[min(which(py_predict_cdf >= 0.97
## [1] 8 35
# data
n = 35 # sample size
y = 31 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
120 CHAPTER 7. INTRODUCTION TO PREDICTION
# posterior mean
posterior_ev = sum(theta * posterior)
posterior_ev
## [1] 0.8718
# posterior variance
posterior_var = sum(theta ^ 2 * posterior) - posterior_ev ^ 2
# posterior sd
sqrt(posterior_var)
## [1] 0.05286
# posterior cdf
posterior_cdf = cumsum(posterior)
0.0008
prior
scaled likelihood
posterior
0.0004
0.0000
theta
9. Similar to the prior simulation, but now we simulate 𝜃 based on its poste-
rior distribution. The table below displays the results of a few repetitions
of the simulation.
theta_sim y_sim
0.8304 30
0.8823 32
0.9251 30
0.9442 35
0.9069 32
0.9083 32
0.9564 34
0.8123 32
0.7888 30
0.8705 29
10. Similar to the prior calculation, but now we use the posterior probabilities
as the weights in the law of total probability calculation.
# Predictive distribution
y_predict = 0:n
for (i in 1:length(y_predict)) {
py_predict[i] = sum(dbinom(y_predict[i], n, theta) * posterior) # posterior
}
Theoretical
Simulation
0.10
Probability
0.05
0.00
# Prediction interval
py_predict_cdf = cumsum(py_predict)
c(y_predict[max(which(py_predict_cdf <= 0.025))], y_predict[min(which(py_predict_c
123
## [1] 23 35
# data
n = 1093 # sample size
y = 865 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
# posterior mean
posterior_ev = sum(theta * posterior)
posterior_ev
## [1] 0.7912
# posterior variance
posterior_var = sum(theta ^ 2 * posterior) - posterior_ev ^ 2
# posterior sd
sqrt(posterior_var)
## [1] 0.01227
0.0030
prior
scaled likelihood
posterior
0.0020
0.0010
0.0000
theta
n = 35
# Predictive simulation
theta_sim = sample(theta, n_sim, replace = TRUE, prob = posterior)
theta_sim y_sim
0.7979 32
0.7768 27
0.7559 26
0.8000 33
0.7844 28
0.7823 30
0.7811 25
0.7660 23
0.8057 32
0.7910 27
# Predictive distribution
y_predict = 0:n
for (i in 1:length(y_predict)) {
py_predict[i] = sum(dbinom(y_predict[i], n, theta) * posterior) # posterior
}
# Prediction interval
py_predict_cdf = cumsum(py_predict)
c(y_predict[max(which(py_predict_cdf <= 0.025))], y_predict[min(which(py_predict_cdf >= 0.97
## [1] 22 32
Theoretical
Simulation
0.10
Probability
0.05
0.00
y
126 CHAPTER 7. INTRODUCTION TO PREDICTION
Even if parameters are essentially “known” — that is, even if the prior/posterior
variance of parameters is small — there will still be sample-to-sample variability
reflected in the predictive distribution of the data, mainly influenced by the size
𝑛 of the sample being “predicted”.
1. Plot the prior distribution. What does this say about our prior beliefs?
2. Now suppose we randomly select a sample of 35 Cal Poly students and 21
students prefer data as singular. Plot the prior and likelihood, and find
the posterior distribution and plot it. Have our beliefs about 𝜃 changed?
Why?
3. Find the posterior predictive distribution corresponding to samples of size
35. Compare the observed sample value of 21/35 with the posterior pre-
dictive distribution. What do you notice? Does this indicate problems
with the model?
1. We have a very strong prior belief that 𝜃 is close to 0.79; the prior SD
is only 0.012. There is a prior probability of 98% that between 76% and
82% of Cal Poly students prefer data as singular.
# prior
theta = seq(0, 1, 0.0001)
prior = theta ^ 864 * (1 - theta) ^ 227
7.1. POSTERIOR PREDICTIVE CHECKING 127
theta
# prior mean
prior_ev = sum(theta * prior)
prior_ev
## [1] 0.7914
# prior variance
prior_var = sum(theta ^ 2 * prior) - prior_ev ^ 2
# prior sd
sqrt(prior_var)
## [1] 0.01228
2. Our posterior distribution has barely changed from the prior. Even though
the sample proportion is 21/35 = 0.61, our prior beliefs were so strong
(represented by the small prior SD) that a sample of size 35 isn’t very
convincing.
# data
n = 35 # sample size
y = 21 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
# posterior mean
posterior_ev = sum(theta * posterior)
posterior_ev
## [1] 0.7855
# posterior variance
posterior_var = sum(theta ^ 2 * posterior) - posterior_ev ^ 2
# posterior sd
sqrt(posterior_var)
## [1] 0.01222
0.0030
prior
scaled likelihood
posterior
0.0020
0.0010
0.0000
theta
n = 35
# Predictive simulation
theta_sim = sample(theta, n_sim, replace = TRUE, prob = posterior)
# Predictive distribution
y_predict = 0:n
for (i in 1:length(y_predict)) {
py_predict[i] = sum(dbinom(y_predict[i], n, theta) * posterior) # posterior
}
130 CHAPTER 7. INTRODUCTION TO PREDICTION
# Prediction interval
sum(py_predict[y_predict <= y])
## [1] 0.01105
Theoretical
Simulation
0.10
Probability
0.05
0.00
0 5 10 15 17 20 23 2526 2930 32 35
A Bayesian model is composed of both a model for the data (likelihood) and a
prior distribution on model parameters.
Predictive distributions can be used as tools in model checking. Posterior pre-
dictive checking involves comparing the observed data to simulated samples
(or some summary statistics) generated from the posterior predictive distribu-
tion. We’ll focus on graphical checks: Compare plots for the observed data with
those for simulated samples. Systematic differences between simulated samples
and observed data indicate potential shortcomings of the model.
If the model fits the data, then replicated data generated under the model should
look similar to the observed data. If the observed data is not plausible under
the posterior predictive distribution, then this could indicate that the model is
not a good fit for the data. (“Based on the data we observed, we conclude that
it would be unlikely to observe the data we observed???”)
However, a problematic model isn’t necessarily due to the prior. Remember
that a Bayesian model consists of both a prior and a likelihood, so model mis-
specification can occur in the prior or likelihood or both. The form of the
likelihood is also based on subjective assumptions about the variables being
measured and how the data are collected. Posterior predictive checking can
7.1. POSTERIOR PREDICTIVE CHECKING 131
help assess whether these assumptions are reasonable in light of the observed
data.
• The probability that the player successfully makes any particular free
throw attempt is 𝜃.
• A Uniform prior distribution for 𝜃 values in a grid from 0 to 1.
• Conditional on 𝜃, the number of successfully made attempts has a Bino-
mial(20, 𝜃) distribution. (This determines the likelihood.)
1. Suppose the player misses her first 10 attempts and makes her second 10
attempts. Does this data seem consistent with the model?
2. Explain how you could use posterior predictive checking to check the fit
of the model.
# prior
prior = rep(1, length(theta))
prior = prior / sum(prior)
# data
n = 20 # sample size
y = 10 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
# predictive simulation
n_sim = 10000
for (r in 1:n_sim){
theta_sim = sample(theta, 1, replace = TRUE, prob = posterior)
trials_sim = rbinom(n, 1, theta_sim)
switches[r] = length(rle(trials_sim)$lengths) - 1 # built in function
}
plot(table(switches) / n_sim,
xlab = "Number of switches",
ylab = "Posterior predictive probability",
panel.first = rect(0, 0, 1, 1, col='gray', border=NA))
7.2. PRIOR PREDICTIVE TUNING 133
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 12 14 16
Number of switches
## [1] 0.0005
1. Simulate sleep hours for 10000 Cal Poly students under this model and
make a histogram of the simulated values.
2. According to this model, (approximately) what percent of students sleep
less than 5 hours a night? More than 11? Do these values seem reasonable?
1. First simulate a value 𝜃 from the Uniform(4, 12) distribution. Then given
𝜃 simulate a value 𝑦 from a Normal(𝜃, 1.5) distribution. Repeat many
times to get many (𝜃, 𝑦) pairs and summarize the 𝑦 values.
N_sim = 10000
sigma = 1.5
y = rnorm(N_sim, theta, sigma)
Histogram of y
1000
Frequency
600
200
0
0 5 10 15
Sleep hours
## [1] 0.1499
## [1] 0.1552
In the previous example, it was helpful to think about the distribution of sleep
hours for individual students when formulating prior beliefs about the popula-
tion mean. In general, it is often easier to think in terms of the scale of the data
(individual sleep hours) rather than the scale of the parameters (mean sleep
hours).
Prior predictive distributions “live” on the scale of the data, and are sometimes
easier to interpret than prior distributions themselves. It is often helpful to
tune prior distributions indirectly via prior predictive distributions rather than
directly. We can choose a prior distribution for parameters, simulate a prior
predictive distribution for the data given this prior, and consider if the distribu-
tion of possible data values seems reasonable given our background knowledge
about the variable and context. If not, we can choose another prior and repeat
the process until we have suitably “tuned” the prior.
Remember, the prior does not have to be perfect; there is no perfect prior.
However, if a particular prior gives rise to obviously unreasonable data values
(e.g., negative sleep hours) we should try to improve it. It’s always a good idea
to consider prior predictive distributions when formulating a prior distribution
for parameters.
136 CHAPTER 7. INTRODUCTION TO PREDICTION
Chapter 8
Introduction to Continuous
Prior and Posterior
Distributions
Recall that the posterior distribution is proportional to the product of the prior
distribution and the likelihood. Thus, there are two probability distributions
which will influence the posterior distribution.
137
138CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS
𝜋(𝜃|𝑦) ∝ 𝑓(𝑦|𝜃)𝜋(𝜃)
posterior ∝ likelihood × prior
For a continuous random variable 𝑈 with pdf 𝑓 the probability that the random
variable falls between any two values 𝑎 and 𝑏 is given by the area under the
density between those two values.
𝑏
𝑃 (𝑎 ≤ 𝑈 ≤ 𝑏) = ∫ 𝑓(𝑢)𝑑𝑢
𝑎
1𝜃 is used to denote both: (1) the actual parameter (i.e., the random variable) 𝜃 itself, and
(2) possible values of 𝜃.
8.1. A BRIEF REVIEW OF CONTINUOUS DISTRIBUTIONS 139
A pdf will assign zero probability to intervals where the density is 0. A pdf
is usually defined for all real values, but is often nonzero only for some subset
of values, the possible values of the random variable. Given a specific pdf, the
generic bounds (−∞, ∞) should be replaced by the range of possible values,
that is, those values 𝑢 for which 𝑓(𝑢) > 0.
For example, if 𝑈 can only take positive values we can write its pdf as
The “0 otherwise” part is often omitted, but be sure to specify the range of
values where 𝑓 is positive.
The expected value of a continuous random variable 𝑈 with pdf 𝑓 is
∞
𝐸(𝑈 ) = ∫ 𝑢 𝑓(𝑢) 𝑑𝑢
−∞
The constant 𝑐 does not affect the shape of the density as a function of 𝑢, only
the scale on the density (vertical) axis. The absolute scaling on the density axis
is somewhat irrelevant; it is whatever it needs to be to provide the proper area.
In particular, the total area under the pdf must be 1. The scaling constant is
∞
determined by the requirement that ∫−∞ 𝑓(𝑢)𝑑𝑢 = 1. (Remember to replace the
generic (−∞, ∞) bounds with the range of possible values.)
What is important about the pdf is relative height. For example, if two values
𝑢 and 𝑢̃ satisfy 𝑓(𝑢)̃ = 2𝑓(𝑢) then 𝑈 is roughly “twice as likely to be near 𝑢̃
than 𝑢”
𝜖
𝑓(𝑢)̃ 𝑓(𝑢)𝜖
̃ 𝑃 (𝑢̃ − 2 ≤ 𝑈 ≤ 𝑢̃ + 2𝜖 )
2= = ≈ 𝜖
𝑓(𝑢) 𝑓(𝑢)𝜖 𝑃 (𝑢 − 2 ≤ 𝑈 ≤ 𝑢 + 2𝜖 )
1.00 1.00
0.75 0.75
fU(u)
fU(u)
0.50 0.50
0.25 0.25
0.00 0.00
0 1 2 3 4 5 0 1 2 3 4 5
u u
Figure 8.1: Illustration of 𝑃 (1 < 𝑈 < 2.5) (left) and 𝑃 (0.995 < 𝑈 < 1.005) and
𝑃 (1.695 < 𝑈 < 1.705) (right) for 𝑈 with an Exponential(1) distribution, with
pdf 𝑓𝑈 (𝑢) = 𝑒−𝑢 , 𝑢 > 0. The plot on the left displays the true area under the
curve over (1, 2.5). The plot on the right illustrates how the probability that
𝑈 is “close to” 𝑢 can be approximated by the area of a rectangle with height
equal to the density at 𝑢, 𝑓𝑈 (𝑢). The density height at 𝑢 = 1 is twice as large
than the density height at 𝑢 = 1.7, so the probability that 𝑈 is “close to” 1 is
(roughly) twice as large as the probability that 𝑈 is “close to” 1.7.
1.0
0.8
0.6
Density
0.4
0.2
0.0
0 2 4 6 8 10
Example 8.1. Continuing Example 7.1 where 𝜃 represents the population pro-
portion of students in Cal Poly statistics classes who prefer to consider data as
a singular noun.
1. See the plot below. The distribution is similar to the discrete grid approx-
imation in Example 7.2.
2. Set the total area under the curve equal to 1 and solve for 𝑐 = 3
1 1
1 = ∫ 𝑐𝜃2 𝑑𝜃 = 𝑐 ∫ 𝜃2 𝑑𝜃 = 𝑐(1/3) ⇒ 𝑐 = 3
0 0
4. See the plot below. The prior proportional to (1 − 𝜃)2 is the mirror image
of the prior proportional to 𝜃2 , reflected about 0.5. As the exponent on
𝜃 increases, more density is shifted towards 1. As the exponent on 1 − 𝜃
increases, more density is shifted towards 0. When the exponents are the
same, the density is symmetric about 0.5
6 2
5
5 (1 )2
2(1 )2
5(1 )2
4
0
0.0 0.2 0.4 0.6 0.8 1.0
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION143
constant which ensures that the total area under the density is 1. The actual Beta density
formula, including the normalizing constant, is
Γ(𝛼 + 𝛽) 𝛼−1
𝑓(𝑢) = 𝑢 (1 − 𝑢)𝛽−1 , 0 < 𝑢 < 1,
Γ(𝛼)Γ(𝛽)
∞
where Γ(𝛼) = ∫0 𝑒−𝑣 𝑣𝛼−1 𝑑𝑣 is the Gamma function. For a positive integer 𝑘, Γ(𝑘) = (𝑘 − 1)!.
√
Also, Γ(1/2) = 𝜋.
144CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS
3. Starting with each of the prior distributions from the first part, find the
posterior distribution of 𝜃 based on this sample, and identify it as a Beta
distribution by specifying the shape parameters 𝛼 and 𝛽
a. proportional to 𝜃2 , 0 < 𝜃 < 1.
b. proportional to 𝜃5 , 0 < 𝜃 < 1.
c. proportional to (1 − 𝜃)2 , 0 < 𝜃 < 1.
d. proportional to 𝜃2 (1 − 𝜃)2 , 0 < 𝜃 < 1.
e. proportional to 𝜃5 (1 − 𝜃)2 , 0 < 𝜃 < 1.
4. For each of the posterior distributions in the previous part, compute the
posterior mean and standard deviation. How does each posterior distri-
bution compare to its respective prior distribution?
Solution. to Example 8.2
2. Given 𝜃, the number of students in the sample who prefer data as singular,
𝑌 , follows a Binomial(35, 𝜃) distribution. The likelihood is the probability
of observing 𝑌 = 31 viewed as a function of 𝜃.
35 31
𝑓(31|𝜃) = ( )𝜃 (1 − 𝜃)4 , 0<𝜃<1
31
∝ 𝜃31 (1 − 𝜃)4 , 0<𝜃<1
4. See the table above. Each posterior distribution concentrates more prob-
ability towards the observed sample proportion 31/35 = 0.886, though
there are some small differences due to the prior. The posterior SD is less
than the prior SD; there is less uncertainty about 𝜃 after observing some
data.
6 Beta(3, 1) Beta(34, 5)
Beta(6, 1) 8 Beta(37, 5)
5 Beta(1, 3) Beta(32, 7)
Beta(3, 3) Beta(34, 7)
Beta(6, 3) 6 Beta(37, 7)
4
3 4
2
2
1
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Beta distributions are often used in Bayesian models involving population pro-
portions. Consider some binary (“success/failure”) variable and let 𝜃 be the
population proportion of success. Select a random sample of size 𝑛 from the
population and let 𝑌 count the number of successes in the sample.
Beta-Binomial model. If 𝜃 has a Beta(𝛼, 𝛽) prior distribution and the con-
ditional distribution of 𝑌 given 𝜃 is the Binomial(𝑛, 𝜃) distribution, then the
146CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS
When the prior and posterior distribution belong to the same family, that fam-
ily is called a conjugate prior distribution for the likelihood. So, the Beta
distributions form a conjugate prior family for Binomial distributions.
Example 8.3. In Example 7.2 we used a grid approximation to the prior dis-
tribution of 𝜃. Now we will assume a continuous prior distributions. Assume
that 𝜃 has a Beta(3, 1) prior distribution and that 31 students in a sample of
35 Cal Poly statistics students prefer data as singular.
# prior
alpha_prior = 3
beta_prior = 1
prior = dbeta(theta, alpha_prior, beta_prior)
# data
n = 35
y = 31
# likelihood
likelihood = dbinom(y, n, theta)
# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = dbeta(theta, alpha_post, beta_post)
# plot
ymax = max(c(prior, posterior))
scaled_likelihood = likelihood * ymax / max(likelihood)
plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n'
par(new=T)
plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='
par(new=T)
plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yax
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "ora
148CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS
prior
scaled likelihood
posterior
theta
3. The results based on continuous distributions are the same as those for
the grid approximation. The grid is just an approximation of the “true”
Beta-Binomial theory.
3 31
4. The prior mean is 3+1 = 0.75. The sample proportion is 35 = 0.886. The
34
posterior mean is 39 = 0.872. We can write
34 3 4 31 35
=( )×( )+( )×( )
39 4 39 35 39
3 4 31 35
=( )×( )+( )×( )
4 4 + 35 35 4 + 35
The posterior mean is a weighted average of the prior mean and the sample
proportion where the weights are given by the relative “samples sizes”. The
“prior sample size” is 3 + 1 = 4. The actual observed sample size is 35.
𝐸(𝜃|𝑦)(1 − 𝐸(𝜃|𝑦))
Var(𝜃|𝑦) =
𝛼+𝛽+𝑛+1
Example 8.4. Now let’s reconsider the posterior prediction parts of Example
7.2, treating 𝜃 as continuous. Assume that 𝜃 has a Beta(3, 1) prior distribution
and that 31 students in a sample of 35 Cal Poly statistics students prefer data as
singular, so that the posterior distribution of 𝜃 is the Beta(34, 5) distribution.
n_sim = 10000
𝑛 𝐵(𝛼 + 𝑦, 𝛽 + 𝑛 − 𝑦)
𝑃 (𝑌 = 𝑦) = ( ) , 𝑦 = 0, 1, … , 𝑛,
𝑦 𝐵(𝛼, 𝛽)
(𝛼−1)!(𝛽−1)!
𝐵(𝛼, 𝛽) is the beta function, for which 𝐵(𝛼, 𝛽) = (𝛼+𝛽−1)! if 𝛼, 𝛽 are positive integers. (For
1
general 𝛼, 𝛽 > 0, 𝐵(𝛼, 𝛽) = ∫0 𝑢𝛼−1 (1 − 𝑢) 𝛽−1 𝑑𝑢 = Γ(𝛼)Γ(𝛽) 𝛼
Γ(𝛼+𝛽) .) The mean is 𝑛 ( 𝛼+𝛽 ). In R:
dbbinom, rbbinom, pbbinom in extraDistr package
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION151
0.15
Posterior predictive probability
0.10
0.05
0.00
17 19 21 23 25 27 29 31 33 35
## 2.5% 97.5%
## 24 35
3. The interval is similar to the one from the grid approximation, and the
interpretation is the same. There is posterior predictive probability of 95%
that between 24 and 35 students in a sample of 35 students will prefer data
as singular.
You can tune the shape parameters — 𝛼 (like “prior successes”) and 𝛽 (like
“prior failures”) — of a Beta distribution to your prior beliefs in a few ways.
Recall that 𝜅 = 𝛼 + 𝛽 is the “concentration” or “equivalent prior sample size”.
𝛼 = 𝜇𝜅
𝛽 = (1 − 𝜇)𝜅
• If prior mode 𝜔 and prior concentration 𝜅 (with 𝜅 > 2) are specified then
𝛼 = 𝜔(𝜅 − 2) + 1
𝛽 = (1 − 𝜔)(𝜅 − 2) + 1
152CHAPTER 8. INTRODUCTION TO CONTINUOUS PRIOR AND POSTERIOR DISTRIBUTIONS
𝜇(1 − 𝜇)
𝛼 = 𝜇( − 1)
𝜎2
𝜇(1 − 𝜇)
𝛽 = (1 − 𝜇) ( − 1)
𝜎2
• You can also specify two percentiles and use software to find 𝛼 and 𝛽. For
example, you could specify the endpoints of a prior 98% credible interval.
1. Sketch your Beta prior distribution for 𝜃. Describe its main features and
your reasoning. Then translate your prior into a Beta distribution by
specifying the shape parameters 𝛼 and 𝛽.
2. Assume a prior Beta distribution for 𝜃 with prior mean 0.15 and prior SD
is 0.08. Find 𝛼 and 𝛽, and a prior 98% credible interval for 𝜃.
1. Of course, choices will vary, based on what you know about left-
handedness. But do think about what your prior might look like, and use
one of the methods to translate it to a Beta distribution.
2. Let’s say we’ve heard that about 15% of people in general are left-handed,
but we’ve also heard 10% so we’re not super sure, and we also don’t know
how Cal Poly students compare to the general population. So we’ll assume
a prior Beta distribution for 𝜃 with prior mean 0.15 (our “best guess”) and
a prior SD of 0.08 to reflect our degree of uncertainty. This translates to
a Beta(2.8, 16.1) prior, with a central 98% prior credible interval for 𝜃
that between 2.2% and 38.1% of Cal Poly students are left-handed. We
could probably go with more prior certainty than this, but it seems at
least like a reasonable starting place before observing data. We can (and
should) use prior predictive tuning to aid in choosing 𝛼 and 𝛽 for our Beta
distribution prior.
mu = 0.15
sigma = 0.08
## [1] 2.838
8.2. CONTINUOUS DISTRIBUTIONS FOR A POPULATION PROPORTION153
## [1] 16.08
Considering Prior
Distributions
One of the most commonly asked questions when one first encounters Bayesian
statistics is “how do we choose a prior?” While there is never one “perfect”
prior in any situation, we’ll discuss in this chapter some issues to consider when
choosing a prior. But first, here are a few big picture ideas to keep in mind.
155
156 CHAPTER 9. CONSIDERING PRIOR DISTRIBUTIONS
Example 9.1. Xiomara claims that she can predict which way a coin flip will
land. Rogelio claims that he can taste the difference between Coke and Pepsi.
Before reading further, stop to consider: whose claim - Xiomara’s or Rogelio’s
- is initially more convincing? Or are you equally convinced? Why? To put
it another way, whose claim are you initially more skeptical of? Or are you
equally skeptical? To put it one more way, whose claim would require more
data to convince you?1
To test Xiomara’s claim, you flip a fair coin 10 times, and she correctly predicts
the result of 9 of the 10 flips. (You can assume the coin is fair, the flips are
independent, and there is no funny business in data collection.)
To test Rogelio’s claim, you give him a blind taste test of 10 cups, flipping
a coin for each cup to determine whether to serve Coke or Pespi. Rogelio
correctly identifies 9 of the 10 cups. (You can assume the coin is fair, the flips
are independent, and there is no funny business in data collection.)
Let 𝜃𝑋 be the probability that Xiomara correctly guesses the result of a fair coin
flip. Let 𝜃𝑅 be the probability that Rogelio correctly guesses the soda (Coke or
Pepsi) in a randomly selected cup.
1. How might a frequentist address this situation? What would the conclu-
sion be?
2. Consider a Bayesian approach. Describe, in general terms, your prior
distributions for the two parameters. How do they compare? How would
this impact your conclusions?
a wider range of plausible values. We might even have a prior mean for
𝜃𝑅 above 0.5 if we have experience with a lot of people who can tell the
difference between Coke and Pepsi. Given the sample data, our posterior
probability that 𝜃𝑅 > 0.5 would be larger than the posterior probability
that 𝜃𝑋 > 0.5, and we would be more convinced by Rogelio’s claim than
by Xiomara’s.
Density
0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
Even if a prior does not represent strong prior beliefs, just having a prior dis-
tribution at all allows for Bayesian analysis. Remember, both Bayesian and
frequentist are valid approaches to statistical analyses, each with advantages
and disadvantages. That said, there are some issues with frequentist approaches
that incorporating a prior distribution and adopting a Bayesian approach alle-
viates. (To be fair, an upcoming investigation will address some disadvantages
of the Bayesian approach compared with the frequentist approach.)
Example 9.2. Tamika is a basketball player who throughout her career has
had a probability of 0.5 of making any three point attempt. However, her coach
is afraid that her three point shooting has gotten worse. To check this, the
coach has Tamika shoot a series of three pointers; she makes 7 out of 24. Does
the coach have evidence that Tamika has gotten worse?
Let 𝜃 be the probability that Tamika successfully makes any three point attempt.
Assume attempts are independent.
1. Prior to collecting data, the coach decides that he’ll have convincing ev-
idence that Tamika has gotten worse if the p-value is less than 0.025.
Suppose the coach told Tamika to shoot 24 attempts and then stop and
count the number of successful attempts. Use software to compute the
p-value. Is the coach convinced that Tamika has gotten worse?
2. Prior to collecting data, the coach decides that he’ll have convincing ev-
idence that Tamika has gotten worse if the p-value is less than 0.025.
Suppose the coach told Tamika to shoot until she makes 7 three point-
ers and then stop and count the number of total attempts. Use software
to compute the p-value. Is the coach convinced that Tamika has gotten
worse? (Hint: the total number of attempts has a Negative Binomial
distribution.)
158 CHAPTER 9. CONSIDERING PRIOR DISTRIBUTIONS
3. Now suppose the coach takes a Bayesian approach and assumes a Beta(𝛼,
𝛽) prior distribution for 𝜃. Suppose the coach told Tamika to shoot 24
attempts and then stop and count the number of successful attempts. Iden-
tify the likelihood function and the posterior distribution of 𝜃.
4. Now suppose the coach takes a Bayesian approach and assumes a Beta(𝛼,
𝛽) prior distribution for 𝜃. Suppose the coach told Tamika to shoot until
she makes 7 three pointers and then stop and count the number of total
attempts. Identify the likelihood function and the posterior distribution
of 𝜃.
5. Compare the Bayesian and frequentist approaches in this example. Does
the “strength of the evidence” depend on how the data were collected?
24 7
𝑓(𝑦 = 7|𝜃) = ( )𝜃 (1 − 𝜃)17 ∝ 𝜃7 (1 − 𝜃)17 , 0 < 𝜃 < 1.
7
24 − 1 7
𝑓(𝑛 = 24|𝜃) = ( )𝜃 (1 − 𝜃)17 ∝ 𝜃7 (1 − 𝜃)17 , 0 < 𝜃 < 1.
7−1
(The (24−1
7−1 ) follows from the fact that the last attempt has to be success.)
Note that the shape of the likelihood as a function of 𝜃 is the same as in
the previous part. Therefore, the posterior distribution is the Beta(𝛼 + 7,
𝛽 + 17) distribution.
159
For those initially skeptical of prior distributions at all, the strategy of always
choosing an noninformative or flat prior might be appealing. Flat priors are
common, but are rarely ever the best choices from a modeling perspective. Just
like you would not want to assume a Normal distribution for the likelihood in
every problem, you would not to use a flat prior in every problem.
Furthermore, there are some subtle issues that arise when attempting to choose
a noninformative prior.
1. What are the possible values for 𝜃? What prior distribution might you
consider a noninformative prior distribution?
2. You might choose a Uniform(0, 1) prior, a.k.a., a Beta(1, 1) prior. Recall
how we interpreted the parameters 𝛼 and 𝛽 in the Beta-Binomial model.
Does the Beta(1, 1) distribution represent “no prior information”?
1. 𝜃 takes values in (0, 1). We might assume a flat prior on (0, 1), that is a
Uniform(0, 1) prior.
2. We interpreted 𝛼 as “prior successes” and 𝛽 as “prior failures”. So a Beta(1,
1) is in some some equivalent to a “prior sample size” of 2. Certainly not
a lot of prior information, but it’s not “no prior information” either.
3. The sample proportion, 4/20 = 0.2.
4. With a Beta(1, 1) prior and the 4/20 sample data, the posterior distri-
bution is Beta(5, 17). The posterior mean of 𝜃 is 5/22 = 0.227. The
posterior mean is a weighted average of the prior mean and the sample
proportion: 0.227 = (0.5)(2/22) + (0.2)(20/22). The “noninformative”
prior does have influence; the data does not “speak entirely for itself”.
5. If 𝛼 + 𝛽 represents “prior sample size”, we could try a Beta(0, 0) prior.
Unfortunately, such a probability distribution does not actually exist. For
a Beta distribution, the parameters 𝛼 and 𝛽 have to be strictly positive in
order to have a valid pdf. The Beta(0, 0) density would be proportional
to
𝜋(𝜃) ∝ 𝜃−1 (1 − 𝜃)−1 , 0 < 𝜃 < 1.
161
1
However, this is not a valid pdf since ∫0 𝜃−1 (1 − 𝜃)−1 𝑑𝜃 = ∞, so there is
no constant that can normalize it to integrate to 1. Even so, here is a plot
of the “density”.
Improper Beta(0, 0)
theta
That is, the posterior distribution is the Beta(4, 16) distribution. The
posterior mean is 4/20=0.2, the sample proportion. Hoever, the posterior
4−1 3
mode is 4+16−2 = 18 = 0.167. So the posterior mode does not let the
“data speak entirely for itself”.
theta = rbeta(1000000, 1, 1)
odds = theta / (1 - theta)
hist(odds[odds<50], breaks = 100, xlab = "odds", freq = FALSE,
ylab = "density",
main = "Prior distribution of odds if prior distribution of probability is Unifor
0.2
0.0
0 10 20 30 40 50
odds
Even though the prior for 𝜃 was flat, the prior for a transformation of 𝜃 is
not.
1. Explain why you might not want to use a flat Uniform(0, 1) prior for 𝜃.
2. Assume a Uniform(0, 1) prior. Suppose you will test 𝑛 = 100 suspected
cases. Use simulation to approximate the prior predictive distribution of
163
the number in the sample who have the disease. Does this seem reason-
able?
3. Assume a Uniform(0, 1) prior. Suppose that in 𝑛 = 100 suspected cases,
none actually has the disease. Find and interpret the posterior median.
Does this seem reasonable?
theta_sim = runif(10000)
y_sim = rbinom(10000, 100, theta_sim)
hist(y_sim,
xlab = "Simulated number of successes",
main = "Prior predictive distribution")
300
100
0
0 20 40 60 80 100
rate is even greater than this? Again, this does not seem very reasonable
based on our knowledge that the disease is rare.
Introduction to Posterior
Simulation and JAGS
In the Beta-Binomial model there is a simple expression for the posterior dis-
tribution. However, in most problems it is not possible to find the posterior
distribution analytically, and therefore we must approximate it.
1. Write an expression for the shape of the posterior density. Is this a rec-
ognizable probability distribution?
2. We have seen one method for approximating a posterior distribution. How
could you employ it here?
165
166CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
2
1 (𝜃 − 0.15)
Prior: 𝜋(𝜃) ∝ exp (− )
0.08 2(0.082 )
Likelihood: 𝑓(𝑦|𝜃) ∝ 𝜃5 (1 − 𝜃)20
2
5 20 1 (𝜃 − 0.15)
Posterior: 𝜋(𝜃|𝑦) ∝ (𝜃 (1 − 𝜃) ) ( exp (− ))
0.08 2(0.082 )
# prior
prior = dnorm(theta, 0.15, 0.08)
prior = prior / sum(prior)
# data
n = 25 # sample size
y = 5 # sample count of success
# posterior
product = likelihood * prior
posterior = product / sum(product)
# plot
ylim = c(0, max(c(prior, posterior, likelihood / sum(likelihood))))
plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta', ylab
par(new=T)
plot(theta, likelihood / sum(likelihood), type='l', xlim=c(0, 1), ylim=ylim, col="orang
par(new=T)
plot(theta, posterior, type='l', xlim=c(0, 1), ylim=ylim, col="seagreen", xlab='', ylab
legend("topright", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue"
167
prior
scaled likelihood
0.0000 0.0002 0.0004 0.0006
posterior
theta
• Observed data 𝑦
• Model for the data, 𝑓(𝑦|𝜃) which depends on parameters 𝜃. (This model
determines the likelihood function.)
• Prior distribution for parameters 𝜋(𝜃)
Example 10.2. Continuing the kissing study in Example 5.2 where 𝜃 can only
take values 0.1, 0.3, 0.5, 0.7, 0.9. Consider a prior distribution which places
probability 1/9, 2/9, 3/9, 2/9, 1/9 on the values 0.1, 0.3, 0.5, 0.7, 0.9, respec-
tively. Suppose that 𝑦 = 8 couples in a sample of size 𝑛 = 12 lean right.
1. Describe in detail how you could use simulation to approximate the pos-
terior distribution of 𝜃, without first computing the posterior distribution.
2. Code and run the simulation. Compare the simulation-based approxima-
tion to the true posterior distribution from Example 5.2.
3. How would the simulation/code change if 𝜃 had a Beta prior distribution,
say Beta(3, 3)?
4. Suppose that 𝑛 = 1200 and 𝑦 = 800. What would be the problem with
running the above simulation in this situation? Hint: compute the proba-
bility that 𝑌 equals 800 for a Binomial distribution with parameters 1200
and 0.667.
n_sim = 100000
theta_prior_sim y_sim
0.1 0
0.5 2
0.5 5
0.7 10
0.1 2
0.5 6
0.5 8
0.1 1
0.7 7
0.5 7
0.7 10
0.3 3
0.5 5
0.5 7
0.9 11
0.7 12
0.3 6
0.5 8
0.3 5
0.9 10
theta_post_sim = theta_prior_sim[y_sim == 8]
table(theta_post_sim)
## theta_post_sim
## 0.3 0.5 0.7 0.9
## 177 4070 5031 253
plot(table(theta_post_sim) / length(theta_post_sim),
xlab = "theta",
ylab = "Relative frequency")
170CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
0.5
0.4
Relative frequency
0.3
0.2
0.1
0.0
theta
3. The only difference is that we would first simulate a value of 𝜃 from its
Beta(3, 3) prior distribution (using rbeta). Now any value between 0 and
1 is a possible value of 𝜃. But we would still approximate the posterior
distribution by discarding any (𝜃, 𝑌 ) pairs for which 𝑌 is not equal to 8.
Since 𝜃 is continuous, we could summarize the simulated values with a
histogram or density plot.
n_sim = 100000
theta_prior_sim = rbeta(n_sim, 3, 3)
theta_prior_sim y_sim
0.4965 6
0.3985 6
0.4446 6
0.4233 7
0.4427 5
0.5821 6
0.1518 4
0.6136 6
0.7128 6
0.3245 5
0.7393 8
0.3007 4
0.9726 12
0.3001 3
0.2044 1
0.3401 2
0.6289 6
0.7696 11
0.7605 10
0.3823 4
theta_post_sim = theta_prior_sim[y_sim == 8]
Histogram of theta_post_sim
3.0
2.0
Density
1.0
0.0
theta
In principle, the posterior distribution 𝜋(𝜃|𝑦) given observed data 𝑦 can be found
by
simply because in large samples there are just many more possibilities. For
example, in 1000 flips of a fair coin, the most likely value of the number of
heads is 500, but the probability of exactly 500 heads in 1000 flips is only 0.025.
When there are many possibilities, the probability gets stretched fairly thin.
Therefore, if we want say 10000 simulated values of 𝜃 given 𝑦, we would have
first simulate many, many more values.
The situation is even more extreme when the data is continuous, where the
probably of replicating the observed sample is essentially 0.
Therefore, we need more efficient simulation algorithms for approximating pos-
terior distributions. Markov chain Monte Carlo (MCMC) methods1 pro-
vide powerful and widely applicable algorithms for simulating from probabil-
ity distributions, including complex and high-dimensional distributions. These
algorithms include Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte
Carlo, among others. We will see later some of the ideas behind MCMC algo-
rithms. However, we will rely on software to carry out MCMC simulations.
JAGS2 (“Just Another Gibbs Sampler”) is a stand alone program for performing
MCMC simulations. JAGS takes as input a Bayesian model description — prior
plus likelihood — and data and returns an MCMC sample from the posterior
distribution. JAGS uses a combination of Metropolis sampling, Gibbs sampling,
and other MCMC algorithms.
A few JAGS resources:
library(rjags)
n = 35 # sample size
y = 31 # number of successes
# Likelihood
y ~ dbinom(theta, n)
# Prior
theta ~ dbeta(alpha, beta)
alpha <- 3 # prior successes
beta <- 1 # prior failures
}"
10.1. INTRODUCTION TO JAGS 175
Again, the above is just a text string, which we’ll pass to JAGS for translation.
We pass the model (which is just a text string) and the data to JAGS to be
compiled via jags.model. The model is defined by the text string via the
textConnection function. The model can also be saved in a separate file, with
the file name being passed to JAGS. The data is passed to JAGS in a list. In
dataList below y = y, n = n maps the data defined by y=31, n=35 to the
terms y, n specified in the model_string.
dataList = list(y = y, n = n)
After the update phase, we simulate values from the posterior distribution that
we’ll actually keep using coda.samples. Using coda.samples arranges the
output in a format conducive to using coda, a package which contains helpful
functions for summarizing and diagnosing MCMC simulations. The variables to
record simulated values for are specified with the variable.names argument.
Here there is only a single parameter theta, but we’ll see multi-parameter ex-
amples later.
176CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
Standard R functions like summary and plot can be used to summarize re-
sults from coda.samples. We can summarize the simulated values of theta to
approximate the posterior distribution.
summary(posterior_sample)
##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.871382 0.052462 0.000525 0.000777
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.756 0.839 0.878 0.909 0.956
plot(posterior_sample)
10.1. INTRODUCTION TO JAGS 177
8
0.9
6
0.8
4
0.7
2
0.6
0
2000 6000 10000 0.6 0.7 0.8 0.9 1.0
The Doing Bayesian Data Analysis (DBDA2E) textbook package also has
some nice functions built in, in particular in the DBD2AE-utilities.R file.
For example, the plotPost functions creates an annotated plot of the posterior
distribution along with some summary statistics. (See the DBDA2E documen-
tation for additional arguments.)
source("DBDA2E-utilities.R")
##
## *********************************************************************
## Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
## A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
## *********************************************************************
plotPost(posterior_sample)
178CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
mode = 0.896
95% HDI
0.772 0.967
Param. Val.
library(bayesplot)
mcmc_hist(posterior_sample)
10.1. INTRODUCTION TO JAGS 179
mcmc_dens(posterior_sample)
mcmc_trace(posterior_sample)
1.0
0.9
theta
0.8
0.7
0.6
thetas = as.matrix(posterior_sample)
head(thetas)
## theta
## [1,] 0.8479
## [2,] 0.8715
## [3,] 0.8592
## [4,] 0.8828
## [5,] 0.8846
## [6,] 0.8927
hist(thetas)
10.1. INTRODUCTION TO JAGS 181
Histogram of thetas
1500
1000
Frequency
500
0
thetas
The matrix would have one column for each variable named in variable.names;
in this case, there is only one column corresponding to the simulated values of
theta.
We can now use the simulated values of theta to simulate replicated samples to
approximate the posterior predictive distribution. To be clear, the code below
is running R commands within R (not JAGS).
(There is a way to simulate predictive values within JAGS itself, but I think
it’s more straightforward in R. Just use JAGS to get a simulated sample from
the posterior distribution. On the other hand, if you’re using Stan there are
functions for simulating and summarizing posterior predicted values.)
plot(table(ynew),
main = "Posterior Predictive Distribution for samples of size 35",
xlab = "y")
182CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
1500
1000
table(ynew)
500
0
18 20 22 24 26 28 30 32 34
# Likelihood
for (i in 1:n){
y[i] ~ dbern(theta)
}
# Prior
theta ~ dbeta(alpha, beta)
alpha <- 3
10.1. INTRODUCTION TO JAGS 183
beta <- 1
}"
The Bernoulli model can be passed to JAGS similar to the Binomial model
above. Below, we have also introduced the n.chains argument, which simu-
lates multiple Markov chains and allows for some additional diagnostic checks.
Simulating multiple chains helps assess convergence of the Markov chain to the
target distribution. (We’ll discuss more details later.) Initial values for the
chains can be provided in a list with the inits argument; otherwise initial
values are generated automatically.
# Simulate
update(model, 1000, progress.bar = "none")
Nrep = 10000
##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 5
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.872245 0.052547 0.000235 0.000238
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.753 0.841 0.878 0.911 0.956
plot(posterior_sample)
8
0.9
6
0.8
4
0.7
2
0.6
If multiple chains are simulated, the DBDA2E function diagMCMC can be used
for diagnostics.
Note: Some of the DBDA2E output, in particular from diagMCMC, isn’t always
displayed when the RMarkdown file is knit. You might need to manually run
these cells within RStudio. I’m not sure why; please let me know if you figure
it out.
10.1. INTRODUCTION TO JAGS 185
plotPost(posterior_sample)
mode = 0.897
95% HDI
0.767 0.963
Param. Val.
diagMCMC(posterior_sample)
186CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
10.1. INTRODUCTION TO JAGS 187
10.1.9 ShinyStan
library(shinystan)
my_sso <- launch_shinystan(as.shinystan(posterior_sample,
model_name = "Bortles!!!"))
# Data
n = 25
y = 5
# Model
model_string <- "model{
# Likelihood
y ~ dbinom(theta, n)
# Prior
theta ~ dnorm(mu, tau)
188CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
}"
dataList = list(y = y, n = n)
# Compile
model <- jags.model(file = textConnection(model_string),
data = dataList)
# Simulate
update(model, 1000, progress.bar = "none")
Nrep = 10000
##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 5
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.183882 0.052470 0.000235 0.000308
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.0894 0.1468 0.1812 0.2181 0.2930
plot(posterior_sample)
6
0.3
4
0.2
2
0.1
The posterior density is similar to what we computed with the grid approxima-
tion.
190CHAPTER 10. INTRODUCTION TO POSTERIOR SIMULATION AND JAGS
Chapter 11
Example 11.1. The ELISA test for HIV was widely used in the mid-1990s for
screening blood donations. As with most medical diagnostic tests, the ELISA
test is not perfect. If a person actually carries the HIV virus, experts estimate
that this test gives a positive result 97.7% of the time. (This number is called
the sensitivity of the test.) If a person does not carry the HIV virus, ELISA
gives a negative (correct) result 92.6% of the time (the specificity of the test).
Estimates at the time were that 0.5% of the American public carried the HIV
virus (the base rate).
Suppose that a randomly selected American tests positive; we are interested in
the conditional probability that the person actually carries the virus.
Show/hide solution
1. We don’t know what you guessed, but from experience many people guess
80-100%. Afterall, the test is correct for most of people who carry HIV,
191
192 CHAPTER 11. ODDS AND BAYES FACTORS
and also correct for most people who don’t carry HIV, so it seems like the
test is correct most of the time. But this argument ignores one important
piece of information that has a huge impact on the results: most people
do not carry HIV.
2. Let 𝐻 denote the event that the person carries HIV (hypothesis), and let
𝐸 denote the event that the test is positive (evidence). Therefore, 𝐻 𝑐 is
the event that the person does not carry HIV, another hypothesis. We are
given
Among the 78515 who test positive, 4885 carry HIV, so the probability
that an American who tests positive actually carries HIV is 4885/78515
= 0.062.
6. The result says that only 6.2% of Americans who test positive actually
carry HIV. It is true that the test is correct for most Americans with HIV
(4885 out of 5000) and incorrect only for a small proportion of Americans
who do not carry HIV (73630 out of 995000). But since so few Americans
carry HIV, the sheer number of false positives (73630) swamps the number
of true positives (4885).
193
7. Prior to observing the test result, the prior probability that an American
carries HIV is 𝑃 (𝐻) = 0.005. The posterior probability that an American
carries HIV given a positive test result is 𝑃 (𝐻|𝐸) = 0.062.
𝑃 (𝐻|𝐸) 0.062
= = 12.44
𝑃 (𝐻) 0.005
An American who tests positive is about 12.4 times more likely to carry
HIV than an American whom the test result is not known. So while 0.067
is still small in absolute terms, the posterior probability is much larger
relative to the prior probability.
Example 11.2. True story: On a camping trip in 2003, my wife and I were
driving in Vermont when, suddenly, a very large, hairy, black animal lumbered
across the road in front of us and into the woods on the other side. It happened
very quickly, and at first I said “It’s a gorilla!” But then after some thought,
and much derision from my wife, I said “it was probably a bear.”
I think this story provides an anecdote about Bayesian reasoning, albeit bad
reasoning at first but then good. Put the story in a Bayesian context by identi-
fying hypotheses, evidence, prior, and likelihood. What was the mistake I made
initially?
Show/hide solution
• “Type of animal” is playing the role of the hypothesis: gorilla, bear, dog,
squirrel, rabbit, etc.
• That the animal is very large, hairy, and black is the evidence.
194 CHAPTER 11. ODDS AND BAYES FACTORS
• The likelihood value for the animal being very large, hairy, and black is
close to 1 for both a bear and gorilla, maybe more middling for a dog, but
close to 0 for a squirrel, rabbit, etc.
The mistake I made initially was to neglect the base rates and not consider
my prior probabilities. Let’s say the likelihood is 1 for both gorilla and bear
and 0 for all other animals. Then based solely on the likelihoods, the posterior
probability would be 50/50 for gorilla and bear, which maybe is why I guessed
gorilla.
After my initial reaction, I paused to formulate my prior probabilities, which
considering I was in Vermont, gave much higher probability to a bear than a
gorilla. (My prior probabilities should also have given even higher probability
to animals such as dogs, squirrels, and rabbits.)
By combining prior and likelihood in the appropriate way, the posterior prob-
ability is
• very high for a bear, due to high likelihood and not-too-small prior,
• close to 0 for a gorilla, due to the very small prior,
• and very low for a squirrel or rabbit or other small animals because of the
close-to-zero likelihood, even if the prior is large.
Recall that the odds of an event is a ratio involving the probability that the
event occurs and the probability that the event does not occur
𝑃 (𝐴) 𝑃 (𝐴)
odds(𝐴) = 𝑐
=
𝑃 (𝐴 ) 1 − 𝑃 (𝐴)
In many situations (e.g. gambling) odds are reported as odds against 𝐴, that
is, the odds in favor of 𝐴 not occuring, a.k.a., the odds of 𝐴𝑐 : 𝑃 (𝐴𝑐 )/𝑃 (𝐴).
The probability of an even can be obtained from odds
odds(𝐴)
𝑃 (𝐴) =
1 + odds(𝐴)
Example 11.3. Continuing Example 11.1
1. In symbols and words, what does one minus the answer to the probability
in question in Example 11.1 represent?
2. Calculate the prior odds of a randomly selected American having the HIV
virus, before taking an ELISA test.
3. Calculate the posterior odds of a randomly selected American having the
HIV virus, given a positive test result.
4. By what factor has the odds of carrying HIV increased, given a positive
test result, as compared to before the test? This is called the Bayes
factor.
195
5. Suppose you were given the prior odds and the Bayes factor. How could
you compute the posterior odds?
6. Compute the ratio of the likelihoods of testing positive, for those who
carry HIV and for those who do not carry HIV. What do you notice?
Solution. to Example 11.3
Show/hide solution
6. The likelihood of testing positive given HIV is 𝑃 (𝐸|𝐻) = 0.977 and the
likelihood of testing positive given no HIV is 𝑃 (𝐸|𝐻 𝑐 ) = 1−0.926 = 0.074.
𝑃 (𝐸|𝐻) 0.977
= = 13.2
𝑃 (𝐸|𝐻 𝑐 ) 0.074
This value is the Bayes factor! So we could have computed the Bayes
factor without first computing the posterior probabilities or odds.
• If 𝑃 (𝐻) is the prior probability of 𝐻, the prior odds (in favor) of 𝐻 are
𝑃 (𝐻)/𝑃 (𝐻 𝑐 )
• If 𝑃 (𝐻|𝐸) is the posterior probability of 𝐻 given 𝐸, the posterior odds
(in favor) of 𝐻 given 𝐸 are 𝑃 (𝐻|𝐸)/𝑃 (𝐻 𝑐 |𝐸)
• The Bayes factor (BF) is defined to be the ratio of the posterior odds
to the prior odds
𝑃 (𝐸|𝐻)
𝐵𝐹 =
𝑃 (𝐸|𝐻 𝑐 )
• That is, the Bayes factor can be computed without first computing pos-
terior probabilities or odds.
• Odds form of Bayes rule
1. Compute and interpret the prior odds that a person carries HIV.
2. Use the odds form of Bayes rule to compute the posterior odds that the
person carries HIV given a positive test, and interpret the posterior odds.
3. Use the posterior odds to compute the posterior probability that the per-
son carries HIV given a positive test.
𝑃 (𝐸|𝐻) 0.977
𝑐
= = 13.2
𝑃 (𝐸|𝐻 ) 0.074
Therefore
1 1
posterior odds = prior odds × Bayes factor = × 13.2 ≈ ≈ 0.695
19 1.44
Given a positive test, a person in this group is 1.44 times more likely to
not carry HIV than to carry HIV.
3. The odds is the ratios of the posterior probabilities, and we basically just
rescale so they add to 1. The posterior probability is
0.695 1
𝑃 (𝐻|𝐸) = = ≈ 0.410
1 + 0.695 1 + 1.44
The Bayes table is below; we have added a row for the ratios to illustrate
the odds calculations.
Introduction to Bayesian
Model Comparison
A Bayesian model is composed of both a model for the data (likelihood) and
a prior distribution on model parameters. Model selection usually refers to
choosing between different models for the data (likelihoods). But it can also
concern choosing between models with the same likelihood but different priors.
In Bayesian model comparison, prior probabilities are assigned to each of the
models, and these probabilities are updated given the data according to Bayes
rule. Bayesian model comparison can be viewed as Bayesian estimation in a
hierarchical model with an extra level for “model”. (We’ll cover hiearchical
models in more detail later.)
Example 12.1. Suppose I have some trick coins, some of which are biased in
favor of landing on heads, and some of which are biased in favor of landing on
tails.1 I will select a trick coin at random; let 𝜃 be the probability that the
selected coin lands on heads in any single flip. I will flip the coin 𝑛 times and
use the data to decide about the direction of its bias. This can be viewed as a
choice between two models
199
200CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON
Nrep = 1000000
theta = rbeta(Nrep, 7.5, 2.5)
y = rbinom(Nrep, 10, theta)
sum(y == 6) / Nrep
## [1] 0.1243
Nrep = 1000000
theta = rbeta(Nrep, 2.5, 7.5)
y = rbinom(Nrep, 10, theta)
sum(y == 6) / Nrep
## [1] 0.04191
3. The Bayes factor is the ratio of the likelihoods. The likelihood of 6 heads
in 10 flips under model 1 is 0.124, and under model 2 is 0.042. The Bayes
factor in favor of model 1 is 0.124/0.042 = 2.95. Observing 6 heads in
10 flips is 2.95 more likely under model 1 than under model 2. Also, the
posterior odds in favor of model 1 given 6 heads in 10 flips are 2.95 times
greater than the prior odds in favor of model 1.
4. In this case, the prior odds are 1, so the posterior odds in favor of model 1
are 2.95. The posterior probability of model 1 is 0.747, and the posterior
probability of model 2 is 0.253.
5. Now the prior odds in favor of model 1 are 1/9. So the posterior odds
in favor of model 1 given 6 heads in 10 flips are (1/9)(2.95)=0.328. The
posterior probability of model 1 is 0.247, and the posterior probability of
model 2 is 0.753.
Now suppose I want to predict the number of heads in the next 10 flips of
the selected coin.
6. If model 1 is correct the prior is Beta(7.5, 2.5) so the posterior after ob-
serving 6 heads in 10 flips is Beta(13.5, 6.5). Simulate a value of 𝜃 from
a Beta(13.5, 6.5) distribution and given 𝜃 simulate a value of 𝑦 from a
Binomial(10, 𝜃) distribution. Repeat many times. Approximate the pos-
terior predictive probability of 7 heads in the 10 flips flips, given model 1
is correct and 6 heads in the first 10 flips, with the proportion of simulated
repetitions that yield a 𝑦 value of 7; the probability is 0.216.
Nrep = 1000000
theta = rbeta(Nrep, 7.5 + 6, 2.5 + 4)
y = rbinom(Nrep, 10, theta)
plot(table(y) / Nrep,
ylab = "Posterior predictive probability",
main = "Given Model 1")
202CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON
Given Model 1
0.20
Posterior predictive probability
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
7. The simulation is similar, just use the prior in model 2. The posterior
predictive probability of 7 heads in the 10 flips flips, given model 2 is
correct and 6 heads in the first 10 flips, is 0.076.
Nrep = 1000000
theta = rbeta(Nrep, 2.5 + 6, 7.5 + 4)
y = rbinom(Nrep, 10, theta)
plot(table(y) / Nrep,
ylab = "Posterior predictive probability",
main = "Given model 2")
203
Given model 2
0.20
Posterior predictive probability
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
8. We saw in a previous part that with a 0.5/0.5 prior on model and 6 heads
in 10 flips, the posterior probability of model 1 is 0.747 and of model 2 is
0.253. We now add another stage to our simulation
The simulation results are below. We can also find the posterior predictive
probability of 7 heads in the next 10 flips using the law of total probabil-
ity to combine the results from the two previous parts: (0.747)(0.216) +
(0.253)(0.076) = 0.18
Nrep = 1000000
alpha = c(7.5, 2.5) + 6
beta = c(2.5, 7.5) + 4
plot(table(y) / Nrep,
204CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
When several models are under consideration, the Bayesian model is the full
hierarchical structure which spans all models being compared. Thus, the most
complete posterior prediction takes into account all models, weighted by their
posterior probabilities. That is, prediction is accomplished by taking a weighted
average across the models, with weights equal to the posterior probabilities of
the models. This is called model averaging.
Example 12.2. Suppose again I select a coin, but now the decision is whether
the coin is fair. Suppose we consider the two models
1. A sample proportion of 15/20 = 0.75 does not seem consistent with the
“must be fair” model, so we expect the Bayes Factor to favor the “anything
in possible” model.
The Bayes Factor is the ratio of the likelihoods (of 15/20). To approximate
the likelihood of 15 heads in 20 flips for the “must be fair” model
Nrep = 1000000
theta2 = rbeta(Nrep, 1, 1)
y2 = rbinom(Nrep, 20, theta2)
## [1] 0.3197
2. Similar to the previous part but with different data; now we compute the
likelihood of 11 heads in 20 flips. The Bayes factor is about 3.34. Thus,
the Bayes factor favors the “must be fair” model.
## [1] 3.328
3. A central 99% prior credible interval for 𝜃 based on the “must be fair”
model is (0.459, 0.541), which does not include the sample proportion
of 0.55. So you might think that the data would favor the “anything is
possible” model. However, the numerator and denominator in the Bayes
factor are average likelihoods: the likelihood of the data averaged over each
possible value of 𝜃. The “must be fair” model only gives initial plausibility
206CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON
to 𝜃 values that are close to 0.5, and for such 𝜃 values the likelihood of
11 heads in 20 flips is not so small. Values of 𝜃 that are far from 0.5 are
effectively not included in the average, due to their low prior probability,
so the average likelihood is not so small.
In contrast, the “anything is possible” model stretches the prior probability
over all values in (0, 1). For many 𝜃 values in (0, 1) the likelihood of
observing 11 heads in 20 flips is close to 0, and with the Uniform(0, 1)
prior, each of these 𝜃 values contributes equally to the average likelihood.
Thus, the average likelihood is smaller for the “anything is possible” model
than for the “must be fair” model.
Complex models generally have an inherent advantage over simpler models be-
cause complex models have many more options available, and one of those op-
tions is likely to fit the data better than any of the fewer options in the simpler
model. However, we don’t always want to just choose the more complex model.
Always choosing the more complex model overfits the data.
Bayesian model comparison naturally compensates for discrepancies in model
complexity. In more complex models, prior probabilities are diluted over the
many options available. Even if a complex model has some particular combina-
tion of parameters that fit the data well, the prior probability of that particular
combination is likely to be small because the prior is spread more thinly than
for a simpler model. Thus, in Bayesian model comparison, a simpler model can
“win” if the data are consistent with it, even if the complex model fits well.
Example 12.3. Continuing Example 12.2 where we considered the two models
100 flips. Which model does the Bayes factor favor? Is the choice of
model sensitive to the change of prior distribution within the “anything is
possible” model?
3. For each of the two “anything is possible” priors, find the posterior dis-
tribution of 𝜃 and a 98% posterior credible interval for 𝜃 given 65 heads
in 100 flips. Is estimation of 𝜃 within the “anything is possible” model
sensitive to the change in the prior distribution for 𝜃?
1. The simulation is similar to the ones in the previous example, just with
different data. The Bayes Factor is about 0.126 in favor of the “must be
fair” model. So the Bayes Factor favors the “anything is possible” model.
Nrep = 1000000
theta2 = rbeta(Nrep, 1, 1)
y2 = rbinom(Nrep, 100, theta2)
## [1] 0.1325
2. The simulation is similar to the one in the previous part, just with a
different prior. The Bayes Factor is about 5.73 in favor of the “must be
fair” model. So the Bayes Factor favors the “must be fair” model. Even
though there both non-informative priors, the Beta(1, 1) and Beta(0.01,
0.01) priors leads to very different Bayes factors and decisions. The choice
of model does appear to be sensitive to the choice of prior distribution.
Nrep = 1000000
## [1] 5.272
208CHAPTER 12. INTRODUCTION TO BAYESIAN MODEL COMPARISON
3. For a Beta(1, 1) prior, the posterior of 𝜃 given 65 heads in 100 flips is the
Beta(66, 36) distribution, and a central 98% posterior credible interval
for 𝜃 is (0.534, 0.752). For a Beta(0.01, 0.01) prior, the posterior of 𝜃
given 65 heads in 100 flips is the Beta(65.01, 35.01) distribution, and a
central 98% posterior credible interval for 𝜃 is (0.536, 0.755). The Beta(66,
36) and Beta(65.01, 35.01) distributions are virtually identical, and the
98% credible intervals are practically the same. At least in this case, the
estimation of 𝜃 within the “anything is possible” model does not appear
to be sensitive to the choice of prior.
1. With the Beta(1, 1) prior in the “anything is possible” model, the posterior
distribution of 𝜃 after 6 heads in the first 10 flips is the Beta(7, 5) distri-
bution. With the Beta(500, 500) prior in the “must be fair” model, the
posterior distribution of 𝜃 after 6 heads in the first 10 flips is the Beta(506,
504) distribution. The simulation to approximate the likelihood in each
model is similar to before, but now we simulate 𝜃 from its posterior distri-
bution after the first 10 flips, and evaluate the likelihood of observing 59
heads in the remaining 90 flips. The Bayes factor is about 0.056 in favor
of the “must be fair” model. So the Bayes Factor favors the “anything is
possible” model.
Nrep = 1000000
theta2 = rbeta(Nrep, 1 + 6, 1 + 4)
y2 = rbinom(Nrep, 90, theta2)
## [1] 0.05509
2. With the Beta(0.01, 0.01) prior in the “anything is possible” model, the
posterior distribution of 𝜃 after 6 heads in the first 10 flips is the Beta(6.01,
4.01) distribution. The simulation is similar to the previous part, just with
the different distribution for 𝜃 in the “anything is possible” model. The
Bayes factor is about 0.057 in favor of the “must be fair” model, about the
same as in the previous part. So the Bayes Factor favors the “anything is
possible” model. Notice that after “training” the models on the first 10
observations, the model comparison is no longer so sensitive to the choice
of prior within the “anything is possible” model.
Nrep = 1000000
## [1] 0.05941
How do you choose the ROPE? That determines on the practical application.
In general, traditional testing of point null hypotheses (that is, “no ef-
fect/difference”) is not a primary concern in Bayesian statistics. Rather, the
posterior distribution provides all relevant information to make decisions about
practically meaningful issues. Ask research questions that are important in the
context of the problem and use the posterior distribution to answer them.
Chapter 13
Bayesian Analysis of
Poisson Count Data
1. In what ways is this like the Binomial situation? (What is a trial? What
is “success”?)
2. In what ways is this NOT like the Binomial situation?
Solution. to Example 13.1
Show/hide solution
1. Each pitch is a trial, and on each trial either a home run is hit (“success”)
or not. The random variable 𝑌 counts the number of home runs (successes)
over all the trials
2. Even though 𝑌 is counting successes, this is not the Binomial situation.
• The number of trials is not fixed. The total number of pitches varies
from game to game. (The average is around 300 pitches per game).
• The probability of success is not the same on each trial. Different
batters have different probabilities of hitting home runs. Also, dif-
ferent pitch counts or game situations lead to different probabilities
of home runs.
211
212 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA
• The trials might not be independent, though this is a little more ques-
tionable. Make sure you distinguish independence from the previous
assumption of unequal probabilities of success; you need to consider
conditional probabilities to assess independence. Maybe if a pitcher
gives up a home run on one pitch, then the pitcher is “rattled” so
the probability that he also gives up a home run on the next pitch
increases, or the pitcher gets pulled for a new pitcher which changes
the probability of a home run on the next pitch.
1. In what ways is this like the Binomial situation? (What is a trial? What
is “success”?)
2. In what ways is this NOT like the Binomial situation?
Show/hide solution
1. Each automobile on the road in the day is a trial, and on each automo-
bile either gets in an accident (“success”) or not. The random variable
𝑌 counts the number of automobiles that get into accidents (successes).
(Remember “success” is just a generic label for the event you’re interested
in; “success” is not necessarily good.)
2. Even though 𝑌 is counting successes, this is not the Binomial situation.
• The number of trials is not fixed. The total number of automobiles
on the road varies from day to day.
• The probability of success is not the same on each trial. Different
drivers have different probabilities of getting into accidents; some
drivers are safer than others. Also, different conditions increase the
probability of an accident, like driving at night.
• The trials are plausibly not independent. Make sure you distinguish
independence from the previous assumption of unequal probabilities
of success; you need to consider conditional probabilities to assess
independence. If an automobile gets into an accident, then the prob-
ability of getting into an accident increases for the automobiles that
are driving near it.
Poisson models are models for counts that have more flexibility than Binomial
models. Poisson models are parameterized by a single parameter (the mean) and
do not require all the assumptions of a Binomial model. Poisson distributions
are often used to model the distribution of variables that count the number
of “relatively rare” events that occur over a certain interval of time or in a
certain location (e.g., number of accidents on a highway in a day, number of
213
car insurance policies that have claims in a week, number of bank loans that go
into default, number of mutations in a DNA sequence, number of earthquakes
that occur in SoCal in an hour, etc.)
𝑒−𝜃 𝜃𝑦
𝑓(𝑦|𝜃) = , 𝑦 = 0, 1, 2, …
𝑦!
𝐸(𝑌 ) = 𝜃
𝑉 𝑎𝑟(𝑌 ) = 𝜃
For a Poisson distribution, both the mean and variance are equal to 𝜃, but
remember that the mean is measured in the count units (e.g., home runs) but
the variance is measured in squared units (e.g., (home runs)2 ).
1 If 𝑌 has mean 𝜃 and 𝑌 has mean 𝜃 then linearity of expected value implies that 𝑌 +𝑌
1 1 2 2 1 2
has mean 𝜃1 +𝜃2 . If 𝑌1 has variance 𝜃1 and 𝑌2 has variance 𝜃2 then independence of 𝑌1 and 𝑌2
implies that 𝑌1 + 𝑌2 has variance 𝜃1 + 𝜃2 . What Poisson aggregation says is that if component
counts are independent and each with a Poisson shape, then the total count also has a Poisson
shape.
214 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA
0.6 Poisson(0.5)
Poisson(1)
Poisson(1.5)
0.5 Poisson(2)
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8
Example 13.3. Suppose the number of home runs hit per game (by both
teams in total) at a particular Major League Baseball park follows a Poisson
distribution with parameter 𝜃.
1. Sketch your prior distribution for 𝜃 and describe its features. What are
the possible values of 𝜃? Does 𝜃 take values on a discrete or continuous
scale?
2. Suppose 𝑌 represents a home run count for a single game. What are the
possible values of 𝑌 ? Does 𝑌 take values on a discrete or continuous scale?
Suppose a single game with 1 home run is observed. Find the poste-
rior distribution of 𝜃. In particular, how do you determine the likelihood
column?
5. Now consider the original prior again. Find the posterior distribution of
𝜃 after observing 1 home run in the first game and 3 home runs in the
second, without the intermediate updating of the posterior after the first
game. How does the likelihood column relate to the likelihood columns
from the previous parts? How does the posterior distribution compare
with the posterior distribution from the previous part?
6. Now consider the original prior again. Suppose that instead of observing
the two individual values, we only observe that there is a total of 4 home
runs in 2 games. Find the posterior distribution of 𝜃. In particular, you
do you determine the likelihood column? How does the likelihood column
compare to the one from the previous part? How does posterior compare
to the previous part?
7. Suppose we’ll observe a third game tomorrow. How could you find — both
analytically and via simulation —the posterior predictive probability that
this game has 0 home runs?
9. Now let’s consider some real data. Assume home runs per game at Citizens
Bank Park (Phillies!) follow a Poisson distribution with parameter 𝜃.
Assume that the prior distribution for 𝜃 satisfies
The following summarizes data for the 2020 season2 . There were 97 home
runs in 32 games. Use grid approximation to compute the posterior dis-
tribution of 𝜃 given the data. Be sure to specify the likelihood. Plot the
prior, (scaled) likelihood, and posterior.
2 Source: https://www.baseball-reference.com/teams/PHI/2020.shtml
216 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9
1. Your prior is whatever it is. We’ll discuss how we chose a prior in a later
part. Even though each data value is an integer, the mean number of home
runs per game 𝜃 can be any value greater than 0. That is, the parameter
𝜃 takes values on a continuous scale.
2. 𝑌 can be 0, 1, 2, and so on, taking values on a discrete scale. Technically,
there is no fixed upper bound on what 𝑌 can be.
3. The likelihood is the Poisson probability of 1 home run in a game computed
for each value of 𝜃.
𝑒−𝜃 𝜃1
𝑓(𝑦 = 1|𝜃) =
1!
For example, the likelihood of 1 home run in a game given 𝜃 = 0.5 is
−0.5 1
𝑓(𝑦 = 1|𝜃 = 0.5) = 𝑒 1!0.5 = 0.3033. If on average there are 0.5 home
217
runs per game, then about 30% of games would have exactly 1 home run.
As always posterior is proportional to the product of prior and likelihood.
We see that the posterior distibution puts even greater probability on
𝜃 = 1.5 than the prior.
theta prior likelihood product posterior
0.5 0.13 0.3033 0.0394 0.1513
1.5 0.45 0.3347 0.1506 0.5779
2.5 0.28 0.2052 0.0575 0.2205
3.5 0.11 0.1057 0.0116 0.0446
4.5 0.03 0.0500 0.0015 0.0058
4. The likelihood is the Poisson probability of 3 home runs in a game com-
puted for each value of 𝜃.
𝑒−𝜃 𝜃3
𝑓(𝑦 = 3|𝜃) =
3!
The posterior places about 90% of the probability on 𝜃 being either 1.5 or
2.5.
theta prior likelihood product posterior
0.5 0.1513 0.0126 0.0019 0.0145
1.5 0.5779 0.1255 0.0725 0.5488
2.5 0.2205 0.2138 0.0471 0.3566
3.5 0.0446 0.2158 0.0096 0.0728
4.5 0.0058 0.1687 0.0010 0.0073
5. Since the games are independent3 the likelihood is the product of the
likelihoods from the two previous parts
𝑒−𝜃 𝜃1 𝑒−𝜃 𝜃3
𝑓(𝑦 = (1, 3)|𝜃) = ( )( )
1! 3!
they are conditionally independent given 𝜃. This is a somewhat subtle distinction, so I’ve
glossed over the details.
218 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA
7. Simulate a value of 𝜃 from its posterior distribution and then given 𝜃 sim-
ulate a value of 𝑌 from a Poisson(𝜃) distribution, and repeat many times.
Approximate the probability of 0 home runs by finding the proportion of
repetitions that yield a 𝑌 value of 0. (We’ll see some code a little later.)
We can compute the probability using the law of total probability. Find
the probability of 0 home runs for each value of 𝜃, that is 𝑒−𝜃 𝜃0 /0! = 𝑒−𝜃 ,
and then weight these values by their posterior probabilities to find the
predictive probability of 0 home runs, which is 0.163.
𝑒−0.5 (0.0145) + 𝑒−1.5 (0.5488) + 𝑒−2.5 (0.3566) + 𝑒−3.5 (0.0728) + 𝑒−4.5 (0.0073)
=(0.6065)(0.0145) + (0.2231)(0.5488) + (0.0821)(0.3566) + (0.0302)(0.0728) + (0.0111)(0.0073) = 0
According to this model, we predict that about 16% of games would have
0 home runs.
can take any value greater than 0, the interval [0, 8] accounts for 99.99%
of the prior probability.)
# prior
theta = seq(0, 8, 0.001)
# data
n = 1 # sample size
y = 1 # sample mean
# likelihood
likelihood = dpois(y, theta)
# posterior
product = likelihood * prior
posterior = product / sum(product)
prior
scaled likelihood
posterior
0 2 4 6 8
theta
# prior
theta = seq(0, 8, 0.001)
# data
n = 32 # sample size
y = 97 / 32 # sample mean
# posterior
product = likelihood * prior
posterior = product / sum(product)
prior
scaled likelihood
posterior
0 2 4 6 8
theta
Gamma densities with rate parameter = 1 Gamma densities with shape parameter = 3
1.0
1.0
0.8
0.8
0.6
0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 5 10 15 20 0 2 4 6 8 10
Example 13.4. The plots above show a few examples of Gamma distributions.
1. The plot on the left above contains a few different Gamma densities, all
with rate parameter 𝜆 = 1. Match each density to its shape parameter 𝛼;
the choices are 1, 2, 5, 10.
2. The plot on the right above contains a few different Gamma densities, all
with shape parameter 𝛼 = 3. Match each density to its rate parameter 𝜆;
the choices are 1, 2, 3, 4.
constant which ensures that the total area under the density is 1. The actual Gamma density
formula, including the normalizing constant, is
𝜆𝛼
𝑓(𝑢) = 𝑢𝛼−1 𝑒−𝜆𝑢 , 𝑢 > 0,
Γ(𝛼)
∞
where Γ(𝛼) = ∫0 𝑒−𝑢 𝑢𝛼−1 𝑑𝑢 is the Gamma function. For a positive integer 𝑘, Γ(𝑘) = (𝑘−1)!.
√
Also, Γ(1/2) = 𝜋.
223
1. For a fixed 𝜆, as the shape parameter 𝛼 increases, both the mean and the
standard deviation increase.
2. For a fixed 𝛼, as the rate parameter 𝜆 increases, both the mean and the
standard deviation decrease.
Observe that changing 𝜆 doesn’t change the overall shape of the curve,
just the scale of values that it covers. However, changing 𝛼 does change
the shape of the curve; notice the changes in concavity in the plot on the
left.
Gamma densities with rate parameter = 1 Gamma densities with shape parameter = 3
1.0 =1 =1
=2 1.0 =2
=5 =3
0.8 = 10 =4
0.8
0.6
0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 5 10 15 20 0 2 4 6 8 10
Example 13.5. Assume home runs per game at Citizens Bank Park follow a
Poisson distribution with parameter 𝜃. Assume for 𝜃 a Gamma prior distribution
with shape parameter 𝛼 = 4 and rate parameter 𝜆 = 2.
1. Write an expression for the prior density 𝜋(𝜃). Plot the prior distribution.
Find the prior mean, prior SD, and prior 50%, 80%, and 98% credible
intervals for 𝜃.
2. Suppose a single game with 1 home run is observed. Write the likelihood
function.
3. Write an expression for the posterior distribution of 𝜃 given a single game
with 1 home run. Identify by the name the posterior distribution and
the values of relevant parameters. Plot the prior distribution, (scaled)
likelihood, and posterior distribution. Find the posterior mean, posterior
SD, and posterior 50%, 80%, and 98% credible intervals for 𝜃.
4. Now consider the original prior again. Determine the likelihood of observ-
ing 1 home run in game 1 and 3 home runs in game 2 in a sample of 2
games, and the posterior distribution of 𝜃 given this sample. Identify by
the name the posterior distribution and the values of relevant parameters.
Plot the prior distribution, (scaled) likelihood, and posterior distribution.
Find the posterior mean, posterior SD, and posterior 50%, 80%, and 98%
credible intervals for 𝜃.
5. Consider the original prior again. Determine the likelihood of observing a
total of 4 home runs in a sample of 2 games, and the posterior distribution
of 𝜃 given this sample. Identify by the name the posterior distribution and
224 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA
the values of relevant parameters. How does this compare to the previous
part?
6. Consider the 2020 data in which there were 97 home runs in 32 games.
Determine the likelihood function, and the posterior distribution of 𝜃 given
this sample. Identify by the name the posterior distribution and the values
of relevant parameters. Plot the prior distribution, (scaled) likelihood,
and posterior distribution. Find the posterior mean, posterior SD, and
posterior 50%, 80%, and 98% credible intervals for 𝜃.
7. Interpret the credible interval from the previous part in context.
8. Express the posterior mean of 𝜃 based on the 2020 data as a weighted
average of the prior mean and the sample mean.
9. While the main parameter is 𝜃, there are other parameters of interest.
For example, 𝜂 = 𝑒−𝜃 is the population proportion of games in which
there are 0 home runs. Assuming that you already have the posterior
distribution of 𝜃 (or a simulation-based approximation), explain how you
could use simulation to approximate the posterior distribution of 𝜂. Run
the simulation and plot the posterior distribution, and find and interpret
50%, 80%, and 98% posterior credible intervals for 𝜂.
10. Use JAGS to approximate the posterior distribution of 𝜃 given this sample.
Compare with the results from the previous example.
This is the same prior we used in the grid approximation in Example 13.3.
See below for a plot.
𝛼 4
Prior mean = =2
𝜆 2
𝛼 √4 =1
Prior SD = √
𝜆2 22
𝛼 5
Posterior mean = = 1.667
𝜆 3
𝛼 5
Posterior SD = √ √ = 0.745
𝜆2 32
# prior
alpha = 4
lambda = 2
prior = dgamma(theta, shape = alpha, rate = lambda)
226 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA
# likelihood
n = 1 # sample size
y = 1 # sample mean
likelihood = dpois(n * y, n * theta)
# posterior
posterior = dgamma(theta, alpha + n * y, lambda + n)
# plot
plot_continuous_posterior <- function(theta, prior, likelihood, posterior) {
prior
scaled likelihood
posterior
0 2 4 6 8
theta
227
𝑒−𝜃 𝜃1 𝑒−𝜃 𝜃3
𝑓(𝑦 = (1, 3)|𝜃) = ( )( ) ∝ 𝑒−2𝜃 𝜃4 , 𝜃 > 0.
1! 3!
𝛼 8
Posterior mean = =2
𝜆 4
𝛼 8
Posterior SD = √ √ = 0.707
𝜆2 42
n = 2 # sample size
y = 2 # sample mean
# likelihood
likelihood = dpois(1, theta) * dpois(3, theta)
# posterior
posterior = dgamma(theta, alpha + n * y, lambda + n)
# plot
plot_continuous_posterior(theta, prior, likelihood, posterior)
228 CHAPTER 13. BAYESIAN ANALYSIS OF POISSON COUNT DATA
prior
scaled likelihood
posterior
0 2 4 6 8
theta
𝑒−2𝜃 (2𝜃)4
𝑓(𝑦 ̄ = 2|𝜃) = ∝ 𝑒−2𝜃 𝜃4 , 𝜃>0
4!
# likelihood
n = 2 # sample size
y = 2 # sample mean
likelihood = dpois(n * y, n * theta)
# posterior
posterior = dgamma(theta, alpha + n * y, lambda + n)
# plot
plot_continuous_posterior(theta, prior, likelihood, posterior)
229
prior
scaled likelihood
posterior
0 2 4 6 8
theta
𝛼 101
Posterior mean = = 2.97
𝜆 34
𝛼 101
Posterior SD = √ √ = 0.296
𝜆2 342
# likelihood
n = 32 # sample size
y = 97 / 32 # sample mean
likelihood = dpois(n * y, n * theta)
# posterior
posterior = dgamma(theta, alpha + n * y, lambda + n)
# plot
plot_continuous_posterior(theta, prior, likelihood, posterior)
prior
scaled likelihood
posterior
0 2 4 6 8
theta
8. The prior mean is 4/2=2, based on a “prior sample size” of 2. The sample
mean is 97/32 = 3.03, based on a sample size of 32. The posterior mean
is (4 + 97)/(2 + 32) = 2.97. The posterior mean is a weighted average
of the prior mean and the sample mean with the weights based on the
“sample sizes”
4 + 97 2 4 32 97
2.97 = =( ) ( )+( ) ( ) = (0.0589)(2)+(0.941)(3.03)
2 + 32 2 + 32 2 2 + 32 32
eta_sim = exp(-theta_sim)
25
20
Posterior density
15
10
5
0
## 25% 75%
## 0.04220 0.06315
## 10% 90%
## 0.03481 0.07455
## 1% 99%
## 0.02488 0.09849
• The data has been loaded as individual values, number of home runs in
each of the 32 games
233
# data
df = read.csv("_data/citizens-bank-hr-2020.csv")
y = df$hr
n = length(y)
# model
model_string <- "model{
# Likelihood
for (i in 1:n){
y[i] ~ dpois(theta)
}
# Prior
theta ~ dgamma(alpha, lambda)
alpha <- 4
lambda <- 2
}"
Nrep = 10000
Nchains = 3
##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 2.97290 0.29624 0.00171 0.00173
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 2.42 2.77 2.96 3.17 3.58
plot(posterior_sample)
235
1.2
4.0
3.5
0.8
3.0
0.4
2.5
2.0
0.0
2000 6000 10000 2.0 3.0 4.0
In the previous example we saw that if the values of the measured variable follow
a Poisson distribution with parameter 𝜃 and the prior for 𝜃 follows a Gamma
distribution, then the posterior distribution for 𝜃 given the data also follows a
Gamma distribution.
Gamma-Poisson model.6 Consider a measured variable 𝑌 which, given 𝜃,
follows a Poisson(𝜃) distribution. Let 𝑦 ̄ be the sample mean for a random sample
of size 𝑛. Suppose 𝜃 has a Gamma(𝛼, 𝜆) prior distribution. Then the posterior
distribution of 𝜃 given 𝑦 ̄ is the Gamma(𝛼 + 𝑛𝑦,̄ 𝜆 + 𝑛) distribution.
That is, Gamma distributions form a conjugate prior family for a Poisson like-
lihood.
The posterior distribution is a compromise between prior and likelihood. For the
Gamma-Poisson model, there is an intuitive interpretation of this compromise.
In a sense, you can interpret 𝛼 as “prior total count” and 𝜆 as “prior sample
size”, but these are only “pseudo-observations”. Also, 𝛼 and 𝜆 are not necessarily
integers.
𝑛
Note that if 𝑦 ̄ is the sample mean count is then 𝑛𝑦 ̄ = ∑𝑖=1 𝑦𝑖 is the sample total
count.
6 I’ve been naming these models in the form “Prior-Likelihood”, e.g. Gamma prior and
• The posterior total count is the sum of the “prior total count” 𝛼 and the
sample total count 𝑛𝑦.̄
• The posterior sample size is the sum of the “prior sample size” 𝜆 and the
observed sample size 𝑛.
• The posterior mean is a weighted average of the prior mean and the sample
mean, with weights proportional to the “sample sizes”.
𝛼 + 𝑛𝑦 ̄ 𝜆 𝛼 𝑛
= ( )+ 𝑦̄
𝜆+𝑛 𝜆+𝑛 𝜆 𝜆+𝑛
• As more data are collected, more weight is given to the sample mean (and
less weight to the prior mean)
• Larger values of 𝜆 indicate stronger prior beliefs, due to smaller prior
variance (and larger “prior sample size”), and give more weight to the
prior mean
Example 13.6. Continuing the previous example, assume home runs per game
at Citizens Bank Park follow a Poisson distribution with parameter 𝜃. Assume
for 𝜃 a Gamma prior distribution with shape parameter 𝛼 = 4 and rate param-
eter 𝜆 = 2. Consider the 2020 data in which there were 97 home runs in 32
games.
1. How could you use simulation (not JAGS) to approximate the posterior
predictive distribution of home runs in a game?
2. Use the simulation from the previous part to find and interpret a 95%
posterior prediction interval with a lower bound of 0.
3. Is a Poisson model a reasonable model for the data? How could you use
posterior predictive simulation to simulate what a sample of 32 games
might look like under this model. Simulate many such samples. Does the
observed sample seem consistent with the model?
237
Nrep = 10000
theta_sim = rgamma(Nrep, 101, 34)
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11
quantile(y_sim, 0.95)
## 95%
## 6
df = read.csv("_data/citizens-bank-hr-2020.csv")
y = df$hr
n = length(y)
n_samples = 100
# simulate samples
for (r in 1:n_samples){
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13
4. Continuing with the simulation from the previous part, now for each sim-
ulated sample we record the number of games with 0 home runs. Each
“dot” in the plot below represents a sample of size 32 for which we mea-
sure the number of games in the sample with 0 home runs. While we see
that it’s less likely to have 0 home runs in 32 games than not, it would not
be too surprising to see 0 home runs in a sample of 32 games. Therefore,
the fact that there are 0 home runs in the observed sample alone does not
invalidate the model.
n_samples = 10000
# simulate samples
for (r in 1:n_samples){
0.30
Proportion of samples of size 32
0.20
0.10
0.00
0 1 2 3 4 5 6 7 8 9
Introduction to
Multi-Parameter Models
241
242 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS
1. See the table below. There are 3 possible values for 𝜇 and 2 possible
values for 𝜎 so there are (3)(2) = 6 possible (𝜇, 𝜎) pairs. Each row in the
Bayes table represents a (𝜇, 𝜎) pair. Since the prior assumes independence,
the prior probability of any pair is the product of the marginal prior
probabilities of 𝜇 and 𝜎. For example, the probability probability that
𝜇 = 97.6 and 𝜎 = 0.5 is (0.2)(0.25) = 0.05
2. The likelihood is similar to what we have seen in other examples concern-
ing body temperature, but it is now a function of both 𝜇 and 𝜎. That is,
the likelihood is a function of two variables. The likelihood is determined
by evaluating, for each (𝜇, 𝜎) pair, the Normal(𝜇, 𝜎) density at each of
𝑦 = 97.9 and 𝑦 = 97.5 and then finding the product:
2 2
𝑓(𝑦=(97.9,97.5)|𝜇,𝜎)∝[𝜎−1 exp(− 21 ( 97.9−𝜇
𝜎 ) )][𝜎−1 exp(− 21 ( 97.5−𝜇
𝜎 ) )]
# prior
# data
y = c(97.9, 97.5) # single observed value
# likelihood
likelihood = dnorm(97.9, mean = theta$mu, sd = theta$sigma) *
dnorm(97.5, mean = theta$mu, sd = theta$sigma)
# posterior
product = likelihood * prior
posterior = product / sum(product)
# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
4. Intuitively, knowing only the posterior mean would not be sufficient, since
it would not give us enough information to estimate the standard deviation
𝜎. In order to evaluate the likelihood we need to compute 𝑦−𝜇 𝜎 for each
individual 𝑦 value, so if we only had the sample mean we would not be
able to fill in the likelihood column.
The plots below compare the prior and posterior distributions from the previous
problem.
244 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS
ggplot(bayes_table %>%
mutate(mu = factor(mu),
sigma = factor(sigma)),
aes(mu, sigma)) +
geom_tile(aes(fill = prior)) +
scale_fill_viridis(limits = c(0, max(c(prior, posterior))))
ggplot(bayes_table %>%
mutate(mu = factor(mu),
sigma = factor(sigma)),
aes(mu, sigma)) +
geom_tile(aes(fill = posterior)) +
scale_fill_viridis(limits = c(0, max(c(prior, posterior))))
1 1
prior posterior
0.3 0.3
sigma
sigma
0.2 0.2
0.1 0.1
0.0 0.0
0.5 0.5
• 𝜇 has a Normal distribution with mean 98.6 and standard deviation 0.3.
• 𝜏 has a Gamma distribution with shape parameter 5 and rate parameter
2.
• 𝜇 and 𝜏 are independent.
(This problem just concerns the prior distribution. We’ll look at this posterior
distribution in the next example.)
1. Simulate (𝜇, 𝜏 ) pairs from the prior distribution and plot them.
2. Simulate (𝜇, 𝜎) pairs from the prior distribution and plot them. Describe
the prior distribution of 𝜎.
3. Find and interpret a central 98% prior credible interval for 𝜇.
4. Find a central 98% prior credible interval for the precision 𝜏 = 1/𝜎2 .
5. Find and interpret a central 98% prior credible interval for 𝜎.
245
6. What is the prior credibility that both 𝜇 and 𝜎 lie within their credible
intervals?
Nrep = 100000
10.0 5
4
7.5
level
0.5
tau_sim_prior
tau_sim_prior
3 0.4
5.0 0.3
0.2
2 0.1
2.5
0.0
3
1.0
2
0.8 level
sigma_sim_prior
3.5
3.0
density
2.5
2.0
1.5
1 1.0
0.6
0.5
0
0.4
1 2 3 98.00 98.25 98.50 98.75 99.00 99.25
sigma_sim_prior mu_sim_prior
## 1% 99%
## 97.9 99.3
4. We can compute a credible interval like usual. Precision just doesn’t have
as practical an interpretation as standard deviation.
## 1% 99%
## 0.6374 5.8224
## 1% 99%
## 0.4144 1.2526
1. There are (100-96)/0.01 = 400 values of 𝜇 in the grid (actually 401 in-
cluding both endpoints) and (25-0.1)/0.01 = 2490 values of 𝜇 in the grid
(actually 2491). There are almost 1 million possible values of the pair
(𝜇, 𝜏 ) in the grid.
2. See below. Even though 𝜇 and 𝜏 are independent according to the prior
distribution, there is a negative posterior correlation. (Below the posterior
is computed via grid approximation. After the posterior distribution was
computed, values were simulated from it for plotting.)
# parameters
mu = seq(96.0, 100.0, 0.01)
tau = seq(0.1, 25, 0.01)
# prior
prior_mu_mean = 98.6
prior_mu_sd = 0.3
prior_precision_shape = 5
prior_precision_rate = 2
# data
y = c(97.9, 97.5)
248 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS
# likelihood
likelihood = dnorm(97.9, mean = theta$mu, sd = theta$sigma) *
dnorm(97.5, mean = theta$mu, sd = theta$sigma)
# posterior
product = likelihood * prior
posterior = product / sum(product)
# posterior simulation
sim_posterior = theta[sample(1:nrow(theta), 100000, replace = TRUE, prob = posteri
cor(sim_posterior$mu, sim_posterior$tau)
## [1] -0.2888
#plots
ggplot(sim_posterior, aes(mu)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")
ggplot(sim_posterior, aes(tau)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")
5
1.5
0.4
0.3
level
1.0
0.6
density
density
3
tau
0.4
0.2
0.2
0.5 2
0.1
0.0 0.0
98 99 0.0 2.5 5.0 7.5 10.0 97.75 98.00 98.25 98.50 98.75 99.00
mu tau mu
3. See below. We see that the posterior shifts the density towards smaller
values of 𝜇 and 𝜎. There is also a slight positive posterior correlation
between 𝜇 and 𝜎.
cor(sim_posterior$mu, sim_posterior$sigma)
## [1] 0.2804
249
#plots
ggplot(sim_posterior, aes(mu)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")
ggplot(sim_posterior, aes(sigma)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")
3
1.5 1.0
2
level
1.0 0.8
4
density
density
sigma
3
1 1
0.5 0.6
0.0 0
0.4
98 99 0.5 1.0 1.5 2.0 97.75 98.00 98.25 98.50 98.75
mu sigma mu
The previous problem illustrates that grid approximation can quickly become
computationally infeasible when there are multiple parameters (to obtain suf-
ficient precision). Naively conditioning a simulation on the observed sample
is also computationally infeasible, since except in the simplest situations the
probability of recreating the observed sample in a simulation is essentially 0.
250 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS
Example 14.4. The temperature data file contains 208 measurements of hu-
man body temperature (degrees F). The sample mean is 97.71 degrees F and
the sample SD is 0.75 degrees F. Assuming the same prior distribution as in the
previous problem, use JAGS to approximate the joint posterior distribution of
𝜇 and 𝜎. Summarize the posterior distribution in context.
Nrep = 10000
Nchains = 3
# data
data = read.csv("_data/temperature.csv")
y = data$temperature
n = length(y)
# model
model_string <- "model{
251
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 1 / sigma ^ 2)
}
# Prior
mu ~ dnorm(98.6, 1 / 0.3 ^ 2)
}"
##
## Iterations = 2001:12000
## Thinning interval = 1
252 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS
## Number of chains = 3
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 97.736 0.0514 0.000297 0.000297
## sigma 0.749 0.0365 0.000211 0.000269
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 97.635 97.701 97.736 97.771 97.836
## sigma 0.682 0.724 0.748 0.773 0.825
plot(posterior_sample)
Trace of mu Density of mu
8
97.9
4
97.6
8
4
0.65
0.9
0.8
sigma
0.7
# posterior summary
posterior_sim = data.frame(as.matrix(posterior_sample))
head(posterior_sim)
## mu sigma
## 1 97.74 0.7210
## 2 97.68 0.7885
## 3 97.77 0.7331
## 4 97.72 0.7154
## 5 97.71 0.7449
## 6 97.74 0.7251
apply(posterior_sim, 2, mean)
## mu sigma
## 97.7362 0.7493
apply(posterior_sim, 2, sd)
## mu sigma
## 0.05138 0.03647
254 CHAPTER 14. INTRODUCTION TO MULTI-PARAMETER MODELS
## 1% 99%
## 97.62 97.86
## 1% 99%
## 0.6717 0.8423
cor(posterior_sim)
## mu sigma
## mu 1.00000 0.04211
## sigma 0.04211 1.00000
ggplot(posterior_sim, aes(mu)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")
6
density
ggplot(posterior_sim, aes(sigma)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen")
9
density
0.80
level
80
sigma
60
0.75
40
20
0.70
not true in general. However, if 𝑋 and 𝑌 have a Bivariate Normal distribution and their
correlation is 0, then 𝑋 and 𝑌 are independent.
257
perature? Conduct the simulation and compute and interpret a 95% prediction
interval.
See the code below. JAGS has already returned a simulation from the joint
posterior distribution of (𝜇, 𝜎) For each of these simulated values, simulate a
corresponding 𝑦 value like usual.
theta_sim = as.matrix(posterior_sample)
0.2
0.1
0.0
95 96 97 98 99 100 101
## 2.5% 97.5%
## 96.27 99.23
Bayesian Analysis of a
Numerical Variable
In this chapter we’ll continue our study of a single numerical variable. When the
distribution of the measured variable is symmetric and unimodal, the population
mean is often the main parameter of interest. However, the population SD also
plays an important role.
In the previous section we assumed that the measured numerical variable fol-
lowed a Normal distribution. That is, we assumed a Normal likelihood function.
However, the assumption of Normality is not always justified, even when the
distribution of the measured variable is symmetric and unimodal. In this chap-
ter we’ll investigate an alternative to a Normal likelihood that is more flexible
and robust to outliers and extreme values.
Example 15.1. In a previous assignment, we assumed that birthweights
(grams) of human babies follow a Normal distribution with unknown mean 𝜇
and known SD 𝜎 = 600. (1 pound � 454 grams.) We assumed a Normal(3400,
100) (in grams, or Normal(7.5, 0.22) in pounds) prior distribution for 𝜇; this
prior distribution places most of its probability on mean birthweight being
between 7 and 8 pounds.
Now we’ll assume, more realistically, that 𝜎 is unknown.
1. What does the parameter 𝜎 represent? What is a reasonable prior mean for
𝜎? What range of values of 𝜎 will account for most of the prior probability?
2. Assume a Gamma prior distribution for 𝜎 with mean 600 and SD 200; this
is a Gamma distribution with shape parameter 𝛼 = 6002 /2002 and rate
parameter 600/2002 . Also assume that 𝜇 and 𝜎 are independent according
to the prior distribution. Explain how you could use simulation to approx-
imate the prior predictive distribution of birthweights. Run the simulation
and summarize the results. Does the choice of prior seem reasonable?
259
260CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE
data = read.csv("_data/birthweight.csv")
y = data$birthweight
Histogram of y
0.0008
Density
0.0004
0.0000
summary(y)
sd(y)
## [1] 631.3
n = length(y)
ybar = mean(y)
CDC website. We’re only using a random sample to cut down on computation time.
261
n_rep = 10000
hist(y_sim,
breaks = 50,
freq = FALSE,
main = "Prior Predictive Distribution",
xlab = "birthweight (grams)")
0.0002
0.0000
birthweight (grams)
## 2.5% 97.5%
## 2086 4685
• The JAGS code takes the full sample of size 1000 as an input.
• The data consists of 1000 values assumed to each be from a Normal(𝜇,
𝜎) distribution.
• Remember that in JAGS, it’s dnorm(mean, precision).
263
• To find the likelihood of observing the entire sample, JAGS finds the
likelihood of each of the individual values and then multiplies the
values together for us to find the likelihood of the sample.
• This is accomplished in JAGS by specifying the likelihood via a for
loop which evaluates the likelihood y[i] ~ dnorm(mu, 1 / sigma
^ 2) for each y[i] in the sample.
Nrep = 10000
Nchains = 3
# data
# data has already been loaded in previous code
# y is the full sample
# n is the sample size
# model
model_string <- "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 1 / sigma ^ 2)
}
# Prior
mu ~ dnorm(3400, 1 / 100 ^ 2)
}"
##
## Initializing model
##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 3318 19.5 0.1125 0.115
## sigma 632 14.1 0.0815 0.105
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 3280 3305 3318 3331 3356
## sigma 605 622 631 641 661
plot(posterior_sample)
265
Trace of mu Density of mu
0.015
3250 3350
0.000
2000 6000 10000 3250 3300 3350 3400
0.000 0.020
660
580
5. Simulate a (𝜇, 𝜎) pair from the posterior distribution; JAGS has already
done this for you. Then given 𝜇 and 𝜎 simulate a value of 𝑦 from a
Normal(𝜇, 𝜎) distribution. Repeat many times and summarize the sim-
ulated values of 𝑦 to approximate the posterior predictive distribution of
birthweights.
theta_sim = data.frame(as.matrix(posterior_sample))
0.0006
0.0004
Density
0.0002
0.0000
Birthweight (grams)
## 0.5% 99.5%
## 1714 4892
6. Find the proportion of values in the observed sample that lie outside of
the prediction interval
## [1] 0.024
About 2.4 percent of birthweights in the sample fall outside of the 99%
prediction interval, when we would only expect 1%. While not a large dif-
ference in magnitude, we are observing a higher percentage of birthweights
in the tails than we would expect if birthweights followed a Normal dis-
tribution. So we have some evidence that a Normal model — that is, a
Normal likelihood — might not be the best model for birthweights of all
live births as it doesn’t properly account for extreme birthweights.
267
# simulate samples
for (r in 1:n_samples){
i = index_sample[r]
Histogram of y
0.0008
Density
0.0004
0.0000
Normal distributions don’t allow much room for extreme values. An alternative
is to assume a distribution with heavier tails. For example, t-distributions have
heavier tails than Normal. For t-distributions, the degrees of freedom parameter
𝑑 ≥ 1 controls how heavy the tails are. When 𝑑 is small, the tails are much
heavier than for a Normal distribution, leading to a higher frequency of extreme
values. As 𝑑 increases, the tails get lighter and a 𝑡-distribution gets closer to a
Normal distribution. For 𝑑 greater than 30 or so, there is very little different
between a 𝑡-distribution and a Normal distribution except in the extreme tails.
The degrees of freedom parameter 𝑑 is sometimes referred to as the “Normality
parameter”, with larger values of 𝑑 indicating a population distribution that is
closer to Normal.
2 If the observed data has multiple modes or is skewed, then other parameters like median
or mode might be more appropriate measures of center than the population mean.
269
0.4
Normal(0, 1)
t(1)
t(3)
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
Variable
Example 15.2. Continuing the birthweight example, we’ll now model the dis-
tribution of birthweights with a 𝑡(𝜇, 𝜎, 𝑑) distribution.
1. How many parameters is the likelihood function based on? What are
they?
2. What does assigning a prior distribution to the Normality parameter 𝑑
represent?
3. The Normality parameter must satisfy 𝑑 ≥ 1, so we want a distribution
for which only values above 1 are possible. One way to accomplish this is
to let 𝑑0 = 𝑑 − 1, assign a Gamma distribution prior to 𝑑0 ≥ 0, and then
let 𝑑 = 1 + 𝑑0 . One common approach is to let the shape parameter of the
Gamma distribution to be 1, and to let the scale parameter be 1/29, so
that the prior mean of 𝑑 is 30. Assume the same priors for 𝜇 and 𝜎 as in
the previous example, and a Gamma(1, 1/29) prior for (𝑑 − 1). Use JAGS
to fit the model to the birthweight data and approximate and summarize
the posterior distribution.
4. Consider the posterior distribution for 𝑑. Based on this posterior distri-
bution, is it plausible that birthweights follow a Normal distribution?
5. Consider the posterior distribution for 𝜎. What seems strange about this
distribution? (Hint: consider the sample SD.)
6. The standard deviation of a Normal(𝜇, 𝜎) distribution is 𝜎. However,
the standard deviation of a 𝑡(𝜇, 𝜎, 𝑑) distribution is not 𝜎; rather it is
270CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE
𝑑 𝑑
𝜎√ 𝑑−2 > 𝜎. When 𝑑 is large, √ 𝑑−2 ≈ 1, and so the standard deviation
is approximately 𝜎. However, it can make a difference when 𝑑 is small.
Using the JAGS output, create a plot of the posterior distribution of
𝑑
𝜎√ 𝑑−2 . Does this posterior distribution of the population standard devi-
ation seem more reasonable in light of the sample data?
3. See the JAGS code below and output below. Note that the posterior
distribution for 𝜇 is similar to the posterior distribution for 𝜇 from the
model with the Normal likelihood.
Nrep = 10000
Nchains = 3
# data
# data has already been loaded in previous code
# y is the full sample
# n is the sample size
# model
model_string <- "model{
# Likelihood
for (i in 1:n){
y[i] ~ dt(mu, 1 / sigma ^ 2, tdf)
}
# Prior
271
mu ~ dnorm(3400, 1 / 100 ^ 2)
}"
##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
272CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE
plot(posterior_sample)
Trace of mu Density of mu
0.000
3300
2000 4000 6000 8000 10000 12000 3300 3350 3400 3450
2000 4000 6000 8000 10000 12000 400 450 500 550
0.0
3
6. See below. This posterior distribution seems more reasonable given the
sample SD.
Posterior distribution
0.010
Density
0.005
0.000
7. Simulate a triple (𝜇, 𝜎, 𝑑) from the joint posterior distribution; JAGS has
already done this for you. Given (𝜇, 𝜎, 𝑑), simulate a value 𝑦 from a
𝑡(𝜇, 𝜎, 𝑑) distribution. Repeat many times and summarize the simulated
𝑦 values to approximate the posterior distribution.
See the code and output below. The posterior predictive distribution now
spans a more disperse range of birthweights that with the Normal model.
Birthweight (grams)
## 0.5% 99.5%
## 1104 5605
# simulate samples
for (r in 1:n_samples){
Histogram of y
0.0008
Density
0.0004
0.0000
We see that the actual sample resembles simulated samples more closely for
the model based on the t-distribution likelihood than for the one based on the
Normal likelihood. While we do want a model that fits the data well, we also
do not want to risk overfitting the data. In this case, we do not want a few
extreme outliers to unduly influence the model. However, it does appear that a
model that allows for heavier tails than a Normal distribution could be useful
here. Moreover, accommodating the tails improves the fit in the center of the
distribution too.
When the degrees of freedom are very small, 𝑡-distributions can give rise to
super extreme values. We see this in the posterior predictive distribution for
birthweights, where there are some negative birthweights and birthweights over
10000 grams. The model based on the the 𝑡-likelihood seems to fit well over
the range of observed values of birthweight. As usual, we should refrain from
276CHAPTER 15. BAYESIAN ANALYSIS OF A NUMERICAL VARIABLE
Example 16.1. Do newborns born to mothers who smoke tend to weigh less
at birth than newborns from mothers who don’t smoke? We’ll investigate this
question using birthweight (pounds) data on a sample of births in North Car-
olina over a one year period.
Assume birthweights follow a Normal distribution with mean 𝜇1 for nonsmokers
and mean 𝜇2 for smokers, and standard deviation 𝜎.
Note: our primary goal will be to compare the means 𝜇1 and 𝜇2 . We’re assuming
a common standard deviation 𝜎 to simplify a little, but we could (and probably
should) also let standard deviation vary by smoking status.
277
278 CHAPTER 16. COMPARING TWO SAMPLES
• 𝜎 is independent of (𝜇1 , 𝜇2 )
• 𝜎 has a Gamma(1, 1) distribution
• (𝜇1 , 𝜇2 ) follow a Bivariate Normal distribution with prior means (7.5,
7.5) pounds, prior standard deviations (0.5, 0.5) pounds, and prior
correlation 0.9.
Simulate values of (𝜇1 , 𝜇2 ) from the prior distribution1 and plot them.
Briefly2 describe the prior distribution.
4. The following code loads and summarizes the sample data. Briefly describe
the data.
data = read.csv("_data/baby_smoke.csv")
1 Values from a Multivariate Normal distribution can be simulated using mvrnorm from the
MASS package. For Bivariate Normal, the inputs are the mean vector [𝐸(𝜇1 ), 𝐸(𝜇2 )] and the
covariance matrix
Var(𝜇1 ) Cov(𝜇1 , 𝜇2 )
[ ]
Cov(𝜇1 , 𝜇2 ) Var(𝜇2 )
0.4
0.3
habit
density
nonsmoker
0.2 smoker
0.1
0.0
data %>%
group_by(habit) %>%
summarize(n(), mean(weight), sd(weight)) %>%
kable(digits = 2)
6. Describe how you would compute the likelihood. For concreteness, how
you would you compute the likelihood if there were only 4 babies in the
sample: 2 non-smokers with birthweights of 8 pounds and 7 pounds, and
2 smokers with birthweights of 8.3 pounds and 7.1 pounds.
10. Plot the posterior distribution of 𝜇1 /𝜇2 , describe it, and find and interpret
posterior central 50%, 80%, and 98% credible intervals.
11. Is there some evidence that babies whose mothers smoke tend to weigh
less than those whose mothers don’t smoke?
12. Can we say that smoking is the cause of the difference in mean weights?
13. Is there some evidence that babies whose mothers smoke tend to weigh
much less than those whose mothers don’t smoke? Explain.
14. One quantity of interest is the effect size, which is a way of measuring
the magnitude of the difference between groups. When comparing two
means, a simple measure of effect size (Cohen’s 𝑑) is
𝜇1 − 𝜇 2
𝜎
Plot the posterior distribution of this effect size and describe it. Compute
and interpret posterior central 50%, 80%, and 98% credible intervals.
2. Our main focus is on (𝜇1 , 𝜇2 ). We see that the prior places high density
on (𝜇1 , 𝜇2 ) pairs with 𝜇1 close to 𝜇2 .
library(MASS)
8
8 level
0.9
mu2
mu2
0.6
7
0.3
ggplot(sim_prior,
aes(mu_diff)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "skyblue") +
282 CHAPTER 16. COMPARING TWO SAMPLES
Prior Distribution
1.5
1.0
density
0.5
0.0
## [1] 0.5059
4. The distributions of birthweights are fairly similar for smokers and non-
smokers. The sample mean birthweight for smokers is about 0.3 pounds
less than the sample mean birthweight for smokers. The sample SDs of
birthweights are similar for both groups, around 1.4-1.5 pounds.
5. Yes, it is reasonable to assume that the two samples are independent. The
data for smokers was collected separately from the data for non-smokers.
That is, it is reasonable to assume independence in the data.
6. For each observed value of birthweight for non-smokers, evaluate the like-
lihood based on a 𝑁 (𝜇1 , 𝜎) distribution. For example, if birthweight of
a non-smoker is 8 pounds, the likelihood is dnorm(8, mu1, sigma); if
birthweight of a non-smoker is 7 pounds, the likelihood is dnorm(7, mu1,
283
sigma). The likelihood for the sample of non-smokers would be the prod-
ucts — assuming independence within sample — of the likelihoods of the
individual values, as a function of 𝜇1 and 𝜎: dnorm(8, mu1, sigma) *
dnorm(7, mu1, sigma) * ...
The likelihood for the sample of smokers would be the products of the
likelihoods of the individual values, as a function of 𝜇2 and 𝜎: dnorm(8.3,
mu2, sigma) * dnorm(7.1, mu2, sigma) * ...
The likelihood function for the full sample would be the product — as-
suming independence between samples — of the likelihoods for the two
samples, a function of 𝜇1 , 𝜇2 and 𝜎.
7. Here is the code; there are some comments about syntax at the end of this
chapter.
# data
y = data$weight
x = (data$habit == "smoker") + 1
n = length(y)
n_groups = 2
# Prior parameters
mu_prior_mean <- c(7.5, 7.5)
mu_prior_sd <- c(0.5, 0.5)
mu_prior_corr <- 0.9
# Model
model_string <- "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[x[i]], 1 / sigma ^ 2)
}
# Prior
mu[1:n_groups] ~ dmnorm.vcov(mu_prior_mean[1:n_groups],
mu_prior_cov[1:n_groups, 1:n_groups])
284 CHAPTER 16. COMPARING TWO SAMPLES
sigma ~ dgamma(1, 1)
}"
# Compile
Nrep = 10000
n.chains = 5
# Simulate
update(model, 1000, progress.bar = "none")
sim_posterior = as.data.frame(as.matrix(posterior_sample))
names(sim_posterior) = c("mu1", "mu2", "sigma")
head(sim_posterior)
7.2
6.9
mu2
6.6
7.1
7.0 level
25
20
mu2
15
6.9
10
6.8
6.7
7.05 7.10 7.15 7.20
mu1
cor(sim_posterior$mu1, sim_posterior$mu2)
## [1] 0.127
ggplot(sim_posterior,
287
aes(mu_diff)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen") +
labs(x = "Difference in population mean birthweight (pounds, non-smokers - smokers)",
title = "Posterior Distribution")
Posterior Distribution
2
density
## [1] 0.9604
10. See code and output below. JAGS has already simulated (𝜇1 , 𝜇2 ) pairs
from the posterior distribution, so we just need to compute 𝜇1 /𝜇2 for each
pair.
ggplot(sim_posterior,
aes(mu_ratio)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen") +
labs(x = "Ratio of population mean birthweight (non-smokers / smokers)",
title = "Posterior Distribution")
289
Posterior Distribution
20
15
density
10
11. Yes, there is some evidence. Even though we started with fairly strong
prior credibility of no difference, with the relatively large sample sizes, the
difference in sample means observed in the data was enough to overturn
the prior beliefs. Now, the 98% credible interval for 𝜇1 − 𝜇2 does contain
0, indicating some plausibility of no difference. But there’s nothing special
290 CHAPTER 16. COMPARING TWO SAMPLES
about 98% credibility, and we should look at the whole posterior distri-
bution. According to our posterior distribution, we place a high degree of
plausibility on the mean birthweight for smokers being less than the mean
birthweight of non-smokers.
12. The question of causation has nothing to do with whether we are doing
a Bayesian or frequentist analysis. Rather, the question of causation con-
cerns: how were the data collected? In particular, was this an experiment
with random assignment of the explanatory variable? It wasn’t; it was an
observational study (you can’t randomly assign some mothers to smoke).
Therefore, there is potential for confounding variables. Maybe mothers
who smoke tend to be less healthy in general than mothers who don’t
smoke, and maybe some other aspect of health is more closely associated
with lower birthweight than smoking is.
14. The observed effect size is about 0.3/1.5 = 0.2. Birthweights vary naturally
from baby to baby by about 1.5 pounds, so a difference of 0.3 pounds
seems relatively small. The sample mean birthweight for non-smokers
is 0.2 standard deviations greater than the sample mean birthweight for
smokers.
ggplot(sim_posterior,
aes(effect_size)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen") +
labs(x = "Effect size (non-smokers - smokers)",
title = "Posterior Distribution")
291
Posterior Distribution
5
3
density
The values of any numerical variable vary naturally from unit to unit. The SD
of the numerical variable measures the degree to which individual values of the
variable vary naturally, so the SD provides a natural “scale” for the variable.
Cohen’s 𝑑 compares the magnitude of the difference in means relative to the
natural scale (SD) for the variable
Some rough guidelines for interpreting |𝑑|:
For example, assume the two population distributions are Normal and the two
population standard deviations are equal. Then when the effect size is 1.0 the
median of the distribution with the higher mean is the 84th percentile of the
distribution with the lower mean, which is a very large difference.
293
• You should be able to define the prior parameters for the Multivariate
Normal distribution within JAGS, but I keep getting an error. So I’m
defining prior parameters outside of JAGS and then passing them in with
the data. (I can never remember what you can do in JAGS and what you
295
Non-Normal Likelihood
The sample data exhibits a long left tail, similar to what we observed in the pre-
vious chapter. Therefore, a non-Normal model might be more appropriate for
the distribution of birthweights. The code and output below uses a 𝑡-distribution
for the likelihood, similar to what was done in the previous section. The poste-
rior distribution of (𝜇1 , 𝜇2 ) is fairly similar to what was computed above for the
model with the Normal likelihood. The model below based on the 𝑡-distribution
likelihood shifts posterior credibility a little more towards mean birthweights for
non-smokers being greater than birthweights for smokers. But the difference is
still small in absolute terms; at most 0.5 pounds or so. In terms of comparing
population means, the choice of likelihood (Normal versus 𝑡) does not make
much of a difference in this example. That is, the inference regarding 𝜇1 − 𝜇2
appears not to be too sensitive to the choice of likelihood. However, if we were
using the model to predict birthweights, then the 𝑡-distribution based model
296 CHAPTER 16. COMPARING TWO SAMPLES
# Model
model_string <- "model{
# Likelihood
for (i in 1:n){
y[i] ~ dt(mu[x[i]], 1 / sigma ^ 2, tdf)
}
# Prior
mu[1:n_groups] ~ dmnorm.vcov(mu_prior_mean[1:n_groups],
mu_prior_cov[1:n_groups, 1:n_groups])
sigma ~ dgamma(1, 1)
}"
# Compile
Nrep = 10000
n.chains = 5
# Simulate
update(model, 1000, progress.bar = "none")
sim_posterior = as.data.frame(as.matrix(posterior_sample))
names(sim_posterior) = c("mu1", "mu2", "sigma", "tdf")
head(sim_posterior)
7.50
7.25
mu2
7.00
6.75
7.2
level
7.1 35
30
mu2
25
20
15
7.0
10
5
6.9
cor(sim_posterior$mu1, sim_posterior$mu2)
## [1] 0.06929
ggplot(sim_posterior,
aes(mu_diff)) +
geom_histogram(aes(y=..density..), color = "black", fill = "white") +
geom_density(size = 1, color = "seagreen") +
labs(x = "Difference in population mean birthweight (pounds, non-smokers - smokers)",
title = "Posterior Distribution")
300 CHAPTER 16. COMPARING TWO SAMPLES
Posterior Distribution
3
density
## [1] 0.992
Chapter 17
Introduction to Markov
Chain Monte Carlo
(MCMC) Simulation
Monte Carlo methods consist of a broad class of algorithms for obtaining numerical results
based on random numbers, even in problems that don’t explicitly involve probability (e.g.,
Monte Carlo integration).
2 This island hopping example is inspired by Kruschke, Doing Bayesian Data Analysis.
301
302CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION
Suppose that every day, the politician makes her travel plans according to the
following algorithm.
• She flips a fair coin to propose travel to the island to the east (heads) or
west (tails). (If there is no island to the east/west, treat it as an island
with population zero below.)
• If the proposed island has a population greater than that of the current
island, then she travels to the proposed island.
• If the proposed island has a population less than that of the current island,
then:
– She computes 𝑎, the ratio of the population of the proposed island
to the current island.
– She travels to the proposed island with probability 𝑎,
– And with probability 1 − 𝑎 she spends another day on the current
island.
1. Suppose there are 5 islands, labeled 1, … , 5 from west to east, and that
island 𝜃 has population 𝜃 (thousand), and that she starts at island 3 on
day 1. How could you use a coin and a spinner to simulate by hand the
politician’s movements over a number of days? Conduct the simulation
and plot the path of her movements.
5
4
Island
3
2
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Day
𝑎(1 → 0) = 0 𝑎(1 → 2) = 1
𝑎(2 → 1) = 1/2 𝑎(2 → 3) = 1
𝑎(3 → 2) = 2/3 𝑎(3 → 4) = 1
𝑎(4 → 3) = 3/4 𝑎(4 → 5) = 1
𝑎(5 → 4) = 4/5 𝑎(5 → 6) = 0
3
2
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Day
2. The plot below corresponds to the path from the previous plot.
305
0.4
0.3
Proportion of days
0.2
0.1
0.0
1 2 3 4 5
Island
3. Some code is below. There are different ways to implement this algorithm,
but note the proposal and acceptance steps below.
n_states = 5
theta = 1:n_states
pi_theta = theta
n_steps = 10000
theta_sim = rep(NA, n_steps)
theta_sim[1] = 3 # initialize
for (i in 2:n_steps){
current = theta_sim[i - 1]
proposed = sample(c(current + 1, current - 1), size = 1, prob = c(0.5, 0.5))
if (!(proposed %in% theta)){ # to correct for proposing moves outside of boundaries
proposed = current
}
a = min(1, pi_theta[proposed] / pi_theta[current])
theta_sim[i] = sample(c(proposed, current), size = 1, prob = c(a, 1-a))
}
# trace plot
plot(1:n_steps, theta_sim, type = "l", ylim = range(theta), xlab = "Day", ylab = "Island")
306CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION
5
4
Island
3
2
1
Day
4. The plot below corresponds to the path from the previous plot.
0.20
0.10
0.00
1 2 3 4 5
Island
6. She can still make east-west proposals with a coin flip. And she can still
decide whether to accept the proportion based on relative population.
If she is currently on island 𝜃current and she proposes a move to island
𝜃proposed , she will accept the proposal with probability
2
𝜃proposed 𝑒−0.5𝜃proposed
𝑎(𝜃current → 𝜃proposed ) = 2
𝜃current 𝑒−0.5current
n_states = 30
theta = 1:n_states
pi_theta = theta ^ 2 * exp(-0.5 * theta) # notice: not probabilities
n_steps = 10000
theta_sim = rep(NA, n_steps)
theta_sim[1] = 1 # initialize
for (i in 2:n_steps){
current = theta_sim[i - 1]
proposed = sample(c(current + 1, current - 1), size = 1, prob = c(0.5, 0.5))
if (!(proposed %in% theta)){ # to correct for proposing moves outside of boundaries
proposed = current
}
a = min(1, pi_theta[proposed] / pi_theta[current])
theta_sim[i] = sample(c(proposed, current), size = 1, prob = c(a, 1-a))
}
# trace plot
plot(1:n_steps, theta_sim, type = "l", ylim = range(theta_sim), xlab = "Day", ylab = "Island
308CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION
25
20
15
Island
10
5
Day
0.08
0.04
0.00
1 3 5 7 9 11 13 15 17 19 21 23 25
Island
But if she always move towards islands with larger populations, she would
not visit the smaller ones at all.
9. Yes, the next island visited dependent on the current island. For example,
if she is on island 3 today, tomorrow she can only be on island 2, 3, or 4.
10. No the next island visited is not dependent on how she got to the current
island. The proposals and acceptance probability only depend on the
current state, and not past states (given the current state).
11. With only east-west proposals, since she would never visit an island with
population 0 (because the acceptance probability would be 0), she could
never get to islands on the other side. She could modify her algorithm
to cast a wider net in her proposals, instead of just proposing moves to
adjacent islands.
The goal of a Markov chain Monte Carlo method is to simulate from a proba-
bility distribution of interest. In Bayesian contexts, the distribution of interest
will usually be the posterior distribution of parameters given data.
A Markov chain is a random process that exhibits a special “one-step” depen-
dence structure. Namely, conditional on the most recent value, any future value
is conditionally independent of any past values. In a Markov chain: “Given the
present, the future is conditionally independent of the past.” Roughly, in terms
of simulating the next value of a Markov chain, all that matters is where you
are now, not how you got there.
The idea of MCMC is to build a Markov chain whose long run distribution
— that is, the distribution of state visits after a large number of “steps” —
is the probability distribution of interest. Then we can indirectly simulate a
representative sample from the probability distribution of interest, and use the
simulated values to approximate the distribution and its characteristics, by run-
ning an appropriate Markov chain for a sufficiently large number of steps. The
Markov chain does not need to be fully specified in advance, and is often con-
structed “as you go” via an algorithm like a “modified random walk”. Each step
of the Markov chain typically involves
In principle, proposals can be fairly naive and not related to the target distri-
bution (though in practice choice of proposal is very important since it affects
310CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION
3. There are different approaches, but here’s a common one. We want to pro-
pose a new state in the neighborhood of the current state. Given 𝜃current ,
propose 𝜃proposed from a 𝑁 (𝜃current , 𝛿) distribution where the standard
deviation 𝛿 represents the size of the “neighborhood”. For example, if
𝜃current = 0.5 and 𝛿 = 0.05 then we would draw the proposal from the
𝑁 (0.5, 0.05) distribution, so there’s a 68% chance the proposal is between
0.45 and 0.55 and a 95% chance that it’s between 0.40 and 0.60.
5. The posterior density is larger for the proposed state, so the proposed
move is accepted with probability 1.
𝜋(0.15|𝑦=4) (0.154 (1−0.15)25−4 )(0.151−1 (1−0.15)3−1 )
𝑎(0.20→0.15)=min(1, 𝜋(0.20|𝑦=4) )=min(1, (0.204 (1−0.20)25−4 )(0.201−1 (1−0.20)3−1 ) )=1
## [1] 1.276
6. The posterior density is smaller for the proposed state, so based on the
ratio of the posterior densities, the proposed move is accepted with prob-
ability 0.553.
𝜋(0.25|𝑦=4) (0.254 (1−0.25)25−4 )(0.251−1 (1−0.25)3−1 )
𝑎(0.20→0.25)=min(1, 𝜋(0.20|𝑦=4) )=min(1, (0.204 (1−0.20)25−4 )(0.201−1 (1−0.20)3−1 ) )=0.553
## [1] 0.5533
7. See below. The Normal distribution proposal can propose values outside
of (0, 1), so we set 𝜋(𝜃|𝑦 = 4) equal to 0 for 𝜃 ∉ (0, 1). This way, proposals
to states outside (0, 1) will never be accepted.
n_steps = 10000
delta = 0.05
312CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION
for (n in 2:n_steps){
current = theta[n - 1]
proposed = current + rnorm(1, mean = 0, sd = delta)
accept = min(1, pi_theta(proposed) / pi_theta(current))
theta[n] = sample(c(current, proposed), 1, prob = c(1 - accept, accept))
}
Posterior Distribution
6
5
pi(theta|y = 4)
4
3
2
1
0
theta
0.2
0.1
0 20 40 60 80 100
All steps
0.5
0.4
0.3
θn
0.2
0.1
0.0
1. Given the current value 𝜃current propose a new value 𝜃proposed according
to the proposal ( or “jumping”) distribution 𝑗.
𝑗(𝜃current → 𝜃proposed )
is the conditional density that 𝜃proposed is proposed as the next state given
that 𝜃current is the current state
2. Compute the acceptance probability as the ratio of target density at the
current and proposed states
𝜋(𝜃proposed )
𝑎(𝜃current → 𝜃proposed ) = min (1, )
𝜋(𝜃current )
3. Accept the proposal with probability 𝑎(𝜃current → 𝜃proposed ) and set 𝜃new =
𝜃proposed . With probability 1 − 𝑎(𝜃current → 𝜃proposed ) reject the proposal
and set 𝜃new = 𝜃current .
• If 𝜋(𝜃proposed ) ≥ 𝜋(𝜃current ) then the proposal will be accepted with
probability 1.
• Otherwise, there is a positive probability or rejecting the proposal
and remaining in the current state. But this still counts as a “step”
of the MC.
The Metropolis algorithm only uses the target distribution 𝜋 through ratios of
𝜋(𝜃 )
the form 𝜋(𝜃proposed) . Therefore, 𝜋 only needs to be specified up to a constant
current
of proportionality, since even if the normalizing constant were known it would
3 The algorithm is named after Nicholas Metropolis, a physicist who led the research group
which first proposed the method in the early 1950s, consisting of Arianna Rosenbluth, Marshall
Rosenbluth, Augusta Teller, and Edward Teller. It is disputed whether Metropolis himself
had anything to do with the actual invention of the algorithm.
315
cancel out anyway. This is especially useful in Bayesian contexts where the
target posterior distribution is only specified up to a constant of proportionality
via
posterior ∝ likelihood × prior
We will most often use MCMC methods to simulate values from a posterior
distribution 𝜋(𝜃|𝑦) of parameters 𝜃 given data 𝑦. The Metropolis (or Metropolis-
Hastings algorithm) allows us to simulate from a posterior distribution without
computing the posterior distribution. Recall that the inputs of a Bayesian model
are (1) the data 𝑦, (2) the likelihood 𝑓(𝑦|𝜃), and (3) the prior distribution 𝜋(𝜃).
The target posterior distribution satisfies
𝜋(𝜃|𝑦) ∝ 𝑓(𝑦|𝜃)𝜋(𝜃)
To reiterate
In his two most recent games, Steph Curry made 4 out of 10 and 6 out of 11
attempts.
𝑓(((10,4),(11,6))|𝜇,𝑝)∝(𝑒−𝜇 𝜇10 𝑝4 (1−𝑝)10−4 )(𝑒−𝜇 𝜇11 𝑝6 (1−𝑝)11−6 )=𝑒−2𝜇 𝜇21 𝑝10 (1−𝑝)21−10
Notice that the likelihood can be evaluated based on (1) the total number
of games, 2, (2) the total number of attempts, 21, and (3) the total number
of successful attempts, 10.
2. Posterior is proportional to prior times likelihood. The priors for 𝜇 and 𝑝
are independent.
3. Each (𝜇, 𝑝) pair with 𝜇 > 0 and 0 < 𝑝 < 1 is a possible state.
4. Given 𝜃current = (𝜇current , 𝑝current ) we can propose a state using a Bivariate
Normal distribution centered at the current state. The proposed values of
𝜇 and 𝑝 could be chosen independently, but they could also reflect some
dependence.
5. Suppose the current state is 𝜃current = (8, 0.5) and the proposed state is
𝜃proposed = (7.5, 0.55). Compute the probability of accepting this proposal.
## [1] 0.8334
n_steps = 11000
delta = c(0.4, 0.05) # mu, p
for (n in 2:n_steps){
current = theta[n - 1, ]
proposed = current + rnorm(2, mean = 0, sd = delta)
accept = min(1, pi_theta(proposed$mu, proposed$p) / pi_theta(current$mu, current$p))
accept_ind = sample(0:1, 1, prob = c(1 - accept, accept))
theta[n, ] = proposed * accept_ind + current * (1 - accept_ind)
}
mutate(label = 1:100),
aes(mu, p)) +
geom_path() +
geom_point(size = 2) +
geom_text(aes(label = label, x = mu + 0.1, y = p + 0.01)) +
labs(title = "Trace plot of first 100 steps")
0.6 82 96 54 57
56
55
39 29
28
27
71 38
90
89
88
87
86 7441
37 35 40
59
84 76 83
8564 75 42
60
68 61 34
58 23
67 77
78 30
0.5 66 43
65 53
52 73 31 24 26
25
p
63
45
44 72 20
19
18
17
22 2
1
51
16
32
62 33 21
46 49 15 9
5048 8
7
6 5 4
3
0.4
14
11
10
12
13
47
6 7 8 9 10
mu
# Delete the first 1000 steps - we'll see why in the next chapter
theta = theta[-(1:1000), ]
0.2
density
0.1
0.0
4 6 8 10 12
mu
3
density
0.8
0.6
p
0.4
0.2
0.7
0.6
level
0.9
0.5
p
0.6
0.3
0.4
0.3
6 8 10
mu
7. The JAGS code is below. The results are similar, but not quite the same
as our code from scratch in the previous part. JAGS has a lot of built in
features that improve efficiency. In particular, JAGS is making smarter
proposals and is not rejecting as many proposals as our from-scratch al-
gorithm. The scatterplots of simulated (𝜇, 𝑝) pairs kind of show this; the
plot based on the from-scratch algorithm is “thinner” than the one based
on JAGS because the from-scratch algorithm rejects proposals and sits in
place more often.
# data
n = c(10, 11)
y = c(4, 6)
n_sample = 2
# Model
model_string <- "model{
# Likelihood
for (i in 1:n_sample){
n[i] ~ dpois(mu)
}
322CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION
# Prior
mu ~ dgamma(10, 2)
p ~ dbeta(4, 6)
}"
# Compile
Nrep = 10000
n.chains = 5
# Simulate
update(model, 1000, progress.bar = "none")
sim_posterior = as.data.frame(as.matrix(posterior_sample))
head(sim_posterior)
## mu p
## 1 9.034 0.3177
## 2 8.791 0.5250
## 3 8.290 0.3150
## 4 7.839 0.3179
323
## 5 8.326 0.5670
## 6 8.983 0.5996
0.2
density
0.1
0.0
4 8 12 16
mu
density 3
0.8
0.6
p
0.4
0.2
4 8 12 16
mu
325
0.6
level
0.5
0.9
p
0.6
0.4 0.3
0.3
6 8 10
mu
326CHAPTER 17. INTRODUCTION TO MARKOV CHAIN MONTE CARLO (MCMC) SIMULATION
Chapter 18
• Does the algorithm produce samples that are representative of the target
distribution of interest?
• Are estimates of characteristics of the distribution (e.g. posterior mean,
posterior standard deviation, central 98% credible region) based on the
simulated Markov chain accurate and stable?
• Is the algorithm efficient, in terms of time or computing power required
to run?
327
328 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
The following plots display the values of the first 200 steps and their density,
and the values of 10,000 steps and their density, for 5 different runs of the
Metropolis chain each starting from a different initial value: 0.1, 0.3, 0.5, 0.7,
0.9. The value of 𝛿 is 0.005. (We’re setting this value to be small to illustrate
a point.) What do you notice in the plots? How does the initial value influence
the results?
First 200 steps of 5 different runs of MC First 10000 steps of 5 different runs of MC
theta_n
theta_n
0.6
0.6
0.0
0.0
0 50 100 150 200 0 2000 6000 10000
n n
First 200 steps of 5 different runs of MC First 10000 steps of 5 different runs of MC
8
10 20
density
density
4
0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
theta theta
Show/hide solution
With such a small 𝛿 value the chain tends to take a long time to move away
from its current value. For example, the chain that starts at a value of 0.9 tends
to stay near 0.9 for the first hundreds of steps. Values near 0.9 are rare in a
Beta(5, 24) distribution, so this chain generates a lot of unrepresentative values
before it warms up to the target distribution. After a thousand or so iterations
all the chains start to overlap and become indistinguishable regardless of the
initial condition. However, the density plots for each of the chains illustrate
that the initial steps of the chain still carry some influence.
The goal of an MCMC simulation is to simulate a representative sample of
values from the target distribution. While an MCMC algorithm should converge
eventually to the target distribution, it might take some time to get there. In
particular, it might take a while for the influence of the initial state to diminish.
Burn in refers to the process of discarding the first several hundred or thousand
steps of the chain to allow for a “warm up” period. Only values simulated after
the burn in period are used to approximate the target distribution.
329
The update step in rjags runs the MCMC simulation for a burn in period,
consisting of n.iter steps. (The n.iter in update is not the same as the n.iter
in coda.samples.) The update function merely “warms-up” the simulation, and
the values sampled during the update phase are not recorded.
The JAGS code below simulates 5 different chains, from 5 different initial con-
ditions, each with a burn in period of 1000 steps, after which 10,000 steps of
each chain are simulated. The output consists of 50,000 simulated values of 𝜃.
# Data
n = 25
y = 4
# Model
model_string <- "model{
# Likelihood
y ~ dbinom(theta, n)
# Prior
theta ~ dbeta(1, 3)
}"
data_list = list(y = y, n = n)
# Compile
model <- jags.model(textConnection(model_string),
data = data_list,
n.chains = 5)
# Simulate
update(model, n.iter = 1000, progress.bar = "none")
Nrep = 10000
330 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
##
## Iterations = 2001:12000
## Thinning interval = 1
## Number of chains = 5
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.172895 0.069324 0.000310 0.000435
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.0609 0.1223 0.1652 0.2153 0.3286
plotPost(posterior_sample)
mode = 0.153
95% HDI
0.0489 0.308
Param. Val.
331
nrow(as.matrix(posterior_sample))
## [1] 50000
Recall that there are several packages available for summarizing MCMC output,
and these packages contain various diagnostics. For example, the output of the
diagMCMC function in the DBDA2E-utilities.R file includes a plot the shrink
factor.
diagMCMC(posterior_sample)
332 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
333
Example 18.2. Continuing with Metropolis sampling from a Beta(5, 24) dis-
tribution, the following plots display the results of three different runs, one each
for 𝛿 = 0.01, 𝛿 = 0.1, 𝛿 = 1, all with an initial value of 0.5. Describe the
differences in the behavior of the chains. Which chain seems “best”? Why?
delta=0.01; first 200 steps delta=0.1; first 200 steps delta=1; first 200 steps
0.8
0.8
0.8
Xn
Xn
Xn
0.4
0.4
0.4
0.0
0.0
0.0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
n n n
delta=0.01; first 10000 steps delta=0.1; first 10000 steps delta=1; first 10000 steps
0.8
0.8
0.8
Xn
Xn
Xn
0.4
0.4
0.4
0.0
0.0
0.0
n n n
Show/hide solution
When 𝛿 = 0.01 only values that are close to the current value are proposed. A
proposed value close to the current value will have a density that is close to, if
not greater, than that of the current value. Therefore, most of the proposals
will be accepted, but these proposals don’t really go anywhere. With 𝛿 = 0.01
the chain moves often, but it does not move far.
When 𝛿 = 1 a wide range of values will be proposed, including values outside
of (0, 1). Many proposed values will have density that is much less than that
of the current value, if not 0. Therefore many proposals will be rejected. With
𝛿 = 1 the chain tends to get stuck in a value for a large number of steps before
moving (though when it does move, it can move far.)
Both of the above cases tend to get stuck in place and require a large number
of steps to explore the target distribution. The case 𝛿 = 0.1 is a more efficient.
The proposals are neither so narrow that it takes a long time to move nor so
wide that many proposals are rejected. The fast up and down pattern of the
trace plot shows that the chain with 𝛿 = 0.2 explores the target distribution
much more efficiently than the other two cases.
334 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
The values of a Markov chain at different steps are dependent. If the degree
of dependence is too high, the chain will tend to get “stuck”, requiring a large
number of steps to fully explore the target distribution of interest. Not only
will the algorithm be inefficient, but it can also produce inaccurate and unstable
estimates of chararcteristics of the target distribution.
If the MCMC algorithm is working, trace plots should look like a “fat, hairy
catepillar.”1 Plots of the autocorrelation function (ACF) can also help determine
how “clumpy” the chain is. An autocorrelation measures the correlation between
values at different lags. For example, the lag 1 autocorrelation measures the
correlation between the values and the values from the next step; the lag 2
autocorrelation measures the correlation between the values and the values from
2 steps later.
Example 18.3. Continuing with Metropolis sampling from a Beta(5, 24) dis-
tribution, the following plots display, for the case 𝛿 = 0.1, the actual values
of the chain (after burn in) and the values lagged by 1, 5, and 10 time steps.
Are the values at different steps dependent? In what way are they not too
dependent?
0.35
0.35
value; lagged value
0.20
0.20
0.05
0.05
0.05
0.35
0.35
lagged value
lagged value
lagged value
0.20
0.20
0.20
0.05
0.05
0.05
0.05 0.15 0.25 0.35 0.05 0.15 0.25 0.35 0.05 0.15 0.25 0.35
Show/hide solution
1 I’ve seen this description in many references, but I don’t know who first used this termi-
nology.
335
Yes, the values are dependent. In particular, the lag 1 autocorrelation is about
0.8, and the lag 5 autocorrelation is about 0.4. However, the autocorrelation
decays rather quickly as a function of lag. The lag 10 autocorrelation is already
close to 0. In this way, the chain is “not too dependent”; each value is only
correlated with the values in the next few steps.
An autocorrelation plot displays the autocorrelation within in a chain as a func-
tion of lag. If the ACF takes too long to decay to 0, the chain exhibits a high
degree of dependence and will tend to get stuck in place.
The plot below displays the ACFs corresponding to each of the 𝛿 values in
Example 18.2. Notice that with 𝛿 = 0.2 the ACF decays fairly quickly to 0,
while in the other cases there is still fairly high autocorrelation even after long
lags.
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
ACF
ACF
ACF
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Example 18.4. Continuing with Metropolis sampling from a Beta(5, 24) pos-
terior distribution. We know that the posterior mean is 5/29 = 0.172. But what
if we want to approximate this via simulation?
1. Suppose you simulated 10000 independent values from a Beta(5, 24) dis-
tribution, e.g. using rbeta. How would you use the simulated values to
estimate the posterior mean?
2. What is the standard error of your estimate from the previous part? What
does the standard error measure? How could you use simulation to ap-
proximate the standard error?
3. Now suppose you simulated 10000 from a Metropolis chain (after burn in).
How would you use the simulated values to estimate the posterior mean?
336 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
What does the standard error measure in this case? Could you use the
formula from the previous part to compute the standard error? Why?
4. Consider the three chains in Example 18.2 corresponding to the three 𝛿
values 0.01, 0.1, and 1. Which chain provides the most reliable estimate
of the posterior mean? Which chain yields the smallest standard error of
this estimate?
Show/hide solution
1. Simulate 10000 values and compute the sample mean of the simulated
values.
2. For the Beta(5, 24) distribution, the population SD is √(5/29)(1 − 5/29)/(29 + 1) =
0.07. √ The standard error of the sample mean of 10000 values is
0.07/ 10000 = 0.0007. The standard error measures the sample-to-
sample variability of sample means over many samples of size 10000. To
approximate the standard error via simulation: sample 10000 values from
a Beta(5, 24) distribution and compute the sample mean, then repeat
many times and find the standard deviation of the simulated sample
means.
3. You would still use the sample mean of the 10000 values to approximate
the posterior mean. The standard error measures how much the sample
mean varies from run-to-run of the Markov chain. To approximate the
standard error via simulation: simulate 10000 steps of the Metropolis
chain and compute the sample mean, then repeat many times and find
the standard deviation of the simulated sample means. The standard
error formula from the previous part assumes that the 10000 values are
independent, but the values on the Markov chain are not, so we can’t use
the same formula.
4. Among these three, the chain with 𝛿 = 0.1 provides the most reliable
estimate of the posterior mean since it does the best job of sampling from
the posterior distribution. While there is dependence in all three chains,
the chain with 𝛿 = 0.1 has the least dependence and so comes closest to
independent sampling, so it would have the smallest standard error.
A Markov chain that exhibits a high degree of dependence will tend to get
stuck in place. Even if you simulate 10000 steps of the chain, you don’t really
get 10000 “new” values. The effective sample size (ESS) is a measure of how
much independent information there is in an autocorrelated chain. Roughly, the
effective sample size answers the question: what is the equivalent sample size of
a completely independent chain?
337
The effective sample size2 of a chain with 𝑁 steps (after burn in) is
𝑁
ESS = ∞
1+ 2 ∑ℓ=1 ACF(ℓ)
where the infinite sum is typically cut off at some upper lag (say ℓ = 20). For
a completely independent chain, the autocorrelation would be 0 for all lags and
the ESS would just be the number of steps 𝑁 . The more quickly the ACF decays
to 0, the larger the ESS. The more slowly the ACF decays to 0, the smaller the
ESS.
The larger the ESS of a Markov chain, the more accurate and stable are MCMC-
based estimates of characteristics of the posterior distribution (e.g., posterior
mean, posterior standard deviation, 98% credible region). That is, if the ESS
is large and we run the chain multiple times, then estimates do not vary much
from run to run.
The standard error of a statistic is a measure of its accuracy. The standard error
of a statistic measures the sample-to-sample variability of values of the statistic
over many samples of the same size. A standard error can be approximated via
simulation.
For many statistics (means, proportions) the standard error based on a sample
of 𝑛 independent values is on the order of √1𝑛 .
For example, the standard error of a sample mean measures the sample-to-
sample variability of sample means over many samples of the same size. The
standard error of a sample mean based on an independent sample of size 𝑛 is
population SD
√
𝑛
function effectiveSize.
338 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
• Simulate many steps of the Markov chain and compute the value of the
statistic for the simulated chain
• Repeat many times and find the standard deviation of simulated values of
the statistic
For many statistics (means, proportions) the MCSE based on a chain with
1
effective sample size ESS is on the order of √ESS .
The MCSE effects the accuracy of parameter estimates based on the MCMC
method. If the chain is too dependent, the ESS will be small, the MCSE will be
large, and resulting estimates will not be accurate. That is, two different runs
of the chain could produce very different estimates of a particular characteristic
of the target distribution.
The plot belows correspond to each of the 𝛿 values in Example 18.2. Each
plot represents 500 runs of the chain, each run with 1000 steps (after burn
in). For each run we computed both the sample mean (our estimate of the
posterior mean) and the 0.5th percentile (our estimate of the lower endpoint of
a central 99% credible interval.) Therefore, each plot in the top row displays 500
simulated sample means, and each plot in the bottom row displays 500 simulated
0.5th percentiles. The MCSE is represented by the degree of variability in each
plot. We see that for both statistics the MCSE is smallest when 𝛿 = 0.1,
corresponding to the smallest degree of autocorrelation and the largest ESS.
339
150
120
Frequency
Frequency
Frequency
100
60
80
50
40
0 20
0
0
0.10 0.20 0.30 0.40 0.10 0.20 0.30 0.40 0.10 0.20 0.30 0.40
120
120
Frequency
Frequency
Frequency
80
80
50
40
40
0
0
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
For most of the situations we’ll see in this course, standard MCMC algorithms
will run fairly efficiently, and checking diagnostics is simply a matter of due
diligence. However, especially in more complex models, diagnostic checking is
an important step in Bayesian data analysis. Poor diagnostics can indicate the
need for better MCMC algorithms to obtain a more accurate picture of the
posterior distribtuion. Algorithms that use “smarter” proposals will usually
lead to better results.
340 CHAPTER 18. SOME DIAGNOSTICS FOR MCMC SIMULATION
Bibliography
Dogucu, M., Johnson, A., and Ott, M. (2022). Bayes Rules! An Introduction to
Applied Bayesian Modeling. Chapman and Hall/CRC, Boca Raton, Florida,
1st edition. ISBN 978-0367255398.
Kruschke, J. (2015). Doing Bayesian data analysis: A tutorial with R, JAGS,
and Stan. Academic Press, 2nd edition.
McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples
in R and Stan, 2nd Edition. CRC Press, 2 edition.
341