i i 2 i 1 2 θ i 2 2 3 2
i i 2 i 1 2 θ i 2 2 3 2
15. Using the data (mass and age) provided in Listing 3.7, fit the fol-
lowing non-linear regression model:
CONTENTS
4.1 Analysis of normal means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.1.1 One-sample/paired analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.1.2 Comparison of two normal means . . . . . . . . . . . . . . . . . . . . . . . 121
4.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.1 Jeffreys prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.2.2 Gaussian prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.2.3 Continuous shrinkage priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.2.4 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2.5 Example: Factors that affect a home’s microbiome . . . . . 130
4.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.3.1 Binary data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.3.2 Count data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.3.3 Example: Logistic regression for NBA clutch free throws 138
4.3.4 Example: Beta regression for microbiome data . . . . . . . . . 140
4.4 Random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.5 Flexible linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.5.1 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.5.2 Heteroskedastic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.5.3 Non-Gaussian error models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.5.4 Linear models with correlated data . . . . . . . . . . . . . . . . . . . . . 153
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Linear models form the foundation for much of statistical modeling and in-
tuition. In this chapter, we introduce many common statistical models and
implement them in the Bayesian framework. We focus primarily on Bayesian
aspects of these analyses including selecting priors, computation and compar-
isons with classical methods. The chapter begins with analyses of the mean of
a normal population (Section 4.1.1) and comparison of the means of two nor-
mal populations (Section 4.1.2), which are analogous to the classic one-sample
and two-sample t-tests. Section 4.2 introduces the more general Bayesian mul-
tiple linear regression model including priors that are appropriate for high-
dimensional problems. Multiple regression is extended to non-Gaussian data
in Section 4.3 via generalized linear models and correlated data in Section 4.4
via linear mixed models.
119
120 Bayesian Statistical Methods
where
√ Φ is the standard normal cumulative distribution function and Z =
nȲ /σ and thus matches exactly with a frequentist p-value. By definition,
Φ(zτ ) = τ , and so the decision rule to reject H0 in favor of H1 if the posterior
probability of H0 is less than α is equivalent to rejecting H0 if −Z < zα , or
equivalently if Z > z1−α (−zα = z1−α due to the symmetry of the standard
normal PDF). Therefore, the decision rule to reject H0 in favor of H1 at
significance level α if Z > z1−α is identical to the classic one-sided z-test.
However, unlike the classical test, we can quantify our uncertainty using the
posterior probability that the hypothesis H0 (or H1 ) is true since we have
computed Prob(H0 |Y).
Unknown variance: As shown in Section 2.3, the Jeffreys’ prior for
(µ, σ 2 ) is
3/2
1
π(µ, σ 2 ) ∝ . (4.3)
σ2
Appendix A.3 shows that the marginal posterior of µ integrating over σ 2 is
µ|Y ∼ tn Ȳ , σ̂ 2 /n ,
(4.4)
n
where σ̂ 2 = i=1 (Yi − Ȳ )2 /n, i.e., the posterior is Student’s t distribution
P
with location Ȳ , variance parameter σ̂ 2 /n and n degrees of freedom. Posterior
inference such as credible sets or the posterior probability that µ is posi-
tive follow from the quantiles of Student’s t distribution. The credible set is
slightly different than the frequentist confidence interval because the degrees
of freedom in the classic t-test is n − 1, whereas the degrees of freedom in the
posterior is n; this is the effect of the prior on σ 2 .
In classical statistics, when σ 2 is unknown the Z-test based on the nor-
mal distribution is replaced with the t-test based on Student’s t distribution.
Similarly, the posterior distribution of the mean changes from a normal dis-
tribution given σ 2 to Student’s t distribution when uncertainty about the
variance is considered. Figure 4.1 compares the Gaussian and t density func-
tions. The density functions are virtually identical for n = 25, but for n = 5
the t distribution has heavier tails than the Gaussian distribution; this is the
effect of accounting for uncertainty in σ 2 .
n=5 n = 25
0.010
Gaussian
Student's t
0.04
0.008
0.03
0.006
Density
Density
0.02
0.004
0.01
0.002
0.000
0.00
6 8 10 12 14 8 9 10 11 12
µ µ
FIGURE 4.1
Comparison of the Gaussian and Student’s t distributions. √ Below
are the Gaussian PDF with mean Ȳ and standard deviation σ/ n√compared
to the PDF of Student’s t distribution with location Ȳ , scale σ̂/ n, and n
degrees of freedom. The plots assume Ȳ = 10, σ = σ̂ = 2, and n ∈ {5, 25}.
δ is the difference in means and the parameter of interest. Denote the sample
mean of the nj observations in group j = 1, 2 as Ȳj and the group-specific vari-
P n1 Pn1 +n2
ance estimators as s21 = i=1 (Yi − Ȳ1 )2 /n1 and s22 = i=n 1 +1
(Yi − Ȳ2 )2 /n2 .
Conditional distribution with the variance fixed: Conditional on
the variance and flat prior π(µ, δ) ∝ 1 it can be shown that the posterior of
the difference in means is
2 1 1
δ|Y ∼ Normal Ȳ2 − Ȳ1 , σ + . (4.5)
n1 n2
Appendix A.3 shows (as a special case of multiple linear regression, see Section
4.2) the marginal posterior distribution of δ integrating over both µ and σ 2 is
2 1 1
δ|Y ∼ tn Ȳ2 − Ȳ1 , σ̂ + , (4.7)
n1 n2
Linear models 123
Listing 4.1
R code for comparing two normal means with the Jeffreys’ prior.
1 # Y1 is the n1-vector of data for group 1
2 # Y2 is the n2-vector of data for group 2
3
where σ̂ 2 = (n1 ŝ21 + n2 ŝ22 )/n is a pooled variance estimator. As with the one-
sample model, the difference between the posterior for the known-variance
versus unknown-variance cases is that an estimate of σ 2 is inserted in the pos-
terior and the Gaussian distribution is replaced with Student’s t distribution
with n degrees of freedom. In the Bayesian analysis we did not “plug in” an
estimate of σ 2 , rather, by accounting for its uncertainty and marginalizing
over µ and σ 2 the posterior for δ happens to have a natural estimator of σ 2
in δ’s scale. Listing 4.1 implements this method.
Unequal variance: If the assumption that the variance is the same for
iid
both groups is violated, then the two-sample model can be extended as Yi ∼
iid
Normal(µ1 , σ12 ) for i = 1, ..., n1 and Yi ∼ Normal(µ2 , σ22 ) for i = n1 +1, ..., n1 +
n2 . Since no parameters are shared across the two groups, we can apply the
one-sample model separately for each group to obtain
indep
µj |Y ∼ tnj (Ȳj , s2j /nj ) (4.8)
124 Bayesian Statistical Methods
where β = (β1 , ..., βp )T are the regression coefficients and the errors (also
iid
called the residuals) are εi ∼ Normal(0, σ 2 ). We assume throughout that
Xi1 = 1 for all i so that β1 is the intercept (i.e., the mean if all other covariates
are zero). This model includes as special cases the one-sample mean model
in Section 4.1.1 (with p = 1 and β1 = µ) and the two-sample mean model in
Section 4.1.2 (with p = 2, Xi2 equal one if observation i is from the second
group and zero otherwise, β1 = µ, and β2 = δ).
The coefficient βj for j > 0 is the slope associated with the j th covariate.
For the remainder of this subsection we will assume that all p − 1 covariates
(excluding the intercept term) have been standardized to have mean zero and
variance one so the prior can be specified without considering the scales of the
covariates. That is, if the original covariate j, X̃ij , had sample mean X̄j and
standard deviation ŝj then we set Xij = (X̃ij − X̄j )/ŝj . After standardization,
the slope βj is interpreted as the change in the mean response corresponding
to an increase of one standard deviation unit (ŝj ) in the original covariate.
Similarly, βj /ŝj is the expected increase in the mean response associated with
an increase of one in the original covariate. The model actually has p + 1 pa-
rameters (p regression coefficients and variance σ 2 ) so we temporarily use p as
the number of regression parameters and not the total number of parameters
in the model.
The likelihood function for the linearP model with n observations is the
p
product of n Gaussian PDFs with means j=1 Xij βj and variance σ 2 ,
2
n p
Y 1 1 X
f (Y|β, σ 2 ) = √ exp − 2 Yi − Xij βj . (4.10)
i=1
2πσ 2σ j=1
Linear models 125
Note that the least squares estimator is unique only if XT X is full rank (i.e.,
p < n and none of X’s columns are redundant) and the estimator is poorly
defined if XT X is not full rank. Assuming X is full rank, the sampling dis-
tribution used to construct frequentist
confidence intervals and p-values is
2 T −1
β̂ LS ∼ Normal β 0 , σ (X X) , where β 0 is the true value.
The posterior mean is the least squares solution and the posterior covari-
ance matrix is the covariance matrix of the sampling distribution of the least
squares estimator. Therefore, the posterior credible intervals from this model
will numerically match the confidence intervals from a least squares analysis
with known error variance.
Unknown variance: With σ 2 unknown, the Jeffreys’ prior is π(β, σ 2 ) ∝
2 −p/2−1
(σ ) (Section 2.3). Assuming XT X has full rank, Appendix A.3 shows
that h i
β|Y ∼ tn β̂ LS , σ̂ 2 (XT X)−1 , (4.15)
126 Bayesian Statistical Methods
Listing 4.2
R code for Bayesian linear regression under the Jeffreys’ prior.
1 # This code assumes:
2 # Y is the n-vector of observations
3 # X us the n x p matrix of covariates
4 # The first column of X is all ones for the intercept
5
Listing 4.3
JAGS code for multiple linear regression with Gaussian priors.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnorm(inprod(X[i,],beta[]),taue)
4 }
5 # Priors
6 beta[1] ~ dnorm(0,0.001) #X[i,1]=1 for the intercept
7 for(j in 2:p){
8 beta[j] ~ dnorm(0,taub*taue)
9 }
10 taue ~ dgamma(0.1, 0.1)
11 taub ~ dgamma(0.1, 0.1)
and so from here on we set µ = 0. This prior combined with the likelihood
Y ∼ Normal(Xβ, σ 2 In ) gives posterior
h i
β|Y, σ 2 ∼ Normal (XT X + Ω−1 )−1 XT Y, σ 2 (XT X + Ω−1 )−1 , (4.17)
where λ = 1/τ 2 . Ridge regression is often used to stabilize least squares prob-
lems when the number of predictors is large and/or the covariates are collinear.
In ridge regression, the tuning parameter λ can be selected based on cross-
validation. In a fully Bayesian analysis, τ 2 can either be fixed to a large value
to give an uninformative prior, or given a conjugate inverse gamma prior as in
Listing 4.3 to allow the data to determine how much to shrink the coefficients
towards zero. If τ 2 is given a prior, then the intercept term β1 should be given
a different variance because it plays a different role than the other regression
coefficients.
The hform of the posterior
i simplifies under Zellner’s g-prior [83] β|σ 2 ∼
2
Normal 0, σg (XT X)−1 for g > 0. The conditional posterior is then
h i
β|Y, σ 2 ∼ Normal cβ̂ LS , cσ 2 (XT X)−1 , (4.19)
because β̂ LS and (XT X)−1 can be computed once outside the MCMC sam-
pler. The shrinkage factor c determines how strongly the posterior mean and
covariance are shrunk towards zero. A common choice is g = 1/n and thus
c = n/(n + 1). Since Fisher’s information matrix for the Gaussian distribution
is the inverse covariance matrix, the prior contributes 1/nth the information
as the likelihood, and so this prior is called the unit information prior [49].
This is the famous LASSO [81] penalized regression estimator and thus the
double exponential prior is often called the Bayesian LASSO prior. An attrac-
tive feature of this estimator is that some of the estimates may have β̂j set
exactly to zero, and this then performs variable selection simultaneously with
estimation. In other words, the LASSO encodes the prior belief that some of
the covariates are unimportant.
The double exponential prior is just one example of a shrinkage prior
with peak at zero and heavy tails. The horseshoe prior [16] is βj |λj ∼
Normal(0, λ20 λ2j ) where λ0 is global variance common to all regression coef-
ficients and λj is a local prior standard deviation specific to βj . The local
variances are given half-Cauchy prior (i.e., student-t prior with one degree of
freedom restricted to be positive). This global-local prior is designed to shrink
null coefficients towards zero by having small variance while the true signals
have uninformative priors with large variance. The Dirichlet–Laplace prior
[11] gives even more shrinkage towards zero by supplementing the Bayesian
LASSO with local shrinkage parameters and a Dirichlet prior on the shrink-
age parameters. The R2D2 prior [84] is another global-local shrinkage prior for
Linear models 129
0.007
BLASSO
Gaussian
0.006
0.005
0.004
Density
0.003
0.002
0.001
0.000
−4 −2 0 2 4
FIGURE 4.2
Comparison of the Gaussian and double exponential prior distribu-
tions. Below are the standard normal PDF and the double exponential PDF
with parameters set to give mean zero and variance one.
4.2.4 Predictions
One use of linear regression is to make a prediction for a new set of covariates,
Xpred = (X1pred , ..., Xppred ). Given the model parameters, the distribution of
Listing 4.4
JAGS code for the Bayesian LASSO.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnorm(inprod(X[i,],beta[]),taue)
4 }
5 # Priors
6 beta[1] ~ dnorm(0,0.001)
7 for(j in 2:p){
8 beta[j] ~ ddexp(0,taub*taue)
9 }
10 taue ~ dgamma(0.1, 0.1)
11 taub ~ dgamma(0.1, 0.1)
130 Bayesian Statistical Methods
Listing 4.5
JAGS code for linear regression predictions.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnorm(inprod(X[i,],beta[]),taue)
4 }
5 # Priors
6 beta[1] ~ dnorm(0,0.001)
7 for(j in 2:p){
8 beta[j] ~ dnorm(0,taub*taue)
9 }
10 taue ~ dgamma(0.1, 0.1)
11 taub ~ dgamma(0.1, 0.1)
12
13 # Predictions
14 for(i in 1:n_pred){
15 Y_pred[i] ~ dnorm(inprod(X_pred[i,],beta[]),taue)
16 }
17 # User must pass JAGS the covariates X_pred and integer n_pred
18 # JAGS returns PPD samples of Y_pred
Pp
the new response is Y pred |β, σ 2 ∼ Normal( j=1 Xjpred βj , σ 2 ). To properly
account for parametric uncertainty, we should use the posterior predictive
distribution (Section 1.5) that averages over the uncertainty in β and σ 2 .
MCMC provides a means to sample from the PPD by making a sample from
the predictive distribution for each of the s = 1, ..., S MCMC samples of the
Pp (s)
parameters, Y (s) |β (s) , σ 2(s) ∼ Normal( j=1 Xjpred βj , σ 2(s) ), and using the
S draws Y (1) , ..., Y (S) to approximate the PPD. The PPD is then summarized
the same way as other posterior distributions, such as by the posterior mean
and 95% interval. Similar approaches can use to analyze missing data as in
Section 6.4.
Listing 4.5 gives JAGS code to make linear regression predictions. The
matrix of predictors Xpred must be passed to JAGS and JAGS will return the
predictions Ypred . Making predictions with JAGS can slow the sampler and
consume memory, and so it is often better to first perform MCMC sampling
for the parameters using JAGS and then make predictions in R as in Listing
4.6.
Listing 4.6
R code to use JAGS MCMC samples for linear regression predictions.
1 # INPUTS
2 # beta_samples := S x p matrix of MCMC samples (from JAGS)
3 # taue_samples := S x 1 matrix of MCMC samples (from JAGS)
4 # X_pred := n_pred x p matrix of prediction covariates
5
6 S <- nrow(beta_samples)
7 n_pred <- nrow(X_pred)
8 Y_pred <- matrix(NA,S,n_pred)
9 sigma <- 1/sqrt(taue_samples)
10
11 for(s in 1:S){
12 Y_pred[s,] ~ X_pred%*%beta_samples[s,]+rnorm(n_pred,0,sigma[s])
13 }
14
15 # OUTPUT
16 # Y_pred := S x n_pred matrix of PPD samples
operational taxonomic units) of fungi. The response is the log of the number
of fungi species present in the sample, which is a measure of species richness.
The objective is to determine which factors influence a home’s species rich-
ness. For each home, eight covariates are included in this example: longitude,
latitude, annual mean temperature, annual mean precipitation, net primary
productivity (NPP), elevation, the binary indicator that the house is a single-
family home, and the number of bedrooms in the home. These covariates are
all centered and scaled to have mean zero and variance one.
iid
We apply the Gaussian model in Listing 4.3 first with βj ∼
iid
Normal(0, 1002 ) (“Flat prior”) and then with βj |σ 2 , τ 2 ∼ Normal(0, σ 2 τ 2 )
with τ 2 ∼ InvGamma(0.1, 0.1) (“Gaussian shrinkage prior”), and the Bayesian
LASSO prior in Listing 4.4. For each of the three models we ran two MCMC
chains with 10,000 samples in the burn-in and 20,000 samples after burn-in.
Trace plots (not shown) showed excellent convergence and the effective sample
size exceeded 1,000 for all parameters and all models.
The results are fairly similar for the three priors (Figure 4.3). In all three
models, temperature, NPP, elevation, and single-family home are the most
important predictors, with the most richness estimated to occur in single-
family homes with low temperature, NPP, and elevation. In all three models,
the sample with largest fitted value (i.e., the posterior mean of Xβ) is a single-
family home with three bedrooms in Montpelier, VT, and the sample with the
smallest fitted value is a multiple-family home with two bedrooms in Tempe,
AZ.
Although the results are not that sensitive to the prior in this analysis,
there are some notable difference. For example, compared to the posterior
under a flat priors, the posterior of the slope for latitude (top right panel in
132 Bayesian Statistical Methods
20
25
●
●
●
●
● ●
●●●●
●
●
●● ●
●
●
●●●
● ● ● ● ●
● ● ●
● ● ●
●
● ● ●
● ● ● ● ●
● ● ●
●● ●
●
●
●●
● ●●
●
●
●●
●● ●
20
●
● ● ● ●
● ● ● ● ● ●
● ●
● ●
15
●
●
● ●
● ●
● ● ●
● ● ●
● ● ●
●●● ●● ● ● ●
● ●● ●● ●
● ● ● ● ● ●
● ● ●
●●
● ●
● ●●● ●●
● ● ●● ●
● ●● ● ● ● ●●
●●● ● ● ● ●
● ● ● ● ● ●
●● ●●
● ● ● ● ● ● ● ● ●●
●
● ● ●● ● ●
●
●●● ● ● ●
● ●● ●●
●
●
●●
●
● ● ● ●●
●● ● ●● ●●●●
● ● ● ●●● ● ● ●● ●● ●● ●
●
●
● ● ●● ●● ● ●
● ● ●●
● ●● ●●●● ●●●
● ● ● ●● ●
●
● ●● ●● ● ●
● ● ● ● ● ● ● ●● ●● ●●●●●
● ●
● ●● ●
●● ● ● ●●●● ● ●● ●●●●
●●●
●●
●
● ● ● ● ●
●
●●● ●● ●●● ●● ●
●
● ● ●
● ● ● ●● ●
● ●
● ●
15
●
●● ●
● ● ● ● ● ●●
● ●●● ● ● ●
●● ● ●●
● ●●
●
●●
● ●●
●
●
●● ● ●● ●●● ●● ● ●
● ● ● ●● ● ●●● ● ● ●
●●
●●● ● ●
● ●
● ● ● ●
● ●● ●
●●●
●
●
●
● ● ●
● ●●
● ●
● ● ● ● ●
●●
●
●●
●
●
●●
●
● ●
● ● ● ●
●●● ● ●●
●●
● ● ●●●
● ● ● ● ● ● ● ●
● ●● ●● ● ● ●●
●●●● ● ●
10
● ● ●● ● ●
● ●●●●
●
●●
●●
● ● ● ●
● ● ●
●●
●●●●
● ●● ●
● ● ● ●●●●● ●
●●● ●
●
●● ● ●● ● ● ● ● ●
● ●●
● ●●● ●●●
●
● ● ● ● ● ●
● ● ● ●
● ●● ● ●●●
● ● ● ● ●● ●●● ●
● ●
● ● ● ● ●●● ●● ●
●● ●●●●●●
●●
●●●● ●● ●
● ●● ● ● ●●●● ● ●
●●
●
●●
●●
●●
●
●●
●●
●●
● ●● ●● ● ● ● ●●●●
●
●●
●●
●
●
●●●
●●● ●
10
● ● ● ●
● ● ● ● ●
● ●
●●● ●
●
● ● ●●●
●
●
●
●● ●
●●
●
● ● ●●
●
● ● ●
●● ● ● ● ●
● ● ● ●●
●● ●● ● ● ●
●●
●
● ●● ● ● ●● ●
●
●
●
● ●
●● ●● ● ● ●
●●●● ● ●●
●● ●
●
● ●
●
●
● ● ● ●●
●●●
●
●
●
● ●●●
●● ● ● ●
●● ● ●●
●
● ● ● ●
●● ● ●
●
5
● ●
● ●
● ● ●●
●
● ●
●
●●
5
●● ● ●●● ● ●
●
●
●● ●
●
●●
●●
● ● ●
●●
● ●● ●
● ●
●●●
●
●●
● ●
●
● ●
●
●
● ●
●
●
0
●
−0.05 0.00 0.05 0.10 −0.15 −0.10 −0.05 0.00 0.05 0.10
β β
25
20
20
10
15
15
10
10
5
5
0
0
−0.30 −0.25 −0.20 −0.15 −0.10 −0.05 0.00 −0.05 0.00 0.05 0.10 −0.10 −0.08 −0.06 −0.04 −0.02 0.00 0.02
β β β
25
25
15
20
20
15
15
10
10
10
5
5
5
0
−0.10 −0.05 0.00 −0.02 0.00 0.02 0.04 0.06 0.08 −0.04 −0.02 0.00 0.02 0.04 0.06 0.08
β β β
FIGURE 4.3
Regression analysis of the richness of a home’s microbiome. The
first panel shows the sample locations and the remaining panels plot the pos-
terior distributions of the regression coefficients, βj . The three models are
distinguished by their priors for βj : the flat prior is βj ∼ Normal(0, 1002 )
iid
(solid line), the Gaussian shrinkage prior is βj ∼ Normal(0, σ 2 τ 2 ) with
iid
τ 2 ∼ InvGamma(0.1, 0.1) (dashed line), and the Bayesian LASSO is βj ∼
DE(0, σ 2 τ 2 ) with τ 2 ∼ InvGamma(0.1, 0.1) (dotted line).
Linear models 133
Figure 4.3) is shrunk towards zero by the Gaussian shrinkage model, and the
posterior density concentrates even more around the origin for the Bayesian
LASSO prior. However, it is not clear from this plot which of these three fits
in preferred; model comparison is discussed in Chapter 5.
The linear predictor ηi can take any value in (−∞, ∞) depending on Xij .
Therefore, to link the covariates with the mean we can simply set E(Yi ) = ηi
as in standard linear regression. To complete the standard model we elect not
to link the covariates with the variance, and simply set V(Yi ) = σ 2 for all i.
The function that links the linear predictor with a parameter is called the
link function. Say that the parameter in the likelihood for the response is θi
(e.g., E(Yi ) = θi or V(Yi ) = θi ), then the link function g is
g(θi ) = ηi . (4.22)
The link function must be an invertible function that is well-defined for all
permissible values of the parameter. For example, in the Gaussian case the
134 Bayesian Statistical Methods
Listing 4.7
Model statements for several GLMs in JAGS.
1
link function for the mean is the identity function g(x) = x so that the mean
can be any real number. To link the covariates to the variance we must ensure
that the variance is positive, and so natural-log function g(x) = log(x) is more
appropriate. Link functions are not unique, and must be selected by the user.
For example, the link function for the mean could be replaced by g(x) = x3
and the link function for the variance could be replaced with g(x) = log10 (x).
Bayesian fitting of a GLM requires selecting the prior and computing the
posterior distribution. The priors for the regression coefficients discussed for
Gaussian data in Section 4.2 can be applied for GLMs. The posterior distribu-
tions for GLMs are usually too complicated to derive the posterior in closed-
form and prove that, say, a particular prior distribution leads to a student-t
posterior distribution with interpretable mean and covariance. However, much
of the intuition developed with Gaussian linear models carries over to GLMs.
With non-Gaussian responses the full conditional distributions for the βj
are usually not conjugate and Metropolis sampling must be used. Maximum
likelihood estimates can be used as initial values and the corresponding stan-
dard errors can suggest appropriate candidate distributions. In the examples
in this chapter we use JAGS to carry out the MCMC sampling. R also has
dedicated packages for Bayesian GLMs, such as the MCMClogit package for
logistic regression (Section 4.3.1) that are likely faster than JAGS. With flat
priors and a large sample it is even more efficient to evoke the Bayesian central
limit theorem (Section 3.1.3) and approximate the posterior as Gaussian (e.g.,
using the glm function in R) and avoid MCMC altogether.
This link function converts the event probability q first to the odds of the
event, q/(1 − q) > 0, and then to the log odds, which can be any real number.
The logistic regression model is written
J
indep X
Yi ∼ Bernoulli(qi ) and logit(qi ) = ηi = Xij βj . (4.24)
j=1
136 Bayesian Statistical Methods
The inverse logistic function is g −1 (x) = exp(x)/[1 + exp(x)] and so the model
indep
can also be expressed as Yi ∼ Bernoulli (exp(ηi )/[1 + exp(ηi )]) . JAGS code
for this model is in Listing 4.7a.
Because the log odds of the event that Yi = 1 are linear in the covariates,
βj is interpreted as the increase in the log odds corresponding to an increase of
one in Xj with all other covariates held fixed. Similarly, with all over covariates
held fixed, increasing Xj by one multiplies the odds by exp(βj ). Therefore if
βj = 2.3, increasing Xj by one multiplies the odds by ten and if βj = −2.3,
increasing Xj by one divides the odds by ten. This interpretation is convenient
for communicating the results and specifying priors. For example, if a change
of one in the covariate is deemed a large change, then a standard normal prior
may have sufficient spread to represent an uninformative prior.
Probit regression: There are many possible link functions from [0, 1]
to [−∞, ∞]. In fact, the quantile function (inverse CDF) of any continuous
random variable with support [−∞, ∞] would suffice. The link function in
logistic regression is the quantile function of the logistic distribution. In probit
regression, the link function is the Gaussian quantile function
indep
Yi ∼ Bernoulli(qi ) and qi = Φ(ηi ), (4.25)
where Zi is the latent indicator that visitor i fished and thus the mean of
Yi is zero for non-fishers with Zi = 0. In this scenario, we do not observe
the Zi but this two-stage model gives the equivalent model to (4.31) but uses
only standard distributions. As a result, the model can be coded in JAGS with
covariates included in the mass at zero and the Poisson rate as in Listing 4.7e.
95
Model 1 − Intercept
IT
●
Model 1 − Slope
SC Model 2 − Intercept
●
90
3
KL
●
RW
Clutch percentage
Posterior density
85
JW KD
2
●
●
80
JH
●
AD
75
1
LJ
70
● GA
●
65
0
65 70 75 80 85 90 95 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
Overall percentage β
FIGURE 4.4
Logistic regression analysis of NBA free throws. The first panel shows
the overall percentage of made free throws versus the percentage for clutch
shots only for each player (denoted by the player’s initials). The solid line
is the x = y lines and the dashed line is the fitted value from Model 2. The
second plot is the posterior density for the slope and intercept from the Model
1 and the intercept from the model 2.
The results are plotted in Figure 4.4. For Model 1, the slope is centered
squarely on one and the intercept is centered slightly below zero, but both
140 Bayesian Statistical Methods
TABLE 4.1
Beta regression of microbiome data. Posterior median and 95% intervals
for the regression coefficients βj and concentration parameter r.
regression model
Yi |β, r has mean qi and variance qi (1 − qi )/(r + 1), and so r > 0 determines
the concentration of the beta distribution around the mean qi .
Using priors βj ∼ Normal(0, 100) and r ∼ Gamma(0.1, 0.1), we fit this
model in JAGS using the code in Listing 4.7e (although sampling would likely
be faster using the betareg package in R). Before fitting the model, the co-
variates are standardized to have mean zero and variance one. Convergence
was excellent for all parameters (the bottom left panel of Figure 4.5 shows the
trace plot for r). The posterior distributions in Table 4.1 indicate that there
is more diversity (smaller Yi ) on average in homes in cool regions in the east,
and multiple-family homes with many bedrooms.
The random effect αi is the true mean for patient i, and the observations for
patient i vary around αi with variance σ 2 . The αi are called random effects
because if we repeated the experiment with a new sample of 20 children the
αi would change. In this model, the population of patient-specific means is
assumed to follow a normal distribution with mean µ and variance τ 2 . The
overall mean µ is a fixed effect because if we repeated the experiment with
a new sample of 20 children from the same population it would not change.
A linear model with both fixed and random effects is called a linear mixed
model.
A Bayesian analysis of a random effects model requires priors for the pop-
ulation parameters. For example, Listing 4.8 provides JAGS code with con-
jugate priors µ ∼ Normal(0, 1002 ) and τ 2 ∼ InvGamma(0.1, 0.1). The same
algorithms (e.g., Gibbs sampling) can be used for random effects models as
for the other models we have considered. In fact, computationally there is
no need to distinguish between fixed and random effects (which can lead to
142 Bayesian Statistical Methods
1.0
●
●
200
0.8
●● ●
● ●
●
150
● ●
● ● ●
● ●
●
● ●
Frequency
● ● ●
●
0.6
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
●
● ●
● ●
● ● ● ●
● ● ●
●
100
● ●
●● ● ● ●
● ● ●
● ● ●
● ●
● ● ● ● ●
●
● ●
● ● ●
● ● ●● ●
●● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ●●
●● ● ●
●● ●●
●
0.4
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
●●● ● ● ● ● ● ●
● ● ●
●●● ● ●● ●● ●
● ●● ●● ● ●
● ● ● ●
● ● ●
●●
● ●● ● ●●● ● ● ●
● ●
● ● ● ● ●● ●
● ●● ● ● ● ● ●
● ● ●
● ● ● ●● ● ● ● ●
● ●●
●
● ●
● ● ● ●● ● ●
●
● ● ● ● ●● ● ● ●● ● ●
● ●● ● ● ● ● ●
● ● ●● ● ●
● ● ● ● ● ● ● ●● ● ●
●
● ● ● ● ● ●
● ●●● ●● ● ● ● ● ●●● ● ● ●
●● ● ● ● ●
● ●●
● ● ● ● ● ●● ●
●
● ●● ●● ●●
50
● ● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ●
● ● ● ● ●● ● ● ● ● ●
●● ● ● ● ●● ● ●
● ● ●
●
●● ●●
● ● ●
● ● ● ●● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ●●
●●● ●● ● ● ● ● ●●
● ●● ● ● ●● ● ●
●
● ● ● ●● ●●
●● ●● ● ● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ●● ● ● ● ●●
●
● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●
●●
● ● ● ●● ● ●
● ● ●● ● ● ● ●
● ●●
● ● ● ● ●● ●●●
●● ●●
●
●
●
0.2
● ● ●
●
●
●
●●
● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ●● ●●●● ● ● ●● ●
● ● ● ●
●
●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ●●
●
●
●
●
●● ● ●
●●●● ● ● ● ● ●●●●● ● ●●● ● ●
●●●● ● ●● ● ●
● ● ● ● ● ●● ●
● ● ● ●
●●
●● ● ●●
●
●
●●
●●●
● ●
● ● ● ●● ● ● ●● ●
● ● ●●● ● ● ●●
●●● ● ● ●●
● ●●
●
●
●
● ●●
● ● ● ● ●
● ●● ● ● ●
●
● ● ● ● ● ● ● ●● ●
● ● ● ● ●●● ●●● ● ●●
●
●●● ●
●● ●● ●
● ●
● ● ● ● ●●● ●● ● ● ●●
● ● ●●● ● ●● ●●
● ●
● ● ● ● ● ● ● ●●● ●
● ●●● ●●
●●
●●● ●
● ● ● ●● ● ●●
● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●
● ● ● ● ●●● ●
● ● ● ●● ● ●● ●● ● ● ●● ●●
● ●●●
●
● ● ●● ● ● ●● ●
●● ● ● ● ● ● ●●● ● ●
● ●● ● ●●
●
● ●●● ●●●● ● ● ● ●●●
●● ● ● ● ● ●●
● ● ●●●
●
●●
●●● ● ● ●●●
● ● ●
● ● ● ●
●●
●●● ● ● ● ● ● ●
● ● ● ●
●●●● ●
● ●
●●●
●●
●● ●●●● ● ● ● ●
● ● ●
●● ● ● ● ●
● ● ●● ● ● ● ● ●● ● ●
●●
● ● ●● ●● ●
● ● ●● ● ● ●● ● ●
● ●
●
●●
●●
●● ●●
●
● ●●
●
●
●
● ●●● ●
● ●
●● ● ●●
●●● ●● ●●
●
0
●●
0.2 0.4 0.6 0.8 1.0 −120 −110 −100 −90 −80 −70
Summerville, SC
Greensburg, PA
Junction City, CA
9.0
3
8.5
Density
2
r
8.0
7.5
1
7.0
10000 15000 20000 25000 30000 0.0 0.2 0.4 0.6 0.8 1.0
Iteration y
FIGURE 4.5
Beta regression for microbiome data. The top left panel shows the his-
togram of the observed proportions of abundance allocated to the most abun-
dance OTU, and the top right panel plots this variable against the sam-
ple’s longitude. The second row gives the trace plots (the two chains are
different shades of gray) of the concentration parameter, r, and the fitted
Beta[r̂q̂, r̂(1 − q̂)] density for three samples evaluated at the posterior mean
for all parameters.
Linear models 143
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
54
54
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ●
52
52
● ●
●
● ● ● ●
Bone density
Bone density
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
50
50
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
48
48
● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ●
●
● ● ● ●
● ●
46
46
● ●
● ●
●
● ● ●
Age Patient
Random effects
5
IID
1.2
4
1.0
Posterior density
0.8
3
Density
0.6
2
0.4
1
0.2
0.0
0
FIGURE 4.6
One-way random effect analysis of the jaw data. The dots in the top
left panel show bone density at the four visits for each patient (connected by
lines), the top right panel compares the observations (dots) and the posterior
distribution of the subject random effect αi (boxplot), the bottom left panel
plots the posterior of the variance ratio τ 2 /(τ 2 + σ 2 ), and the final panel
compares the posterior density of the mean µ from the random effects model
iid
versus independence model Yij ∼ Normal(µ, σ 2 ).
144 Bayesian Statistical Methods
Listing 4.8
The one-way random effects model in JAGS.
1 # Likelihood
2 for(i in 1:n){for(j in 1:m){
3 Y[i,j] ~ dnorm(alpha[i],sig2_inv)
4 }}
5
6 # Random effects
7 for(i in 1:n){alpha[i] ~ dnorm(mu,tau2_inv)}
8
9 # Priors
10 mu ~ dnorm(0,0.0001)
11 sig2_inv ~ dgamma(0.1,0.1)
12 tau2_inv ~ dgamma(0.1,0.1)
confusion, [42]). Conceptually, however, fixed and random effects are distinct
because fixed effects describe the population and a random effect describes an
individual from the population. A common source of confusion is that fixed
effects (such as µ in Listing 4.8) are treated as random variables in a Bayesian
analysis. However, as with all parameters, the prior and posterior distribu-
tions for fixed effects reflect subjective uncertainty about the true values of
these fixed but unknown parameters.
Unlike analyses such as linear regression where the main focus is on the
mean and prediction of new observations, in random effects models the vari-
ance components (e.g., σ 2 and τ 2 ) are often the main focus of the analysis.
Therefore it is important to scrutinize the priors used for these parameters.
The inverse gamma prior is conjugate for the variance parameters which often
leads to simple Gibbs updates. However, as shown in Figure 4.7, the inverse
gamma prior with small shape parameter for a variance induces a prior for
the standard deviation with prior PDF equal to zero at the origin. This ex-
cludes the possibility that there is no random effect variance, which may not
be suitable in many problems.
As an alternative, [29] endorses a half-Cauchy (HC) prior for the standard
deviation. The HC distribution is the student-t distribution with one degree
of freedom restricted to be positive and has a flat PDF at the origin (Figure
4.7), which is usually a more accurate expression of prior belief. Listing 4.9
gives JAGS code for this prior. In this code the HC distribution is assigned
directly to the standard deviations which breaks the conjugacy relationships
for the variance components (this is easily handled by JAGS); [29] shows that
conjugacy can be restored using a two-stage model. This code assumes the HC
scale parameter is fixed at one. Because the Cauchy prior has a very heavy
tail this gives 0.99 prior quantile equal to 63. Despite this wide prior range,
the scale of the HC prior should be adjusted to the scale of the data.
Random effects induce correlation between observations from the same
group. In the one-way random effects model, the covariance marginally over
Linear models 145
Listing 4.9
The one-way random effects model with half-Cauchy priors.
1 # Likelihood
2 for(i in 1:n){for(j in 1:m){
3 Y[i,j] ~ dnorm(alpha[i],sig2_inv)
4 }}
5
6 # Random effects
7 for(i in 1:n){alpha[i] ~ dnorm(mu,tau2_inv)}
8
9 # Priors
10 mu ~ dnorm(0,0.0001)
11 sig2_inv <- pow(sigma1,-2)
12 tau2_inv <- pow(sigma2,-2)
13 sigma1 ~ dt(0, 1, 1)T(0,) # Half-Cauchy priors with
14 sigma2 ~ dt(0, 1, 1)T(0,) # location 0 and scale 1
3.5
3.5
Half−Cauchy Half−Cauchy
InvGamma, a=1 InvGamma, a=1
3.0
3.0
2.5
Prior density
Prior density
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.0 0.5 1.0 1.5 2.0 2.5 3.0
σ σ
FIGURE 4.7
Priors for a standard deviation. The half-Cauchy prior for σ and the prior
induced for σ by inverse gamma priors on σ 2 with different shape parameters.
All priors are scaled to have median equal 1. The two panels differ only by
the range of σ being plotted.
146 Bayesian Statistical Methods
and the random intercept and slope for patient i is αi = (αi1 , αi2 )T . The mean
vector β includes the population mean intercept and slope, and is thus a fixed
effect. The 2 × 2 population covariance matrix Ω determines the variation
of the random effects over the population. To complete the Bayesian model
we specify prior σ 2 ∼ InvGamma(0.1, 0.1), β ∼ Normal(0, 1002 I2 ), and Ω ∼
Linear models 147
Listing 4.10
Random slopes model in JAGS.
1 # Likelihood
2 for(i in 1:n){for(j in 1:m){
3 Y[i,j] ~ dnorm(alpha[i,1]+alpha[i,2]*age[j],tau)
4 }}
5
6 # Random effects
7 for(i in 1:n){
8 alpha[i,1:2] ~ dmnorm(beta[1:2],Omega_inv[1:2,1:2])
9 }
10
11 # Priors
12 tau ~ dgamma(0.1,0.1)
13 for(j in 1:2){beta[j] ~ dnorm(0,0.0001)}
14 Omega_inv[1:2,1:2] ~ dwish(R[,],2.1)
15
16 R[1,1]<-1/2.1
17 R[1,2]<-0
18 R[2,1]<-0
19 R[2,2]<-1/2.1
InvWishart(2.1, I2 /2.1). The inverse Wishart prior for the covariance matrix
has prior mean I2 , the 2 × 2 identify matrix (see Appendix A.1). JAGS code
for this random-slopes model is in Listing 4.10.
The posterior mean of the population covariance matrix is
91.78 −10.14
E(Ω|Y) = (4.38)
−10.14 1.23
and √ the posterior 95% interval for the correlation Cor(αi1 , αi2 ) =
Ω12 / Ω11 Ω22 is (-0.98, -0.89). Therefore there is a strong negative depen-
dence between the intercept and slope, indicating that bone density increases
rapidly for children with low bone density at age 8, and vice versa.
Figure 4.8 plots the posterior distribution of the fitted values αi1 + αi2 X
for X between 8 and 10 years for three patients. For each patient and each age,
we compute the 95% interval using the quantiles of the S posterior samples
(s) (s)
αi1 + αi2 X. We also plot the posterior predictive distribution (PPD) for
the measured bone density at age 10. The PPD is approximated by sampling
∗(s) (s) (s)
Yi ∼ Normal(αi1 + αi2 10, σ 2(s) ) at each iteration and then computing
the quantiles of the S predictions. The PPD accounts for both uncertainty in
the patient’s random effect αi and measurement error with variance σ 2 . The
intervals in Figure 4.8 suggest that uncertainty in the random effects is the
dominant source of variation.
Marginal models: Inducing correlation by conditioning on random effects
is equivalent to a marginal model (Section 4.5.4) that does not include random
148 Bayesian Statistical Methods
60
Patient 1
Patient 2
Patient 3
55
Bone density
50
45
Age
FIGURE 4.8
Mixed effects analysis of the jaw data. The observed bone density
(points) for three subjects versus the posterior median (solid lines) and 95%
intervals (dashed lines) of the fitted value αi1 + αi2 X for X ranging from 8–10
years, and 95% credible intervals (vertical lines at Age=10) of the posterior
predictive distributions for the measured response at age X = 10.
effects but directly specifies correlation between observation in the same group.
For example, the one-way random effects model
indep iid
Yij |αi ∼ Normal(αi , σ 2 ) where αi ∼ Normal(µ, τ 2 ) (4.39)
where Yi = (Yi1 , ..., Yim )T is the data vector for group i, µ = (µ, ..., µ)T is the
mean vector, and Σ is the covariance matrix with τ 2 + σ 2 on the diagonals
and τ 2 elsewhere. An advantage of the marginal approach is that we no longer
have to estimate the random effects αi ; a disadvantage is that for large data
sets and complex correlation structure the mean vector and especially the
covariance matrix can be large which slows computation.
In the hierarchical representation, the elements of Yi are independent and
identically distributed given αi ; in the marginal model the m observations
from group i are no longer independent, but they remain exchangeable, i.e.,
their distribution is invariant to permuting their order. The concept of ex-
changeability plays a fundamental role in constructing hierarchical models.
The representation theorem by Bruno de Finetti states that any infinite se-
quence of exchangeable variables can be written as independent and identically
distributed conditioned on some latent distribution. Therefore, this important
type of dependent data can be modeled using a simpler hierarchical model.
Linear models 149
iid
where β = (β1 , ..., βp )T are the regression coefficients and the errors are εi ∼
Normal(0, σ 2 ). This model makes four key assumptions:
(1) Linearity: The mean of Yi |Xi is linear in Xi
(2) Equal variance: The residual variance (σ 2 ) is the same for all i
(3) Normality: The errors εi are Gaussian
(4) Independence: The errors εi are independent
In real analyses, most if not all of these assumptions will be violated to some
extent. Minor violations will not invalidate statistical inference, but glaring
model misspecifications should be addressed. This chapter provides Bayesian
remedies to model misspecification (each subsection addresses one of the four
assumptions above). A strength of the Bayesian paradigm is that these models
can be fit by simply adding a few lines of JAGS code and do not require
fundamentally new theory or algorithms.
(a) Fitted mean trend − Homoskedastic model (b) Spline basis functions
1.0
2
0.8
1
Acceleration
0.6
Bj(X)
0
0.4
−1
0.2
Median
95% interval
−2
0.0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
time Time
(c) Fitted mean trend − Heteroskedastic model (d) Variance − Heteroskedastic model
2.0
2
Median
95% interval
1.5
1
Acceleration
σ2(x)
1.0
0
−1
0.5
Median
95% interval
−2
0.0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
time time
FIGURE 4.9
Nonparametric regression for the motorcycle data. Panel (a) plots the
time since impact (scaled to be between 0 and 1) and the acceleration (g) along
with the posterior median and 95% interval for the mean function from the
homoskedastic fit; Panel (b) shows the J = 10 spline basis functions Bj (X);
Panels (c) and (d) show the posterior median and 95% intervals for the mean
and variance functions from the heteroskedastic model.
Linear models 151
Note that each basis function in Figure 4.9b has Bj (0) = 0, and so an in-
tercept (β0 ) is required. By increasing J, any smooth mean function can be
approximated as a linear combination of the B-spline basis functions.
Motorcycle example: To fit the mean curve to the data plotted in Figure
4.9a, we use J = 10 B-spline basis functions and priors βj ∼ Normal(0, τ 2 σ 2 )
and σ 2 , τ 2 ∼ InvGamma(0.1, 0.1). The model is fit using MCMC with the
code in Listing 4.3. In this code, the basis functions have been computed in
using the bs package in R and passed to JAGS as xij = Bj (Xi ). For each
iteration, we compute g(Xi ) for all i = 1, ..., n as a function of that iteration’s
posterior sample of β. This produces the entire posterior distribution of the
mean function g for all n sample points (and any other X we desire), and
Figure 4.9a plots the posterior median and 95% interval of g at each sample
point. The fitted model accurately captures the main trend including the valley
around X = 0.4.
We selected J = 10 basis functions because this degree of model complexity
visually seemed to fit the data well. Choosing smaller J would give a smoother
estimate of g and choosing larger J would give a rougher estimate. Clearly a
more rigorous approach to selecting the number of basis functions is needed,
and this is discussed in Chapter 5.
152 Bayesian Statistical Methods
Listing 4.11
Model statement for heteroskedastic Gaussian regression.
1 for(i in 1:n){
2 Y[i] ~ dnorm(mu[i],prec[i])
3 mu[i] <- inprod(x[i,],beta[])
4 prec[i] <- 1/sig2[i]
5 sig2[i] <- exp(inprod(x[i,],alpha[]))
6 }
7 for(j in 1:p){beta[j] ~ dnorm(0,taub)}
8 for(j in 1:p){alpha[j] ~ dnorm(0,taua)}
9 taub ~ dgamma(0.1,0.1)
10 taua ~ dgamma(0.1,0.1)
appears that properly quantifying uncertainty about the mean trend requires
a realistic model for the error variance.
and gi ∈ {1, ..., K} is the cluster label for observation i with Prob(gi = k) =
πk . By letting the number of mixture components increase to infinity, any dis-
iid
tribution can be approximated, and by selecting priors θk ∼ Normal(0, τ22 ),
τ12 , τ22 ∼ InvGamma, and (π1 , ..., πK ) ∼ Dirichlet all full conditional distri-
butions for all parameters are conjugate permitting Gibbs samples (Listing
4.12b).
For a fixed number of mixture components (K) the mixture-of-normals
model is a semiparametric estimator of the density f (ε). There is a rich liter-
ature on nonparametric Bayesian density estimation [37]. The most common
model is the Dirichlet process mixture model that has infinitely many mixture
components and a particular model for the mixture probabilities.
Listing 4.12
Model statement for Gaussian regression with non-normal errors.
1
11
correlated errors is
Y ∼ Normal(Xβ, Σ). (4.48)
A Bayesian analysis of correlated data hinges on correctly specifying the corre-
lation structure to capture say spatial or temporal correlation. Given the cor-
relation structure and priors for the correlation parameters, standard Bayesian
computational tools can be used to summarize the posterior. Correlation pa-
rameters usually will not have conjugate priors and so Metropolis–Hastings
sampling is used. An advantage of the Bayesian approach for correlated data
is that using MCMC sampling we can account for uncertainty in the corre-
lation parameters for prediction or inference on other parameters, whereas
maximum likelihood analysis often uses plug-in estimates of the correlation
parameters and thus underestimates uncertainty.
Gun control example: The data for this analysis come from Kalesan
et. al. (2016) [47]. The response variable, Yi , is the log firearm-related death
rate per 10,000 people in 2010 in state i (excluding Alaska and Hawaii). This
is regressed onto five potential confounders: log 2009 firearm death rate per
10,000 people; firearm ownership rate quartile; unemployment rate quartile;
non-firearm homicide rate quartile; and firearm export rate quartile. The co-
variate of interest is the number of gun control laws in effect in the state. This
gives p = 6 covariates.
We first fit the usual Bayesian linear regression model
p
X
Y i = β0 + X i β j + εi (4.49)
j=1
iid
with independent errors εi ∼ Normal(0, σ 2 ) and uninformative priors. The
posterior density of the regression coefficient corresponding to the number of
gun laws is plotted in Figure 4.10. The posterior probability that the coefficient
is negative is 0.96, suggesting a negative relationship between the number of
gun laws and the firearm-related death rate.
The assumption of independent residuals is questionable because neighbor-
ing states may be correlated. Spatial correlation may stem from guns being
brought across state borders or from missing covariates (e.g., attitudes about
and use of guns) that vary spatially. Research has shown that accounting
for residual dependence can have a dramatic effect on regression coefficient
estimates [43].
We decompose the residual covariance Cov[(ε1 , ..., εn )T ] = Σ as
Σ = τ 2 S + σ 2 In , (4.50)
Non−spatial
Spatial
80
Posterior density
60
40
20
0
Beta
FIGURE 4.10
Effect of gun-control legislation on firearm-related death rate. Pos-
terior distribution of the coefficient associated with the number of gun-control
laws in a state from the spatial and non-spatial model of the states’ firearm-
related death rate.
exponentially with the distance between them. However, quantifying the dis-
tance between irregularly shaped states is challenging, and so we model spatial
dependence using adjacencies. Let Aij = 1 if states i and j share a border
and Aij = 0 if i = j or the states are not neighbors. The spatial covariance
follows the conditionally autoregressive model S = (M − ρA)−1 , where A is
the adjacency matrix with (i, j) element Aij and M is the diagonal matrix
with the ith diagonal element equal to the number of states that neighbor
state i. The parameter ρ ∈ (0, 1) is not the correlation between adjacent sites,
but determines the strength of spatial dependence with ρ = 0 corresponding
to independence.
The posterior mean (standard deviation) of the spatial dependence pa-
rameter ρ is 0.38 (0.25), and so the residual spatial dependence in these data
is not strong. However, the posterior of the regression coefficient of interest
in Figure 4.10 is noticeably wider for the spatial model than the non-spatial
model. The posterior probability that the coefficient is negative lowers from
0.96 from the non-spatial model to 0.93 for the spatial model. Therefore, while
accounting for residual dependence did not qualitatively change the results,
this example illustrates that the chosen model for the residuals can affect the
posterior of the regression coefficients.
Jaw bone density example: A possible correlation structure for the
longitudinal data in Figure 4.6 (top left) is to assume that correlation decays
with the time between visits. A first-order autoregression correlation structure
is Cor(Yij , Yik ) = ρ|j−k| . Denoting the vector of m observations for patient
Linear models 157
Listing 4.13
Random slopes model with autoregressive dependence in JAGS.
1 # Likelihood
2 for(i in 1:n){
3 Y[i,1:m] ~ dmnorm(mn[i,1:m],SigmaInv)
4 for(j in 1:m){mn[i,j] <- alpha[i,1]+alpha[i,2]*age[j]}
5 }
6 SigmaInv[1:m,1:m] <- inverse(Sigma[1:m,1:m])
7 for(j in 1:m){for(k in 1:m){
8 Sigma[j,k] <- pow(rho,abs(k-j))/tau
9 }}
10
11 # Random effects
12 for(i in 1:n){alpha[i,1:2] ~ dmnorm(beta[1:2],Omega[1:2,1:2])}
13
14 # Priors
15 tau ~ dgamma(0.1,0.1)
16 for(j in 1:2){beta[j] ~ dnorm(0,0.0001)}
17 rho ~ dunif(0,1)
18 Omega[1:2,1:2] ~ dwish(R[,],2.1)
19
20 R[1,1]<-1/2.1
21 R[1,2]<-0
22 R[2,1]<-0
23 R[2,2]<-1/2.1
4.6 Exercises
1. A clinical trial gave six subjects a placebo and six subjects a new
weight loss medication. The response variable is the change in
weight (pounds) from baseline (so -2.0 means the subject lost 2
pounds). The data for the 12 subjects are:
Placebo Treatment
2.0 -3.5
-3.1 -1.6
-1.0 -4.6
0.2 -0.9
0.3 -5.1
0.4 0.1
> library(MASS)
> data(Boston)
> ?Boston
(c) Refit the Bayesian model with double exponential priors for the
regression coefficients, and discuss how the results differ from
the analysis with uninformative priors.
(d) Fit a Bayesian linear regression model in (a) using only the first
500 observations and compute the posterior predictive distribu-
tion for the final 6 observations. Plot the posterior predictive
distribution versus the actual value for these 6 observations and
comment on whether the predictions are reasonable.
3. Download the 2016 Presidential Election data from the book’s web-
site. Perform Bayesian linear regression with the response variable
for county i being the difference between the percentage of the vote
for the Republican candidate in 2016 minus 2012 and all variables
in the object X as covariates.
(a) Fit a Bayesian linear regression model with uninformative
Gaussian priors for the regression coefficients and summarize
the posterior distribution of all regression coefficients.
(b) Compute the residuals Ri = Yi − Xi β̂ where β̂ is the posterior
mean of the regression coefficients. Are the residuals Gaussian?
Which counties have the largest and smallest residuals, and
what might this say about these counties?
(c) Include a random effect for the state, that is, for a county in
state l = 1, ..., 50,
Yi |αl ∼ Normal(Xi β + αl , σ 2 )
iid
where αl ∼ Normal(0, τ 2 ) and τ 2 has an uninformative prior.
Why might adding random effects be necessary? How does
adding random effects affect the posterior of the regression co-
efficients? Which states have the highest and lowest posterior
mean random effect, and what might this imply about these
states?
4. Download the US gun control data from the book’s website. These
data are taken from the cross-sectional study in [47]. For state i, let
Yi be the number of homicides and Ni be the population.
(a) Fit the model Yi |β ∼ Poisson(Ni λi ) where log(λi ) = Xi β. Use
uninformative priors and p = 7 covariates in Xi : the intercept,
the five confounders Zi , and the total number of gun laws in
state i. Provide justification that the MCMC sampler has con-
verged and sufficiently explored the posterior distribution and
summarize the posterior of β.
(b) Fit a negative binomial regression model and compare with the
results from Poisson regression.
160 Bayesian Statistical Methods
(c) For the Poisson model in (a), compute the posterior predictive
distribution for each state with the number of gun laws set
to zero. Repeat this with the number of gun laws set to 25
(the maximum number). According to these calculations, how
would the number of deaths nationwide be affected by these
policy changes? Do you trust these projections?
5. Download the titanic dataset from R,
library("titanic")
dat <- titanic_train
?titanic_train
> library(geoR)
> data(gambia)
> ?gambia
library(babynames)
dat <- babynames
dat <- dat[dat$name=="Sophia" &
dat$sex=="F" &
dat$year>1950,]
yr <- dat$year
p <- dat$prop
t <- dat$year - 1950
Y <- log(p/(1-p))
Let Yt denote the sample log-odds in year t+1950. Fit the following
time series (auto-regressive order 1) model to these data:
Yt = µt + ρ(Yt−1 − µt−1 ) + εt
iid
where µt = α + βt and εt ∼ Normal(0, σ 2 ). The priors
are α, β ∼ Normal(0, 1002 ), ρ ∼ Uniform(−1, 1), and σ 2 ∼
InvGamma(0.1, 0.1).
(a) Give an interpretation of each of the four model parameters: α,
β, ρ, and σ 2 .
(b) Fit the model using JAGS for t > 1, verify convergence, and
report the posterior mean and 95% interval for each parameter.
(c) Plot the posterior predictive distribution for Yt in the year 2020.
162 Bayesian Statistical Methods
10. Open and plot the galaxies data in R using the code below,
> library(MASS)
> data(galaxies)
> ?galaxies
> Y <- galaxies
> hist(Y,breaks=25)
CONTENTS
5.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.2 Hypothesis testing and Bayes factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.3 Stochastic search variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.4 Bayesian model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.5 Model selection criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.6 Goodness-of-fit checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
163
164 Bayesian Statistical Methods
where θ̂ i is the parameter estimate based on the fold that excludes observation
i. The log score is the average log likelihood of the test set observations given
the parameter estimates from the training data. Based on these measures, we
might discard models with COV far below the nominal 1 − α level and from
the remaining model choose the one with small M SE, M AD and W IDT H
and large LS. It is essential that the evaluation is based on out-of-sample
predictions rather than within-sample fit. Overly complicated models (e.g., a
linear model with too many predictors) may replicate the data used to fit the
model but be too unstable to predict well under new conditions.
Cross validation can be motivated by information theory. Suppose the
iid
“true” data generating model has PDF f0 so that in reality Yi ∼ f0 . Of
course, we cannot know the true model and so we choose between M models
with PDFs f1 , ..., fM . Our objective is to select the model that is in some
sense the closest to the true model. A reasonable measure of the difference
between the true and postulated model is the Kullback–Leibler divergence
fj (Y )
KL(f0 , fj ) = E log = E {log[fj (Y )]} − E {log[f0 (Y )]} ,
f0 (Y )
166 Bayesian Statistical Methods
where the expectation is with respect to the true model Y ∼ f0 . The term
E {log[f0 (Y )]} is the same for all j = 1, ..., M and therefore ranking models
based on KL(f0 , fj ) is equivalent to ranking models based their log score
LSj = E {log[fj (Y )]}. Since the data are generated as from f0 , the cross
validation log score in (5.3) is a Monte Carlo estimate of the true log score
LSj , and therefore ranking models based on their cross validation log score
is an attempt to rank them based on similarity to the true data-generating
model.
If the prior probability of the two models are equal, then the BF is simply the
posterior odds Prob(M2 |Y)/Prob(M1 |Y); on the other hand, if the data are
not at all informative about the models and the prior and posterior odds are
the same, then BF = 1 regardless of the prior.
Selecting between two competing models is often referred to as hypothe-
sis testing. In hypothesis testing one of the models is referred to as the null
model or null hypothesis and the other is the alternative model/hypothesis.
Hypothesis tests are usually designed to be conservative so that the null model
is rejected in favor of the alternative only if the data strongly support this
model. If we define Model 1 as the null hypothesis and Model 2 as an al-
ternative hypothesis, then a rule of thumb [48] is that BF > 10 provides
Model selection and diagnostics 167
The first model has no unknown parameters, and the second model’s parame-
ter θ can be integrated out giving the beta-binomial model for Y (see Appendix
A.1). Therefore, these hypotheses about θ correspond to two different models
for the data
1000
3.0
a=1, b=1
a=0.5, b=0.5
a=50, b=50
2.5
100
a=50, b=1
2.0
10
Bayes factor
Prior PDF
1.5
1
1.0
0.1
a=1, b=1
0.5
a=0.5, b=0.5
0.01
a=50, b=50
a=50, b=1
0.0
θ Y
FIGURE 5.1
Bayes factor for the beta-binomial model. (left) Beta(a, b) prior PDF
for several combinations of a and b and (right) observed data Y versus the
Bayes factor comparing the beta-binomial model Y |θ ∼ Binomial(n, θ) and
θ ∼ Beta(a, b) versus the null model Y ∼ Binomial(n, 0.5) for n = 20 and
several combinations of a and b.
M1 : µ ≤ 0 versus M2 : µ > 0.
To compute the BF for these hypotheses we simply fit the Bayesian model
170 Bayesian Statistical Methods
0.5
1.0
0.4
0.8
Prior density
0.6
0.3
Prior CDF
0.2
0.4
0.1
0.2
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
β β
FIGURE 5.2
Spike and slab prior. PDF (left) and CDF of β under the spike and slab
prior β = γδ where γ ∼ Bernoulli(0.5) and δ ∼ Normal(0, 1).
ity. Given the model γ, the regression coefficients that are included in the
model can have independent normal priors (other priors are possible, [70])
βj |γj = 1 ∼ Normal(0, σ 2 τ 2 ). The inclusion probability and regression co-
efficient variance can be fixed or have prior such as q ∼ Beta(a, b) and
τ 2 ∼ InvGamma(ǫ, ǫ). As with Bayes factors (Section 5.2), the posterior for
this model can be sensitive to the prior, and multiple priors should be com-
pared to understand sensitivity.
The prior for βj induced by this model is plotted in Figure 5.2. The prior
is a mixture of two components: a peak at βj = 0 corresponding to samples
that exclude (γj = βj = 0) covariate j and a Gaussian curve corresponding
to samples that include (γj = 1 so βj = δj ) βj . Because of this distant shape,
this prior is often called the spike-and-slab prior.
All M models can simultaneously be written as the supermodel
p
indep X
Yi |β, σ 2 ∼ Normal Xij βj , σ 2
j=1
βj = γj δj
γj ∼ Bernoulli(q)
δj ∼ Normal(0, τ 2 σ 2 ).
MCMC samples from this model include different subsets of the covariates.
Posterior samples with γj = 0 have βj = 0 and thus covariate j excluded from
the model. This supermodel can be fit a single time and give approximations
for all M posterior model probabilities. Since the search over models is done
within MCMC, this is a stochastic search as opposed to systematic searches
such as forward or backward regression. An advantage of stochastic search
172 Bayesian Statistical Methods
is that models with low probability are rarely sampled and so more of the
computing time is spent on high-probability models.
With many possible models, no single model is likely to emerge as having
high probability. Therefore, with large M extremely long chains can be needed
to give accurate estimates of model probabilities. A more stable summary of
the model space are the marginal inclusion probabilities M IPj = E(γj =
1|Y) = Prob(βj 6= 0|Y). M IPj is the probability that covariate j is included
in the model averaging over uncertainty in the subset of the other variables
that are included in the model, and can be used to compute the Bayes factor
for the models that do and do not include covariate j, BFj = [M IPj /(1 −
M IPj )]/[q/(1 − q)]. In fact, [6] show that if a single model is required for
prediction, the model that includes covariates with M IPj greater than 0.5 is
preferred to the highest probability model.
Childhood malaria example: Diggle et al. [23] analyze data from n =
1, 332 children from the Gambia. The binary response Yi is the indictor that
child i tested positive for malaria. We use five covariates in Xij :
• Age: Age of the child, in days
• Net use: Indicator variable denoting whether (1) or not (0) the child regularly
sleeps under a bed-net
• Treated: Indicator variable denoting whether (1) or not (0) the bed-net is
treated (coded 0 if netuse=0)
• Green: Satellite-derived measure of the greenness of vegetation in the im-
mediate vicinity of the village (arbitrary units)
• PCH: Indicator variable denoting the presence (1) or absence (0) of a health
center in the village
All five covariates are standardized to have mean zero and variance one. We
use the logit regression model
p
X
logit[Prob(Yi = 1)] = α + Xij βj . (5.15)
j=1
Listing 5.1
JAGS code for SSVS.
1 for(i in 1:n){
2 Y[i] ~ dbern(pi[i])
3 logit(pi[i]) <- alpha + X[i,1]*beta[1] +
4 X[i,2]*beta[2] + X[i,3]*beta[3] +
5 X[i,4]*beta[4] + X[i,5]*beta[5]
6 }
7 for(j in 1:5){
8 beta[j] <- gamma[j]*delta[j]
9 gamma[j] ~ dbern(0.5)
10 delta[j] ~ dnorm(0,tau)
11 }
12 alpha ~ dnorm(0,0.01)
13 tau ~ dgamma(0.1,0.1)
TABLE 5.1
Posterior model probabilities for the Gambia analysis. All other mod-
els have posterior probability less than 0.01.
Covariates Probability
Age, Net use, Greenness, Treated 0.42
Age, Net use, Greenness, Treated, Health center 0.37
Age, Net use, Greenness, Health center 0.20
TABLE 5.2
Marginal posteriors for the Gambia analysis. Posterior inclusion proba-
bilities (i.e., Prob(βj 6= 0|Y)) and posterior median and 90% intervals for the
βj .
1200
12000
1000
10000
Posterior density
Posterior density
800
8000
600
6000
4000
400
2000
200
0
0
0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 −0.3 −0.2 −0.1 0.0 0.1
βj βj
FIGURE 5.3
Posterior distribution for the SSVS analysis. Posterior distribution for
the regression coefficients βj for age and proximity to a health center.
it escapes the prior spike at zero, however the posterior for proximity to a
health center retains considerable mass at zero and thus the spike-and-slab
shape.
High-dimensional regression example: The data for this example are
from [50] and can be downloaded from http://www.ncbi.nlm.nih.gov/geo (ac-
cession number GSE3330). In this study, n = 60 mice (31 female) were sam-
pled and the physiological phenotype stearoyl-CoA desaturase 1 (SCD1) is
taken as the response to be regressed onto the expression levels of 22,575
genes. Following [12] and [84], we use only the p = 1, 000 genes with highest
pairwise correlation with the response as predictors in the model. Even after
this simplification, this leaves a high-dimensional problem with p > n.
We use the linear regression model with SSVS prior
Xp
Yi ∼ Normal α + Xij βj , σe2 (5.16)
j=1
βj = γ j δj
γj ∼ Bernoulli(q)
δj ∼ Normal(0, σe2 σb2 ).
1.0
1.0
0.8
0.8
0.6
0.6
Prior 2
Prior 3
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Prior 1 Prior 1
FIGURE 5.4
High-dimensional regression example. Marginal inclusion probabilities
(M IPj = Prob(βj 6= 0|Y)) under three priors: (1) q ∼ Beta(1, 1) and σb2 ∼
InvGamma(0.1, 0.1), (2) q ∼ Beta(1, 2) and σb2 ∼ InvGamma(0.1, 0.1), and
(3) q ∼ Beta(1, 1) and σb2 ∼ InvGamma(0.5, 0.5).
InvGamma(0.5, 0.5) (Prior 3). The models are fit using Gibbs sampling with
200,000 iterations, a burn-in of 20,000 iterations discarded and the remaining
samples thinned by 20. This MCMC code requires 1–2 hours on an ordinary
PC.
Only two genes have inclusion probability greater than 0.5 (Figure 5.4).
The ordering of the marginal inclusion probabilities is fairly robust to the
prior, but the absolute value of the marginal inclusion probabilities varies
with the prior. The posteriorP median (95% interval) of the number of variables
p
included in the model pin = j=1 γj are 33 (12, 535) for Prior 1, 28 (12, 392)
for Prior 2 and 20 (11, 64) for Prior 3. In this analysis, the results are more
sensitive to the prior for the variance than the inclusion probability.
where p(Y ∗ |Y, Mm ) is the PPD from model Mm . Making inference that
accounts for model uncertainty using posterior model probabilities is called
Bayesian model averaging (BMA; [45]).
SSVS (Section 5.3) via MCMC provides a convenient way to perform
model-averaged predictions. Draws from the SSVS model include dummy indi-
cators for different models as unknown parameters, and thus naturally average
over models according to their posterior probability. Therefore, if a sample of
Y ∗ is made at each MCMC iterations, then these samples follow the Bayesian
model averaged PPD, as desired.
In addition to prediction, BMA can be used for inference on param-
eters common to all models. For example, if parameter βj has posterior
distribution p(βj |Y, Mm ) under model Mm , then the BMA posterior is
PM
p(βj |Y) = m=1 wm (Y)p(βj |Y, Mm ). However, BMA results for parame-
ters must be carefully scrutinized because the interpretation of parameters
can change considerably across models and so it is not always clear how to
interpret an average over models. As an extreme case, in a regression analysis
with collinearity the sign of βj might change depending on the other covariates
that are included, and so it is not obvious that results should be combined
across models.
and marginal over the random effects, the definition of the likelihood and
thus deviance are not unique, but for most models for independent data the
definition of the deviance is clear.
The Bayesian information criteria (BIC) is one such criteria, defined as
pD = D̄ − D̂, (5.21)
that models with small DIC are simple (small pD ) and fit well (small D̄).
The effective number of parameters pD is generally less than the number of
parameters p if the prior are strong, as desired. Unfortunately pD can be out-
side [0, p] in pathological cases, typically where the posterior mean of θ is
not a good summary of the posterior as is the case in mixture models with
multimodal priors and posteriors.
The motivation of pD as a measure of model size is complex, but intu-
ition can be built using a few examples. In multiple linear regression with
Y|β ∼ Normal(Xβ, σ 2 In ) and Zellner’s prior β ∼ Normal(0, cσ 2 (XT X)−1 ),
c
the effective number of parameters is pD = c+1 p. In this case, the effective
number of parameters increases from zero with tight prior (c = 0) to p with
uninformative prior (c = ∞). Also, in the one-way random effects model (Sec-
tion 4.4)
Yij |µj ∼ Normal(µj , σ 2 ) and µj ∼ Normal(0, cσ 2 ), (5.23)
for i = 1, ..., n replications within each of the j = 1, ..., p groups and fixed
variance components σ 2 and c. The effective number of parameters is pd =
c
c+1/n p, which also increases from 0 to p with the prior variance c.
The Watanabe–Akaike (also known as the widely applicable) information
criteria (W AIC; [30]) is an alternative to DIC. W AIC is proposed as an
approximation to n-fold (i.e., leave-one-out) cross validation. Rather than the
posterior mean of the deviance, W AIC compute the posterior mean and vari-
ance of the likelihood and log likelihood. The criteria is
log{f¯i } + 2pW
X
W AIC = −2 (5.24)
i=1
where fit for observation i is measured by the posterior mean of the likelihood,
f¯i = E[f (Yi |θ)|Y] (5.25)
and model complexity is measured by
n
X
pW = Var[log(f (Yi |θ))|Y], (5.26)
i=1
defined as the sum of posterior variance of the log likelihood functions. There-
fore, as with BIC and DIC, models with small W AIC are preferred because
they are simple and fit well.
Selecting a random effect model for the Gambia data: As in Section
5.3, the binary response Yi indicates that child i tested positive for malaria and
we consider covariates for age, bed-net use, bed-net treatment, greenness and
health center. In this analysis we also consider child i’s village vi ∈ {1, ..., 65}
to account for dependence in the malaria status of children from the same
village (Figure 5.5 plots the location and number of children sampled from
each village). We use the random effects logistic regression model
p
X
logit[Prob(Yi = 1)] = α + Xij βj + θvi , (5.27)
j=1
Model selection and diagnostics 179
●
● ●
●
● ●
● ●
●
● ●●
FIGURE 5.5
Gambia data. The location and number of children sampled from each vil-
lage.
where θv is the random effect for village v. We compare three models for the
village random effects via DIC and W AIC:
1. No random effects: θv = 0
2. Gaussian random effects: θv ∼ Normal(0, τ 2 )
3. Double-exponential random effects: θv ∼ DE(0, τ 2 )
In all models, the priors are α, βj ∼ Normal(0, 100) and τ 2 ∼
InvGamma(0.1, 0.1).
Code for Model 3 is given in Listing 5.2. DIC is computed using the
dic.samples function in JAGS, although this unfortunately requires extra
MCMC sampling. There is no analogous function for W AIC in JAGS and
so it must be computed outside of JAGS. In Listing 5.2 the extra line in the
likelihood like[i] instructs JAGS to return posterior samples of the likelihood
function f (Yi |θ) which in this model is the binomial PMF (dbin in JAGS).
After MCMC sampling, the posterior mean of like[i] is computed as the
approximation to f¯i and the posterior variance of log(like[i]) is computed
to approximate pW .
The W AIC and DIC results are in Table 5.3. Both measures show strong
support for including village random effects, but cannot distinguish between
Gaussian and double-exponential random-effect distributions. Since the Gaus-
sian model is more familiar, this is probably the preferred model for these
data.
2016 Presidential Election example: The data for this analysis come
from Tony McGovern’s very useful data repository 1 . The response variable,
Yi , is the percent increase in Republican (GOP) support from 2012 to 2016,
i.e.,
% in 2016
100 −1 , (5.28)
% in 2012
1 https://github.com/tonmcg/County_Level_Election_Results_12-16
180 Bayesian Statistical Methods
Listing 5.2
JAGS code to compute WAIC and DIC for the random effects model.
1 mod <- textConnection("model{
2 for(i in 1:n){
3 Y[i] ~ dbern(pi[i])
4 logit(pi[i]) <- beta[1] + X[i,1]*beta[2]
5 + X[i,2]*beta[3] + X[i,3]*beta[4]
6 + X[i,4]*beta[5] + X[i,5]*beta[6]
7 + theta[village[i]]
8 like[i] <- dbin(Y[i],pi[i],1) # For WAIC computation
9 }
10 for(j in 1:6){beta[j] ~ dnorm(0,0.01)}
11 for(j in 1:65){theta[j] ~ ddexp(0,tau)}
12 tau ~ dgamma(0.1,0.1)
13 }")
14
21 # Compute DIC
22 DIC <- dic.samples(model,n.iter=50000,progress.bar="none")
23
24 # Compute WAIC
25 like <- rbind(samps[[1]],samps[[2]]) # Combine the two chains
26 fbar <- colMeans(like)
27 Pw <- sum(apply(log(like),2,var))
28 WAIC <- -2*sum(log(fbar))+2*Pw
TABLE 5.3
Model selection criteria for the Gambia data. DIC (pD ) and W AIC
(pW ) for the three random effects models.
in county i = 1, ..., n (Figure 5.6a). The election data are matched with p = 10
county-level census variables (Xij ) obtained from Kaggle via Ben Hamner2 :
1. Population, percent change – April 1, 2010 to July 1, 2014
2. Persons 65 years and over, percent, 2014
3. Black or African American alone, percent, 2014
4. Hispanic or Latino, percent, 2014
5. High school graduate or higher, percent of persons age 25+, 2009–
2013
6. Bachelor’s degree or higher, percent of persons age 25+, 2009–2013
7. Homeownership rate, 2009–2013
8. Median value of owner-occupied housing units, 2009–2013
9. Median household income, 2009–2013
10. Persons below poverty level, percent, 2009–2013.
All covariates are centered and scaled (e.g., Figure 5.6b). The objective is to
determine the factors that are associated with an increase in GOP support.
Also, following the adage that “all politics is local,” we explore the possibility
that the factors related to GOP support vary by state.
For a county in state s, we assume the linear model
p
X
Yi = β0s + Xi βsj + εi , (5.29)
j=1
iid
where βjs is the effect of covariate j in state s and εi ∼ Normal(0, σ 2 ). We
compare three models for the βjs
1. Constant slopes: βjs ≡ βj for all counties
iid
2. Varying slopes, uninformative prior: βjs ∼ Normal(0, 102 )
indep
3. Varying slopes, informative prior: βjs ∼ Normal(µj , σj2 )
In all models, the prior for the error variance is σ 2 ∼ InvGamma(0.1, 0.1). In
the first model the slopes have uninformative priors βj ∼ Normal(0, 102 ).
In the final model, the mean µj and variances σj2 are given priors µj ∼
Normal(0, 102 ) and σj2 ∼ InvGamma(0.1, 0.1) and estimated from the data
so that information is pooled across states via the prior. The three methods
are compared using DIC and WAIC as in Listing 5.3 for the constant slopes
model.
We first study the results from Model 1 with the same slopes in all states.
Table 5.4 shows that all covariates other than home-ownership rate are asso-
ciated with the election results. GOP support tended to increase in counties
2 https://www.kaggle.com/benhamner/2016-us-election
182 Bayesian Statistical Methods
[−44.4 to −1.9)
[−1.9 to 2.3)
[2.3 to 5.3)
[5.3 to 8.0)
[8.0 to 11.4)
[11.4 to 16.9)
[16.9 to 61.7]
NA
[−1.8763 to −0.8680)
[−0.8680 to −0.6188)
[−0.6188 to −0.3582)
[−0.3582 to −0.0977)
[−0.0977 to 0.2762)
[0.2762 to 0.9898)
[0.9898 to 6.1896]
FIGURE 5.6
2016 Presidential Election data. Panel (a) plots the percentage change in
Republican (GOP) support from 2012 to 2016 (Yi ) and Panel (b) plots the
percent of counties with a bachelor’s degree or higher (standardized to have
mean zero and variance one; Xi7 ).
Model selection and diagnostics 183
Listing 5.3
JAGS code to compute DIC for the constant slopes model.
1 model_string <- "model{
2 for(i in 1:n){
3 Y[i] ~ dnorm(Xb[i],taue)
4 Xb[i] ~ inprod(X[i,],beta[])
5 like[i] <- dnorm(Y[i],Xb[i],taue) # For WAIC
6 }
7 for(j in 1:p){beta[j] ~ dnorm(0,0.01)}
8 tau ~ dgamma(0.1,0.1)
9 }"
10
17 # Burn-in samples
18 update(model, 10000, progress.bar="none")
19
20 # Compute DIC
21 dic <- dic.samples(model1,n.iter=50000)
22
23 # Compute WAIC
24 samp <- coda.samples(model, variable.names=c("like"),
25 n.iter=50000)
26 like <- rbind(samps[[1]],samps[[2]]) # Combine the two chains
27 fbar <- colMeans(like)
28 Pw <- sum(apply(log(like),2,var))
29 WAIC <- -2*sum(log(fbar)) + 2*Pw
184 Bayesian Statistical Methods
TABLE 5.4
2016 Presidential Election multiple regression analysis. Posterior
mean (95% interval) for the slopes βj for the model with the same slopes
in each state.
with decreasing population, high proportion of seniors and high school grad-
uates, low proportions of African Americans and Hispanics, high income but
low home value and high poverty rate.
The DIC (D̄, pD ) for the three models are 21312 (21300, 12) for Model
1 with constant slopes, 18939 (18483, 455) for Model 2 with varying slope
and uninformative priors, and 18842 (18604, 238) for Model 3 with varying
slopes and informative priors. The first model is not rich enough to capture the
important trends in data and thus has high D̄ and DIC. The second model
has the best fit to the observed data (smallest D̄), but is too complicated
(large pD ). The final model balances model complexity and fit and has the
smallest DIC. WAIC gives similar results, with W AIC (pW ) equal to 21335
(20), 18971 (406) and 18909 (259) for Models 1–3, respectively.
Inspection of the posterior of the variances σj2 from the Model 3 shows
that the covariate effect that varies the most across states is the proportion
of the county with a Bachelor’s degree. Figure 5.7 maps the posterior mean
and standard deviation of the state-level slopes, βs7 . The association between
the proportion of the population with a Bachelor’s degree and change in GOP
support is the strongest (most negative) in New England and the Midwest. The
posterior standard deviation is the smallest in Colorado and Texas, possibly
because these states have high variation in the covariate across counties.
Simulation study to evaluate the performance of DIC and W AIC:
In these examples DIC and W AIC gave similar results. However, in practice
there will obviously be cases where they differ and the user will have to choose
between them and defend their choice. Also, when applied to real data as above
the “correct” model is unknown and so we cannot say for certain that either
method selected the right model. One way to build trust in these criteria (or
any other statistical method) is to evaluate their performance for simulated
Model selection and diagnostics 185
[−8.73 to −7.58)
[−7.58 to −6.54)
[−6.54 to −6.17)
[−6.17 to −5.38)
[−5.38 to −4.95)
[−4.95 to −3.79)
[−3.79 to −3.22]
NA
[0.548 to 0.747)
[0.747 to 0.845)
[0.845 to 0.902)
[0.902 to 0.948)
[0.948 to 1.004)
[1.004 to 1.150)
[1.15 to 1.66]
NA
FIGURE 5.7
Results of the 2016 Presidential Election analysis. Posterior mean and
standard deviation of the effect of the bachelor-degree rate on GOP support
(βs7 ) for the model with a different slope in each state and informative prior.
186 Bayesian Statistical Methods
data where the correct model is known (see Section 7.3 for more discussion of
simulation studies).
The data for this simulation experiment are generated to mimic the Gam-
bia data. The response Yi is binary and generated from the random effects
logistic regression model
10
10
17 33 51 75 90 14 29 43 73 86
5
0
0
Difference in WAIC
Difference in DIC
−5
−5
−10
−10
−15
−15
−20
−20
−25
−25
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
σ σ
FIGURE 5.8
Simulation to evaluate selection criteria. Boxplots of the difference
in DIC (left) and W AIC (right) comparing random effects logistic re-
gression with simple logistic regression for N = 100 simulated datasets
generated with random effect standard deviation σ. Each boxplot repre-
sents the distribution of the difference over N datasets simulated with σ ∈
{0.00, 0.25, 0.50, 0.75, 1.00} and the numbers above the boxplots are the per-
centage of the N datasets for which the difference was negative and thus the
criteria favored the random effects model.
Gaussian but the data are not, then even the best fitting model is inappro-
priate. Therefore, in addition to comparing models, diagnostics should be
performed to determine if the models capture the important features of the
data.
Standard diagnostic tools are equally important to a Bayesian and non-
Bayesian analysis. For example, in a linear regression, normality (e.g., residual
qq-plot), linearity (e.g., added variable plots), influential points (e.g., Cook’s
D), etc., should be scrutinized. Many of these classic tools are based on least-
squares residuals and therefore are not purely Bayesian, but they remain valu-
able informal goodness-of-fit measures.
Another way to critique a model is out-of-sample prediction performance.
Say the data are split into a training set Y and a test set Y∗ = (Y1∗ , ..., Ym∗ ).
Section 1.5 discusses posterior predictive distribution (PPD) of the test set
observation i given the training data (and averaging over uncertainty in model
parameters), fi∗ (y|Y). Comparing the test-set data to the PPD is a way to
verify the model fits well. The PPD evaluated at the observed test observation,
CP Oi = fi∗ (Yi∗ |Y), (5.30)
is called the conditional predictive ordinate (CPO) [27, 65]. Test set observa-
tions with small CPO do not fit the model well and are potentially outliers.
A more interpretable diagnostic is to check that roughly 95% of the test-
set observations fall in the 95% posterior prediction intervals. For continuous
188 Bayesian Statistical Methods
data, the probability integral transform (PIT) statistic [20] provides a measure
of fit for the entire predictive distribution rather than just the 95% intervals.
The PIT is the posterior predictive probability below the test set value, Yi∗ ,
Z Yi∗
P ITi = fi∗ (y|Y)dy. (5.31)
−∞
This integral looks daunting, but can be easily approximated as the propor-
tion of the MCMC samples from the PPD that are below the test-set value.
Typically P ITi is computed for each test-set observation and these statistics
are plotted in a histogram. If the model fits well, then the PIT statistics should
follow a Uniform(0,1) distribution and the PIT histogram should be flat.
An important consideration when interpreting diagnostic measures is that
even a model that appears to be perfectly calibrated (say with uniform
PIT statistics) is not necessarily the true model (if there is such a thing).
For example, say the data are generated as Y |X ∼ Normal(X, 1) with
X ∼ Normal(0, 1), then the model Y ∼ Normal(0, 2) will fit the data per-
fectly well, but is clearly inferior to a model that includes X as a predictor.
Therefore, both model selection and goodness-of-fit testing are important.
Posterior predictive checks: Rather than focusing on predicting in-
dividual test set observations, posterior predictive checks (e.g., [33]) evalu-
ate fit using summaries of the dataset. Let θ̃ be a posterior sample of the
model parameters and Ỹ be a replicate dataset drawn from the model given
θ̃. To facilitate comparisons, the replicate dataset should have the same di-
mensions as the observed data. For example, in the linear regression model
Y ∼ Normal(Xβ, σ 2 I), the parameters are θ̃ = (β̃, σ̃ 2 ) and we would sample
should fall in the predictive distribution of the mean and variance because
there are parameters in the model for these summaries. Therefore, selecting
D to the sample mean or variance is not informative. However, the normal
model assumes the distribution is symmetric and there are no parameters in
the model to capture asymmetry. Therefore, taking D to be the skewness
provides a useful verification that this modeling assumption holds.
The Bayesian p-value is a more formal summary of the posterior predictive
check. The Bayesian p-value is the probability under repeated sampling from
the fitted model of observing a summary statistic at least as large as the
observed statistic. This probability can be approximated using MCMC output
as the proportion of the S draws D(Ỹ1 ), ..., D(ỸS ) that are greater than
D(Y). A Bayesian p-value near zero or one indicates that in at least one
aspect the model does not fit well.
The Bayesian p-value resembles the p-value from classical hypothesis test-
ing in that both quantify the probability under repeated sampling of observing
a statistic at least as large as the observed statistic. An important difference
is that the classical p-value assumes repeated sampling under the null hypoth-
esis, whereas the Bayesian p-value assumes repeated sampling from the fitted
Bayesian model. Another important difference is that for the Bayesian p-value,
probabilities near either zero or one provide evidence against the fitted model.
Gun control example: Kalesan et al. (2016) [47] study the relationship
between state gun laws and firearm-related death rates. The response variable,
Yi , is the number of firearm-related deaths in 2010 in state i. The analysis
includes five potential confounders (Zij ): 2009 firearm death rate per 10,000
people; firearm ownership rate quartile; unemployment rate quartile; non-
firearm homicide rate quartile and firearm export rate quartile. The covariates
of interest in the study are status of gun laws in the state. Let Xil indicate
that state
P i has law l. In this example, we simply use the number of laws
Xi = l Xil as the covariate. Setting aside correlation versus causation issues,
the objective of the analysis is to determine if there is a relationship between
the number of gun laws in the state and its firearm-related death rate. Our
objective in this section is to illustrate the use of posterior predictive checks
to verify that the model fits well.
We check the fit of two models. The first is the usual Poisson regression
model
X 5
Yi ∼ Poisson(λi ) where λi = Ni exp α + Zij βj + Xi β6 (5.33)
j=1
and Ni is the state’s population. A concern with the Poisson model is that
because the mean equals the variance it may not be flexible enough to capture
large counts. Therefore, we also consider the negative binomial model with
mean λi and over-dispersion parameter m
m
Yi ∼ NB ,m . (5.34)
λi + m
190 Bayesian Statistical Methods
Listing 5.4
JAGS code for the over-dispersed Poisson regression model.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnegbin(q[i],m)
4 q[i] <- m/(m+N[i]*lambda[i])
5 log(lambda[i]) <- alpha + inprod(Z[i,],beta[1:5]) +
X[i]*beta[6]
6 }
7
8 #Priors
9 for(j in 1:6){
10 beta[j] ~ dnorm(0,0.1)
11 }
12 alpha ~ dnorm(0,0.1)
13 m ~ dgamma(0.1,0.1)
14
The priors for the fixed effects are α, βj ∼ Normal(0, 10) and for the negative-
binomial model m ∼ Gamma(0.1, 0.1). Code to fit the negative-binomial
model is given in Listing 5.4.
We interrogate the models using posterior predictive checks with the fol-
lowing six test statistics:
1. Minimum count: D1 (Y) = min{Y1 , ..., Yn }
2. Maximum count: D2 (Y) = max{Y1 , ..., Yn }
3. Range of counts: D3 (Y) = max{Y1 , ..., Yn } − min{Y1 , ..., Yn }
4. Minimum rate: D4 (Y) = min{Y1 /N1 , ..., Yn /Nn }
5. Maximum rate: D5 (Y) = max{Y1 /N1 , ..., Yn /Nn }
6. Range of rates: D6 (Y) = max{Y1 /N1 , ..., Yn /Nn } −
min{Y1 /N1 , ..., Yn /Nn }
We use test statistics related to the range of the counts and rates because
Model selection and diagnostics 191
Minimum count Maximum count Range of counts
0.006
0.006
0.06
Posterior density
Posterior density
Posterior density
0.004
0.004
0.04
0.002
0.002
0.02
0.000
0.000
0.00
20 30 40 50 60 70 80 2000 2500 3000 3500 4000 2000 2500 3000 3500 4000
D D D
50000
50000
40000
Posterior density
Posterior density
Posterior density
40000
80000
30000
30000
60000
20000
20000
40000
10000
10000
20000
0
0
2e−05 3e−05 4e−05 5e−05 6e−05 0.00020 0.00025 0.00030 0.00015 0.00020 0.00025
D D D
FIGURE 5.9
Bayesian p-values for the gun control example. The density curves
show the posterior predictive distribution of the six test statistics for the
Poisson and negative-binomial (NB) models, and the vertical lines are the
test statistics for the observed data. The Bayesian p-value is given in the
legend’s parentheses.
the main concern here is properly accounting for large counts. Figure 5.9
plots the PPD for both models and all six test statistics. The PPD from the
Poisson model does not capture the largest counts or the range of counts ob-
served in the dataset. The Bayesian p-value is close to zero or one for all four
test statistics involving either the maximum or the range. As expected, the
negative-binomial model gives much wider prediction intervals and thus less
extreme Bayesian p-values. Therefore, while the Poisson model is not appro-
priate for this analysis, the negative-binomial model appears to be adequate
based on these (limited) tests. However, in both analyses the coefficient as-
sociated with the number of gun laws (β6 ) is negative with high probability
(95% interval (-0.017,-0.011) for the Poisson model and (-0.026, -0.008) for
the negative-binomial model), and so the conclusion that there is a negative
association between the number of gun laws in a state and its firearm-related
mortality is robust to this modeling assumption.
192 Bayesian Statistical Methods
5.7 Exercises
1. Download the airquality dataset in R. Compare the following two
models using 5-fold cross validation:
M1 : ozonei ∼ Normal(β1 + β2 solar.Ri , σ 2 )
M2 : ozonei ∼ Normal(β1 +β2 solar.Ri +β2 tempi +β3 windi , σ 2 ).
Specify and justify the priors you select for both models.
2. Fit model M2 to the airquality data from the previous problem,
and use posterior predictive checks to verify that the model fits
well. If you find model misspecification, suggest (but do not fit)
alternatives.
3. Assume that Y |λ ∼ Poisson(N λ) and λ ∼ Gamma(0.1, b). The
null hypothesis is that λ ≤ 1 and the alternative is that λ > 1.
Select b so that the prior probability of the null hypothesis is 0.5.
Using this prior, compute the posterior probability of the alternative
hypothesis and the Bayes factor for the alternative relative to the
null hypothesis for (a) N = 10 and Y = 12; (b) N = 20 and Y = 24;
(c) N = 50 and Y = 60; (d) N = 100 and Y = 120. For which N
and Y is there definitive evidence in favor of the alternative?
4. Use the “Mr. October” data (Section 2.4) Y1 = 563, N1 = 2820,
Y2 = 10, and N2 = 27. Compare the two models:
M1 : Y1 |λ1 ∼ Poisson(N1 λ1 ) and Y2 |λ2 ∼ Poisson(N2 λ2 )
M2 : Y1 |λ0 ∼ Poisson(N1 λ0 ) and Y2 |λ0 ∼ Poisson(N2 λ0 ).
using Bayes factors, DIC and W AIC. Assume the Uniform(0,c)
prior for all λj and compare the results for c = 1 and c = 10.
5. Use DIC and W AIC to compare logistic and probit links for the
gambia data in the R package geoR using the five covariates in List-
ing 5.2 and no random effects.
6. Fit logistic regression model to the gambia data in the previous
question and use posterior predictive checks to verify the model fits
well. If you find model misspecification, suggest (but do not fit)
alternatives.
7. For the NBA free throw data in Section 1.6, assume that for player
i, Yi |pi ∼ Binomial(ni , pi ) where Yi is the number of clutch makes,
ni is the number of clutch attempts, and pi is the clutch make
probability. Compute the posterior probabilities of the models:
M1 : logit(pi ) = β1 + logit(qi )
M2 : logit(pi ) = β1 + β2 logit(qi ).
Model selection and diagnostics 193
where qi is the overall free throw percentage. Specify the priors you
use for the models’ parameters and discuss whether the results are
sensitive to the prior.
8. Fit model M2 to the NBA data in the previous question and use
posterior predictive checks to verify the model fits well. If you find
model misspecification, suggest (but do not fit) alternatives.
9. Download the Boston Housing Data in R from the Boston dataset.
The response is medv, the median value of owner-occupied homes,
and the other 13 variables are covariates that describe the neighbor-
hood. Use stochastic search variable selection (SSVS) to compute
the most likely subset of the 13 covariates to include in the model
and the marginal probability that each variable is included in the
model. Clearly describe the model you fit including all prior distri-
butions.
10. Download the WWWusage dataset in R. Using data from times t =
5, ..., 100 as outcomes (earlier times may be used as covariates), fit
the autoregressive model
Yt |Yt−1 , ..., Y1 ∼ Normal(β0 + β1 Yt−1 + ... + βL Yt−L , σ 2 )
where Yt is the WWW usage at time t. Compare the models with
L = 1, 2, 3, 4 and select the best time lag L.
11. Using the WWWusage dataset in the previous problem, fit the model
with L = 2 and use posterior predictive checks to verify that the
model fits well. If you find model misspecification, suggest (but do
not fit) alternatives.
12. Open and plot the galaxies data in R using the code below,
> library(MASS)
> data(galaxies)
> ?galaxies
> Y <- galaxies
> n <- length(Y)
> hist(Y,breaks=25)
CONTENTS
6.1 Overview of hierarchical modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.2 Case study 1: Species distribution mapping via data fusion . . . . 200
6.3 Case study 2: Tyrannosaurid growth curves . . . . . . . . . . . . . . . . . . . . . 203
6.4 Case study 3: Marathon analysis with missing data . . . . . . . . . . . . . 211
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
By ordering the three variables and specifying the univariate marginal distri-
bution of X and then the univariate conditional distributions for Y |X and
Z|X, Y , the multivariate problem is reduced to three univariate problems.
Because the variables are ordered and each conditional distribution depends
only on the previous variables in the ordering, the resulting joint distribution
is guaranteed to be valid. Also, since any multivariate distribution can be
decomposed this way, there is no loss of flexibility by taking this approach.
The trivariate model is represented as a directed acyclic graph (DAG) in
195
196 Bayesian Statistical Methods
(a) (b)
Y X Y Z
X Z
FIGURE 6.1
Directed acyclic graphs (DAGs). Panel (a) shows the DAG for the model
f (X, Y, Z) = f (X)f (Y |X)f (Z|X, Y ) and Panel (b) shows the DAG for the
model f (X, Y, Z) = f (X)f (Y |X)f (Z|Y ).
Figure 6.1a. A DAG (also called a Bayesian network) represents the model
as a graph with each observation and parameter as a node (i.e., the points
that define the graph) and edges (i.e., connections between the nodes) to
denote conditional dependence. To define a valid stochastic model the graph
must be directed and acyclic. A directed graph associates each edge with
a direction; an arrow from X to Y indicates that the hierarchical model is
defined by modeling the conditional distribution of Y as a function of X.
The absence of an arrow from X to Z in Figure 6.1b conveys the choice
that conditioned on Y , Z does not depend on X, i.e., f (z|x, y) = f (z|y); in
contrast, Figure 6.1a is the DAG for the model with conditional dependence
between X and Z given Y . The graph must also be acyclic, meaning that it is
impossible to follow directed edges from a node through the graph and return
to the original node. These two conditions rule out building models such as
p(x, y, z) = p(x|y, z)p(y|z)p(z|y), which may not be a valid joint distribution.
Hierarchical models can take many forms, but a general way to build a
model is through a data layer, a process layer and a prior layer. Model build-
ing should begin with the process layer that contains the underlying scientific
processes of interest and the unknown parameters. Building this layer is ide-
ally done in consultation with domain experts. Once this layer is defined the
statistical objectives can be articulated, for example, to estimate a particular
parameter or test a specific hypothesis. Ideally these objectives dictate the
data to be collected for the analysis. The data layer relates (via the likelihood
function) the data to the process and encodes bias and error in the data col-
lection procedure, which requires knowledge of how the data were collected.
Finally the prior layer quantifies uncertainty about the model parameters at
the onset of the analysis.
Building a model hierarchically is convenient, but not fundamentally dif-
ferent than models we have considered previously. In fact, we have already
encountered many hierarchical models such as the random effects models in
Section 4.4. This means that the computational methods to sample from the
Case studies using hierarchical modeling 197
where it is assumed that all infected individuals are removed from the pop-
ulation before the next time step and q is the probability of a non-infected
person coming into contact with and contracting the disease from an infected
individual. The epidemiological process-layer model expresses the disease dy-
namics up to a few unknown parameters. To estimate these parameters, the
number of cases at time t, denoted Yt , is collected. The data layer models
our ability to measure the process It . For example, after discussing the data
collection procedure with domain experts, we might assume there are no false
positives (uninfected people counted as infected) but potentially false nega-
tives (uncounted infected individuals) and thus
Y2 Y3 Y4 …
Data layer
Process layer
I1, S1, p, q
Prior layer
λ1, λ2, a, b
FIGURE 6.2
Directed acyclic graph for the Reed–Frost infectious disease model.
This model is visualized as a DAG in Figure 6.3. The DAG shows how
information moves through the hierarchical model. For example, how does the
data from patient 1 help us predict the bone density for patient 2’s next visit?
To traverse the DAG from Y1 to Y2 requires going through the population
parameters µ and Σ. That is, Y1 informs the model about β 1 which shapes
the random effects distribution that enters the model for β 2 and thus Y2 . If we
only had data from patient 2, then we would likely resort to an uninformative
prior for β 2 and with only a few observations for patient 2 the posterior would
be unstable. However, the hierarchical model allows us to borrow strength
across patients to stabilize the results.
MCMC is a natural choice for fitting hierarchical models; just as hierar-
chical models build complexity by layering simple conditional distributions,
MCMC samples from the complex posterior distribution by sequentially up-
dating parameters from simple full conditional distributions. In fact, display-
ing the hierarchical model as a DAG not only helps understand the model
but it also aids in coding the MCMC sampler because the full conditional
distribution for a parameter depends only on terms with an arrow to or from
the parameter’s node in the DAG. For example, from Figure 6.3 it is clear
that the full conditional distribution for β 2 depends only on the data-layer
terms for Y21 , ..., Y2m |β 2 and the process-layer term for β 2 |µ, Σ. If we view
the model only through these terms then it is immediately clear that the full
conditional distribution of β 2 is exactly the full conditional distribution of the
regression coefficients in standard Bayesian linear regression (Section 4.2).
When to stop adding layers? The hierarchical models in this section
have three levels: data, process and prior. However, likely the values that define
Case studies using hierarchical modeling 199
Data layer
β1 β2 … βn
Process layer
µ, Σ
Prior layer
c, ν, Ω
FIGURE 6.3
Directed acyclic graph. Visual representation of the random slopes model
Yij |β i ∼ Normal(Xj β i , σ 2 ) for i = 1, ..., n and j = 1, ..., 4 with random effect
distribution β i |µ, Σ ∼ Normal(µ, Σ) and priors µ ∼ Normal(0, c2 I2 ) and
Σ ∼ InvWishart(ν, Ω).
the prior layer will not be known exactly, and it is tempting to add a fourth
(and a fifth, etc.) layer to explain this uncertainty. A general rule of thumb
is to stop adding layers when there is no replication to estimate parameters.
For example, referring to the DAG in Figure 6.3, it is reasonable to add a
layer to estimate the random effect mean µ and covariance Σ because there
are repeated random effects β 1 , ..., β n that can be leveraged to estimate these
parameters. However, adding an additional layer to estimate prior mean Ω of
the random effect covariance Σ would not be reasonable because there is only
one covariance Σ in the model, and even if we knew Σ exactly we would not
be able to estimate its distribution from a single sample.
The remainder of this chapter is formatted as a sequence of case studies in
hierarchical modeling. The three case studies each pose different challenges:
1. Species distribution mapping via data fusion: Combining in-
formation from multiple data streams while accounting for their
bias and uncertainty
2. Tyrannosaurid growth curves: Pooling information across sub-
populations (species) and quantifying uncertainty in non-linear
models with a small number of observations
3. Marathon analysis with missing data: Accounting for missing
data when performing statistical inference and prediction
In these analyses we demonstrate the flexibility of hierarchical modeling, and
200 Bayesian Statistical Methods
also illustrate complete Bayesian analyses including model and prior specifi-
cation, model comparisons, and presentation of the results.
40 40
250 20
36 36
200
15
150
10
100
50 5
32 0 32 0
−95 −90 −85 −80 −75 −95 −90 −85 −80 −75
40 40
36 36
60 20
40 15
10
20
5
32 0 32 0
−95 −90 −85 −80 −75 −95 −90 −85 −80 −75
(e) Posterior mean abundance (f) Probability that occupancy exceeds 0.01
40 40
1.00
36 0.4 36
0.75
0.3
0.50
0.2
0.25
0.1
32 32 0.00
−95 −90 −85 −80 −75 −95 −90 −85 −80 −75
FIGURE 6.4
2012 Brown-headed nuthatch data. Panels (a) and (b) plot the number of
BBS sampling occasions (N1i ) and number of √ BBS sightings (Y1i ); Panels (c)
and (d) plot the square root √of Ebird effort ( N2i ) and the square root of the
number of Ebird sightings ( Y2i ). Panel (e) plots the posterior mean abun-
dance, λi , and Panel (f) plots the posterior probability that the occupancy
probability exceeds 0.01, i.e., Prob[1 − exp(−λi ) > 0.01|Y].
202 Bayesian Statistical Methods
mean is N2i λ̃i with rate λ̃i = θ1 λi + θ2 , where θ1 > 0 controls the difference
between observation rates by BBS observers and eBird observers and θ2 > 0 is
the eBird false positive rate so that if the cell is truly uninhabited and λi = 0,
then E(Y2i ) = N2i θ2 . To allow for over-dispersion we fit the model
Listing 6.1
Spatial data fusion model for BHNU abundance.
1 # Data layer
2 for(i in 1:n){
3 Y1[i] ~ dbin(phi[i],N1[i]) # BBS
4 phi[i] <- 1-exp(-lam[i])
5
10 # Process layer
11 for(j in 1:p){beta[j]~dnorm(beta0,tau)}
12 for(i in 1:n){
13 log(lam[i]) <- inprod(X[i,],beta[])
14 }
15
16 # Prior layer
17 theta1 ~ dgamma(0.1,0.1)
18 theta2 ~ dgamma(0.1,0.1)
19 m ~ dgamma(0.1,0.1)
20 tau ~ dgamma(0.1,0.1)
21 beta0 ~ dnorm(0,1)
effective sample size is greater than 1,000 for all βjk , indicating the sampler
has mixed well and sufficiently explored the posterior.
Table 6.1 presents the posterior distributions of the hyperparameters. Of
note, the eBird false positive rate, θ2 , is estimated to be near zero, and thus
the eBird data appears to be a reliable source of information. The posterior
mean of λi and the posterior probability that cells are occupied (i.e., at least
one individual is present) are mapped in Figures 6.4e and 6.4f, respectively.
As expected, the estimated abundance is the largest in Georgia and the Car-
olinas, but the occupancy probability is also high farther west in Louisiana
and Arkansas. The occupancy probabilities would be lower in these western
states if the eBird data were excluded.
TABLE 6.1
Posteriors for the BHNU analysis. Posterior median and 95% intervals
mean for the final fit with L = 10.
age, for each species. The data exhibit non-linear relationships between age
and mass and there are commonalities between species. We therefore pursue
a non-linear hierarchical model.
The original analysis of these data used non-linear least squares (fitted
curves shown in the left panel of Figure 6.5). Quantifying uncertainty in this
fit is challenging. The sampling distribution of the estimator does not have a
closed form due to the non-linear mean structure, and with only a handful of
observations to estimate roughly the same number of parameters, large-sample
normal approximations are not valid and resampling techniques such as the
bootstrap may have insufficient data to approximate the sampling distribu-
tion. As shown below, a Bayesian analysis powered by MCMC fully quantifies
posterior uncertainty.
Let Yij and Xij be the body mass and age, respectively, of sample i from
species j = 1, ..., 4. We model the data as
where fj is the true growth curve for species j and ǫij > 0 is multiplicative
error with mean one. We use multiplicative error rather than additive error
because variation in the population likely increases with mass/age. Assuming
the errors are log-normal with log(ǫij ) ∼ Normal(−σj2 /2, σj2 ) then E(ǫij ) = 1
as required and the model becomes
and E(Yij ) = fj (Xij ) and where σj2 controls the error variance for species j.
The data in Figure 6.5 (left) clearly exhibit nonlinearity. However, after
taking a log transformation of both mass and age their relationship is fairly
linear (Figure 6.5, right). Therefore, one model we consider is the log-linear
model
log[fj (X)] = aj + bj log(X) (6.8)
where aj and bj are the intercept and slope, respectively, for species j. On
the original scale, the corresponding growth curve is fj (X) = exp(aj )X bj . If
Case studies using hierarchical modeling 205
Albertosaurus Albertosaurus
Daspletosaurus Daspletosaurus
5000
8
Gorgosaurus Gorgosaurus
Tyrannosaurus Tyrannosaurus
4000
6
2000
5
1000
4
0
FIGURE 6.5
Tyrannosaurid growth curve data. Panel (a) gives scatter plots of the
estimated age and body mass (kg) of 20 samples of four tyrannosaurid species;
Panel (b) plots the same data after a log transformation of both variables. The
curves plotted in Panel (a) are the fitted logistic curves from [25] and the lines
in Panel (b) are least squares fits.
exp[dj (x − cj )]
fj (X) = aj + bj . (6.9)
1 + exp[dj (x − cj )]
Listing 6.2
JAGS code for hierarchical growth curve modeling.
1 # n is the total number of observations for all species
2 # x[i] is the log age of individual i
3 # y[i] is the log mass of individual i
4 # sp[i] is the species number (1, 2, 3, or 4) of individual i
5
6 # Data layer
7 for(i in 1:n){
8 y[i] ~ dnorm(muY[i],taue)
9 muY[i] <- log(a[sp[i]] + b[sp[i]]/(1+exp(-part[i])) - 0.5/taue
10 part[i] <- (x[i]-c[sp[i]])/d[sp[i]]
11 }
12
13 # Process layer
14 for(j in 1:N){
15 a[j] <- exp(alpha[j,1])
16 b[j] <- exp(alpha[j,2])
17 c[j] <- alpha[j,3]
18 d[j] <- exp(alpha[j,4])
19
23 # Prior layer
24 for(k in 1:4){
25 mu[k] ~ dnorm(0,0.1)
26 tau[k] ~ dgamma(0.1,0.1)
27 }
28 taue ~ dgamma(0.1,0.1)
InvGamma(0.1, 0.1). The normal prior is replaced with the log-normal prior
for aj , bj and dj to ensure these parameters are positive and thus fj (X) is
positive and increasing for all X. The second prior (“pooled”) is a Bayesian
hierarchical model that borrows information across the four species. In the
pooled analysis we assume the variance is the same for all species, σj2 = σ 2 and
has uninformative prior σ 2 ∼ InvGamma(0.1, 0.1). For the log-linear model
priors for the intercepts are aj ∼ Normal(µa , σa2 ), where µa ∼ Normal(0, 10)
and σa2 ∼ InvGamma(0.1, 0.1). The same hierarchical model is applied to the
log(aj ), log(bj ), cj and log(dj ) in the logistic model. The JAGS code for this
model is given in Listing 6.2.
This hierarchical model treats the parameters across the four species as
random effects, and learning about the random effects distribution (i.e., µa and
σa2 ) stabilizes the posterior by providing additional information via the priors.
It is debatable whether these parameters are truly random effects, i.e., whether
there is an infinite distribution of exchangeable species from which these four
Case studies using hierarchical modeling 207
Albertosaurus Daspletosaurus
5000
5000
● Data
3000
3000
●
●
0 1000
0 1000
● ●
●
●
●
●
5 10 15 20 25 5 10 15 20 25
Gorgosaurus Tyrannosaurus
●
5000
5000
●
Body Mass (kg)
3000
●
●
●●
0 1000
0 1000
●
●
●
● ● ●
5 10 15 20 25 5 10 15 20 25
FIGURE 6.6
Fitted log-linear growth curves – unpooled. Observations (points) ver-
sus the posterior mean (solid lines) and 95% intervals (dashed lines) of the
tyrannosaurid growth curves for the unpooled log-linear model.
were randomly selected for the study. However, analyzing the data from these
four species using a random effects model clearly improves the results (as
shown below) by pooling information across species to reduce uncertainty.
We fit the model with log-linear and logistic growth curves, each separate
by species (unpooled) and using a hierarchical model (pooled). DIC (pD ) for
the four fits are: 29 (25) for log-linear unpooled, -3 (9) for log-linear pooled,
64 (41) for logistic unpooled and -2 (12) for logistic pooled. The pooled mod-
els reduce model complexity (as measured by pD ) and this leads to smaller
(better) DIC. DIC for the log-linear and logistic growth curves are similar.
Figures 6.6–6.9 plot the posterior mean and pointwise 95% credible interval
for fj for each model and each species (the interval estimates are for fj and not
Yij , so they should not include 95% of the observations). The posterior means
of the four methods are fairly similar and all fit the data well. The main
difference between the fits is that by borrowing information across species,
the pooled analyses have narrower credible sets. Visually, the log-linear fits
in Figure 6.9 appear to sufficiently model the growth curves. However, given
that the logistic curve fits nearly as well and possesses the intuitive property
of plateauing at an advanced age, this model is arguably preferable when
considering the entire life course.
208 Bayesian Statistical Methods
Albertosaurus Daspletosaurus
5000
5000
● Data
Body Mass (kg)
3000
●
●
0 1000
0 1000
● ●
●
●
●
●
5 10 15 20 25 5 10 15 20 25
Gorgosaurus Tyrannosaurus
●
5000
5000
●
Body Mass (kg)
3000
●
●
●●
0 1000
0 1000
●
●
●
● ● ●
5 10 15 20 25 5 10 15 20 25
FIGURE 6.7
Fitted log-linear growth curves – pooled. Observations (points) versus
the posterior mean (solid lines) and 95% intervals (dashed lines) of the tyran-
nosaurid growth curves for the pooled (hierarchical) log-linear model.
Case studies using hierarchical modeling 209
Albertosaurus Daspletosaurus
5000
5000
● Data
Body Mass (kg)
3000
●
●
0 1000
0 1000
● ●
●
●
●
●
5 10 15 20 25 5 10 15 20 25
Gorgosaurus Tyrannosaurus
●
5000
5000
●
Body Mass (kg)
3000
●
●
●●
0 1000
0 1000
●
●
●
● ● ●
5 10 15 20 25 5 10 15 20 25
FIGURE 6.8
Fitted log-linear growth curves – logistic, unpooled. Observations
(points) versus the posterior mean (solid lines) and 95% intervals (dashed
lines) of the tyrannosaurid growth curves for the unpooled logistic model.
210 Bayesian Statistical Methods
Albertosaurus Daspletosaurus
5000
5000
● Data
Body Mass (kg)
3000
●
●
0 1000
0 1000
● ●
●
●
●
●
5 10 15 20 25 5 10 15 20 25
Gorgosaurus Tyrannosaurus
●
5000
5000
●
Body Mass (kg)
3000
●
●
●●
0 1000
0 1000
●
●
●
● ● ●
5 10 15 20 25 5 10 15 20 25
FIGURE 6.9
Fitted log-linear growth curves – logistic, pooled. Observations (points)
versus the posterior mean (solid lines) and 95% intervals (dashed lines) of the
tyrannosaurid growth curves for the pooled (hierarchical) logistic model.
Case studies using hierarchical modeling 211
Runner 3
● Runner 149 ●
4
120
● ● ●
● ●
Speed (minute/mile)
● ● ● ●
●
100
● ● ●
● ● ●
● ●
● ●
2
Runner
●
80
60
0
40
20
−2
5 10 15 20 25 5 10 15 20 25
Mile Mile
(c) Posterior of beta − full data set (d) Posterior of beta − complete cases
1.5
1.5
1.0
1.0
0.5
0.5
Posterior
Posterior
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.5
−1.5
0 5 10 15 20 25 0 5 10 15 20 25
Mile Mile
FIGURE 6.10
Missing data analysis of the 2016 Boston Marathon data. Panel (a)
shows the missing (black) and non-missing (white) Xij by runner (i) and mile
(j). Panel (b) plots the observed (points) standardized covariates (Xij ) and
the posterior distributions (boxplots) of the missing covariates for two runners.
Panels (c) and (d) plot the posterior distribution of each regression coefficient,
βj , for the missing-data model and complete-case analysis, respectively.
Case studies using hierarchical modeling 213
6.5 Exercises
1. Since full conditional distributions are used in many MCMC al-
gorithms, it is tempting to specify the model via its conditional
214 Bayesian Statistical Methods
Listing 6.3
JAGS model statement for the missing data analysis of the marathon data.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnorm(alpha + inprod(X[i,],beta[]),taue)
4 }
5
6 # Missing-data model
7 for(i in 1:n){
8 X[i,1] ~ dnorm(0,tau1)
9 for(j in 2:p){
10 X[i,j] ~ dnorm(rho*X[i,j-1],tau2)
11 }
12 }
13
14 # Priors
15 alpha ~ dnorm(0,0.01)
16 for(j in 1:p){
17 beta[j] ~ dnorm(0,0.01)
18 }
19 taue ~ dgamma(0.1, 0.1)
20 tau1 ~ dgamma(0.1, 0.1)
21 tau2 ~ dgamma(0.1, 0.1)
22 rho ~ dnorm(0, 0.01)
Case studies using hierarchical modeling 215
> library(rmeta)
> data(cochrane)
> cochrane
name ev.trt n.trt ev.ctrl n.ctrl
1 Auckland 36 532 60 538
2 Block 1 69 5 61
3 Doran 4 81 11 63
4 Gamsu 14 131 20 137
5 Morrison 3 67 7 59
6 Papageorgiou 1 71 7 75
7 Tauesch 8 56 10 71
The data are from seven randomized trials that evaluate the effect
of corticosteroid therapy on neonatal death. For trial i ∈ {1, ..., 7}
denote Yi0 as the number of events in the Ni0 control-group patients
and Yi1 as the number of events in the Ni1 treatment-group patients.
indep
(a) Fit the model Yij |θj ∼ Binomial(Nij , θj ) with θ0 , θ1 ∼
Uniform(0, 1). Can we conclude that the treatment reduces the
event rate?
indep
(b) Fit the model Yij |θij ∼ Binomial(Nij , θij ) with logit(θij ) =
iid
αij and αi = (αi0 , αi1 )T ∼ Normal(µ, Σ), µ ∼
Normal(0, 102 I2 ), and Σ ∼ InvWishart(3, I2 ). Summarize the
evidence that the treatment reduces the death rate.
(c) Draw a DAG for these two models.
(d) Discuss the advantages and disadvantages of both models.
216 Bayesian Statistical Methods
> library(geoR)
> data(gambia)
> ?gambia
The data consist of 2,035 children that live in 65 villages. For village
v ∈ {1, ..., 65}, denote nv as the number of children in the sample,
Yv as the number of children that tested positive for malaria, and
pv as the true probability of testing positive for malaria. We use
the spatial model αv = logit(pv ), where α = (α1 , ..., α65 )T follows
a multivariate normal distribution with mean E(αv ) = µ, variance
V(αv ) = σ 2 , and correlation Cor(αu , αv ) = exp(−duv /ρ), where
duv is the distance between villages u and v. For priors assume µ ∼
Normal(0, 102 ), σ 2 ∼ InvGamma(0.1, 0.1), and ρ ∼ Uniform(0, d∗ )
where d∗ is the maximum distance between villages.
(a) Specify the data layer, process layer and prior layer for this
hierarchical model.
(b) Fit the model using JAGS and assess convergence.
(c) Summarize the data and results using five maps: the sample
size nv , the sample proportion Yv /nv , the posterior means of
the pv , the posterior standard deviations of the pv , and the
posterior probabilities that pv exceeds 0.5.
7
Statistical properties of Bayesian methods
CONTENTS
7.1 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.2 Frequentist properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.2.1 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.2.2 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
217
218 Bayesian Statistical Methods
Different loss functions give different Bayes rule estimators. Absolute loss
gives the posterior median as the Bayes rule estimator. More generally, if
over-estimation and under-estimation are weighted differently, a useful loss
function is the asymmetric check-loss function
(
(1 − τ )(θ̂(Y) − θ) if θ̂(Y) > θ
l(θ, θ̂(Y)) = (7.1)
τ (θ − θ̂(Y)) if θ̂(Y) < θ
for τ ∈ (0, 1). In this loss function, if τ is close to zero, then over-estimation
(top row) is penalized more than under-estimation (bottom row). The Bayes
rule for this loss function is the τ th posterior quantile. If θ is a discrete random
variable, then the MAP estimator θ̂(Y) = arg maxθ f (θ|Y) is the Bayes rule
under zero-one loss l(θ, θ̂(Y)) = I[θ 6= θ̂(Y)].
Decision theory can also formalize Bayesian hypothesis testing. Let H ∈
{0, 1} denote the true state, with H = 0 if the null hypothesis is true and
H = 1 if the alternative hypothesis is true. A Bayesian analysis produces
posterior probabilities Prob(H = h|Y) for h = {0, 1} (see Section 5.2). The
decision to be made is which hypothesis to select. Let d(Y) = 0 if we select
the null hypothesis and d(Y) = 1 if we select the alternative hypothesis.
To determine the Bayes rule for d(Y) we must specify the loss incurred if
we select the wrong hypothesis. If we assign a Type I error the loss λ1 and a
Type II error the loss λ2 , the loss function can be written
0
if H = d(Y)
l(H, d(Y)) = λ1 if H = 0 and d(Y) = 1 (7.2)
λ2 if H = 1 and d(Y) = 0.
Therefore, the Bayes rule is to reject the null hypothesis and conclude that
the alternative is true, i.e., select d(Y) = 1, if the posterior probability of the
alternative hypothesis exceeds
λ1
Prob(H = 1|Y) > . (7.4)
λ1 + λ2
Note that unlike classical hypothesis testing if we swap the roles of the hy-
potheses the decision rule remains the same. Often the loss of a Type 1 error
is assumed to be larger than a Type II error, e.g., λ1 = 10λ2 , so that the
null hypothesis is rejected only if the posterior probability of the alternative
hypothesis is near one.
These decision-theory results hold for prediction as well. For example, if
predictions are evaluated using squared-error prediction loss, then the mean
220 Bayesian Statistical Methods
of the posterior predictive distribution is the Bayes rule. Also, if the response
is binary then the Bayes rule for classification is to predict Y pred = 1 if the
posterior predictive probability Prob(Y pred = 1|Y) > λ1 /(λ1 + λ2 ), where
λ1 is the loss for a false positive and λ2 is the loss of a false negative. In
this case, it might be reasonable to set λ1 = λ2 and predict Y pred = 1 if
Prob(Y pred = 1|Y) > 0.5.
If the bias is positive, this means that on average the estimator over-estimates
the parameter and vice versa. With all else equal, an unbiased estimator with
bias to equal zero is preferred. However, the bias only evaluates the center of
the sample distribution and ignores its spread. Therefore we also study the
variance of the sampling distribution V[θ̂(Y)]. For two estimators that are
unbiased we prefer low variance because this means the sampling distribution
is concentrated around the true parameter; on the other hand, small vari-
ance can be undesirable for biased estimators (Figure 7.1). The most common
method of comparison is mean squared error
4
θ^1(Y)
θ^2(Y)
θ^3(Y)
θ^ (Y)
4
3
Sampling distribution
2
1
0
FIGURE 7.1
Hypothetical sampling distributions. Of the four hypothetical sampling
distributions, the first two are unbiased and the last two are biased, and the
first and third have small variance while the second and fourth have large vari-
ance. The true value θ0 is denoted with the vertical line at θ = 0.5. The mean
squared errors of the four estimators are 0.01, 0.25, 0.26 and 0.50, respectively.
probability to the true value, i.e., the probability of the estimator being within
ǫ (for any ǫ) of the true value increases to one as the sample size increases
to infinity. An asymptotically unbiased estimator is consistent if its variance
decreases to zero with the sample size.
Frequentist evaluation of statistical methods extends to interval estimation
and testing. For Bayesian credible intervals we evaluate their frequentist cov-
erage probability, i.e., the probability that they include the true value when
applied repeatedly to many datasets. Similarly, for a Bayesian testing proce-
dure we compute the probability of making Type I and Type II errors when
the test is applied to many random datasets.
TABLE 7.1
Bias-variance tradeoff for a normal-mean analysis. Assuming
iid
Yi , ..., Yn ∼ Normal(θ, σ 2 ), this table gives the bias, variance and mean
squared error (MSE) of the estimators θ̂1 (Y) = Ȳ and θ̂2 (Y) = cȲ , where
c = n/(n + m) ∈ [0, 1] and θ0 is the true value.
where c = n/(n + m) ∈ [0, 1]. The first estimator is the usual sample mean
(i.e., the MLE or posterior mean under Jeffreys prior) and the second is the
posterior mean assuming prior θ ∼ Normal(0, σ 2 /m).
Table 7.1 gives the bias, variance and MSE of the two estimators. The
sample mean is unbiased but always has larger variance than the Bayesian
estimator. The relative MSE is
The Bayesian procedure is preferred when this ratio is less than one. When
θ0 = 0, the prior mean, then the ratio is less than one for all n and m. That
is, when the prior mean is the true value, the Bayesian estimator is preferred
over the sample mean. The Bayesian estimator can be preferred even if the
prior mean is not exactly the true value. As long as the true value is close to
zero, r
1 2
|θ0 | < σ + = B(n, m) (7.9)
n m
the Bayesian estimator is preferred. The bound B(n, m) shrinks to zero as
the sample size increases and so in this case the advantage of the Bayesian
approach is for small datasets. As n → ∞ or m → 0 the RMSE converges to
Statistical properties of Bayesian methods 223
one, and thus for large sample sizes or uninformative priors, the two estimators
perform similarly.
7.2.2 Asymptotics
In Section 3.1.3 we discussed the Bayesian central limit theorem that states
that under general conditions the posterior converges to a normal distribution
as the sample size increases. It can also be shown that under these conditions
and assuming the prior includes the true value that the sampling distribution
of the posterior mean θ̂ B = E(θ|Y) converges (as n → ∞) to the sampling
distribution of the maximum likelihood estimator
θ̂ B ∼ Normal θ 0 , Σ̂M LE (7.10)
where θ 0 is the true value and Σ̂M LE is defined in Section 3.1.3. Therefore,
the posterior mean is asymptotically unbiased. Further, it follows from classic
results for maximum likelihood estimators that its variance decreases to zero as
the sample size increases and thus the posterior mean is a consistent estimator
for θ in essentially the same conditions as the maximum likelihood estimator.
An asymptotic property that is uniquely Bayesian is posterior consistency.
For a given dataset Y, the posterior probability that θ is within ǫ of the true
value is
Prob(||θ − θ0 || < ǫ|Y). (7.11)
A Bayesian procedure is said to possess posterior consistency if this proba-
bility is assured to converge to one as the sample size increases (Figure 7.2).
Appendix A.3 provides a proof that the posterior distribution is consistent
for parameters with discrete support under very general conditions on the
likelihood and prior, and posterior consistency has been established for most
finite-dimensional problems and many Bayesian nonparametric methods. For
a more thorough discussion of posterior consistency see [37].
Both the Bayesian CLT and posterior consistency result in Appendix A.3
hold for any prior as long as the prior does not change with the sample size and
has positive mass/density around the true value. This confirms the argument
that for large datasets, any reasonable prior distribution should lead to the
same statistical inference and that this posterior will converge to the true value.
1.0
20
● ●
n = 100 ●
●
n = 250
●
n = 500
0.8
n = 1000 ●
15
n = 2500
●
0.6
●
Posterior
Pn
10
0.4
●
●
5
0.2
●
● ● ε = 0.01
●
●
● ε = 0.05
● ε = 0.1
0.0
0
0.7 0.8 0.9 1.0 1.1 1.2 1.3 500 1000 1500 2000 2500
θ n
FIGURE 7.2
Illustration of posterior consistency. The data are generated as
iid
Yi , ..., YN ∼ Normal(θ0 , 1) with N = 2, 500 and θ0 = 1, and we fit the
Bayesian model with flat (improper) prior for the mean and variance fixed
at one using the first n observations Yn = (Y1 , ..., Yn ). The left panel plots
the posterior f (θ|Yn ) by n, and the right panel plots the corresponding pos-
terior probability Pn = Prob(|θ − θ0 | < ǫ|Yn ).
Other summaries of the sampling distribution such as bias and coverage are
computed similarly. Of course, this is only a Monte Carlo estimate of the
true MSE and a different simulation experiment will give a different estimate.
Therefore,
√ the approximation should be accompanied by a standard error,
sM SE / S, where sM SE is the sample standard deviation of the S squared
errors, [θ̂(Y1 ) − θ0 ]2 , ..., [θ̂(YS ) − θ0 ]2 .
A typical simulation study will generate data from a few values of θ0 and
Statistical properties of Bayesian methods 225
n to understand when the method performs well and when it does not. Note
that if θ̂(Y) is a Bayesian estimator, say the posterior mean, then computing
each θ̂(Ys ) may require MCMC and thus Bayesian simulation experiments
may need many applications of MCMC and can be time consuming (running
the S chains in parallel is obviously helpful). For a thorough description of
simulation studies, see [13] (Chapter 9).
As an example, we conduct a simulation study to compare ordinary least
squares (OLS) regression with Bayesian LASSO regression (BLR; Section
4.2). The sampling distribution of the OLS estimator is known (multivari-
ate student-t) but the sampling distribution of the posterior mean under the
BLR model is quite complicated and difficult to study without simulation.
iid Pp
We generate Xij ∼ Normal(0, 1) and Yi |Xi ∼ Normal( j=1 Xij βj0 , σ02 ). The
data are generated with true values σ0 = 1 and the first p0 elements of β 0
equal to zero and the final p1 elements equal to one, so that p = p0 + p1 = 20.
We generate S = 100 datasets each from six combinations of n, p0 and p1 us-
ing the R code in Listing 7.1. Each dataset is analyzed using least squares (the
lm function in R) and Bayesian LASSO (the BLR function in R with default
values). For dataset s = 1, ..., S, let β̂js be the estimate of βj (either the least
squares solution for OLS or the posterior mean for BLR) and vjs its estimated
variance (either the squared standard error for OLS or the posterior variance
for BLR). Methods are compared using
S p
1 XX
Bias = (β̂js − βj0 )
Sp s=1 j=1
S p
1 XX
Variance = vjs
Sp s=1 j=1
S p
1 XX
MSE = (β̂js − βj0 )2
Sp s=1 j=1
S p
1 XX √
Coverage = I(|β̂js − βj0 | < 2 vjs ).
Sp s=1 j=1
Listing 7.1
R simulation study code.
1 # Set up the simulation
2 library(BLR)
3 n <- 25 # Sample size
4 p_null <- 15 # Number of null covariates
5 p_act <- 5 # Number of active covariates
6 nsims <- 100 # Number of simulated datasets
7 sigma <- 1 # True value of sigma
8 beta <- c(rep(0,p_null),rep(1,p_act)) # True beta
9
40 E <- sweep(EST2,2,beta,"-")
41 MSE <- c(MSE,mean(E^2))
42 BIAS <- c(BIAS,mean(E))
43 VAR <- c(VAR,mean(VAR2))
44 COV <- c(COV,mean(abs(E/sqrt(VAR2))<2))
45
TABLE 7.2
Simulation study results. The simulation study compares ordinary least
squares (“OLS”) with Bayesian LASSO regression (“BLR”) for estimating
regression coefficients in terms of bias, variance, mean squared error (“MSE”)
and coverage of 95% intervals (all metrics are averaged over covariates and
datasets). The simulations vary based on the sample size (n), the number of
null covariates (p0 ) and the number of active covariates (p1 ). All values are
multiplied by 100.
method has larger bias and thus MSE than OLS and the empirical coverage
of the Bayesian credible sets dips below the nominal level. For the large sam-
ples size cases (n = 100), these same trends are apparent but the differences
between methods are smaller, as expected.
7.4 Exercises
1. Assume Y |µ ∼ Normal(µ, 2) and µ ∼ Normal(0, 2) (i.e., n = 1).
The objective is to test the null hypothesis H0 : µ ≤ 0 versus
the alternative hypothesis that H1 : µ > 0. We will reject H0 if
Prob(H1 |Y ) > c.
(a) Compute the posterior of µ.
(b) What is the optimal value of c if Type I and Type II errors
have the same costs?
(c) What is the optimal value of c if a Type I error costs ten times
more than a Type II error?
(d) Compute the Type I error rate of the test as a function of c.
(e) How would you pick c to control Type I error at 0.05?
2. Given data from a small pilot study, your current posterior probabil-
ity that the new drug your company has developed is more effective
228 Bayesian Statistical Methods
Discrete uniform
0.20
a = 1, b = 4 Notation: X ∼ DiscreteUniform(a, b)
a = 2, b = 8
Support: X ∈ {a, a + 1, ..., b}
0.15
PMF: 1/(b − a + 1)
0.10
Mean: (a + b)/2
Variance: [(b − a + 1)2 − 1]/12
0.05
Binomial
0.4
Parameters:
n ∈ {1, 2, ...}, θ ∈ [0, 1]
Probability
PMF: nx θx (1 − θ)n−x
0.2
Mean: nθ
0.1
Variance: nθ(1 − θ)
Notes: If X is the number of successes in n
0.0
231
232 Appendices
Beta-binomial
0.12
n=25 a = 1, b = 2 Notation: X ∼ BetaBinomial(n, a, b)
a = 5, b = 10
Support: X ∈ {0, 1, ..., n}
0.10
a = 10, b = 20
Parameters: n ∈ {1, 2, ...}, a, b > 0
0.08
Probability
Γ(n+1)Γ(x+a)Γ(n−x+b)Γ(a+b)
PMF: Γ(x+1)Γ(n−x+1)Γ(n+a+b)Γ(a)Γ(b)
0.06
Mean: na/(a + b)
0.04
Negative Binomial
Notation: X ∼ NegBinomial(θ, m)
Support: X ∈ {0, 1, 2, ...}
0.14
m = 5, θ = 0.5
Parameters: m > 0, θ ∈ [0, 1]
0.12
m = 10, θ = 0.5
m = 2, θ = 0.05
PMF: x+m−1 θm (1 − θ)x
0.10
x
Probability
0.08
0 5 10 15 20 25 30
Poisson
θ=2
Notation: X ∼ Poisson(θ)
0.25
θ = 10
θ = 20
Support: X ∈ {0, 1, 2, ...}
0.20
Parameters:
x
θ>0
Probability
PMF: θ exp(−θ)
0.15
x!
Mean: θ
0.10
Variance: θ
0.05
0 5 10 15 20 25 30
formly over time (space) with the expected
x number of events in a given time interval (re-
gion) equal to θ, then the number of events
that occur in the interval (region) follows a
Poisson(θ) distribution.
Appendices 233
Multivariate discrete
Multinomial
Notation: X = (X1 , ..., Xp ) ∼ Multinomial(n,
Pp θ)
Support: Xj ∈ {0, 1, ..., n} with j=1 Xj = n
Pp
Parameters: θ = (θ1 , ..., θp ) with θj ∈ [0, 1] and j=1 θj = 1
p x
PMF: Qp n! xj ! j=1 θj j
Q
j=1
Mean: E(Xj ) = nθj
Variance: V(Xj ) = nθj (1 − θj )
Covariance: Cov(Xj , Xk ) = −nθj θk
Marginal distributions: Xj ∼ Binomial(n, θj )
Notes: If n independent trials each have p possible outcomes with the proba-
bility of outcome j being θj and Xj is the number of the trials that result in
outcome j, then X = (X1 , ..., Xp ) ∼ Multinomial(n, θ).
234 Appendices
Univariate continuous
Uniform
Notation: X ∼ Uniform(a, b)
1.0
a=0, b=1
a=0.5, b=2.5
Support: X ∈ [a, b]
0.8
1
Density
PDF: b−a
Mean: (a + b)/2
0.4
iid
Notes: If X1 , X2 ∼ Uniform(0, 1) then
0.0
p
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −2 log(X1 ) cos(2πX2 ) ∼ Normal(0, 1); if
x
X ∼ Uniform(0, 1) and F is a continuous CDF,
then F −1 (X) has CDF F .
Beta
a=20, b=20
a=0.5, b=0.5
Parameters: a > 0, b > 0
3
Density
Γ(a+b) a−1
PDF: Γ(a)Γ(b) x (1 − x)b−1
2
a
Mean: a+b
Variance: (a+b)2ab
1
(a+b+1)
Notes: If a = b = 1 then X ∼ Uniform(0, 1);
0
x 1 − X ∼ Beta(b, a).
Gamma
a=0.5, b=0.5
Notation: X ∼ Gamma(a, b)
a=1, b=1 Support: X ∈ [0, ∞]
1.5
a=2, b=2
a=10, b=10 Parameters: shape a > 0, scale b > 0
ba
PDF: Γ(a) xa−1 exp(−bx)
Density
1.0
Mean: a/b
Variance: a/b2
0.5
Inverse gamma
1.2
a=0.5, b=0.5 Notation: X ∼ InvGamma(a, b)
a=1, b=1
a=2, b=2 Support: X ∈ [0, ∞]
1.0
ba
Density
b
Mean: a−1 (if a > 1)
0.4
2
Variance: (a−1)b2 (a−2) (if a > 2)
0.2
x Gamma(a, b).
Normal/Gaussian
Notation: X ∼ Normal(µ, σ 2 )
0.4
µ = 0, σ = 1
µ = − 2, σ = 1 Support: X ∈ (−∞, ∞)
µ = 0, σ = 2
Parameters: location µ ∈ i(−∞, ∞), scale σ > 0
0.3
2
h
exp − (x−µ)
Density
1
PDF: √2πσ 2σ 2
0.2
Mean: µ
Variance: σ 2
0.1
Student’s t
Notation: X ∼ tν (µ, σ 2 )
µ = 0, σ = 1, ν = 2
µ = 0, σ = 1, ν = 5
Support: X ∈ (−∞, ∞)
µ = 2, σ = 2, ν = 5 Parameters: location µ ∈ (−∞, ∞), scale σ >
0.3
2 −(ν+1)/2
0.2
Γ( ν+1 )
h i
PDF: Γ(ν/2)2√νπσ 1 + (x−µ) νσ 2
Mean: µ (if ν > 1)
0.1
ν
Variance: σ 2 ν−2 (if ν > 2)
0.0
−5 0 5 10 Notes: c + dX ∼ tν (c + dµ, d2 σ 2 ); if µ = 0
x
and σ 2 = 1 then X follows the standard
t distribution; if Z ∼ Normal(0, 1) indepen-
dentpof W ∼ Gamma(ν/2, 1/2) then µ +
σZ/ W/ν ∼ tν (µ, σ 2 ); if ν = 1 then X follows
the Cauchy distribution; X is approximately
Normal(µ, σ 2 ) for large ν.
236 Appendices
Laplace/Double exponential
0.5
µ = 0, σ = 1 Notation: X ∼ DE(µ, σ)
µ = 0, σ = 2
Support: X ∈ (−∞, ∞)
0.4
µ = 2, σ = 1
Parameters: location
µ ∈ (−∞, ∞), scale σ > 0
0.3
Density
PDF: 2σ1
exp − |x−µ|
σ
0.2
Mean: µ
Variance: 2σ 2
0.1
−5 0 5 10
Logistic
Notation: X ∼ Logistic(µ, σ)
Support: X ∈ (−∞, ∞)
0.25
µ = 0, σ = 1
µ = 0, σ = 2
Parameters: location µ ∈ (−∞, ∞), scale σ > 0
0.20
µ = 2, σ = 1
exp[−(x−µ)/σ]
PDF: σ1 {1+exp[−(x−µ)/σ]} 2
0.15
Density
Mean: µ
0.10
Variance: π 2 σ 2 /3
Notes: c + dX ∼ Logistic(c + dµ, dσ); if
0.05
−5 0 5 10 Logistic(µ, σ).
x
Appendices 237
Multivariate continuous
Multivariate normal
Notation: X = (X1 , ..., Xp )T ∼ Normal(µ, Σ)
Support: Xj ∈ (−∞, ∞)
Parameters: mean vector µ = (µ1 , ..., µp ) with µj ∈ (−∞, ∞) and p × p
positive definite covariance matrix Σ
PDF: (2π)−p/2 |Σ|−1/2 exp[− 21 (X − µ)T Σ−1 (X − µ)]
Mean: E(Xj ) = µj
Variance: V(Xj ) = σj2 where σj2 is the (j, j) element of Σ
Covariance: Cov(Xj , Xk ) = σjk where σjk is the (j, k) element of Σ
Marginal distributions: Xj ∼ Normal(µj , σj2 )
Notes: For q-vector a and q × p matrix b, a + bX ∼ Normal(a + bµ, bΣbT ).
Multivariate t
Notation: X = (X1 , ..., Xp )T ∼ tν (µ, Σ)
Support: Xj ∈ (−∞, ∞)
Parameters: location µ = (µ1 , ..., µp ) with µj ∈ (−∞, ∞), p × p positive
definite matrix Σ and degrees of freedom ν > 0
Γ(ν/2+p/2) −(ν+p)/2
−1/2
1 + ν1 (X − µ)T Σ−1 (X − µ)
PDF: Γ(ν/2)(νπ) p/2 |Σ|
Dirichlet
Notation: X = (X1 , ..., XpP ) ∼ Dirichlet(θ)
p
Support: Xj ∈ [0, 1] with j=1 Xj = 1
Parameters: θ = (θ1 , ..., θp ) with θj > 0
Γ( p
P
θj ) Qp θ −1
PDF: Qp Γ(θj ) j=1 xj j
j=1
j=1
Pp
Mean: E(Xj ) = θj /( k=1 θP k)
θj ( k6=j θk )
Variance: V(Xj ) = (
Pp
θk )2 (1+ p
P
k=1 k=1 θk )
−θj θkP
Covariance: Cov(Xj , Xk ) = (Pp θk )2 (1+ p
k=1 P k=1 θk )
Marginal distributions: Xj ∼ Beta(θj , k6=j θk )
indep Pp
Notes: If Wj ∼ Gamma(θj , b) and Xj = Wj /( k=1 Wk ) then X =
(X1 , ..., Xp ) ∼ Dirichlet(θ).
238 Appendices
Wishart
Notation: X ∼ Wishart(ν, Ω)
Support: X = {Xjk } is a p × p symmetric positive definite matrix
Parameters: degrees of freedom ν > p−1 and p×p symmetric positive definite
matrix Ω = {Ωjk }
1
PDF: 2pν |Ω|ν/2 Γ (n/2)
|X|(p−ν−1)/2 exp[−trace(Ω−1 X)/2]
p
Mean: E(Xjk ) = νΩjk
Variance: V(Xjk ) = ν(Ω2jk + Ωjj Ωkk )
Marginal distributions: Xjj ∼ Gamma(ν/2, Ωjj /2)
iid Pν
Notes: If ν is an integer and Z1 , ..., Zν ∼ Normal(0, Ω), then i=1 Zi ZTi ∼
Wishart(ν, Ω).
Inverse Wishart
Notation: X ∼ InvWishart(ν, Ω)
Support: X = {Xjk } is a p × p symmetric positive definite matrix
Parameters: degrees of freedom ν > p−1 and p×p symmetric positive definite
matrix Ω = {Ωjk }
PDF: |Ω|
ν/2
|X|−(p−ν−1)/2 exp[−trace(ΩX−1 )/2]
2pν Γp (n/2)
1
Mean: E(Xjk ) = ν−p−1 Ωjk (for ν > p + 1)
(ν−p+1)Ω2kk +(ν−p−1)Ωjj Ωkk
Variance: V(Xjk ) = (ν−p)(ν−p−1) 2 (ν−p−3) (for ν > p + 3)
Marginal distributions: Xjj ∼ InvGamma((ν − p + 1)/2, Ωjj /2)
Notes: If Y ∼ Wishart(ν, Ω−1 ) then Y−1 ∼ InvWishart(ν,
p Ω); if ν = p+1 and
Ω is a diagonal matrix then the correlation Xjk / Xjj Xkk ∼ Uniform(−1, 1).
Appendices 239
A.3: Derivations
Normal-normal model for a mean
iid
Say Yi |µ ∼ Normal(µ, σ 2 ) for i = 1, ..., n with σ 2 known and prior µ ∼
Normal(θ, σ 2 /m). Since the Y1 , ..., Yn are independent, the likelihood factors
as
n n
(Yi − µ)2
Y Y 1
f (Y|µ) = f (Yi |µ) = √ exp − .
i=1 i=1
2πσ 2σ 2
Discarding constants that do not depend on µ and expressing the product of
exponentials as the exponential of the sum, the likelihood is
" n #
X (Yi − µ)2
1
nȲ n 2
f (Y|µ) ∝ exp − ∝ exp − −2 µ + µ
i=1
2σ 2 2 σ2 σ2
Pn
where Ȳ = i=1 Yi /n. The last equality comes from multiplying the quadratic
terms, collecting them as a function of their power of µ, and discarding terms
without a µ. Similarly, the prior can be written
m(µ − θ)2
1 mθ m 2
π(µ) ∝ exp − ∝ exp − −2 µ + µ .
2σ 2 2 σ2 σ2
Because both the likelihood and prior are quadratic in µ they can be combined
as
p(µ|Y) ∝ f (Y|µ)π(µ)
1 nȲ + mθ n+m 2
∝ exp − −2 µ + µ
2 σ2 σ2
1 1
∝ exp − −2M µ + µ2 ,
2 V
where M = (nȲ +mθ)/σ 2 and V = σ 2 /(n+m). The exponent of the posterior
is quadratic in µ, and we have seen that a Gaussian PDF is quadratic in the
exponent. Therefore, we rearrange the terms in the posterior to reveal its
Gaussian PDF form. Completing the square in the exponent (and discarding
and/or adding terms that do not depend on µ) gives
(µ − V M )2
1 1 2
p(µ|Y) ∝ exp − −2M µ + µ ∝ exp − .
2 V 2V
Therefore, the posterior is µ|Y ∼ N(V M, V ). Plugging in the above expres-
sions for M and V gives
σ2
µ|Y ∼ N wȲ + (1 − w)θ,
n+m
where w = n/(n + m).
242 Appendices
p(β|Y) ∝ f (Y|β)π(β)
1 T T 1 T T
∝ exp − (Y − Xβ) U(Y − Xβ) − (β − µ) V(β − µ)
2 2
i
1h T T T
∝ exp − −2(Y UX + µ V)β + β(X UX + V)β
2
1 T T
∝ exp − −2W β + β Pβ
2
written
n
Y 1
f (Y|Σ) ∝ |Σ|−1/2 exp − (Yi − µi )T Σ−1 (Yi − µi )
i=1
2
n
" #
−n/2 1X T −1
∝ |Σ| exp − (Yi − µi ) Σ (Yi − µi )
2 i=1
n
( )
−n/2 1X T −1
∝ |Σ| exp − Trace (Yi − µi ) Σ (Yi − µi )
2 i=1
n
( )
−n/2 1X −1 T
∝ |Σ| exp − Trace Σ (Yi − µi )(Yi − µi )
2 i=1
( " n #)
1 X
−1
∝ |Σ|−n/2 exp − Trace Σ (Yi − µi )(Yi − µi )T
2 i=1
n
( " #)
−n/2 1 −1
X
T
∝ |Σ| exp − Trace Σ (Yi − µi )(Yi − µi )
2 i=1
−n/2 1 −1
∝ |Σ| exp − Trace(Σ W)
2
Pn
where W = − µi )(Yi − µi )T . The inverse Wishart prior is
i=1 (Yi
−(ν+p+1)/2 1 −1
π(Σ) ∝ |Σ| exp − Trace(Σ R) .
2
p(Σ|Y) ∝ f (Y|Σ)π(Σ)
1
|Σ|−(n+ν+p+1)/2 exp − Trace[Σ−1 (W + R)] .
∝
2
Pn
Therefore, Σ|Y ∼ InvWishartp (n + ν, i=1 (Yi − µi )(Yi − µi )T + R).
and
∂ 2 log f (Y|µ, τ ) ∂ −n 1 X
= + 2 (Yi − µ)2
∂τ 2 ∂τ 2τ 2τ i=1
n 1 X
= 2
− 3 (Yi − µ)2 .
2τ τ i=1
Since E(Yi ) = µ and E(Yi − µ)2 = τ , the elements of the information matrix
are
2
∂ log f (Y|µ, τ ) n
−E 2
=
∂µ τ
2
∂ log f (Y|µ, τ ) n nτ n
−E 2
= − 2+ 3 = 2
∂τ 2τ τ 2τ
2
∂ log f (Y|µ, τ )
−E = 0.
∂µ∂τ
The determinant of the 2 × 2 information matrix is thus
n n n2
|I(µ, τ )| = 2
− 02 = 3 ,
τ 2τ 2τ
and the JP is r
n2 1
π(µ, σ) ∝ ∝ 2 3/2 .
2τ 3 (σ )
where the two elements of the product represent the updates of the two pa-
rameters from their full conditional distribuitons given the current value of
the parameters in the chain. We want to show that the marginal distribu-
tion of (θ1′ , θ2′ ) integrating over (θ1∗ , θ2∗ ) follows the posterior. The marginal
distribution is
Z Z
′ ′
g(θ1 , θ2 ) = q(θ1′ , θ2′ |θ1∗ , θ2∗ )f (θ1∗ , θ2∗ )dθ1∗ dθ2∗ .
as desired. The proof for p > 2 similar but involves higher-order integration.
Part (2): Part (1) shows (for a special case) that the stationary distri-
bution of the Gibbs sampler is the posterior distribution. The proof that the
Gibbs sampler converges to its stationary (posterior) distribution draws heav-
ily from Markov chain theory. Given that the posterior distribution is the
stationary distribution, [82] proves that a Gibbs sampler converges to the
posterior distribution if the chain is aperiodic and p-irreducible. A chain is
aperiodic if for any partition of the posterior domain of θ, say {A1 , ..., Am },
so that each subset has positive posterior probability, then the probability of
the chain transitioning from Ai to Aj is positive for any i and j. A chain is
p-irreducible if for any initial value θ (0) in the support of the posterior distri-
bution and set A with positive posterior probability, i.e., Prob(θ ∈ A) > 0,
there exists an s so that there is a positive probability that the chain will visit
A at iteration s. Proving convergence then requires showing that the Gibbs
sampler is aperiodic and p-irreducible which is discussed, e.g., in [82] and [69].
A sufficient condition is that for any set A with positive posterior probabil-
ity and any initial value θ (0) in the support of the posterior distribution, the
probability under the Gibbs sampler that θ(1) ∈ A is positive. This condition
is met in all but exotic cases where support of full conditional distributions
depend on the values of other parameters, in which case convergence should
be studied carefully.
Appendices 247
Γ(A) B A −A−1
Z
∝ A
τ exp(−B/τ )dτ
B Γ(A)
Γ(A)
∝
BA
" n #−(n+1)/2
X
2
∝ (Yi − µ) .
i=1
i=1
" n #
X
= n Yi2 /n 2 2
− Ȳ + Ȳ − 2Ȳ µ + µ 2
i=1
" n #
X
= n Yi2 /n − Ȳ 2 + (µ − Ȳ )2
i=1
n σ̂ 2 + (µ − Ȳ )2 ,
=
Pn Pn
since σ̂ 2 = i=1 (Yi − Ȳ )2 /n = i=1 Yi2 /n − Ȳ 2 . Inserting this expression
248 Appendices
This
√ is Student’s t distribution with location parameter Ȳ , scale parameter
σ̂/ n, and n degrees of freedom.
B A −A−1
Γ(A) B
Z
∝ τ exp − dτ
BA Γ(A) τ
∝ B −A
−(n+p)/2
(Y − Xβ)T (Y − Xβ)
∝ .
f (Yi |θ )
Pn h i
1
By the law of large numbers, i=1 log → −KL(θ), and thus
n f (Yi |θ 0 )
p(θ|Y) π(θ)
log → log − nKL(θ).
p(θ 0 |Y) π(θ 0 )
Therefore, as n → ∞, p(θ|Y)/p(θ 0 |Y) → 0 for any θ 6= θ 0 , and Prob(θ =
θ 0 |Y) converges to one.
and denote the corresponding density function as φ (β; µ(α), Σ(α)). We first
use this approximation for the marginal distribution of the low-dimensional
parameter α. Since p(α, β|Y) = p(β|α, Y)p(α|Y), the marginal posterior of
α can be written
p(α, β|Y)
p(α|Y) = .
p(β|α, Y)
Expanding around the MAP estimate β = µ(α) and using the Laplace ap-
proximation for the denominator gives the approximation
f (Y|α, β)π(α, β|Y)
p(α|Y) ≈ .
φ(β; µ(α), Σ(α)) β =µ(α)
Appendices 251
(s)
where V̂ is the sample covariance of the previous samples θ (1) , ..., θ (s−1) ,
δ > 0 is a small constant to avoid singularities and c = 2.42 /p [31].
Delayed rejection Metropolis replaces the standard single proposal in
Metropolis–Hastings sampling with multiple proposal considered sequentially.
∗ (s−1)
The
first stage
is a usual Metropolis–Hasting step with candidate θ |θ ∼
q θ ∗ |θ (s−1) and acceptance probability
p (θ ∗ |Y) q θ (s−1) |θ ∗
R θ ∗ , θ (s−1) = min 1, .
p θ (s−1) |Y q θ ∗ |θ (s−1)
The notation becomes cumbersome but this is can be iterated beyond two
254 Appendices
Slice sampling
Slice sampling [59] is a clever way to apply Gibbs sampling when the full
conditional distributions do not belong to known parametric families of dis-
tributions. Slice sampling introduces an auxiliary variable (i.e., a variable that
is not an actual parameter) U > 0 and draws samples from the joint distribu-
tion
p∗ (θ, U ) = I[0 < U < p(θ|Y)].
By construction, under p∗ the marginal distribution of θ is
Z
I[0 < U < p(θ|Y)]dU = p(θ|Y),
and therefore if samples of (θ, U ) are drawn from p∗ , then the samples of θ
follow the posterior distribution. Also, Gibbs sampling can be used to draw
samples from p∗ since the full conditional distributions are both uniform
1. U |θ, Y ∼ Uniform on [0, p(θ|Y)]
2. θ|U, Y ∼ Uniform on P (U ) = {θ; p(θ|Y) > U }
Therefore, slice sampling works by drawing from the joint distribution of
(θ, U ), discarding the samples of U and retaining the samples from θ. The
most challenging step is to make a draw from P (U ) (see the figure below). For
some posteriors P (U ) has a simple form and samples can be drawn directly.
In other cases, θ can be drawn from a uniform distribution with a domain
that includes P (U ) until a sample falls in P (U ).
2.5
U
2.0
1.5
p(θ|Y)
1.0
0.5
P(U)
0.0
Listing 7.10
JAGS code for simple linear regression for the paleo data.
1 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
2 age <- c(2, 15, 14, 16, 18, 22, 28)
3 n <- length(age)
4
5 # Fit in JAGS
6 #install.packages("rjags")
7 library(rjags)
8
22 update(model, 10000)
23 samples <- coda.samples(model, n.iter=20000,
24 variable.names=c("beta1","beta2"))
25 summary(samples)
Appendices 257
Listing 7.11
OpenBUGS code for simple linear regression for the paleo data.
1 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
2 age <- c(2, 15, 14, 16, 18, 22, 28)
3 n <- length(age)
4
5 #install.packages("R2OpenBUGS")
6 library(R2OpenBUGS)
7
Listing 7.12
STAN code for simple linear regression for the paleo data.
1 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
2 age <- c(2, 15, 14, 16, 18, 22, 28)
3 n <- length(age)
4
5 #install.packages("rstan")
6 library(rstan)
7
10 data {
11 int<lower=0> n;
12 vector [n] mass;
13 vector [n] age;
14 }
15
16 parameters {
17 real beta1;
18 real beta2;
19 real<lower=0> sigma;
20 }
21
22 model {
23 vector [n] mu;
24 beta1 ~ normal(0,1000000);
25 beta2 ~ normal(0,1000000);
26 sigma ~ cauchy(0.0,1000);
27 mu = beta1 + beta2*age;
28 mass ~ normal(mu,sigma);
29 }
30 "
31
Listing 7.13
NIMBLE code for simple linear regression for the paleo data.
1 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
2 age <- c(2, 15, 14, 16, 18, 22, 28)
3 n <- length(age)
4
5 #install.packages("nimble")
6 library(nimble)
7
18 consts <-
list(n=n,age=age)
19 data <-
list(mass=mass)
20 inits <-
function(){list(beta1=rnorm(1),beta2=rnorm(1),tau=10)}
21 samples <-
nimbleMCMC(model_string, data = data, inits = inits,
22 constants=consts,
23 monitors = c("beta1", "beta2"),
24 samplesAsCodaMCMC=TRUE,WAIC=FALSE,
25 niter = 30000, nburnin = 10000, nchains = 2)
26 plot(samples)
27 effectiveSize(samples)
260 Appendices
Listing 7.14
JAGS code for the random slopes model for the jaw data.
1 model_string <- textConnection("model{
2 # Likelihood
3 for(i in 1:n){for(j in 1:m){
4 Y[i,j] ~ dnorm(alpha1[i]+alpha2[i]*age[j],tau3)
5 }}
6
7 # Random effects
8 for(i in 1:n){
9 alpha1[i] ~ dnorm(mu1,tau1)
10 alpha2[i] ~ dnorm(mu2,tau2)
11 }
12
13 # Priors
14 mu1 ~ dnorm(0,0.0001)
15 mu2 ~ dnorm(0,0.0001)
16 tau1 ~ dgamma(0.1,0.1)
17 tau2 ~ dgamma(0.1,0.1)
18 tau3 ~ dgamma(0.1,0.1)
19 }")
20
MCMC software for the random slopes model. Effective samples size
(ESS) and run time (seconds) for two chains each with 100,000 MCMC iter-
ations and a burn-in of 10,000.
Parameter ESS for µ1 ESS for µ2 Run time
JAGS 293 335 1.83
OpenBUGS 1300 1200 10.2
STAN 180000 180000 424
NIMBLE 283 311 26.5
Appendices 261
Listing 7.15
OpenBUGS code for the random slopes model for the jaw data.
1 model_string <- function(){
2 # Likelihood
3 for(i in 1:n){for(j in 1:m){
4 Y[i,j] ~ dnorm(mn[i,j],tau3)
5 mn[i,j] <- alpha1[i]+alpha2[i]*age[j]
6 }}
7
8 # Random effects
9 for(i in 1:n){
10 alpha1[i] ~ dnorm(mu1,tau1)
11 alpha2[i] ~ dnorm(mu2,tau2)
12 }
13
14 # Priors
15 mu1 ~ dnorm(0,0.0001)
16 mu2 ~ dnorm(0,0.0001)
17 tau1 ~ dgamma(0.1,0.1)
18 tau2 ~ dgamma(0.1,0.1)
19 tau3 ~ dgamma(0.1,0.1)
20 }
21
Listing 7.16
STAN code for the random slopes model for the jaw data.
1 stan_model <- "
2
3 data {
4 int<lower=0> n;
5 int<lower=0> m;
6 vector [m] age;
7 matrix [n,m] Y;
8 }
9
10 parameters {
11 vector [n] alpha1;
12 vector [n] alpha2;
13 real mu1;
14 real mu2;
15 real<lower=0> sigma3;
16 real<lower=0> sigma2;
17 real<lower=0> sigma1;
18 }
19
20 model {
21 real mu;
22 alpha1 ~ normal(0,sigma1);
23 alpha2 ~ normal(0,sigma2);
24 sigma1 ~ cauchy(0.0,1000);
25 sigma2 ~ cauchy(0.0,1000);
26 sigma3 ~ cauchy(0.0,1000);
27 mu1 ~ normal(0,1000);
28 mu2 ~ normal(0,1000);
29
Listing 7.17
NIMBLE code for the random slopes model for the jaw data.
1 library(nimble)
2
10 # Random effects
11 for(i in 1:n){
12 alpha1[i] ~ dnorm(mu1,tau1)
13 alpha2[i] ~ dnorm(mu2,tau2)
14 }
15
16 # Priors
17 mu1 ~ dnorm(0,0.0001)
18 mu2 ~ dnorm(0,0.0001)
19 tau1 ~ dgamma(0.1,0.1)
20 tau2 ~ dgamma(0.1,0.1)
21 tau3 ~ dgamma(0.1,0.1)
22 })
23
265
266 Bibliography
[11] Anirban Bhattacharya, Debdeep Pati, Natesh S Pillai, and David B Dun-
son. Dirichlet–Laplace priors for optimal shrinkage. Journal of the Amer-
ican Statistical Association, 110(512):1479–1490, 2015.
[12] Howard D Bondell and Brian J Reich. Consistent high-dimensional
Bayesian variable selection via penalized credible regions. Journal of
the American Statistical Association, 107(500):1610–1624, 2012.
[13] Dennis D Boos and Leonard A Stefanski. Essential Statistical Inference:
Theory and Methods, volume 120. Springer Science & Business Media,
2013.
[14] Carlos A Botero, Beth Gardner, Kathryn R Kirby, Joseph Bulbulia,
Michael C Gavin, and Russell D Gray. The ecology of religious beliefs.
Proceedings of the National Academy of Sciences, 111(47):16784–16789,
2014.
[15] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben
Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Pe-
ter Li, and Allen Riddell. STAN: A probabilistic programming language.
Journal of Statistical Software, 20(2):1–37, 2016.
[16] Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe
estimator for sparse signals. Biometrika, 97(2):465–480, 2010.
[17] Fang Chen. Bayesian modeling using the MCMC procedure. In Proceed-
ings of the SAS Global Forum 2008 Conference, Cary NC: SAS Institute
Inc, 2009.
[18] Hugh A Chipman, Edward I George, and Robert E McCulloch. BART:
Bayesian additive regression trees. The Annals of Applied Statistics,
4(1):266–298, 2010.
[19] Ciprian M Crainiceanu, David Ruppert, and Matthew P Wand. Bayesian
analysis for penalized spline regression using WinBUGS. Journal of Sta-
tistical Software, 14(1):1–24, 2005.
[20] A Philip Dawid. Present position and potential developments: Some
personal views: Statistical theory: The prequential approach. Journal of
the Royal Statistical Society: Series A (General), pages 278–292, 1984.
[21] Gustavo de los Campos and Paulino Perez Rodriguez. BLR: Bayesian
Linear Regression, 2014. R package version 1.4.
[22] Perry de Valpine, Daniel Turek, Christopher J Paciorek, Clifford
Anderson-Bergman, Duncan Temple Lang, and Rastislav Bodik. Pro-
gramming with models: Writing statistical algorithms for general model
structures with NIMBLE. Journal of Computational and Graphical
Statistics, 26(2):403–413, 2017.
Bibliography 267
[23] Peter Diggle, Rana Moyeed, Barry Rowlingson, and Madeleine Thom-
son. Childhood malaria in the Gambia: A case-study in model-based
geostatistics. Journal of the Royal Statistical Society: Series C (Applied
Statistics), 51(4):493–506, 2002.
[24] Stewart M Edie, Peter D Smits, and David Jablonski. Probabilistic mod-
els of species discovery and biodiversity comparisons. Proceedings of the
National Academy of Sciences, 114(14):3666–3671, 2017.
[25] Gregory M Erickson, Peter J Makovicky, Philip J Currie, Mark A
Norell, Scott A Yerby, and Christopher A Brochu. Gigantism and com-
parative life-history parameters of tyrannosaurid dinosaurs. Nature,
430(7001):772–775, 2004.
[26] Kevin R. Forward, David Haldane, Duncan Webster, Carolyn Mills,
Cherly Brine, and Diane Aylward. A comparison between the Strep A
Rapid Test Device and conventional culture for the diagnosis of strepto-
coccal pharyngitis. Canadian Journal of Infectious Diseases and Medical
Microbiology, 17:221–223, 2004.
[27] Seymour Geisser. Discussion on sampling and Bayes inference in scientific
modeling and robustness (by GEP Box). Journal of the Royal Statistical
Society: Series A (General), 143:416–417, 1980.
[28] Alan E Gelfand, Peter Diggle, Peter Guttorp, and Montserrat Fuentes.
Handbook of Spatial Statistics. CRC press, 2010.
[29] Andrew Gelman et al. Prior distributions for variance parameters in hi-
erarchical models (comment on article by Browne and Draper). Bayesian
Analysis, 1(3):515–534, 2006.
[30] Andrew Gelman, Jessica Hwang, and Aki Vehtari. Understanding predic-
tive information criteria for Bayesian models. Statistics and Computing,
24(6):997–1016, 2014.
[31] Andrew Gelman, Gareth O Roberts, and Walter R Gilks. Efficient
Metropolis jumping rules. Bayesian Statistics, 5(599-608):42, 1996.
[32] Andrew Gelman and Donald B Rubin. Inference from iterative simulation
using multiple sequences. Statistical Science, pages 457–472, 1992.
[33] Andrew Gelman and Cosma Rohilla Shalizi. Philosophy and the practice
of Bayesian statistics. British Journal of Mathematical and Statistical
Psychology, 66(1):8–38, 2013.
[34] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distribu-
tions, and the Bayesian restoration of images. In Readings in Computer
Vision, pages 564–584. Elsevier, 1987.
268 Bibliography
[35] Edward I George and Robert E McCulloch. Variable selection via Gibbs
sampling. Journal of the American Statistical Association, 88(423):881–
889, 1993.
[36] John Geweke. Evaluating the accuracy of sampling-based approaches to
the calculation of posterior moments, volume 196. Federal Reserve Bank
of Minneapolis, Research Department, Minneapolis, MN, USA, 1991.
[37] Subhashis Ghosal and Aad van der Vaart. Fundamentals of Nonparamet-
ric Bayesian Inference, volume 44. Cambridge University Press, 2017.
[38] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules,
prediction, and estimation. Journal of the American Statistical Associa-
tion, 102(477):359–378, 2007.
[39] Heikki Haario, Marko Laine, Antonietta Mira, and Eero Saksman.
DRAM: Efficient adaptive MCMC. Statistics and Computing, 16(4):339–
354, 2006.
[40] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive
Metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
[41] Wilfred K Hastings. Monte Carlo sampling methods using Markov chains
and their applications. Biometrika, 57(1):97–109, 1970.
[42] James S Hodges. Richly Parameterized Linear Models: Additive, Time
Series, and Spatial Models using Random Effects. Chapman & Hall/CRC,
2016.
[43] James S Hodges and Brian J Reich. Adding spatially-correlated errors can
mess up the fixed effect you love. The American Statistician, 64(4):325–
334, 2010.
[44] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased esti-
mation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
[45] Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volin-
sky. Bayesian model averaging: A tutorial. Statistical Science, pages
382–401, 1999.
[46] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and
Lawrence K Saul. An introduction to variational methods for graphi-
cal models. Machine Learning, 37(2):183–233, 1999.
[47] Bindu Kalesan, Matthew E Mobily, Olivia Keiser, Jeffrey A Fagan, and
Sandro Galea. Firearm legislation and firearm mortality in the USA:
A cross-sectional, state-level study. The Lancet, 387(10030):1847–1855,
2016.
Bibliography 269
[48] Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the
American Statistical Association, 90(430):773–795, 1995.
[49] Robert E Kass and Larry Wasserman. A reference Bayesian test for
nested hypotheses and its relationship to the Schwarz criterion. Journal
of the American Statistical Association, 90(431):928–934, 1995.
[50] Hong Lan, Meng Chen, Jessica B Flowers, Brian S Yandell, Donnie S
Stapleton, Christine M Mata, Eric Ton-Keen Mui, Matthew T Flowers,
Kathryn L Schueler, Kenneth F Manly, et al. Combined expression trait
correlations and expression quantitative trait locus mapping. PLoS Ge-
netics, 2(1):e6, 2006.
[51] Dennis V Lindley. A statistical paradox. Biometrika, 44(1/2):187–192,
1957.
[52] Roderick Little. Calibrated Bayes, for statistics in general, and missing
data in particular. Statistical Science, 26(2):162–174, 2011.
[53] Jean-Michel Marin, Pierre Pudlo, Christian P Robert, and Robin J Ry-
der. Approximate Bayesian computational methods. Statistics and Com-
puting, 22(6):1167–1180, 2012.
[54] Peter McCullagh and John Nelder. Generalized Linear Models, Second
Edition. Boca Raton: Chapman & Hall/CRC, 1989.
[55] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth,
Augusta H Teller, and Edward Teller. Equation of state calculations by
fast computing machines. The Journal of Chemical Physics, 21(6):1087–
1092, 1953.
[56] Greg Miller. ESP paper rekindles discussion about statistics. Science,
331(6015):272–273, 2011.
[57] Antonietta Mira. On Metropolis-Hastings algorithms with delayed rejec-
tion. Metron, 59(3-4):231–241, 2001.
[58] Frederick Mosteller and David L Wallace. Inference in an authorship
problem: A comparative study of discrimination methods applied to the
authorship of the disputed Federalist Papers. Journal of the American
Statistical Association, 58(302):275–309, 1963.
[59] Radford M Neal. Slice sampling. Annals of Statistics, pages 705–741,
2003.
[60] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of
Markov Chain Monte Carlo, 2(11):2, 2011.
[61] Radford M Neal. Bayesian Learning for Neural Networks, volume 118.
Springer Science & Business Media, 2012.
270 Bibliography
273
274 Index
Sample, 3
Sampling distribution, 3, 220, 223,
224
Semiparametric density estimator,
153
Semiparametric regression, 151
Sensitivity analysis, 22