Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)
Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)
By
Peter D. Congdon
University of London, England
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the
validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copy-
right holders of all material reproduced in this publication and apologize to copyright holders if permission to publish
in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know
so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users.
For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been
arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Preface...............................................................................................................................................xi
2. Bayesian Analysis Options in R, and Coding for BUGS, JAGS, and Stan................ 45
2.1 Introduction.................................................................................................................. 45
2.2 Coding in BUGS and for R Libraries Calling on BUGS ......................................... 46
2.3 Coding in JAGS and for R Libraries Calling on JAGS............................................ 47
2.4 Coding for rstan .......................................................................................................... 49
2.4.1 Hamiltonian Monte Carlo............................................................................. 49
2.4.2 Stan Program Syntax...................................................................................... 49
2.4.3 The Target + Representation......................................................................... 51
2.4.4 Custom Distributions through a Functions Block..................................... 53
2.5 Miscellaneous Differences between Generic Packages
(BUGS, JAGS, and Stan)............................................................................................... 55
References................................................................................................................................ 56
v
vi Contents
Index.............................................................................................................................................. 565
Preface
My gratitude is due to Taylor & Francis for proposing a revision of Applied Bayesian
Hierarchical Methods, first published in 2010. The revision maintains the goals of present-
ing an overview of modelling techniques from a Bayesian perspective, with a view to
practical data analysis. The new book is distinctive in its computational environment,
which is entirely R focused. Worked examples are based particularly on rjags and jagsUI,
R2OpenBUGS, and rstan. Many thanks are due to the following for comments on chap-
ters or computing advice: Sid Chib, Andrew Finley, Ken Kellner, Casey Youngflesh,
Kaushik Chowdhury, Mahmoud Torabi, Matt Denwood, Nikolaus Umlauf, Marco Geraci,
Howard Seltman, Longhai Li, Paul Buerkner, Guanpeng Dong, Bob Carpenter, Mitzi
Morris, and Benjamin Cowling. Programs for the book can be obtained from my website
at https://www.qmul.ac.uk/geog/staff/congdonp.html or from https://www.crcpress.com/
Bayesian-Hierarchical-Models-With-Applications-Using-R-Second-Edition/Congdon/p/
book/9781498785754. Please send comments or questions to me at [email protected].
QMUL, London
xi
1
Bayesian Methods for Complex Data:
Estimation and Inference
1.1 Introduction
The Bayesian approach to inference focuses on updating knowledge about unknown
parameters θ in a statistical model on the basis of observations y, with revised knowledge
expressed in the posterior density p(θ|y). The sample of observations y being analysed
provides new information about the unknowns, while the prior density p(θ) represents
accumulated knowledge about them before observing or analysing the data. There is
considerable flexibility with which prior evidence about parameters can be incorporated
into an analysis, and use of informative priors can reduce the possibility of confounding
and provides a natural basis for evidence synthesis (Shoemaker et al., 1999; Dunson, 2001;
Vanpaemel, 2011; Klement et al., 2018). The Bayes approach provides uncertainty intervals
on parameters that are consonant with everyday interpretations (Willink and Lira, 2005;
Wetzels et al., 2014; Krypotos et al., 2017), and has no problem comparing the fit of non-
nested models, such as a nonlinear model and its linearised version.
Furthermore, Bayesian estimation and inference have a number of advantages in terms
of its relevance to the types of data and problems tackled by modern scientific research
which are a primary focus later in the book. Bayesian estimation via repeated sampling
from posterior densities facilitates modelling of complex data, with random effects treated
as unknowns and not integrated out as is sometimes done in frequentist approaches
(Davidian and Giltinan, 2003). For example, much of the data in social and health research
has a complex structure, involving hierarchical nesting of subjects (e.g. pupils within
schools), crossed classifications (e.g. patients classified by clinic and by homeplace),
spatially configured data, or repeated measures on subjects (MacNab et al., 2004). The
Bayesian approach naturally adapts to such hierarchically or spatio-temporally correlated
effects via conditionally specified hierarchical priors under a three-stage scheme (Lindley
and Smith, 1972; Clark and Gelfand, 2006; Gustafson et al., 2006; Cressie et al., 2009), with
the first stage specifying the likelihood of the data, given unknown random individual or
cluster effects; the second stage specifying the density of the random effects; and the third
stage providing priors on parameters underlying the random effects density or densities.
The increased application of Bayesian methods has owed much to the development of
Markov chain Monte Carlo (MCMC) algorithms for estimation (Gelfand and Smith, 1990;
Gilks et al., 1996; Neal, 2011), which draw repeated parameter samples from the posterior
distributions of statistical models, including complex models (e.g. models with multiple
or nested random effects). Sampling based parameter estimation via MCMC provides
a full posterior density of a parameter so that any clear non-normality is apparent, and
1
2 Bayesian Hierarchical Models
hypotheses about parameters or interval estimates can be assessed from the MCMC sam-
ples without the assumptions of asymptotic normality underlying many frequentist tests.
However, MCMC methods may in practice show slow convergence, and implementation of
some MCMC methods (such as Hamiltonian Monte Carlo) with advantageous estimation
features, including faster convergence, has been improved through package development
(rstan) in R.
As mentioned in the Preface, a substantial emphasis in the book is placed on implemen-
tation and data analysis for tutorial purposes, via illustrative data analysis and attention
to statistical computing. Accordingly, worked examples in R code in the rest of the chap-
ter illustrate MCMC sampling and Bayesian posterior inference from first principles. In
subsequent chapters R based packages, such as jagsUI, rjags, R2OpenBUGS, and rstan are
used for computation.
As just mentioned, Bayesian modelling of hierarchical and random effect models via
MCMC techniques has extended the scope for modern data analysis. Despite this, applica-
tion of Bayesian techniques also raises particular issues, although these have been allevi-
ated by developments such as integrated nested Laplace approximation (Rue et al., 2009)
and practical implementation of Hamiltonian Monte Carlo (Carpenter et al., 2017). These
include:
a) Propriety and identifiability issues when diffuse priors are applied to variance or
dispersion parameters for random effects (Hobert and Casella, 1996; Palmer and
Pettit, 1996; Hadjicostas and Berry, 1999; Yue et al., 2012);
b) Selecting the most suitable form of prior for variance parameters (Gelman, 2006)
or the most suitable prior for covariance modelling (Lewandowski et al., 2009);
c) Appropriate priors for models with random effects, to avoid potential overfitting
(Simpson et al., 2017; Fuglstad et al., 2018) or oversmoothing in the presence of
genuine outliers in spatial applications (Conlon and Louis, 1999);
d) The scope for specification bias in hierarchical models for complex data structures
where a range of plausible model structures are possible (Chiang et al., 1999).
p( y|q )p(q )
p(q |y ) = . (1.2)
p( y )
The marginal likelihood p(y) may be obtained by integrating the numerator on the right
side of (1.2) over the support for θ, namely
ò
p( y ) = p( y|q )p(q )dq .
From (1.2), the term p(y) therefore acts as a normalising constant necessary to ensure p(θ|y)
integrates to 1, and so one may write
log éë p(q |y )ùû = log(k ) + log éë p( y|q )ùû + log éë p(q )ùû
and log[ p( y|q )] + log[ p(q )] is generally referred to as the log posterior, which some R pro-
grams (e.g. rstan) allow to be directly specified as the estimation target.
In some cases, when the prior on θ is conjugate with the posterior on θ (i.e. has the same
density form), the posterior density and marginal likelihood can be obtained analytically.
When θ is low-dimensional, numerical integration is an alternative, and approximations to
the required integrals can be used, such as the Laplace approximation (Raftery, 1996; Chen
and Wang, 2011). In more complex applications, such approximations are not feasible, and
integration to obtain p(y) is intractable, so that direct sampling from p(θ|y) is not feasible.
In such situations, MCMC methods provide a way to sample from p(θ|y) without it having
a specific analytic form. They create a Markov chain of sampled values q (1) ,… ,q (T ) with
transition kernel K(q cand |q curr ) (investigating transitions from current to candidate values
for parameters) that have p(θ|y) as their limiting distribution. Using large samples from
the posterior distribution obtained by MCMC, one can estimate posterior quantities of
interest such as posterior means, medians, and highest density regions (Hyndman, 1996;
Chen and Shao, 1998).
∫
Ep [ g(u)] = g(u)p(u)du,
is estimated as
g= ∑ g (u
t =1
(t )
)
and, under independent sampling from π(u), g tends to Ep [ g(u)] as T → ∞. However, such
independent sampling from the posterior density p(θ|y) is not usually feasible.
When suitably implemented, MCMC methods offer an effective alternative way to gen-
erate samples from the joint posterior distribution, p(θ|y), but differ from conventional
Monte Carlo methods in that successive sampled parameters are dependent or autocorre-
lated. The target density for MCMC samples is therefore the posterior density π(θ) = p(θ|y)
and MCMC sampling is especially relevant when the posterior cannot be stated exactly
in analytic form e.g. when the prior density assumed for θ is not conjugate with the like-
lihood p(y|θ). The fact that successive sampled values are dependent means that larger
samples are needed for equivalent precision, and the effective number of samples is less
than the nominal number.
For the parameter sampling case, assume a preset initial parameter value θ(0). Then
MCMC methods involve repeated iterations to generate a correlated sequence of sampled
values θ(t) (t = 1, 2, 3, …), where updated values θ(t) are drawn from a transition distribution
that is Markovian in the sense of depending only on θ(t−1). The transition distribution
K (q (t ) |q (t -1) ) is chosen to satisfy additional conditions ensuring that the sequence has
the joint posterior density p(θ|y) as its stationary distribution. These conditions typically
reduce to requirements on the proposal and acceptance procedure used to generate can-
didate parameter samples. The proposal density and acceptance rule must be specified in
a way that guarantees irreducibility and positive recurrence; see, for example, Andrieu
and Moulines (2006). Under such conditions, the sampled parameters θ(t) {t = B, B + 1, … , T },
beyond a certain burn-in or warm-up phase in the sampling (of B iterations), can be viewed
as a random sample from p(θ|y) (Roberts and Rosenthal, 2004).
In practice, MCMC methods are applied separately to individual parameters or blocks of
more than one parameter (Roberts and Sahu, 1997). So, assuming θ contains more than one
parameter and consists of C components or blocks {q1 , … , qC } , different updating methods
may be used for each component, including block updates.
There is no limit to the number of samples T of θ which may be taken from the poste-
rior density p(θ|y). Estimates of the marginal posterior densities for each parameter can
be made from the MCMC samples, including estimates of location (e.g. posterior means,
modes, or medians), together with the estimated certainty or precision of these parameters
in terms of posterior standard deviations, credible intervals, or highest posterior density
intervals. For example, the 95% credible interval for θh may be estimated using the 0.025
and 0.975 quantiles of the sampled output {q h(t ) , t = B + 1,… , T } . To reduce irregularities in
the histogram of sampled values for a particular parameter, a smooth form of the posterior
density can be approximated by applying kernel density methods to the sampled values.
Monte Carlo posterior summaries typically include estimated posterior means and vari-
ances of the parameters, obtainable as moment estimates from the MCMC output, namely
Bayesian Methods for Complex Data 5
Ê(q h ) = q h = åq
t =B + 1
(t )
h /(T - B)
V̂ (q h ) = å (q
t=B+1
(t )
h - q h )2 /(T - B).
ò
E(q h |y ) = q h p(q |y )dq ,
ò
V (q h |y ) = q h2 p(q |y )dq - [E(q h |y )]2
= E(q |y ) - [E(q h |y )] .
2
h
2
One may also use the MCMC output to derive obtain posterior means, variances, and
credible intervals for functions Δ = Δ(θ) of the parameters (van Dyk, 2003). These are esti-
mates of the integrals
ò
E[D(q )|y] = D(q )p(q |y )dq ,
∫
V[∆(q )| y] = ∆ 2 p(q | y )dq − [E( ∆ | y )]2
2 2
= E( ∆ | y ) − [E( ∆ | y )] .
For Δ(θ), its posterior mean is obtained by calculating Δ(t) at every MCMC iteration from
the sampled values θ(t). The theoretical justification for such estimates is provided by the
MCMC version of the law of large numbers (Tierney, 1994), namely that
T
D[q (t ) ]
å T - B ® E [D(q )],
t =B + 1
p
provided that the expectation of Δ(θ) under p (q ) = p(q |y ), denoted Eπ[Δ(θ)], exists. MCMC
methods also allow inferences on parameter comparisons (e.g. ranks of parameters or con-
trasts between them) (Marshall and Spiegelhalter, 1998).
in more complex data sets or with more complex forms of model or response, a more gen-
eral perspective than that implied by (1.1)–(1.3) is available, and also implementable, using
MCMC methods.
Thus, a class of hierarchical Bayesian models are defined by latent data (Paap, 2002;
Clark and Gelfand, 2006) intermediate between the observed data and the underlying
parameters (hyperparameters) driving the process. A terminology useful for relating hier-
archical models to substantive issues is proposed by Wikle (2003) in which y defines the
data stage, latent effects b define the process stage, and ξ defines the hyperparameter stage.
For example, the observations i = 1,…,n may be arranged in clusters j = 1, …, J, so that the
observations can no longer be regarded as independent. Rather, subjects from the same
cluster will tend to be more alike than individuals from different clusters, reflecting latent
variables that induce dependence within clusters.
Let the parameters θ = [θL,θb] consist of parameter subsets relevant to the likelihood and
to the latent data density respectively. The data are generally taken as independent of θb
given b, so modelling intermediate latent effects involves a three-stage hierarchical Bayes
(HB) prior set-up
with a first stage likelihood p( y|b ,q L ) and a second stage density p(b|θb) for the latent data,
with conditioning on higher stage parameters θ. The first stage density p(y|b,θL) in (1.4) is
a conditional likelihood, conditioning on b, and sometimes called the complete data or
augmented data likelihood. The application of Bayes’ theorem now specifies
p(q |y ) = =
ò
p(q )p( y|q ) p(q ) p( y|b ,q L )p(b|q b )db
,
p( y ) p( y )
where
ò ò
p( y|q ) = p( y , b|q )db = p( y|b ,q L )p(b|q b )db ,
is the observed data likelihood, namely the complete data likelihood with b integrated out,
sometimes also known as the integrated likelihood.
Often the latent data exist for every observation, or they may exist for each cluster in
which the observations are structured (e.g. a school specific effect bj for multilevel data yij
on pupils i nested in schools j). The latent variables b can be seen as a population of values
from an underlying density (e.g. varying log odds of disease) and the θb are then popula-
tion hyperparameters (e.g. mean and variance of the log odds) (Dunson, 2001). As exam-
ples, Paap (2002) mentions unobserved states describing the business cycle and Johannes
and Polson (2006) mention unobserved volatilities in stochastic volatility models, while
Albert and Chib (1993) consider the missing or latent continuous data {b1, …, bn} which
underlie binary observations {y1, …, yn}. The subject specific latent traits in psychometric or
educational item analysis can also be considered this way (Fox, 2010), as can the variance
Bayesian Methods for Complex Data 7
scaling factors in the robust Student t errors version of linear regression (Geweke, 1993) or
subject specific slopes in a growth curve analysis of panel data on a collection of subjects
(Oravecz and Muth, 2018).
Typically, the integrated likelihood p(y|θ) cannot be stated in closed form and classical
likelihood estimation relies on numerical integration or simulation (Paap, 2002, p.15). By
contrast, MCMC methods can be used to generate random samples indirectly from the
posterior distribution p(θ,b|y) of parameters and latent data given the observations. This
requires only that the augmented data likelihood be known in closed form, without need-
ing to obtain the integrated likelihood p(y|θ). To see why, note that the marginal posterior
of the parameter set θ may alternatively be derived as
ò ò
p(q |y ) = p(q , b|y )db = p(q |y , b)p(b|y )db ,
with marginal densities for component parameters θh of the form (Paap, 2002, p.5)
p(q h |y ) =
ò ò p(q , b|y)dbdq
q [ h] b
[ h] ,
µ
ò p(q |y)p(q )dq
q [ h]
[ h] =
ò ò p(q )p(y|b,q )p(b|q )dbdq
q [ h] b
[ h] ,
where θ[h] consists of all parameters in θ with the exception of θh. The derivation of suitable
MCMC algorithms to sample from p(θ,b|y) is based on Clifford–Hammersley theorem,
namely that any joint distribution can be fully characterised by its complete conditional
distributions. In the hierarchical Bayes context, this implies that the conditionals p(b|θ,y)
and p(θ|b,y) characterise the joint distribution p(θ,b|y) from which samples are sought, and
so MCMC sampling can alternate between updates p(b(t ) |q (t -1) , y ) and p(q (t ) |b(t ) , y ) on con-
ditional densities, which are usually of simpler form than p(θ,b|y). The imputation of latent
data in this way is sometimes known as data augmentation (van Dyk, 2003).
To illustrate the application of MCMC methods to parameter comparisons and hypoth-
esis tests in an HB setting, Shen and Louis (1998) consider hierarchical models with unit
or cluster specific parameters bj, and show that if such parameters are the focus of interest,
their posterior means are the optimal estimates. Suppose instead that the ranks of the unit
or cluster parameters, namely
Rj = rank(b j ) = ∑ I(b ≥ b ),
k≠i
j k
(where I(A) is an indicator function which equals 1 when A is true, 0 otherwise) are
required for deriving “league tables”. Then the conditional expected ranks are optimal,
and obtained by ranking the bj at each MCMC iteration, and taking the means of these
ranks over all samples. By contrast, ranking posterior means of the bj themselves can
perform poorly (Laird and Louis, 1989; Goldstein and Spiegelhalter, 1996). Similarly,
when the empirical distribution function of the unit parameters (e.g. to be used to obtain
the fraction of parameters above a threshold) is required, the conditional expected EDF
is optimal.
8 Bayesian Hierarchical Models
exceeds τ, namely
T
( b j > t| y ) =
Pr ∑ I (b
t =B + 1
(t )
j > t)/(T − B).
Thus, one might, in an epidemiological application, wish to obtain the posterior probabil-
ity that an area’s smoothed relative mortality risk bj exceeds unity, and so count iterations
where this condition holds. If this probability exceeds a threshold such as 0.9, then a sig-
nificant excess risk is indicated, whereas a low exceedance probability (the sampled rela-
tive risk rarely exceeded 1) would indicate a significantly low mortality level in the area.
In fact, the significance of individual random effects is one aspect of assessing the gain of
a random effects model over a model involving only fixed effects, or of assessing whether
a more complex random effects model offers a benefit over a simpler one (Knorr-Held and
Rainer, 2001, p.116). Since the variance can be defined in terms of differences between ele-
ments of the vector (b1 ,..., bJ ), as opposed to deviations from a central value, one may also
consider which contrasts between pairs of b values are significant. Thus, Deely and Smith
(1998) suggest evaluating probabilities Pr(b j ≤ tbk |k ≠ j , y ) where 0 < t ≤ 1, namely, the pos-
terior probability that any one hierarchical effect is smaller by a factor τ than all the others.
1.5 Metropolis Sampling
A range of MCMC techniques is available. The Metropolis sampling algorithm is still a
widely applied MCMC algorithm and is a special case of Metropolis–Hastings consid-
ered in Section 1.8. Let p(y|θ) denote a likelihood, and p(θ) denote the prior density for
θ, or more specifically the prior densities p(q1 ),… p(qC ) of the components of θ. Then the
Metropolis algorithm involves a symmetric proposal density (e.g. a Normal, Student t, or
uniform density) q(q cand |q (t ) ) for generating candidate parameter values θcand, with accep-
tance probability for potential candidate values obtained as
cancels out, as it is a constant. Stated more completely, to sample parameters under the
Metropolis algorithm, it is not necessary to know the normalised target distribution,
namely, the posterior density, π(θ|y); it is enough to know it up to a constant factor.
So, for updating parameter subsets, the Metropolis algorithm can be implemented by
using the full posterior distribution
where θh] denotes the parameter set excluding θh. So, the probability for updating θh can be
obtained either by comparing the full posterior (known up to a constant k), namely
æ p h (q h ,cand |q[(ht]) ) ö
a = min çç 1, ÷ .
è p h (q h(t ) |q[(ht]) ) ÷ø
Then one sets q h(t +1) = q h ,cand with probability α, and q h(t +1) = q h(t ) otherwise.
often justified, as many posterior densities do approximate normality. For example, Albert
(2007) applies a Laplace approximation technique to estimate the posterior mode, and uses
the mean and variance parameters to define the proposal densities used in a subsequent
stage of Metropolis–Hastings sampling.
The rate at which a proposal generated by q is accepted (the acceptance rate) depends on
how close θcand is to θ(t), and this in turn depends on the variance sq2 of the proposal density.
A higher acceptance rate would typically follow from reducing sq2 , but with the risk that
the posterior density will take longer to explore. If the acceptance rate is too high, then
autocorrelation in sampled values will be excessive (since the chain tends to move in a
restricted space), while a too low acceptance rate leads to the same problem, since the chain
then gets locked at particular values.
One possibility is to use a variance or dispersion estimate, sm2 or Σm, from a maximum
likelihood or other mode-finding analysis (which approximates the posterior variance)
and then scale this by a constant c > 1, so that the proposal density variance is sq2 = csm2 .
Values of c in the range 2–10 are typical. For θh of dimension dh with covariance Σm, a pro-
posal density dispersion 2.382Σm/dh is shown as optimal in random walk schemes (Roberts
et al., 1997). Working rules are for an acceptance rate of 0.4 when a parameter is updated
singly (e.g. by separate univariate normal proposals), and 0.2 when a group of parameters
are updated simultaneously as a block (e.g. by a multivariate normal proposal). Geyer and
Thompson (1995) suggest acceptance rates should be between 0.2 and 0.4, and optimal
acceptance rates have been proposed (Roberts et al., 1997; Bedard, 2008).
Typical Metropolis updating schemes use variables Wt with known scale, for example,
uniform, standard Normal, or standard Student t. A Normal proposal density q(q cand |q (t ) )
then involves samples Wt ~ N(0,1), with candidate values
q cand = q (t ) + s qWt ,
where σq determines the size of the jump from the current value (and the acceptance
rate). A uniform random walk samples Wt Unif( −1,1) and scales this to form a proposal
q cand = q (t ) + k Wt , with the value of κ determining the acceptance rate. As noted above, it is
desirable that the proposal density approximately matches the shape of the target density
p(θ|y). The Langevin random walk scheme is an example of a scheme including informa-
tion about the shape of p(θ|y) in the proposal, namely q cand = q (t ) + s q [Wt + 0.5Ñ log( p(q (t ) |y )]
where ∇ denotes the gradient function (Roberts and Tweedie, 1996).
Sometimes candidate parameter values are sampled using a transformed version of a
parameter, for example, normal sampling of a log variance rather than sampling of a vari-
ance (which has to be restricted to positive values). In this case, an appropriate Jacobean
adjustment must be included in the likelihood. Example 1.2 below illustrates this.
exponential, gamma, etc.) from which direct sampling is straightforward. Full conditional
densities are derived by abstracting out from the joint model density p(y|θ)p(θ) (likelihood
times prior) only those elements including θh and treating other components as constants
(George et al., 1993; Gilks, 1996).
Consider a conjugate model for Poisson count data yi with means μi that are themselves
gamma-distributed; this is a model appropriate for overdispersed count data with actual
variability var(y) exceeding that under the Poisson model (Molenberghs et al., 2007).
Suppose the second stage prior is μi ~ Ga(α,β), namely,
and further that α ~ E(A) (namely, α is exponential with parameter A), and β ~ Ga(B,C)
where A, B, and C are preset constants. So the posterior density p(θ|y) of q = ( m1 ,..mn , a , b )
, given y, is proportional to
∏e ∏m
n
e − Aa b B −1e − C b − mi
miyi b a /Γ(a) a − 1 − bmi
i e (1.6)
i i
where all constants (such as the denominator yi! in the Poisson likelihood, as well as the
inverse marginal likelihood k) are combined in a proportionality constant.
It is apparent from inspecting (1.6) that the full conditional densities of μi and β are also
gamma, namely,
mi ∼ Ga( yi + a , b + 1),
and
b ~ Ga B + na , C +
∑ i
mi ,
respectively. The full conditional density of α, also obtained from inspecting (1.6), is
∏m
n
p(a| y , b , m) ∝ e − Aa b a /Γ(a) i
a −1
.
i
This density is non-standard and cannot be sampled directly (as can the gamma densities
for μi and β). Hence, a Metropolis or Metropolis–Hastings step can be used for updating it.
n
( y i − m)2
∏
1
p( y|q ) = exp − .
i =1
s 2p 2s 2
12 Bayesian Hierarchical Models
Assume a flat prior for μ, and a prior p(s ) ∝ 1/s on σ; this is a form of noninformative
prior (see Albert, 2007, p.109). Then one has posterior density
n
( y i − m)2
∏ exp −
1
p(q|y ) ∝ .
s n+1
i =1
2s 2
with the marginal likelihood and other constants incorporated in the proportionality
sign.
Parameter sampling via the Metropolis algorithm involves σ rather than σ2, and uni-
form proposals. Thus, assume uniform U(−κ,κ) proposal densities around the current
parameter values μ(t) and σ(t), with κ = 0.5 for both parameters. The absolute value of
s (t ) + U( − k , k) is used to generate σcand. Note that varying the lower and upper limit of
the uniform sampling (e.g. taking κ = 1 or κ = 0.25) may considerably affect the accep-
tance rates.
An R code for κ = 0.5 is in the Computational Notes [1] in Section 1.14, and uses the
full posterior density (rather than the full conditional for each parameter) as the tar-
get density for assessing candidate values. In the acceptance step, the log of the ratio
p( y|q cand )p(q cand )
is compared to the log of a random uniform value to avoid computer
p( y|q (t ) )p(q (t ) )
over/underflow. With T = 10000 and B = 1000 warmup iterations, acceptance rates for
the proposals of μ and σ are 48% and 35% respectively, with posterior means 2.87 and
4.99. Other posterior summary tools (e.g. univariate and bivariate kernel density plots,
effective sample sizes) are included in the R code (see Figure 1.1 for a plot of the pos-
terior bivariate density). Also included is a posterior probability calculation to assess
Pr(μ < 3|y), with result 0.80, and a command for a plot of the changing posterior expec-
tation for μ over the iterations. The code uses the full normal likelihood, via the dnorm
function in R.
5.3 10
5.2
8
5.1
6
sigma
5.0
4
4.9
2
4.8
4.7 0
2.6 2.8 3.0 3.2 3.4
mu
FIGURE 1.1
Bivariate density plot, normal density parameters.
Bayesian Methods for Complex Data 13
zi = (wi − m)/s ,
where m1 and σ are both positive. To simplify notation, one may write V = σ2.
Consider Metropolis sampling involving log transforms of m1 and V, and separate
univariate normal proposals in a Metropolis scheme. Jacobian adjustments are needed
in the posterior density to account for the two transformed parameters. The full poste-
rior p( m, m1 , V |y ) is proportional to
where p(μ), p(m1) and p(V) are priors for μ, m1 and V. Suppose the priors p(m1) and p(μ)
are as follows:
m1 ∼ Ga( a0 , b0 ),
m ∼ N(c0 , d02 ),
b a a -1 - b x
Ga( x|a , b ) = x e .
G(a )
Also, for p(V) assume
V ∼ IG(e0 , f 0 ),
b a -(a +1) - b /x
IG( x|a , b ) = x e .
G(a )
m − c0 −( e0 + 1) − f0 /V
2
∂m1 ∂V
∂q ∂q p( m)p(m1 )p(V )
2 3
∏[p(w )]
i
i
yi
(1 − p(wi )]ni − yi .
14 Bayesian Hierarchical Models
One has (∂m1/∂q2 ) = e q2 = m1 and (∂V/∂q3 ) = e q3 = V . So, taking account of the param-
eterisation (θ1,θ2,θ3), the posterior density is proportional to
m − c0 − e0 − f0 /V
2
The R code (see Section 1.14 Computational Notes [2]) assumes initial values for μ = θ1
of 1.8, for θ2 = log(m1) of 0, and for θ3 = log(V) of 0. Preset parameters in the prior den-
sities are (a0 = 0.25, b0 = 0.25, c0 = 2, d0 = 10, e0 = 2.000004, f0 = 0.001). Two chains are run
with T = 100000, with inferences based on the last 50,000 iterations. Standard devia-
tions in the respective normal proposal densities are set at 0.01, 0.2, and 0.4. Metropolis
updates involve comparisons of the log posterior and logs of uniform random variables
{U h(t ) , h = 1,… , 3} .
Posterior medians (and 95% intervals) for {μ,m1,V} are obtained as 1.81 (1.78, 1.83), 0.36
(0.20,0.75), 0.00035 (0.00017, 0.00074) with acceptance rates of 0.41, 0.43, and 0.43. The pos-
terior estimates are similar to those of Carlin and Gelfand (1991). Despite satisfactory
convergence according to Gelman–Rubin scale reduction factors, estimation is beset
by high posterior correlations between parameters and low effective sample sizes. The
cross-correlations between the three hyperparameters exceed 0.75 in absolute terms,
effective sample sizes are under 1000, and first lag sampling autocorrelations all exceed
0.90.
It is of interest to apply rstan (and hence HMC) to this dataset (Section 1.10) (see Section
1.14 Computational Notes [3]). Inferences from rstan differ from those from Metropolis
sampling estimation, though are sensitive to priors adopted. In a particular rstan esti-
mation, normal priors are set on the hyperparameters as follows:
m ∼ N(2, 10),
Two chains are applied with 2500 iterations and 250 warm-up. While estimates for μ
are similar to the preceding analysis, the posterior median (95% intervals) for m1 is now
1.21 (0.21, 6.58), with the 95% interval straddling the default unity value. The estimate
for the variance V is lower. As to MCMC diagnostics, effective sample sizes for μ and m1
are larger than from the Metropolis analysis, absolute cross-correlations between the
three hyperparameters in the MCMC sampling are all under 0.40 (see Figure 1.2), and
first lag sampling autocorrelations are all under 0.60.
1.8 Metropolis–Hastings Sampling
The Metropolis–Hastings (M–H) algorithm is the overarching algorithm for MCMC
schemes that simulate a Markov chain θ(t) with p(θ|y) as its stationary distribution.
Following Hastings (1970), the chain is updated from θ(t) to θcand with probability
Bayesian Methods for Complex Data 15
FIGURE 1.2
Posterior densities and MCMC cross-correlations, rstan estimation of beetle mortality data.
where the proposal density q (Chib and Greenberg, 1995) may be non-symmetric, so
that q(q cand |q (t ) ) does not necessarily equal q(q (t ) |q cand ). q(q cand |q (t ) ) is the probability (or
density ordinate) of θcand for a density centred at θ(t), while q(q (t ) |q cand ) is the probabil-
ity of moving back from θcand to the current value. If the proposal density is symmetric,
with q(q cand |q (t ) ) = q(q (t ) |q cand ) , then the Metropolis–Hastings algorithm reduces to the
Metropolis algorithm discussed above. The M–H transition kernel is
for q cand ¹ q (t ) , with a nonzero probability of staying in the current state, namely
16 Bayesian Hierarchical Models
ò
K (q (t ) |q (t ) ) = 1 - a (q cand |q (t ) )q(q cand |q (t ) )dq cand .
Conformity of M–H sampling to the requirement that the Markov chain eventually sam-
ples from π(θ) is considered by Mengersen and Tweedie (1996) and Roberts and Rosenthal
(2004).
If the proposed new value θcand is accepted, then θ(t+1) = θcand, while if it is rejected the next
state is the same as the current state, i.e. θ(t+1) = θ(t). As mentioned above, since the target
density p(θ|y) appears in ratio form, it is not necessary to know the normalising constant
k = 1/p(y). If the proposal density has the form
then a random walk Metropolis scheme is obtained (Albert, 2007, p.105; Sherlock et al.,
2010). Another option is independence sampling, when the density q(θcand) for sampling
candidate values is independent of the current value θ(t).
While it is possible for the target density to relate to the entire parameter set, it is typi-
cally computationally simpler in multi-parameter problems to divide θ into C blocks or
components, and use the full conditional densities in componentwise updating. Consider
the update for the hth parameter or parameter block. At step h of iteration t + 1 the preced-
ing h − 1 parameter blocks are already updated via the M–H algorithm, while qh +1 , … , qC
are still at their iteration t values (Chib and Greenberg, 1995). Let the vector of partially
updated parameters apart from θh be denoted
The candidate value for θh is generated from the hth proposal density, denoted
qh (q h ,cand |q h(t ) ) . Also governing the acceptance of a proposal are full conditional densities
p h (q h(t ) |q[(ht]) ) µ p( y|q h(t ) )p(q h(t ) ) specifying the density of θh conditional on known values of
other parameters θ[h]. The candidate value θh,cand is then accepted with probability
pi |b j = Φ( b1 + b2 xi + b j ),
where {b j ∼ N(0, 1 / tb ), j = 1,… , J }. It is assumed that bk ∼ N(0, 10) and tb ∼ Ga(1, 0.001).
A Metropolis–Hastings step involving a gamma proposal is used for the random
effects precision τb, and Metropolis updates for other parameters; see Section 1.14
Computational Notes [3]. Trial runs suggest τb is approximately between 5 and 10, and a
Bayesian Methods for Complex Data 17
gamma proposal Ga(k , k/tb , curr ) with κ = 100 is adopted (reducing κ will reduce the M–H
acceptance rate for τb).
A run of T = 5000 iterations with warm-up B = 500 provides posterior medians (95%
intervals) for { b1 , b2 , sb = 1 / tb } of −2.91 (−3.79, −2.11), 0.40 (0.28, 0.54), and 0.27 (0.20,
0.43), and acceptance rates for {β1,β2,τb} of 0.30, 0.21, and 0.24. Acceptance rates for the
clutch random effects (using normal proposals with standard deviation 1) are between
0.25 and 0.33. However, none of the clutch effects appears to be strongly significant, in
the sense of entirely positive or negative 95% credible intervals. The effect b9 (for the
clutch with lowest average birthweight) has posterior median and 95% interval, 0.36
(−0.07, 0.87), and is the closest to being significant, while for b15 the median (95%CRI) is
−0.30 (−0.77,0.10).
1.9 Gibbs Sampling
The Gibbs sampler (Gelfand and Smith, 1990; Gilks et al., 1993; Chib, 2001) is a special
componentwise M–H algorithm whereby the proposal density q for updating θh equals the
full conditional p h (q h |q h] ) µ p( y|q h )p(q h ). It follows from (1.7) that proposals are accepted with
probability 1. If it is possible to update all blocks this way, then the Gibbs sampler involves
parameter block by parameter block updating which, when completed, forms the transition
from q (t ) = (q1(t ) ,… ,qC(t ) ) to q (t +1) = (q1(t +1) ,… ,qC(t +1) ) . The most common sequence used is
While this scanning scheme is the usual one for Gibbs sampling, there are other options,
such as the random permutation scan (Roberts and Sahu, 1997) and the reversible Gibbs
sampler which updates blocks 1 to C, and then updates in reverse order.
y j ∼ N(qj , s j2 ),
and the second stage specifies a normal model for the latent θj,
qj ∼ N( m, t 2 ).
The full conditionals for the latent effects θj, namely p(qj |y , m, t 2 ) are as specified by
Gelman et al. (2014, p.116). Assuming a flat prior on μ, and that the precision 1/τ2 has
a Ga(a,b) gamma prior, then the full conditional for μ is N(q , t 2 /J ), and that for 1/τ2 is
gamma with parameters ( J/2 + a, 0.5 ∑ (q − m)
j
j
2
+ b).
18 Bayesian Hierarchical Models
TABLE 1.1
Schools Normal Meta-Analysis Posterior Summary
μ τ ϑ1 ϑ2 ϑ3 ϑ4 ϑ5 ϑ6 ϑ7 ϑ8
Mean 8.0 2.5 9.0 8.0 7.6 8.0 7.1 7.5 8.8 8.1
St devn 4.4 2.8 5.6 4.9 5.4 5.1 5.0 5.2 5.2 5.4
For the R application, the setting a = b = 0.1 is used in the prior for 1/τ2. Starting values
for μ and τ2 in the MCMC analysis are provided by the mean of the yj and the median
of the s j2 . A single run of T = 20000 samples (see Section 1.13 Computational Notes [4])
provides the posterior means and standard deviations shown in Table 1.1.
H (q , f) = U (q ) + K (f),
where U (q ) = - log[ p( y|q )p(q )] (the negative log posterior) defines potential energy, and
å
D
K (f ) = q d2 /md defines kinetic energy (Neal, 2011, section 5.2). Updates of the momen-
d=1
tum variable include updates based on the gradients of U(q ),
dU (q )
g d (q ) = ,
dq d
with g(θ) denoting the vector of gradients.
For iterations t = 1, …, T, the updating sequence is as follows:
log(r ) = U (q (t ) ) + K (f (t ) ) - U (q * ) - K (f * ).
p( y |x , f),
with a response y (of length n) conditional on a latent field x (usually also of length n),
depending on hyperparameters θ, with sparse precision matrix Qθ, and with ϕ denoting
other parameters relevant to the observation model. The hierarchical model is then
q , f ∼ p(q )p(f),
p ( x ,q , f |y ) µ p (q )p (f )p ( x|q ) Õ p(y |x ,f ).
i
i i
log(hi ) = m + ui + si ,
where ui ∼ N (0, su2 ), the si follow an intrinsic autoregressive prior (expressing spatial
dependence) with variance ss2 , and s ∼ ICAR(ss2 ) and ui are iid (independent and identi-
cally distributed) random errors. Then x = (η,u,s) is jointly Gaussian with hyperparameters
( m, ss2 , su2 ).
20 Bayesian Hierarchical Models
∫
p ( xi | y ) = p (q | y )p ( xi |q , y )dq ,
∫
p (qj | y ) = p (q | y )dq[ j] ,
where θ[j] denotes θ excluding θj, and integrations are carried out numerically.
∞
Teff , h = T / 1 + 2
∑r
k =0
hk ,
where
r hk = g hk /g h 0 ,
is the kth lag autocorrelation, γh0 is the posterior variance V(θh|y), and γhk is the kth lag autoco-
K∗
variance cov[q ,q
(t )
h
(t + k )
h |y]. In practice, one may estimate Teff,h by dividing T by 1 + 2 ∑ k =0
rhk ,
where K* is the first lag value for which ρhk < 0.1 or ρhk < 0.05 (Browne et al., 2009).
Bayesian Methods for Complex Data 21
Also useful for assessing efficiency is the Monte Carlo standard error, which is an
estimate of the standard deviation of the difference between the true posterior mean
∫
E(qh | y ) = qh p(q | y )dq , and the simulation-based estimate
T +B
å
1
qh = q h(t ) .
T t =B + 1
A simple estimator of the Monte Carlo variance is
1é 1 ù
T
ê
T êë T - 1 å(q
t=1
(t )
h - q h )2 ú
úû
though this may be distorted by extreme sampled values; an alternative batch means
method is described by Roberts (1996). The ratio of the posterior variance in a parameter
to its Monte Carlo variance is a measure of the efficiency of the Markov chain sampling
(Roberts, 1996), and it is sometimes suggested that the MC standard error should be less
than 5% of the posterior standard deviation of a parameter (Toft et al., 2007).
The effective sample size is mentioned above, while Raftery and Lewis (1992, 1996) esti-
mate the iterations required to estimate posterior summary statistics to a given accuracy.
Suppose the following posterior probability
Pr[∆(q | y ) < b] = p∆ ,
is required. Raftery and Lewis seek estimates of the burn-in iterations B to be discarded,
and the required further iterations T to estimate pΔ to within r with probability s; typical
quantities might be pΔ = 0.025, r = 0.005, and s = 0.95. The selected values of {pΔ,r,s} can also
be used to derive an estimate of the required minimum iterations Tmin if autocorrelation
were absent, with the ratio
I = T/Tmin ,
y j ~ N ( m + q j , s y2 ); q j ~ N (0, s q2 ), j = 1,¼, J
conventional sampling approaches may become trapped near σθ = 0, whereas improved
convergence and effective sample sizes are achieved by introducing a redundant scale
parameter l ∼ N (0, Vl )
y j ~ N ( m + lx j , s y2 ),
xj ∼ N (0, sx2 ).
The expanded model priors induce priors on the original model parameters, namely
qj = lxj ,
sq = l sx .
The setting for Vλ is important; too much diffuseness may lead to effective impropriety.
Another source of poor convergence is suboptimal parameterisation or data form.
For example, convergence is improved by centring independent variables in regres-
sion applications (Roberts and Sahu, 2001; Zuur et al., 2002). Similarly, delayed conver-
gence in random effects models may be lessened by sum to zero or corner constraints
(Clayton, 1996; Vines et al., 1996), or by a centred hierarchical prior (Gelfand et al., 1995;
Gelfand et al., 1996), in which the prior on each stochastic variable is a higher level sto-
chastic mean – see the next section. However, the most effective parameterisation may
also depend on the balance in the data between different sources of variation. In fact,
non-centred parameterisations, with latent data independent from hyperparameters,
may be preferable in terms of MCMC convergence in some settings (Papaspiliopoulos
et al., 2003).
empirical sum to zero constraint may be achieved by centring the sampled random effects
at each iteration (sometimes known as “centring on the fly”), so that
ui∗ = ui − u
and inserting ui∗ rather than ui in the model defining the likelihood. Another option
(Vines et al., 1996; Scollink, 2002) is to define an auxiliary effect uia ∼ N (0, su2 ) and obtain
the ui, following the same prior N (0, su2 ) , but now with a guaranteed mean of zero, by the
transformation
n
ui = (uia − u a ).
n−1
To illustrate a centred hierarchical prior (Gelfand et al., 1995; Browne et al., 2009), consider
two way nested data, with j = 1, … , J repetitions over subjects i = 1, … , n
yij = m + ai + uij ,
with ai ∼ N (0, sa2 ) and uij ∼ N (0, su2 ). The centred version defines
ki = m + ai
yij = ki + uij ,
so that
ki ∼ N ( m, sa2 ).
with ai ∼ N (0, sa2 ) , and bij ∼ N (0, s b2 ) . The hierarchically centred version defines
zij = m + ai + bij ,
ki = m + ai ,
so that
zij ∼ N (ki , s b2 ),
and
ki ∼ N ( m, sa2 ).
24 Bayesian Hierarchical Models
Roberts and Sahu (1997) set out the contrasting sets of full conditional densities under the
standard and centred representations and compare Gibbs sampling scanning schemes.
Papaspiliopoulos et al. (2003) compare MCMC convergence for centred, noncentred, and
partially non-centred hierarchical model parameterisations according to the amount of
information the data contain about the latent effects ki = m + ai . Thus for two-way nested
data the (fully) non-centred parameterisation, or NCP for short, involves new random
effects k i with
yij = k i + m + su eij ,
k i = sa zi ,
where eij and zi are standard normal variables. In this form, the latent data k i and hyperpa-
rameter μ are independent a priori, and so the NCP may give better convergence when the
latent effects κi are not well identified by the observed data y. A partially non-centred form
is obtained using a number w ε [0,1], and
yij = k iw + w m + uij ,
k iw = (1 − w) m + sa zi ,
or equivalently,
k iw = (1 − w)ki + wk i .
Thus w = 0 gives the centred representation, and w = 1 gives the non-centred parameterisa-
tion. The optimal w for convergence depends on the ratio σu/σα. The centred representation
performs best when σu/σα tends to zero, while the non-centred representation is optimal
when σu/σα is large.
to the variance over all chains k = 1, …, K. These factors converge to 1 if all chains are
sampling identical distributions, whereas for poorly identified models, variability of sam-
pled parameter values between chains will considerably exceed the variability within any
one chain. To apply these criteria, one typically allows a burn-in of B samples while the
sampling moves away from the initial values to the region of the posterior. For iterations
t = B + 1, … , T + B, a pooled estimate of the posterior variance sq2h|y of θh is
K B+T
åå (q
1
Wh = (t )
hk - q hk )2 ,
(T - 1)K k =1 t=B+1
with qhk being the posterior mean of θh in samples from the kth chain, and where
∑ (q
T
Vh = hk − qh .)2 ,
K −1 k =1
denotes between chain variability in θh, with qh . denoting the pooled average of the qhk .
The potential scale reduction factor compares sq2h|y with the within sample estimate Wh.
Specifically, the scale factor is R̂h = (sq2h|y /Wh )0.5 with values under 1.2 indicating conver-
gence. A multivariate version of the PSRF for vector θ is mentioned by Brooks and Gelman
(1998) and Brooks and Roberts (1998) and involves between and within chain covariances
Vθ and Wθ, and pooled posterior covariance Σ q|y . The scale factor is defined by
b′Σ q|y b T − 1 1
Rq = max = + 1 + l1
b b′Wq b T K
the advent of MCMC methods, conjugate priors were often used in order to reduce the
burden of numeric integration. Now non-conjugate priors (e.g. finite range uniform priors
on standard deviation parameters) are widely used. There may be questions of sensitivity
of posterior inference to the choice of prior, especially for smaller datasets, or for certain
forms of model; examples are the priors used for variance components in random effects
models, the priors used for collections of correlated effects, for example, in hierarchical
spatial models (Bernardinelli et al., 1995), priors in nonlinear models (Millar, 2004), and
priors in discrete mixture models (Green and Richardson, 1997).
In many situations, existing knowledge may be difficult to summarise or elicit in the
form of an “informative prior”. It may be possible to develop suitable priors by simulation
(e.g. Chib and Ergashev, 2009), but it may be convenient to express prior ignorance using
“default” or “non-informative” priors. This is typically less problematic – in terms of poste-
rior sensitivity – for fixed effects, such as regression coefficients (when taken to be homog-
enous over cases) than for variance parameters. Since the classical maximum likelihood
estimate is obtained without considering priors on the parameters, a possible heuristic is
that a non-informative prior leads to a Bayesian posterior estimate close to the maximum
likelihood estimate. It might appear that a maximum likelihood analysis would therefore
necessarily be approximated by flat or improper priors, but such priors may actually be
unexpectedly informative about different parameter values (Zhu and Lu, 2004).
A flat or uniform prior distribution on θ, expressible as p(θ) = 1 is often adopted on fixed
regression effects, but is not invariant under reparameterisation. For example, it is not true
for ϕ = 1/θ that p(ϕ) = 1 as the prior for a function ϕ = g(θ), namely
d −1
p(f) = g (f) ,
df
0.5
p(q ) µ I (q ) ,
æ ¶ 2l(q ) ö
I (q ) = -E çç ÷÷ ,
è d l(q g )d l(q h ) ø
and l(q ) = log(L(q |y )) is the log-likelihood. Unlike uniform priors, a Jeffreys
prior is invariant under transformation of scale since I (q ) = I ( g(q ))( g¢(q ))2 and
p(q ) µ I ( g(q ))0.5 g¢(q ) = p( g(q )) g¢(q ) (Kass and Wasserman, 1996, p.1345).
1.13.1 Including Evidence
Especially for establishing the intercept (e.g. the average level of a disease), or regression
effects (e.g. the impact of risk factors on disease) or variability in such impacts, it may be pos-
sible to base the prior density on cumulative evidence via meta-analysis of existing studies,
or via elicitation techniques aimed at developing informative priors. This is well established
Bayesian Methods for Complex Data 27
in engineering risk and reliability assessment, where systematic elicitation approaches such
as maximum-entropy priors are used (Siu and Kelly, 1998; Hodge et al., 2001). Thus, known
constraints for a variable identify a class of possible distributions, and the distribution with
the greatest Shannon–Weaver entropy is selected as the prior. Examples are θ ~ N(m,V), if
estimates m and V of the mean and variance are available, or an exponential with parameter
–q/log(1 − p) if a positive variable has an estimated pth quantile of q.
Simple approximate elicitation methods include the histogram technique, which divides
the domain of an unknown θ into a set of bins, and elicits prior probabilities that θ is
located in each bin. Then p(θ) may be represented as a discrete prior or converted to a
smooth density. Prior elicitation may be aided if a prior is reparameterised in the form
of a mean and prior sample size. For example, beta priors Be(a,b) for probabilities can be
expressed as Be(mt,(1 − m)t), where m = a/(a + b) and τ = a + b are elicited estimates of the
mean probability and prior sample size. This principle is extended in data augmentation
priors (Greenland and Christensen, 2001), while Greenland (2007) uses the device of a
prior data stratum (equivalent to data augmentation) to represent the effect of binary risk
factors in logistic regressions in epidemiology.
If a set of existing studies is available providing evidence on the likely density of a
parameter, these may be used in a form of preliminary meta-analysis to set up an infor-
mative prior for the current study. However, there may be limits to the applicability of
existing studies to the current data, and so pooled information from previous studies may
be downweighted. For example, the precision of the pooled estimate from previous stud-
ies may be scaled downwards, with the scaling factor possibly an extra unknown. When a
maximum likelihood (ML) analysis is simple to apply, one option is to adopt the ML mean
as a prior mean, but with the ML precision matrix downweighted (Birkes and Dodge, 1993).
More comprehensive ways of downweighting historical/prior evidence have been pro-
posed, such as power prior models (Chen et al., 2000; Ibrahim and Chen, 2000). Let 0 ≤ d ≤ 1
be a scale parameter with beta prior that weights the likelihood of historical data yh relative
to the likelihood of the current study data y. Following Chen et al. (2000, p.124), a power
prior has the form
where p(yh|θ) is the likelihood for the historical data, and (aδ,bδ) are pre-specified beta den-
sity hyperparameters. The joint posterior density for (θ,δ) is then
Chen and Ibrahim (2006) demonstrate connections between the power prior and conven-
tional priors for hierarchical models.
Another relevant principle in multiple effect models is that of uniform shrinkage gov-
erning the proportion of total random variation to be assigned to each source of variation
(Daniels, 1999; Natarajan and Kass, 2000). So, for a two-level normal linear model with
with eij ∼ N (0, s 2 ) and hj ∼ N (0, t 2 ) , one prior (e.g. inverse gamma) might relate to the
residual variance σ2, and a second conditional U(0,1) prior relates to the ratio t 2 /(t 2 + s 2 )
of cluster to total variance. A similar effect is achieved in structural time series models
(Harvey, 1989) by considering different forms of signal to noise ratios in state space models
including several forms of random effect (e.g. changing levels and slopes, as well as season
effects). Gustafson et al. (2006) propose a conservative prior for the one-level linear mixed
model
yi ∼ N (hi , s 2 ),
hi ∼ N ( m, t 2 ),
namely a conditional prior p(t 2 |s 2 ) aiming to prevent over-estimation of τ2. Thus, in full,
a -( a +1)
p(t 2 |s 2 ) = é1 + t 2 /s 2 ùû
2 ë
.
s
The case a = 1 corresponds to the uniform shrinkage prior of Daniels (1999), where
s2
p(t 2 |s 2 ) = ,
[s + t 2 ]2
2
Σ = diag(S).R.diag(S),
where S is a k × 1 vector of standard deviations, and R is a k × k correlation matrix. With
the prior sequence, p(R,S) = p(R|S)p(S), Barnard et al. suggest log(S) ~ Nk(ξ,Λ), where Λ is
usually diagonal. For the elements rij of R, constrained beta sampling on [−1,1] can be
used subject to positive definitiveness constraints on Σ. Daniels and Kass (1999) consider
the transformation hij = 0.5 log[(1 - rij )/(1 + rij )] and suggest an exchangeable hierarchical
shrinkage prior, ηij ~ N(0,τ2), where
p(t 2 ) ∝ (c + t 2 )−2 ;
c = 1/(k − 3).
Bayesian Methods for Complex Data 29
A separation strategy is also facilitated by the LKJ prior of Lewandowski et al. (2009) and
included in the rstan package (McElreath, 2016). While a full covariance prior (e.g. assum-
ing random slopes on all k predictors in a multilevel model) can be applied from the out-
set, MacNab et al. (2004) propose an incremental model strategy, starting with random
intercepts and slopes but without covariation between them, in order to assess for which
predictors there is significant slope variation. The next step applies a full covariance model
only for the predictors showing significant slope variation.
Formal approaches to prior robustness may be based on “contamination” priors. For
instance, one might assume a two group mixture with larger probability 1 − r on the
“main” prior p1(θ), and a smaller probability such as r = 0.1 on a contaminating density p2(θ),
which may be any density (Gustafson, 1996). More generally, a sensitivity analysis may
involve some form of mixture of priors, for example, a discrete mixture over a few alterna-
tives, a fully non-parametric approach (see Chapter 4), or a Dirichlet weight mixture over
a small range of alternatives (e.g. Jullion and Lambert, 2007). A mixture prior can include
the option that the parameter is not present (e.g. that a variance or regression effect is zero).
A mixture prior methodology of this kind for regression effects is presented by George
and McCulloch (1993). Increasingly also, random effects models are selective, including
a default allowing for random effects to be unnecessary (Albert and Chib, 1997; Cai and
Dunson, 2006; Fruhwirth-Schnatter and Tuchler, 2008).
In hierarchical models, the prior specifies both the form of the random effects (fully
exchangeable over units or spatially/temporally structured), the density of the random
effects (normal, mixture of normals, etc.), and the third stage hyperparameters. The form
of the second stage prior p(b|θb) amounts to a hypothesis about the nature and form of
the random effects. Thus, a hierarchical model for small area mortality may include spa-
tially structured random effects, exchangeable random effects with no spatial pattern, or
both, as under the convolution prior of Besag et al. (1991). It also may assume normality
in the different random effects, as against heavier tailed alternatives. A prior specifying
the errors as spatially correlated and normal is likely to be a working model assumption,
rather than a true cumulation of knowledge, and one may have several models for p(b|θb)
being compared (Disease Mapping Collaborative Group, 2000), with sensitivity not just
being assessed on the hyperparameters.
Random effect models often start with a normal hyperdensity, and so posterior infer-
ences may be sensitive to outliers or multiple modes, as well as to the prior used on the
hyperparameters. Indications of lack of fit (e.g. low conditional predictive ordinates for par-
ticular cases) may suggest robustification of the random effects prior. Robust hierarchical
models are adapted to pooling inferences and/or smoothing in data, subject to outliers or
other irregularities; for example, Jonsen et al. (2006) consider robust space-time state-space
models with Student t rather than normal errors in an analysis of travel rates of migrating
leatherback turtles. Other forms of robust analysis involve discrete mixtures of random
effects (e.g. Lenk and Desarbo, 2000), possibly under Dirichlet or Polya process models (e.g.
Kleinman and Ibrahim, 1998). Robustification of hierarchical models reduces the chance of
incorrect inferences on individual effects, important when random effects approaches are
used to identify excess risk or poor outcomes (Conlon and Louis, 1999; Marshall et al., 2004).
(e.g. positive recurrence) may be violated (Berger et al., 2005). This may apply even if con-
ditional densities are proper, and Gibbs or other MCMC sampling proceeds apparently
straightforwardly. A simple example is provided by the normal two-level model with sub-
jects i = 1, …, n nested in clusters j = 1, …, J,
yij = m + qj + uij ,
where qj ∼ N (0, t 2 ) and uij ∼ N (0, s 2 ). Hobert and Casella (1996) show that the posterior dis-
tribution is improper under the prior p( m, t, s ) = 1/(s 2t 2 ), even though the full conditionals
have standard forms, namely
æ ö
ç n( y j - m ) 1 ÷
p(q j |y , m , s ,t ) = N ç
2 2
2 , n ÷,
ç n+ s 1 ÷
ç + 2 ÷
è t 2 s 2
t ø
æ s2 ö
p( m |y , s 2 ,t 2 ,q ) = N ç y - q , ÷,
è nJ ø
æJ ö
p(1/t 2 |y , m , s 2 ,q ) = Ga ç , 0.5
ç2 å q j2 ÷ ,
÷
è j ø
æ nJ ö
p(1/s 2 |y , m ,t 2 ,q ) = Ga ç , 0.5
ç 2 å ( yij - m - q j )2 ÷ ,
÷
è ij ø
Priors that are just proper mathematically (e.g. gamma priors on 1/τ2 with small scale
and shape parameters) are often used on the grounds of expediency, and justified as letting
the data speak for themselves. However, such priors may cause identifiability problems as
the posteriors are close to being empirically improper. This impedes MCMC convergence
(Kass and Wasserman, 1996; Gelfand and Sahu, 1999). Furthermore, using just proper pri-
ors on variance parameters may in fact favour particular values, despite being suppos-
edly only weakly informative. Gelman (2006) suggests possible (less problematic) options
including a finite range uniform prior on the standard deviation (rather than variance),
and a positive truncated t density.
1.14 Computational Notes
[1] In Example 1.1, the data are generated (n = 1000 values) and underlying parameters
are estimated as follows:
library(mcmcse)
library(MASS)
library(R2WinBUGS)
# generate data
set.seed(1234)
y = rnorm(1000,3,5)
# initial vector setting and parameter values
T = 10000; B = T/10; B1=B+1
mu = sig = numeric(T)
# initial parameter values
mu[1] = 0
sig[1] = 1
u.mu = u.sig = runif(T)
# rejection counter
REJmu = 0; REJsig = 0
# log posterior density (up to a constant)
logpost = function(mu,sig){
loglike = sum(dnorm(y,mu,sig,log=TRUE))
return(loglike - log(sig))}
# sampling loop
for (t in 2:T) {print(t)
mut = mu[t-1]; sigt = sig[t-1]
# uniform proposals with kappa = 0.5
mucand = mut + runif(1,-0.5,0.5)
sigcand = abs(sigt + runif(1,-0.5,0.5))
alph.mu = logpost(mucand,sigt)-logpost(mut,sigt)
if (log(u.mu[t]) <= alph.mu) mu[t] = mucand
else {mu[t] = mut; REJmu = REJmu+1}
alph.sig = logpost(mu[t],sigcand)-logpost(mu[t],sigt)
if (log(u.sig[t]) <= alph.sig) sig[t] = sigcand
else {sig[t] <- sigt; REJsig <- REJsig+1}}
# sequence of sampled values and ACF plots
plot(mu)
32 Bayesian Hierarchical Models
plot(sig)
acf(mu,main="acf plot, mu")
acf(sig,main="acf plot, sig")
# posterior summaries
summary(mu[B1:T])
summary(sig[B1:T])
# Monte Carlo standard errors
D=data.frame(mu[B1:T],sig[B1:T])
mcse.mat(D)
# acceptance rates
ACCmu=1-REJmu/T
ACCsig=1-REJsig/T
cat("Acceptance Rate mu =",ACCmu,"n ")
cat("Acceptance Rate sigma = ",ACCsig, "n ")
# kernel density plots
plot(density(mu[B1:T]),main= "Density plot for mu posterior")
plot(density(sig[B1:T]),main= "Density plot for sigma posterior ")
f1=kde2d(mu[B1:T], sig[B1:T], n=50, lims=c(2.5,3.4,4.7,5.3))
filled.contour(f1,main="Figure 1.1 Bivariate Density", xlab="mu",
ylab="sigma",
color.palette=colorRampPalette(c(’white’,’blue’,’yellow’,’red’,’dark
red’)))
filled.contour(f1,main="Figure 1.1 Bivariate Density",xlab="mu",
ylab="sigma",
color.palette=colorRampPalette(c(’white’,’lightgray’,’gray’,’darkgra
y’,’black’)))
# estimates of effective sample sizes
effectiveSize(mu[B1:T])
effectiveSize(sig[B1:T])
ess(D)
multiESS(D)
# posterior probability on hypothesis μ < 3
sum(mu[B1:T] < 3)/(T-B)
[2] The R code for Metropolis sampling of the extended logistic model is library(coda)
# data
w = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
n = c(59, 60, 62, 56, 63, 59, 62, 60)
y = c(6, 13, 18, 28, 52, 53, 61, 60)
# posterior density
f = function(mu,th2,th3) {
# settings for priors
a0=0.25; b0=0.25; c0=2; d0=10; e0=2.004; f0=0.001
V = exp(th3)
m1 = exp(th2)
sig = sqrt(V)
x = (w-mu)/sig
xt = exp(x)/(1+exp(x))
h = xt94m1;
loglike = y*log(h)+(n-y)*log(1-h)
# prior ordinates
logpriorm1 = a0*th2-m1*b0
logpriorV = -e0*th3-f0/V
Bayesian Methods for Complex Data 33
logpriormu = -0.5*((mu-c0)/d0)942-log(d0)
logprior = logpriormu+logpriorV+logpriorm1
# log posterior
f = sum(loglike)+logprior}
# main MCMC loop
runMCMC = function(samp,mu,th2,th3,T,sd) {
for (i in 2:T+1) {
# candidates for mu
mucand = mu[i-1]+sd[1]*rnorm(1,0,1)
f.cand = f(mucand,th2[i-1],th3[i-1])
f.curr = f(mu[i-1], th2[i-1],th3[i-1])
if (log(runif(1)) <= f.cand-f.curr) mu[i] = mucand else
{mu[i] = mu[i-1]}
# candidates for log(m1)
th2cand = th2[i-1]+sd[2]*rnorm(1,0,1)
f.cand = f(mu[i],th2cand,th3[i-1])
f.curr = f(mu[i],th2[i-1], th3[i-1])
if (log(runif(1)) <= f.cand-f.curr) th2[i] = th2cand else
{th2[i] = th2[i-1]}
# candidates for log(V)
th3cand = th3[i-1]+sd[3]*rnorm(1,0,1)
f.cand = f(mu[i],th2[i],th3cand)
f.curr = f(mu[i],th2[i],th3[i-1])
if (log(runif(1)) <= f.cand-f.curr) th3[i] = th3cand else
{th3[i] = th3[i-1]}
samp[i-1.1] = mu[i]; samp[i-1.2] = exp(th2[i]); samp[i-1.3] =
exp(th3[i])}
return(samp)}
# number of iterations
T=100000
# warm-up samples
B=50000
B1=B+1
R=T-B
mu=th3=th2=numeric(T)
sd=acc=numeric(3)
# metropolis proposal standard devns
sd[1] = 0.01; sd[2] = 0.2; sd[3] = 0.4
# accumulate samples
samp = matrix(,T,3)
# initial parameter values
mu[1] = 0; th2[1]= 0; th3[1] =0
samp[1,1] = mu[1]; samp[1,2] = exp(th2[1]); samp[1,3] = exp(th3[1])
# first chain
chain1=runMCMC(samp,mu,th2,th3,T,sd)
chain1=chain1[B1:T,]
# posterior summary
quantile(chain1[1:R,1], probs=c(.025,0.5,0.975))
quantile(chain1[1:R,2], probs=c(.025,0.5,0.975))
quantile(chain1[1:R,3], probs=c(.025,0.5,0.975))
# second chain
chain2=runMCMC(samp,mu,th2,th3,T,sd)
chain2=chain2[B1:T,]
# posterior summary
34 Bayesian Hierarchical Models
quantile(chain2[1:R,1], probs=c(.025,0.5,0.975))
quantile(chain2[1:R,2], probs=c(.025,0.5,0.975))
quantile(chain2[1:R,3], probs=c(.025,0.5,0.975))
# combine chains
chain1=as.mcmc(chain1)
chain2=as.mcmc(chain2)
combchains = mcmc.list(chain1, chain2)
gelman.diag(combchains)
crosscorr(combchains)
accsum = "Acceptance rates: mu, m1, and sigma942"
print(accsum)
1 - rejectionRate(combchains)
effectiveSize(combchains)
autocorr.diag(combchains)
library(rstan)
library(bayesplot)
library(coda)
# data
w = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
n = c(59, 60, 62, 56, 63, 59, 62, 60)
y = c(6, 13, 18, 28, 52, 53, 61, 60)
D=list(y=y,n=n,w=w,N=8)
# rstan code
model ="
data {
int<lower=0> N;
int n[N];
int y[N];
real w[N];
}
parameters {
real <lower=0> mu;
real log_sigma;
real log_m1;
}
transformed parameters {
real<lower=0> sigma;
real<lower=0> sigma2;
real<lower=0> m1;
real x[N];
real pi[N];
sigma=exp(log_sigma);
sigma2=sigma942;
m1=exp(log_m1);
for (i in 1:N) {x[i]=(w[i]-mu)/sigma;}
for (i in 1:N) {pi[i]=pow(exp(x[i])/(1+exp(x[i])),m1);}
}
model {
log_sigma ~normal(0,5);
mu ~normal(2,3.16);
log_m1 ~normal(0,1);
Bayesian Methods for Complex Data 35
[5] There are J+2 unknowns in the R code (N.B. the s j2 are not unknowns) for imple-
menting these Gibbs updates. There are T=20000 MCMC samples to be accumu-
lated in the matrix samples. With a = b = 0.1 in the prior for 1/τ2, and calling on coda
routines for posterior summaries, one has
library(coda)
# data
y=c(28,8,-3,7,-1,1,18,12)
sigma=c(15,10,16,11,9,11,10,18)
sigma2 = sigma942
J = 8
# total MCMC iterations
T = 20000
# ten unknowns (eight effects, plus their mean and variance)
samps = matrix(, T, 10)
colnames(samps) <- c("mu","tau","Sch1","Sch2","Sch3","Sch4","Sch5","
Sch6","Sch7","Sch8")
# starting values
mu=mean(y)
tau2=median(sigma2)
# sampling loop
for (t in 1:T) {th.mean=(y/sigma2+mu/tau2)/(1/sigma2+1/tau2)
th.sd=sqrt(1/(1/sigma2+1/tau2))
theta=rnorm(J,th.mean,th.sd)
mu=rnorm(1,mean(theta),sqrt(tau2/J))
# prior on random effects precision
invtau2=rgamma(1,J/2+0.1,sum((theta-mu)942)/2+0.1)
tau2 = 1/invtau2
tau = sqrt(tau2)
# accumulate samples
samps[t,3:10] = theta
samps[t,1] =mu
samps[t,2] =tau}
# posterior summary
summary(as.mcmc(samps))
post.mn = apply(samps,2,mean)
post.sd = apply(samps,2,sd)
post.median = apply(samps,2,median)
post.95=apply(samps, 2, quantile, probs = c(0.95))
post.05=apply(samps, 2, quantile, probs = c(0.05))
# trace and density plots
plot(as.mcmc(samps))
References
Albert J (2007) Bayesian Computation with R. Springer.
Albert J, Chib S (1993) Bayesian analysis of binary and polychotomous response data. Journal of the
American Statistical Association, 88, 669–679.
Albert J, Chib S (1997) Bayesian tests and model diagnostics in conditionally independent hierarchi-
cal models. Journal of the American Statistical Association, 92, 916–925.
38 Bayesian Hierarchical Models
Altaleb A, Chauveau D (2002) Bayesian analysis of the logit model and comparison of two Metropolis–
Hastings strategies. Computational Statistics & Data Analysis, 39, 137–152.
Andrieu C, Moulines E (2006) On the ergodicity properties of some adaptive MCMC algorithms.
Annals of Applied Probability, 16(3), 1462–1505.
Barnard J, McCulloch R, Meng X (2000) Modeling covariance matrices in terms of standard devia-
tions and correlations, with applications to shrinkage. Statistica Sinica, 10, 1281–1311.
Bedard M (2008) Optimal acceptance rates for Metropolis algorithms: Moving beyond 0.234. Stochastic
Processes and their Applications, 118(12), 2198–2222.
Berger J, Bernardo J (1992) On the development of reference priors, in Bayesian Statistics 4, pp 35–60,
eds J Bernardo, J Berger, A Dawid, A Smith. Clarendon Press, Oxford.
Berger J, Strawderman W, Tang D (2005) Posterior propriety and admissibility of hyperpriors in nor-
mal hierarchical models. Annals of Statistics 33, 606–646.
Bernardinelli L, Clayton D, Montomoli C (1995) Bayesian estimates of disease maps: How important
are priors? Statistics in Medicine 14, 2411–2431.
Besag J, Green P, Higdon D, Mengerson K (1995) Bayesian computation and stochastic systems.
Statistical Science, 10(1),103–166.
Besag J, York J, Mollie A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43, 1–21.
Birkes D, Dodge Y (1993) Alternative Methods of Regression. John Wiley.
Brooks S, Gelman A (1998) Alternative methods for monitoring convergence of iterative simulations.
Journal of Computational and Graphical Statistics, 7, 434–456.
Brooks S, Roberts G (1998) Convergence assessment techniques for Markov chain Monte Carlo.
Statistics and Computing, 8, 319–335.
Browne W, Steele F, Golalizadeh M (2009) The use of simple reparameterizations to improve the
efficiency of Markov chain Monte Carlo estimation for multilevel models with applications to
discrete time survival models. Journal of the Royal Statistical Society: Series A, 172, 579–598.
Cai B, Dunson D (2006) Bayesian covariance selection in generalized linear mixed models. Biometrics,
62, 446–457.
Carlin B, Gelfand A (1991) An iterative Monte Carlo method for nonconjugate Bayesian analysis.
Statistics and Computing, 1(2), 119–128.
Carpenter B, Gelman A, Hoffman M, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P,
Riddell A (2017) Stan: A probabilistic programming language. Journal of Statistical Software,
76(1), 1–32
Chen M, Wang X (2011) Approximate predictive densities and their applications in generalized linear
models. Computational Statistics & Data Analysis, 55(4), 1570–1580.
Chen M-H, Ibrahim J (2006) The relationship between the power prior and hierarchical models.
Bayesian Analysis, 1, 551–574.
Chen M-H, Ibrahim J, Shao Q-M (2000) Power prior distributions for generalized linear models.
Journal of Statistical Planning and Inference, 84, 121–137.
Chen M-H, Shao Q-M (1998) Monte Carlo estimation of Bayesian credible and HPD intervals. Journal
of Computational & Graphical Statistics, 8(1), 69–92.
Chiang J, Chib S, Narasimhan C (1999) Markov chain Monte Carlo and models of consideration set
and parameter heterogeneity. Journal of Econometrics, 89, 223–248.
Chib S (2001) Monte Carlo methods and Bayesian computation: Overview, in International Encyclopedia
of the Social & Behavioral Sciences. https://doi.org/10.1016/B0-08-043076-7/00467-8
Chib S, Ergashev B (2009) Analysis of multifactor affine yield curve models. Journal of the American
Statistical Association, 104(488), 1324–1337.
Chib S, Greenberg E (1995) Understanding the Metropolis-Hastings algorithm. The American
Statistician, 49, 327–335.
Chib S, Jeliazkov I (2006) Inference in semiparametric dynamic models for binary longitudinal data.
Journal of the American Statistical Association, 101, 685–700.
Clark J, Gelfand A (eds) (2006) Hierarchical Modelling for the Environmental Sciences: Statistical Methods
and Applications. Oxford University Press.
Bayesian Methods for Complex Data 39
Clayton D (1996) Generalized linear mixed models, in: Markov Chain Monte Carlo in Practice, eds W
Gilks, S Richardson, D Spiegelhalter. Chapman & Hall, London, UK.
Congdon P (2003) Applied Bayesian Modelling. Wiley, Chichester, UK.
Conlon E, Louis T (1999) Addressing multiple goals in evaluating region-specific risk using Bayesian
methods, pp 31–47, in Disease Mapping and Risk Assessment for Public Health, eds A Lawson, A
Biggeri, D Bohning, E Lesaffre, J Viel, R Bertollini. Wiley.
Cressie N, Calder C A, Clark J S, Hoef J M V, Wikle C K (2009) Accounting for uncertainty in eco-
logical analysis: The strengths and limitations of hierarchical statistical modeling. Ecological
Applications, 19(3), 553–570.
Daniels M (1999) A prior for the variance in hierarchical models. Canadian Journal of Statistics, 27,
569–580.
Daniels M, Kass R (1999) Nonconjugate Bayesian Estimation of Covariance matrices and its use in
hierarchical models. Journal of the American Statistical Association, 94, 1254–1263.
Davidian M, Giltinan D M (2003) Nonlinear models for repeated measures data: An overview and
update. Journal of Agricultural, Biological, and Environmental Statistics, 8, 387–419.
Deely J, Smith A (1998) Quantitative refinements for comparisons of institutional performance.
Journal of the Royal Statistical Society, Series A, 161, 5–12.
Disease Mapping Collaborative Group (2000) Disease mapping models: An empirical evaluation.
Statistic in Medicine, 19, 2217–2241.
Dunson D (2001) Commentary: Practical advantages of Bayesian analysis of epidemiologic data.
American Journal of Epidemiology, 153, 1222–1226.
Fahrmeir L, Knorr-Held L (1997) Dynamic discrete-time duration models. Sociological Methodology,
27, 417–452.
Fox J-P (2010) Bayesian Item Response Modeling: Theory and Applications. Springer.
Fruhwirth-Schnatter S, Tuchler R (2008) Bayesian parsimonious covariance estimation for hierarchi-
cal linear mixed models. Statistics & Computing, 18, 1–13.
Fuglstad G, Simpson D, Lindgren F, Rue H (2018) Constructing priors that penalize the complexity of
Gaussian random fields. Journal of the American Statistical Association, 114(525), 445–452.
Gelfand A, Sahu S (1999) Identifiability, improper priors, and Gibbs sampling for generalized linear
models. Journal of the American Statistical Association, 94, 247–253.
Gelfand A, Sahu S, Carlin B (1995) Efficient parameterization for normal linear mixed models.
Biometrika, 82, 479–488.
Gelfand A, Sahu S, Carlin B (1996) Efficient parameterizations for generalised linear models, in
Bayesian Statistics 5, pp 165–180, eds J Bernardo, J Berger, A Dawid, A Smith. Clarendon Press,
Oxford, UK.
Gelfand A, Smith A (1990) Sampling-based approaches to calculating marginal densities. Journal of
the American Statistical Association, 85, 398–409.
Gelman A (2006) Prior distributions for variance parameters in hierarchical models. Bayesian Analysis,
1, 515–533.
Gelman A, Rubin D (1996) Markov chain Monte Carlo methods in biostatistics. Statistical Methods in
Medical Research, 5, 339–355.
Gelman A, Stern H, Carlin J, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis, 3rd Edition.
Chapman and Hall/CRC.
Gelman A, van Dyk D, Huang Z, Boscardin J (2008) Using redundant parameterizations to fit hierar-
chical models. Journal of Computational and Graphical Statistics, 17, 95–122.
George E, Makov U, Smith A (1993) Conjugate likelihood distributions. Scandinavian Journal of
Statistics, 20, 147–156.
George E, McCulloch R (1993) Variable selection via Gibbs sampling. Journal of the American Statistical
Association, 88(423), 881–889.
Geweke J (1992) Evaluating the accuracy of sampling-based approaches to calculating posterior
moments, in Bayesian Statistics, Volume 4. eds J Bernardo, J Berger, A Dawid, A Smith. Oxford
University Press, New York.
Geweke J (1993) Bayesian treatment of the Student’s-t linear model. Journal of Applied Economics, 8, S19–S40.
40 Bayesian Hierarchical Models
Geyer C, Thompson E (1995) Annealing Markov chain Monte Carlo with applications to ancestral
inference. Journal of the American Statistical Association, 90, 909–920.
Ghosh J (2008) Efficient Bayesian Computation and Model Search in Linear Hierarchical Models.
PhD Thesis ISDS, Duke University.
Gilks W (1996) Full conditional distributions, in Markov Chain Monte Carlo in Practice, pp 75–88, eds
W Gilks, S Richardson, D Spiegelhalter. Chapman and Hall, London, UK.
Gilks W, Richardson S, Spielgelhalter D (1996) Introducing Markov chain Monte Carlo, in Markov
Chain Monte Carlo in Practice, pp 1–19, eds W Gilks, S Richardson, D Spiegelhalter. Chapman
and Hall, London, UK.
Gilks W, Wang C, Yvonnet B, Coursaget P (1993) Random-effects models for longitudinal data using
Gibbs sampling. Biometrics, 38, 963–974.
Goldstein H, Spiegelhalter D (1996) League tables and their limitations: Statistical issues in com-
parisons of institutional performance. Journal of the Royal Statistical Society: Series A (Statistics in
Society), 159(3), 385–409.
Green P, Richardson S (1997) On Bayesian analysis of mixtures with an unknown number of compo-
nents. Journal of the Royal Statistical Society: Series B, 59, 731–792.
Greenland S (2007) Bayesian perspectives for epidemiological research. II. Regression analysis.
International Journal of Epidemiology, 36, 195–202.
Greenland S, Christensen R (2001) Data augmentation priors for Bayesian and semi-Bayes analyses of
conditional-logistic and proportional-hazards regression. Statistics in Medicine, 20, 2421–2428.
Gustafson P. (1996) Local sensitivity of inferences to prior marginals. Journal of the American Statistical
Association, 91, 774–781.
Gustafson P, Hossain S, MacNab Y (2006) Conservative priors for hierarchical models. Canadian
Journal of Statistics, 34, 377–390.
Hadjicostas P, Berry S (1999) Improper and proper posteriors with improper priors in a Poisson-
gamma hierarchical model. Test, 8, 147–166.
Harvey A (1989) Structural Time Series Models and the Kalman Filter. Cambridge University Press.
Hastings, W (1970) Monte-Carlo sampling methods using Markov Chains and their applications.
Biometrika, 57, 97–109.
Hobert J, Casella G (1996) The effect of improper priors on Gibbs sampling in hierarchical linear
mixed models. Journal of the American Statistical Association, 91, 1461–1473.
Hodge R, Evans M, Marshall J, Quigley J, Walls L (2001) Eliciting engineering knowledge about reli-
ability during design-lessons learnt from implementation. Quality and Reliability Engineering
International, 17, 169–179.
Hoffman M, Gelman A (2014) The No-U-turn sampler: Adaptively setting path lengths in Hamiltonian
Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.
Hyndman R (1996) Computing and graphing highest density regions. American Statistician, 50,
361–365.
Ibrahim J, Chen M-H (2000) Power prior distributions for regression models. Statistical Science, 15,
46–60.
Jeffreys H (1961) Theory of Probability, 3rd Edition. Oxford University Press, Clarendon Press, Oxford,
UK.
Johannes M, Polson N (2006) MCMC methods for continuous-time financial econometrics, in
Handbook of Financial Econometrics, eds Y Ait-Sahalia, L Hansen. North Holland, Amsterdam.
Jonsen I, Myers R, James M (2006) Robust hierarchical state–space models reveal diel variation in
travel rates of migrating leatherback turtles. Journal of Animal Ecology, 75, 1046–1057.
Jullion A, Lambert P (2007) Robust specification of the roughness penalty prior distribution in
spatially adaptive bayesian p-splines models. Computational Statistics and Data Analysis, 51,
2542–2558.
Kass R, Carlin B, Gelman A, Neal R (1998) Markov chain Monte Carlo in practice: A round table dis-
cussion. The American Statistician, 52, 93–100.
Kass R, Wasserman L (1996) The selection of prior distributions by formal rules. Journal of the American
Statistical Association, 91, 1343–1370.
Bayesian Methods for Complex Data 41
Kleinman K, Ibrahim J (1998) A semiparametric Bayesian approach to the random effects model.
Biometrics, 54, 921–938.
Klement, R, Bandyopadhyay, P, Champ, C, Walach, H (2018) Application of Bayesian evidence syn-
thesis to modelling the effect of ketogenic therapy on survival of high grade glioma patients.
Theoretical Biology and Medical Modelling, 15(1), 12.
Knorr-Held L, Rainer E (2001) Projections of lung cancer mortality in West Germany: A case study in
Bayesian prediction. Biostatistics, 2, 109–129.
Koop G (2003) Bayesian Econometrics. John Wiley.
Krypotos A, Blanken T, Arnaudova I, Matzke D, Beckers T (2017) A primer on Bayesian analysis for
experimental psychopathologists. Journal of Experimental Psychopathology, 8(2), jep-057316.
Laird N, Louis T (1989) Empirical Bayes confidence intervals for a series of related experiments.
Biometrics, 45(2), 481–495.
Lenk P, DeSarbo W (2000) Bayesian inference for finite mixture models of generalized linear models
with random effects. Psychometrika, 65, 475–496.
Lewandowski D, Kurowicka D, Joe H (2009) Generating random correlation matrices based on vines
and extended onion method. Journal of Multivariate Analysis, 100(9), 1989–2001.
Liechty J, Liechty M, Muller P (2004) Bayesian correlation estimation. Biometrika, 91, 1–14.
Lindley D, Smith A (1972) Bayes estimates for the linear model. Journal of the Royal Statistical
Society, B34, 1–41.
MacNab Y, Qiu Z, Gustafson P, Dean C, Ohlsson A, Lee S (2004) Hierarchical Bayes analysis of mul-
tilevel health services data: A Canadian neonatal mortality study. Health Services and Outcomes
Research Methodology, 5, 5–26.
Marshall C, Best N, Bottle A, Aylin P (2004) Statistical issues in the prospective monitoring of health
outcomes across multiple units. Journal of the Royal Statistical Society: Series A, 167, 541–559.
Marshall E, Spiegelhalter D (1998) Comparing institutional performance using Markov chain Monte
Carlo methods, pp 229–249, in Statistical Analysis of Medical Data: New Developments, eds B
Everitt, G Dunn. Arnold, London, UK.
McElreath R (2016) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Mengersen K, Tweedie R (1996) Rates of convergence of the Hastings and Metropolis algorithms. The
Annals of Statistics, 24, 101–121.
Millar R (2004) Sensitivity of Bayes estimators to hyper-parameters, with an application to maximum
yield from fisheries. Biometrics, 60, 536–542.
Molenberghs G, Verbeke G, Demetrio, C (2007) An extended random-effects approach to modelling
repeated, overdispersed count data. Lifetime Data Analysis, 13, 513–531.
Monnahan C C, Thorson J T, Branch T A (2017) Faster estimation of Bayesian models in ecology using
Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3), 339–348.
Natarajan R, Kass R (2000) Reference Bayesian methods for generalized linear mixed models. Journal
of the American Statistical Association, 95, 227–237.
Neal R (2011) MCMC using Hamiltonian dynamics, Chapter 5, in Handbook of Markov Chain Monte
Carlo, eds S Brooks, A Gelman, G Jones, X-L Meng. CRC Press.
Oravecz Z, Muth C (2018) Fitting growth curve models in the Bayesian framework. Psychonomic
Bulletin and Review, 25(1), 235–255.
Paap R (2002) What are the advantages of MCMC based inference in latent variable models? Statistica
Neerlandica, 56, 2–22.
Palmer J, Pettit L (1996) Risks of using improper priors with Gibbs sampling and autocorrelated
errors. Journal of Computational and Graphical Statistics, 5, 245–249.
Papaspiliopoulos O, Roberts G, Skold M (2003) Non-centered parameterisations for hierarchical
models and data augmentation, pp 307–326, in Bayesian Statistics 7, eds J Bernardo, S Bayarri, J
Berger, A Dawid, D Heckerman, A Smith, M West. Oxford University Press.
Raftery A (1996) Approximate Bayes factors and accounting for model uncertainty in generalized
linear models. Biometrika, 83, 251–266.
Raftery A, Lewis S (1992) One long run with diagnostics: Implementation strategies for Markov
chain Monte Carlo. Statistical Science, 7, 493–497.
42 Bayesian Hierarchical Models
Raftery A, Lewis S (1996) The number of iterations, convergence diagnostics and generic Metropolis
algorithms, in Practical Markov Chain Monte Carlo, eds W Gilks, D Spiegelhalter, S Richardson.
Chapman & Hall, London, UK.
Robert C (2015) The Metropolis–Hastings Algorithm. Wiley StatsRef: Statistics Reference Online, pp
1–15. https://onlinelibrary.wiley.com/doi/full/10.1002/9781118445112.stat07834
Robert C, Elvira V, Tawn N, Wu C (2018) Accelerating MCMC algorithms. WIRES Computational
Statistics, 10, e1435.
Roberts G, Gelman A, Gilks W (1997) Weak convergence and optimal scaling of random walk
metropolis algorithms. The Annals of Applied Probability, 7, 110–120.
Roberts G, Rosenthal J (2004) General state space Markov chains and MCMC algorithms. Probability
Surveys, 1, 20–71.
Roberts G, Sahu S (1997) Updating schemes, correlation structures, blocking and parameterization of
the Gibbs sampler. Journal of the Royal Statistical Society B, 59, 291–317.
Roberts G, Sahu S (2001) Approximate predetermined convergence properties of the Gibbs sampler.
Journal of Computational and Graphical Statistics, 10, 216–229.
Roberts G, Tweedie R (1996) Geometric convergence and central limit theorems for multidimen-
sional Hastings and Metropolis algorithms. Biometrika, 83, 95–110.
Rodrigues A, Assuncao R (2008) Propriety of posterior in Bayesian space varying parameter models
with normal data. Statistics & Probability Letters, 78, 2408–2411.
Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models
using integrated nested Laplace approximations. Journal of the Royal Statistical Society, Series B,
71(2), 319–392.
Sargent D (1998) A general framework for random effects survival analysis in the Cox proportional
hazards setting. Biometrics, 54(4), 1486–1497.
Scollnik D (2002) Implementation of four models for outstanding liabilities in WinBUGS: A discussion
of a paper by Ntzoufras and Dellaportas (2002). North American Actuarial Journal, 6, 128–136.
Shen W, Louis T (1998) Triple-goal estimates in two-stage hierarchical models. Journal of the Royal
Statistical Society: Series B, 60, 455–471.
Sherlock C, Fearnhead P, Roberts G (2010) The random walk Metropolis: Linking theory and practice
through a case study. Statistical Science, 25(2), 172–190.
Shoemaker J, Painter I, We B (1999) Bayesian statistics in genetics: A guide for the uninitiated. Trends
in Genetics, 15, 354–358.
Simpson D, Rue H, Riebler A, Martins T, Sørbye S (2017) Penalising model component complexity: A
principled, practical approach to constructing priors. Statistical Science, 32(1), 1–28.
Sinharay S, Stern H (2005) An empirical comparison of methods for computing Bayes factors in
generalized linear mixed models. Journal of Computational and Graphical Statistics, 14, 415–435.
Siu N, Kelly D (1998) Bayesian parameter estimation in probabilistic risk assessment. Reliability
Engineering and System Safety, 62, 89–116.
Spiegelhalter D (2004) Incorporating Bayesian Ideas into Health-Care evaluation. Statistical Science,
19, 156–174.
Sun D, Speckman P, Tsutakawa R (2000) Random effects in generalized linear mixed models
(GLMMs), pp 23–39, in Generalized Linear Models: A Bayesian Perspective, eds D Dey, S Ghosh, B
Mallick. Dekker, New York.
Sun D, Tsutakawa R, Speckman P (1999) Posterior distribution of hierarchical models using CAR(1)
distributions. Biometrika, 86, 341–350.
Tierney L (1994) Markov Chains for exploring posterior distributions. Annals of Statistics, 21,
1701–1762.
Toft N, Innocent G, Gettinby G, Reid S (2007) Assessing the convergence of Markov Chain Monte
Carlo methods: An example from evaluation of diagnostic tests in absence of a gold standard.
Preventive Veterinary Medicine, 79, 244–256.
van Dyk D (2003) Hierarchical models, data augmentation, and Markov chain Monte Carlo, pp 41–56,
in Statistical Challenges in Modern Astronomy III, eds G Babu, E Feigelson. Springer, New York.
Bayesian Methods for Complex Data 43
Vanpaemel W (2011) Constructing informative model priors using hierarchical methods. Journal of
Mathematical Psychology, 55(1), 106–117.
Vines S, Gilks W, Wild P (1996) Fitting bayesian multiple random effects models. Statistics and
Computing, 6, 337–346.
Wetzels R, van Ravenzwaaij D,Wagenmakers E (2014) Bayesian analysis, in The Encyclopedia of Clinical
Psychology, eds R Cautin, S Lilienfeld. Wiley-Blackwell, Hoboken, NJ.
Wikle C (2003) Hierarchical models in environmental science. International Statistical Review, 71,
181–199.
Willink R, Lira I (2005) A united interpretation of different uncertainty intervals. Measurement, 38,
61–66.
Yu B, Mykland P (1998) Looking at Markov samplers through cusum path plots: A simple diagnostic
idea. Statistics and Computing, 8(3), 275–286.
Yue Y, Speckman P, Sun D (2012) Priors for Bayesian adaptive spline smoothing. Annals of the Institute
of Statistical Mathematics, 64(3), 577–613.
Zhu M, Lu A (2004) The counter-intuitive non-informative prior for the Bernoulli family. Journal of
Statistics Education [Online], 12(2).
Zuur G, Garthwaite P, Fryer R (2002) Practical use of MCMC methods: Lessons from a case study.
Biometrical Journal, 44, 433–455.
2
Bayesian Analysis Options in R, and
Coding for BUGS, JAGS, and Stan
2.1 Introduction
R, available at https://cran.r-project.org/, is an integrated suite of software facilities for
data manipulation, statistical analysis, and graphical display (R Core Team, 2016). The
advantages of the R environment for Bayesian analysis are considerable, including access
to extensive graphical capabilities (e.g. ggplot) and data manipulation facilities; a range
of posterior diagnostic and summarisation tools; and the ability to obtain classical esti-
mates in tandem with a full Bayesian analysis. A full list of packages in R is available at
https://cran.r-project.org/web/packages/available_packages_by_name.html and www.
onlinetoolz.com/tools/r-packages.php, while Bayesian analysis packages are listed at
https://cran.r-project.org/web/views/Bayesian.html.
Worked examples in subsequent chapters focus primarily on three options for generic
Bayesian analysis in R, based on user-defined program code. Implementation in R uses
interfaces for BUGS such as R2OpenBugs and R2MultiBUGS, for JAGS (e.g. rjags, run-
jags, jagsUI), and for Stan (rstan). The LaplacesDemon package (CRAN, 2018) also offers
Bayesian estimation options, with entirely R based user code. A number of packages use
one or more of BUGS, JAGS, or Stan as a basis for coding and computation, but provide
extra compilation checks, posterior summarisation, or data analysis options. Thus, the
rube package (Seltman, 2016) interfaces with BUGS and JAGS to provide additional compi-
lation details to assist with code debugging, while MCMCvis (Youngflesh, 2017) provides
tools for posterior summarisation and visualisation which can be applied across all three
generic options. The Nimble package aims to update BUGS and retain its functionality in
the R environment (de Valpine et al., 2017), while R2MultiBUGS is a recently developed
alternative to R2OpenBUGS and links to MultiBUGS (Goudie et al., 2019). Comparative
analyses of some of these packages include Li et al. (2018) and Monnahan et al. (2017).
A range of application packages not requiring user-defined code adapted to the applica-
tion is available. These have a different design philosophy to the generic coding options,
using MCMC algorithms that are model-specific and hence likely to be more efficient
(Martin and Quinn, 2006). As one example, bamlss (Bayesian Additive Models for Location,
Scale, and Shape) enables Bayesian estimation of generalised linear models, additive
regression, and spatial models (Umlauf et al., 2018). MCMCpack (Martin et al., 2011) allows
estimation of generalised linear models, change-point models, quantile linear regression,
and certain latent variable models. The rstanarm package (Gabry and Goodrich, 2018) uses
Stan as a basis for estimation, but using simplified functions: for example, the stan_glm
function to represent generalised linear models. The R-INLA package uses the Integrated
45
46 Bayesian Hierarchical Models
options(scipen=999)
library(R2OpenBUGS)
library(heavy)
library(loo)
data(ereturns)
x=as.vector(ereturns[[4]])
y=as.vector(ereturns[[3]])
# Data
D=list(y=y,x=x,n=60,x.new=0.13)
# Model Code
model <- function() { for (i in 1:n) {y[i] ~dnorm(mu[i],tau)
mu[i] <- beta[1] + beta[2]*(x[i]-mean(x[]))
Bayesian Analysis Options in R 47
# log-likelihood
LL[i] <- -0.92+0.5*log(tau)-0.5*tau*pow(y[i]-mu[i],2)
# replicates (predictions) at observed x[i]
yrep[i] ~dnorm(mu[i],tau)
# check replicate against actual observation
check[i] <- step(yrep[i]-y[i])}
# priors
for (j in 1:2) {beta[j] ~dnorm(0,0.001)}
# calculate precision
tau <- 1/(sigma*sigma)
sigma ~dunif(0,100)
# prediction at new x value
mu.new <- beta[1]+beta[2]*(x.new-mean(x[]))
y.new ~dnorm(mu.new,tau)}
inits1 = list(beta=rep(0,2), sigma=1)
inits2 = list(beta=rep(0,2), sigma=2)
inits = list(inits1,inits2)
pars = c("beta","sigma","check","y.new","LL")
n.iters=10000; n.burnin=500; n.chains=2
R=bugs(D,inits,pars,n.iters,model,n.chains,n.burnin,debug=T,
codaPkg = F,bugs.seed=10)
R$summary
LOO=loo(R$sims.list$LL)
LOO.PW=LOO$pointwise[,3]
As expected, a number of cases, particularly 8, 15, 34, and 58 have extreme posterior
predictive checks, and these cases also have the most extreme pointwise LOO-IC values.
This example could also be run using R2MultiBUGS, with the second line now
library(R2MultiBUGS), and the bugs command being:
y[i] ~ dnorm(mu[i],1/(sigma^2)).
Drawbacks of JAGS code relative to BUGS are that loop limits cannot involve any cal-
culation, and the inability to take sub-samples at each MCMC iteration (see Example 3.5).
The JAGS code for the above regression example emphasises its essential similarity with
the BUGS code, but also coding flexibility, in that equality rather than assignment signs are
48 Bayesian Hierarchical Models
allowed, and extra facilities such as the logdensity.norm function to obtain log-likelihoods.
The JAGS code also includes a function to generate suitable initial parameter values and
calls on the jagsUI package. The jagsUI package has the benefit of repeatedly checking con-
vergence and thus avoiding unnecessary computing. The calling sequence is as follows:
library(jagsUI)
library(heavy)
library(loo)
data(ereturns)
x=as.vector(ereturns[[4]])
y=as.vector(ereturns[[3]])
# Data
D=list(y=y,x=x,n=60,x.new=0.13)
cat("
model {for (i in 1:n) {y[i] ~dnorm(mu[i], 1/sigma^2)
mu[i] = beta[1] + beta[2]*(x[i]-mean(x[]))
# log-likelihood
LL[i] = logdensity.norm(y[i],mu[i],1/sigma^2)
# replicates at observed x[i]
yrep[i] ~dnorm(mu[i],1/sigma^2)
# check replicate against actual observation
check[i] = step(yrep[i]-y[i])}
# priors
for (j in 1:2) {beta[j] ~dnorm(0,0.001)}
sigma ~dunif(0,100)
# prediction at new x value
mu.new = beta[1]+beta[2]*(x.new-mean(x[]))
y.new ~dnorm(mu.new, 1/sigma^2)}
", file="model.jag")
# Estimation
inits <- function(){list(sigma=runif(0,5), beta=rnorm(2,0,0.1))}
pars = c("beta","sigma","check","y.new","LL")
R=autojags(D,inits,pars,model.file="model.jag",2,iter.increment=1000,
n.burnin=100,Rhat.limit=1.025, max.iter=5000, seed=1234, codaOnly=
c(’LL’))
# Posterior Summary
R$summary
# Fit
LOO=loo(as.matrix(R$sims.list$LL))
LOO.PW.JAGS=LOO$pointwise[,3]
order(LOO.PW.JAGS)
# priors
for (j in 1:2) {beta[j] ~dnorm(0,0.001)}
sigma ~dunif(0,100)
# prediction at new x value
mu.new = beta[1]+beta[2]*(x.new-mean(x[]))
y.new ~dnorm(mu.new, 1/sigma^2 )} "
inits <- function(){list(sigma=runif(0,5), beta=rnorm(2,0,0.1))}
pars = c("beta","sigma","check","y.new","LL")
R = autorun.jags(model,data=D,startburnin=500,startsample=4000,
inits=inits,
monitor=pars ,n.chains=2)
add.summary(R)
# MCMC output for log-likelihoods
LLsamps=as.matrix(as.mcmc.list(R, vars = "LL"))
LOO=loo(LLsamps)
LOO.PW.JAGS=LOO$pointwise[,3]
order(LOO.PW.JAGS)
<lower=0> sigma,
then one may be interested in estimating the precision as τ = 1/σ2. The transformed
parameters block may also specify limits, and this may facilitate particular types of model.
50 Bayesian Hierarchical Models
For example, one may specify a log-link in binomial regression by stipulating that prob-
abilities πi are between 0 and 1, with a log-link obtained as
The generated quantities block specifies names and derivation of any quantities, such as log-
likelihoods, resulting from the calculations during estimation. All distinct statements in all
blocks must be terminated by a semicolon, which in for-loops precedes the closing } of the loop.
Flexibility in rstan coding is provided by opportunities to vectorise prior and likelihood
statements in the model block (and hence avoid for-loops); see Section 9 on Regression Models
in the Example Models section of the Stan User Guide (Stan Development Team, 2017).
We continue the regression example above, now specifying the predictors (intercept and
CRSP) as a regression matrix. The generated quantities block includes log-likelihoods, gen-
eration of replicate data, and posterior predictive checks comparing replicate and actual
data. Vectorisation is illustrated in the code by the statement
y ~normal(eta,sigma);
options(scipen=999)
library(loo)
library(rstan)
# Regression Data
library(heavy)
data(ereturns)
x=as.vector(ereturns[[4]])
y=as.vector(ereturns[[3]])
x_new=0.13
K=2
X=matrix(,60,K)
X[,1]=1
X[,2]=x-mean(x)
x_new=x.new-mean(x)
D=list(y=y,X=X,n=60,K=2,x_new=x_new)
model="
data {
int n; // number of observations
real y[n]; // response
real x_new; // new predictor value
int K; // number of predictors
matrix[n,K] X; // predictor matrix
}
parameters {
vector[K] beta; // regression coefficients
real <lower=0> sigma; // residual standard deviation
}
transformed parameters {
vector[n] eta; // linear regression term
eta = X*beta;
}
Bayesian Analysis Options in R 51
model {
sigma ~uniform(0,100);
beta ~normal(0,31.6);
y ~normal(eta,sigma);
}
generated quantities { real LL[n];
real y_rep[n];
real y_new;
real check[n];
for (i in 1:n) {LL[i]= normal_lpdf(y[i] eta[i],sigma); }
for (i in 1:n) {y_rep[i] =normal_rng(eta[i],sigma);}
for (i in 1:n) {check[i] =step(y_rep[i]-y[i]);}
y_new = normal_rng(beta[1]+beta[2]*x_new,sigma); // prediction at new x
value
}
"
# Estimation
fit=stan(model_code = model,data=D, iter = 1500,warmup = 250,chains=2)
# Posterior Summary
print(fit,digits=3)
# plot of posterior densities
# stan_dens(fit)
# Fit
LLsamps <- as.matrix(fit,pars="LL")
LOO=loo(LLsamps)
LOO.PW.STAN= LOO$pointwise[,3]
order(LOO.PW.STAN)
model {
target += uniform_lpdf(sigma | 0,100);
target += normal_lpdf(beta | 0, 31.6);
target += normal_lpdf(y | eta, sigma);
}
This is relevant in, say, marginal likelihood estimation, if one seeks to scale the contribu-
tion of the log-likelihood to the log-posterior (see Example 3.1); in regression using weighted
log-likelihoods or regression using frequency tabulations; or in fitting distributions with
custom likelihoods (not available among the standard densities included in rstan).
Using the target + format, rstan can accommodate improper priors as long as the posteri-
ors are proper. Whereas BUGS and JAGS code specify a formal graphical model, for rstan,
the code simply specifies a joint density function needed for HMC. Thus, Jeffreys prior on
a variance σ2, namely
p (s ) = 1/ s ,
52 Bayesian Hierarchical Models
can be coded
target += -log(sigma);
The rstan estimation gives respective posterior means (sd) for the coefficients on meals
and mobility of 0.037 (0.09) and 0.064 (0.020). Again, the effect of mobility is amplified as
compared to unweighted logistic regression (though less so than under the Zelig approach),
while the effect of meals is attenuated.
The target + option can also be used with frequency data. Suppose housing tenants
are grouped into 72 groups (with frequencies FREQ) according to an ordinal satisfaction
Bayesian Analysis Options in R 53
response (three categories) and three categorical predictors, one with four categories, one
with three, and one binary. Then an ordinal logistic regression can be applied via the
This scenario is in fact applicable to the housing dataset in the R MASS library.
To demonstrate the target + option applied to a non-standard density, consider the
Kumaraswamy distribution, obtained by sampling y ~ Beta(1,b) and then x = y1/a. The den-
sity is p( x|a, b) = abx a -1(1 - x a )b -1 . We generate 1000 observations with a = 3 and b = 2.
The code sequence below provides posterior means (sd) for a and b of 3.00 (0.11) and 1.92
(0.09).
N =1000; a = 3; b = 2
# Kumaraswamy density
x = rbeta(N, 1, b)^(1/a)
library(rstan)
model ="
data {
int<lower=1> N;
real<lower=0,upper=1> x[N];
}
transformed data {
real sum_log_x;
sum_log_x = 0.0;
for (i in 1:N) {sum_log_x = sum_log_x + log(x[i]);}
}
parameters {
real<lower=0> a;
real<lower=0> b;
}
model {
target += N * (log(a) + log(b)) + (a - 1) * sum_log_x;
for (i in 1:N) { target += (b - 1) * log1m(pow(x[i],a)); }
}
"
D = list(N = N, x = x)
fit=stan(model_code = model,data=D, iter = 2500,warmup =
250,chains=2,seed=10)
# Posterior Summary
print(fit,digits=3)
[(1 - w) m + w x]x -1
p( x|w , m) = (1 - w) m
x!
( )
exp - [(1 - w) m + w x ]
and the application by Joe and Zhu (2005). Joe and Zhu (2005, Table 3) consider data for
n = 158 tumour count observations and provide estimates (mean, se) for ω and ϑ = μ(1 − ω),
namely 0.79 (0.04) and 0.91 (0.10).
An rstan implementation involves the sequence:
library(rstan)
# Tumour count data from Joe and Zhu (2005)
x=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2
,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,
4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,7,8,9,9,10,10,1
0,10,10,11,13,14,15,16,20,20,20,
21,24,24,24,26,30,50,50)
D=list(x=x,N=158)
model ="
functions {
real generalized_poisson_log(int x, real theta, real omega) {
return log(theta) + (x - 1) * log(theta + x*omega) - lgamma(x + 1)
- x * omega - theta ; }
}
data {
int<lower=0> N;
int x[N];
}
parameters {
real<lower=0> mu;
real<lower=-1, upper=1> omega;
}
transformed parameters {
real<lower=0> theta;
theta=mu*(1-omega);
}
model {
for (i in 1:N) {x[i] ~generalized_poisson(theta, omega);}
}
"
fit=stan(model_code = model,data=D, iter = 2500,warmup =
250,chains=2,seed=10)
# Posterior Summary
print(fit,digits=3)
We obtain estimates (posterior mean (sd)) for ω and ϑ of 0.797 (0.037) and 0.919 (0.095).
Note that this model can be extended to better account for the zero inflation present in the
data.
Bayesian Analysis Options in R 55
References
Albert J, Chib S (1993) Bayesian analysis of binary and polychotomous response data. Journal of the
American Statistical Association, 88, 669–679.
Annis J, Miller B, Palmeri T (2017) Bayesian inference with Stan: A tutorial on adding custom distri-
butions. Behavior Research Methods, 49(3), 863–886.
Betancourt M, Girolami M. (2015) Hamiltonian Monte Carlo for hierarchical models. Chapter 4,
pp 79–102, in U. Singh, S. Upadhyay, D. Dey (eds) Current Trends in Bayesian Methodology with
Applications. CRC, Boca Raton, FL.
Bürkner P (2017) brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical
Software, 80(1), 1–28.
Carnes N (2017) Logistic Regression for Survey Weighted Data. http://docs.zeligproject.org/arti-
cles/zelig_logitsurvey.html
Consul P (1989) Generalized Poisson Distribution: Properties and Applications. Marcel Decker, New York.
CRAN (2018) Laplaces Demon: Complete Environment for Bayesian Inference. https://cran.r-proj-
ect.org/web/packages/LaplacesDemon/LaplacesDemon.pdf
Denwood M (2016) runjags: An R package providing interface utilities, model templates, p arallel
computing methods and additional distributions for MCMC models in JAGS. Journal of
Statistical Software, 71(9), 1–25.
de Valpine P, Turek D, Paciorek C, Anderson-Bergman C, Lang D, Bodik R (2017) Programming with
models: Writing statistical algorithms for general model structures with NIMBLE. Journal of
Computational and Graphical Statistics, 26(2), 403–413.
Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987) Hybrid Monte Carlo. Physics Letters B, 195,
216–222.
Gabry J, Goodrich B (2018) How to Use the rstanarm Package. https://cran.r-project.org/web/pack-
ages/rstanarm/vignettes/rstanarm.html
Goudie R, Turner R, De Angelis D, Thomas A (2019) MultiBUGS: A parallel implementation of
the BUGS modelling framework for faster Bayesian inference. Journal of Statistical Software.
arXiv:1704.03216
Hoffman M, Gelman A (2014) The No-U-turn sampler: adaptively setting path lengths in Hamiltonian
Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.
Joe H, Zhu R (2005) Generalized Poisson distribution: The property of mixture of Poisson and com-
parison with negative binomial distribution. Biometrical Journal, 47(2), 219–229.
Joseph M (2016) Exact sparse CAR models in Stan. http://mc-stan.org/documentation/case-stud-
ies/mbjoseph-CARStan.html
Li M, Dushoff J, Bolker B (2018) Fitting mechanistic epidemic models to data: A comparison of simple
Markov chain Monte Carlo approaches. Statistical Methods in Medical Research, 27(7), 1956–1967.
Lykou A, Ntzoufras I (2011) WinBUGS: A tutorial. WIRES: Wiley Interdisciplinary Reviews, 3(5),
385–396.
Martin A, Quinn K (2006) Applied Bayesian inference in R using MCMCpack. R News, 6(1), 2–7.
Martin A, Quinn K., Park J (2011) MCMCpack: Markov chain Monte Carlo in R. Journal of Statistical
Software, 42(9), 1–21. www.jstatsoft.org/v42/i09/
McElreath R (2018) Algebra and the Missing Oxen. http://elevanth.org/blog/2018/01/29/
algebra-and-missingness/
Monnahan C, Thorson J, Branch T (2017) Faster estimation of Bayesian models in ecology using
Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3), 339–348.
Morris M (2018) Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data. http://
mc-stan.org/users/documentation/case-studies/icar_stan.html
Neal R (2011) MCMC Using Hamiltonian Dynamics, Chapter 5, in S Brooks, A Gelman, G Jones, X–L
Meng (eds) Handbook of Markov Chain Monte Carlo. CRC Press, Boca Raton, FL, pp 113–162.
R Core Team (2016) R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna. www.r-project.org/
Bayesian Analysis Options in R 57
Seltman H (2016) R Package rube (Really Useful WinBUGS (or JAGS) Enhancer). Version 0.3-8.
http://www.stat.cmu.edu/~hseltman/rube/
Stan Development Team (2014) Stan Modeling Language: User’s Guide and Reference Manual.
https://github.com/stan-dev/stan/releases/download/v2.4.0/stan-reference-2.4.0.pdf
Stan Development Team (2017) Modeling Language User’s Guide and Reference Manual, Version
2.17.0. https://mc-stan.org/users/documentation/
Umlauf N, Klein N, Zeileis A (2018) BAMLSS: Bayesian additive models for location, scale, and shape
(and beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627.
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-
validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
Youngflesh C (2017) MCMCvis: Tools to Visualize, Manipulate, and Summarize MCMC Output.
https://mran.microsoft.com/snapshot/2017-04-22/web/packages/MCMCvis/index.html
3
Model Fit, Comparison, and Checking
3.1 Introduction
Model assessment involves choices between competing models in terms of best fit, and
checks to ensure model adequacy. For example, even if one model has a superior fit, it still
needs to be established whether predictions from that model check with, namely, repro-
duce satisfactorily, the observed data. Checking may also seek to establish whether model
assumptions (e.g. normality of random effects) are justified, whether the model reproduces
particular aspects of the data, and whether particular observations are poorly fit (Sinharay
and Stern, 2003; Berkhof et al., 2000; Kelly and Smith, 2011; Lucy, 2018; Conn et al., 2018;
Park et al., 2015).
Once adequacy is established for a set of candidate models, one may seek to choose a
particular best fitting model to base inferences on, or average over two or more adequate
models with closely competing fit. This chapter focuses on three main strategies to assess
model fit and carry out model checks: the formal approach; approaches based on posterior
analysis of the likelihood; and predictive methods based on samples of replicate data.
Particular emphasis is placed on their application in hierarchical models. Hierarchical
indicator priors for selecting predictors are considered here (Section 3.4), and more exten-
sively in Chapter 7.
R packages focusing particularly on Bayesian model selection or other aspects of model
comparison include loo (Vehtari et al., 2017); mombf for regression and mixture analyses
(Rossell, 2018; Johnson and Rossell, 2012); AICcmodavg (for deviance information criterion
(DIC) calculation) (https://rdrr.io/cran/AICcmodavg/); BayesFactor (https://rdrr.io/cran/
BayesFactor/), and the bridgesampling package (Gronau et al., 2017a). Packages focusing
on predictor selection include BayesVarSel (https://rdrr.io/cran/BayesVarSel/), and BMA
(https://rdrr.io/cran/BMA/).
59
60 Bayesian Hierarchical Models
Let prior model probabilities be denoted p(m = k), where m Î(1, ¼ , K ) is a model indica-
tor. Then posterior model probabilities are obtained as
p( y |m = k )p(k )
p(m = k | y ) =
p( y )
where
∫
p( y|m = k ) = p( y|qk )p(qk )dqk
is the marginal likelihood for model k, with parameter θk of dimension dk. This section
considers approximations to marginal likelihoods and to Bayes factors
that compare such likelihoods. In simple models, such as normal linear regressions with
regression coefficients and residual variance as the only unknowns, the formal approach is
relatively simple to implement, and marginal likelihoods are available analytically under
certain priors (Bos, 2002).
Approximate methods (Tierney and Kadane, 1986) for obtaining summary fit measures
(e.g. marginal likelihoods) or posterior densities of parameters are also reliable in simple
models. A large sample approximation for the log marginal likelihood is provided by the
Bayesian Information Criterion (BIC) (Schwarz, 1978; Myung and Pitt, 2004) defined as
where q̂k is the maximum likelihood estimator, dk is a known model dimension, and n is
the sample size. The BIC is consistent for a wide set of problems, meaning that the prob-
ability of selecting the most parsimonious true model tends to 1 as the sample size tends to
infinity. However, for singular model selection problems (discrete mixtures, factor models
where the true number of factors is unknown, etc.), the asymptotic justification for the
BIC no longer applies: considering the case of discrete parametric mixtures (Chapter 4),
the Fisher information matrix with K components is singular at a distribution based on
K-1 components. An alternative for such problems, the singular BIC, or sBIC, has been
proposed (Drton and Plummer, 2017) and implemented in the R package sBIC (Weihs and
Plummer, 2016). The widely applicable Bayesian information criterion (WBIC) can also be
applied for nonsingular models (Watanabe, 2013; Friel et al., 2017).
Posterior model probabilities on nested models may also be obtainable by adding model
selection indicators, as illustrated by Bayesian variable selection algorithms (Mitchell and
Beauchamp, 1988; Fernandez et al., 2001) for choosing predictors in regression. Such selec-
tion has been extended to variance hyperparameters in hierarchical models (e.g. Cai and
Dunson, 2006; Chen and Dunson, 2003; Fruhwirth-Schnatter and Tuchler, 2008; Kinney
and Dunson, 2008), enabling selection which avoids the complex issues involved in mar-
ginal likelihood estimation for random effects models. Section 3.4 considers variance
selection in hierarchical models.
However, in more complex random effect applications with discrete responses or
hierarchically structured data, there remain issues which impede the straightforward
Model Fit, Comparison, and Checking 61
application of the formal approach (Han and Carlin, 2001). For example, in approxi-
mating marginal likelihoods, there is a choice whether or not to integrate over random
effects (Sinharay and Stern, 2005). The more commonly advocated approach of integrat-
ing out random effects becomes impractical when there are multiple possibly correlated
random effects. The formal approach is also sensitive to priors adopted on parameters,
which in the case of random effect models include the form of prior on variance compo-
nents (e.g. inverse gamma or uniform), as well as the degree of prior informativeness.
As priors become more diffuse, the formal approach tends to select the simplest least
parameterised models, in line with the so-called Lindley or Bartlett paradox (Bartlett,
1957). Finally, the formal approach to model averaging requires both posterior densities
p(qk | y , m = k ), and posterior model probabilities p(m = k|y). Estimates of posterior densi-
ties p(qk | y , m = k ) may be difficult to obtain in complex random effects models with large
numbers of parameters.
Straightforward and pragmatic approaches to model comparison, which are also appli-
cable to complex hierarchical models, are available as alternatives to formal methods. The
two main approaches are based on posterior densities of fit measures (log-likelihood, devi-
ance) and on predictive assessment using samples of replicate data. Section 3.3 considers
the posterior deviance as a fit measure and the related measure of model complexity (effec-
tive dimension) that are of utility in comparing hierarchical models. Bayesian fit measures
such as the DIC or LOO-IC (Vehtari et al., 2017) are analogous to information theoretic
approaches in frequentist statistics (Burnham and Anderson, 2002), but more widely appli-
cable (e.g. to non-nested models). The components of the overall fit deriving from each
observation (e.g. the deviance contributions from particular observations) may be used in
model checking (Plummer, 2008).
The predictive approach to model choice and diagnosis (Section 3.5) has also been simpli-
fied by MCMC (Gelfand, 1996). Predictive methods shift the focus onto observables away
from parameters (Geisser and Eddy, 1979) and seek to alleviate the impact on model com-
parison of factors such as specification of priors. The predictive approach is particularly
advantageous in model checking, namely ensuring that a model actually reproduces the
data satisfactorily (e.g. Kacker et al., 2008), but is also applied to model choice, for example,
under posterior predictive loss criteria (Gelfand and Ghosh, 1998).
Predictive model checking typically involves repeated sampling of replicate data ynew
from a model’s parameters at each MCMC iteration (Gelfand et al., 1992). For a satisfactory
model this process generates data like the observed data such that (y,ynew) are exchange-
able draws from the joint density (Stern and Sinharay, 2005, pp.176–177).
When all the data is used in model estimation, such sampling provides estimates of
the posterior predictive density of model k, p( y new | y , m = k ). However, predictive com-
parisons based on models using all the data in estimation may be overly favourable
to the model being fitted (i.e. be conservative in terms of detecting model discrepan-
cies) (Bayarri and Berger, 1999). An alternative involves cross-validation (Alqallaf and
Gustafson, 2001) where the model predicts values for certain observations (the test
sample) on the basis of a model estimated using the remaining observations (the learn-
ing sample). Key et al. (1999) argue that cross-validation is approximately optimal in
an M-open scenario, where none of the models being considered is believed to be the
true model.
62 Bayesian Hierarchical Models
ò
p( y ) = p( y|q )p(q )dq .
The marginal likelihood is also a component in Bayes formula, such that at any parameter
value θ
p( y|q )p(q )
p(q |y ) = .
p( y )
Consider models 1 and 2 with equal prior model probabilities p(m = 1) = p(m = 2) = 0.5. Then
the ratio of posterior model probabilities is obtained as
p(m = 2| y ) p( y |m = 2)
= = B21
p(m = 1| y ) p( y |m = 1)
where B21 is the Bayes factor. Kass and Raftery (1995) provide guidelines for interpreting
B21. If 2logeB21 is larger than 10, the evidence for model 2 is very strong, while values of
2logeB21 < 2 are inconclusive as evidence in favour of one model or another. Note that such
criteria are influenced by the prior adopted. In general, diffuse priors (whether on fixed
effect parameters or variances) are to be avoided, as they tend to favour the selection of the
simpler model.
Estimating the marginal likelihood by direct integration is generally infeasible in
multi-parameter applications. Hence, a range of approximations have been proposed for
estimating marginal likelihoods or associated model choice criteria, such as the Bayes fac-
tor. For example, on suitable rearrangement (Chib, 1995), the Bayes formula implies that
the marginal likelihood may be approximated by estimating the posterior ordinate p(θ|y)
in the relation
where θh is a point with high posterior density (e.g. posterior mean or median). One may
estimate p(θ|y) by kernel density methods or by moment approximations based on MCMC
output – see Lenk and DeSarbo (2000) for a discussion of such estimates. Let g(θ) denote an
estimated density that approximates p(θ|y). One may then evaluate g(θ) at θh (Sinharay and
Stern, 2005; Bos, 2002), so providing an estimate of the log marginal likelihood as
The relation log[ p( y )] = log[ p( y |qh )] + log[ p(qh )] - log[ p(qh | y )] also implies a sampling-
based estimator of the log marginal likelihood. Since this relation applies for all samples
θ(r), one may average over values
to estimate the log of the marginal likelihood, log[p(y)]. Using log transforms is likely to
be the most suitable approach for larger samples, to avoid numeric overflow. For small
samples, one may set L( r ) = p( y|q ( r ) ) , p ( r ) = p(q ( r ) ), and g ( r ) = g(q ( r ) ). Then an estimator of the
marginal likelihood is provided by the simple average of the ratios L( r )p( r )/g ( r ) .
Alternatively, suppose θ contains B parameter sub-blocks. When the full conditionals of
each sub-block are available in closed form, Chib (1995) considers a marginal/conditional
decomposition of p(θh|y) as follows
p(qh | y ) = p(q1h | y )p(q2 h |q1h , y )p(q3 h |q1h , q2 h , y )¼ p(qBh |q1h ,..qB -1, h , y )
with p(θh|y), and thus p(y), estimated by using B − 1 sampling sequences subsidiary
to the main scheme. If B = 2, namely qh = (q1h , q2 h ), the posterior ordinate p(θh|y) is
then p(q1h | y )p(q2 h | y , q1h ), where p(q1h | y ) is estimated from the output of the main sample
e.g. as
p(q1h |y ) = å p(q
r =1
1h |y ,q 2( r ) )
p* (q k |y , m = k ) = p( y|q k , m = k )p(q k |m = k )
p* (q k |y , m = k )/ ck = p(q k |y ).
Then by definition
ò
p( y|m = k ) = p* (q k |y , m = k )dq .
Consider a function g(θ) with known normalising constants, often termed an importance
function, and one that should ideally approximate the posterior p(θ|y). Then one has
p∗ (qk |y , m = k )
∫
p( y|m = k ) = p∗ (qk |y , m = k )dq =
∫ g(qk )
g(qk )dqk .
This suggests that an estimator for the marginal likelihood may be obtained using sam-
ples qk( r ) (r = 1,¼R) from g(θk), namely
64 Bayesian Hierarchical Models
p* (qk( r ) |y , m = k )
Mk = år
g(qk( r ) )
.
Let L(kr ) = p( y|qk( r ) ) , p k( r ) = p(qk( r ) ) and g k( r ) = g(qk( r ) ) . Then, the importance sample estimator
may be written in terms of weights wk( r ) = p k( r )/g k( r ) comparing the prior and importance
function, namely
Mk = ∑ L r
(r )
k wk( r ) .
Bridge sampling estimators of marginal likelihoods use the fact that the marginal likeli-
hood of model k is the normalising constant ck = p( y |m = k ) in the relation
1=
∫a(q )p(q |y)g(q )dq
k k k k
=
Eg [a(qk )p(qk |y )]
∫a(q )g(q )p(q |y)dq
k k k k
Ep [a(qk ) g(qk )]
where α(θ) is a bridge function linking the densities g(θ) and p(θ|y) (Meng and Wong, 1996;
Gronau et al., 2017b), Eg[] denotes expectation with regard to the density g(θ), and Ep[]
denotes expectation with regard to the density p(θ|y). Substituting p* (qk | y , m = k )/ck for
p(θ|y) in 1 = Eg [a(qk )p(qk | y )]/ Ep [a(qk ) g(qk )] gives the result
ìï 1 R ü ì S ü
í ê
ïî R r =1 ë
å
éa (q ( r ) )p * (q ( r ) |y )ù ï ï 1
ý í
ûú ï ï S
þ î r =1
å
éëa (q ( r ) ) g(q ( r ) )ùû ïý
ïþ
Setting a (q ) = 1/g(q ) then gives a marginal likelihood estimator
*
p (q ( r ) |y )
R
å
1
M=
R r =1
g(q ( r ) )
that uses only samples from the approximate posterior (or importance) density g(θ).
Setting a (q ) = 1/p* (q |y ) gives an estimator based on the harmonic mean of the ratios
*
p (q ( r ) |y )/g(q ( r ) ), and using parameters sampled from p(θ|y) rather than g(θ) (Gelfand and
Dey, 1994). So
Model Fit, Comparison, and Checking 65
S
g(q ( r ) )
å
1 1
= .
M S r =1
p* (q ( r ) |y )
*
The choice a (q ) = 1/[g(q )p (q |y )] leads to the geometric estimator of Lopes and West
(2004), namely
å éë p (q
1 0.5
* (r )
|y )/g(q ( r ) |y )ùû
R
M= r =1
S .
å éë g(q
1 0.5
(r )
|y )/p (q |y )ùû
* (r )
S r =1
A recursive scheme for obtaining an optimal estimate of α(θ) is also available, and men-
tioned by Lopes and West (2004, p.54) and Frühwirth-Schnatter (2004, equation 8). This sim-
plifies if R = S, as in the first illustrative worked application below. With R ≠ S, s1 = S/(S + R)
and s2 = 1 − s1, one has an updated estimate for M at recursion j
M j = A( M j -1 )/B( M j -1 )
where
A(u) = å W /(s W
r
2r 1 2r + s2u), B(u) = å 1/(s W
s
1 1s + s2u),
and
3.2.3 Path Sampling
Another approximation may be obtained by a technique known as path sampling (Gelman
and Meng, 1998; Xie et al., 2011; Friel and Pettitt, 2008). Consider a path variable t ranging
from 0 to 1, and define the power posterior based on various levels of weighted likelihood,
namely
ò
t
z( y|t) = éë p( y|q )ùû p(q )dq
so that z(y|t = 0) is the integral of the prior, namely 1 for proper priors, while z(y|t = 1) is the
ò
marginal likelihood, p( y ) = p( y|q )p(q )dq .
To derive an estimate of z(y|t = 1), one may use the identity
66 Bayesian Hierarchical Models
1
æ z( y|t = 1) ö
log( p( y )) = log ç
è z( y|t = 0) ø 0
ò
÷ = Eq|y ,t log[ p( y|q )]dt
which states that the log marginal likelihood is the expected log-likelihood with respect
to the power posterior at temperature t, with t ranging from 0 to 1. This follows (Friel and
Pettitt, 2008) because
d 1 d 1 d é
ò{p(y|q )} p(q )dq ùúû
t
log éë z( y|t)ùû = z( y|t) =
dt z( y|t) dt z( y|t) dt êë
1
ò{p(y|q )} log[p(y|q )]p(q )dq
t
=
z( y|t)
{p( y|q )}
t
p(q )
=
ò z( y|t)
log[ p( y|q )]dq
qs = asc (3.1)
defined at cutpoints {a0 , … aL } in [0,1], where c is a specified positive power. So, the estimate
log(Mc) of the log marginal likelihood at that power is obtained by summing over T grid
points that combine information from successive expected log likelihoods,
T -1
å(q
1
log( Mc ) = s +1 - qs ) éëEq|y ,qs+1 log[ p( y|q )] + Eq|y ,qs log[ p( y|q )]ùû .
s=0
2
Friel and Pettitt (2008) take c = 4 in (3.1), while Xie et al. (2011) recommend values of c between 2.5
and 5. So with T = 40 intervals, equally spaced cutpoints {a0 = 0, a1 = 0.025, a2 = 0.05, … a40 = 1},
and setting c = 4, one has q0 = 0, q1 = (0.025)4 , …, q39 = 0.975 4 , q40 = 1. The Monte Carlo stan-
dard error of log(Mc) is obtained as the square root of the summed variances of the con-
tributions to log(Mc) at each of T grid points. Thus let ds = (1 / 2)(qs + 1 - qs ) and let νs be the
Monte Carlo variance of Eq|y ,qs+1 log[ p( y|q )] . Then the variance at each grid point is ds2ns
å
T -1
and the Monte Carlo variance of log(Mc) is ds2ns .
s=0
To illustrate estimation in path sampling consider the online vignette demonstrating the
use of the R package bridgesampling (Gronau et al., 2017a; https://cran.r-project.org/web/
packages/bridgesampling/vignettes/bridgesampling_example_jags.html). This example
assumes a normal-normal two stage hierarchy (see Chapter 4), as often used in meta-anal-
ysis, with known first level variance σ2
yi ~ N (qi , s 2 ),
qi ~ N ( m, t 2 ).
The comparison is between a model with μ = 0 and a model with μ unknown. The data (n = 20
cases) are generated under the first option, namely with μ = 0, and also with σ2 = 1 and τ2 = 0.5.
Model Fit, Comparison, and Checking 67
Path sampling as in Friel and Pettitt (2008) is applied, with qs = as4 where
{as } = {a0 , 1/T , 2/T , … , T - 1/T , T } and T = 30. For numeric stability, a0 is taken as 0.00001
rather than 0, so that q0 = 1E − 20. Estimates are made using jagsUI. The parameters and
likelihoods at each of the T + 1 points are estimated using the device from Barry (2006). The
likelihood is specified as in (4.4.3), namely:
yi ~ N (qi , s 2 + t 2 ),
with σ2 = 1 known. Using the code listed in the Computational Notes [1] in Section 3.7, the
estimated marginal likelihoods are closely similar to those reported in the bridgesam-
pling vignette, namely −37.53 for the zero-mean model, and −37.81 for the model with μ
unknown.
hi = Xi b + Wibi ,
where Xi and Wi are predictors, and bi are latent data. For such non-conjugate schemes, the
marginal likelihood is not obtainable analytically, and one possible approach to evaluat-
ing marginal likelihoods is to work with the integrated likelihood
ò ò
p( y|q ) = p( y , b|q )db = p( y|b , q )p(b|q )db ,
where the random effects or latent data have been integrated out, and where θ includes
hyperparameters ψ (e.g. covariances) governing the b, as well as parameters φ (e.g. fixed
regression effects) not relevant to the random effect hyperdensity (Sinharay and Stern,
2005; Fruhwirth-Schnatter, 1999). This can be done in practice in MCMC sampling by
applying importance sampling, the Laplace approximation, or numeric integration meth-
ods to the complete data likelihood p(y,b|θ).
However, it may be argued that under a Bayesian approach, the distinction between
fixed and random regression coefficients is less relevant, and so use of the integrated likeli-
hood approach and implied numerical complexity may be avoided. For example, one may
(e.g. Clayton, 1996) adopt a unified perspective on the parameters in the joint precision
matrix for the fixed effects (and other parameters not in the hyperdensity) φ, and the ran-
dom effects hyperparameters ψ. Chib (2008) proposes marginal likelihood estimation for
different classes of panel model by marginalisation over the random effects. Sinharay and
Stern (2005) also mention obtaining the marginal likelihood by considering the expanded
parameter set ω = (b,φ,ψ), so that
p( y ) =
∫∫p(y|j,y)p(j,y)djdy,
=
∫∫∫ p( y|j , b)p(b|y)p(y, j)dbdydj.
68 Bayesian Hierarchical Models
The advantage of working with the expanded likelihood p(y|b,φ) is the avoidance of
repeated integration, but this comes at the expense of an often considerably increased
dimension of the parameter space (namely by the number of components in b). Marginal
likelihood approximation retaining the expanded likelihood is considered in real exam-
ples by Nandram and Kim (2002) and Gelfand and Vlachos (2003).
∫
Let g(θ|b) be a density subject to g(q|b)dq = 1 , where θ = (ψ,φ), and let θ* be an appro-
priate fixed point (e.g. a posterior mean). Chen (2005) mentions an estimator for the log
marginal likelihood M = p(y) in a hierarchical modelling situation based on the identity
ò
p( y|q * ) = p( y|q * , b)p(b|q * ) g(q |b)dq db ,
é1 R
g(q ( r ) |b( r ) ) p( y|q * , b( r ) )p(b( r ) |q * ) ù
log[ M] = log[ p( y|q * )] - log ê
êë R
å
r =1
ú.
p(q ( r ) ) p( y|q ( r ) , b( r ) )p(b( r ) |q ( r ) ) úû
é1 R
p( y|q * , b( r ) )p(b( r ) |q * ) ù
log[ M] = log éë p( y|q * )ùû - log ê
êë R
år =1
ú,
p( y|q ( r ) , b( r ) )p(b( r ) |q ( r ) ) úû
* *
The component log[ p( y|q )] = log(L ) may be estimated from the Monte Carlo average
å p(y|q , b
1
L* = * (r )
).
R r =1
Chen (2005) shows that a variance minimising estimator is, however, obtained by setting
g(q |b) = p(q |b , y ), namely the conditional posterior density of θ given b.
Example 3.1 Marginal Likelihood and Bayes Factors, Turtle Mortality Data
This example applies approximations to the marginal likelihood to data from Sinharay
and Stern (2005). These are nested binary data yij on n = 244 newborn turtles i = 1, …, mj
clustered into clutches j = 1, …, J, with responses yij = 1 or 0 according to survival or death.
The known predictor is turtle birthweight xij so there are p = 2 regression parameters,
including an intercept. Graphical analysis suggests that heavier turtles have better sur-
vival chances, but also suggests extraneous variability in survival rates across clutches.
Model Fit, Comparison, and Checking 69
Sinharay and Stern (2005) compare several methods of deriving formal model fit mea-
sures, namely marginal likelihoods or Bayes factors. Here two alternative models are
evaluated for the probability pij = Pr( y ij = 1) using the temperature path approach of
Friel and Pettitt (2008) and jagsUI. One involves a fixed effects only regression on birth-
weight with a probit link. The other assumes additional random effects based on clutch
membership. So model 1 specifies pij1 = F( b1 + b2 xij ) , while model 2 has
pij 2 = F( b1 + b2 xij + b j ),
where b j ~ N(0, sb2 ) . The predictor xij is standardised, and unlike Sinharay and Stern
(2005), N(0,1) priors are assumed on the fixed effects {β1,β2}. A gamma prior on τb, namely
tb ~ Ga(0.1, 0.1) , is assumed, as the shrinkage prior
1
p(sb2 ) µ , (3.2)
(1 + sb2 )2
used by Sinharay and Stern cannot be implemented in jagsUI.
There are some possible sources of sensitivity: formal model measures may depend
on informativeness in the priors and on the form of prior, for example, the prior density
adopted on the random effects variance sb2 or the precision tb = 1/sb2 . For example, the
value sb2 = 1 has a quarter of the prior weight for sb2 = 0 under the prior used by Sinharay
and Stern (2005). With this prior, they obtain an inconclusive Bayes factor of 1.27 in
favour of the simpler fixed effects only model. Under particular methods, additional
sensitivity issues occur. Using the temperature path approach, estimates of the mar-
ginal likelihood may be affected by the number of sequence points T (Drummond and
Bouckaert, 2015) and by the location of the points, especially points near zero.
The temperature path as has T = 10, and qs = as4 where a = (0.00025,0.0005,0.001,0.005,0.
05,0.075,0.1,0.25,0.5,0.75,0.99). The parameters and likelihoods at each of the T + 1 points
are estimated using the device from Barry (2006). For numeric stability, the initial point
in the path is taken as 0.00025 rather than 0. Although formally, the estimate of log[p( y )]
is obtained by piecing together the separate posterior estimates Eqk|y , ts p( y|qk ), an essen-
tially identical estimate is obtained by applying the trapezoid rule at each iteration and
monitoring the composite log marginal likelihood node.
The marginal likelihood estimate for model 1 is thus obtained as −150.4 as compared
to −152.56 for the random effects alternative, model 2, giving B12 = 8.7. sb2 is estimated at
0.152. Relatively large clutch effects (mean, posterior sd) are obtained under model 2 for
clutches 9 and 15, namely 0.46 (0.27) and −0.37 (0.27).
As an alternative option for the random effects approach, defining model 3, a shrink-
age prior is implemented by taking a uniform prior on U = 1/(1 + sb2 )2 , namely
U ~ Unif(0, 1),
This option produces a clutch variance estimate sb2 = 0.149 . The marginal likelihood
is −150.4, so that BF13 is 1. Thus varying Bayes factors illustrate the impacts of differ-
ent priors on the variance or precision. This approach can be extended to allow uncer-
tainty in the shrinkage prior power, allowing potentially more pronounced shrinkage
(Gustafson et al., 2006). Thus, one takes a uniform prior on
U = 1/(1 + sb2 )P ,
70 Bayesian Hierarchical Models
where P is unknown with a minimum of 1. So with P = 1 + P1, where P1 ~ Ga(0.1, 0.1), one
has
This leads to an estimate for P of the default value 1, but with an estimated marginal
likelihood of −151.4 (and so BF13 = 2.7), while the posterior mean for sb2 is now 0.155.
An advantage with rstan is that the prior (3.2) can be represented using the expression
(within the model segment)
where sigma2 is the unknown variance. Using rstan in combination with the bridgesa-
mpling package provides respective marginal likelihoods for models 1 and 2 of −156.48
and −156.71, and a Bayes factor BF12 = 1.26 (Gronau et al., 2017a). This is close to that
reported by Sinharay and Stern (2005). The clutch variance for model 2 (with prior as in
equation 3.2) is sb2 = 0.153 .
This option is also used to compare the fixed effects regression model with a variable
slopes model (model 4), namely
p ij 4 = F( b1 + ( b 2 + b j )xij ),
where b j ~ N(0, sb2 ) . A gamma prior on 1/sb2 is taken. The resulting marginal likelihood
is −160.13, and so a more decisive advantage for the simpler model, with BF14 = 38.66.
This counts as strong evidence in favour of the simpler model according to the sched-
ules of Jeffreys (1961), and of Kass and Raftery (1995).
One may also apply rstan to direct path sampling, namely to estimating a sequence of
models with varying temperatures t,
with t ranging from 0 to 1. If U[ti] denotes the actual log likelihood at an ascending
temperature sequence ti Î[0, 1] , for i = 1,… , T , then the marginal likelihood estimate is
∑
T
U(ti )/T . Alternatively, one may generate T temperatures randomly from the uni-
i=1
form U(0,1). For a selected temperature t, the code for the fixed effects model is
model="data {
int<lower = 1> N;
int<lower = 0, upper = 1> y[N];
real<lower = 0> x[N];
real<lower=0, upper=1> t;//parameter for path sampling
}
parameters {
real alpha;
real beta;
}
transformed parameters {
real U_case[N];
for (i in 1:N) {U_case[i]= bernoulli_lpmf(y[i] |
Phi(alpha+beta*x[i]));}
}
Model Fit, Comparison, and Checking 71
model {
target += normal_lpdf(alpha 0, 3.16);
target += normal_lpdf(beta 0, 3.16);
for (i in 1:N) {target += t*bernoulli_lpmf(y[i] |
Phi(alpha+beta*x[i]));}
}
generated quantities {
real U;
U = sum(U_case);
}
"
For example, a calling sequence with T = 1000 randomly generated
temperatures is
T=1000
temps=runif(T,0,1)
U=c()
sink("sink.txt")
for (i in 1:length(temps)) {D=list(y = turtles$y,x = turtles$x,N
=244,t=temps[i])
fit=stan(model_code=model,data=D,iter=1250,warmup=250,chains=1,refre
sh=−1,seed=100)
U[i]= summary(fit, pars = c("U"))$summary[1]}
sink()
# marginal likelihood estimate
mean(U)
C = -2(log L1 - log L2 ) = D1 - D2
where p(y|θ) is the likelihood of the data y given parameters θ, and h(y) is a standardising
function of the data only (and so does not affect model choice).
Suppose the deviance is monitored during an MCMC run, providing samples
{D(1) , … , D( R) } . The overall fit of a model is measured by the posterior expected deviance
obtained by averaging over the posterior density of the parameters,
D = Eq|y [D],
namely the expected deviance minus the deviance at the posterior means of the param-
eters; the latter is also known as the plug-in deviance (Plummer, 2008). In hierarchical
random effects models, the effective number of parameters in total is typically lower than
the nominal number of parameters, due to borrowing of strength under the hyperdensity
(e.g. Zhu et al., 2006; Buenconsejo et al., 2008).
The DIC is then obtainable as the expected deviance plus the effective model dimension,
So the DIC will prefer models with lower values of D , combined with smaller values of de
(which indicate a relatively parsimonious model). A possible disadvantage with the DIC
is that it can be affected by reparameterisation of θ or by the form of link in general linear
models, with this applying in particular to the “plug-in” deviance D(q | y ) ; hence the value
of de may be sensitive to parameterisation.
The deviance D(q | y ) at the posterior mean q of the parameters may also be estimated
by using posterior means of quantities involved in defining the deviance, such as case
means (Poisson likelihood), means and overdispersion parameter (negative binomial
likelihood), means and variance (normal likelihood), and so on (Spiegelhalter et al.,
2002, p.596). Thus, let μi denote case specific means and ξ denote any other parameters
needed to derive the deviance. Then an estimate D( m , x | y ) may be more easily obtain-
able than D(q | y ) in complex (e.g. discrete mixture) models, or in models with many
random effects, where the number of nominal parameters may considerably exceed the
number of cases. This type of procedure is also mentioned by Spiegelhalter (2006) in
terms of monitoring the “direct parameters” that appear in the distributional syntax
and plugging these into the deviance; it was adopted in the paper by Ohlssen et al.
(2006, section 2).
Model Fit, Comparison, and Checking 73
The DIC and de can be disaggregated to individual observations, and provide a measure
of local complexity, namely of observations that are more problematic under the model rel-
ative to others. Spiegelhalter et al. (2002, p.602) mention that the local complexity measures
dei = Di - Di (q )
measure the leverage of observation i, defined as the relative influence that each obser-
vation has on its own fitted value. Unusually large observation specific DIC measures,
namely
DICi = Di + dei
are used by Spiegelhalter et al. (2002) as indicators of outlier status – observations incon-
sistent with the model. The DIC can be seen as a Bayesian version of AIC and may under-
penalise model complexity, as pointed out by discussants to Spiegelhalter et al. (2002). By
contrast, it is well established (Burnham and Anderson, 2002) that the BIC tends to select
overly parsimonious models. A fit criterion analogous to the BIC may be defined as
and was used by Pourahmadi and Daniels (2002, p.228) for panel data with repeated obser-
vations over n subjects.
Note that the model with the lowest DIC or DIC* will not necessarily be a suitable model
if it does not reproduce the data adequately. Hence, model checks are required to assess
consistency of predictions from the model with the actual observations.
Just as there are alternative approaches to marginal likelihood derivation in hierarchical
models, Spiegelhalter et al. (2002) point that for such models, one cannot uniquely define
the likelihood or model complexity without specifying the level of the hierarchy that is the
model focus. Thus one might analyse count data using a complete data likelihood (with
unknown latent data b as well as hyperparameters θ) using a Poisson-gamma or Poisson-
lognormal model, or alternatively apply a negative binomial likelihood with the random
effects integrated out (Fahrmeir and Osuna, 2003), and the complexity measures will obvi-
ously differ.
Model choice may be affected by the focus, as shown by Plummer (2008, p.530) in an
analysis of a discrete mixture model, with one approach considering a complete data like-
lihood pC ( y |b , q ) (with the parameters including missing component indicators), and the
other considering the integrated likelihood pI(y|θ). Ando (2007) considered DICs based on
both conditional and integrated likelihoods, namely DICC and DICI, and showed that both
tend to select overfitted (i.e. non-parsimonious) models.
Y is then the log-likelihood log{P(Y|q )} of θ. The corresponding loss function is the devi-
ance D(q ) = -2 log{ p(Y|q )} .
As estimates of the loss function, one may consider either the plug-in deviance
ò
Le (Y |Z) = -2 log{P(Y |q )P(q |Z)dq ,
with the test data considered fixed. Whereas the plug-in deviance is sensitive to reparam-
eterisation and does not take account of the precision of θ(Z), the expected deviance is
coordinate-free and takes account of precision.
When there are no training data, Y must be used to estimate θ and assess model fit.
However, L(Y,Y) is optimistic as a measure of model adequacy, as it uses the data twice.
Consider the corresponding function for observation i, namely L(Yi,Y). This can be com-
pared with the cross-validation loss L(Yi,Y[i]), where Y[i] is Y with observation i excluded.
The excess of L(Yi,Y) over L(Yi,Y[i]) provides a measure of optimism from using the data
twice. The expected decrease in loss due to using L(Yi,Y) instead of L(Yi,Y[i]) is obtained as
Issues of focus, as well as the derivation of the complexity measure de, are also consid-
ered by Celeux et al. (2006). In general terms, a complexity measure or effective parameter
count is obtained by comparing the mean deviance with the deviance at the pseudo-true
parameter values θt (Spiegelhalter et al., 2002, section 3.2). There are various estimators
q of the pseudo-true parameter values θt, apart from the element wise posterior means.
Another possibility is to consider the posterior mode posterior value, qˆ, that generates
the maximum posterior density p(q |y ) µ p( y|q )p(q ) (Celeux et al., 2006, p.654), namely
qˆ = argmax p(q |y ). In applications (e.g. discrete mixture models and random effect mod-
q
els) with missing data b, this extends to considering the pair (qˆ , bˆ ) that generates the
maximum posterior density (Celeux et al., 2006, p.656). Celeux et al. (2006) mention other
possibilities for q, such as the EM maximum likelihood estimate.
They state different DIC definitions under three alternative foci (observed data likeli-
hood, complete data likelihood, and conditional likelihood) and under different options
for q. For the observed data focus with likelihood p( y|q ) , obtained possibly after integrat-
ing out random effects, one has
DIC = D + de = D(q ) + 2de = 2D - D(q ) = -4Eq [log{ p( y|q )}|y] + 2 log[ p( y|q )].
whereas taking q as the posterior mode qˆ amounts to an alternative DIC definition, denoted
DIC2 by Celeux et al., namely
Taking b as additional parameters, one may define q on the basis of joint modal or maxi-
mum a posteriori parameters, (qˆ , bˆ ), with de obtained by comparing the average deviance
D with
The joint mode (qˆ , bˆ ) may be estimated by monitoring the posterior density over an MCMC
sequence, and finding that set of values {q ( r ) , b( r ) } associated with the maximum value,
pmax (q | y ), of the posterior density. The DIC may then be defined as
where
n
LPPD( y|q ) = å log ò p(y|q )p(q |y)dq
i =1
is the log posterior predictive density (LPPD) for y (Gelman et al., 2014), and de is the esti-
mated effective model dimension (complexity). The LLPD is an estimate, albeit a biased
overestimate, of the expected log posterior predictive density (ELPD) for (unobserved)
76 Bayesian Hierarchical Models
new data y generated from the same density as the observed data y, and the complexity
measure is a measure of the bias.
To estimate the LPPD for a particular observation, one obtains the likelihood for that
observation at each MCMC iteration (i.e. conditioning on θ(r) at iteration r). The resulting vec-
tor of likelihoods, for observation i and samples r = 1, … , R, can be denoted Li = (Li1 , … , LiR ).
The log of the mean of Lir over iterations r provides the LPPD for observation i, namely
LPPD( yi |q ) = log(Li ). The total of these over observations is the estimate of the LPPD.
The estimated complexity for the WAIC is obtained by monitoring log-likelihoods dur-
ing MCMC sampling, namely LLir = log(Lir ). Then the variance of LLi = (LLi1 , … , LLiR ) pro-
vides an estimate of complexity dei for that observation, dei = var(LLi ). The total of the dei is
the total complexity de. The estimated piecewise WAIC can be obtained as -2(log(Li ) - dei ),
and the total WAIC as the sum of the piecewise WAIC.
If the R package loo is used to obtain the LOO-IC, then it is more convenient to moni-
tor log-likelihoods (which are the input to loo), and then obtain sampled likelihoods by
exponentiation. For example, using rjags or jagsUI (for example), and with R an object
containing model results (including sampled log-likelihoods, LL), WAIC calculations are
as follows:
LL = as.matrix(R$sims.list$LL)
L=exp(LL)
waic1=log(apply(L,2,mean))
waic2=apply(LL,2,sd)
# casewise waic
waic.pw=-2*(waic1-waic2)
elpd_waic=sum(waic1)-sum(waic2)
# total waic
waic=-2*elpd_waic.
The LOO-IC uses an estimate of the leave-one-out predictive fit (or ELPD)
n
∑ log[p(y |y
i=1
i [i] )],
p( yi | y[i] ) =
ò p(y |q)p(q|y
i [i] )dq.
The latter may be estimated using samples θr from the full data posterior p(θ|y) using
importance ratios
1 p(q |y[i] )
IRir = ∝ ,
p( yi |q )
r
p(q |y )
R R R
∑ ∑ ∑ p(y |q ) .
1 1
IRir p( y i |q r )/ IRir ≈ 1/[ r
r =1 r =1
R r =1
i
This estimator may be unstable due to high variances of the importance ratios for certain
observations.
Model Fit, Comparison, and Checking 77
Vehtari et al. (2017) use a smoothed version of the importance ratios based on fit-
ting a generalised Pareto density to the upper tail of the importance ratios, leading to
Pareto smoothed importance sampling (PSIS) estimates of the LOO-IC. Let wir denote the
smoothed importance weights. Then the estimate of the ELPD is
∑
R
n wir p( yi |q r )
ELPD PSIS-LOO = ∑ log r =1 ,
∑
R
i=1 wir
r =1
with the LOO-IC estimated as -2 ´ ELPD PSIS-LOO . The estimate of the effective parameter
total is then
LL = as.matrix(R$sims.list$LL)
L=exp(LL)
library(resample)
S = nrow(LL)
n = ncol(LL)
lpd_pw = log(colMeans(L))
w = 1/exp(LL-max(LL))
w_n = w/matrix(colMeans(w),S,n,byrow=TRUE)
w_r = pmin (w_n, sqrt(S))
elpd_loo_pw = log(colMeans(L*w_r)/colMeans(w_r))
p_loo_pw = lpd_pw − elpd_loo_pw
# Complexity
sum(p_loo_pw)
# LOO-IC
−2*sum(elpd_loo_pw)
Though the WAIC and LOO-IC provide an estimate of predictive ability, both are sub-
ject to stochastic variability which can be considerable for smaller datasets (Piironen and
Vehtari, 2017). There may also be cautions regarding the estimates of WAIC and LOO-IC,
provided in the loo package and discussed by Vehtari et al. (2017, p.1416). For the LOO-IC,
these are based on the estimated shape parameter of the generalized Pareto, values of
which indicate whether the variance of the importance ratios is effectively infinite.
3.3.4 The WBIC
The BIC is a penalised fit measure, and the widely applicable Bayesian information crite-
rion or WBIC (Watanabe, 2013) is therefore included here, though it is essentially based on
an estimator of the marginal likelihood. Thus following Friel et al. (2017), and referring to
path sampling ideas, there exists a unique temperature t* such that
Watanabe (2013) shows that asymptotically, as the sample size n tends to ∞, t* » 1/log(n).
Friel et al. (2017) show for a number of worked examples that the optimal t* is smaller than
78 Bayesian Hierarchical Models
1/log(n), but that the latter approximation may be a useful practical option, except when
weakly informative priors are used.
3
Density
FIGURE 3.1
Density of clutch variance.
6
Frequency
FIGURE 3.2
Random slopes on birth weight.
80 Bayesian Hierarchical Models
p( b j | J j ) = J j p( b j | J j = 1) + (1 - J j )p( b j | J j = 0)
p( b j | J j = 1) ~ N (0, Vj )
p( b j | J j = 0) ~ N (0, Vj / K j ) K j >> 1
with Kj chosen so that the sampling from the prior is constrained to values around zero,
that is, to substantively insignificant values. If all p predictors apart from the intercept are
open to inclusion or exclusion, then MCMC sampling over parameters βj and indicators Jj
is averaging over 2p possible models (Fernandez et al., 2001).
By contrast, Kuo and Mallick (1998) and Smith and Kohn (1996) take the selection indica-
tors Jj and coefficients βj to be independent rather than being governed by mixture priors.
Assuming normal priors, one has βj = 0 if Jj = 0, but p( b j ) ~ N (0, Vj ) if Jj = 1. Following Zellner
(1986), the prior on ( b0 , b1 ,… bp) may be specified as a g-prior, namely
one can consider retaining covariances Σbgh subject to variances in both effects bgj and bhj
being retained. Thus, Smith and Kohn (2002) identify zero off-diagonal elements in the
inverse Π b = Σ b−1 of the variance-covariance matrix. Alternatively, one may also allow the
exclusion of variance components (diagonal terms in Σb), which necessarily leads to exclu-
sion of associated covariances.
Selection schemes applicable to both diagonal and off-diagonal elements in covariance
matrices for random effects have been developed by Fruhwirth-Schnatter and Tuchler
(2008), Chen and Dunson (2003), Kinney and Dunson (2008), and Cai and Dunson (2006);
for applications, see Yang (2012), Saville et al. (2011), and Harun and Cai (2014). Note that
these methods may be relatively difficult to implement, with Saville and Herring (2009)
finding “these methods are generally time consuming to implement, require special soft-
ware, and rely on subjective choice of hyperparameters.”
Consider a general linear mixed model for nested responses yij (as in longitudinal data
with repetitions i over subjects j) with means μij. These means are linked to a P × 1 vector
of regressors Xij and Q × 1 vector of regressors Zij via the model
S b = LGG ¢L ,
1 0 … 0
g21 1 … 0
Γ= ,
… … 0
g gQ 2 … 1
Q1
implying
r1 −1
sbkl = lk ll gr2r1 +
∑ g g ,
s=1
ks ls
The selection indicators for retaining variances and covariances are J q ~ Bern(pL ) , gov-
erning the diagonal terms in Λ, and H kl ~ Bern(pG ) governing the terms in Γ. Note that
retaining γkl requires not only Hkl = 1, but J k = J l = 1. If either Jk or Jl is zero, then γkl is nec-
essarily excluded. Cai and Dunson (2006) suggest positive truncated normal priors with
variance 10 for the diagonal terms λq, namely
lq = 0 if J q = 0
Diffuse priors are not recommended (Cai and Dunson, 2008, p.72), as they may favour the
null model. There may also be a case for interlinked priors for λq and the variances of the
uij effects (if present).
Fruhwirth-Schnatter and Tuchler (2008) consider the covariance matrix decomposition
S b = CC ¢,
with C a lower triangular matrix of dimension Q including unknown diagonal terms Cqq.
To illustrate the covariance selection procedure, a hierarchical linear normal model with
varying cluster regression effects b j = b + b j of dimension Q would be reframed as
Ckl = 0 if J kl = 0,
and bjk is 0 at a particular iteration if all Ckl in the kth row of C are zero. A possible prior for
the Jkl indicators is Bernoulli with probability πJ, where πJ follows a beta density,
based on the total free covariance parameters, and the number TJ of Jkl taking the value 1
(i.e. the number of non-zero elements in C). For Q = 1 in a model where a cluster level ran-
dom intercept is to be tested for inclusion, one would have
where z j ~ N (0, 1), and c ≠ 0 if J = 1 and c = 0 if J = 0. The (model averaged) estimate of the
covariance matrix Σb of the bj over r = 1, … , R iterations of a chain is obtained as
åC
1 (r )
Ŝ b = (C ¢ )(r ) .
R r =1
Model Fit, Comparison, and Checking 83
Methods for selecting the entire random effect term extend to selection of individual ran-
dom effects. For selecting the entire term, consider a spike and slab prior with the spike
component having considerably lower variance:
where r << 1. This extends to selection of individual random effects, for example using
Lasso random effect models (Fruhwirth-Schnatter and Wagner, 2010) involving compo-
nent-specific indicators δi and a hierarchical prior on the variances. For example, a mixture
of Laplace densities is obtained under
z1i ~ E(1/(2rQ)),
z2i ~ E(1/(2Q)),
with r set small, so that z1i 0 . The δi are binary indicators with unknown probability πδ,
the prior proportion of subjects with non-zero random effects. If Q is also unknown, there
may be identification issues under independent priors, as different combinations of πδ and
Q can give similar bi.
y i ~ N(ni , pi ),
bi ~ N(0, sb2 ).
Fitting this baseline model, without any random effects selection, suggests not all the
plate effects are needed. Posterior mean probabilities for Pr(bi > 0|y ) are inconclusive,
ranging from 0.34 to 0.63.
As one approach to selection, the method of Fruhwirth-Schnatter and Wagner (2010)
seeks to classify units as either close to average (with di ≈ 0, with bi close to zero, and
effectively unnecessary), above average with δi ≈ 1, and high Pr(bi > 0|y ), or below aver-
age, also with δi ≈ 1 but high Pr(bi < 0|y ) = 1 - Pr(bi > 0, y ) . A Laplace mixture density for
the plate effects is used, namely
di ~ Bern(w),
w ~ Beta(1, 1),
with r = 0.00001 and 1/Q ~ Ga(0.5, 0.2275), the latter as suggested by Fruhwirth-Schnatter
and Wagner (2010).
Estimated retention probabilities Pr( di = 1|y ) range from 0.48 to 0.70, while the
probabilities of high effects Pr(bi > 0|y ) range from 0.18 to 0.82. The most distinctive
Pr(bi > 0|y ) are for plates 10 and 17, with probabilities Pr(bi < 0|y ) around 0.80, and
plates 4 and 15 with probabilities Pr(bi > 0|y ) exceeding 0.80 (cf. Fruhwirth-Schnatter
and Wagner, 2010, Table 7). Figure 3.3 plots out the probabilities Pr(bi > 0|y ). The prob-
abilities of high effects Pr(bi > 0|y ) are relatively stable as less informative Ga(1, 0.05)
and Ga(1, 0.01) priors are assumed for 1/Q.
We also consider a horseshoe prior for the plate effects, namely
bi ~ N(0, li2sb2 ),
with half Cauchy C(0, 1)+ priors on both the λi and sb2 . As mentioned by Carvalho et al.
(2009), ji = 1/(1 + li2 ) is interpretable as the amount of weight that the posterior mean
for bi places on zero. We consider instead ki = li2 /(1 + li2 ) as an indicator for non-zero
posterior mean bi , analogous to a probability that bi ≠ 0. The estimated κi range from
0.35 to 0.61, with κi greater than 0.5 for plates 4,10, 15, 16 and 17 (see Figure 3.4). Despite
the extra parameters in this extended model as compared to the baseline model, a
formal comparison shows similar marginal likelihoods for the extended and baseline
models.
3.0
2.5
2.0
Frequency
1.5
1.0
0.5
0.0
FIGURE 3.3
Probabilities of high random effects.
Model Fit, Comparison, and Checking 85
4
Frequency
FIGURE 3.4
Histogram of weights for non-zero effects, horseshoe prior.
with errors taken to be uncorrelated through time. In line with a commonly adopted
methodology, the bqi are taken to be bivariate normal with mean zero and covariance
Σb. The precision matrix S b-1 is assumed to be Wishart with 2 degrees of freedom and
identity scale matrix, S. The observation level precision is taken to have a gamma prior,
tu ~ Ga(1, 0.001).
A two-chain run of 5000 iterations in jagsUI give posterior means (sd) for b = ( b1 , b2 ) of
92.6 (0.7) and 0.41 (0.10). Posterior means (sd) for the random effect standard deviations
86 Bayesian Hierarchical Models
sb j = S b jj of {b1i , b2i } are 5.55 (0.42), and 0.64 (0.13). The ratios b ji /sd(b ji ) of posterior
means to standard deviations of the varying intercepts and slopes both show variation,
though less so for the slopes. While 42 of 288 ratios b1i /sd(b1i ) exceed 2, only 2 of the
corresponding ratios for slopes do. Correlation between the effects does not seem to be
apparent, with sb12 having a 95% interval straddling zero.
In a second analysis, covariance selection is considered via the approach of
Fruhwirth-Schnatter and Tuchler (2008). Context-based informative priors for the
diagonal elements of C are assumed. Initially C11 ~ Ga(1, 0.2) and C22 ~ Ga(1, 1.5), based
on the posterior means 5.55 and 0.64 for the random effects standard deviations from
the preceding analysis. For the lower diagonal term, a normal prior C21 ~ N(0, 1) is
assumed. These options are preferred to, say, adopting diffuse priors on the Cjk terms,
in order to stabilise the covariance selection analysis. Note that the covariance term
Σ21 is non-zero only when both C11 and C21 are retained. This option gives a posterior
probability of 1 for retaining slope variation, while the posterior probability for inter-
cept variation is 0.98.
However, priors on Cjk that downweight the baseline analysis more lead to lower
retention probabilities. Taking C11 ~ Ga(0.5, 0.1), C22 ~ Ga(0.5, 0.75) and C21 ~ N(0, 2)
gives retention probabilities for varying intercepts and slopes of 1 and 0.93 respectively.
Similarly, taking C11 ~ Ga(0.1, 0.0.02), C22 ~ Ga(0.1, 0.15) and C21 ~ N(0, 10) gives retention
probabilities for varying intercepts and slopes of 0.65 and 0.98 respectively. This is in
line with a general principle that model selection tends to choose the null model if
diffuse priors are taken on the parameter(s) subject to inclusion or rejection (Cai and
Dunson, 2008).
We also consider an adaptation of the method of Saville and Herring (2009) for con-
tinuous nested outcomes, which involves scaling factors exp(fj ) premultiplying random
effects (e.g. cluster intercepts and slopes) taken to have the same variance as the main
residual term. This allows Bayes factor calculation using Laplace methods. In the cur-
rent application, and allowing for correlated slopes and intercepts, one has
b1i ~ N(0, 1 / tu ),
For the ϕj discrete mixture, priors are adopted, with one option corresponding to
lj = exp(fj ) being close to zero, while in the other, the prior on ϕj allows unrestricted
sampling. Here
J j ~ Bern(0.5).
This provides posterior probabilities Pr( J j = 1) of 1 and 0.41 respectively for random
intercepts and slopes. Posterior means (sd) for the random effect standard deviations
sbj = S bjj of {b1i , b2i } are 6.18 (0.39), and 0.16 (0.17).
Model Fit, Comparison, and Checking 87
p( yi | y[i] ) =
ò p(y |q , y
i [i] )p(q | y[i] )dq ,
are called conditional predictive ordinates or CPOs (e.g. Chaloner and Brant, 1988; Geisser
and Eddy, 1979), and sampling from them shows what values of yi are likely when a model
is applied to all the data points except the ith, namely to the data y[i]. The predictive dis-
tribution p( yi | y[i] ) can be compared to the actual observation in various ways (Gelfand et
al., 1992).
For example, to assess whether the observation is extreme (not well fitted) in terms of
the model being applied, replicate data yi,rep may be sampled from p( yi | y[i] ) and their con-
cordance with the data may be represented by probabilities (Marshall and Spiegelhalter,
2003),
These are estimated in practice by counting iterations r where the constraint yi(,rrep
)
≤ yi
holds. For discrete data, this assessment is based on the probability
Gelfand (1996) recommends assessing concordance between predictions and actual data
by a tally of how many actual observations yi are located within the 95% interval of the
corresponding model prediction yi,rep. For example, if 95% or more of all the observations
are within 95% posterior intervals of the predictions yi,rep, then the model is judged to be
reproducing the observations satisfactorily.
The collection of predictive ordinates { p( yi | y[i] ), i = 1, n} is equivalent to the marginal
likelihood p(y) when p(y) is proper, in that each uniquely determines the other. A pseudo
88 Bayesian Hierarchical Models
Bayes factor is obtained as a ratio of products of leave one out cross-validation predictive
densities (Vehtari and Lampinen, 2002) under models M1 and M2, namely
PsBF( M1 , M2 ) = ∏ {p(y |y
i =1
i [i] , M1 )/ p( yi |y[i] , M2 )} .
In practical data analysis, one typically uses logs of CPO estimates, and totals the log(CPO) to
derive log pseudo marginal likelihoods and log pseudo Bayes factors (Sinha et al., 1999, p.588).
Monte Carlo estimates of conditional predictive ordinates p( yi | y[i] ) may be obtained
without actually omitting cases, so formal cross-validation based on n separate estima-
tions (the 1st omitting case 1, the 2nd omitting case 2, etc) may be approximated by using
a single estimation run. For parameter samples {q (1) ,… ,q ( R) } from an MCMC chain, an
estimator for the CPO, p( yi | y[i] ), is
å p(y |q
1 1 1
= (r )
,
p( yi |y[i] ) R r =1 i )
namely the harmonic mean of the likelihoods for each observation (Aslanidou et al.,
1998; Silva et al., 2006; Sinha, 1993) In computing terms, an inverse likelihood needs to
be calculated for each case at each iteration, the posterior means of these inverse likeli-
hoods obtained, and the CPOs are the inverse of those posterior mean inverse likelihoods.
Denoting the inverse likelihoods as H i( r ) = 1/p( yi |q ( r ) ) , one would in practice take minus
the logarithms of the posterior means of Hi as an estimate of log(CPO)i. The sum over all
cases of these estimates provides a simple estimate of the log pseudo marginal likelihood.
In the turtle data example (Example 3.1), the fixed effects only model 1 has a PsBF of −151.8,
while the random intercepts model 2 has a PsBF of −149.6 under a Ga(0.1,0.1) prior for τb. So
the pseudo Bayes factors tends to weakly support the random effects option.
Model fit (and hence choice) may also be assessed by comparing samples yrep from the
posterior predictive density based on all observations, though such procedures may be con-
servative since the presence of yi influences the sampled yi,rep (Marshall and Spiegelhalter,
2003). Laud and Ibrahim (1995) and Meyer and Laud (2002) propose model choice based on
minimisation of the criterion
The C measure can be obtained from the posterior means and variances of sampled yi(,rrep
)
or
å
n
from the posterior average of ( yi(,rrep
)
- yi )2 . Carlin and Louis (2000) and Buck and Sahu
i =1
(2000) propose related model fit criteria appropriate to both metric and discrete outcomes.
Posterior predictive loss (PPL) model choice criteria allow varying trade-offs in the bal-
ance between bias in predictions and their precision (Gelfand and Ghosh, 1998; Ibrahim et
al., 2001). Thus for k positive and y continuous, one possible criterion has the form
Model Fit, Comparison, and Checking 89
n
ì 2ü
PPL(k ) = å íîvar(y
i =1
i , rep
æ k ö
)+ ç é y - E( yi , rep )ùû ý .
è k + 1 ÷ø ë i þ
This criterion would be compared between models at selected values of k, typical values
being k = 0, k = 1, and k = 10,000, where higher k values put greater stress on accuracy in
predictions, and less on precision. One may consider calibration of such measures, namely
expressing the uncertainty of C or PPL in a variance measure (Laud and Ibrahim, 1995;
Ibrahim et al., 2001). De la Horra and Rodríguez-Bernal (2005) suggest predictive model
choice based on measures of distance between the two densities that can potentially be
used for predicting future observations, namely sampling densities and posterior predic-
tive densities.
To assess poorly fitted cases, the CPO values may be scaled (dividing by their maximum)
and low values for particular observations (e.g. under 0.001) will then show observations
which the model does not reproduce effectively (Weiss, 1994). If there are no very small
scaled CPOs, then a relatively good fit of the model to all data points is suggested, and is
likely to be confirmed by other forms of predictive check. The ratio of extreme percentiles
of the CPOs is useful as an indicator of a good fitting model e.g. the ratio of the 99th to the
1st percentile.
An improved estimate of the CPO may be obtained by weighted resampling from p(θ|y)
(Smith and Gelfand, 1992; Marshall and Spiegelhalter, 2003). Samples θ(r) from p(θ|y) can
be converted (approximately) to samples from p(q | y[i] ) by resampling the θ(r) with weights
wi( r ) = G( yi |q ( r ) )/ åG(y |q
r =1
i
(r )
),
where
G( yi |q ( r ) ) = 1/p( yi |q ( r ) ),
is the inverse likelihood of case i at iteration r. Using the resulting re-sampled values q ( r ) ,
corresponding predictions y rep can be obtained which are a sample from p( yi | y[i] ).
∫
p( yrep |y ) = p( yrep |q )p(q|y )dq ,
may be taken, and checks made against the data, for example, whether the actual obser-
vations y are within 95% credible intervals of yrep. Formally, such samples are obtained
by the method of composition (Chib, 2008), whereby if θ(r) is a draw from p(θ|y), then yrep (r )
drawn from p( yrep |q ) is a draw from p( yrep | y ) . In a satisfactory model, namely one that
(r )
adequately reproduces the data being modelled, predictive concordance (accurate repro-
duction of the actual data by replicate data) is at least 95% (Gelfand, 1996, p.158).
90 Bayesian Hierarchical Models
Other comparisons of actual and predicted data can be made, for example by a chi-square
comparison (Gosoniu et al., 2006). Johnson (2004) proposes a Bayesian chi-square approach
based on partitioning the cumulative distribution into K bins, usually of equal probability.
Thus, one chooses quantiles
pk = ak - ak -1 , k = 1, … , K .
Then using model means μi for subject i Î(1, … , n) one obtains the implied cumulative
density qi, say ak* -1 < qi < ak* , and allocates the fitted point to a bin randomly chosen from
bins 1, … , k * .
For example, suppose there are K = 5 equally probable intervals, with pk = 0.2. If mi = 1.4 ,
the probability assigned to an observation yi = 1 by the cumulative density function falls
in the interval (0.247,0.592), which straddles bins 2 and 3. To allocate a bin, a U(0.247,0.592)
variable is sampled, and the predicted bin is 2 or 3, according to whether the sampled
uniform variable falls within (0.247,0.4), or (0.4, 0.592). The totals so obtained accumulating
over all subjects define predicted counts mk (q ) which are compared (at each MCMC itera-
tion) to actual counts npk, as in formula (3) in Johnson (2004). This provides the Bayesian
chi-square criterion
RB (q ) = å
k =1
npk
,
being fitted [2]. One can assess the posterior probability that RB (q ) exceeds the 95th per-
centile of the cK2 -1 density. Poor fit will show in probabilities considerably exceeding 0.05.
Analogues of classical significance tests are obtained using the posterior predictive
p-value (Kato and Hoijtink, 2004). This was originally defined (Rubin, 1984; Meng, 1994) as
the probability that a test statistic T ( yrep ) of future observations yrep is larger than or equal
to the observed value of T(y), given the adopted model M, the response data y, and any
ancillary data x,
where x would typically be predictors measured without error. The probability is calcu-
lated over the posterior predictive distribution of yrep conditional on M and x. By contrast,
the classical p-test integrates over y, as in
The formulation of Meng (1994) is extended by Gelman et al. (1996) to apply to discrepancy
criteria D(y,θ) based on data and parameters, as well as to observation-based functions
T(y). So the posterior predictive check is
where the probability is taken over the joint posterior distribution of yrep and θ given M
and x. In estimating the corresponding ppost, the discrepancy is calculated at each MCMC
iteration. This is done both for the observations, giving a value D( y , x ,q ( r ) ) , and for the
(r )
replicate data yrep (r )
, sampled from p( yrep |q ( r ) , x) resulting in a value D( yrep (r )
, x ,q ( r ) ) for each
sampled parameter q ( r ). The proportion of samples where D( yrep (r )
, x ,q ( r ) ) exceeds D( y , x ,q ( r ) )
is then the Monte Carlo estimate of ppost For example, Kato and Hoijtink (2004) show good
performance of ppost using both statistics T and discrepancies D in a normal multilevel
model context with subjects i = 1,… m j in clusters j = 1, … , J
where uij ~ N (0, s j2 ). The hypotheses considered (i.e. in the form of reduced models) are
bkj = bk and s j2 = s 2 .
Posterior predictive checks may be used to assess model assumptions. For instance, in
multilevel and general linear mixed models, assumptions of normality regarding ran-
dom effects are often made by default, and a posterior check against such assumptions
is sensible. A number of classical tests have been proposed such as the Shapiro–Wilk
W statistic (Royston, 1993) and the Jarque–Bera test (Bera and Jarque, 1980). These sta-
tistics can be derived at each iteration for actual and replicate data, and the comparison
D( yrep , x ,q ) ³ D( y , x ,q ) applied over MCMC iterations to provide a posterior predictive
p-value.
ò
p(T | y ) = p(T |b)p M (b| y )db ,
ò
where p M (b|y ) = p(b|q )p(q |y )dq may be termed the ‘predictive prior’ for b (Marshall
and Spiegelhalter, 2007, p.413). This contrasts with more conservative posterior predictive
checks based on replicate sampling from p( yi ,rep |bi , q ) , under which Tobs is compared to the
reference distribution
ò
p(T | y ) = p(T |b)p(b| y )dq.
Marshall and Spiegelhalter (2003) confirm that a mixed predictive procedure reduces
the conservatism of posterior predictive checks in relatively simple random effects mod-
els, and is more effective in reproducing p( yi | y[i] ) than weighted importance sampling.
However, this procedure may be influenced by the informativeness of the priors on the
hyperparameters θ, and also by the presence of multiple random effects.
Marshall and Spiegelhalter (2007) also consider full cross-validatory mixed predictive
checks to assess conflict in evidence regarding random effects b between the likelihood
and the second stage prior; see also Bayarri and Castellanos (2007). Consider nested data
{ yij , i = 1, … , n j ; j = 1, … , J } with likelihood
yij ~ N (b j , s 2 ),
b j ~ N ( m, t 2 ).
Under a cross-validatory approach, the discrepancy measure Tjobs for cluster j would be
based on the remaining data y[j] with cluster j excluded, and its reference distribution is
then
∫
p(Tjrep |y[ j] ) = p(Tjrep |b j , s 2 )p(b j | m, t 2 )p(s 2 , t 2 , m|y[ j] )db j ds 2dt 2d m.
Marshall and Spiegelhalter (2007) also propose a conflict p-test based on comparing a pre-
dictive prior replicate b j , rep | y[ j] with a fixed effect estimate or “likelihood replicate” bj,fix for
bj based only on the data. The latter is obtained using a highly diffuse fixed effects prior
on the bj, rather than a borrowing strength hierarchical prior, for example, b j ~ Be(1, 1) or
b j ~ Be(0.5, 0.5). Defining
This can be compared to a mixed predictive p-value, based on sampling yj,rep from a
cross-validatory model using only the remaining cases y[j] to estimate parameters, and
then comparing yj,rep, or some function Tjrep = T ( y j , rep ) , with yj,obs or with Tjobs = T ( y j , obs ).
Model Fit, Comparison, and Checking 93
Thus, depending on the substantive application, one may define lower or upper tail mixed
p-values
or
with the latter being relevant in (say) assessing outliers in hospital mortality comparisons.
If T(y) = y, and y is a count, then a mid p-value is relevant instead, with the upper tail test
being
Li et al. (2016, 2017) combine the principle of mixed predictive tests with that of impor-
tance sampling, with the intention of further correcting for optimism present in standard
posterior predictive tests. Consider a particular MCMC iteration t. Sub-samples of random
effects are obtained conditional on hyperparameters θ(t) and the random effects b(t). One set
of sub-samples b jA, s ,rep (for observations j and sub-samples s = 1, ¼ , S) are obtained, along
with the corresponding y j , s ,rep conditional on b j , s ,rep . One then obtains the correspond-
ing p j , s ,mix as per (3.8), assuming the data are binomial or Poisson. Integrated importance
weights for the pj,mix are based on an independent set of replicate random effects, say b Bj , s ,rep .
Equations (38) to (40) in Li et al. (2017) set out the procedure more completely. Integrated
importance weights are obtained as
æ ö
Wi(t ) = 1/ ç
ç
è
å p(y |q
s
i
(t )
, b Bj ,s ,rep )/S ÷ ,
÷
ø
and can be used to provide WAIC estimates (denoted iWAIC); see equations (26)–(27) in Li
et al. (2016).
Note that replicating this calculation in JagsUI or R2OpenBUGS needs to account for the
fact that the step function is a greater than or equals calculation.
We also include posterior predictive checks (Section 3.5.2) based on comparing devi-
ances for actual data and replicate data. Replicate data can be drawn from the model in
an unmodified form (which may provide conservative posterior checks), or with repli-
cates obtained using the mixed sampling approach.
94 Bayesian Hierarchical Models
With estimation using jagsUI, the mixed p-tests pi,mix and log(CPO) statistics are found
to imply similar inferences regarding less well-fitted cases. The lowest pi,mix is for slide
4, which also has the second lowest log(CPO), while the second highest pi,mix is for slide
10, which has the lowest log(CPO). Regarding the posterior predictive checks, as in (3.7),
taking replicates from the original model leads to a relatively low probability of 0.10,
while using mixed replicates provides a probability of 0.08. Both these indicate possible
model failure.
For this small sample, it is relatively straightforward to carry out a full (leave one out)
cross-validation based on omitting each observation in turn. This shows slides 4, 15,
and 20 as underpredicted (low probabilities that replicates exceed actual), and slides 10
and 17 as overpredicted.
Integrated importance cross-validation probabilities based on S = 10 subsamples are
also obtained; see the code used in [3]. These are very similar to the full cross-validation
probabilities (see Table 3.1, which highlights slides with full cross-validation probabili-
ties over 0.95 or under 0.05). We also use subsampling to obtain estimated iWAIC, fol-
lowing the notation of Li et al. (2016). Thus, the total iWAIC is 121.9, as compared to a
LOO-IC of 121.6 and a WAIC of 119.8. Casewise iWAIC confirm the poor fit to slides
4 and 10. For this example, log(CPO) and casewise iWAIC statistics correlate closely,
namely 0.9976.
In an attempt to improve fit, we replace the single intercept by a three-group discrete
mixture intercept. Thus
y i ~ N(ni , pi ),
TABLE 3.1
Seeds Data. Comparing Cross-Validation Probabilities, log(CPO), and Casewise iWAIC
Mixed IIS Full Casewise
Plate Cross-Validation Cross-Validation Cross-Validation log(CPO) iWAIC
1 0.883 0.925 0.922 −3.15 −3.19
2 0.473 0.460 0.454 −2.58 −2.51
3 0.894 0.949 0.946 −3.97 −3.98
4 0.050 0.015 0.013 −4.86 −4.88
5 0.226 0.178 0.176 −2.66 −2.65
6 0.250 0.240 0.242 −1.28 −1.29
7 0.312 0.266 0.258 −2.70 −2.76
8 0.117 0.058 0.053 −3.62 −3.75
9 0.735 0.808 0.811 −2.72 −2.75
10 0.925 0.980 0.978 −5.03 −4.79
11 0.271 0.267 0.267 −1.64 −1.66
12 0.263 0.188 0.194 −2.11 −2.15
13 0.726 0.769 0.772 −2.32 −2.37
14 0.856 0.899 0.902 −2.87 −2.84
15 0.128 0.028 0.034 −4.08 −4.12
16 0.934 0.936 0.937 −2.08 −2.06
17 0.959 0.973 0.974 −3.52 −3.48
18 0.442 0.451 0.456 −2.28 −2.35
19 0.606 0.625 0.630 −2.13 −2.18
20 0.130 0.047 0.049 −3.70 −3.79
21 0.681 0.692 0.692 −1.42 −1.41
Model Fit, Comparison, and Checking 95
Gi ∼ Categorical(f[1 : 3]),
f ~ Dirichlet( 5, 5, 5),
bi ~ N(0, sb2 ).
The posterior predictive checks, whether or not based on mixed replicates, are now
satisfactory, both around 0.48. There are now no casewise predictive exceedance prob-
abilities exceeding 0.95 or under 0.05. The LOO-IC and WAIC now stand at 116.6 and
110.4 respectively.
3.6 Computational Notes
require(jagsUI)
# generate data
set.seed(12345)
mu = 0
tau2 = 0.5
sigma2 = 1
# number of observations
n = 20
theta = rnorm(n, mu, sqrt(tau2))
y = rnorm(n, theta, sqrt(sigma2))
# define w according to length, T=30, of bridge-sampling schedule
T=30
T1=T+1
D= list(T=T,T1=T1, w=matrix(1,n,T1),n=n,y=y, path.pow=4)
# Model 1, mu=0
cat("model {for (h in 1:n) {for (s in 1:T1) {
L.tem[h,s] <- pow(L[h,s],q[s])
w[h,s] ~dunif(a1[h,s],b1[h,s])
a1[h,s] <- -1/L.tem[h,s]
b1[h,s] <- 1/L.tem[h,s]
LL[h,s] <- log(L[h,s])
# log-likelihood
log(L[h,s]) <- 0.5*log(phi[s]/(1+phi[s]))-0.919-0.5*phi[s]/
(1+phi[s])*y[h]*y[h]}}
# precision parameters
for (s in 1:T1) {phi[s] ~dgamma(1,1)}
phi.est <- phi[T1]
# path sampling calculations
for (s in 1:T1) {q[s] <- pow(a[s],path.pow)
expLL[s] <- sum(LL[1:n,s])}
a[1] <- 0.00001
for (s in 1:T) {a[s+1] <- s/T
mc[s] <- (q[s+1]-q[s])*(expLL[s+1]+expLL[s])*0.5}
logML <- sum(mc[])}
96 Bayesian Hierarchical Models
", file="model1.jag")
inits1 = list(phi=rep(1,T1))
inits2 = list(phi=rep(2,T1))
inits=list(inits1,inits2)
pars = c("logML","phi.est")
R1 = autojags(D, inits, pars,model.file="model1.jag",2,iter.
increment=1000,
n.burnin=100,Rhat.limit=1.025, max.iter=5000, seed=1234)
R1$summary
# Model 2, mu unknown
cat("model {for (h in 1:n) {for (s in 1:T1) {
L.tem[h,s] <- pow(L[h,s],q[s])
w[h,s] ~dunif(a1[h,s],b1[h,s])
a1[h,s] <- -1/L.tem[h,s]
b1[h,s] <- 1/L.tem[h,s]
LL[h,s] <- log(L[h,s])
# log-likelihood
log(L[h,s])<-0.5*log(phi[s]/(1+phi[s]))-0.919-0.5*phi[s]/
(1+phi[s])*(y[h]-mu[s])*(y[h]-mu[s])}}
# mean and precision parameters
for (s in 1:T1) {phi[s] ~dgamma(1,1)
mu[s] ~dnorm(0,1)}
phi.est <- phi[T1]
mu.est <- mu[T1]
# path sampling calculations
for (s in 1:T1) {q[s] <- pow(a[s],path.pow)
expLL[s] <- sum(LL[1:n,s])}
a[1] <- 0.00001
for (s in 1:T) {a[s+1] <- s/T
mc[s] <- (q[s+1]-q[s])*(expLL[s+1]+expLL[s])*0.5}
logML <- sum(mc[])}
", file="model2.jag")
inits1 = list(phi=rep(1,T1),mu=rep(0,T1))
inits2 = list(phi=rep(2,T1),mu=rep(0,T1))
inits=list(inits1,inits2)
pars = c("logML","phi.est","mu.est")
R2= autojags(D, inits, pars,model.file="model2.jag",2,iter.
increment=1000,
n.burnin=100,Rhat.limit=1.025, max.iter=5000, seed=1234)
R2$summary
# Marginal Likelihoods and Bayes Factor
ML=c()
ML[1]=R1$summary[1]
ML[2]=R2$summary[1]
BF12=exp(ML[1]-ML[2])
[2] The Bayesian chi-square method is illustrated using model 5 for the Scottish lip
cancer incidence, as considered in Johnson (2004, pp.2374–2376). Thus, with Ei
denoting expected incidence counts,
( )
yi ~ Po Eiexp ( ri ) ,
where the ρi are modelled as diffuse fixed effects. The BUGS code is as follows:
Model Fit, Comparison, and Checking 97
From iterations 5–100 thousand of a single chain run, the probability that RB exceeds the
95% point of a c 42 density is 0.157, and the posterior means of the number (mhat[] in the
code) of the n = 56 counts assigned to the five bins are (8.6,9.9,10.9,12.1,14.5).
[3] The code used for the IIS cross-validation probability estimates (seeds data) is
log(L.new[i,s])<- logfact(n[i])-logfact(y[i])-logfact(n[i]-y[i])
+y[i]*log(p.new.2[i,s])+(n[i]-y[i])*log(1-p.new.2[i,s])}}
# priors
for (j in 1:P) {beta[j] ~dnorm(0.0,1.0E-6)}
tau ~dgamma(1,0.001)}
References
Albert J (1999) Criticism of a hierarchical model using Bayes factors. Statistics in Medicine, 18, 287–305.
Alqallaf F, Gustafson P (2001) On cross-validation of Bayesian models. Canadian Journal of Statistics,
29, 333–340.
Akaike H (1973) Information theory and an extension of the maximum likelihood principle, in The
Second International Symposium on Information Theory, eds B Petrov, F Csaki. Akademiai Kiado,
Budapest.
Ando T (2007) Bayesian predictive information criterion for the evaluation of hierarchical Bayesian
and empirical Bayes models. Biometrika, 94, 443–458.
Aslanidou H, Dey D, Sinha D (1998) Bayesian analysis of multivariate survival data using Monte
Carlo methods. Canadian Journal of Statistics, 26, 33–48.
Barry R (2006) An alternative to the ‘ones’ trick? BUGS Archive, 09/11/2006. https://www.jiscmail.
ac.uk/cgi-bin/webadmin?A1=ind06&L=BUGS#13
Bartlett M (1957) A comment on D.V. Lindley’s statistical paradox. Biometrika, 44, 533–534.
Bayarri M, Berger J (1999) Quantifying surprise in the data and model verification, pp 53–82, in
Bayesian Statistics 6, eds J Bernardo, J Berger, A Dawid, A Smith. Oxford University Press,
London, UK.
Bayarri M, Berger J (2000) P-values for composite null models. Journal of the American Statistical
Association, 95, 1127–1142.
Bayarri M, Castellanos M (2007) Bayesian checking of the second levels of hierarchical models.
Statistical Science, 22, 363–367.
Bera A, Jarque C (1980) Efficient tests for normality, homoscedasticity and serial independence of
regression residuals. Economics Letters, 6, 255–259.
Berkhof J, van Mechelen I, Hoijtink H (2000) Posterior predictive checks: Principles and discussion.
Computational Statistics, 3, 337–354.
Bernardo J, Smith A (1994) Bayesian Theory. Wiley.
Besag J, York J, Mollié A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43(1), 1–20.
Bos C (2002) A comparison of marginal likelihood computation methods, pp 111–117, in COMPSTAT
2002: Proceedings in Computational Statistics, eds W Härdle, B Ronz. Springer, Berlin.
Brown H, Prescott R (1999) Applied Mixed Models in Medicine. John Wiley & Sons.
Buck C, Sahu S (2000) Bayesian models for relative archaeological chronology building. Applied
Statistics, 49, 423–444.
Buenconsejo J, Fish D, Childs J, Holford T (2008) A Bayesian hierarchical model for the estimation of
two incomplete surveillance data sets. Statistics in Medicine, 27, 3269–3285.
Burnham K, Anderson D (2002) Model Selection and Multimodel Inference: A Practical Information-
Theoretic Approach, 2nd Edition. Springer-Verlag, New York.
Cai B, Dunson D (2006) Bayesian covariance selection in generalized linear mixed models. Biometrics,
62, 446–457.
Cai B, Dunson D (2008) Bayesian variable selection in generalized linear mixed models, in Random
Effect and Latent Variable Model Selection, ed D Dunson. Springer.
Carlin B, Louis T (2000) Bayes and Empirical Bayes Methods for Data Analysis, 2nd Edition. Chapman
and Hall, London, UK.
Model Fit, Comparison, and Checking 99
Carvalho C, Polson N, Scott J (2009) Handling sparsity via the horseshoe. Proceedings of Machine
Learning Research, 5, 73–80.
Celeux G, Forbes F, Robert C, Titterington M (2006) Deviance information criteria for missing data
models. Bayesian Analysis, 1, 651–674.
Chaloner K, Brant R (1988) A Bayesian approach to outlier detection and residual analysis.
Biometrika,75, 651–660.
Chen M-H (2005) Computing marginal likelihoods from a single MCMC output. Statistica Neerlandica,
59, 16–29.
Chen Z, Dunson D (2003) Random effects selection in linear mixed models. Biometrics, 59, 762–769.
Chib S (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association,
90(432), 1313–1321.
Chib S (2008) Panel data modeling and inference: A Bayesian primer, pp 479–515, in The Econometrics
of Panel Data, 3rd Edition, eds L Matyas, P Sevestre. Springer-Verlag, Berlin, Germany.
Chib S, Jeliazkov I (2001) Marginal likelihood from the Metropolis–Hastings output. Journal of the
American Statistical Association, 96(453), 270–281.
Clayton D (1996) Generalized linear mixed models, in Markov Chain Monte Carlo in Practice, eds W
Gilks, S Richardson, D Spiegelhalter. Chapman & Hall, London, UK.
Conn, P, Johnson D, Williams P, Melin S, Hooten M (2018) A guide to Bayesian model checking for
ecologists. Ecological Monographs, 88(4), 526–542.
Crowder MJ (1978) Beta-binomial ANOVA for proportions. Applied Statistics, 27, 34–37.
de la Horra, J, Rodrguez-Bernal M (2005) Bayesian model selection: A predictive approach with
losses based on distances. Statistics & Probability Letters, 71, 257–265.
Drton M, Plummer M (2017) A Bayesian information criterion for singular models. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 79(2), 323–380.
Drummond A, Bouckaert R (2015) Bayesian Evolutionary Analysis with BEAST. Cambridge University
Press.
Fahrmeir L, Osuna L (2003) Structured count data regression. Sonderforschungsbereich, 386, Discussion
Paper 334, University of Munich.
Fernandez C, Ley E, Steel M (2001) Benchmark priors for Bayesian model averaging. Journal of
Econometrics, 100, 381–427.
Friel N, McKeone J, Oates C, Pettitt A (2017) Investigation of the widely applicable Bayesian informa-
tion criterion. Statistics and Computing, 27(3), 833–844.
Friel N, Pettitt A (2008) Marginal likelihood estimation via power posteriors. Journal of the Royal
Statistical Society: Series B, 70, 589–607.
Fruhwirth-Schnatter S (1999) Bayes Factors and Model Selection for Random Effect Models. Working
Paper, Department of Statistics, University of Business Administration and Economics, Vienna.
Fruhwirth-Schnatter S (2004) Estimating marginal likelihoods for mixture and Markov switching
models using bridge-sampling techniques. The Econometrics Journal, 7, 143–167.
Fruhwirth-Schnatter S, Tuchler R (2008) Bayesian parsimonious covariance estimation for hierarchi-
cal linear mixed models. Statistics & Computing, 18, 1–13.
Frühwirth-Schnatter S, Wagner H (2010) Stochastic model specification search for Gaussian and par-
tial non-Gaussian state space models. Journal of Econometrics, 154(1), 85–100.
Geisser S, Eddy W (1979) A predictive approach to model selection. Journal of the American Statistical
Association, 74, 153–160.
Gelfand A (1996) Model determination using sampling based methods, Chapter 9, in Markov Chain Monte
Carlo in Practice, eds W Gilks, S Richardson, D Spiegelhalter. Chapman & Hall/CRC, Boca Raton.
Gelfand A, Dey D (1994) Bayesian model choice: Asymptotics and exact calculations. Journal of the
Royal Statistical Society, Series B, 56, 501–514.
Gelfand A, Dey D, Chang H (1992) Model determination using predictive distributions with imple-
mentations via sampling-based methods, pp 147–168, in Bayesian Statistics 4, eds J Bernardo
et al. Oxford University Press.
Gelfand A, Ghosh S (1998) Model choice: A minimum posterior predictive loss approach. Biometrika,
85, 1–11.
100 Bayesian Hierarchical Models
Gelfand A, Vlachos P (2003) On the calibration of Bayesian model choice criteria. Journal of Statistical
Planning and Inference, 111, 223–234.
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis. CRC, Boca
Raton, FL.
Gelman A, Meng XL (1998) Simulating normalizing constants: From importance sampling to bridge
sampling to path sampling. Statistical Science, 13(2), 163–185.
Gelman A, Meng XL, Stern H (1996) Posterior predictive assessment of model fitness via realized
discrepancies. Statistica Sinica, 6, 733–807.
George E, McCulloch R (1993) Variable selection via Gibbs sampling. Journal of the American Statistical
Association, 88(423), 881–889.
George E, McCulloch R (1997) Approaches for Bayesian variable selection. Statistica Sinica, 7, 339–373.
Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the
American Statistical Association, 102(477), 359–378.
Gosoniu L, Vounatsou P, Sogoba N, Smith T (2006) Bayesian modelling of geostatistical malaria risk
data. Geospatial Health, 1, 127–139.
Green M, Medley G, Browne W (2009) A comparison of methods of posterior predictive assess-
ment in multilevel logistic regression using an example from veterinary medicine. Veterinary
Research. 40(4), 1–10.
Gronau Q, Sarafoglou A, Matzke D, Ly A, Boehm U, Marsman M (2017b) A tutorial on bridge sam-
pling. Journal of Mathematical Psychology, 81, 80–89.
Gronau Q, Singmann H, Wagenmakers E (2017a). Bridgesampling: An R package for estimating nor-
malizing constants. arXiv preprint arXiv:1710.08162
Gustafson P, Hossain S, Macnab Y (2006) Conservative prior distributions for variance parameters in
hierarchical models. Canadian Journal of Statistics, 34(3), 377–390.
Han C, Carlin B (2001) Markov chain Monte Carlo methods for computing Bayes factors: A compara-
tive review. Journal of the American Statistical Association, 96, 1122–1132.
Harun N, Cai B (2014) Bayesian random effects selection in mixed accelerated failure time model for
interval-censored data. Statistics in Medicine, 33(6), 971–984.
Ibrahim J, Chen M, Sinha D (2001) Criterion-based methods for Bayesian model assessment. Statistica
Sinica, 11, 419–443.
Jeffreys H. (1961) The Theory of Probability, 3rd edn. Oxford, UK, Clarendon Press.
Johnson V (2004) A Bayesian χ2 test for goodness-of-fit. Annals of Statistics, 32, 2361–2384.
Johnson V, Rossell D (2012) Bayesian model selection in high-dimensional settings. Journal of the
American Statistical Association, 107(498), 649–660.
Kacker R, Forbes A, Kessel R, Sommer K-D (2008) Bayesian posterior predictive p-value of statistical
consistency in interlaboratory evaluations. Metrologia, 45, 512–523.
Kass R, Raftery A (1995) Bayes factors. Journal of the American Statistical Association, 90, 773–795.
Kato B, Hoijtink H (2004) Testing homogeneity in a random intercept model using asymptotic, pos-
terior predictive and plug-in p-values. Statistica Neerlandica, 58, 179–196.
Kelly D, Smith C (2011) Bayesian model checking, pp 39–50, in Bayesian Inference for Probabilistic Risk
Assessment. eds D Kelly, C Smith. Springer, London, UK.
Key J, Pericchi L, Smith A (1999) Bayesian model choice: What and why?, pp 343–370, in Bayesian
Statistics 6, eds J Bernardo, J Berger, A Dawid, A Smith. Oxford Science Publications, Oxford,
UK.
Kinney, S, Dunson D (2008) Bayesian model uncertainty in mixed effects models, in Random Effect and
Latent Variable Model Selection, ed D Dunson. Springer.
Kuhn E, Lavielle M (2005) Maximum likelihood estimation in nonlinear mixed effects models.
Computational Statistics & Data Analysis, 49, 1020–1038.
Kuo L, Mallick B (1998) Variable selection for regression models. Sankhyā: The Indian Journal of
Statistics, Series B, 60(1), 65–81.
Laud P, Ibrahim J (1995) Predictive model selection. Journal of The Royal Statistical Society: Series B, 57,
247–262.
Model Fit, Comparison, and Checking 101
Lenk P, DeSarbo W (2000) Bayesian inference for finite mixture models of generalized linear models
with random effects. Psychometrika, 65, 475–496.
Li L, Qiu S, Zhang B, Feng C (2016) Approximating cross-validatory predictive evaluation in Bayesian
latent variable models with integrated IS and WAIC. Statistics and Computing, 26(4), 881–897.
Li L, Feng C, Qiu S (2017) Estimating cross-validatory predictive p-values with integrated impor-
tance sampling for disease mapping models. Statistics in Medicine, 36(14), 2220–2236.
Lopes HF, West M (2004) Bayesian model assessment in factor analysis. Statistica Sinica, 14(1), 41–68.
Lucy L (2018) Bayesian model checking: A comparison of tests. Astronomy & Astrophysics, 614, A25.
MacNab Y, Qiu Z, Gustafson P, Dean C, Ohlsson A, Lee S (2004) Hierarchical Bayes analysis of mul-
tilevel health services data: A Canadian neonatal mortality study. Health Services and Outcomes
Research Methodology, 5, 5–26.
Marshall C, Spiegelhalter D (2003) Approximate cross-validatory predictive checks in disease map-
ping models. Statistics in Medicine, 22, 1649–1660.
Marshall C, Spiegelhalter D (2007) Identifying outliers in Bayesian hierarchical models: A simula-
tion-based approach. Bayesian Analysis, 2, 1–33.
Meng X (1994) Posterior predictive p-values. The Annals of Statistics, 22, 1142–1160.
Meng XL, Wong WH (1996) Simulating ratios of normalizing constants via a simple identity: A theo-
retical exploration. Statistica Sinica, 6(4), 831–860.
Meyer M, Laud P (2002) Predictive variable selection in generalized linear models. Journal of the
American Statistical Association, 97, 859–871.
Mitchell TJ, Beauchamp JJ (1988) Bayesian variable selection in linear regression. Journal of the
American Statistical Association, 83(404), 1023–1032.
Müller S, Scealy J, Welsh A (2013) Model selection in linear mixed models. Statistical Science, 28(2),
135–167.
Myung J, Pitt M (2004) Model comparison methods, pp 351–366, in Methods in Enzymology, Vol. 383,
eds L Brand, M Johnson. Elsevier, Amsterdam.
Nandram B, Kim H (2002) Marginal likelihoods for a class of Bayesian generalized linear models.
Journal of Statistical Computation and Simulation, 73, 319–340.
Ohlssen D, Sharples L, Spiegelhalter D (2006) Flexible random-effects models using Bayesian
semi-parametric models: Applications to institutional comparisons. Statistics in Medicine, 26,
2088–2112.
Park J Y, Johnson M, Lee Y-S (2015) Posterior predictive model checks for cognitive diagnostic mod-
els. International Journal of Quantitative Research in Education, 2(3–4), 244–264.
Pettit L,Young K (1990) Measuring the effect of observations on Bayes factors. Biometrika, 77, 455–466.
Piironen J, Vehtari A (2017) Comparison of Bayesian predictive methods for model selection. Statistics
and Computing, 27(3), 711–735.
Plummer M (2008) Penalized loss functions for Bayesian model comparison. Biostatistics, 9, 523–539.
Pourahmadi M, Daniels M (2002) Dynamic conditionally linear mixed models for longitudinal data.
Biometrics, 58, 225–231.
Rockova V, Lesaffre E, Luime J, Löwenberg B (2012). Hierarchical Bayesian formulations for selecting
variables in regression models. Statistics in Medicine, 31(11–12), 1221–1237.
Rossell P (2018) Bayesian Model Selection and Averaging with mombf. https://cran.r-project.org/
web/packages/mombf/vignettes/mombf.pdf
Royston P (1993) A toolkit for testing for non-normality in complete and censored samples. The
Statistician, 42, 37–43.
Rubin DB (1984) Bayesianly justifiable and relevant frequency calculations for the applied statisti-
cian. The Annals of Statistics, 12(4), 1151–1172.
Sala-i-Martin X, Doppelhofer G, Miller RI (2004) Determinants of long-term growth: A Bayesian
averaging of classical estimates (BACE) approach. American Economic Review, 94(4), 813–835.
Saville B, Herring A (2009) Testing random effects in the linear mixed model using approximate
Bayes factors. Biometrics, 65, 369–376.
102 Bayesian Hierarchical Models
Saville B, Herring A, Kaufman J (2011) Assessing variance components in multilevel linear models
using approximate Bayes factors: A case-study of ethnic disparities in birth weight. Journal of
the Royal Statistical Society: Series A, 174(3), 785–804.
Schwarz G (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Silva R, Lopes H, Migon H (2006) The extended generalized inverse Gaussian distribution for log-
linear and stochastic volatility models. Brazilian Journal of Probability and Statistics, 20, 67–91.
Sinha D (1993) Semiparametric Bayesian analysis of multiple event time data. Journal of the American
Statistical Association, 88(423), 979–983.
Sinha D, Chen M-H, Ghosh S (1999) Bayesian analysis and model selection for interval-censored
survival data. Biometrics, 55, 585–590.
Sinharay S, Stern H (2003) Posterior predictive model checking in hierarchical models. Journal of
Statistical Planning and Inference, 111, 209–221.
Sinharay S, Stern H (2005) An empirical comparison of methods for computing bayes factors in gen-
eralized linear mixed models. Journal of Computational and Graphical Statistics, 14, 415–435.
Smith AF, Gelfand AE (1992) Bayesian statistics without tears: A sampling–resampling perspective.
The American Statistician, 46(2), 84–88.
Smith M, Kohn R (1996) Nonparametric regression using Bayesian variable selection. Journal of
Econometrics, 75(2), 317–343.
Smith M, Kohn R (2002) Parsimonious covariance matrix estimation for longitudinal data. Journal of
the American Statistical Association, 97(460), 1141–1153.
Spiegelhalter D (2006) Two brief topics on modelling With WinBUGS. Presented at ICEBUGS
Conference, Helsinki 2006.
Spiegelhalter D, Best N, Carlin B, van der Linde A (2002) Bayesian measures of model complexity
and fit. Journal of the Royal Statistical Society, Series B, 64, 583–639.
Stern H, Sinharay S (2005) Bayesian model checking and model diagnostics, pp 171–192, in Bayesian
Thinking: Modeling and Computation, Handbook of Statistics, Vol. 25, eds D Dey, C Rao. Elsevier,
Amsterdam, Netherlands.
Tierney L, Kadane J (1986) Accurate approximations for posterior moments and marginal densities.
Journal of the American Statistical Association, 81, 82–86.
Vannucci M (2000) Matlab code for Bayesian variable selection. ISBA Bulletin, 7(3), 1–3.
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-
validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
Vehtari A, Lampinen J (2002) Expected utility estimation via cross-validation, in Bayesian Statistics 7,
eds J Bernardo, M Bayarri, J Berger, A Dawid, D Heckerman, A Smith, M West. Clarendon Press.
Watanabe S (2010) Asymptotic equivalence of Bayes cross validation and widely applicable informa-
tion criterion in singular learning theory. Journal of Machine Learning Research 11, 3571–3594.
Watanabe S (2013) A widely applicable Bayesian information criterion. Journal of Machine Learning
Research, 14, 867–897.
Weihs C, Plummer M (2016) Package sBIC. Computing the singular BIC for multiple models. https://
cran.r-project.org/web/packages/sBIC/sBIC.pdf
Weiss R (1994) Pediatric pain, predictive inference and sensitivity analysis. Evaluation Review, 18,
651–678.
Xie W, Lewis P, Fan Y, Kuo L, Chen M-H (2011) Improving marginal likelihood estimation for
Bayesian phylogenetic model selection. Systematic Biology, 60(2), 150–160.
Yang M (2012) Bayesian variable selection for logistic mixed model with nonparametric random
effects. Computational Statistics & Data Analysis, 56(9), 2663–2674.
Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with g-prior dis-
tributions, pp 233–243, in Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de
Finetti. North-Holland/Elsevier.
Zhu L, Gorman D, Horel S (2006) Hierarchical Bayesian spatial models for alcohol availability, drug
“hot spots” and violent crime. International Journal of Health Geographics, 5, 54.
4
Borrowing Strength via Hierarchical Estimation
4.1 Introduction
What is sometimes termed ensemble estimation, or borrowing strength, refers to infer-
ences for collections of similar (exchangeable) units i = 1, … , n (schools, health agencies,
etc.) using Bayesian hierarchical methods (Burr and Doss, 2005; Clark and Gelfand, 2006;
Rounder et al., 2013; Rhodes et al., 2016). Among possible examples are surgical outcome
rates (Kuhan et al., 2002; Bayman et al., 2013), drug development (Gupta, 2012), baseball
batting averages (Kruschke and Vanpaemel, 2015), health quality measures (Staggs and
Gajewski, 2017), or oviposition preference data (Fordyce et al., 2011). Fixed effects models for
such collections are problematic (Marshall and Spiegelhalter, 1998), whereas hierarchical
random effects approaches pool information across units to obtain more reliable estimates
for each unit, identify units with unusually high or low values, and enable comparisons
between units. Borrowing strength may need to be modified to account for, or accom-
modate, unusual observations (Baker and Jackson, 2016; Farrell et al., 2010). Rankings of
the units may often be required, or probabilities of significant difference between units or
against a threshold (Deely and Smith, 1998; Staggs and Gajewski, 2017).
Implementations for hierarchical methods in R include Bayesian applications, as in
bayesPref (Gompert and Fordyce, 2015), LearnBayes (Albert, 2015), bmeta (Ding and Baio,
2016), bamdit (Verde, 2018), meta4diag (Guo and Riebler, 2016), and frequentist applications,
such as metaplus (Beath, 2016) and metafor (Viechtbauer, 2010; Viechtbauer, 2017); see also
https://cran.r-project.org/web/views/MetaAnalysis.html. For semiparametric and dis-
crete mixture models, packages include DPpackage (Jara et al., 2011), bspmma (Burr, 2012),
bayesmix (Gruen and Plummer, 2015), and label.switching (Papastamoulis, 2016).
A prototypical Bayesian hierarchical model for interrelated units specifies an outcome
model (first stage likelihood) p( yi |bi , Φ ) , and a process model involving unobserved effects
bi, with density p(bi |Ψ) , conditional on hyperparameters Ψ. In a longitudinal linear regres-
sion, the Φ might be regression coefficients and the residual regression variance, while Ψ
could include the variance of unit random intercepts bi. Similarly, in a Poisson-gamma
mixture, the likelihood p( yi |bi ) conditions on latent gamma effects bi. At the second stage,
the gamma density p(bi |Ψ) for the bi conditions on gamma shape and scale parameters Ψ,
while prior densities for the gamma parameters form the third stage.
The procedures considered in this chapter are typically based on an exchangeability
principle: that units are similar enough to justify being modelled by a common den-
sity and that the units are not configured in ways (e.g. over time or space) that implies
higher correlations between some units than others (Spiegelhalter et al., 2004; Lindley and
Smith, 1972, p.4). Structuring of units in space, time, or other forms of non-exchangeability
does not preclude borrowing strength, but a prior reflecting that structuring is required
103
104 Bayesian Hierarchical Models
(see Chapters 5, 6). Exchangeability means that there is no prior basis for supposing some
units have higher true effects than others, or that certain subgroups of units are more
similar between themselves than other subgroups (e.g. that mortality in hospitals i and
j is more similar than between hospitals i and k). For units of the same type and obser-
vations generated under similar conditions, exchangeability means all possible permuta-
tions of the sequence of units have the same probability: random variables { y1 , … , y n } are
exchangeable if their joint distribution P( y1 , … , y n ) is invariant under permutation of its
arguments, so that
P( y1∗ , … , y n∗ ) = P( y1 , … , y n )
∫
p(bi | y ) = p(bi | y , Ψ)p(Ψ| y )dΨ,
∑
T
leading to the estimate p̂(bi | y ) = p(bi | y , Ψ(t ) ) .
t =1
An alternative to direct simulation is to simulate the full posterior p(Ψ, b| y ) using MCMC
methods, by obtaining samples {bi(t ) , Ψ(t ) } from the full conditional posteriors p(bi |b[i] , Ψ, y )
and p(yq |Ψ[ q] , b , y ). For example, often the first stage density p(y|b) is in the full exponen-
tial family, so that
y b − B(bi )
p( yi |bi ) = exp i i + C( yi , fi ) (4.1)
A(fi )
where ϕi is a scale parameter. Assuming a conjugate second stage prior, the conditional
posterior of each bi follows the same density. For example, assume (Frees, 2004; Das and
Dey, 2006; Das and Dey, 2007; Ferreira and Gamerman, 2000) that
æé y ù é 1 ùö
p(bi , Y|y ) = k 2 exp çç ê g1(Y ) + i ú bi - B(bi ) ê g 2 (Y ) + ÷ . (4.3)
èë A(f )
i û ë A(fi ) úû ÷ø
106 Bayesian Hierarchical Models
With proper log-concave priors p(Ψ), the full conditionals p(y q |Y [ q] , b , y ) are logconcave,
and can be sampled using methods such as those of Gilks and Wild (1992). By contrast,
if improper priors are assumed on hyperparameters {y 1 ,… ,y Q }, then the full posterior
p(b , Ψ| y ) is not necessarily proper (George et al., 1993; George and Zhang, 2001; Browne
and Draper, 2006), and empirical convergence of the MCMC sequence {b(t ) , Ψ(t ) } may be
problematic even if the posterior is proper analytically. George and Zhang (2001) consider
posterior propriety results for the Poisson-gamma, the binomial-beta, and multinomial-
Dirichlet models in terms of conditions on the hyperparameter prior tail behaviour. For
the latter two hierarchical schemes, no improper prior can guarantee a proper posterior.
Similar convergence and identification issues apply to the general linear mixed model
formulation.
and
in the treated and control arms. Then the log odds ratios form the unit level response,
yi = wiT − wiC ,
1 1 1 1
si2 = + + + ,
riT N iT − riT riC N iC − riC
Borrowing Strength via Hierarchical Estimation 107
(see Example 4.1). It is also possible to take yi as a log relative risk between treatment and
control groups, namely
r r
yi = log iT − log iC ,
N iT N iC
with variance
1 1 1 1
+ − − .
riT riT N iT N iC
Another option is to take the risk difference
riT r
yi = − iC ,
N iT N iC
as approximately normal with variance
and
bi ~ N ( m , t 2 ). (4.4.2)
Integrating out the bi, the marginal likelihood for yi (Guolo and Varin, 2017) is then
yi |m , t 2 ~ N ( m , si2 + t 2 ). (4.4.3)
Often the summary measures are unit or trial means and different observational vari-
ances are associated with differing sample sizes Ni, so that si2 = s 2 /N i, where σ2 is an addi-
tional unknown. While clinical meta-analysis applications are common, a similar scenario
occurs in small area estimation from multiple surveys where si2 are sampling variances
obtained according to the survey design.
More complex situations can be fitted into this framework. For example, Abrams et al.
(2000) consider the effect of testing positive or negative in a screening test on subsequent
levels of anxiety; see also Abrams et al. (2005). Let xik be baseline anxiety in study i, with
k = 1 (tested positive) and k = 2 (tested negative), and with Ni1 and Ni2 subjects in different
arms. Let zik be follow-up anxiety according to screening result, and let dik = zik − xik denote
change in anxiety. Then the measure of interest is the contrast between anxiety growth
according to screening result, namely yi = di1 − di 2 , with variance
where
and ρ is a within-subject correlation taken constant across studies and arms. Studies may
not report all the relevant statistics: they may report the dik and their variances, or the sepa-
rate baseline and follow-up measures in each arm {xik , zik } and their variances. In either
case, meta-analysis requires a prior on ρ.
In (4.4), assume independent priors on the hyperparameters
n nl
t 2 ∼ IG , ,
2 2
m ∼ N (m m , Vm ),
where n , l, m m , Vm are assumed known. The full posterior conditional for bi is then (Browne
and Draper, 2006; George et al., 1993; Silliman, 1997, p.927)
where
-1
æ1 1 ö t 2s2
Di = ç 2 + 2 ÷ = 2 i 2 ,
è si t ø t + si
si2
wi = ,
si2 + t 2
and the first equality is by virtue of conditional independence of the bi. The full condi-
tional for τ2 is
n
t 2 ∼ IG 0.5[n + n], 0.5 nl +
∑i =1
( yi − m)2 ,
while that for μ involves a precision weighted average of mμ, and the average of the bi,
namely
æ æ nVm ö æ t2 ö t 2Vm ö
m ~ Nçb ç ÷ + m m ç 2 ÷
, 2 ÷
.
ç nVm + t 2 ÷
è è ø è nVm + t ø nVm + t ø
Allowing interrelatedness between units leads to inferences about underlying unit means
that are different from those obtained under alternative scenarios sometimes used, namely,
(a) the “independent units” case, with bi taken as unknown and mutually unrelated fixed
Borrowing Strength via Hierarchical Estimation 109
effects, with t 2 → ∞, and (b) the complete pooling model of classical meta-analysis where
the studies are regarded as effectively interchangeable and t 2 = 0 .
By contrast, the intermediate “exchangeable units” Bayes model leads to a posterior
mean for bi,
E[bi | y] = wi m + [1 − wi ]yi ,
that averages over the prior mean μ and the data mean yi with weights wi = si2 /(si2 + t 2 ) and
1 − wi = t 2 /(si2 + t 2 ) respectively, as is apparent from the Gibbs sampling full conditionals.
The bi under an exchangeability scenario have narrower posterior intervals than under an
independent units assumption, with precision related to the confidence about the prior
mean and the prior assumed for τ2 (see also Section 4.4). Assume the intra-study vari-
ances can be expressed as si2 = s 2 /N i and then set t 2 = s 2 /N m , where Nμ is the sample size
assigned to the prior mean. Then the weights wi become N m /( N i + N m ) demonstrating that
shrinkage to the prior mean increases as the confidence about the prior mean increases.
The normal-normal model may be robustified against skewness, heavy tails, and outlier
studies in either the sampling density or the latent effects density. If non-normality is sus-
pected at the second stage, a heavy-tailed prior can be used to accommodate possibly out-
lying studies. A normal-t approach involves study-specific scale adjustments at the second
stage (West, 1984), downplaying the influence of atypical studies on posterior estimates of
the overall effect μ, and avoiding over-shrinkage of individual study effects bi. The scaling
factors are gamma with shape and rate ν/2:
bi ∼ N ( m, t 2 /li )
n n
li ∼ Ga , .
2 2
4.3.1 Meta-Regression
Sometimes it is necessary to control explicitly for trial design, study location, and
other design features in order to justify an exchangeability assumption (Marshall and
Spiegelhalter, 1998; Pauler and Wakefield, 2000; Prevost et al., 2000). Similarly, in survey-
based small area estimation, the estimate of bi may incorporate information from admin-
istrative area data Xi (Rao, 2003; Jiang and Lahiri, 2006). So, with centred predictors Xi of
dimension p (excluding a constant term), the normal-normal model becomes
yi ∼ N (bi , si2 ),
bi ∼ N ( m + Xi b , t 2 ),
yi | b , t 2 ∼ N ( m + Xi b , si2 + t 2 ).
Borrowing Strength via Hierarchical Estimation 111
yi = m + Xi b + di + ei ,
di ∼ N (0, t 2 ),
ei ∼ N (0, si2 ),
1
p(t 2 ) ∝ ,
t2
equivalent to taking t 2 ∼ IG(0, 0) and to a flat prior on log(τ) over (0,∞), can lead to improper
posteriors in random-effects models (DuMouchel and Waternaux, 1992). A just proper
risk-averse alternative, such as t 2 ∼ IG(c, c) with c small is often used (Simpson et al., 2016).
However, this prior has a spike near zero (Browne and Draper, 2006), and different values of c
can influence posterior influences despite the supposedly diffuse nature of the prior (Gelman,
2006). The prior 1/t 2 ∼ Ga(1, c) similarly may lead to overfitting (Simpson et al., 2016).
One might carry out a sensitivity analysis over a range of proper but diffuse Ga(c,d) pri-
ors for the precision 1/τ2, such as {c = 0.1, d = 0.001} or c = d = 0.0001 (Fahrmeir and Lang,
2001; van Dongen, 2006). An alternative scheme is to compare alternative values of c in
Ga(1,c) priors for 1/τ2 (Besag et al., 1995), possibly using a mixture prior over M possible
values for c = (c1 , … , cM ) in the prior 1/t 2 ∼ Ga(1, c), such as cm = 1, 0.1, 0.01 and 0.001 (Jullion
and Lambert, 2007). Then
c| p ∼ ∑ p Ga(1, c )
m =1
m m
p ∼ Dirichlet(w)
t 2 ∼ c −2 (n , l),
112 Bayesian Hierarchical Models
is equivalent to assuming t 2 ∼ IG (n/2, nl/2) , where λ is a prior guess at the mean vari-
ance, and ν is a prior sample size (or level of confidence) parameter. Conlon et al. (2007)
consider informative inverse gamma priors on τ2 for inter-study variability in logexpres-
sion ratios in a microarray data application; for example, they use relatively large prior
sample sizes ν.
Smith et al. (1995) discuss elicitation of informative inverse gamma priors for τ2 based on
anticipated variation in the underlying rates bi, and the fact that assuming normality, 95%
of the bi will lie between m − 1.96t and m + 1.96t . Assume the bi are measured on a log scale
(e.g. log relative risks or log odds ratios), and suppose the expected ratio of the 97.5th and
2.5th percentiles of risks (or odds) between centres or studies is 5, then the gap between
the 97.5th and 2.5th percentiles for bi is log(5) = 1.61. For normal bi, the prior mean for τ2 is
then (0.5 × 1.61/1.96 ) = 0.17 , and the prior mean for 1/τ2 is 5.93. If the upper limit for the
2
ratio of the 97.5th and 2.5th percentile of rates or odds is set at 10, this defines the 97.5th per-
centile of τ2 namely (0.5 × 2.3/1.96 ) = 0.34. The expectation and variability are then used to
2
define an inverse gamma prior on τ2 or a gamma prior on 1/τ2. Another procedure based
on expected contrasts in relative risk (RR) or relative odds (RO) is mentioned by Marshall
and Spiegelhalter (2007, p.422): 95% of units will have RRs or ROs in the range exp(±1.96t) ,
and an expectation of reasonable homogeneity might correspond to values of τ less than
th = 0.2 . Setting y = 0.5th = 0.1, these expectations are expressed via a half normal prior on
τ, with τ = |T| where
T ∼ N (0, y 2 ),
f ∼ Ga(c, d),
where d is a small multiple of 1/R2 and R is the range of the observed centre effects, and with
γ and c constrained according to g > 1 > c. When the first stage sampling density involves
an unknown variance, Gustafson et al. (2006) suggest a conditional prior sequence adapted
to avoiding undersmoothing, namely
a+1
1 b
p(t 2 |s 2 ) ∝ 2 exp − 2 .
t + s t + s
2 2
4.4.1 Non-Conjugate Priors
Among non-conjugate strategies (for normal-normal meta-analysis) an effective choice in
terms of being genuinely non-informative (Gelman, 2006) is a bounded uniform prior on
the random effects standard deviation t ∼ U (0, H ) with H large. However, this prior may
be biased towards relatively large variances when the number of units (trials, studies, etc.)
is small (van Dongen, 2006, p.92).
A prior selection strategy based on the principles of penalising complexity, and of pre-
ferring simpler models when more complex models are not strongly supported (Occam’s
razor), may be adopted. Thus Simpson et al. (2016) propose that the prior π(ξ) on a flexibil-
ity parameter (hyperparameter), such as the level 2 standard deviation in a normal-normal
model, be set so as to prefer the simpler base model in which ξ = 0. A penalising complexity
(PC) prior has density decreasing at high values and maximum at ξ = 0 in order to prevent
overfitting; that is, the mode of the PC prior is always at the base model. A suitable value
for the prior on ξ can be obtained via a user-defined condition Pr(Q(x) > U ) = a . This speci-
fies an upper value U for a function Q(ξ) of ξ, and the associated probability α. For the τ
standard deviation parameter in a normal-normal hierarchy, the PC prior is an exponen-
tial with rate λ, t ∼ Exp(l), and if one specifies Pr(t > tU ) = a , the resulting exponential rate
is l = − ln(a)/tU . The PC prior for the precision 1/ t is a Gumbel type 2 density.
2
s02
w= ,
s + t2
2
0
where
n
∑s
1 1 1
= ,
s02 n i =1
2
i
is the harmonic mean of the study sampling variances. DuMouchel (1996) proposes a uni-
form prior on s0 /(s0 + t) which is equivalent to a Pareto prior, namely
s0
p(t) = .
(s0 + t)2
This prior is proper but with E(t) = ∞, and with (0.01,0.25,0.5,0.75,0.99) percentile points at
(s0 /99, s0 /3, s0 , 3s0 , 99s0 ) . Note that the Pareto can also be parameterised as
p(u) = bs0bu − b −1 ,
114 Bayesian Hierarchical Models
and variance m2 + V . Gelman (2006) and Zhao et al. (2006) adopt folded non-central t-den-
sities for τ, obtained by dividing the absolute value of a normal variable by the square root
of a gamma variable.
If the normal variable has mean zero, then the folded non-central t becomes a half t
variable. With degrees of freedom in the t density set to 1, this leads to a half-Cauchy for
τ, exemplified by
∆ ∼ N(0, s∆2 ),
s∆ ∼ U (0, K ),
l ∼ Ga(0.5, 0.5),
t =|∆|/l0.5 .
Setting s∆2 = 1 leads to a C + (0, 1) prior, as in the horseshoe prior (Chapter 7). The half Cauchy
prior on τ is included in rstan and runjags libraries in R.
Half t and half Cauchy priors for the second stage parameter τ may also be achieved
by a reparameterisation of the second-stage prior on the latent trial means which strictly
involves parameter redundancy. Such over-parameterisation may improve MCMC conver-
gence (Gelman, 2006, section 3.2). With preset parameters ν and A (degrees of freedom and
prior scale respectively) one has, for yi ∼ N (bi , si2 ) ,
bi = m + xhi ,
x ∼ N (0 , A)
hi ∼ N (0, sh2 )
1/sh2 ∼ cn2
t =|x|sh .
Borrowing Strength via Hierarchical Estimation 115
Applications are provided by van Dongen (2006) and Chelgren et al. (2011). Setting ν = 1
leads to a half Cauchy prior
p(sb ) ∝ (t 2 + A)−1 ,
where Gelman (2006, p.524) uses a value A = 25 in a meta-analysis with small n, based on a
prior belief that τ is well below 100.
riT riC
y i = log − log ,
N iT − riT N iC − riC
are taken as approximately normal with known variances
1 1 1 1
si2 = + + + .
riT N iT - riT riC N iC - riC
A normal higher stage is assumed with yi ~ N(bi , si2 ) and bi ∼ N( m, t 2 ) . A uniform shrink-
age prior on
s02
w= ,
s +t 2
2
0
as considered above, is assumed for the second-stage variance, with the half-Cauchy
also considered. Additionally, a N(0, 100) prior on μ is adopted. Various kinds of predic-
tion may be considered. Here, the predicted treatment effect in a new trial is sampled
according to
bnew ∼ N( m, t 2 )
Early convergence in a two-chain run of 5000 iterations with jagsUI is obtained. τ2 is esti-
mated as 0.085 (mean) and 0.080 (median). A clear benefit of NRT is seemingly appar-
ent, with the odds ratio exp(μ) having a posterior mean (and 95% CrI) of 1.93 (1.73,2.17).
On the other hand, the predicted odds ratio for a new trial (OR.new in the rjags code)
includes null values for the benefit from NRT, having mean (95% CrI) of 2.2 (0.7,5.1).
Some deficiencies against model assumptions are evident: although the Shapiro–Wilk
normality test of the posterior mean bj is inconclusive (a p-value of 0.07), the Jarque–Bera
test (Jarque and Bera,1980) shows a significant departure from normality.
Similarly, evaluating individual components of the total WAIC (widely applicable
information criterion) shows studies 4, 36, and 59 as having distinctively high values.
Trial 4 has an exceptionally high empirical log odds ratio in support of NRT, while trial
36 shows unusually low NRT benefit. Mixed predictive exceedance checks (Marshall
and Spiegelhalter, 2007) show aberrant values (0.001 and 0.992) for these two trials, with
116 Bayesian Hierarchical Models
study 59 also having an extreme value. A reanalysis using the normal-normal scheme
uses a half-Cauchy prior, with the setting on the Cauchy scale parameter as in Gelman
et al. (2008). τ2 is now estimated as 0.081 (mean) and 0.078 (median). The LOO-IC (leave-
one-out information criterion) is reduced slightly, but model checks show similar fea-
tures to the analysis using the uniform shrinkage prior.
To allow for potential outlier trials and downweight their effect, an alternative analy-
sis adopts a second-stage Student density with
bi ∼ N( m, t 2 /li )
n n
li ∼ G , .
2 2
Less typical trial results will have values of λi considerably under 1 and a test for the
posterior probability that λi is less than 1 can be included. The prior on ν is specified in
two steps as n ∼ E(k) and k ∼ U(0.01, 0.5) . Evidence in support of a heavy tailed second
stage is equivocal. From a two-chain run of 10000 iterations with jagsUI, ν has a poste-
rior mean of 9.3, suggesting departure from normality. The posterior mean and median
for τ2 are reduced to 0.051 and 0.045 respectively. Two trials (4 and 36) have posterior
probability that λi < 1 in excess of 0.8, namely, trials 4 and 36. These trials also have
extreme mixed predictive exceedance p-values. On the other hand, gain in goodness
of fit is not obtained: the marginal density likelihood, uncorrected for complexity, is
unchanged, and the WAIC increases.
The skew t model (Lee and Thompson, 2008; Fernandez and Steel, 1998) is also esti-
mated using rube. This involves asymmetric scaling of the second-stage variance accord-
ing to whether the residual e j = y j − b j is negative or positive. For positive residuals, τ2
is scaled by a factor g 2 > 0, while for negative residual terms, the scaling is by 1/γ2. The
value γ = 1 corresponds to a symmetric t density, while γ > 1 (γ < 1) corresponds to positive
(negative) skew. Applied to the NRT data, a two-chain run of 5,000 iterations shows no
gain in fit, or any evidence that the 95% CRI for γ excludes 1. Mixed predictive exceed-
ance checks for studies 4, 36, and 59 are still extreme, with values 0.016, 0.97 and 0.014.
Finally, a two-category discrete mixture is assumed on the second-stage variance
(Beath, 2014), with an outlier group (Gj = 2) posited to have higher variance. To improve
identifiability, prior probabilities for the outlier and main groups, Pr(Gj = 2) and Pr(Gj = 1)
are set at 0.05 and 0.95 respectively. The default (main group) variance is assigned a
uniform shrinkage prior as above, while the increment in the outlier group variance is
assigned an informative E(10) prior. The posterior probability that Pr(Gj = 2|y) is then 0.25
for trial 4 (a marginal Bayes factor of 6.3), while the corresponding marginal Bayes factor
for trial 36 is 3.6. Again, fit is not improved against the standard normal-normal model,
and mixed predictive exceedance checks for studies 4, 36, and 59 remain extreme.
4.5 Multivariate Meta-Analysis
Multivariate meta-analysis may adopt a normal-normal strategy, albeit often with orig-
inally binary, count, or time to event data (Mavridis and Salanti, 2013). A multivariate
analysis for metric outcomes may arise in different ways. These include clinical applica-
tions involving treatment and control arms; studies where multiple outcomes are reported;
in meta-analysis of diagnostic test studies, where sensitivity and specificity are reported
(Guo and Riebler, 2016; Guo et al., 2017); in multiple treatments meta-analysis; and in net-
work meta-analysis (Greco et al., 2016).
Borrowing Strength via Hierarchical Estimation 117
In the first scenario, the event rate in the control arm may be taken as indicating baseline
risk, and there is interest in whether the treatment effect is related in any way to base-
line risk (Arends, 2006). Suppose riT of NiT treated subjects in trial i exhibit a particular
response (e.g. disease or death), as compared to riC of NiC control subjects, and define log
odds yiT = log(riT /( N iT − riT )) and yiC = log(riC /( N iC − riC )) . Often the outcome may be taken
as the log of the odds ratio, yiT − yiC , assumed normal (see Example 4.2). However, to sepa-
rate out baseline risk, one may model { yiT , yiC } as (approximately) bivariate normal. If the
trial is randomised, it is legitimate to assume that { yiT , yiC } are independent at the first
stage (van Houwelingen et al., 2002). So
biT mT
biC ~ N mC , Σ b ,
tT2 tTC
where Σ b = , with tTC = rtT tC , and diagonal terms tT2 and tC2 represent variabil-
tTC tC2
ity in the true treatment and control event rates. Then g = mT − mC defines the underly-
ing treatment effect with variance tT2 + tC2 − 2tTC . The conditional variance of the treatment
effect, given the true control group rate, is tT2 − (tTC
2
/tC2 ). So, baseline risk explains a portion
tTC
2
tC2 − 2tTC +
tC2
tT + tC − 2tTC
2 2
y i1 bi1
yi 2 b
∼ N , Si ,
i2
. .
y
iK biK
where
is the known covariance matrix between outcomes for trial i, with sijk = rijk (sij2 sik2 )0.5. A multi-
variate normal second-level prior for (bi1 , … biK ) involves means { m1 , … mK }, and K × K covari-
ance matrix
118 Bayesian Hierarchical Models
There may well be sensitivity to the priors adopted for Σb, especially when there are a small
number of trials, or some missingness in outcomes, with results from the inverse Wishart
potentially sensitive to the prior scale matrix (Wei and Higgins, 2013). Alternatives involve
decomposition approaches to the covariance matrix, so that separate priors are specified
on variances and correlations (Barnard et al., 2000; Lu and Ades, 2009; Burke et al., 2016;
Guo et al., 2017; Hurtado Rua et al., 2015). Incorporating evidence into multivariate priors
on variances and correlations leads to stabilised inferences (Burke et al., 2016; Guo et al.,
2017). Alternatives to a U(−1, 1) prior on correlations include a normal prior on the Fisher
z-transformed correlation logit(( r + 1)/2), a uniform prior U(0,1), constrained to positive
correlations (Burke et al., 2016), or penalised complexity priors, as included in the R pro-
gram meta4diag (Guo et al., 2017). Alternative to gamma priors on precisions 1/tk2 , which
may lead to relatively high estimated τk, are half-normal priors (Burke et al., 2016; Lambert
et al., 2005). Alternative methods are available if some, or all, within study correlations are
not observed (i.e. only standard errors of treatment effects are available), for example, spec-
ifying an informative prior (Mavridis and Salanti, 2013). For the bivariate case, an alterna-
tive model may be specified (Riley et al., 2008), entirely avoiding the need for observed
intra-study correlations.
Multivariate normality is often a simplification, and one may wish to allow both for
heavier tails, skewness, or multi-modality; see Genton (2004) and Lee and Thompson
(2008) regarding use of skew-elliptical densities as a route to greater robustness. These
models build on the principle (Azzalini, 1985) that if f and g are symmetric densities with
parameters μ and σ, with G the cumulative density corresponding to g, then the new den-
sity defined by
2 x − m x − m
h( x| m, s , d ) = f Gd
s s s
is skew for non-zero δ.
Following Sahu et al. (2003), a multivariate skew-normal model is a particular type of
skew-elliptical model (of dimension K) obtained by considering errors eK ×1 ∼ N K (0, Σ ),
positive variables ZK ×1 ∼ N K (0, I ) and taking y = DZ + e where D is a diagonal matrix,
diag( d1 , … dK ). In a regression setting with a K dimensional mean μ, one has
y |Z = z ∼ N K ( m + Dz , Σ ).
Values δk > 0 correspond to positive skew in the kth outcome while a negative δk arises from
negative skew. A multivariate skew-t model (allowing for both heavier tails than the nor-
mal, and also for skewness) is obtained by sampling ZK ×1 ∼ tK ,n (0, I ), where ν is a degrees
of freedom parameter, and then
n + zT z
y |Z = z ∼ tK ,n + K m + Dz , Σ .
n + K
Borrowing Strength via Hierarchical Estimation 119
y iT biT siT 2
0
y iC ~ N biC , 0 2
,
siC
æ biT ö æ æ mT ö ö
ç ÷ ~ N çç ç ÷ , S b ÷÷ ,
è biC ø è è mC ø ø
where
2 1 1 1 1
siT = + 2
, siC = + .
riT N iT − riT riC N iC − riC
It is also assumed that the precision matrix Σ b−1 of the latent effects is Wishart with
identity scale matrix and 2 degrees of freedom, while the { mT , mC } parameters have N(0,
1000) priors.
Using jagsUI, posterior means for ( mT , mC ) are estimated as (−4.87, −4.07), with mean
vaccination effect g = mT − mC of −0.79 (−1.27, −0.32), slightly more negative than the esti-
mate of −0.74 found by van Houwelingen et al. (2002) using classical methods (in the
SAS package). The posterior mean for the second-stage covariance matrix is
1.83 2.21
Σb = ,
2.21 3.29
with correlation between treatment and control effects (where effects are log-odds),
obtained from monitoring the components of Σb, as 0.90.
Similarly, the slope of the regression to predict the vaccination group log-odds from
the control group log-odds, obtained by averaging Σ (bt12) /Σ (bt22 )
over iterations t, is 0.67
(slope.TC in the code). The variance of the true treatment effects biT − biC is obtained by
monitoring Vt = Σ b ,11 + Σ b , 22 − 2Σ b ,12 , while the conditional variance of the vaccination
log-odds effects biT given biC (and hence the variance of biT − biC given biC) is obtained
by monitoring Vc = Σ b ,11 − Σ b2,12 /Σ b , 22 . Finally, the proportion of treatment effect varia-
tion explained by baseline risk (i.e. the true log-odds in the control group), obtained by
monitoring 1 − Vc /Vt , has a posterior mean of 0.51 (r2.base in the code).
To assess possible outliers, a mixed predictive exceedance check (Marshall and
Spiegelhalter, 2007) is carried out by sampling replicate random effects (biT , new , biC , new )
and then sampling replicate data yij,new. There are four observations out of the 26 (13
pairs) with predictive exceedance checks
under 0.10 or over 0.90, with the most extreme being an exceedance probability of 0.024
in the vaccination arm of trial 6 (pred.exc[1,6] in the code). As a summary fit measure, a
predictive criterion (Laud and Ibrahim, 1995) based on comparing y new and y is derived.
Accordingly, a second analysis (using WINBUGS via rube) adopts a bivariate Student
t model at stage 2. The degrees of freedom parameter is set at 4, providing a robust
analysis (Gelman et al., 2014, section 17.2). This leads to only two observations, (6,T) and
(6,C), having predictive exceedance checks under 0.10. Extension of the model to a skew
bivariate t (Sahu et al., 2003), namely model 3 in the code, provides no further gain in fit.
Model fit using the predictive criterion (PFC.mix in the code), in fact, is worse for these
two extensions, illustrating that improved model fit does not always follow measures to
counteract adverse model checks. The mean vaccination effect g = mT − mC is less precise
under these models, namely a mean (95% CRI) of −0.78 (−1.34, −0.23) under the second
model, and −0.80 (−1.54, −0.04) under the third.
A final analysis uses a Cholesky decomposition (Wei and Higgins, 2013) for the sec-
ond-stage covariance matrix in an MVN-MVN analysis,
Σ b = Vb RbVb
Rb = L′ L
b1i m1
b2i ~ N m2 , Σ b ,
t12 t1t2 rb
where Σ b = . In fact, three studies contain subjects with isolated sys-
t t r
1 2 b t22
tolic hypertension (namely high SBP, but normal DBP). So the treatment effect may be
smaller in these trials. To represent this effect, we introduce a second-stage regression
(i.e. multivariate meta-regression):
Borrowing Strength via Hierarchical Estimation 121
b1i n1i
b2i ~ N n2i , Σ b ,
n1i = m1 + b1ISHi
n2i = m2 + b2ISHi .
B(bi ) = e bi
in (4.1), where mi = e bi , a(fi ) = 1 and c( yi , fi ) = log yi !. Then equation (4.2) has the form
Namely, a gamma density for mi = e bi with parameters a = g1(y) and b = g 2 (y). The condi-
tional posterior is
namely a gamma for μi with parameters α + yi and β + 1. Denoting the mean of the μi as
ξ = α/β, one obtains V ( mi ) = a/b 2 = x 2 /a . Then
V ( yi ) = E[V ( yi | mi )] + V[E( yi | mi )] = x + x 2 /a
∫
p( yi |x , a) = p( yi |wi , x)p(wi |a)dwi ,
a i y
Γ (a + yi ) a x
p( yi |x , a ) = .
Γ (a )Γ ( yi + 1) a + x a + x
a yi
Γ (a + y i ) a x
.
Γ(a)Γ( yi + 1) a + x a + x
p( yi |x , a , yi > 0) = a .
a
1−
a + x
E( mi ) = m m = a/b
var( mi ) = Vm = a/b 2
(e.g. Clayton and Kaldor, 1987). When this parameterisation includes offsets oi, the poste-
rior p( mi , a , b | y ) has the form
n a −1
exp( − mi oi )( mi oi )yi b a
n n n
∏
L(a , b , m| y )p(a , b ) =
yi! Γ(a)
∏ mi
exp − b
∑ mi p(a , b ),
i =1 i =1 i =1
with conditional posterior for μi now Ga(a + yi , b + oi ) . Hence the posterior mean is
Borrowing Strength via Hierarchical Estimation 123
yi + a
E( mi | yi , a , b ) = .
oi + b
One may define reliabilities (Staggs and Gajewski, 2017) using Vμ and the conditional vari-
ances var (( yi /oi )| mi , oi ) = mi /oi . Reliabilities in unit rates are estimated as:
Vm
.
mi
Vm +
oi
So higher reliabilities attach to units with larger offsets.
The conditional likelihoods (George et al., 1993, p.191) for α and β under this structure are
obtained from L(a , b , m| y ), namely
n a -1
æ ba ö æ n
ö
L(a |b , m ) = ka ç ÷
è G(a ) ø
ç
ç
è
Õ
i =1
mi ÷
÷
ø
,
and
æ n
ö
L( b |a , m ) = k b b na exp ç - b
ç
è
å m ÷÷ø ,
i =1
i
where kα and kβ are normalising constants. Hence L(β|α,μ) is gamma with param-
∑
n
eters nα + 1 and mi . The conditional posteriors p(a| b , m) = L(a| b , m)p(a) and
i =1
p( b |a , m) = L( b |a , m)p( b ) are log-concave when the priors p(α) and p(β) are log-concave.
∑
n
Assuming a gamma prior p(b ) = Ga(c, d) , the full conditional for β is Ga(na + 1 + c, i=1
mi + d) .
However, the full conditional for α is non-standard, whatever form for p(α) is adopted.
Another Poisson-gamma mixture formulation (e.g. Albert, 1999; Christiansen and
Morris, 1996) assumes
yi |li ∼ Po(oi li ),
z
li ∼ Ga z , ,
mi
where V (li ) = mi2 /z and the Poisson corresponds to z → ∞. If μi = μ and a gamma prior is
assumed for μ, then the posterior mean for λi conditional on μ and ζ is
yi + z y
E(li | y , z , m) = = Bi m + (1 − Bi ) i ,
oi + z/m oi
where
z
Bi = ,
z + oi m
measures the level of shrinkage towards the overall mean μ. Thus, shrinkage will be
greater when oi (e.g. the population at risk in a mortality application) is small, or when
ζ is large. As for the second-stage variance in the normal-normal model, the prior on ζ
124 Bayesian Hierarchical Models
influences the degree of shrinkage that is obtained. Let ri = yi /oi . Then Christiansen and
Morris (1996) suggest a uniform prior based on the average shrinkage factor
z
B0 = ∼ U (0, 1),
z + min(oi )r
with the prior value of ζ then obtained as B0 min(oi )r /(1 − B0 ).
Extended parameterisations of the negative binomial have been suggested (Liu and Dey,
2007). Winkelmann and Zimmermann (1991) suggest a variance function
V ( yi ) = E[V ( yi | mi )] + V[E( yi | mi )] = x + fx k +1
with k ≥ −1, and obtained by taking mi ~ Ga(x 1-k /f , x - k /f ). Setting k = 0 and k = 1 leads to
what are called NB1 and NB2 forms of the negative binomial, under which the variances
are linear and quadratic in ξ, namely V ( yi ) = x + fx and V ( yi ) = x + fx 2 respectively.
¥
(2p V )-0.5 é -(log mi - M )2 ù
p( yi |M , V ) =
yi ! ò
0
miyi -1e - mi exp ê
ë 2V
ú dmi
û
with marginal mean and variance respectively e M +V /2 and e 2 M +V [eV − 1]. As V → 0, this
reduces to a Poisson density. An alternative parameterisation (Weems and Smith, 2004)
has yi ∼ Po( miU i ) with log( mi ) = b 0 + b1x1i +¼ b p x pi , and log(U i ) ∼ N (1, V ).
The Poisson lognormal generalises readily to multivariate count data (Chib and
Winkelmann, 2001) or to mixing with heavier tails than available under the lognormal;
for example, the log Student t with a low degrees of freedom parameter for a heavy tailed,
albeit symmetric, mixing density. Skew normal and skew Student t mixing can also be
used, since in some applications, extremes of frailty tend to be above rather than below the
centre of the density (Sahu et al., 2003).
The exchangeable Poisson lognormal model is quite widely applied to pooling infer-
ences over sets of units (e.g. hospitals) when health event totals yi such as surgical deaths
are obtained and there are oi expected events; the Poisson lognormal is also widely applied
in modelling for spatially structured disease count data (Chapter 6). The oi might be based
on multiplying the patient total for hospital i by an average event rate and are usually
Borrowing Strength via Hierarchical Estimation 125
assumed known (i.e. not to be subject to measurement error). If the average rate is based on
the total set of n hospitals then one has ∑ y = ∑ o , and with m = o r one has
i i i i i
yi ∼ Po(oi ri ),
with the ρi interpretable as relative risks averaging 1 over all units. However, this feature is
not always present, and allowing for mean risk other than 1 (e.g. if a national surgical mor-
tality rate is applied to a particular set of hospitals), the Poisson lognormal then assumes
log( ri ) = b0 + wi
where the wi ∼ N (0, Vw ) are exchangeable normal random effects, with relative risks ρi
pooled towards a global average rate exp(β0) according to the size of Vw. Equivalently
vi = exp(wi ) are lognormal with mean m = exp(0.5Vw ) and variance m2 (expVw − 1).
Generalised Poisson and Poisson process models are also often useful in particular set-
tings, including underdispersion (Consul, 1989; Scollnik, 1995; Podlich et al., 2004). The
generalised Poisson density (Consul, 1989) specifies
l(l + y r)y −1 − l − r y
p( y |l, r) = e
y!
with mean λ/(1 − ρ), variance λ/(1 − ρ)3 and hence coefficient of variation 1/(1 − r)2 ≥ 1. This
reduces to a Poisson density as ρ → 0.
y i ∼ Po(oi mi ),
mi ∼ Ga(a , b ),
where the μi are relative risks, since actual and expected deaths are equal. The hyper-
parameters α and β are assigned diffuse gamma priors. The DIC for this model is 475.
To assess variations in the extent of shrinkage, one may plot the lengths of 90% cred-
ible intervals for percentile ranks against posterior mean reliabilities Vm /(Vm + ( mi /oi ))
(Staggs and Gajewski, 2017). As expected, more precise estimates of percentile ranks are
associated with higher reliability. A mixed predictive exceedance check (Marshall and
Spiegelhalter, 2007) shows 13 observations with exceedance probabilities under 0.05 or
over 0.95.
A second model adopts the scheme of Christiansen and Morris (1996), which includes
data-based priors. Thus
y i | mi ∼ Po(oi mi ),
z
mi ∼ Ga z , ,
Mm
with shrinkage factors
126 Bayesian Hierarchical Models
z
Bi = .
z + oi M m
The prior on ζ is indirect, via a uniform prior on B0 = z/(z + min(oi )r ). A two-chain run
of 5,000 iterations provides a DIC of 530, and high values for both B0 and ζ, namely 0.986
and 7.81. The mixed predictive exceedance check now shows seven observations with
exceedance probabilities under 0.05 or over 0.95.
Christiansen and Morris (1996) argue that exchangeability between all 131 units might
not be applicable, since hospitals with larger patient totals have lower crude death rates.
As one remedy for such a pattern, one might take
y i ∼ Po(ni ),
z
ni ∼ Ga z , ,
ri
where
log( ri ) = b1 + b2 log(oi ),
y i | mi ∼ Po(oi mi ),
zG
mi ∼ Ga zGi , i
mGi
zk
B0 k = .
zk + min(oi ; Gi = k )rk
events, the beta-binomial may also be used if populations are relatively small, and has dif-
ferent implications for shrinkage: shrinkage is greater under the Poisson-gamma (Howley
and Gibberd, 2003). Binomial and multinomial mixture methods have recently become
popular in the analysis of ecologic problems where marginals of a contingency table are
available, often from different sources such as census and voting data, but the internal
cells are unobserved (King, 1997; King et al., 2004). They may also be applied in meta-anal-
ysis, avoiding normal approximations (Bakbergenuly and Kulinskaya, 2017; Kulinskaya
and Olkin, 2014).
For binomial data yi ∼ Bin( N i , pi ), i = 1, … , n , the exponential family parameterisation
sets
B(bi ) = N i log(1 + e bi ),
N
in (4.1), where pi = e bi /(1 + e bi ), a(fi ) = 1, and c( yi , fi ) = log i . Then equation (4.2) has the
form yi
namely a beta density for πi with parameters g1(ψ) and N i g 2 (y) − g1(y). The conditional
posterior of πi is then also a beta with parameters g1(y) + yi and N i [ g 2 (y) + 1] − g1(y) − yi .
The marginal density is the beta-binomial with
N i Be( g1 + yi , N i ( g 2 + 1) − ( g1 + yi ))
p( yi | g1 , g 2 ) = .
yi Be( g1 , N i g 2 − g1 )
with mean r ∈(0, 1) , and where γ > 0, termed the spread parameter by Howley and Gibberd
(2003), is inversely related to the prior variance of the proportions r(1 − r)/(1 + g). The con-
ditional posterior for πi is
pi ∼ Be(gr + yi , g(1 − r) + N i − yi ),
g Ni yi
E(pi | y , g , r) = r+ ,
g + Ni g + N i N i
namely a weighted average of the observed rate and the prior mean rate. Shrinkage to the
prior mean is greater when γ is large and for small populations Ni. The marginal density is
128 Bayesian Hierarchical Models
N i Be(gr + yi , g(1 − r) + N i − yi )
p( yi |g , r) =
yi Be(gr , g(1 − r))
N i Γ(gr + yi )Γ(g(1 − r) + N i − yi )Γ(g)
= ,
yi Γ(gr)Γ(g(1 − r))Γ(g + N ) i
g + Ni
V ( yi ) = V[E( yi |pi )] + E[V ( yi |pi )] = r(1 − r) .
g + 1
Γ( a + b)
n
Γ( a + yi )Γ(b + N i − yi )
L( a, b , y ) ∝
Γ( a)Γ(b)
∏
i
Γ( a + b + N i )
p( a, b),
and mixed Gibbs–Hastings sampling to the joint conditional likelihood
Γ( a + b)
n
L( a, b , p , y ) ∝
Γ( a)Γ(b)
∏p i
a + yi − 1
i (1 − pi )b + Ni − yi + 1 p( a, b).
They also consider implications for posterior parameter correlation of the reparameterisa-
tion (Lee and Sabavala, 1987)
pi ∼ Be( m, h),
m = a/( a + b),
h = 1/(1 + a + b),
yi |pi ∼ Bin( N i , pi ),
logit(pi ) = bi ,
Borrowing Strength via Hierarchical Estimation 129
bi |m , t ∼ N ( m , t 2 ).
1 1 pi
2
1
p(pi | m, t 2 ) = exp − 2 log − m .
t 2p 2t 1 − pi pi (1 − pi )
The logistic-normal prior with τ = 2.67 and μ = 0 matches a Jeffreys prior on πi in the first
two moments, and setting τ = 1.69 matches the uniform prior in the first two moments
(Agresti and Hitchcock, 2005). As for the Poisson lognormal, one may generalise to heavier
tailed or skewed mixing densities. Teather (1984) proposes a family of symmetric prior
densities for logit(πi) that includes the normal and double exponential as special cases.
Alternative links (e.g. probit) or mixing over links are possible.
In many applications (e.g. studies with patients allocated to multiple treatment), the ran-
dom effect variation is representing differential frailty in the patient population of the
study, so that for studies i = 1, … , n with k = 1, … , K treatment categories
logit(pik ) = bi + bk ,
bi ∼ N (0, t 2 ),
where the βk are fixed treatment effects, while the bi can be interpreted as between study
variation in treatment effects. For example, Gao (2004) considers this structure for data
from Winship (1978) on a meta-analysis of eight randomised clinical trials comparing
healing rates in duodenal ulcer patients. For trials with treatment and control arms only,
with patient totals {N iT , N iC }, the logistic-normal model is often applied in meta-analysis
when trial totals are small, rather than adopting a normal approximation (Warn et al.,
2002; Parmigiani, 2002). In fact, other links (combined with binomial sampling) may be
more useful in clinical interpretability.
The prior structure often focuses on the control arm probabilities πiC, and on differences
between trial and control group probabilities. Thus assume
Then analysis of treatment-control differences δi on the log odds ratio scale would involve
transforms wiT = logit(piT ) , and wiC = logit(piC ), and taking
di = wiT − wiC ,
di ∼ N ( ∆ , sd2 ).
For the πiC, random effect options might be to take wiC ∼ N ( mC , tC2 ), with { mC , tC2 } as addi-
tional unknowns, or piC ∼ Be( aC , bC ) with {aC , bC } additional unknowns.
130 Bayesian Hierarchical Models
Consider instead a log link, so that wiT = log(piT ), and wiC = log(piC ), again with
di ~ N ( ∆ , sd2 ). The δi now measure log relative risks, which are often more clinically useful
than log odds ratios, and exp(Δ) will measure the relative risk of (say) recurrence or mor-
tality under the treatment. In practice, sampling has to be constrained to ensure δi is less
than −log(πiC), so that
di ∼ N ( ∆ , sd2 ).
wiC = piC ,
di ∼ N ( ∆ , sd2 ),
sampling has to be constrained to ensure that piT ∈[0, 1] . This involves confining δi to the
interval [−piC , 1 − piC ] with the actually sampled model specifying
If the control group probabilities are regarded as proxies for the underlying risk of subjects
in a study, then the model involves a regression on centred control group effects, namely
di ∼ N ( ∆ , sd2 ),
where wC is the average of the control arm effects (calculated at each iteration), and β is an
extra unknown.
4.7.2 Multinomial Mixtures
For representing overdispersion in multinomial data with M categories
Ni = ∑y
m
im ,
M
Γ ( A)
p(pi|a) = ∏p am − 1
,
∏
M im
Γ (a )
m m=1
m=1
Borrowing Strength via Hierarchical Estimation 131
so that prior means for πim are αm/A, with variances am (K − am )/A 2 ( A + 1) . The posterior
density for [pi1 , … , piM ] is Dirichlet with parameters ( yi1 + a1 , … , yiM + aM ) . Assuming equal
prior mass is assigned to all categories, namely a1 = a2 = … = aM , there is greater shrinkage
or flattening towards an equal prior cell probability across the M categories as A increases.
Greater flexibility may be provided by a multivariate generalisation of the logistic-normal
prior (Aitchison and Shen, 1980; Hoff, 2003). Thus with ( yi1 , … , yiM ) ∼ Mult( N i ,[pi1 , … , piM ]),
exp(bij )
pij = ,
∑
M
exp(bim )
m=1
where the vector (bi1 , … bi , M −1 ) of the first M − 1 effects is multivariate normal with mean
mi = ( mi1 , … , mi , M −1 ) and covariance matrix Σ of dimension M − 1. For the reference category,
one sets biM = 0. If the categories are ordered and similarity of probabilities in adjacent
categories is expected on substantive grounds, the covariance matrix or its inverse may
be stipulated in line with a low order autoregressive form; this is known as “histogram
smoothing” (Leonard, 1973).
Another generalisation is to add a higher stage prior on the Dirichlet parameters, for
example, on the total mass A. Thus, Albert and Gupta (1982) consider a two-stage prior in
multinomial-Dirichlet analysis of contingency tables. With the reparameterisation ai = A ri
where ∑ m
rm = 1, one possible hierarchical prior generalises the binomial-beta with
A ∼ Ga( aA , bA ),
( r1 , … , rM ) ∼ Dir(w1 , … , w M ),
pi = ri1xi + ri 2 (1 − xi ).
Among possible priors for the unknown ri1 and ri2 in a 2 × 2 ecological problem are:
Imai et al. (2008) typify ecological missing data as data “coarsening,” and the first two pri-
ors above are consistent with coarsening at random. By contrast, the final option amounts
modelling the joint density p(x,r) of racial composition x and turnout behaviour r = (r1 , r2 )
via the sequence p( x|r )p(r ). This is similar to joint modelling of missingness and observed
data in non-random models for missing data (Pastor, 2003) and hence may be termed
“coarsening not at random.” If predictors of turnout rates are available, then the means μi1
and μi2 include regression terms.
y iT ∼ Bin( N iT , piT ),
y iC ∼ Bin( N iC , piC ).
piC ∼ Be(aC , bC ),
with uniform priors on the unknowns, aC ∼ U(1, 100) and bC ∼ U(1, 100) . Different com-
parison scales can be defined. For example, on the log odds ratio scale
wiT = logit(piT ),
wiC = logit(piC ),
di = wiT − wiC ,
di ∼ N( ∆ , sd2 ),
Borrowing Strength via Hierarchical Estimation 133
and diffuse normal and inverse gamma priors on Δ and sd2 respectively. Under an abso-
lute risk difference scale, one has instead
wiT = piT ,
wiC = piC ,
di = wiT − wiC ,
y iT ∼ Bin( N iT , piT ),
y iC ∼ Bin( N iC , piC ),
with S assigned a gamma prior, and the θ parameters themselves assumed beta distrib-
uted. Under these assumptions, RR new has a 95% interval (0.3,2.2), with a 38% chance of
exceeding 1.
There are 41 studies in the terbinafine analysis, and each study has Ni patients and yi
patients with adverse reactions. The binomial logit-normal (BLN) representation
y i |pi ∼ Bin( N i , pi ),
logit(pi ) = m + bi ,
bi |t ∼ N(0, t 2 ),
y i |pi ∼ Bin( N i , pi )
pi ∼ Be(a , b ).
The latter can be also be represented directly in rstan using the beta_binomial density.
Pooling across all studies, 111 of 3002 patients have adverse effects (around 3.7%).
The mixed replicate checking scheme is used to identify poorly fitted cases. In
the BLN representation, this is implemented by sampling replicate normal ran-
dom effects brep, i , and then the corresponding predicted totals of adverse reac-
tions. Poorly fitted cases are identified by extreme exceedance probabilities, namely
p. exci = Pr( y rep , i > y i |Y ) + 0.5Pr( y rep , i = y i |Y ) under 0.05 or over 0.95. Two studies (19, 38)
are identified as poorly fitted (a potential outlier), with study 19, containing 186 patients
but 0 adverse effects, having p. exci = 0.96.
In the hierarchical beta-binomial representation, we sample replicate πrep,i and then
the corresponding predicted adverse reactions. Now three studies are identified as
problematic: study 19 with p. exci = 0.96, and two studies (33 and 38) with relatively
high adverse reaction totals. Inferences regarding the mean adverse rate are similar
between the two approaches: the beta-binomial adverse mean rate is 3.44%, compared
to 3.45% under the binomial logitnormal (based on averaging over all samples of all πi).
However, sampled πi under the binomial logit-normal show greater positive skew than
the beta-binomial (1.61 vs 1.22), reflecting accommodation for the higher rates for some
studies. There is a similar contrast in outlier accommodation between the conjugate
Poisson-gamma mixture and the Poisson lognormal, as the tails of the lognormal are
heavier than for the gamma distribution (Connolly et al., 2009; Wang and Blei, 2017).
and poor predictions for a new unit (Hoff, 2003). For example, a normal random-effects analy-
sis of hospital mortality rates may shrink extreme rates considerably, and this might mask
potentially unusual results for units with smaller totals of patients at risk (Ohlssen et al., 2007).
Among the principles that govern robust smoothing and regression methods for non-
standard densities are discrete mixing of densities over K > 1 subpopulations (Bohning,
1999) and various types of local regression based on kernel or smoothness priors (Muller
et al., 1996). In this chapter, the focus is on discrete mixture modelling, where the Bayesian
approach has been coupled with many recent advances. These include the Bayesian ana-
logue to non-parametric maximum likelihood estimation, with MCMC implementation
as set out by Diebolt and Robert (1994), and Richardson and Green (1997), and numer-
ous developments of the Dirichlet process methodology, as reviewed by Hanson et al.
(2005). The Bayesian approach is flexible in terms of prior structures that can be imposed
in estimation, either grounded in substantive theory, or to improve definition of the sub-
groups (e.g. Robert and Mengersen, 1999). On the other hand, repeated sampling without
appropriate parameter constraints is subject to “label switching,” since labelling of the
subgroups is arbitrary (Fruhwirth-Schattner, 2001; Chung et al., 2004).
where p( y |S) = p( y |yS ) is the density for yi conditional on Si, and {p1 , … , pK } are the prior
∑
K
subgroup probabilities, with pk = 1. The unconditional or marginal density for a sin-
k =1
gle yi is
K
p( y |p , y) = ∏ ∑ p p(y |y ).
i =1 k =1
k i k
Si ∼ P(Si |j)
136 Bayesian Hierarchical Models
and at the lowest (first) stage the distribution of the observations p( y |j , S) depends on both
φ and S = (S1 , … , Sn ). The joint distribution is therefore
p( y , S, j) = p( y |S, j)p(S|j)p(j).
p( y |p , m, s ) = ∑ p f(y| m , s ),
k =1
k k k
and
s2
p( mk |sk2 ) = N xk , k .
kk
Also assume a Dirichlet prior for the unknown mixture probabilities
with α preset or possibly an extra unknown. Gibbs sampling then samples the missing
data (the allocation indicators) according to a multinomial density with probabilities at
iteration t,
Let dik(t ) = 1 if Si(t ) = k and dik(t ) = 0 otherwise. Suppose N k(t ) = #{Si(t ) = k } is the total number of
(t )
cases with Si(t ) = k , that mk = ∑d (t )
y /N k(t ) is the average response for these cases, and that
ik i
Borrowing Strength via Hierarchical Estimation 137
Ek(t ) = ∑d
(t )
ik ( yi − mk(t ) )2 is the sum of squared errors for this subgroup. Then, with condi-
tioning on remaining parameters understood, the πk are updated according to a Dirichlet
with
N k(t )kk
sk2(t ) ∼ IG 0.5[nk + N k(t ) ], 0.5 Vk + Ek(t ) + (xk − mk(t ) )
kk + N k (t )
and the subgroup means are updated according to
Diebolt and Robert (1994) suggest stabilising adjustments to these updates to improve
convergence. A refinement is to take the mixture proportions as subject specific as in
(pi1 , … , piK ) ∼ Dir(a , … , a) , and in the updates for pik(t ), the N k(t ) are replaced by binary indica-
tors according to which class subject i is allocated to at a particular iteration.
issues are also not generally considered in the Dirichlet process approach (Section 4.9),
where the emphasis is on the smoothed unit means.
Identifying (usually ordering) constraints may be imposed on parameters to avoid label-
switching (Roeder and Wasserman, 1997, Richardson and Green 1997), providing what
may be termed “non-exchangeable priors” (Betancourt, 2017). Label switching or labelling
degeneracy refers to permuting the mixture component subscripts without altering the
likelihood (Redner and Walker, 1984). However, Celeux et al. (2000), Marin et al. (2005),
and Geweke (2007) consider drawbacks to such identifiability constraints (e.g. distortions
of the posterior distribution of the parameters). For example, in a normal mixture, con-
straints may be imposed on prior masses πk (e.g. p1 > p2 > … > pK ), or on the subpopulation
parameters, μk or on the scale parameters σk. A preliminary MCMC sampling analysis
without parameter constraints may be used to assess the most suitable form of constraint
(Fruhwirth-Schattner, 2001). Another possibility is to use maximum likelihood solutions
(e.g. using the R package flexmix) to set constraints and/or relatively informative priors
that are sensible for the dataset. Re-analysis of the posterior output to impose a consistent
labelling is another possibility (Frühwirth-Schnatter, 2001), as are data-based priors, albeit
not fully Bayesian (Wasserman, 2000). For example, in a two-group model without regres-
sion on predictors, the unit with the maximum y value could be pre-labelled as belonging
to one or other subpopulation.
Particular types of parameterisation may be used to improve identification, such as
introducing dependence between the parameters ψk in different components so that they
are perturbations of one another (Robert and Mengersen, 1999). For example, a normal
mixture model with yk = ( mk , sk2 ) would be based on taking {q1 , s12 } as reference parameters
and adopting the parameterisation
s2 = s1w1 ,
s3 = s2w2 ,
s4 = s3w3 ,
where wk ∼ U (0, 1). With q1 = m1 , the prior on the series of normal means takes a perturba-
tion form
m2 = q1 + s1q2 ,
m3 = q1 + s1q2 + s1s2q3 ,
p1 = p1 ,
p2 = (1 − p1 )p2 ,
Borrowing Strength via Hierarchical Estimation 139
p3 = (1 − p1 )(1 − p2 )p3 ,
pK = (1 − p1 )(1 − p2 )… (1 − pK −1 )
with pk ∼ U (0, 1) . This prior is still invariant under permutation of the cluster indices
and an identifying constraint is placed on the variances by taking 1 ≥ w1 ≥ … ≥ wK −1 . An
advantage of this representation is that an improper prior on { m1 , s12 } can be used (Robert
and Titterington, 1998). For the two group case, Basu (1996) presents the parameterisation
n = s12 / s22 and ∆ = ( m2 − m1 )/s1 to test for normal or Student t unimodality as against bimo-
dality; posterior probabilities of unimodality are obtained using the results of Robertson
and Fryer (1968).
Celeux et al. (2000) and others apply post-processing to the MCMC output resulting
from a discrete mixture analysis without parameter constraints; the goal is to reconfigure
the output with a consistent labelling. Suppose there are p parameters in any subpopula-
tion. If MCMC convergence is assumed, one may select a short run of iterations (say S = 100
iterations) where there is no label switching to provide a reference labelling. The initial
run of parameter samples provides a base reference label sequence 1, 2, … , K (one among
the K! possible), and K means of dimension p, qk = {q1k , q2 k , … , qpK } , that can be permuted
to include all other remaining K! − 1 possible labelling schemes. In a subsequent run of R
iterations where label switching might occur, iteration r is assigned to that scheme (among
the K!) closest to it in distance terms and a relabelling applied if there has been a switch
away from the base reference label. Additionally, the means under the schemes are recal-
culated at each iteration S + r (Celeux et al., 2000, p.965).
Schemes for gaining identifiability can be applied within the MCMC sampling, as illus-
trated in the rjags online code for the BUGS example concerning peak sensitivity wave-
lengths (the Eyes example) (https://sourceforge.net/p/mcmc-jags/examples/ci/3765ddf
d606e96c5de12818b50ef1b807f77af53/tree/classic-bugs/vol2/eyes/eyes.bug). Assume an
unconstrained analysis, with no constraints on the mixture parameters. Then, assuming
relabelling based on sampled means, processing resorts these sampled means, named say
m0[1:K] in the code, with identifiable means mu[K], mu[K−1],...,mu[1] defined according
to which of the m0[1:K] has the maximum value, the second highest, etc. Other mixture
parameters (weights and variances for each group in a normal mixture) are reassigned
using the same relabelling rule. This procedure corresponds to adopting a standard set of
labels or standard ordering to obtain an identified solution (Betancourt, 2017).
The rjags online code is for the case K = 2. For K = 3, assume a normal univariate mixture,
with reassignment based on the means, but applied also to resorting weights, from the
sampled P0[1:3] to the identified P[1:3]. Then one possible rjags code fragment is
Assume precisions tau0[1:K] are to be reassigned as well. A general code for larger K can
be written more compactly as follows:
This procedure is illustrated in Example 4.8. Which parameter is selected as the basis for
resorting (e.g. means or weights) may partly be decided using measures of fit.
We illustrate this procedure with jagsUI applied to the randomly generated dataset used
in Betancourt (2017), consisting of a two-group Gaussian mixture with means (−2.75, 2.75),
prior weights P = (0.6,0.4), and variances 1 in both groups. Prior Dirichlet sample sizes of 2
are assumed. The code assumes a conditional likelihood (conditional on allocation indica-
tors) and is:
We obtain a solution with μ2 as the larger mean, with mean (sd) of 2.87 (0.05), and with
corresponding estimated weight p2 = 0.38. In this solution μ1 is the smaller mean, with pos-
terior mean (sd) of −2.73 (0.04), and with corresponding estimated weight p1 = 0.62. The esti-
mated weights reflect the actually sampled assignment indicator totals at line 7 of the code,
respectively sum(z==1) = 622 and sum(z==2) = 378. Convergence was attained at under 2000
iterations.
A less satisfactory result is obtained under the alternative scenario investigated by
Betancourt (2017) where the means are (−0.75,0.75), separated by less than a standard devi-
ation. As before prior weights are P = (0.6,0.4) and variances are 1 in both groups. This time
prior Dirichlet sample sizes of 5 are assumed. Convergence is obtained by under 5000
iterations with this more informative prior, but the estimated means are not fully repro-
ducing the simulation, namely −0.50 (0.21) and 0.44 (0.38) with estimated weights of p =
(0.57,0.43). This demonstrates the identifiability issues present when components are not
widely separated.
yi ∼ ∑ p Po( m ),
k =1
k k
∑
K
where πk is the prior probability that a unit belongs to sub-population k, with pk = 1.
k =1
Alternatively accounting for heterogeneity within subpopulations would involve K
Poisson-gamma subgroups
yi ∼ Po( mi ),
mi ∼ ∑ p Ga(a , b ),
k =1
k k k
yi ∼ Po( mi ),
142 Bayesian Hierarchical Models
mi ∼ ∑ p LN( m , s ),
k =1
k k
2
k
p
Pr(Si = 1| y = 0) = .
p + (1 − p)p( y = 0|y)
The process generating the Si needs only to be considered for zero observations yi = 0, and
the complete data likelihood (assuming Si to be given) is
For example, if p(y|ψ) is taken to be Poisson with mean ψ = μ then E( y |p , m) = (1 − p) m and
yi ∼ ∑ p p (y |y ),
k =1
ik k i k
Borrowing Strength via Hierarchical Estimation 143
e zik
pik = , k = 1, … , K − 1
∑
K
1+ e zik
k =1
1
piK = ,
∑
K
1+ e zik
k =1
where the {zik , k = 1, … , K − 1} are multivariate normal with mean ν and variance Σz. For
example, Hoff (2003) argues for the use of normal mixtures in density smoothing and,
in this case, the pk ( y |yk ) would be univariate or multivariate normal themselves. This
approach generalises to multivariate skewnormal or multivariate Student t densities, and
can be adapted to allow non-exchangeable mixture priors, as in histogram smoothing
(Leonard, 1973).
Instead of subject-specific zik, one may also assume a single vector {z1 , … , zK −1 } to be
multivariate normal. For unique identification of the subgroups one may impose order
constraints on the parameters in ψk or on those underlying {z1 , … , zK −1 } . In the univariate
normal case with yk = { mk , sk2 } , one might assume an ordering either on the means μk, or on
the means νk of the zik.
yi ∼ ∑ r N( m , s ),
k =1
k k
2
k
with y being measured in thousands of kilometres per second. Classical analysis using
the flexmix package (Leisch, 2004) in R shows a better AIC and BIC for 5 clusters. The
mclust program selects K = 4 as optimal, and for the K = 6 solution selects an equal vari-
ance solution. The K = 4 and K = 5 solutions have the drawback of a large variance in the
group with the largest mean.
Bayesian studies such as Ishwaran and James (2002) find at least 5–6 clusters with
a Dirichlet process approach, and under an inverse gamma prior for the sk2 . They do,
however, find only four clusters when a uniform prior U(0,20.83) is used for sk2 , with
20.83 being the observed variance, V(y). Ando (2007) reports six clusters (assuming a
monotonic constraint on the μk) via several model fit criteria, and K = 6 is also the best
fitting using the sBIC criterion of Drton and Plummer (2017, p.350).
Here we compare solutions with K = 4, K = 5 and K = 6. First of all, the rstan ordered
vector parameterisation will be used, following Betancourt (2017) and Savage (2016). A
half-Cauchy(0,2) is assumed for the group standard deviations. Prior Dirichlet sample
sizes α of 2 and 4 are also compared. For K = 5 and K = 6, estimation is with 2 chains
and 10,000 iterations. For K = 4, a higher number of iterations (50,000) is needed for
convergence.
With α = 2, respective posterior mean total log-likelihoods, namely
∑
K
log[ rk f( y| mk , sk )] , are −206.0, −205.8 and −206.5, with respective LOO-IC 421.7,
k =1
422.4 and 424.2. So there is little to separate these solutions in terms of fit. With α = 4, the
posterior mean log-likelihoods are −206.4, −205.7 and −206.2, with respective LOO-IC
being 421.2, 421.4 and 422.8. The rstan solutions generally show the lowest mean group
144 Bayesian Hierarchical Models
with a mean lower than the minimum, namely 9.17, of the observed data points. This
can be taken as generalising beyond the observed data.
We also implement jagsUI with the latent means constrained to lie between the mini-
mum and maximum of the observations. MCMC convergence is focused on relabelled
parameters, using the standard labelling approach set out above [1]. Convergence is
problematic with independent priors on the precisions yk = 1/sk2 when K > 4. Improved
convergence is obtained if a hierarchical prior is adopted instead, namely yk ∼ Ga( ay , by ),
where aψ and bψ are assigned E(1) priors. This is an intermediate option between inde-
pendent priors and assuming the same variance across all groups. As noted by Baudry
et al. (2010), the most appropriate number of mixture components may not guaran-
tee well-separated groups. To assess cluster overlap, we use the entropy measure
−2∑∑ di k
ik log( rik ) (Scrucca et al., 2016, p.297); another form, with effective numerical
equivalence, is −2 ∑∑ i k
rik log( rik ).
For K = 4, 5 and 6 respective posterior mean log-likelihoods are −205.6, −204.4, and
−204.4, so BIC-type penalised fit measures (with respective penalties 48.5, 61.7, and 74.9)
would favour K = 4. Respective posterior mean entropies are 59, 87, and 109, so penalisa-
tion by entropy (Biernacki et al., 2000) would also decisively favour K = 4. The LOO-IC
measures also favour K = 4, with the values for K = 4, K = 5, and K = 6 being respectively
357, 370, and 376. A solution with K = 3 was also run, which gave a mean log-likelihood
of −216.4 and an entropy of 68.3. The estimated group means under K = 4 are 9.7, 19.9,
22.4, and 28.0, with group probabilities 0.10, 0.33, 0.42, and 0.15 via jagsUI.
By comparison, mclust provides estimated means of 9.7, 19.8, 22.9, and 24.5 with
respective probabilities 0.08, 0.39, 0.37, and 0.16, and bayesmix (Gruen and Plummer,
2015) provides estimated means of 10.3, 20.4, 22.5, and 30.5 with respective probabilities
0.09, 0.45, 0.39, and 0.06. The bayesmix run used the code:
Predictive checks for K = 4 under a hierarchical prior for ψk show only one exceedance
probabilities under 0.1 or over 0.9.
prior G0, the expectation of G, and a precision or mass parameter α governing the
concentration of the prior for G about its mean G0. For any partition A1 , … , AM on
the support of G0, the vector {G( A1 ), … , G( AM )} of probabilities G(Am) contained in the
set {Am , m = 1, … , M } follows a Dirichlet distribution D(aG0 ( A1 ), … , aGM ( AM )). Such an
approach may be termed semiparametric as it involves a parametric model at the first
stage for the observations, but a non-parametric model at the second stage (Basu and
Chib, 2003).
Original forms of the DP prior assumed G0 to be known (fixed). One problem with a
Dirichlet process when G0 is known is that it assigns a probability of 1 to the space of dis-
crete probability measures (Hanson et al., 2005, p.249). An alternative is to take the param-
eters in G0 to be unknown, and to follow a set of parametric distributions, with possibly
unknown hyperparameters, resulting in a mixture of Dirichlet process or MDP model
(Walker et al., 1999, p.489). Computational procedures for such models are discussed by
Jara (2007), Ohlssen et al. (2007), Jara et al. (2011), Burr (2012), Karabatsos (2016), Karabatsos
(2017), with associated R packages including DPpackage (Jara et al., 2011), and bspmma
(Burr, 2012).
Following West et al. (1994), assume conventional first-stage sampling densities
yi ∼ p( yi |bi , y) , with distributions P( yi |bi , y) . The uncertainty about the appropriate
form of prior arises about the distribution G for the latent effects bi. Under a DP prior,
any set of unitspecific parameters {b1 , … , bn } generated from G lies in a set of K ≤ n dis-
tinct values {z1 , … , zK } which are sampled from G0. The concentration parameter α gov-
erning the closeness of G to G0 can be taken as an unknown, or assigned a preset value
(e.g. α = 1) (Da Silva, 2009). The number of distinct values or clusters K is stochastic,
with an implicit prior determined by α, with limiting mean a log(1 + n/a) . Note that the
posterior mean of K is not necessarily a reliable guide to the number of components in
the data or effects (e.g. components with substantive meaning), though it can be inter-
preted as an upper bound on the number of components (Ishwaran and Zarepour, 2000,
pp.381–382).
Given the realised number of clusters K (at any particular MCMC iteration), the bi are
sampled from the set {z1 , … , zK } according to a multinomial distribution. Define cluster
indicators S = {S1 , … Sn }, where Si = k if bi = ζk, and denote N k = #{Si = k } as the total number
of units with Si = k (i.e. units in the same cluster with a common value ζk for the second
stage latent effect). If α is taken as unknown, its prior is important in determining the
number of clusters. Taking a ∼ Ga(h1 , h2 ) where η1 and η2 are relatively large will tend to
discourage unduly small or large values for α. Typical values are h1 = h2 = 1 or h1 = h2 = 2,
though taking η2 > η1 as in {h1 = 2, h2 = 4} tends to encourage repetitions in the ζk, and can
be used to assess the number of components present in the data (Ishwaran and Zarepour,
2000, p.377). It is clear that the parameters used in the prior for α may affect the number of
components, but typically there is less concern with this aspect in non-parametric mixture
modelling (Leslie et al., 2007).
Consider the assignment of a latent effect bi to a particular unit, given that the remain-
ing n − 1 latent effects b[i] = {b1 , … , bi −1 , bi + 1 , … , bn } are already assigned. Also let S[i] be
a particular configuration of the remaining n − 1 effects b[i] into K[i] distinct values, with
N[i]k = #{Sj = k , j ≠ i} denoting the total of those n − 1 units having a common value z[i]k .
Then the conditional prior for bi follows a Polya urn scheme (West et al., 1994; Hanson
et al., 2005, p.252; Dunson et al., 2007, p.165)
a
∑ d(b ),
1
(bi |b[i] , S[i] , K[i] , a) ∼ G0 + k
a + n−1 a + n−1 k≠i
146 Bayesian Hierarchical Models
K[ i ]
a
∑N
1
∼ G0 + [ i ]k d(z[i]k ), (4.5)
a + n−1 a + n−1 k =1
K
a
∑ N d(z ).
1
(bn + 1 |b , S, K , a) ∼ G0 + k k
a+n a+n k =1
Predictions of the first stage response for unit n + 1 are obtained as
K
a
∑N P
1
( y n + 1 |b , S, K , a) ∼ Pn + 1(|zn + 1 ) + k n+1 (|zk ),
a+n a+n k =1
where ζn+1 is an extra draw from G0. Predictions beyond n + 1 may be relevant in panel or
time series applications (Hirano, 1998).
In terms of Gibbs sampling, (4.5) implies conditional posteriors (West et al., 1994, p.367;
Ishwaran and James, 2001, p.166)
K[ i ]
∫
qi 0 = p( yi |bi ) g0 (bi )dbi (4.6.1)
Normalising the values αqi0 and qik to probabilities {ri 0 , ri1 , … riK[i] } summing to 1, the condi-
tional posteriors for the subgroup indicators are then
where Si = 0 corresponds to drawing a new sample from G0 under the Polya urn scheme.
yi |bi ∼ p( yi |bi ),
b1 , … bn |G,
Borrowing Strength via Hierarchical Estimation 147
G|a , G0 ∼ DP(aG0 ),
where {y1 , … , yp } , are unknown, and also possibly some of the defining ξ parameters.
Consider a normal mixture with both means and variances possibly differing for each
unit (Cao and West, 1996; Hirano, 2002), namely
yi ∼ N ( mi , si2 ).
( mi , si2 ) ∼ G,
G ∼ DP(aG0 ),
mi ∼ p01( mi |x1 ),
with ξ1 and ξ2 possibly including further unknowns. For example, Hirano (2002) takes
1/si2 ∼ c 2 (s)/(sQ),
and
mi ∼ N (m, csi2 ),
1
∫s
− ( y − m )2 /2 s 2
qi 0 = e i i i g ( m , s 2 )d m ds 2 ,
0 i i i i
i 2p
1 − ( y − m )2 /2 s 2
qik = N[i]k e i k k k > 0.
sk 2p
As other examples, Chib and Hamilton (2002) consider a potential outcomes model for
panel data with DP errors, while Kleinman and Ibrahim (1998) consider Gibbs updates
in an MDP framework for parameters in general linear mixed models for nested data.
For example, let Xi and Zi be predictors of dimension q and r (possibly overlapping), and
consider repeated data yit over subjects i, with observation vectors yi = ( yi1 , … yiT ), and first
stage model
148 Bayesian Hierarchical Models
yi ∼ N (Xi b + Zibi , s 2 ),
where one may assume conventional normal and inverse gamma priors for β and σ2.
However, for bi = (bi1 , … bir ), greater flexibility is obtained by taking
bi ∼ G,
G ∼ DP(a , G0 ),
bi ∼ ∑ p h(b |y ).
k =1
k i k
This approach is called a Dirichlet process mixture by Hanson et al. (2005, p.250), and
a dependent Dirichlet process by Dunson et al. (2007, p.164). For practical application,
Ishwaran and Zarepour (2000) and Ishwaran and James (2002) suggest the infinite repre-
sentation be approximated by one truncated at M ≤ n components with
g (b ) = ∑ p h(b|y ),
m =1
m m
where the πm are sampled by introducing M − 1 beta distributed random variables,
Vm ∼ Be(cm , dm ),
with VM = 1 to ensure the random weights πm sum to 1 (Ishwaran and James, 2001;
Sethuraman, 1994). Then
p1 = V1 ,
pm = (1 − V1 )(1 − V2 )… (1 − Vm −1 )Vm m > 1.
This method of generation is known as stick-breaking, since at each stage, the procedure
randomly breaks what is left of a stick of unit length and assigns the length of the break to
the current πm. Griffin (2016) proposes an adaptive technique for selecting the truncation
point in truncated DP priors. Recent applications include Prabhakaran et al. (2016) and Hu
et al. (2018). It may be noted that rstan can use the TDP principle to estimate mixtures, but
taking M as a known rather than maximum number of components [2].
Borrowing Strength via Hierarchical Estimation 149
Following Pitman and Yor (1997), the beta parameters {cm , dm } in the prior for Vm can be
written cm = 1 − C , dm = D + mC , where C ∈[0, 1) and D > −C . For an infinite dimensional
mixture, the Dirichlet process is obtained by taking C = 0 and D = α, so that Vm ∼ Be(1, a) .
When a finite (truncated) mixture is used, setting
a
cm = 1 + ,
M
ma m
dm = a − = a 1 −
M M
Vm ∼ Be(1, a)
and M large is equivalent to the infinite DP process for practical purposes (Ishwaran and
James, 2002; Ishwaran and Zarepour, 2000, p.383). If a Ga(η1,η2) prior is used for α, its full
conditional is a ∼ Ga( M + h1 − 1, h2 − log(pM )) (Ishwaran and Zarepour, 2000, p.387). The
realised number of clusters is K ≤ M as above, and (Ishwaran and James, 2002) suggest
AIC and BIC penalties based on K that can be used for model selection.
Taking Vm ∼ Be(a , 1) rather than Vm ∼ Be(1, a) in the truncated stick-breaking scheme
means that larger values of α now imply greater clustering into a few sub-populations.
This is an example of the beta process priors considered by Ishwaran and Zarepour (2000).
Other truncated mixture sampling schemes that start with a prior on α to give an implicit
prior on a stochastic K are available. For example, Ishwaran and Zarepour (2000, p.376)
consider taking α as an unknown in
a a a
(p1 , … pM ) ∼ D , , … , .
M M M
Alternatively, Green and Richardson (2001, p.357) start off with a prior on K and then select
the cluster indicators from a multinomial vector with probabilities p(Si = k ) = pi , where
(p1 , … , pK ) follow a Dirichlet density D( d , … , d ) . They refer to this as an explicit alloca-
tion prior and show how the DP prior is obtained as K → ∞ and δ → 0 in such a way that
K d → a > 0.
The partition probabilities at second and subsequent stages are unknown. Let ε denote
a sequence of 0s and 1s. For example, suppose B1 is selected at step 1, and B11 is selected at
step 2, then ε = [1,1]. The choice at the next stage between sets Bε0 and Bε1 (i.e. between B110
and B111) is governed by probabilities (Ce 0 , Ce 1 ), with a beta prior for Cε0, and Ce 1 = 1 − Ce 0 .
The canonical form for the prior on the partition probabilities at partition m is
Ce 0 ∼ Be(cm , cm )
cm = dm2
where d may be taken as an extra unknown. The Dirichlet process occurs when cm = d/2m ,
so that cm → 0 as m → ∞, whereas cm → ∞ as m → ∞ is appropriate if the underlying distribu-
tion G is expected to be continuous.
While theoretically the completely continuous case corresponds to m → ∞, in practice
the partitioning is truncated at a finite value M. Hanson and Johnson (2002) recommend
M = log 2 (n) where n is the sample size. The partitions can be taken to coincide with per-
centiles of G0, so for example
and so on.
Let dki at partition k, and option i, be a re-expression of the Bε (e.g. for k = 3, d31 = B000,
d32 = B001, d33 = B010, d34 = B011, d35 =B100, d36 = B101, d37 = B110, d38 = B111). Then at partition k, for
i = 1, … 2k , the interval boundaries are
i − 1 i
dki = G0−1 k , G0−1 k ,
2 2
yi ∼ Po( mi ),
log( mi ) = b + sbi .
Then G0 for vi = sbi is a N(0, s 2 ) density, with G0 for bi being a N(0,1) density. So with M = 3
levels, the relevant ordinates from G0 for defining the 8 intervals are (−1.15,−0.67,−0.32,0,0.
32,0.67,1.15).
indicator Si = k, and with {z1 ,… , zM } sampled from G0. The realised number of clusters is
K ≤ M , where a maximum of M = 50 possible normal clusters N( m m ,t m2 ) are assumed as
potential second stage priors. The M potential parameter pairs {m m ,t m2 } defining G0 are
respectively sampled from normal densities with means m m ~ N(mm , 1), where mμ is itself
unknown, and from exponential densities, with 1/t m2 ~ E(1).
Then Vm ∼ Be(1, a), with an exponential E(1) prior assumed on the concentration param-
eter α, and with a lower sampling limit of 0.25 for numeric stability. A mixed predictive
check is based on sampling replicate {zrep ,1 ,… , zrep , M } from G0, and taking brep , i = zrep , k .
A two-chain run of 5,000 iterations using rube shows convergence in α, K, and the
realised latent effects b. The posterior mean and median of K are respectively 3.9 and 4,
supporting a relatively small number of components in the second-stage prior of NRT
effects; α has a posterior mean of 0.85. Mixed predictive checks are satisfactory, with
none exceeding 0.9 or being under 0.1.
A plot of the posterior means of the bi does not show sharply distinct subgroups
(Figure 4.1), though outlier random effects can be seen, such as trials 4, 36, and 59.
However, the effects show more peakedness than under a normal density (superim-
posed plot).
The analysis is also run using a Pitman–Yor prior, with Vm ∼ Be(1 − C , D + mC ), where
C ∈[0, 1) and D > −C, and with a maximum of M = 20 clusters. This is implemented
using R2OpenBUGS with a two-chain run of 20,000 iterations. A uniform U(0,1) prior is
adopted on C, with D obtained as D = D1 − C, where D1 ∼ Ga(1, 0.01) is assigned a gamma
prior. This analysis provides posterior means (sd) for C and D of 0.55 (0.25) and 132 (101),
with the mean number of clusters being 4.7. Posterior means for bi are similar to those of
the first analysis, the correlation between them exceeding 0.95, while exceedance prob-
abilities again show no model failure.
FIGURE 4.1
Nicotine replacement. Estimated random effects.
152 Bayesian Hierarchical Models
Data on log odds ratios yi and their variances si2 in n = 14 trials are considered by Burr
and Doss (2005), and relate to mortality after treatment vs control comparison for decon-
tamination of the digestive tract. Assumptions under the normal-normal model (equa-
tions 4.4.1 to 4.4.3) are cast into doubt by quantile plots of the yi. We consider a truncated
DP prior (with M = n = 14), with second-stage effects bi = ζk when allocation indicators
Si = k, and
zk ∼ N( mk , t 2 ),
mm ∼ N(m m , tm2 ).
Both mμ and tm2 are unknowns, assigned N(0, 100) and Ga(1, 0.01) priors. The second-
stage variance parameter τ2 is also assigned a Ga(1, 0.01) prior.
Analysis compares the DPMmeta option in the R library DPpackage, and a BUGS
a m
code estimated using rube in R, with Vm ∼ Be 1 + , a 1 − in the stick-break-
M M
ing prior. Either computing option suggests α is not strongly identified by the data:
alternative settings for a0 and b 0 in a ∼ Ga( a0 , b0 ) tend to carry over to the estimated
α. So alternative preset values such as α = 1, α = 10, etc. may be adopted instead
(Burr, 2012).
With the setting α = 10, DPMmeta shows a mean of around 9 realised clusters, as
against 6.8 under the TDP prior. Treatment benefit can be measured by the probability
that mμ is negative, or the probability that the mean of the realised bi is negative. The
probability Pr(m m < 0|y ) = 0.92 is inconclusive, though the probability Pr(b < 0|y ) is
0.97 (their two quantities are pben[1:2] in the code).
y i ∼ Po(bi ),
bi ∼ G,
G ∼ DP(aG0 ),
G0 = Ga(c g , dg ).
A related check is whether the 95% intervals for yrep,i include yi (Gelfand, 1996).
A two-chain run of 20,000 iterations in R2OPENBUGS provides an estimated mean
of K = 12 clusters, with posterior means (95% CRI) for α, cg, and dg of 2.83 (0.74,6.45),
0.82 (0.51,1.61), and 0.13 (0.04,0.29). Figure 4.2 shows the prediction y new for a new
case, and demonstrates that the main source of overdispersion is skewness in the
latent frailties bi rather than multiple modes. The predictive checks based on replicate
samples are satisfactory. Note that the same does not apply if the gamma mixing den-
sity parameters are set, e.g. c g = dg = 1. In this case, bimodal posteriors are obtained
on some bi (e.g. b92), and predictive checks for y101 = 34 suggest it to be an extreme
observation.
A second analysis involves a Polya tree prior, and a Poisson-lognormal model, namely
y i ∼ Po( mi ),
log( mi ) = b + sbi
where G0 for vi = sbi is a N(0,σ2) density. The number of stages is set at M = 4, and an E(1)
prior is assumed on 1/σ2. Once an interval Bεm is selected, uniform sampling to generate
bi takes place within the interval defined by G0, except in the tails where the sampling
is from a N(0,1).
As for the Polya urn model, both types of predictive check indicate no major discrep-
ancies. σ has posterior mean (and 95% interval) 2.06 (1.65, 2.51). If σ is taken to equal 1
so that G0 is assumed known, then predictive discrepancies do occur. Taking σ = 1 also
leads to bimodal posteriors for individual bi indicating a clash between prior and data,
such that the prior cannot accommodate certain values. A plot of the estimated bi shows
the distinct zero inflation combined with positive skew (Figure 4.3).
30000
25000
20000
Frequency
15000
10000
5000
0 10 20 30 40 50
New Outcome
FIGURE 4.2
Predictive samples, new outcome, eye tracking data.
154 Bayesian Hierarchical Models
30
25
20
Frequency
15
10
FIGURE 4.3
Estimated random effects, eye tracking data.
4.10 Computational Notes
# entropy
ent1[i,j] <- equals(S[i],j)*log(rho[i,j])
ent2[i,j] <- rho[i,j]*log(rho[i,j])}}
Ent[1] <- −2*sum(ent1[1:N,1:K])
Ent[2] <- −2*sum(ent2[1:N,1:K])
tLL <- sum(log_lik[])
# hyperparameters, hierarchical prior on precisions
a.psi ~dexp(1)
b.psi ~dexp(1)
# Processing to obtain identifiable groups, using ranks of
unconstrained means
rank <- rank(m0)
# relabelled weights, means, precisions, variances
for (j in 1:K) {P[j] <- sum(P0prod[j,])
mu[j] <- sum(m0prod[j,])
psi[j] <- sum(psi0prod[j,])
s2[j] <- 1/psi[j]
for (k in 1:K) {P0prod[j,k] <- P0[k]*equals(rank[k],j)
m0prod[j,k] <- m0[k]* equals(rank[k],j)
psi0prod[j,k] <- psi0[k]* equals(rank[k],j)}}
# relabelled allocation indicators
for (i in 1:N) { S[i] <- sum(dcat[i,])
for (j in 1:K) {d[i,j] <- sum(d0prod[i,j,])
dcat[i,j] <- j*d[i,j]
for (k in 1:K) {d0prod[i,j,k] <- d0[i,k]*equals(rank[k],j)}}}}
", file="mixnorm.jag")
2. Consider the galaxy data and suppose M = 4 is the number of mixture compo-
nents. Unknown means are centred at the observed mean of the data. Then a trun-
cated DP prior can be implemented as
v ~ beta(1,alpha);
for(i in 1:N){ for(c in 1:M){
comp[c]=log(pi[c])+normal_lpdf(y[i]mu[c],sigma[c]); }
target += log_sum_exp(comp); }}
"
D=list(y=y,N=82,M=4)
fit = stan(model_code = stan_model, data =D, iter = 2000, chains = 2)
summary(fit,pars = c("mu","pi","alpha"),probs=
c(0.025,0.975))$summary
The estimated parameters are in Table 4.1 and are similar to those estimated in Example 4.8.
TABLE 4.1
Galaxy Data Discrete Mixture, Galaxy Data, TDP Prior
Mean St devn 2.5% 97.5%
μ1 9.76 0.24 9.33 10.33
μ2 20.02 0.54 19.42 21.48
μ3 22.45 1.08 21.20 26.01
μ4 28.65 4.18 22.59 33.96
π1 0.09 0.03 0.04 0.16
π2 0.36 0.20 0.08 0.86
π3 0.44 0.23 0.02 0.79
π4 0.12 0.11 0.01 0.39
α 0.78 0.36 0.23 1.61
References
Abrams K, Gillies C, Lambert P (2005) Meta-analysis of heterogeneously reported trials assessing
change from baseline. Statistics in Medicine, 24, 3823–3844.
Abrams K, Lambert P, Sanso B, Shaw S, Marteau T (2000) Meta-analysis of heterogeneously reported
study results: A Bayesian approach, pp 29–64, in Meta-Analysis in Medicine and Health Policy, eds
D Berry, D Stangl. Marcel Dekker.
Agresti A, Hitchcock D (2005) Bayesian inference for categorical data analysis. Statistical Methods and
Applications, 14, 297–330.
Aitchison J, Ho C (1989) The multivariate Poisson-log normal distribution. Biometrika, 76, 643–653.
Aitchison J, Shen S (1980) Logistic-normal distributions: Some properties and uses. Biometrika, 67,
261–272.
Alanko T, Duffy J (1996) Compound Binomial distributions for modeling consumption data. The
Statistician, 45, 269–286.
Albert J (1999) Criticism of a hierarchical model using Bayes factors. Statistics in Medicine, 18, 287–305.
Albert J (2015) Package ‘LearnBayes’: Functions for Learning Bayesian Inference. https://cran.r-
project.org/web/packages/LearnBayes/LearnBayes.pdf
Albert JH, Gupta AK (1982) Mixtures of Dirichlet distributions and estimation in contingency tables.
The Annals of Statistics, 10(4), 1261–1268.
Ando T (2007) Bayesian predictive information criterion for the evaluation of hierarchical Bayesian
and empirical Bayes models. Biometrika, 94, 443–458.
Arends L (2006) Multivariate meta-analysis: Modelling the heterogeneity. Repub/EUR Repository.
http://repub.eur.nl/publications/med_hea
Borrowing Strength via Hierarchical Estimation 157
Azzalini A (1985) A class of distributions which includes the normal ones. Scandinavian Journal of
Statistics, 12, 171–178.
Bakbergenuly I, Kulinskaya E (2017) Beta-binomial model for meta-analysis of odds ratios. Statistics
in Medicine, 36, 1715–1734.
Baker R, Jackson D (2008) A new approach to outliers in meta-analysis. Health Care Management
Science, 11(2), 121–131.
Baker R, Jackson D (2016) New models for describing outliers in meta-analysis. Research Synthesis
Methods, 7, 314–328.
Barnard J, McCulloch R, Meng XL (2000) Modeling covariance matrices in terms of standarddevia-
tions and correlations, with applications to shrinkage. Statistica Sinica, 10, 1281–311.
Basu S (1996) Bayesian tests for unimodality, pp 77–82, in Proceedings of the Section on Bayesian
Statistical Science. American Statistical Association.
Basu S, Chib S (2003) Marginal likelihood and Bayes factors for Dirichlet process mixture models.
Journal of the American Statistical Association, 98(461), 224–235.
Baudry J, Raftery A, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for cluster-
ing. Journal of Computational and Graphical Statistics, 19(2), 332–353.
Bayman E, Chaloner K, Hindman B, Todd M (2013) Bayesian methods to determine performance dif-
ferences and to quantify variability among centers in multi-center trials: The IHAST trial. BMC
Medical Research Methodology, 13, 5.
Beath K (2014) A finite mixture method for outlier detection and robustness in meta-analysis. Research
Synthesis Methods, 5(4), 285–293.
Beath K (2016) metaplus: An R package for the analysis of robust meta-analysis and meta-regression.
The R Journal, 8(1), 5–16.
Besag J, Green P, Higdon D, Mengerson K (1995) Bayesian computation and stochastic systems.
Statistical Science, 10(1), 103–166.
Betancourt M (2017) Identifying Bayesian Mixture Models. http://mc-stan.org/users/documenta-
tion/case-studies/identifying_mixture_models.html
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated
completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719–725.
Bohning D (1999) Computer-Assisted Analysis of Mixtures and Applications: Meta-Analysis, Disease
Mapping and Others. Chapman & Hall, New York.
Browne W, Draper D (2006) A comparison of Bayesian and likelihood-based methods for fitting mul-
tilevel models. Bayesian Analysis, 1, 473–550.
Bulmer M (1974) On fitting the Poisson log-normal distribution to species abundance data. Biometrics,
30, 101–110.
Burke D, Bujkiewicz S, Riley R (2016) Bayesian bivariate meta-analysis of correlated effects: Impact
of the prior distributions on the between-study correlation, borrowing of strength, and joint
inferences. Statistical Methods in Medical Research, 27(2), 428–450.
Burr D (2012) bspmma: An R package for Bayesian semiparametric models for meta analysis. Journal
of Statistical Software, 50, 1–23.
Burr D, Doss H (2005) A Bayesian semi-parametric model for random effects meta analysis. Journal of
the American Statistical Association, 100, 242–251.
Cao G, West M (1996) Practical Bayesian inference using mixtures of mixtures. Biometrics, 52,
1334–1341.
Carvalho, V, Branscum, A (2017) Bayesian nonparametric inference for the three-class Youden index
and its associated optimal cutoff points. Statistical Methods in Medical Research, 27, 689–700.
Celeux G, Hurn M, Robert C (2000) Computational and inferential difficulties with mixture posterior
distributions. Journal of the American Statistical Association, 95, 957–970.
Cepeda-Benito A, Reynoso N, Erath S (2004) Meta-analysis of the efficacy of nicotine replacement
therapy for smoking cessation: Differences between men and women. Journal of Consulting and
Clinical Psychology, 72, 712–722.
Chelgren N, Adams M, Bailey L, Bury, B (2011) Using multilevel spatial models to understand sala-
mander site occupancy patterns after wildfire. Ecology, 92, 408–421.
158 Bayesian Hierarchical Models
Chib S, Hamilton B (2002) Semiparametric Bayes analysis of longitudinal data treatment models.
Journal of Econometrics, 110(1), 67–89.
Chib S, Winkelmann R (2001) Markov chain Monte Carlo analysis of correlated count data. Journal of
Business & Economic Statistics, 19(4), 428–435.
Christiansen C, Morris C (1996) Fitting and checking a two-level Poisson model: modeling patient
mortality rates in heart transplant patients, pp 467–501, in Bayesian Biostatistics, eds D Berry, D
Stangl. Marcel Dekker, New York.
Christiansen C, Morris C (1997) Hierarchical Poisson regression modeling. Journal of the American
Statistical Association, 92, 618–632.
Chung H, Loken E, Schafer J (2004) Difficulties in drawing inferences with finite-mixture models: A
simple example. The American Statistician, 58, 152–158.
Clark J, Gelfand A (2006) Hierarchical Modelling for the Environmental Sciences: Statistical Methods and
Applications. Oxford University Press.
Clayton D, Kaldor J (1987) Empirical Bayes estimates of age-standardized relative risks for use in
disease mapping. Biometrics, 43(3), 671–681.
Conlon E, Song J, Liu A (2007) Bayesian meta-analysis models for microarray data: A comparative
study. BMC Bioinformatics, 8, 80.
Connolly S, Dornelas M, Bellwood D, Hughes T (2009) Testing species abundance models: A new
bootstrap approach applied to Indo-Pacific coral reefs. Ecology, 90(11), 3138–3149.
Consul P (1989) Generalized Poisson Distributions. Marcel Dekker, New York.
Daniels M (1999) A prior for the variance in hierarchical models. Canadian Journal of Statistics, 27,
569–580.
Das S, Dey D (2006) On Bayesian analysis of generalized linear models using the Jacobian technique.
The American Statistician, 60, 264–268.
Das S, Dey D (2007) On Bayesian analysis of generalized linear models: A new perspective. Technical
Report 2007-8, Statistical and Applied Mathematical Sciences Institute, UNC. www.samsi.info
Da Silva, A (2009) Bayesian mixture models of variable dimension for image segmentation. Computer
Methods and Programs in Biomedicine, 94(1), 1–14.
Deely N, Smith A (1998) Quantitative refinements for comparisons of institutional performance.
Journal of the Royal Statistical Society: Series A, 161, 5–12.
Delucchi K, Bostrom A (2004) Methods for analysis of skewed data distributions in psychiatric clini-
cal studies: Working with many zero values. The American Journal of Psychiatry, 161, 1159–1168.
DerSimonian R, Laird N (1986) Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–188.
Diebolt N, Robert C (1994) Estimation of finite mixture distributions through Bayesian sampling.
Journal of the Royal Statistical Society: Series B, 56, 363–375.
Ding T, Baio G (2016) bmeta: Bayesian Meta-analysis and Metaregression. http://www.statistica.it/
gianluca/software/bmeta/
Diserud O, Engen S (2000) A general and dynamic species abundance model, embracing the lognor-
mal and the gamma models. The American Naturalist, 155, 497–511.
Drton M, Plummer M (2017) A Bayesian information criterion for singular models. Journal of the Royal
Statistical Society: Series B, 79(2), 323–380.
Druyts E, Palmer J, Balijepalli C, Chan K, Fazeli M, Herrera V (2017) Treatment modifying factors of
biologics for psoriatic arthritis: A systematic review and Bayesian meta-regression. Clinical and
Experimental Rheumatology, 35(4), 681–688.
DuMouchel W (1996) Predictive cross-validation of Bayesian meta-analyses, pp 107–127, in eds J
Bernardo, J Berger, A Dawid, A Smith, Bayesian Statistics 5. Oxford University Press.
DuMouchel W, Waternaux C (1992) Discussion of “Hierarchical models for combining information
and for meta-analysis,” by C Morris and S Normand, pp 338–341, in Bayesian Statistics, Vol. 4,
eds J Bernardo, J Berger, A Dawid, A Smith. Clarendon Press, Oxford, UK.
Dunson D, Pillai N, Park J (2007) Bayesian density regression. Journal of the Royal Statistical Society:
Series B, 69, 163–183.
Efron B (1986) Double exponential families and their use in generalized linear regression. Journal of
the American Statistical Association, 81(395), 709–721.
Borrowing Strength via Hierarchical Estimation 159
Jarque C, Bera A (1980) Efficient tests for normality, homoscedasticity and serial independence of
regression residuals. Econometric Letters, 6, 255–259.
Jiang J, Lahiri P (2006) Mixed model prediction and small area estimation. Test, 15(1), 1.
Jullion A, Lambert P (2007) Robust specification of the roughness penalty prior distribution in
spatially adaptive Bayesian P-splines models. Computational Statistics & Data Analysis, 51(5),
2542–2558.
Karabatsos G (2016) A menu-driven software package for Bayesian regression analysis. The ISBA
Bulletin, 22(4), 13–16.
Karabatsos G (2017) A menu-driven software package of Bayesian nonparametric (and parametric)
mixed models for regression analysis and density estimation. Behavior Research Methods, 49(1),
335–362.
King G (1997) A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from
Aggregate Data. Princeton University Press, Princeton, NJ.
King G, Rosen O, Tanner M (eds) (2004) Ecological Inference: New Methodological Strategies. Cambridge
University Press, New York.
Kleinman KP, Ibrahim JG (1998) A semiparametric Bayesian approach to the random effects model.
Biometrics, 54(3), 921–938.
Kruschke J, Vanpaemel W (2015) Bayesian estimation in hierarchical models, pp 279–299, in The
Oxford Handbook of Computational and Mathematical Psychology, eds J R Busemeyer, Z Wang, J T
Townsend, A Eidels. Oxford University Press, Oxford, UK.
Kuhan G, Marshall E C, Abidia A F, Chetter I C, McCollum P (2002) A Bayesian hierarchical approach
to comparative audit for carotid surgery. European Journal of Vascular and Endovascular Surgery,
24(6), 505–510.
Kulinskaya E, Olkin I (2014) An overdispersion model in meta-analysis. Statistical Modelling, 14(1),
49–76.
Lambert P, Sutton A, Burton P, Abrams K, Jones D (2005) How vague is vague? A simulation study
of the impact of the use of vague prior distributions in MCMC using WinBUGS. Statistics in
Medicine, 24(15), 2401–2428.
Larson J, Soule S (2009) Sector-level dynamics and collective action in the United States, 1965–1975.
Mobilization: An International Quarterly, 14(3), 293–314.
Laud PW, Ibrahim JG (1995) Predictive model selection. Journal of the Royal Statistical Society: Series B
(Methodological), 57(1), 247–262.
Lee J, Sabavala D (1987) Bayesian estimation and prediction for the beta binomial model. Journal of
Business and Economic Statistics, 5, 357–367.
Lee K, Thompson S (2008) Flexible parametric models for random-effects distributions Statistics in
Medicine, 27, 418–434.
Leisch F (2004) FlexMix: A general framework for finite mixture models and latent class regression in
R. Journal of Statistical Software, 11(8), 1–18.
Lenk P (1988) The logistic normal distribution for Bayesian nonparametric predictive densities.
Journal of the American Statistical Association, 83, 509–516.
Lenk P, Desarbo W (2000) Bayesian inference for finite mixtures of generalized linear models with
random effects. Psychometrika, 65, 93–119.
Leonard T (1973) A Bayesian method for histograms. Biometrika, 60, 297–308.
Leslie D, Kohn R, Nott D (2007) A general approach to heteroscedastic linear regression. Statistics and
Computing, 17, 131–146.
Lin T, Lee J, Hsieh W (2007b) Robust mixture modeling using the skew t distribution. Statistics and
Computing, 17, 81–92.
Lin T, Lee J, Ni H (2004) Bayesian analysis of mixture modelling using the multivariate t distribution.
Statistics and Computing, 14, 119–130.
Lin T, Lee J, Yen S (2007a) Finite mixture modelling using the skew normal distribution. Statistica
Sinica, 17, 909–927.
Lindley D, Smith A (1972) Bayes estimates for the linear model. Journal of the Royal Statistical Society:
Series B, 34, 1–41.
162 Bayesian Hierarchical Models
Liu J, Dey D (2007) Hierarchical overdispersed Poisson model with macrolevel autocorrelation.
Statistical Methodology, 4(3), 354–370.
Lu G, Ades AE (2009) Modeling between-trial variance structure in mixed treatment comparisons.
Biostatistics, 10(4), 792–805.
Makuch R, Stephens M, Escobar M (1989) Generalized binomial models to examine the historical
control assumption in active control equivalence studies. The Statistician, 38, 61–70.
Marin J, Mengersen K, Robert C (2005) Bayesian modelling and inference on mixtures of distribu-
tions, in Handbook of Statistics, Vol. 25, eds D Dey, C Rao. Elsevier.
Markham F, Young M, Doran B, Sugden M (2017) A meta-regression analysis of 41 Australian prob-
lem gambling prevalence estimates and their relationship to total spending on electronic gam-
ing machines. BMC Public Health, 17(1), 495.
Marshall E, Spiegelhalter D (1998) Comparing institutional performance using Markov chain Monte
Carlo methods, in Statistical Analysis of Medical Data: New Developments, eds B Everitt, G Dunn.
Arnold.
Marshall E, Spiegelhalter D (2007) Simulation-based tests for divergent behaviour in hierarchical
models. Bayesian Analysis, 2, 409–444.
Mavridis D, Salanti G (2013) A practical introduction to multivariate meta-analysis. Statistical Methods
in Medical Research, 22(2), 133–158.
McLachlan G, Rathnayake S (2014) On the number of components in a Gaussian mixture model.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 341–355.
Militino A, Ugarte M, Dean C (2001) The use of mixture models for identifying high risks in disease
mapping. Statistics in Medicine, 20, 2035–2049.
Mohr D (2006) Bayesian identification of clustered outliers in multiple regression. Computational
Statistics & Data Analysis, 51, 3955–3967.
Moreno E, Vázquez-Polo F, Negrn M (2018) Bayesian meta-analysis: The role of the between-sample
heterogeneity. Statistical Methods in Medical Research, 27(12), 3643–3657.
Muller P, Erkanli A, West M (1996) Bayesian curve fitting using multivariate normal mixtures.
Biometrika, 83, 67–79.
Ohlssen D, Sharples L, Spiegelhalter D (2007) Flexible random-effects models using Bayesian
semi-parametric models: Applications to institutional comparisons. Statistics in Medicine, 26,
2088–2112.
Papastamoulis P (2016) label.switching: An R package for dealing with the label switching problem
in MCMC outputs. Journal of Statistical Software, 69. https://www.jstatsoft.org/article/view/
v069c01.
Parmigiani G (2002) Modeling in Medical Decision Making: A Bayesian Approach. Wiley, New York.
Pastor N (2003) Methods for the analysis of explanatory linear regression models with missing data
not at random. Quality and Quantity, 37, 363–376.
Pauler D, Wakefield J (2000) Modeling and implementation issues in Bayesian meta-analysis,
pp 205–230, in Bayesian Meta-Analysis, eds Stangl D, Berry D. Marcel Dekker.
Pérez M, Pericchi L, Ramrez I (2017) The Scaled Beta2 distribution as a robust prior for scales. Bayesian
Analysis, 12(3), 615–637.
Pitman J, Yor M (1997) The two-parameter Poisson-Dirichlet distribution derived from a stable sub-
ordinator. Annals of Probability, 25, 855–900.
Podlich H, Faddy M, Smyth G (2004) Semi-parametric extended Poisson process models for count
data. Statistics and Computing, 14, 311–321.
Prabhakaran S, Azizi E, Carr A, Pe’er D (2016) Dirichlet process mixture model for correcting techni-
cal variation in single-cell gene expression data, pp 1070–1079, in Proceedings of the International
Conference on Machine Learning, New York.
Prevost T, Abrams K, Jones D (2000) Hierarchical models in generalized synthesis of evidence: An
example based on studies of breast cancer screening. Statistics in Medicine, 19, 3359–3376.
Quintana F, Tam W (1996) Bayesian estimation of beta-binomial models by simulating posterior den-
sities. Revista de la Sociedad Chilena de Estadstica, 13, 43–56.
Rao J (2003) Small Area Estimation. Wiley, New York.
Borrowing Strength via Hierarchical Estimation 163
Taylor-Rodrguez D, Kaufeld K, Schliep E, Clark J, Gelfand A (2017) Joint species distribution model-
ing: Dimension reduction using Dirichlet processes. Bayesian Analysis, 12(4), 939–967.
Teather D (1984) The estimation of exchangeable binomial parameters. Communications in Statistics,
Part A, 13, 671–680.
van Dongen S (2006) Prior specification in Bayesian statistics: Three cautionary tales. Journal of
Theoretical Biology, 242: 90–100.
van Houwelingen H, Arends L, Stiinen T (2002) Advanced methods in meta-analysis: Multivariate
approach and meta-regression. Statistics in Medicine, 21, 589–624.
Verde PE (2018) bamdit: An R package for Bayesian meta-analysis of diagnostic test data. Journal of
Statistical Software, Articles, 86, 1–32.
Viechtbauer W (2010) Conducting meta-analyses in R with the metafor package. Journal of Statistical
Software, 36(3), 1–48.
Viechtbauer W (2017) Package ‘metafor’. https://cran.r-project.org/web/packages/metafor/meta-
for.pdf
Walfish S (2006) A review of statistical outlier methods. Pharmaceutical Technology, 30(11), 82–86.
Walker S, Damien P, Laud P, Smith A (1999) Bayesian nonparametric inference for random distribu-
tions and related functions. Journal of the Royal Statistical Society: Series B, 61, 485–527.
Wang C, Blei D (2017) A general method for robust Bayesian modeling. Bayesian Analysis, 13(4),
1163–1191.
Warn D, Thompson S, Spiegelhalter D (2002) Bayesian random effects meta-analysis of trials with
binary outcomes: Methods for the absolute risk difference and relative risk scales. Statistics in
Medicine, 21, 1601–1623.
Wasserman L (2000) Asymptotic inference for mixture models using data-dependent priors. Journal
of the Royal Statistical Society: Series B, 62, 159–180.
Weems K, Smith P (2004) On robustness of maximum likelihood estimates for Poisson-lognormal
models. Statistics & Probability Letters, 66, 189–196.
Wei Y, Higgins JP (2013) Bayesian multivariate meta-analysis with multiple outcomes. Statistics in
Medicine, 32(17), 2911–2934.
West M (1984) Outlier models and prior distributions in Bayesian linear regression. Journal of the
Royal Statistical Society: Series B, 46, 431–439.
West M, Muller P, Escobar M (1994) Hierarchical priors and mixture models, with application in
regression and density estimation, pp 363–386, in Aspects of Uncertainty: A Tribute to D. V.
Lindley, eds P Freeman, A Smith. Wiley, New York.
Williams D, Rast P, Bürkner P (2018) Bayesian Meta-Analysis with Weakly Informative Prior
Distributions. PsyArXiv. https://andrewgelman.com/wp-content/uploads/2018/01/bayes_
donny.pdf
Winkelmann R, Zimmermann KF (1991) A new approach for modeling economic count data.
Economics Letters, 37(2), 139–143.
Winship DA (1978) Cimetidine in the treatment of duodenal ulcer. Gastroenterology, 74, 402–406.
Young-Xu Y, Chan K (2008) Pooling overdispersed binomial data to estimate event rate. BMC Medical
Research Methodology, 8, 58.
Yu K, Moyeed R (2001) Bayesian quantile regression. Statistics and Probability Letters, 54(4), 437–447.
Zhang J, Fu H, Carlin B (2015) Detecting outlying trials in network meta-analysis. Statistics in
Medicine, 34(19), 2695–2707.
Zhao Y, Staudenmayer J, Coull B, Wand M (2006) General design Bayesian generalized linear mixed
models. Statistical Science, 21, 35–51.
Zollinger A, Davison A, Goldstein D (2015) Meta-analysis of incomplete microarray studies.
Biostatistics, 16(4), 686–700.
5
Time Structured Priors
5.1 Introduction
A time series is a sequence of stochastic observations which are ordered in time, most
often at equally spaced discrete times t = 1, ¼ , T , though extensions to unequally spaced
intervals are relatively straightforward (Lee and Nelder, 2001). Major goals of time series
analysis include modelling the interrelationship of variables evolving jointly through time,
as in econometric growth models (Paap and van Dijk, 2003), forecasting future values of
time series variables (Beck, 2004), and identifying the structural components of a sequence
of observations (Huerta and West, 1999). In the analysis of temporal data, one generally
expects positive covariation between observations that are close to each other in time,
so that exchangeable priors are not appropriate. While time series are sometimes anal-
ysed exchangeably, at least within subgroups of the data, as in change-point models (Mira
and Petrone, 1996), in most applications, there is a gain from modelling temporal covaria-
tion. Hence, hierarchical priors for time series modelling are typically structured in the
sense of explicitly recognising adjacency in time as the basis for smoothing or prediction.
Hierarchical methods also assist in identifying underlying relatively smooth or recurring
features of the data, for example, underlying trends or seasonal effects.
Bayesian methods are widely applied to autoregressive moving average models, without
necessarily imposing the stationarity restrictions and preliminary detrending that fea-
ture in classical estimation. However, a general scheme for specifying priors for modelling
time series data is provided by the state-space approach, considered in Sections 5.3 and
5.4 (Harvey et al., 2006; West, 2013; Giordani et al., 2011; Petris et al., 2009), which includes
ARMA (autoregressive moving average) models as special cases. State-space models rec-
ognise multiple underlying components in time series, with the priors governing the
evolution of the components under an expectation of smoothness. The linear state-space
(or dynamic linear model) specification for the changing level of a univariate continuous
response yt has the form
yt = bt Xt + ut ,
bt = bt -1Gt + wt ,
ut ~ N (0, Vt ) ,
wt ~ N (0, Wt ) ,
165
166 Bayesian Hierarchical Models
are unstructured white noise, Xt is a predictor or design matrix, and Gt is a known matrix
governing the evolution of the state vector βt (Durbin, 2000; West and Harrison, 1997).
The time structured latent effects βt may include level, trend, seasonal, or cyclical effects.
Taking ut and wt to be normal leads to the normal dynamic linear model (West, 1998),
with extension to generalised linear model forms for discrete data leading to dynamic
generalised linear models (West et al., 1985). State-space principles can also be applied to
model stochastically evolving variances, as in stochastic volatility models (Kim et al., 1998;
Jacquier et al., 2002); see Section 5.5.
While there may be benefits from borrowing strength methods that take account of cor-
relations between units, the use of multiple random effects to represent unobserved com-
ponents in time raises potential identification issues (Auger-Méthé et al., 2016; Knape, 2008).
For example, priors for correlated effects in time may specify differences in effects between
adjacent units without specifying the mean level of the effects. MCMC methods may then
require centring of the effects during sampling to ensure identification of other param-
eters. Methods for smoothing or interpolation in time may also need to retain robustness
to take account of regime shifts, or to accommodate temporal outliers. Structured priors
assume relatively smooth variation over adjacent units, and their parameters may be dis-
torted if mechanisms are not incorporated for accommodating extreme points.
There is a wide range of time series analysis options in R using frequentist estimation
packages (https://cran.r-project.org/web/views/TimeSeries.html) which may be useful
for comparative purposes. Bayesian computing options in R for time series include bsts,
particularly for state-space modelling (Scott, 2017); BMR, Bayesian Macroeconometrics in
R (https://github.com/kthohr/BMR/tree/master/man); stochvol for stochastic volatility
analysis (Kastner and Hosszejni, 2016); and tsPI (Helske, 2017). Generic packages such as
rstan and R-INLA may facilitate estimation and identification in complex random effects
time series models (Monnahan et al., 2017; Betancourt and Girolami, 2015).
The chapter below considers schemes for modelling correlated observations and latent
effects in time series. Sections 5.2 and 5.3 consider autoregressive and state-space priors
for time series analysis, while Section 5.4 considers state-space methods for discrete time
series. Section 5.5 considers Bayesian approaches to stochastic volatility and Section 5.6
considering models adaptive to temporal discontinuities.
yt = f0 + f1 yt -1 + f2 yt -2 + … + fp yt - p + ut , t = 1,… , T
where the innovation errors ut ~ N (0, s 2 ) are homoscedastic white noise, independent of
each other and lagged y values { yt -1 , … , yt - p } . So E(utut - s ) = E(ut - jut - j - s ) = 0 for all s and j.
Note that a full likelihood analysis will refer to p latent preseries values (Marriott et al.,
Time Structured Priors 167
1996), with Naylor and Marriott (1996) suggesting preseries values follow a heavy tailed
version of the density assumed for the observed series, for instance ( y0 , y -1 , … y1- p ) as
Student t with variance σ2 and low degrees of freedom v. Autoregressive dependence may
also be present in error terms, such that
yt = f0 + f1 yt -1 + f2 yt -2 + + fp1 yt - p1 + e t
with assumptions as in Chib and Greenberg (1994). Assuming the y-series is centred around its
mean, and defining Byt = yt - yt -1, one has yt - f1 yt -1 - fp yt - p = yt (1 - f1B - fpB p ) = F(B)yt ,
and the ARMA(p, q) model can be written
F (B) yt = G (B) ut .
As for other regressions, collinearity may occur, and parameter selection for the ARMA(p,q)
may include shrinkage priors (Schmidt and Makalic, 2013) and RJMCMC (Ehlers and
Brooks, 2004).
Classical estimation methods typically require stationarity and constant variances in
estimating such models. Stationarity is equivalent to the roots of F(B) = 1 - B - B2 - B p
being outside the unit circle, and invertibility refers to the same condition on the roots of
G(B) . This typically involves preliminary data differencing or transformation to gain sta-
tionarity, or regression to remove trend (e.g. Abraham and Ledolter, 1983, p.225), with the
actual model then applied to differenced data or to regression residuals. To assess whether
stationarity has been achieved, one can consider the autocorrelation sequence of model
residuals: a stationary process should show a sequence fading to zero at high lags, whereas
significant values at high lags indicate nonstationarity. In Bayesian analyses, it is com-
mon to estimate parameters without presuming stationarity (or invertibility), but instead
obtain the posterior probabilities of stationarity via monitoring the sampled parameters
(McCulloch and Tsay, 1994; Marriott et al., 1996).
yt = f0 + f1 yt -1 + ut + g 1ut -1 ,
168 Bayesian Hierarchical Models
yt = åf y
t =1
tj t - j + ut ,
å
0.5
ft = mf + et ,
f
where ut ~ N (0, s 2 ) , et ~ N p (0, I ), ft = (ft1 ,… , ftp ), S f = diag(sf21 , … , sf2p ), and mf = (f1 ,..., fp ).
Instead of a multivariate normal prior for the ϕt, sequential updating of the ϕt may be
applied, for example, via a multivariate random walk (Section 5.3),
ft = ft -1 + wt , wt ~ N p (0, Wt ).
Another possibility (Godsill et al., 2004) is to take both the AR coefficient vector and the
innovation variance σ2 to be time-varying, for example, by setting a random walk prior on
ht = log(st ) , or by a second-stage autoregression, such as
(
ht ~ N rh ht -1 , sh2 . )
As in many Bayesian applications, stationarity constraints are not necessarily placed on
the ϕtj at each t (Prado et al., 2000). However, if the AR parameters lie in the stationary
region, then the series can be considered locally stationary. For example, for an RCAR(1)
model including a latent preseries value y0, a hierarchical scheme such as
Time Structured Priors 169
( )
yt ~ N ft1 yt -1 , s 2 , t > 1
(
y0 ~ t2 m0 , s 2 , )
( )
ft1 ~ N f1 , sf2 , t > 1
may be applied. For this model, stationarity holds if f12 + sf2 < 1.
yt - m = f ( yt -1 - m) + ut
or
yt = f yt -1 + ut
for centred data, where under stationarity -1 < f < 1, and yr and ys for 1 £ r £ s £ T are con-
ditionally independent, given { y r + 1 , ¼ , y s -1 } if r - s > 1 (Rue and Held, 2005). The AR(2)
model has
yt = f1 yt -1 + f2 yt -2 + ut
p ( y1 , ¼ , y T ) = p ( y1 ) p ( y 2 | y1 ) p ( y 3 | y 2 )¼ p ( y T | y T - 1 )
( )
0.5
µ 1 - f2 s - n exp éë -0.5H / s 2 ùû ,
å
T
where H = (1 - f )y1 + ( yt - f yt -1 )2 . The same sequence of marginal and conditional
2 2
t=2
densities applies for AR(1) autoregressive errors.
The precision (inverse covariance) matrix of autoregressive models has interesting theo-
retical properties demonstrating how conditional independence structures determine the
precision matrix and vice versa (Speed and Kiiveri, 1986; Rue and Held, 2005). Specifically,
zeros in the precision matrix define, and are defined by, conditional independencies in the
joint density. Thus, for an AR(1) prior on errors ε with lag coefficient ρ, the precision matrix
П is tridiagonal with (r, s)th cell equalling zero only if the complete conditional distribu-
tion of εr does not depend on εs, namely
170 Bayesian Hierarchical Models
é 1 -r 0 ù
ê-r 1+ r2 -r ú
ê ú
-2 -1 -2
ê 0 -r 1+ r2 ¼ ú
P =s C =s ê ú.
ê ¼ ¼ ú
ê -r 1+ r2 -r ú
ê ú
êë 0 -r 1 ûú
For an AR(2) error sequence with lag parameters { r1 , r 2 }, the precision matrix is
é 1 - r1 - r2 0 ù
ê-r 1 + r12 - r1(1 - r 2 ) - r2 ú
ê 1 ú
ê- r2 - r1(1 - r 2 ) 1 + r12 + r 22 - r1(11 - r 2 ) ú
-2 ê ú
P = s ê 0 - r2 - r1(1 - r 2 ) 1 + r12 + r 22 ¼ ú.
ê ¼ ¼ ú
ê ú
ê - r1(1 - r 2 ) 1 + r12 - r1 ú
ê - r2 - r1 1 úû
ë
yt = d0 + d1t + et ,
et = fet -1 + ut ,
while Chatuverdi and Kumar (2005) consider the unit root hypothesis under a more gen-
eral polynomial trend yt = d0 + S j d jt j + et .
5.2.3 Antedependence Models
Structured antedependence models may offer flexibility in time series specification; they
resemble autoregressions in entailing a regression over preceding observations or latent
effects, but are specified in a way that avoids stationarity constraints (Nunez-Anton and
Zimmerman, 2000; Pourahmadi, 2002). Observations { y1 , ¼ , yT } are antedependent of
order s if yt depends only on { yt -1 , ¼ , yt - s } for all t ³ s (Gabriel, 1962). For example, Jaffrezic
et al. (2003) consider a second-order antedependence model for normal longitudinal data
Time Structured Priors 171
of the form yit = hit + git + uit , where hit models fixed effects, e.g. hit = xit b , uit are unstruc-
tured white noise errors with fixed variance, and the genetic component git follows a sec-
ond-order structured antedependence or AD(2) scheme. This scheme specifies
g1 = e1
g 2 = f12 g1 + e2
with et ~ N (0, wt ). Because of the initial condition g1 = e1, the antedependence parameters,
such as {f1t , f2t } in an AD(2) model, are unconstrained, in contrast to the stationarity con-
straints needed for autoregressive models.
To reduce the number of parameters being estimated, changing variances ωt may be
modelled via a parametric function of time, for example
while the antedependence parameters can also be modelled using time functions. For
example, a Box–Cox power law can be used to parameterise time-varying AD coefficients
ϕkt, namely
fkt = fkrt - rt - k
lk −1 lk −1
where {rt = t /lk , rt − k = (t − k ) /lk } if lk ≠ 0, and {rt = log(t), rt - k = log(t - k )} if lk = 0
(Nunez-Anton and Zimmerman, 2000). The ϕ and ω parameters may be adjusted to account
for unevenly spaced times located at points {a1 , ¼ aT } (Jaffrezic et al., 2004).
yt = f0 + f1 yt - 1 + ut ,
ut ~ N(0, s 2 ),
with priors
f1 ~ N(0, 1),
f0 ~ N(0, 20),
s ~ N + (0, 1).
172 Bayesian Hierarchical Models
The estimates (posterior mean and st devn) for ϕ0, ϕ1, and σ are respectively 12.6 (1.25),
0.41 (0.06), and 0.143 (0.006). The LOO-IC is −255, with Figure 5.1A showing the extreme
pointwise LOO-IC associated with certain observations.
The random-coefficient AR1 specifies
yt = f0 + f1t yt - 1 + ut ,
ut ~ N(0, s 2 ),
f 1t = mf 1 + s f 1e t ,
et ~ N(0, 1),
where the parameterisation of ϕt follows Gelman et al. (2014), and provides improved
MCMC sampling via rstan [2]. For extreme outliers, such as at t = 120 and t = 227, the
mean likelihoods are higher under this model. However, the overall LOO-IC rises to
−250, with the improved fit per se (lower ELPD-LOO) offset by a higher complexity
measure (113 vs 9). The WAIC (widely applicable information criterion) also favours
the simpler model (−256 as against −245). The parameter σϕ1 has a mean of 0.0034, with
posterior mean ϕ1t varying from 0.401 to 0.423.
A GARCH(1,1) model (section 5.5) specifies
yt = f0 + ut ,
(
ut ~ N 0, st2 , )
with variance model
This provides evidence of volatility, as in Figure 5.1B, but the LOO-IC deteriorates to
−205. The α1 and β1 coefficients have skew posterior densities with respective means
(medians) of 0.17 (0.07) and 0.67 (0.87).
Finally, an AR1 lag in y is added to the GARCH(1,1), namely
yt = f0 + f1 yt -1 + ut ,
ut ~ N(0, st2 ),
This provides a LOO-IC of −260, a slight improvement on the basic AR(1) model. The
lagged effect of ut2-1 is now virtually eliminated, with posterior means (medians) for β1
and ϕ of 0.58 (0.65) and 0.38 (0.39).
25
20
Pointwise LOO-IC
15
10
0.30
0.25
Mean Sigma
0.20
0.15
0.10
FIGURE 5.1
(A) Pointwise LOO-IC. Fixed Coefficient AR1 Model. (B) Posterior Mean σ. GARCH(1,1) Model.
idea that a time series is composed of several unobserved components contrasts with
Box–Jenkins or ARMA methods that require differencing to eliminate trend or periodic
effects and achieve stationary means and variances (Durbin, 2000, p.2). ARMA models are
selected using autocorrelation and partial and autocorrelation functions that are subject
to sampling variability, and quite different models can provide similar fits for the same
series. In fact, ARMA sequences can be represented as particular instances of state-space
models with implicit components. Among informative discussions on state-space vs Box–
Jenkins methods, see Durbin and Koopman (2001, p.51) and Harvey and Todd (1983).
174 Bayesian Hierarchical Models
The normal linear state-space specification, or dynamic linear model, has the form
yt = bt Xt + ut ,
bt = bt -1Gt + wt ,
with Xt being a p × 1 design matrix (typically including an intercept), and Gt defining a p × p
state evolution matrix. The normal errors ut and wt are independent of each other, with
mean zero and variances Vt and Wt (or covariances for multivariate y). The initial state vec-
tor or initial condition has a separate (e.g. normal) prior such as b1 ~ N (m1 , W1 ) (Strickland
et al., 2008), where m1 and W1 are typically present (e.g. W1 is set large, in line with diffuse
expectations). Often Gt has a simple form, such as an identity matrix. For the case Gt = G,
Gamerman (1998) mentions an inverse parameterisation consequent on taking
d1 = b1 , dt = bt - G bt -1 ,
so that
bt = åG
j =1
t- j
d j .
p ( bt |Dt -1 ) =
ò p ( b |b
t t -1 ) p ( bt -1 |Dt -1 ) d bt -1 ,
p ( yt |Dt -1 ) =
ò p ( y | b ) p ( b |D
t t t t -1 ) d bt ,
For the linear normal model with Vt = V , Wt = W , sequential updating provides posteriors
bt |Dt ~ N ( mt , Ct ) ,
where
at = Gt mt -1 ,
et = yt - Xt¢at ,
mt = at + At et ,
Time Structured Priors 175
Ct = Rt - At At¢qt ,
Rt = GtCt -1Gt¢ + W ,
qt = Xt¢Rt Xt + V ,
The one step ahead state and observation predictive densities are normal densities, namely,
( bt |Dt -1 ) ~ N ( at , Rt )
( yt |Dt -1 ) ~ N (Xt¢at , qt ) .
5.3.1 Simple Signal Models
As an illustration of a normal state-space or dynamic linear model, assume that observa-
tions yt are obtained with measurement error and in fact generated by a relatively smooth
underlying signal βt. This is a hierarchical model – analogous to the normal-normal model
of Chapter 4 – with the first level being the observation equation, the second level being
the state equation, and the priors on the variances and initial conditions defining hyper-
parameters at the third stage (Berliner, 1996). Assuming iid measurement errors ut, one has
an observation or measurement equation
yt = bt + ut , (5.1.1)
bt = bt -1 + wt , (5.1.2)
for t = 2, ¼ , T . This is also known as a local level model (Durbin and Koopman, 2001), or
random walk plus noise model (Durbin, 2000), and the second stage is a nonstationary first
order random walk or RW(1) prior, corresponding to the unit root case of an AR(1) prior.
As for the AR(1) prior, future values of the signal depend on (βt, βt−1,…, β1) only through
the current value βt. Denoting β[t] = (β1,…, βt−1), the conditional form of the RW(1) prior is
ì p ( b2 | b1 ) p ( b1 ) p ( y1 | b1 ) t=1 ü
ï ï
p( bt | b[t ] , y ) µ í p ( bt + 1 | bt ) p ( bt | bt -1 ) p ( yt | bt ) t = 2, ¼T - 1ý
ï ï
î p ( bT | bT -1 ) p ( yT | bT ) t=T þ
so that for times t = 2, ¼ , T - 1 there is averaging over preceding and following states. The
first period signal (initial condition) β1 is typically taken as an unknown fixed effect with
large variance, while the observation error ut, and state error wt are taken as respectively
N(0, V) and N(0, W), and assumed uncorrelated in time, independent of one other, and also
independent of the signal βt.
Assume b1 ~ N (b1 , S1 ), 1/ V ~ Ga( au , bu ), 1/ W ~ Ga( aw , bw ) then the full conditionals are
æéb b y ùé 1
-1
1 1ù é 1 1 1ù ö
-1
b1 ~ N ç ê 2 + 1 + 1 ú ê + + ú , ê + + ú ÷
ç ë W S1 V û ë W S1 V û ë W S1 V û ÷
è ø
176 Bayesian Hierarchical Models
(b + b y 2 1 2 1
−1 −1
bt ∼ N t +1 t −1 + t + , + t = 2, … , T − 1
W V W V W V
b y 1 1 1 1
−1 −1
bT ∼ N T −1 + T + , +
W V W V W V
T
2
∑
T
1 / V ∼ Ga au + , bu + 0.5
2
( yt − bt )
t =1
(T − 1) , b + 0.5
T
1/ W ∼ Ga aw +
2
w ∑(b − b
t=2
t t −1 )2 .
Higher order random walks in the signal are another possibility, with a kth order random
walk having prior
D k bt ~ N (0, W )
(Berliner, 1996; Kitagawa and Gersch, 1996; Fahrmeir and Lang, 2001). For example, a
second difference random walk or RW(2) prior specifies yt = bt + ut and state equation
D 2 bt = wt . Hence
bt ~ N ( 2 bt -1 - bt - 2 , W ) .
Whereas first order random walks penalise abrupt jumps between successive values, the
RW(2) prior penalises deviations from a linear trend. The RW(2) and higher order RW
priors therefore lead to a smoother evolution of βt through time. This is relevant not just
to time series, but to processes operating on other time scales (e.g. age, cohort), for exam-
ple, in survival analysis or in graduating (smoothing) demographic schedules (Carlin and
Klugman, 1993).
5.3.2 Sampling Schemes
Different MCMC sampling schemes have been proposed for state-space models according
to the form of outcome (e.g. metric or discrete) and the form of the observation-state equa-
tions (e.g. linear or nonlinear). Multi-state or joint sampling of the state vectors βt is gener-
ally more efficient than single-state sampling that updates one state parameter vector at a
time (Knorr-Held, 1999). Joint sampling for β when y is metric is discussed by Carter and
Kohn (1994) and Fruhwirth-Schnatter (1994), while de Jong and Shephard (1995) focus on
sampling the ut and wt error series, as opposed to the state effects βt; recent overviews are
provided by Reis et al. (2006) and Simpson et al. (2017). Gamerman (1998) proposes updat-
ing via the δt rather than the usually highly correlated βt using the re-parameterisation
mentioned above.
Time Structured Priors 177
Knorr-Held (1999) uses properties of the penalty (inverse covariance) matrix of the joint
density for the state vectors as a basis for sampling sub-blocks of the elements (β1,…, βT).
Thus Gaussian state-space priors can be written in joint form as
æ b ¢K b ö
p ( b1 ,..., bT |W ) µ exp ç - ÷ ,
è 2W ø
where the penalty matrix K is determined by the form of autoregressive prior. For a first
order random walk with bt ~ N ( bt -1 , W ), the penalty matrix is
æ1 -1 ö
ç ÷
ç -1 2 -1 ÷
ç -1 2 -1 ÷
ç ÷
K =ç … … … ÷,
ç -1 2 -1 ÷
ç ÷
ç -1 2 -1 ÷
ç -1 1 ÷ø
è
æ 1 -2 1 ö
ç ÷
ç -2 5 -4 ÷
ç 1 -4 6 -4 1 ÷
ç ÷
ç 1 -4 6 -4 1 ÷
K =ç … … … … ÷.
ç ÷
ç 1 -4 6 -4 1 ÷
ç 1 -4 6 -4 1÷
ç ÷
ç 1 -4 5 -2 ÷
ç ÷
è 1 -2 1ø
For an RW(p) prior at equally spaced time points, the elements of the matrix K (apart from
edge effects) are expressible as
i- j æ 2p ö
kij = (-1) ç ÷ if|i - j|£ p,
è p-|i - j|ø
and kij = 0 otherwise.
Let βab denote the subvector (βa, βa+1,…, βb) of state effects, and Kab denote the correspond-
ing submatrix of K. Let K1,a−1 and Kb+1,T denote the submatrices to the left and right of Kab,
namely
æ K1¢ , a-1 ö
ç ÷
K = ç K1, a-1 K ab K b +1,T ÷ .
ç K b¢ +1,T ÷
è ø
178 Bayesian Hierarchical Models
-1
Then the conditional density for βab given β1,a−1, βb+1,T and W, is normal bab ~ N (nab , WK ab ),
where
-1
- K ab K b + 1,T bb + 1,T a=1
nab = - K ab [ K1, a -1 b1, a -1 + K b + 1,T bb + 1,T ]
-1
a > 1, b < T .
-1
- K ab K1, a -1 b1, a -1 b=T
Using this density, a Metropolis-Hastings block sample may be used to update the full
conditional
p ( bab |) µ Õ p (y |b ) p ( b
t= a
t t ab | bb + 1,T , b1, a -1 , W ).
-1
This involves drawing a proposal βab from N (nab , WK ab ) with {nab , K ab } evaluated at the cur-
rent sampled values β and W in a chain, with the proposal accepted or rejected according
to a probability
æ b b
ö
min ç 1,
ç
è
Õ( t=a
p yt |bt* ) Õ p ( y |b ) ÷÷ø ,
t=a
t t
yt = bt + ut
bt = bt -1 + D t + w1t
D t = D t - 1 + w2t
where Δt represent the changing slope of the trend. This provides the local linear trend
model or dynamic trend model (Fruhwirth-Schnatter, 1994).
A constant parameter Δ provides a linear trend, as in the Carter–Lee mortality forecast-
ing model considered by Pedroza (2006); this is sometimes known as a random walk with
drift. Other variations on the local linear model in (5.1) include autoregressive rather than
random walk state equations, such as
bt = fbt -1 + wt
yt = mt + st + ut ,
mt = mt -1 + D t + w1t ,
D t = D t -1 + w2t ,
st + st -1 + st - S + 1 = w3t ,
where S is the number of seasons, and w jt ~ N (0, Wj ). Relevant R packages for estimat-
ing the BSM include stsm (via maximum likelihood), and bsts and dlm (via Bayesian
estimation).
Fruhwirth-Schnatter (1994) sets out the full conditionals for this model under gamma
priors for the precisions 1/Wj. The last equation provides the time domain prior for sea-
sonal effects, whereas a frequency domain prior specifies
[ S/ 2 ]
st = å s ,
j =1
jt
( ) ( )
s jt = s j ,t -1 cos lj + v j ,t -1 sin lj + w3t ,
( ) ( )
v jt = - s j ,t -1 sin lj + v j ,t -1 cos lj + w4t ,
yt = mt + ct + ut ,
mt = mt -1 + D t + w1t ,
D t = D t -1 + w2t ,
5.3.4 Identification Questions
Identification issues in state-space random effect models occur for two main reasons. One
is that the mean or level of the state effects is not specified (rather the mean of pairwise
or higher order differences is specified). The other is the presence of multiple confounded
sources of random variation, as in the basic structural model with level and seasonal
effects, whereas the data can only identify the sum of the random effects ut + st. These
180 Bayesian Hierarchical Models
questions raise issues in MCMC sampling, for example, whether effects need to be centred
at each iteration, because an intercept (if included) will otherwise be confounded with the
means of the random effects.
To exemplify issues occurring due to the mean of the latent series consider the measure-
ment error with RW(1) signal model in 5.3.1. The state equation can be stated as
Dbt = bt - bt -1 ~ N (0, W ),
so the prior only defines a level for differences in βt, but the level of the (undifferenced) βt
is not defined by the prior. If the model for yt does not have a separate intercept parameter,
the level of the βt will be identified by the level of the yt. Suppose though that the observa-
tion equation includes a separate constant γ0 with
yt = g0 + bt + ut .
Then γ0 and the mean of the βt are confounded and for identification one may apply a
centring or corner constraint to the βt. An identifying corner constraint involves setting a
single βt to a known value; taking the initial condition β1 to have a known value, e.g. β1 = 0,
is one option (Clayton, 1996). By contrast, if the initial conditions (β1 in an RW(1) prior, β1
and β2 in an RW(2) prior, etc) are taken as unknowns, then a centring constraint may be
å
T
applied at each MCMC iteration, so that the centred βt satisfy bt = 0 .
t =1
As in other models with multiple sources of random variation, priors on the variance
components in state-space models may affect inferences. This is not simply a matter of
selecting prior densities for scale parameters, but of also a question of how such priors
influence the partitioning of total random variation. One may recognise the interdepen-
dence between variance components using devices such as uniform priors on shrinkage
ratios B = V/V + W combined with a prior on V or V + W (Daniels, 1999). Alternatively (V, W)
may be reparameterised as (V, qV), where q is a signal to noise ratio. So the prior on q might
be centred on 1 in line with a prior belief that signal and observation variances are equal.
These approaches extend to models with competing sources of variation in the state
equation. Consider the three errors wjt (for levels, slopes, and seasonals) in the basic struc-
tural model. Denoting Wj = Var( w jt ) and V = Var(ut ), one may set Wj = q jV where qj are
signal to noise ratios (Koopman, 1993; Harvey, 1989, p.33). One may then set priors on the
qj separately (e.g. separate gammas), or jointly; for example, via a multivariate normal on
the log(qj). Another option is a prior on V and uniform priors on the ratios V/(V + Wj). Such
devices amount to assuming prior correlation between the respective variances.
An alternative approach to ensure stable identification is to set informative priors on the
variance of each random walk, possibly based on expected stochastic variation around
a deterministic trend. For example, following Berzuini and Clayton (1994), for counts
yt ~ Po(λt), consider a second order random walk for bt = log(lt )
bt = 2 bt -1 - bt - 2 + wt
then the value W = 0 for Var(wt) corresponds to a log-linear deterministic relationship
between the λt and time. To allow for stochastic variation, one may assume nW */W ~ cn2 ,
or equivalently
æ n W *n ö
1 / W ~ Ga ç , ÷,
è2 2 ø
Time Structured Priors 181
where W* is a prior setting for W, and higher values of ν represent stronger degrees of
belief in that setting. For example, taking W* = 0.01 corresponds to assuming a 95% prob-
ability that λt will be within −18 and +22% of a log-linear extrapolation from βt−1.
The single source of error approach (Ord et al., 2005) may also assist in achieving parsi-
mony, and in resolving the partitioning of variance between multiple sources of variation
in unobserved component models. Thus, the local linear trend model in multiple source of
error (MSOE) form is
yt = mt + ut ,
mt = mt -1 + D t + w1t ,
D t = D t -1 + w2t ,
yt = mt + ut ,
mt = mt -1 + D t + l1ut ,
D t = D t -1 + l2ut ,
where λ1 and λ2 are loadings. By contrast to the MSOE scheme, the state and observation
errors are now correlated.
yt = mt + st + ut ,
mt = mt -1 + D t + w1t ,
D t = D t -1 + w2t ,
st + st - 1 + … st - S + 1 = w3t ,
( )
with w jt ~ N(0, s j2+ 1 ), and normal observation errors ut ~ N 0, s12 . Half t(0,1) priors with
4 degrees of freedom are assumed on the σj. For t = 1, the μt and Δt series refer to pre-
series values which are assigned N(0,10) priors.
Convergence is obtained readily in a two-chain run of 5000 iterations using rstan,
with a LOO-IC of 44.1. The posterior means (medians) of the σj are 0.202 (0.202), 0.174
(0.173), 0.0036 (0.0025) and 0.0136 (0.0118).
Figure 5.2A–C show respectively the clear seasonal variations, the generally
upward trend in the slope parameters Δt (though most evident in the early part of
the period), and the combined level and trend. These series all include forecasts for
nine extra months through to the end of 2014. A similar slope trajectory is estimated
182 Bayesian Hierarchical Models
1
Numbers (mill)
10
Numbers (mill)
–0
–1 9
–2
Jan 2000 Jan 2005 Jan 2010 Jan 2015 Jan 2000 Jan 2005 Jan 2010 Jan 2015
(c)
(a) Month Month
40
0.015
30
Numbers (mill)
20
0.010
10
0.005
0
Jan 2000 Jan 2005 Jan 2010 Jan 2015 Jan 2000 Jan 2005 Jan 2010
(b) (d)
Month Month
FIGURE 5.2
(A) Passenger numbers, seasonal effects. (B) Passenger numbers, trend effects. (C) Passenger numbers, level
effects. (D) Passenger numbers model, pointwise LOO-IC.
using the R program rucm. Reversals to the broad upward trend in modelled pas-
senger numbers, as in Figure 5.2C, reflect especially the recession of 2008–09, as
well as more distinct outliers for individual months. An examination of the point-
wise LOO-IC, as in Figure 5.2D, shows the most discrepant month (t = 136) to be
April 2010, reflecting the impact on flights of the Eyjafjallajökull volcanic eruption
in Iceland.
To alleviate the impact of outlier values, a student t observation model is also esti-
mated, namely ut ~ t(0, n , s12 ) . The unknown degrees of freedom ν is assigned an E(0.1)
prior. This provides an improved LOO-IC of 16 with a posterior mean (median) of ν of
2.48 (2.33), with the posterior mean (median) for σ1 reduced to 0.093 (0.092).
Estimation of the basic structural model is also straightforward with R-INLA,
with the simplest code involving a random effect that combines level and trend. The
pointwise WAIC from a normal errors-based model reproduces the extreme outlier
at t = 136.
Time Structured Priors 183
y jt = Mt + bt + u jt
u jt ~ N(0, su2 ),
Mt ~ N ( Mt - 1 , s M
2
) (t > 1),
with the initial condition M1 assigned a diffuse N(6900,10000) prior. A gamma prior is
assumed for x = tu + tM = 1/su2 + 1/sM2
, so with k = tu /x ~ U(0, 1) , one obtains τu = κξ and
tM = (1 - k)x . Patwardhan and Small mention that compilations of trends in relative sea
level data suggest an upward trend of 0.5–3.0 mm/year, so a N(0,1) prior on b seems
reasonable.
For improved identification and convergence, the Mt series are differenced with
respect to M1, namely D t = Mt - M1 , and a level parameter β0 is introduced. So the Mt
are effectively represented as Δt + β0, and the observation model is y jt = b0 + D t + bt + u jt .
Convergence is much delayed without using this re-expression. An alternative device is
centring, whereby D t = Mt - M .
An alternative model (model 2) allowing site-specific linear trends is considered,
namely
y jt = Mt + b jt + u jt ,
M t ~ N ( Mt - 1 , s M
2
),
u jt ~ N(0, su2 ),
b j ~ N( mb , sb2 ),
mb ~ N(0, 1).
A relatively informative exponential E(1) prior for 1/sb2 is adopted, as diffuse options
lead to delayed convergence. The same identification strategy as under model 1 is
adopted for the Mt series.
184 Bayesian Hierarchical Models
7120
7100 Mean Sea Level
7080 Mean
7060
2.5%
7040
97.5%
7020
7000
6980
6960
6940
6920
6900
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
FIGURE 5.3
Modelled global sea level.
For model 1, a two-chain run using jagsUI converges after 20,000 iterations. There is a
mean (95% CRI) linear growth rate b of 0.85 (−0.08, 1.54). Variation around the random
level component Mt is comparatively small, with the posterior median of sM 2
standing
at 7.6, compared to a median 2005 for su . A posterior predictive p-test based on squared
2
deviations is satisfactory. As to fit, let yrep,jt be replicate data from the model. Then a
posterior predictive loss criterion is calculated (within the observed data period to 1980,
and with k = 1000) as
80 5 R 80 5
ååå åå V(y
1 k (r ) 2
PPL = ( y rep , jt - y jt ) + rep, jt ).
R k +1 t =1 j =1 r =1 t =1 j =1
The respective components are obtained as 1629 and 1054 (in units of 1000), with the
second a measure of complexity.
For model 2, a two-chain analysis with jagsUI converges after 10,000 iterations. This
analysis gives a mean (95% interval) for μb of 0.78 (−0.15,1.68), while site level mean
growth rates range from 0.30 to 1.89. Variation in the random components (Mt and bj) is
such as to reduce the median σ2 to 1378. The respective PPL components are also much
reduced, namely to 1116 and 731. Figure 5.3 shows the evolution of modelled global sea
level to 1980 and forecasts thereafter.
totals, and an index of abundance, At, such as the catch per unit effort, though such indices
are often imperfect measures (Maunder et al., 2006).
A widely applied population dynamics model is the logistic function of Schaefer (1954)
whereby biomass at period t + 1, Bt+1, is represented as
where etp are multiplicative errors, and g(Bt) represents “surplus production” as a function
of biomass. Thus, one of the observation series, namely Ct, appears in the process model.
The Schaefer model involves three parameters, r, K, and q, which can be interpreted respec-
tively as the maximum intrinsic growth rate, the arithmetic mean biomass at unexploited
equilibrium (or carrying capacity), and a catchability parameter (or proportionality con-
stant). These parameters define the surplus production function, namely
g(Bt ) = rBt (1 - Bt /K ),
At = qBt eto .
Additional parameters are the variances σ2o = var(eto ) and σ2p = var(etp ) of the observation
and process models. Identification may be improved by using several abundance indices,
so that
A jt = q jBt e jto .
Typically, lognormal likelihoods are adopted for both the process and observation models.
Derived parameters of interest include the maximum sustainable yield, MSY = rK/4.
Estimation of this model may well require informative priors for at least some of the
parameters. The data may contain relatively little information on the parameters, so the
prior may considerably influence the posterior. The literature has discussion about appro-
priate forms of prior, such as a uniform or lognormal prior for K, or uniform on log(K) (Punt
and Hilborn, 1997; McAllister, 2014). For q, McAllister (2014) suggests a uniform density for
log(q) over (−20,2), namely a diffuse prior concentrated on values under one, but including
values above one. By contrast, Parent and Rivot (2013) suggest a U(−20,20) prior on log(q),
while Rankin and Lemos (2015) assume log(q) ~ U ( -20, -3). For the rate of natural increase,
r, there may be more substantive prior evidence, though Parent and Rivot (2013) adopt a
U(0.01,3) prior. Regarding the process and observation series variances, Parent and Rivot
(2013) propose a parameter λ governing the ratio σp/σo, but in an application, assume λ = 1
and adopt a diffuse U(−20,20) prior on the log of the common variance σ2. McAllister (2014)
discusses the basis for more informative priors on sp2 ; for example, a value σp = 0.05 results
in an interannual change in total recruited stock biomass of about 5%. Rankin and Lemos
(2015) follow Parent and Rivot (2013) in adopting relatively diffuse priors except on K.
from 1976 to 2010. Relatively diffuse priors are adopted, following recommendations in
the literature (Parent and Rivot, 2013; Rankin and Lemos, 2015), except for K, where a
uniform prior, log(K ) ~ U(3.92, 7.6) follows the ISC Shark Working Group (2013). Thus
for r, q, and σ2, the priors are r ~ U(0.01, 3), log(q) ~ U( -20, 2) , and log(s 2 ) ~ U( -10, 10) .
The biomass series is expressed as
Pt = Bt /K
as in Meyer and Millar (1999). The lognormal prior on the initial condition P1 is as in
ISSWG (2013).
In the first model, the process and observation priors are related according to sp = lso ,
with the prior for λ exponential, λ ~ E(1), centred at 1. Convergence using rstan is rapid,
with the LOO-IC pooling over process and observation likelihoods, as the process
likelihood involves observing the catch data. The LOO-IC is −99, with posterior mean
(median) estimates of the maximum sustainable yield (MSY) of 77.6 (68.1), higher than
the estimate of 52 in ISCSWG (2013). Estimates may be affected by the use in ISCSWG
(2013) of the extended Fletcher–Schaefer model, and by the inclusion in ISCSWG (2013)
of five earlier years when the abundance index was missing. The carrying capacity K
has posterior mean (median) of 1130 (1105). The mean for the At series can be modified
numerically by changing K or q, and samples of these parameters are negatively corre-
lated, a feature that might be used in setting a prior.
In a second model, it is assumed that σp = σo, and this model has a higher LOO-IC,
namely −89. Under this model, the posterior mean (median) estimates of the MSY are
73.5 (66.7). Both analyses show biomass at low levels in the late 1980s, as in Figure 8 of
ISCSWG (2013). Under the first model, the posterior median biomass for 1989 is 588,
compared to 1213 in 1976, and 994 in 2010.
with mt = E( yt |zt ) = b¢(zt ) and μt linked to a linear predictor ηt via a link function g, g( mt ) = ht .
Also a known scale parameter ϕt defines the conditional variance Var( yt |zt ) = b ²(zt )/ft .
Then an observation equation for design matrix Xt of dimension p would typically be of
the form
g( mt ) = ht = bt Xt + ut ,
bt = bt -1Gt + wt ,
where wt ~ N p (0, W ). The error ut ~ N (0, V ) is not necessarily included for discrete
responses, but may be necessary to represent unstructured extra-variation.
An alternative state-space approach, sometimes termed a linear Bayes approach, involves
conjugate priors for the natural parameters and a guide relationship
h(zt ) = bt Xt ,
linking the natural parameters to the state vector (West et al., 1985, p.74; Ferreira and
Gamerman, 2000, p.60). So with time specific parameters (gt,ht), the prior for the natural
parameter at time t is
As for normal linear state-space models, the state vector may include level, trend and sea-
sonal effects. For an underlying signal model (with Xt containing only an intercept), the
regression and state equations become (Kitagawa and Gersch, 1996, Ch 13),
g( mt ) = bt + ut ,
D k bt = wt ,
with ut ~ N (0, V ), wt ~ N (0, W ) . Thus Kashiwagi and Yanagimoto (1992) consider Poisson
data on disease counts yt ~ Po( mt ) , and take k = 1 in the signal equation.
For binary data with pt = Pr( yt = 1) , a signal may be combined with randomly time-vary-
ing dependence on lagged responses (Cox, 1970), providing a parameter driven representa-
tion, whereas an observation-driven model would only involve fixed effect coefficients on
lagged observed yt values (Wu and Cui, 2014). For example, a time-varying level and lag 1
effect could specify
Time series of categorical data vectors, namely yt = ( yt1 , yt 2 , ¼ ytJ ) with only a single ytj = 1
if (say) diagnosis j applies, or mutually exclusive choices j made at time t, are multinomial
according to
Typically, a multiple logit link is assumed for the unknown probabilities ptj (Fahrmeir and
Tutz, 2001; Cargnoni et al., 1996). A signal model would then involve a (J − 1) dimensional
state vector, though by analogy to binary Markov dependence, the regression term ηtj for
188 Bayesian Hierarchical Models
the jth choice may also involve lags on both the same response yt−k,j, and lagged cross-
responses yt−k,m (m ¹ j). For a general predictor, possibly varying by category, Xtj, one has
where βtJ = 0 for identifiability. Cross-series borrowing of strength via random walk priors
may be applied for the J − 1 category specific state vectors βtj. Thus, for the coefficient on
predictor k, Xtjk, one might have
yt* = bt Xt + ut ,
where yt* is positive or negative according as yt = 1 or yt = 0, and the variance of ut is assumed
known for identifiability, usually with var(ut = 1). A simple signal model with Xt = 1 may
then be expressed as
yt* |W , yt , bt µ N ( bt , 1) I (0, ¥) if yt = 1
yt* |W , yt , bt µ N ( bt , 1) I ( -¥ , 0) if yt = 0
bt ~ N ( bt -1 , W ).
5.4.1 Other Approaches
Other general schemes for modelling time series of exponential family data include the
generalised autoregressive moving average (GARMA) representation (Benjamin et al.,
2003; Li, 1994; Silveira de Andrade et al., 2015). The GARMA representation involves con-
ditional means μt, link function g(μt) = ηt, and regression term in the form
p q
ht = g t X t + åf éë g(y
j =1
j t- j ) - g t - j Xt - j ùû + åg
k =1
k éë g( yt -k ) - ht -k ûù .
For example, for Poisson data yt ~ Po( mt ) and yt* = max( yt , m) for a small positive constant
m, one has
p q
log( mt ) = gt Xt + å
j =1
fj éëlog( yt*- j ) - gt - j Xt - j ùû + å g éëlog(y
k =1
k
*
t-k / mt - k )ùû .
Time Structured Priors 189
More general autoregression in the state vector (not limited to random walks) may be
adopted. Thus Oh and Lim (2001) and Chan and Ledolter (1995) adopt an autocorrelated
error θt for count data with
yt ~ Po(e ht ),
ht = qt + Xt b ,
qt = rqt -1 + wt ,
where ρ is constrained to stationarity, and the wt are normally distributed. Utazi (2017)
considers a variant of this model allowing a changepoint in the autoregressive param-
eter ρ.
Dependence on lagged counts can also be achieved by binomial thinning (Silva et al.,
2009), whereby
ht = bt Xt + r yt -1 ,
yt ~ Po(e ht kt ),
æ 1 1ö
kt ~ Ga ç , ÷ ,
è c cø
p q
mt = w + åf y + åg m
j =1
j t- j
k =1
j t-k
with all parameters positive. Under an ACP(1,1) model, one therefore has
mt = w + f yt -1 + gmt -1.
Var( yt ) = mt ( D + f 2 )/D mt ,
190 Bayesian Hierarchical Models
mt = w + fyt - 1 + gmt - 1 .
The seatbelt intervention (St) is represented by a binary variable with values 1 from 1976
onwards, 0 before. The estimates from a two-chain run of 25,000 iterations using jagsUI
show most of the lag effect on μt operating through the conditional means, with γ hav-
ing posterior mean (sd) of 0.98 (0.002). Predictive checks (comparing replicates from the
posterior predictive distribution with actual observations), are satisfactory. The seat-
belt effect β is estimated as −0.62 (0.08), but the pointwise LOO values still show 1976
(and surrounding years) as poorly fit. In that year, the accident rate per million fell to
350 compared to 433 in the previous year. It may be noted that estimates of the LOO-IC
and WAIC (respectively 797 and 744) are unstable.
A second analysis adopts an antedependence approach, whereby yt ~ Po( mt ),
log( mt ) = log(Et ) + b.St + gt , with gt following a first-order antedependence scheme,
whereby
g1 = e1
gt = ft gt - 1 + et t > 2
mt = r1 yt - 1 + exp(log(Et ) + b.St + ht ),
Time Structured Priors 191
3.0
2.5
2.0
Relative Risks
1.5
1.0
0.5
FIGURE 5.4
Annual relative risks, Ontario accidents, posterior means.
ht ~ N( r2 ht - 1 , sh2 ),
with r1 ~ U(0, 1), and r2 ~ U( -1, 1). This model has satisfactory predictive checks, and
there is no significant correlation between successive errors ( yt - mt )/mt0.5 . However, the
LOO-IC and WAIC, at 811 and 754 are higher than for other models, with performance
being vitiated by discontinuities in the series, such as in 1937. The β coefficient has a
mean (95%CRI) of −0.33 (−1.23, −0.01).
Finally, we use R-INLA to estimate a model with random walk level, with yt ~ Po( mt ),
wt ~ N(wt - 1 , sw2 ).
This model provides a posterior mean and CRI for β of −0.22 (−0.41, −0.04), and a
WAIC of 751. A R-INLA model code including trend as well as level can be achieved
using an augmented data representation (Ruiz-Cárdenas et al., 2012).
logit(pt ) = b1 + b2t yt - 1 ,
with b2t = b2 + gt , where gt follow an RW1 prior, implemented using the carnormal func-
tion in R2OpenBUGS. Under this function, the gt are centred at each iteration, leading to
improved identifiability. This provides a LOO-IC of 293. Figure 5.5 plots out the varying
AR coefficients β2t.
BARMA models may also be implemented via R2OpenBUGS, but rstan provides
considerably faster computation and convergence. We compare an autoregressive lag
5 BARMA(5,0) model, with AR coefficients following a horseshoe prior for parsimony,
with a BARMA(5,1) model including a moving average term. Generically
p q
logit(pt ) = b1 + åj =1
r j yt - j + å q (y
k =1
k t-k - pt - k ),
lj = (1/k j - 1)
r j ~ N(0, lj tl ),
–2.7
–2.8
Posterior mean b2
–2.9
–3.0
–3.1
–3.2
FIGURE 5.5
Old Faithful data. Varying AR1 regression coefficient
Time Structured Priors 193
included in the likelihood. For example, the model at t = 1 refers to unobserved preseries
å
p
data points represented in the parameter e 1 = r j y1- j . The alternative is to condition
j=1
on the first five observations. In fact, only the AR1 coefficient plays a significant role,
with ρ1 having posterior mean (sd) of −2.56 (0.44), and with j1 = 0.98 . The LOO-IC for
this model deteriorates to 332.
A BARMA(5,1) model finds ρ2 and θ1 (beta[3] and beta[7] in the code) to be significant
with respective posterior means (sd) 1.39 (0.37) and −2.3 (0.67). The LOO-IC for this
model is 331.
5.5 Stochastic Variances
Many state-space applications assume constant variances in the observation and state
equations, but there is often nonstationarity in such variances (Omori and Watanabe, 2015;
Broto and Ruiz, 2004). Certain types of data such as exchange rate and share price series
rt are particularly likely to demonstrate volatility clustering (Granger and Machina, 2006),
with fluctuating variances Vt = var(rt). Typically, there are periods where volatility is rel-
atively high and periods where volatility is relatively low, often with relatively smooth
transition between high and low volatility regimes. In many applications, the series is
transformed to have an effectively zero mean (Meyer and Yu, 2000, p.200). For example,
the ratio of successive exchange rates rt /rt -1 has approximate average 1, so that a response
obtained as yt = log(rt /rt -1 ) can be taken to average zero. Hence, one may write a model
without intercept (or predictor effects) as
yt = Vt0.5ut ,
æh ö
yt = Vt ut = exp ç t ÷ ut , (5.2)
è 2ø
ht = m + f( ht -1 - m) + sw wt , t > 1
æ s2 ö
h1 ~ N ç m , w 2 ÷ ,
è 1-f ø
æ ut ö ææ0ö æ 1 0öö
ç ÷ ~ N çç ç ÷ , ç ÷÷
è wt ø èè0ø è0 1 ø ÷ø
where |ϕ| < 1 measures persistence in the volatility, but the ut and wt series are uncorre-
lated. This scheme can be generalised to multivariate responses subject to volatility, such
as a set of exchange rates – see Chapter 7, and Yu and Meyer (2006).
194 Bayesian Hierarchical Models
As a heavy tailed alternative, one may consider a Student t likelihood for the log scale
series, implemented as a scale mixture of normals (Jacquier et al., 2004). With ν degrees of
freedom, one has
yt = lt Vt ut ,
æh ö
= lt exp ç t ÷ ut ,
è 2ø
æ n nö
1/lt ~ Ga ç , ÷ ,
è 2 2ø
and other aspects as above. A diffuse prior on ν is not suitable, and one option is an expo-
nential prior with prior mean 10 or 20 (Fernández and Steel, 1998). For a recent alterna-
tive prior (applicable to other types of Student t regression), see Fonseca et al. (2008). This
model deals with isolated y-outliers by introducing a large λt, and it requires a sequence of
large |yt| before Vt is increased (Jacquier et al., 2004, p.190).
By contrast, generalised autoregressive conditional heteroscedastic (GARCH) models
involve autoregression in yt2 and/or Vt. A GARCH(p,q) model specifies
p q
Vt = g + å j =1
a j yt2- j + åbV
j =1
j t- j
where coefficients {g , a j , b j } are constrained to be positive, and setting q = 0 leads to the
ARCH(p) model (Engle, 1982). Stationarity requires
p q
å a + å b < 1
j =1
j
j =1
j
though is not necessarily imposed a priori. Whichever approach is used, departures from
normality are frequently relevant, such that yt / Vt is non-Gaussian. Among heavy tailed
alternatives, one may consider a Student t, either ut ~ t(0, 1, n) , or a scale mixture of normals
(Bauwens and Lubrano, 1998; Chib et al., 2002).
In case y has a non-zero mean, or there are predictors, one may widen the model for y.
For example, a model with a zero mean y and lag 1 effect in y would be
yt = r yt -1 + Vt ut .
yt = r yt -1 + ut g + a yt2-1 .
For ut normal, this can be shown equivalent to the random coefficient AR model
yt = ( r + at )yt -1 + ct ,
where (at,ct) are bivariate normal with mean 0 and covariance matrix Diag(α,γ).
A generalisation of the state-space approach is to introduce correlation between the
ut and wt terms, and so reflect leverage effects. Positive and negative shocks then have
Time Structured Priors 195
different impacts on future volatility (Wang et al., 2011; Asai et al., 2006; Jacquier et al.,
2004; Meyer and Yu, 2000; Chen and So, 2006). So one possible scheme has
æh ö
yt = Vt ut = exp ç t ÷ ut ,
è 2ø
ht = m + f( ht -1 - m) + sw wt ,
ut 0 1 j
wt ∼ N 0 , j 1
,
where φ is a correlation. A heavy tailed version of the leverage model (Jacquier et al., 2004;
Omori et al., 2007) may be obtained with
æh ö
yt = lt exp ç t ÷ ut ,
è 2ø
æ n nö
1/lt ~ Ga ç , ÷ .
è 2 2ø
Under the model (5.2), assume priors m ~ N(0, s m2 ) , (f + 1)/2 ~ Be(rf , sf ) , and sw2 ~ IG(kw , lw ),
where {s m2 , rf , sf , kw , lw } are known. Then with y = ( m, f, sw2 ) , the posterior is
ì - y 2 ü ù éæ 1 - f 2 ö üù
0.5
é T
ì1 -f 2
p(y |y ) µ ê
êë
Õ
t =1
exp{- ht /2} exp í htt ýú êç 2 ÷ exp í
î 2e þúû êè s w ø
ë î 2s w
2
( h1 - m )2 ýú
þúû
é T
æ 1 ö
0.5
ù
êÕ
êë t=2 è wø
ì 1
î w
2ü
ç s 2 ÷ exp í- 2s 2 ( ht - m - f ( ht -1 - m )) ýú p( m )p(f )p(s w )
þúû
2
and Gibbs sampling from full conditionals is obtained (Kim et al., 1998). The Griddy–Gibbs
technique may also be used to enable Gibbs sampling of all parameters in a GARCH(1,1)
model, with normal or Student distributed ut (Bauwens and Lubrano, 1998). Chib et al.
(2002) consider more general Metropolis–Hastings techniques including particle filtering,
to sample from models with discontinuities in the observations.
0.2
0.1
Bitcoin Return
0.0
–0.1
FIGURE 5.6
Fluctuations in returns yt = (rt – rt – 1)/rt (rt is Bitcoin price).
yt = r yt - 1 + ut g + a yt2- 1 ,
is applied, with the constraint r 2 + a < 1 sufficient to ensure E( yt2 < ¥) . The analysis
conditions on the first observation. A two-chain run of 5,000 iterations using jagsUI
gives posterior means (sd) for ρ and α of 0.14 (0.05) and 0.16 (0.06), with γ estimated as
0.0021(0.0002). The LOO-IC is obtained as −1155.
A second approach is based on a stationary autoregressive stochastic volatility model
(in stan), as in (5.2), with
æh ö
yt = Vt ut = exp ç t ÷ ut ,
è 2ø
ut ~ N(0, 1),
ht = m + f( ht - 1 - m) + sw wt , t > 1
with a uniform U(−1,1) prior on ϕ, and a half Cauchy prior on σw. With a two-chain run
of 2,000 iterations, we obtain posterior estimates (mean, sd) for μ and ϕ of −6.34 (0.35)
and 0.91 (0.05), with the LOO-IC estimated as −1249. Figure 5.7 plots the evolving vari-
ance Vt = exp( ht ) under this model.
To better represent the extreme volatility in the series, a Student t (scale mixture) ver-
sion of the preceding stochastic volatility model is applied. Thus
æh ö
yt = lt Vt ut = lt exp ç t ÷ ut ,
è 2ø
æ n nö
1/lt ~ Ga ç , ÷ ,
è 2 2ø
Time Structured Priors 197
0.010
0.008
Variance
0.006
0.004
0.002
0.000
0 100 200 300
Index
FIGURE 5.7
Stochastic volatility. Bitcoin data.
with a prior n ~ E(0.1) . This provides an improved LOO-IC of −1255, with a posterior
mean (sd) for ν of 15.2 (9.6). Low values of the precision scaling parameters zt = 1/lt are
indicators of outlier status, and we find that cases 200, 284,10, 216, and 340 have the low-
est posterior mean ζt. Two of these cases have return values exceeding 20%.
Outliers can also be represented by a binary shift mechanism (Wang, 2011). Thus
æh ö
yt = J t Nt + exp ç t ÷ ut ,
è 2ø
ht = m + f( ht - 1 - m) + sw wt , t > 1
where
J t ~ Bern(pJ ),
Nt ~ N(0, sN2 ),
represent the shift mechanism and its potential size respectively. The probability πJ can
be preset, or assigned a prior favouring a low outlier rate. Taking pJ ~ Beta(2, 48) , this
model (fitted using jagsUI) provides a LOO-IC of −1256. The highest posterior probabili-
ties Pr( J t = 1|y ) are for cases 284, 39, 10, 216, and 200.
or error series. These extend to shifts in variance parameters also (as considered in
Example 5.10).
Robust versions of the priors for the component errors ut and/or wjt in dynamic models
may be applied to allow flexibility in response to disparate observations. For example, a
heavy tailed alternative to Gaussian errors (Martin and Raftery, 1987) may be invoked by
scale mixing at both levels in the local level model
yt = bt + ut ,
bt = bt -1 + wt ,
with
ut ~ N (0, V/l1t ),
æn n ö
l1t ~ Ga ç u , u ÷ ,
è 2 2 ø
æn n ö
l2t ~ Ga ç w , w ÷ .
è 2 2 ø
This generalisation is adapted to detecting or accommodating additive outliers (outliers in
the observation errors) and innovation outliers in the state equation errors. Geweke (1993)
points out problems with adopting diffuse priors for ν, and possibilities include an expo-
nential density such as ν ~ E(0.1) (Fernandez and Steel, 1998).
Many outlier mechanisms involve discrete mixing around default normal error assump-
tions, as in a contaminated normal density (Verdinelli and Wasserman, 1991). Thus, let π be a
given prior probability of an outlier (e.g π = 0.05). Then the observation error in a state-space
model can be modified to allow innovation outliers
where W2 = KW1 with K large. A comprehensive generalisation of the normal errors dynamic
linear model is provided by taking yt and βt to follow the univariate or m ultivariate
exponential power distribution (Gomez et al., 2002).
More specialised binary switching in observation error or state error processes may be
applied (Diggle and Zeger, 1989), for example, adapted to positive pulses (e.g. periods with
abnormally heavy rainfall). To illustrate switching in observation errors to accommodate
positive pulses, consider the AR(1) observation model
yt = f yt -1 + ut ,
such that usually ut = u1t , but exceptionally ut = u2t , where the latter error is necessarily
positive, namely
u1t ~ N (0, s 2 ),
u2t ~ Ga( g1 , g 2 ),
Time Structured Priors 199
where {g1,g2} are preset. Define latent allocation indicators St Î(1, 2), as in Chapter 3. Then
ut = u2t with probabilities pt = Pr(St = 2), that might be defined by a separate model, such as
logit(pt ) = h0 + h1 yt -1.
One may also distinguish innovation outliers from additive outliers corresponding to iso-
lated shifts or “gross errors” in the observation series (Tsay, 1986; Fox, 1972). This involves
separate binary indicators {SAt, SIt}, or a single multinomial indicator St. For example, let
πA and πI be prior probabilities of additive and innovative outliers, and consider an AR(1)
observation model with AR(1) errors
yt = f0 + f1 yt -1 + atSAt + e t ,
et = ret -1 + ut ,
where SAt ~ Bern(pA ), and at ~ N (0, sa2 ) represents the sizes of the additive outliers
(McCulloch and Tsay, 1994). Innovation outliers are encompassed by a variance inflation
mechanism with
ut ~ (1 - pI )N (0, V ) + pI N (0, KV ),
St ~ Mult(1,[p1 , p2 , p3 ]),
yt = f0 + f1 yt -1 + atSt + et
et = ret -1 + utSt
yt = mt + et ,
mt = mt -1 + S1t D t ,
200 Bayesian Hierarchical Models
occurs when S1t = 1, with Pr(S1t = 1) = π1, and the Δt are random effects representing the
shifts. The errors are AR(p)
et = r1et -1 + r2et - 2 + … + rt - p et - p + ut
where shifts in the variance of ut ~ N (0, Vt ) occur when S2t = 1 with Pr(S2t = 1) = π2. If there
is conditioning on ( y1 , … , y p ) , then the variance sequence commences with Vp+ 1 = s 2 , and
subsequently,
Vt = Vt -1 when S2t = 0,
where the κt are positive variables (e.g. gamma distributed) that model proportional shifts
in the error variance.
Shocks in different components of the basic structural model can also be considered (De
Jong and Penzer, 1998; Penzer, 2006). For example, in a three-component local linear trend
model, binary shock indicators (S1t, S2t, S3t) are invoked, such that
yt = mt + S1t D 1t + ut ,
mt = mt -1 + S2t D 2t + D t + w1t ,
D t = D t -1 + S3t D 3t + w2t ,
where the Δ1t represent temporary additive shocks that occur when S1t = 1, the Δ2t represent
shifts in mean, and the Δ3t represent shifts in the slope.
Regime switching models (Geweke and Terui, 1993; Lubrano, 1995) typically involve
discrete switching between two or more levels, regression regimes, or variances, though
smooth transition mechanisms can also be used. The choice between regimes is gov-
erned by a binary switching function St, or a continuous transition function ϕt with values
between 0 and 1, such as the logit (Bauwens et al., 2000). A binary function St might be
defined as one if time t exceeds a threshold κ and zero otherwise, as in change-point mod-
els for the mean level of a series. In self-exciting threshold autoregressive (SETAR) models,
the mechanism involves a lag in y; for example, St = 1 if yt-1 > k . The continuous version in
these two cases would be
exp(w[t - k ])
ft = ,
1 + exp(w[t - k ])
exp(w[ yt -1 - k ])
ft = ,
1 + exp(w[ yt -1 - k ])
where ω is an extra unknown. Additionally, the lag r in the comparison yt - r > k may be
unknown (Geweke and Terui, 1993).
outlier and shift points. The initial analysis compares an AR(2) model for these data to
one allowing for an intercept shift (cf Balke, 1993). Following that analysis, a Bayesian
estimation of the AR(2) is applied. To facilitate prior specification for latent preseries
values y0 and y−1, we centre the original data Yt by subtracting Y1 from all points. So
yt = Yt − Y1.
For the AR(2) model with no shift mechanism (and a heavy tailed Student t prior for
the preseries points) is applied. Thus,
yt = f0 + f1 yt -1 + f2 yt - 2 + ut t = 1,… , T
ut ~ N(0, s 2 ),
y 0 ~ t2 (f0 + e1 , s 2 ),
y -1 ~ t2 (f0 + e2 , s 2 ),
where εj are fixed effects, and N(0,1) priors are adopted for {ϕ1,ϕ2} so that nonstationar-
ity is allowed. A two-chain run using jagsUI provides a LOO-IC of 1284. The posterior
means (and 95% credible intervals) on the AR parameters {ϕ1,ϕ2} are obtained as 0.45
(0.27,0.64), and 0.25 (0.06,0.44).
Suppose, however, a shift in the series level is allowed: a series plot suggests such
a shift around 1895. One may also allow for coefficient selection via binary variables,
namely dj = 1 if ϕj (j > 0) is to be retained, with prior probabilities Pr(d j = 1) = pd , with
pd ~ Beta(1, 1) So
where κ is taken to be uniform between 3 and T − 3. Fitting this model provides an
improved LOO-IC of 1275, with posterior mean for κ of 29.8. The selection pro-
cess indicates that the lag in yt−2 is now in doubt, with Pr(d2 = 1|y ) = 0.5 , whereas
Pr(d1 = 1|y ) = 0.998 .
So an AR(1) model with shift mechanism is applied, namely
The LOO-IC is reduced to 1272, with κ now having mean 28.6 (i.e. the year 1899).
This is similar to the classical estimate of 28 obtained from the changepoint package
(Killick and Eckley, 2014). The lag 1 coefficient estimate is now 0.43 with 95% interval
(0.25,0.62).
Finally, we consider an AR(2) SETAR model (e.g. Korenok, 2009), which bases the shift
threshold on the discharge value. Specifically,
with κy assigned a uniform prior, ky ~ U( -700, 300) , based on actual (differenced) y val-
ues, which have minimum (maximum) of −664 and 250. This model provides a LOO-IC
of 1282.6, with κy estimated as −336. The latter parameter is only weakly identified, as
can be verified by a prior-posterior overlap plot using MCMCvis. This explains the
small reduction in LOO-IC as against an AR(2) model with no shift. Figure 5.8 shows
the extent of updating in κy.
202 Bayesian Hierarchical Models
Density
0.0010
0.0000
FIGURE 5.8
Density of κy.
yt ~ N( b0 + qt , VJt ),
qt = fqt -1 + wt ,
p1 = 1/(1 + r ),
p2 = p3 = 0.5r/(1 + r ),
where r ~ E(9). Additionally, the variances of the observation and state equations are
linked by taking W = qV1 with an E(1) prior on q.
A two-chain run using jagsUI shows early convergence with estimated probability
π1 = 0.94 (and 95% interval from 0.82 to 0.99). The observation error variance V1 has a
posterior mean of 0.031, while the state variance W has mean 0.037.
5.7 Computational Notes
[1] The code for the ARMA(4,0,1) model in Example 5.1 is
Time Structured Priors 203
kappa=c(−1,1,−1,1.5)),
list(phi0=7,gamma1=0.9,phi=c(0.4,0.8,−0.3,−0.1),sigma=0.7,kapp
a=c(−2,1.5,−1.5,2)))
fit4<-sampling(sm,data =D,pars
=c("phi0","phi","gamma1","y_fit","kappa","log_lik"),
iter = 10000,warmup=500,chains = 2,seed= 12345,init=INI)
print(fit4)
# Fit
LLsamps <- extract(fit4,"log_lik",permute=F)
LLsamps <- matrix(LLsamps, 2*9500, 598)
loo(LLsamps)
[2] The code for the random coefficient AR1 model in Example 5.2 is
RCAR.stan <- “
data {
int<lower=0> T;
vector[T] y;
}
parameters {
real mu;
real eta[T];
real y0;
real mu_phi;
real<lower=0> sigma;
real<lower=0> sigma_phi;
}
transformed parameters {
vector[T] muy;
vector[T] phi;
phi[1]=mu_phi+eta[1]*sigma_phi;
for(tin2:T){phi[t]=mu_phi+eta[t]*sigma_phi;}
muy[1]=mu+(mu_phi+eta[1]*sigma_phi)*y0;
for(tin2:T){muy[t]=mu+(mu_phi+eta[t]*sigma_phi)*y[t-1];}
}
model {
sigma ~normal(0, 1);
eta ~normal(0,1);
mu ~normal(0, 20);
mu_phi ~normal(0, 1);
y0 ~normal(0,20);
for (t in 1:T) {y[t] ~normal(muy[t], sigma);}
}
generated quantities {
vector[T] log_lik;
for (t in 1:T) {log_lik[t] = normal_lpdf(y[t] muy[t], sigma);}
}
[3] The code for the intervention decay effect antedependence model is as follows:
cat(“model {for (t in 1:71) {
y[t] ~dpois(mu[t])
# Scaled deviance and likelihood terms
yts[t] <- equals(y[t],0)+(1−equals(y[t],0))*y[t]
mus[t] <- equals(y[t],0)+(1−equals(y[t],0))*mu[t]
Time Structured Priors 205
dv[t] <- 2*(y[t]*log(yts[t]/mus[t])−(y[t]−mu[t]))
LL[t] <- −mu[t]+y[t]*log(mu[t])−logfact(y[t])
# Predictive checks
ynew[t] ~dpois(mu[t])
ch[t] <- step(ynew[t]−y[t])−0.5*equals(ynew[t],y[t])
# Regression
log(mu[t]) <- log(E[t])+beta[t]*SB[t]+g[t]
# Relative risk after control for intervention
RR[t] <- exp(g[t])}
Dv <- sum(dv[1:71])
g.m <- mean(g[])
# priors
phi ~dnorm(0,1)
g[1] ~dnorm(0,1/omega[1])
for (t in 2:71) {g[t] ~dnorm(phi*g[t-1],1/omega[t])}
# Variance model
for (t in 1:71) {log(omega[t]) <- gam[1]+gam[2]*t/100+gam[3]*
t*t/10000}
for (j in 1:3) {gam[j] ~dnorm(0,1)}
# Intervention effect
for(r in 1:26) {b[r] ~dnorm(0,tau.b)}
tau.b ~dexp(1)
# sort ascending order
bsort <- sort(b)
for (j in 1:45) {beta[j] <- 0}
# Decay in effect from year of introduction
for (j in 46:71) {betas[j] <- bsort[j−45]
# Retain negative coefficients
beta[j] <- betas[j]*step(−betas[j])
# Probability that SB effect still relevant
decay.prob[j−45] <- step(−betas[j])}}
“, file=”model3.jag”)
# Initial values and estimation
init1= list(gam=c(−3,0,0),phi=0.8)
init2= list(gam=c(−3,0,0),phi=0.7)
inits=list(init1,init2)
pars <- c(“beta”,”gam”,”LL”,”Dv”,”phi”,”RR”,”ch”,”decay.prob”)
R <- autojags(D, inits, pars,model.file=”model3.jag”,2,iter.
increment=5000, n.burnin=500,Rhat.limit=1.1, max.iter=50000, seed=1234)
R$summary
samps <- as.matrix(R$samples)
# Select log-likelihood samples in samps
LL <- samps[,75:145]
LOO=loo(LL,pointwise=T)
waic(LL)
# Relative risks after controlling for intervention
RR <- samps[,148:218]
plot(apply(RR,2,mean),x=year,xlab=”Year”,ylab=”Relative Risks”)
# plots and listing, pointwise LOO
loocase <- as.vector(LOO$pointwise[,3])
plot(loocase,x=year,xlab=”Year”,ylab=”Pointwise LOO-IC”)
year=seq(1931,2001,1)
list.loocase <- data.frame(year,loocase)
list.loocase=list.loocase[order(−list.loocase$loocase),]
head(list.loocase,10)
206 Bayesian Hierarchical Models
References
Abraham B, Ledolter J (1983) Statistical Methods for Forecasting. Wiley, New York.
Albert J, Chib S (1993) Bayes inference via Gibbs sampling of autoregressive time series subject to
Markov mean and variance shifts. Journal of Business & Economic Statistics, 11(1), 1–15.
Angers J, Biswas A, Maiti R (2017) Bayesian forecasting for time series of categorical data. Journal of
Forecasting, 36(3), 217–229.
Araveeporn A (2017) Comparing random coefficient autoregressive model with and without auto-
correlated errors by Bayesian analysis. Statistical Journal of the IAOS, 33(2), 537–545.
Asai M, McAleer M, Yu J (2006) Multivariate stochastic volatility: A review. Econometric Reviews, 25,
145–175.
Auger-Méthé M, Field C, Albertsen C M, Derocher A, Lewis M, Jonsen I, Flemming J (2016) State-
space models’ dirty little secrets: Even simple linear Gaussian models can have estimation
problems. Scientific Reports, 6, 26677.
Balke N (1993) Detecting level shifts in time series. The Journal of Business and Economic Statistics, 11,
81–92.
Barnett G, Kohn R, Sheather S (1996) Bayesian estimation of an autoregressive model using Markov
chain Monte Carlo. Journal of Econometrics, 74, 237–254.
Bauwens L, Lubrano M (1998) Bayesian inference on GARCH models using the Gibbs sampler.
Econometrics Journal, 1, C23–C46.
Bauwens L, Lubrano M, Richard J (2000) Bayesian Inference in Dynamic Econometric Models. OUP.
Beck N (2004) Time series, in Encyclopedia of Social Science Research Methods, eds M Lewis-Beck, A
Bryman, T Futing Liao. Sage.
Benjamin M, Rigby R, Stasinopoulos D (2003) Generalized autoregressive moving average models.
Journal of the American Statistical Association, 98, 214–223.
Berkes I, Horvath L, Ling S (2009) Estimation in nonstationary random coefficient autoregressive
models. Journal of Time Series Analysis, 30, 395–416.
Berliner L (1996) Hierarchical Bayesian time series models, pp 15–22, in Maximum Entropy and
Bayesian Methods, eds K Hanson, R Silver. Kluwer Academic Publishers.
Berzuini C, Clayton D (1994) Bayesian analysis of survival on multiple time scales. Statistics in
Medicine, 13, 823–838.
Betancourt M, Girolami M (2015) Hamiltonian Monte Carlo for hierarchical models, in Current Trends
in Bayesian Methodology with Applications, eds S Upadhyay, U Singh, D Dey, A Loganathan. CRC.
Bijma F, De Munck J, Huizenga H, Heethaar R, Nehorai A (2005) Simultaneous estimation and test-
ing of sources in multiple MEG data sets. IEEE Transactions on Signal Processing, 53, 3449–3460.
Bockenholt U (1999) An INAR(1) negative multinomial regression model for longitudinal count data.
Psychometrika, 64, 53–68.
Broto C, Ruiz E (2004) Estimation methods for stochastic volatility models: A survey. Journal of
Economic Surveys, 18, 613–649.
Cargnoni C, Muller P, West M (1996) Bayesian forecasting of multinomial time series through con-
ditionally Gaussian dynamic models. Journal of the American Statistical Association, 92, 587–606.
Carlin BP, Klugman SA (1993) Hierarchical Bayesian Whittaker graduation. Scandinavian Actuarial
Journal, 1993(2), 183–196.
Carlin B, Polson D, Stoffer D (1992) A Monte Carlo approach to nonnormal and nonlinear state space
modelling. Journal of the American Statistical Association, 87, 493–500.
Carter C, Kohn R (1994) On Gibbs sampling for state space models. Biometrika, 81, 541–553.
Chan K, Ledolter J (1995) Monte Carlo EM estimation for time series models involving counts. Journal
of the American Statistical Association, 90, 242–252.
Chatuverdi A, Kumar J (2005) Bayesian unit root test for model with maintained trend. Statistics &
Probability Letters, 74, 109–115.
Chen C, Liu L (1993) Joint estimation of model parameters and outlier effects in time series. Journal of
the American Statistical Association, 88, 284–297.
Time Structured Priors 207
Geweke J, Terui N (1993) Bayesian threshold auto-regressive models for nonlinear time series. Journal
of Time Series Analysis, 14, 441–454.
Ghosh K, Tiwari R (2007) Prediction of U.S. cancer mortality counts using semiparametric Bayesian
techniques. Journal of the American Statistical Association, 102, 7–15.
Giordani P, Pitt M, Kohn R (2011) Bayesian inference for time series state space models, in The Oxford
Handbook of Bayesian Econometrics, eds J Geweke, G Koop, H Van Dijk. OUP.
Glosten L, Jagannathan R, Runkle D (1994) On the relation between the expected value and the vari-
ance of the nominal excess return on stocks. Journal of Finance, 48(5), 1779–1801.
Godsill S, Doucet A, West M (2004) Monte Carlo smoothing for nonlinear time series. Journal of the
American Statistical Association, 99, 156–168.
Gómez E, Gómez-Villegas M, Marn J (2002) Continuous elliptical and exponential power linear
dynamic models. Journal of Multivariate Analysis, 83, 22–36.
Granger C, Machina M (2006) Structural attribution of observed volatility clustering. Journal of
Econometrics, 135, 15–29.
Grunwald G, Hyndman R, Tedesco L, Tweedie R (2000) Non-Gaussian conditional linear AR(1) mod-
els. Australian & New Zealand Journal of Statistics, 42, 479–495.
Grunwald S (2005) Environmental Soil-Landscape Modeling: Geographic Information Technologies and
Pedometrics. CRC Press.
Hamilton J (2007) Regime-switching models, in Palgrave Dictionary of Economics, 2nd Edition, eds S
Durlauf, L Blume. Palgrave MacMillan, London.
Harvey A (1989) Structural Time Series Models and the Kalman Filter. Cambridge University Press.
Harvey A, Ruiz E, Shepherd N (1994) Multivariate stochastic variance models. Review of Economic
Studies, 61, 247–264.
Harvey A, Todd P (1983) Forecasting economic time series with structural and Box-Jenkins models:
A case study. Journal of Business & Economic Statistics, 1, 299–307.
Harvey A, Trimbur T, Van Dijk H (2006) Trends and cycles in economic time series: A Bayesian
approach. Journal of Econometrics, 140(2), 618–649.
Heinen A (2003) Modelling time series count data: An autoregressive conditional Poisson model.
SSRN Electronic Journal. DOI:10.2139/ssrn.1117187
Helske J (2017) tsPI: Improved Prediction Intervals for ARIMA Processes and Structural Time Series.
https://cran.r-project.org/web/packages/tsPI/index.html
Huerta G, West M (1999) Priors and component structurres in autoregressive time series. Journal of the
Royal Statistical Society, Series B, 61, 881–899.
ISC Shark Working Group (2013) Stock assessment and future projections of blue shark in the North
Pacific ocean. WCPFC-SC9-2013/SA-WP-11. WCPFC-SC. https://www.wcpfc.int/node/19204
Jacquier E, Polson N, Rossi P (2004) Bayesian analysis of stochastic volatility models with fat-tails
and correlated errors. Journal of Econometrics, 122, 185–212.
Jacquier E, Polson NG, Rossi PE (2002) Bayesian analysis of stochastic volatility models. Journal of
Business & Economic Statistics, 20(1), 69–87.
Jaffrézic F, Thompson R, Hill G (2003) Structured antedependence models for genetic analysis of
repeated measures on multiple quantitative traits. Genetics Research, 82, 55–65.
Jaffrézic F, Venot E, Laloë D, Vinet A, Renand G (2004) Use of structured antedependence models for
the genetic analysis of growth curves. Journal of Animal Science, 82, 3465–3473.
Jowaheer V, Sutradhar B (2002) Analysing longitudinal count data with overdispersion. Biometrika,
89, 389–399.
Jung R, Kukuk M, Liesenfeld R (2006) Time series of count data: modeling, estimation and diagnos-
tics. Computational Statistics & Data Analysis, 51(4), 2350–2364.
Kashiwagi N, Yanagimoto T (1992) Smoothing serial count data through a state-space model.
Biometrics, 48, 1187–1194.
Kastner G, Hosszejni D (2016) Package ‘stochvol’. Efficient Bayesian Inference for Stochastic Volatility
(SV) Models. https://cran.r-project.org/web/packages/stochvol/stochvol.pdf
Khoo W, Ong S (2014) A new model for time series of counts. AIP Conference Proceedings, 1605(1),
938–942.
Time Structured Priors 209
Killick R, Eckley I (2014) Changepoint: An R package for changepoint analysis. Journal of Statistical
Software, 58(3), 1–19.
Kim S, Shephard N, Chib S (1998) Stochastic volatility: Likelihood inference and comparison with
ARCH models. The Review of Economic Studies, 65, 361–393.
Kitagawa G, Gersch W (1996) Smoothness Priors Analysis of Time Series. Springer, New York.
Knape J (2008) Estimability of density dependence in models of time series data. Ecology, 89,
2994–3000.
Knorr-Held L (1999) Conditional prior proposals in dynamic models. Scandinavian Journal of Statistics,
26, 129–144.
Koopman S (1993) Disturbance smoother for state space models. Biometrika, 80, 117–126.
Koopman S, Shephard N, Doornik J (1999) Statistical algorithms for models in state space form using
SsfPack 2.2. Econometrics Journal, 2, 113–166.
Korenok O (2009) Bayesian methods in non-linear time series, pp 441–455, in Encyclopedia of Complexity
and Systems Science. Springer, New York.
Lee S (1998) Coefficient constancy test in a random coefficient autoregressive model. Journal of
Statistical Planning and Inference, 74, 93–101.
Lee Y, Nelder J (2001) Modelling and analysing correlated non-normal data. Statistical Modelling, 1,
3–16.
Leonte D, Nott D, Dunsmuir W (2003) Smoothing and change point detection for gamma ray count
data. Mathematical Geology, 35, 175–194.
Li W (1994) Time series models based on generalized linear models: Some further results. Biometrics,
50, 506–511.
Liboschik T, Fokianos K, Fried R (2017) tscount: An R package for analysis of count time series fol-
lowing generalized linear models. Journal of Statistical Software, 82(5), 1–50.
Ling S (2004) Estimation and testing stationarity for double-autoregressive models. Journal of the
Royal Statistical Society: Series B, 66, 63–78.
Lubrano M (1995) Testing for unit root in a Bayesian framework. Journal of Econometrics, 69, 81–109.
Marriott J, Ravishanker N, Gelfand A, Pai J (1996) Bayesian analysis of ARMA processes: Complete
sampling based inference under full likelihoods, pp 243–256, in Bayesian Analysis in Statistics
and Econometrics, eds D Barry, K Chaloner, J Geweke. Wiley, New York.
Martin D, Raftery A (1987) Non-Gaussian state-space modeling of nonstationary time series:
Robustness, computation, and non-Euclidean models. Journal of the American Statistical
Association, 82, 1044–1050.
Maunder M, Sibert J, Fonteneau A, Hampton J, Kleiber P, Harley S (2006) Interpreting catch per unit
effort data to assess the status of individual stocks and communities. ICES Journal of Marine
Science, 63(8), 1373–1385.
Maunder MN, Deriso RB, Hanson CH (2015) Use of state-space population dynamics models in
hypothesis testing: Advantages over simple log-linear regressions for modeling survival,
illustrated with application to longfin smelt (Spirinchus thaleichthys). Fisheries Research, 164,
102–111.
McAllister M K (2014) A generalized Bayesian surplus production stock assessment software (BSP2).
Collective Volumes of Scientific Papers ICCAT, 70(4), 1725–1757.
McCulloch R, Tsay R (1993) Bayesian inference and prediction for mean and variance shifts in autore-
gressive time series. Journal of the American Statistical Association, 88, 968–978.
McCulloch R, Tsay R (1994) Bayesian analysis of autoregressive time series via the Gibbs sampler.
Journal of Time Series Analysis, 15, 235–250.
Meyer R, Millar RB (1999) BUGS in Bayesian stock assessments. Canadian Journal of Fisheries and
Aquatic Sciences, 56(6), 1078–1087.
Meyer R, Yu J (2000) BUGS for a Bayesian analysis of stochastic volatility models. Econometrics
Journal, 3, 198–215.
Mira A, Petrone S (1996) Bayesian hierarchical nonparametric inference for change point prob-
lems, pp 693–703, in Bayesian Statistics 5, eds J Bernardo, J Berger, A Dawid, A Smith. OUP,
Oxford.
210 Bayesian Hierarchical Models
Monnahan C, Thorson J, Branch T (2017) Faster estimation of Bayesian models in ecology using
Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3), 339–348.
Naylor J, Marriott J (1996) A Bayesian analysis of non-stationary autoregressive series, pp 705–712, in
Bayesian Statistics 5, eds J Bernardo, J Berger, A Dawid, A Smith. Clarendon Press.
Nunez-Anton V, Zimmerman D (2000) Modeling non-stationary longitudinal data. Biometrics, 56,
699–705.
Oh M-S, Lim Y (2001) Bayesian analysis of time series Poisson data. Journal of Applied Statistics, 28,
259–271.
Omori Y, Chib S, Shephard N, Nakajima J (2007) Stochastic volatility with leverage: Fast and efficient
likelihood inference. Journal of Econometrics, 140(2), 425–449.
Omori Y, Watanabe T (2015) Stochastic volatility and realized stochastic volatility models, pp 435–
456, Chapter 21, in Current Trends in Bayesian Methodology with Applications, eds S Upadhyay, U
Singh, D Dey, A Loganathan. Chapman and Hall/CRC.
Ord J, Snyder R, Koehler A, Hyndman R, Leeds M (2005) Time series forecasting: The case for the
single source of error state space approach. Working Paper 7/05, Department of Econometrics
and Business Statistics, Monash University.
Paap R, van Dijk H (2003) Bayes estimation of Markov trends in possibly cointegrated series: an appli-
cation to U.S. consumption and income. Journal of Business & Economic Statistics, 21, 547–563.
Parent E, Rivot E (2013) Introduction to Hierarchical Bayesian Modeling for Ecological Data. Chapman
and Hall/CRC.
Patwardhan A, Small M (1992) Bayesian methods for model uncertainty analysis with application to
future sea level rise. Risk Analysis, 12, 513–523.
Pedroza C (2006) A Bayesian forecasting model: Predicting U.S. male mortality. Biostatistics, 7,
530–550.
Penzer J (2006) Diagnosing seasonal shifts in time series using state space models. Statistical
Methodology, 3, 193–210.
Perreault L, Berniera J, Bobéeb B, Parent E (2000) Bayesian change-point analysis in hydrometeoro-
logical time series: Comparison of change-point models and forecasting. Journal of Hydrology,
235, 242–263.
Petris G, Petrone S, Campagnoli P (2009) Dynamic Linear Models with R. Springer, New York.
Piegorsch W, Bailer J (2005) Analyzing Environmental Data. Wiley.
Pourahmadi M (2002) Graphical diagnostics for modeling unstructured covariance matrices.
International Statistical Review, 70, 395–417.
Prado R, Huerta G, West M (2000) Bayesian time-varying autoregressions: Theory, methods and
applications. Journal of the Institute of Mathematics and Statistics of the University of Sao Paolo, 4,
405–422.
Punt A, Hilborn R (1997) Fisheries stock assessment and decision analysis: The Bayesian approach.
Reviews in Fish Biology and Fisheries, 7, 35–63.
Rankin P S, Lemos R T (2015) An alternative surplus production model. Ecological Modelling, 313,
109–126.
Reis E, Salazar E, Gamerman D (2006) Comparison of sampling schemes for dynamic linear models.
International Statistical Review, 74, 203–214.
Rue H, Held L (2005) Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall/
CRC.
Ruiz-Cárdenas R, Krainski E T, Rue H (2012) Direct fitting of dynamic models using integrated nested
laplace approximations—INLA. Computational Statistics & Data Analysis, 56(6), 1808–1828.
Santos T, Franco G, Gamerman D (2010) Comparison of classical and Bayesian approaches for inter-
vention analysis. International Statistical Review, 78(2), 218–239.
Schaefer MB (1954) Some aspects of the dynamics of populations important to the management of
the commercial marine fisheries. Inter-American Tropical Tuna Commission Bulletin, 1(2), 23–56.
Schmidt D, Makalic E (2013) Estimation of stationary autoregressive models with the Bayesian
LASSO. Journal of Time Series Analysis, 34(5), 517–531.
Time Structured Priors 211
Schotman P, Van Dijk H (1991) On Bayesian routes to unit roots. Journal of Applied Econometrics, 6,
387–401.
Scott S (2017) Package ‘bsts’. Bayesian Structural Time Series. https://cran.r-project.org/web/pack-
ages/bsts/bsts.pdf
Silva N, Pereira I, Silva M E (2009) Forecasting in INAR (1) model. REVSTAT, 7(1), 119–134.
Silveira de Andrade B, Andrade M, Ehlers R (2015) Bayesian GARMA models for count data.
Communications in Statistics: Case Studies, Data Analysis and Applications, 1(4), 192–205.
Simpson M, Niemi J, Roy V (2017) Interweaving Markov chain Monte Carlo strategies for efficient
estimation of dynamic linear models. Journal of Computational and Graphical Statistics, 26(1),
152–159.
Soyer R, Aktekin T, Kim B (2015) Bayesian modeling of time series of counts with business applica-
tions, in Handbook of Discrete-Valued Time Series, eds R Davis, S Holan, R Lund, N Ravishanker.
CRC.
Speed T, Kiiveri H (1986) Gaussian distributions over finite graphs. Annals of Statistics, 14, 138–150.
Startz R (2008) Binomial autoregressive moving average models with an application to US reces-
sions. Journal of Business & Economic Statistics, 26(1), 1–8.
Strickland C, Turner I, Denham R, Mengersen K (2008) Efficient Bayesian Estimation of Multivariate
State Space Models. http://eprints.qut.edu.au
Tsay R (1986) Time series model specification in the presence of outliers. Journal of the American
Statistical Association, 81, 132–141.
Utazi C (2017) Bayesian single changepoint estimation in a parameter-driven model. Scandinavian
Journal of Statistics, 44(3), 765–779.
Verdinelli I, Wasserman L (1991) Bayesian analysis of outlier problems using the Gibbs sampler.
Statistics and Computing, 1, 105–117.
Wang D, Ghosh S (2002) Bayesian analysis of random coefficient autoregressive models. Model
Assisted Statistics and Applications, 3(2), 281–295.
Wang J, Chan J, Choy S (2011) Stochastic volatility models with leverage and heavy-tailed distribu-
tions: A Bayesian approach using scale mixtures. Computational Statistics & Data Analysis, 55(1),
852–862.
Wang P (2011) Pricing currency options with support vector regression and stochastic volatility
model with jumps. Expert Systems with Applications, 38(1), 1–7.
West M (1998) Bayesian forecasting, in Encyclopedia of Statistical Sciences, eds S Kotz, C Read, D Banks.
Wiley.
West M (2013) Bayesian dynamic modelling, pp 145–166, in Bayesian Inference and Markov Chain
Monte Carlo: In Honour of Adrian FM Smith, eds M West, P Damien, P Dellaportas, N Polson, D
Stephens. Oxford University Press.
West M, Harrison P (1997) Bayesian Forecasting and Dynamic Models, 2nd Edition. Springer-Verlag,
New York.
West M, Harrison P, Migon H (1985) Dynamic generalised linear models and Bayesian forecasting.
Journal of the American Statistical Association, 80, 73–97.
Wu R, Cui Y (2014) A parameter-driven logit regression model for binary time series. Journal of Time
Series Analysis, 35(5), 462–477.
Yu J, Meyer R (2006) Multivariate stochastic volatility models: Bayesian estimation and model com-
parison. Econometric Reviews, 25, 361–384.
6
Representing Spatial Dependence
6.1 Introduction
In the analysis of spatially configured data, positive covariation is typically expected
between observations (areas, points) that are close to each other, so that residual spatial
dependence may remain under a simple iid residual assumption (Anselin and Bera, 1998).
Spatial heterogeneity in regression relationships is also common (Anselin, 2010). Spatial
regression aims to represent the residual structure appropriately, or represent hetero-
geneity, and may also be used to obtain improved estimates, especially when applying
Bayesian spatial smoothing. Consider disease counts for areas, when small event totals or
small populations lead to unstable point estimates of rates or relative risks. One is then led
to hierarchical regression models for borrowing strength to achieve more stable estimates
(Riggan et al., 1991; Waller, 2002). If there is spatial covariation (e.g. when contiguous areas
have similar disease levels), an appropriate borrowing strength mechanism would incor-
porate local smoothing towards the mean of adjacent areas (Clayton and Kaldor, 1987). By
contrast, assuming exchangeable random effects implies global smoothing, with rates or
risks smoothed towards the overall mean, and does not account for spatial dependence.
Priors for spatial covariance modelling are therefore structured in the sense of explic-
itly recognising the role of adjacency or proximity, and use this structure as the basis for
smoothing or prediction. Often smoothing of rates is an end in itself; for example, spatial
smoothing of area health data to reflect similarity of disease risks in nearby areas is a
more reliable guide for health interventions (e.g. Zhu et al., 2006). However, structured
priors may also be suitable when the goals of analysis include out-of-sample prediction. In
geostatistical applications, a frequent goal is interpolation of a modelled surface to unsam-
pled locations based on proximity to observed locations (Gotway and Wolfinger, 2003;
Webster et al., 1994; Jiruse et al., 2004).
The R environment now offers considerable potential for analysing spatial data, as dis-
cussed, for example, in Bivand et al. (2013), Allard et al. (2017), and Brunsdon and Comber
(2015). On-line R-based resources for spatial data analysis include www.rspatial.org/
spatial/ and https://data.cdrc.ac.uk/tutorial/an-introduction-to-spatial-data-analysis-
and-visualisation-in-r. Bayesian spatial estimation in R is facilitated by packages such as
CARBayes (Lee, 2013), R-INLA (Blangiardo and Cameletti, 2015; Schrödle and Held, 2011),
INLABMA (Goméz-Rubio and Bivand, 2018), geostatsp (Brown, 2015), spBayes (Finley
et al., 2015), geoR (Ribeiro and Diggle, 2018), and spNNGP.
While there may be benefits from borrowing strength methods based on spatial proxim-
ity, using random effects to represent unobserved components may raise potential iden-
tification issues. For example, priors for random effects may specify differences between
adjacent observations without specifying their mean, so that MCMC methods then require
213
214 Bayesian Hierarchical Models
é ù
p(q i |q[i] ) µ t exp ê -
ê å w F(t [q - q ])úú ,
ij i j
ë j¹i û
where θ[i] denotes values for cases other than i, wij are weights specifying spatial dependence
between observations i and j, Φ(u) is an increasing function in u, subject to Φ(u) = Φ( −u),
and τ a precision parameter. Under a neighbourhood prior, where wij = 1 when observations
(usually areas) i and j are neighbours and wij = 0 otherwise, an equivalent representation is
é ù
p(q i |q[i] ) µ t exp ê -
ê jζ åF(t [q i - q j ])ú ,
ú
ë i û
where ∂i is the set of areas adjacent to area i . The case wij = 1 if |i − j| = 1 and wij = 0 otherwise
leads to first order random walk priors relevant to modelling time-ordered data. The MRF
prior generalises to variables θij in two-dimensional lattices (e.g. areas i and times j), and a
neighbourhood might then be defined as ∂ ij = [(i + 1, j),(i − 1, j),(i , j + 1),(i , j − 1)] (Lavine,
1999). Taking Φ(u) = u2 /2 leads to a Gaussian or L2 norm conditional prior for qi (Waller, 2002)
æ wijq j 1 ö
q i |q[i] ~ N ç
ç å ,
wi + t wi +
÷ , (6.1)
÷
è j¹i ø
whereas if f (u) =|u| then
æ ö
p(q i |q[i] ) µ t exp ç -t
ç åw ij qi - q j ÷ ,
÷
è j#i ø
known as the L1 norm prior (Richardson et al., 2004). To achieve robust smoothing, the
latter form may be better suited to spatial discontinuities, since its mode is at the median
rather than the mean.
that neighbouring areas or points tend to be similar, and that similarity typically diminishes
as distance increases. Even if known predictors are available, it is likely that other relevant
influences on the underlying process cannot be identified or measured, and this residual
heterogeneity is likely (at least in part) to be spatially structured (Lawson, 2008, p.94). For
example, Gelfand et al. (2005a) consider spatial modelling of residuals in the analysis of spe-
cies distributions, both for areas and points as the units, where unobserved influences might
include habitat and inter-species competition. Bayesian techniques have played a central role
in recent developments for analysing spatial data, whether space is viewed from a discrete or
continuous perspective, e.g. Banerjee et al. (2014) and Waller and Carlin (2010).
In studies with a discrete framework, the data are typically aggregated, with observa-
tions consisting of counts (e.g. of diseased subjects in spatial epidemiology) or of regional
indicators (e.g. average income per head or house prices in spatial econometrics). By con-
trast, in geostatistical models for geochemical readings, species distribution, or disease
events in relation to a pollution source, a continuous spatial framework is more relevant
(Section 6.5), allowing interpolation between observed point readings.
Consider metric responses yi for areas i, or at sites specified by grid references gi = ( g1i , g 2i ).
To allow greater flexibility, one may assume a “convolution” prior that compromises
between structured and unstructured variation; so the model includes both a spatially
structured random effect si and a fully exchangeable effect ui, with
yi = a + ui + si ,
where ui ∼ N (0, su2 ), but the si are spatially correlated. Alternatively, suppose yi are counts,
and that Pi are populations at risk with yi ~ Bin(Pi , p i ). Then one may specify
logit(pi ) = a + ui + si ,
where πi are latent probabilities of the event. Alternatively, for rare events in relation to the
risk population, a Poisson assumption is relevant with yi ~ Po(Pi li ) , and
log(li ) = a + ui + si ,
where λi are latent event rates per unit of Pi. If the offsets to the Poisson mean are expected
health events Ei, such that Si yi = S iEi with yi ~ Po(Ei li ), then the λi are interpretable as
latent relative risks (Wakefield, 2007, p.160).
One way to model the correlation in the elements of the vector s = (s1 , … , sn ) is to directly
specify a joint multivariate prior with covariance matrix that expresses spatial correlation
between areas i and j or sites gi and gj (Richardson et al., 1992, p.541; Wakefield, 2007).
Typical assumptions in such models (also considered in Section 6.5) are of stationarity and
isotropy, with the latter meaning the correlation is the same in all directions. For example,
a multivariate normal prior would take
(s1 , … sn ) ∼ N n (0, Σ s ),
1 w12 . w1n
w 1 . w2 n
Σ s = ss2W = ss2
21
,
. . . .
w n1 wn 2 . 1
216 Bayesian Hierarchical Models
where wij = f(dij) are correlation functions that decline as the spatial separation dij between
areas i and j (or sites gi and gj) increases, and defined to ensure that W is always non-nega-
tive definite (Mardia and Watkins, 1989).
For example, one may specify exponential spatial decay,
where d > 0, or for area units, allow for both inter-area distance dij and the length bij of the
common border between area i and j, namely
where γ1 is negative, and γ2 is positive. Another choice is the disc model with
dij2
0.5
2 −1 dij dij
wij = cos − 1 − 2 dij ≤ k ,
p k k k
with wij = 0 for dij > κ, so that κ controls the decline in correlation with distance. Such choices
are to some degree arbitrary, and inferences may be sensitive to the choice of spatial
weights (e.g. Bhattacharjee and Jensen-Butler, 2006).
6.2.1 SAR Schemes
A widely used scheme, especially in spatial econometrics, specifies the joint density via
simultaneous autoregressive or SAR effects (Richardson et al., 1992). By analogy with
ARMA time series models, the autoregression may operate both for (metric) responses
y = ( y1 , … y n )′ , and for the error vector e = (e1 , … en )′ . Let W = [wij ] be a spatial dependence
matrix as above, but with wii = 0 rather than wii = 1. One possible SAR scheme has the form
yi = r1 ∑w y
h≠i
ih h + Xi b + ei ,
ei = r2 ∑ w e + u ,
h≠i
ih h i
where ρ1 and ρ2 are measures of spatial dependence, and the u = (u1 , … , un )′ are inde-
pendently distributed, with diagonal covariance matrix Σu. The covariance matrix for
e = (e1 , … en )′ is (I − r2W )−1 Σ u (I − r2W ′ )−1 . In matrix form
y = r1Wy + X b + e ,
e = r2W e + u.
The ρ coefficients are constrained to lie between 1/ηmin and 1/ηmax, where {h1 , … , hn } are
the eigenvalues of W, in order to ensure that (I − rW ) is invertible. If the weights matrix is
standardised to have row sums of unity, so that wij* = wij /S h wih , then the maximum eigen-
value of W* is 1 and since negative spatial correlation is unlikely, one may specify uniform
or beta priors on ρ coefficients in the interval [0,1]. Wall (2004) points out that SAR priors
Representing Spatial Dependence 217
(and also CAR priors, as considered below) may generate implausible covariance patterns
when considered in terms of the joint priors.
Variants of the above scheme include the spatial errors model (SEM), with ρ1 = 0 (Cressie
and Wikle, 2011),
y = X b + e , (6.2)
e = rW e + u,
y = rWy + X b + u, (6.3)
where in both models u ∼ N (0, s 2 ) are iid. The spatial errors model may be expressed as
y = X b + (I − rW )−1 u,
or, equivalently,
The SEM model may also be considered as a prior for spatially correlated effects. For exam-
ple, in (6.2) one may assume spatially varying βi over units i, with
b = bm + eb ,
e b = r bW e b + u b ,
where βμ is the average coefficient. Another option is a spatial moving average errors rep-
resentation (Hepple, 2003) with
y = X b + e,
e = rWu + u,
u ∼ N (0, s 2 ).
y = X b + (I − rW )−1 u,
u = Xg + v,
y = X b + (I − rW )−1 Xg + (I − rW )−1 v.
218 Bayesian Hierarchical Models
Expressed with iid errors, this leads to the spatial Durbin model or SDM (Seya et al., 2012,
Lacombe and LeSage, 2015).
y = rWy + X( b + g) − rWX b + v,
library(easypackages)
libraries("INLA","spdep","INLABMA","maptools")
setwd("C:/R Files BHMRA")
# shapefile East London electoral wards
ELmap <- readShapePoly("Example_6_1")
ELnb <- poly2nb(ELmap, queen=F)
lw=nb2listw(ELnb, glist=NULL,, zero.policy=NULL)
# Sparse Adjacency matrix
W = as(as_dgRMatrix_listw(nb2listw(ELnb)), "CsparseMatrix")
A grid for the spatial autocorrelation parameter ρ in the SDM model (6.4) is specified
with limits 0.2 and 0.9, namely grid.rho = seq(0.2, 0.9, length.out=20). This is based on an
estimate of 0.48 from maximum likelihood estimation. The estimates from INLABMA
are shown in Table 6.1, with mean (sd) for ρ of 0.53 (0.09). The DIC is estimated as 1242.
With rstan, we may estimate the spatial errors errors (SEM) model, using the multi_
normal_prec option (Brunsdon, 2018). Thus with
the precision is [(I − rW )′ (I − rW )]/s 2 . The full code, with flat priors on hyperparam-
eters, is
Representing Spatial Dependence 219
model="data {
int N;
vector[N] x;
vector[N] y;
matrix<lower=0>[N,N] W;
matrix<lower=0,upper=>[N,N] I;
}
parameters {
real beta;
real alpha;
real<lower = 0> sigma;
real<lower=−1,upper=1> rho;
}
model {
y ~multi_normal_prec(alpha + x * beta, crossprod(I − rho * W)/
(sigma*sigma));
}
generated quantities
{
real LL;
LL= multi_normal_prec_lpdf(y alpha + x * beta, crossprod(I − rho *
W)/(sigma*sigma));
}"
This leads to very similar estimates to those obtained using maximum likelihood,
with posterior mean (sd) for ρ of 0.50 (0.10). The log-likelihood is estimated at −627, and
the DIC (estimated as the mean deviance plus the number of parameters) is obtained
as 1258.
TABLE 6.1
Spatial Autoregressive Models Compared
Mean St devn 2.5% Median 97.5%
Spatial Error Model Intercept 109.8 10.4 90.1 109.9 129.9
IMD 6.2 0.3 5.7 6.2 6.8
ρ 0.50 0.10 0.30 0.51 0.69
DIC 1258.0
Spatial Lag Model Intercept 84.6 12.2 60.8 84.1 108.2
IMD 5.42 0.39 4.66 5.43 6.18
ρ 0.16 0.07 0.03 0.17 0.30
DIC 1271.4
Spatial Moving Average Errors Model Intercept 109.4 8.8 92.4 109.3 126.8
IMD 6.3 0.3 5.7 6.3 6.7
ρ 0.47 0.12 0.23 0.46 0.71
DIC 1262.3
Spatial Durbin Model Intercept 48.0 6.7 34.7 48.0 61.2
IMD 6.2 0.4 5.4 6.2 7.0
IMD-spatial lag −3.2 0.5 −4.1 −3.2 −2.3
ρ 0.53 0.09 0.36 0.53 0.71
DIC 1241.6
220 Bayesian Hierarchical Models
A similar approach may be applied to estimate the spatial moving average errors
model, except that the likelihood is now
The DIC for this model is slightly higher than for the spatial autocorrelated errors
model, with posterior mean (sd) for ρ of 0.47 (0.12).
The spatial lag model may be estimated using the target + representation to accom-
modate the likelihood. The log-likelihood is
I − rW = ∏(1 − rl )
i =1
i
with l = (l1 ,…, ln ) being the eigenvalues of W. So the log determinant term may be
written
n
log I − rW = ∑ log(1 − rl ).
i =1
i
For simplicity, the target + calculations include the squared regression error terms and
the log determinant contributions in the same summand terms, albeit with the total of
these summands still being the overall log-likelihood. Discrepancies at case level might
be assessed by standardised residuals. Estimates for the four hyperparameters are very
similar to those from maximum likelihood, with posterior mean (sd) for ρ of 0.16 (0.07),
and 5.42 (0.39) for the regression coefficient on IMD.
To illustrate the SEM as a prior for random spatial effects, we extend the above rstan
code to allow the random coefficients scheme
b = b m + e b ,
e b = r bW e b + u b ,
where βμ is the average coefficient. This involves an extra input vector, e = rep(1,N), in
the data block:
vector<upper=>[N] e;
vector[N] beta;
There are extra parameters beta_mu, sigma_b, rho_b, and a model block as follows:
model {
beta ~ multi_normal_prec(e * beta_mu, tcrossprod(I − rho_b * W)/
(sigma_b*sigma_b));
y ~ multi_normal_prec(alpha + x .* beta, tcrossprod(I − rho * W)/
(sigma*sigma));
}
Representing Spatial Dependence 221
20
15
Frequency
10
FIGURE 6.1
Histogram of spatially varying predictor effect.
This option shows an increase in the log-likelihood from −627.0 to −589.0, with
Figure 6.1 showing the variation in the impacts of deprivation, and slopes varying from
5.9 to 7.3.
1 −0.5
p(s) = Σs exp( −0.5s′Σ s−1s).
(2p)n/2
Denote Q = [qij ] = Σ s−1 as the precision matrix, and s[i] = (s1 , … , si −1 , si + 1 , … , sn ) . Then the con-
ditional distributions for each si take a univariate normal form, corresponding to the pair-
wise interaction function Φ(u) = u2 /2 (Rue and Held, 2005, p.22), namely
qij 1
si |s[i] ∼ N
∑j≠i
− sj ,
qii qii
with corr(si , s j |s[i , j] ) = − qij / qii q jj . Following Besag and Kooperberg (1995, p.734) define
hii = 0, and set
The above conditional density is then in the conditional autoregressive form specified by
Besag (1974),
si |s[i] ∼ N
∑ h s , d/a . (6.6)
j≠i
ij j i
To obtain the joint density from the conditional one, symmetry of Q means −Qij = −Q ji , so
that from (6.5), the constraint
hij ai = h ji a j
applies. Note that expressing d/ai = ti2 or ai = d/ti2 , this constraint can also be stated
(Cressie and Kapat, 2008) as
hijt j2 = h jiti2 .
Letting R = A(I − H ) , where A = diag( a1 , … , an ) , one has that R is symmetric with diagonal
elements ai and off-diagonal elements −aihij. So the joint density (Besag and Green, 1993;
Banerjee et al., 2014) implied by the conditional priors is
(s1 , … sn ) ∼ N n (0, d R −1 )
Representing Spatial Dependence 223
where Q = d −1R . If R is positive definite as well as symmetric, the joint density of the spa-
tial effects is proper. Positive definiteness of R holds under diagonal dominance (Rue and
Held, 2005, p.20; Besag and Kooperberg, 1995, p.734), namely, that in at least one row (or
column) of R, the diagonal element rii exceeds the absolute sum of the off-diagonal ele-
ments |S j¹i rij |.
wij
hij = r (6.7)
∑ k≠i
wik
ai = ∑ w ,
k≠i
ik
where 0 ≤ r ≤ 1, and taking wij = w ji , with wii = 0, ensures the symmetry constraint is met,
with hij ai = rwij = h ji a j . This is sometimes called the proper CAR, as the covariance matrix
of the corresponding multivariate density is invertible. The most commonly applied
approach is to set wij = 1 for adjacent areas and wij = 0 otherwise, and let ai = di = S k ¹i wik ,
where di is then the number of areas adjacent to area i. For example, when a region is par-
titioned into grid cells, then each grid cell has eight (first order) neighbours (Gelfand et al.,
2005a). However, distance or common boundary length based forms for wij can be used.
In this case, R = A(I − H) has diagonal elements di and off-diagonal elements −ρwij. This
provides the intrinsic conditional autoregression or ICAR(ρ) prior, with
d
si|s[i] ∼ N r Ai ,
di
where Ai is the average of the sj in locality Li of area i, i.e.
Ai =
∑ j ∈Li
sj
.
di
Note that R = A(I − H ) = D − rW is positive definite, and the joint prior on (s1 , … sn ) is
proper, only when |ρ| < 1. Lower values of ρ imply lesser degrees of spatial dependence
between the si, though the limiting case when ρ = 0 has the disadvantage that the variance
is not constant, but depends on the number of neighbours di.
Alternatively, in a CAR(ρ) spatial prior, as distinct from the ICAR(ρ) prior, one may set
hij = rwij , ai = 1,
so that
si |s[i] ~ N r
∑ j≠i
wij s j , d ,
224 Bayesian Hierarchical Models
with a homogenous conditional variance (Cressie and Kapat, 2008, p.729). In this case,
R = I − ρW is positive definite, and so invertible (and the joint density is proper), when the
correlation parameter is between 1/ηmin and 1/ηmax where h1 , … , hn are the eigenvalues of
W (Bell and Broemeling, 2000).
A compromise scheme for the variance deflators ai – see MacNab et al. (2006) and Leroux
et al. (1999) – sets
ai = (1 − l) + l ∑ w ,
j≠i
ij
lwij
hij = ,
1− l + l ∑j≠i
wij
since hij ai = lwij = lw ji = h ji a j . So the joint density for (s1 , … sn ) has covariance δR−1 where
R = lF + (1 − l)I ,
f ii = ∑ w ,
j≠i
ij
f ij = − wij i ≠ j.
The case λ = 0 corresponds to a lack of spatial interdependence, with R then reducing to an
identity matrix, and borrowing strength confined to “global smoothing.” By contrast, λ = 1
leads to the ICAR(1) model (see 6.3.3). So
l d
si |s[i] ∼ N
1− l + l
∑ wij
∑ w s , 1 − l + l∑
j≠i
ij j
,
wij
j≠i j≠i
l d
si |s[i] ∼ N
1 − l + ldi ∑
j ∈Li
sj , .
1 − l + ldi
The scheme of Leroux et al. (1999) can be generalised to allow greater spatial adaptivity
with varying λ (Congdon, 2008). The symmetry condition hij ai = h ji a j is maintained by set-
ting ai = (1 − li ) + li ∑ j≠i
wij , and taking
Representing Spatial Dependence 225
li lj wij
hij = ,
1 − li + li ∑ j≠i
wij
hij ai = h ji a j = li lj wij .
where the average λμ and precision τλ are extra unknowns. Setting Λ = diag(l1 , … , ln ), the
covariance in the joint prior is then
d [LF * + (I - L)]-1 ,
where
f ii∗ = ∑ w ,
j≠i
ij
f ij∗ = − wij lj i ≠ j.
fwij
hij = ,
1+ |f| ∑ j≠i
wij
and
ai = 1+ |f| ∑ w ,
j≠i
ij
where ϕ measures the strength of spatial dependency, and the case ϕ = 0 corresponds to
an absence of spatial interdependence, such that R = I (see also Gschlößl and Czado, 2006).
Gibbs updating for ϕ can be applied. So
f d
si |s[i] ∼ N
1+ |f|∑ wij
∑ w s , 1+|f|∑
j≠i
ij j .
wij
j≠i
j≠i
Under both the MacNab et al. (2006) and Pettitt et al. (2002) schemes, the joint distribution
of s is proper, ensuring a proper posterior when either is taken as the prior distribution.
Retaining hij = f wij /(1+|f |S j¹i wij ), but setting ai = (1+|f |S j¹i wij )/(1+|f |) , means that ϕ → ∞
corresponds to the ICAR(1) prior, with the conditional variance (1+|f |d )/(1+|f |S j¹i wij )
tending to d /S j¹i wij .
226 Bayesian Hierarchical Models
∑ w ,
wij
hij = , ai =
∑
ij
wij j≠i
j≠i
log(li ) = a + si ,
d
si |s[i] ∼ N Ai , ,
∑ j≠i
wij
where Ai = S j¹i wij s j /S j¹i wij. The precision matrix of the joint prior is δ−1R, where
rii = ∑ w ,
j≠i
ij
rij = − wij i ≠ j.
When the wij are binary indicators of adjacency (wij = 1 for areas i and j contiguous, wij = 0
otherwise), then rii = di and the off-diagonal elements rij are −1 if i and j are neighbours,
but zero otherwise. This case demonstrates most directly that conditional independence
properties relating to spatial effects are stipulated by the matrix R and vice versa (Rue and
Held, 2005, p.4). Despite the relative simplicity of this form and the wide use of the ICAR(1)
conditional prior, R is not invertible under this model, and the joint prior is improper
(Haran et al., 2003).
To see this in another way, for the case where the wij are binary, the joint prior can be
specified in terms of pairwise comparisons between the si (Knorr-Held and Becker, 2000).
Let i ~ j denote that areas i and j are neighbours, then for a normal ICAR(1) model, the joint
prior in terms of differences si − sj is (Hodges et al., 2003)
æ 1 ö
p(s1 ,¼sn ) µ d -0.5( n-1) exp ç -
ç 2d å (si - s j )2 ÷ .
÷
è i~ j ø
Thus the prior only specifies differences between spatial effects and not their overall
level. However, all linear contrasts c′s with c′1 = 0 have proper distributions (Besag and
Kooperberg, 1995, p.740).
To tie down the effects and remove their locational invariance, one method involves
centring the sampled values at every iteration to have mean zero. This is one form of lin-
ear constraint, and so the joint distribution becomes integrable and propriety is obtained
(Rodrigues and Assuncao, 2008). Another possibility is a corner constraint, i.e. setting a
particular effect to a known value, such as s1 = 0 (Besag et al., 1995). Finally, one may omit
Representing Spatial Dependence 227
the intercept so that the si model the level of the data. In this case, yi ∼ Po(Pi exp(si )) with
the si not constrained, rather than yi ∼ Po(Pi exp(a + si )) .
As mentioned above a spatial effects-only assumption is relatively informative, and the
ICAR(1) spatial prior is often combined with an exchangeable prior to form a convolution
prior (Richardson et al., 2004). It may be argued that an exchangeable iid effect should
only be introduced in combination with an ICAR(1) spatial prior, since conditional priors
including a correlation parameter, such as the ICAR(ρ) can adjust to varying mixtures of
spatial and unstructured variation by varying the ρ parameter (Wakefield, 2007). Thus, for
a Poisson response, yi ∼ Po(li Pi ) , the convolution prior of Besag et al. (1991), also called the
Besag-York-Mollie (BYM) prior, specifies
log(li ) = a + si + ui
with si |s[i] ∼ N ( Ai , ds /di ), and ui ∼ N (0, du ) usually homoscedastic. Note that heteroscedas-
ticity or heavier tails than under the normal might be represented by taking ui ∼ N (0, yi )
where
yi = du /ki
where the κi are positive variables with mean 1 (LeSage, 1999). While only the sum zi = si + ui
is identifiable in this model, Norton and Niu (2009) show that the precisions δs and δu are
identifiable from the distribution of zi.
deviation sd(si) of the spatial effects is approximately equal to a multiple 1.43 (=1/0.7) of
the conditional scale term, ( ds /d)0.5 , where d is the average number of neighbours. Hence a
“fair” prior on sd(ui ) = d u0.5 (Banerjee et al., 2014, section 6.4.3.3) is one that ensures
Riebler et al. (2016) propose a modified BYM scheme retaining the two random effects, but
with a single scale parameter δ for the composite effects
ti = ui + si = d [ 1 − rqi + rfi∗ ].
Here θi ~ N(0,1) are iid effects, the fi∗ are scaled versions of spatial effects ϕi following
an ICAR(1) prior, and r ∈[0, 1] governs the proportion of residual variance due to spa-
tial dependence. To ensure d is legitimate as the standard deviation of the composite
effect, one requires var(fi ) ≈ var(qi ) ≈ 1. To achieve this, Riebler et al. (2016) propose a scal-
ing whereby the geometric mean of variances of ϕi is 1. To obtain a scaling factor F, with
fi∗ = fi /F , one may apply the R-INLA function inla.scale.model to the adjacency matrix.
logit(pi ) = a + si + ui ,
with conditional variance δs for ICAR(1) spatial effect si, and variance δu for the unstruc-
tured effects. Using rstan for estimation, positive N + (0, 25) priors are assumed on the
standard deviations ss = ds0.5 and su = du0.5 . In rstan, the ICAR(1) spatial prior is imple-
mented using the pairwise difference form of the joint multivariate density (e.g. Gerber
and Furrer, 2015; Morris, 2018), and in particular the target + formulation,
target += 0.5*(N-1)*log(tau_s) −0.5*tau_s*dot_self(s[node1]
- s[node2]);
A second analysis applies the Riebler et al. (2016), or BYM2, prior, with a single set of
effects, and with the proportion of spatial variance now a parameter. This model pro-
vides an unchanged LOO-IC of 608. The proportion of spatial variance ρ is estimated at
0.82, though with a wide 95% interval from 0.32 to 1. Forty-two of the composite effects
ti are now significant.
An area spatial model may also be assessed by whether residual spatial dependence is
removed, and this can be established using the moran.mc function in R. The moran.mc
function uses a Monte Carlo permutation test for Moran’s I statistic. Significant residual
correlation shows in extreme tail p-values, either values close to zero (positive residual
correlation), or p-values near 1 (negative residual correlation).
Here 100,000 permutations are taken, with the calculations using a binary adjacency
spatial interaction matrix for the 133 areas, converted to listw format. We find a non-
significant p-value of around 0.25 for the first model, and 0.27 for the second.
These models are also estimated by R-INLA, with the default log-gamma priors on
random effect precisions. The total random effects ti = ui + si under the BYM model are
very similar to those from the rstan application, with a correlation of 0.99 between the
two sets of posterior means. However, possibly reflecting sensitivity to priors on scale
parameters, spatial effects are smaller under R-INLA, and unstructured effects larger.
The BYM2 model estimated using R-INLA produces a lower DIC than the BYM model.
The proportion ρ of total residual variation due to spatial effects is estimated with mean
(95% CRI) of 0.69 (0.30,0.95), as against 0.82 (0.32,1.00) under rstan. The spatial effects
under the two estimations are highly correlated.
∑
1 1
p(s1 , … sn ) ∝ n −1
exp −0.5 |si − s j| ,
d d j≠i
and has its posterior mode at the median rather than mean of the neighbouring sj. One
might also apply Student t versions of the ICAR(ρ) which, if applied using scale mixtures,
give a natural measure of outlier status. Thus, for a Student t with ν degrees of freedom,
d
si |s[i] ∼ N r Ai ,
gi di
where ri ∼ Be(c, c), with c known, s1i is an ICAR error, but s2i follows a spatial Laplace prior.
Following Congdon (2007), analogous mixture forms can be applied to the errors in the
convolution model itself, giving more emphasis to the unstructured term ui in outlier areas:
log(li ) = a + ri si + (1 − ri )ui .
This type of representation may be useful for modelling edge effects, with the u effects
taking a greater role on the peripheral areas where neighbours are fewer. Another pos-
sibility is a discrete mixture in a “spatial switching” model (Congdon, 2007), allowing an
unstructured term only for areas where the pure spatial effects model is inappropriate.
Thus, for a count response,
yi ∼ Po(Ei lJi , i )
J i ∼ Categoric(p1 , p2 )
(p1 , p2 ) ~ Dirichlet(x1 , x2 )
log(l1i ) = a + si
log(l2i ) = a + si + ui
where the ξj are extra unknowns, and the si ~ ICAR(1). The posterior estimates for the ξj
provide overall weights of evidence in favour of a pure spatial model as compared to a con-
volution model, while high posterior probabilities Pr(Si = 2|y) for particular areas indicate
that pure spatial smoothing is inappropriate for them.
Fernandez and Green (2002) use a discrete mixture model generated via mixing over
several spatial priors. Thus, for count data, assume K possible components with area-spe-
cific probabilities πik on each component
Representing Spatial Dependence 231
yi ∼ ∑ p Po(E l )
k =1
ik i ik
where log(lik ) = ak for a model without predictors. Then K sets of underlying spatial effects
{sik} are generated from separate conditional spatial priors, and used to estimate area-spe-
cific mixture weights
where χ > 0. As χ tends to 0, the πik tend to 1/K without spatial patterning, whereas large χ
reduce over-shrinkage.
Another discrete mixture model for robust spatial dependence modelling uses the Potts
prior (Green and Richardson, 2002). Thus let J i ∈1, … , K be unknown allocation indicators
with yi ∼ Po(Ei mSi ) where { m1 , … , mK } are distinct cluster means. Also let dik = 1 if Ji = k. Then
the joint prior for the allocation indicators incorporates spatial dependence with
K
Pr( J i = k ) = exp w
∑
j∼i
I (dik = d jk )
exp w∑
j∼i ∑
I (dih = d jh )
h=1
where ω > 0 multiplies the number of same label neighbour pairs, so that lower values of
ω indicating lesser spatial dependence. So pooling towards the local neighbourhood aver-
age will tend not to occur if an area’s latent risk is discrepant with those of its neighbours.
Richardson et al. (2004) compare this model with the convolution model under various
simulated scenarios for differentiated spatial risks. Additional effects can be included by
multiplying the mJi . For example, a spatially unstructured multiplicative effect could be
modelled as ni ∼ Ga(bn , bn ), or a log-normal prior assumed with ni = exp(ui ), and ui ∼ N (0, du ).
Then yi ∼ Po(Ei mSi ni ) .
Assumptions such as normality in the spatial effects can be avoided by adapting
the Dirichlet process stick-breaking prior of Sethuraman (1994) to spatial settings. The
stick-breaking prior specifies an unknown distribution G by a mixture
G= ∑ p d( r )
m =1
m m
where M may in principle be infinite, but in practical computing is taken as finite, the mix-
ing probabilities satisfy S mM=1 pm = 1, and δ(ρm) has a point mass at ρm which may be scalar or
vector values for areas (e.g. relative risks) or at grid locations. For example, the ρm may be
drawn from a baseline borrowing-strength prior G0 such as a stationary Gaussian process
in the case of continuous point-referenced spatial data y(gi) at sites gi. One may incorpo-
rate spatial information into either the ρm, as in Gelfand et al. (2005b), or into the mixture
probabilities pm, as in Griffin and Steel (2006). Such formulations are typically for point-
referenced data, and allow for nonstationarity and non-Gaussian features in the response
when the stationary Gaussian process is not appropriate (Duan et al., 2007).
232 Bayesian Hierarchical Models
where g = (g 1 , g 2 ) are regression parameters. One would expect lower λi for areas dis-
similar from their neighbours on the risk factor; that is, γ2 is anticipated to be negative.
Here the discrepancy measure is based on the index zi of socioeconomic deprivation,
whereby dissimilarity may be represented as
Di = zi − Zi
with Zi being the average deprivation level in the locality Li around area i, namely
Zi = S jÎLi z j /di .
Estimation of the original Leroux et al. (1999) model using R2OpenBUGS provides a
posterior mean for the global λ of 0.86, with a LOO-IC of 1278 and WAIC (widely appli-
cable information criterion) of 1185. Estimation using CARBayes provides a slightly
higher estimate of λ, namely 0.93, but a higher WAIC of 1190.
Improved fit is provided by the adaptive Leroux model, with the LOO-IC and
WAIC respectively at 1267 and 1178. The coefficient γ2 has mean (95% CRI) of −0.54
(−0.83,−0.30). In contrast to the estimated global λ of 0.88, there are eight local λi under
0.5, with the minimum being for area 133 (the City of London) with posterior mean
λ133 = 0.002. This area has an illness rate (illness total divided by population, as percent-
age) of 14.5%, as compared to the rate in its locality (surrounding adjacent wards) of
38.4%. Its deprivation index is 16.4, compared to the locality average of 43.9. Figure 6.2
maps out the local λi.
FIGURE 6.2
Local Leroux dependence parameters.
The first analysis uses the nimble package in R, and estimates the BYM model. This
provides a LOO-IC of 3905, with maximum and minimum posterior mean relative risks
of 1.40 and 0.82 . The maximum casewise LOO-IC are for areas (such as 263, 573, and 512)
which have high yi counts in relation to expected suicides. 10% of the total LOO-IC is
due to the 5% worst fitting cases. Incidentally, the estimated proportion of variation due
to spatial dependence is relatively low, namely 0.23 (95% CRI from 0.07 to 0.44).
This feature is reproduced in an estimation of the model incorporating a proper CAR
spatial effect. This is implemented via a sparse precision matrix method in rstan, and
draws on Joseph (2016). The resulting estimate for ρ in (6.7) is 0.28 (with 95% CRI from
0.19 to 0.69). The LOO-IC is 3902, with maximum and minimum posterior mean relative
risks of 1.41 and 0.76. Mixed predictive exceedance checks are included, based on repli-
cate samples of the random spatial effects, and obtained as
These show over-prediction (high pi,mix) in a relatively high proportion of cases, with
high predicted yi deaths in relation to actual deaths.
An alternative to the BYM and proper CAR priors is the Potts prior. This is applied
with an exponential E(1) prior on ω, and with an ordering constraint on the latent cluster
means, so m1 ≤ m2 ≤ … ≤ mK , where K is set at 10. Since there is evidence of unstructured
heterogeneity, the scheme is modified to include unstructured area effects, namely
y i ∼ Po(Ei mSi ni ),
ni = exp(ui ),
ui ∼ N(0, du ).
234 Bayesian Hierarchical Models
For the ordered μk, relatively informative gamma Ga(ak,5) priors are assumed, with
a = (1, 2, 3,…, 10) , so reflecting the typical range of area relative risks for such health
outcomes. A two-chain run of 10,000 iterations provides a mean scaled deviance
2S i {y i log( y i /(Ei li )) - ( y i - Ei li )} of 1034, close to the number of observed areas. The pos-
terior mean (95% CI) of ω is 0.30 (0.01,0.85), with the K = 10 latent cluster means ranging
from μ1 = 0.43 to μK = 1.41. Maximum and minimum relative risks are estimated as 1.32
and 0.65 respectively. The LOO-IC is 3907, with the maximum casewise LOO-IC again
being for areas with high yi counts in relation to expected events.
Finally, the spatial median model is an adaptation of the approach in Congdon (2017),
implementing the asymmetric Laplace prior version of quantile regression at the second
stage of a hierarchical Poisson log-normal representation. Thus for quantiles a = 1,…, A ,
define xa = (1 − 2a)/a(1 − a) , and define scale factors Wai ∼ Exp( da ) which inflate the vari-
ances of discrepant observations, and downweight their influence on the likelihood. In
the absence of predictors, one has
Yi ∼ Poi( mai ),
mai = Ei exp(nai ),
2Wai da
nai ∼ N b0 a + sai + xaWai , ,
a(1 − a)
Wai ∼ Exp( da ).
Here median regression (α = 0.5) only is considered, with a gamma Ga(1,0.001) prior on
δ0.5. This model has a LOO-IC of 3895, improving on the Potts, BYM, and proper CAR
priors. Poorly fitted areas cases are similar, whether identified by casewise LOO-IC, or
by the residual type measure (n i - b 0 - si )/(8Wid )0.5.
Compared to the Potts prior, extreme elevated relative risks are identified under the
spatial median model, the highest posterior mean relative risk ri = exp( b0 + si ) being
1.50 (though the second and third ranking posterior mean ρi are 1.41 and 1.31). The Potts
prior is distinctive in its broader spread of estimated relative risk, including a longer tail
of low estimated relative risk, with 211 of the 983 areas having posterior mean ρi under
0.9. Figure 6.3 compares posterior mean relative risks under the Potts and BYM priors,
and Figure 6.4 compares the Potts and spatial median relative risks.
500 Potts
BYM
400
300
Frequency
200
100
FIGURE 6.3
Posterior mean relative risks, Potts vs BYM.
400
Potts
Spatial Median
300
Frequency
200
100
FIGURE 6.4
Posterior mean relative risks, Potts vs spatial median.
236 Bayesian Hierarchical Models
isotropic if Σ(d) depends only on the distance between gi and gj, and not on other features
such as the direction from gi to gj or the coordinates of the gi. The process is intrinsi-
cally stationary if E[ y( g + d) − y( g )] = 0, namely has a constant mean, and if the variance
depends only on the lag, not on the point locations, namely
E[ y( g + d) − y( g )]2 = V[ y( g + d) − y( g )] = 2g(d),
where γ(d) is the semiovariogram (Waller and Gotway, 2004, p.274). The covariance Σ(d)
and the semiovariogram are related via g (d) = S(0) - S(d) since
2g (d) = V[ y( g + d) - y( g )]
y( g ) = b0 + T ( g ) b + s( g ) + u( g ), (6.8)
u( g ) ∼ N n (0, su2 ),
where θ are parameters defining the spatial correlation function C( g ,q ) = [cij ( gi , g j ;q )], such
as spatial decay and smoothness parameters.
Splines can also be used to model point pattern data, typically with geographic coor-
dinates as predictors. The trend-surface is represented as a two-dimensional spline in
the geographic coordinates. Trend-surface models do not explicitly represent local spatial
dependence, but rather account for trends in the data across longer geographical distances
(Dormann et al., 2007). However, smooth spatial variation does not characterise all appli-
cations, requiring specialised techniques (Sangalli et al., 2013; Wood et al., 2008). Widely
applied spline trend regression options include cubic splines and thin plate splines (Mitas
and Mitasova, 1999; Bowman and Woods, 2016; Yang et al., 2016). Lang and Brezler (2004)
propose tensor products of equally spaced B-spline basis functions combined with sym-
metric priors on the B-spline coefficients, while Wood (2006) develops low rank smooths
from tensor products of any set of bases with quadratic penalties. Such smooths are invari-
ant to rescaling of the predictors. In the R mgcv package, the jagam function develops JAGS
code with multivariate normal priors on the smooth coefficients (Wood, 2016). The prior
precision matrix incorporates the smoothing parameters and smoothing penalty matrices.
Representing Spatial Dependence 237
To avoid a smoothing penalty not corresponding to a full rank precision matrix (and hence
an improper prior), null space penalties, as in Marra and Wood (2012), are added to the
usual penalties. The smooths are centred to improve identifiability.
6.6.1 Covariance Functions
Defining dij as a distance measure between points gi and gj, there are several common
isotropic schemes with C(dij), and hence γ(dij), parameterised to reflect anticipated distance
decay in the correlation between points (e.g. Grunwald, 2005). For example, the exponen-
tial distance model has
with range parameter ϕ > 0, and larger values of ϕ leading to more pronounced distance
decay. Note that different parameterisations of the exponential are used in different
packages (e.g. in spBayes and spNNGP as opposed to gstat). The covariance function for
(s1 , … , sn ) is then
As dij tends to infinity, the semivariogram trends to an upper limit of su2 + ss2 , known as the
sill. The powered exponential variant (Diggle and Ribeiro, 2007) has
C(dij ) = exp[−(fdij )k ],
3dij dij3
C(dij ) = 1 − + ,
2d 2d 3
for d < δ, whereas C(dij) = 0 for dij ≥ d . Hence the spherical function has covariance
é 3dij dij3 ù
S(dij ) = s u2I (i = j) + s s2 ê1 - + 3 ú I (dij < d ),
ë 2d 2d û
and semivariogram
3dij dij3
g(dij ) = su2 + ss2 − 3 for dij < d ,
2d 2d
ss2
C(dij ) = (kdij )n Kn (kdij ),
Γ(n)2n −1
where Kν(u) is a modified Bessel function of order ν. The parameter ν controls the smoothness
of the process, while κ is a scaling parameter. Together they define the range r = (8n)0.5 /k
at which the covariance is diminished to low levels (close to 0.1). INLA parameterises the
Matern in terms of a parameter α = λ + 1, with α = 2 as the default setting (Lindgren and Rue,
2015). Paciorek and Schervish (2006) use kernel convolution (Section 6.6) to develop nonsta-
tionary covariance functions, including a nonstationary version of the Matérn covariance.
Implementation of this method in R is described in Risser and Calder (2017).
Prediction at new locations is a major aspect of geostatistical modelling. Suppose contin-
uous observations y = ( y1 , … , y n ) = ( y1( g1 ), … , y n ( g n )) are made at locations g = ( g1 , … , g n ) ,
and that predictions y0 = ( y01 , … , y0 k ) are required at k new locations g0 = ( g01 , … , g0 k ).
These are based on the posterior predictive density
∫ ∫
p( y0 | y ) = p( y0 , x| y )dx = p( y0 | y , x)p(x| y )dx ,
where ξ is the vector of parameters involved in the model for y, namely those defining
its mean, and the covariance parameters for spatial and unstructured errors (Banerjee
et al., 2014). For example, Diggle et al. (2003) consider a model y( g ) = m + s( g ) + u( g ) with
u( g ) ∼ N n (0, su2 ), and spatial error process
s( g ) ∼ N n (0, ss2C ),
where prediction is required at a single new location g0. With d0 denoting a n × 1 vector of
distances between g0 and g = (g1, …, gn), and with Q = su2I + ss2C , one has
For k > 1, univariate predictions may be obtained separately at each new site g01 , g02 , … , g0 k ,
though multivariate predictions may be more precise.
(k 2 − ∆ )a/2 (t y( g )) = W ( g )
where Δ is the Laplacian, W(g) is a white noise process, α and κ are as above, and τ controls
the marginal variance ss2 .
Representing Spatial Dependence 239
The GMRF approximation involves a triangulation (with m nodes) of the spatial domain,
and the density of the triangulation mesh determines how close the approximation is.
However, increasing the mesh density also increases the computations involved. A projec-
tor matrix A of dimension n × m, containing 0 or 1 entries, is used to link the original points
to the mesh (Lindgren, 2012; Bakka et al., 2018). Unlike stationary covariance models, it is
straightforward to allow nonstationarity in SPDE models.
Computational burden is also reduced by using a low-rank representation of the spa-
tial field (e.g. Finley et al., 2009; Finley et al., 2015). This involves defining a set of knots
g ∗ = { g1∗ , g 2∗ , … , g r∗ } where r n is considerably less than the dimension of the actual data.
Then denoting s∗ = {s( g1∗ ), s( g 2∗ ), … , s( g r∗ )} and distances between the knots as d* one has
where c(g;θ) is an r × 1 vector with ith element [c( g , gi* ;q )].
For a Gaussian outcome, and spatially reference predictors X(g), a predictive process
model is then defined as
y( g ) = X( g )′ b + s(g ) + u( g ).
For a non-Gaussian response, the predictive process is included in the link regression,
such as for y binary with probability π(gi),
logit[p( gi )] = X( gi )′ b + s(gi ).
Estimation of predictive process models for large n is further facilitated (Eidsvik et al.,
2012) by using the latent approximation approach of Rue et al. (2009).
Under the nearest neighbour Gaussian process (NNGP) approach (Datta et al., 2016;
Zhang et al., 2018), a sparse precision matrix of the joint density p[s(g)] of the spatial process
s(g) is achieved by using neighbour sets N(gi). Following Vecchia (1988), the sets N(gi) can
be specified as the m nearest neighbours of the point gi. These sets are used to provide an
approximate conditional specification of the joint density of the spatial process p[s(R)] for a
set of k reference locations R (that can be taken as the n observed locations). This approach
is incorporated in the R package spNNG. The approximation to the joint density is pro-
vided by the conditional density representation
k
Different model formulations can be specified according to whether estimated spatial ran-
dom effects are of interest, or simply regression and other hyperparameters, with the spa-
tial effects then integrated out. These are denoted as the sequential and response options
in the R package spNNGP. Thus, under the sequential model, and for hyperparameters
x = ( b , s s2 , s u2 ,q ), the posterior density is
where C −1 = (I − A)T D −1(I − A) is the precision matrix for s(g), A is a sparse lower triangular
matrix with at least m non-zero elements in each row, and D is diagonal. The construction
of these matrices is set out in Finley et al. (2017).
with
logit(pi ) = b0 + b1xi + hi ,
and with the spatial effects covariance multivariate normal prior encompassing all 92
units in the analysis. A Cholesky decomposition is used to represent the multivariate
normal covariance. The prevalence predictions for the 11 practices with missing preva-
lence data are obtained as generated quantities under an inverse logit transform.
A second analysis uses R2OpenBUGS and a powered exponential distance model for
the spatial effects si, namely
with ϕ > 0, k ∈(0, 2] , and with univariate predictions (s01 ,… , s0 k ) for the 11 new points.
The coding in R2OpenBUGS is hierarchically centred (Thomas et al., 2014). A Ga(1,0.001)
prior adopted on 1/ss2 .
The two models provide similar LOO-IC, respectively 691.4 and 691.7. The posterior
mean (95% CRI) for ϕ under the simple exponential decay option are obtained as 0.92
(0.06, 1.25), with the posterior 95% interval for β1 mostly positive, so that the deprivation
score improves on the prediction of missing prevalence rates. The latter range from 1.9%
to 2.3%, with a 0.95 correlation between the estimated missing prevalence rates between
the two models.
changing max.edge and cutoff in the inla.mesh.2d command. Here we initially select a
relatively coarse grid, setting k = 0.1 and define the mesh using
mesh=inla.mesh.2d(coordinates,max.edge=c(1/k,2/k),cutoff=0.1/k).
There are no explanatory variables, so predictions are based only on the estimated
spatial effects at the grid nodes.
With this relatively coarse grid, a correlation of 0.46 is obtained between actual and
predicted magnitudes. Setting k = 1 as opposed to k = 0.1 produces a denser grid with
around 24 times as many nodes (20,603 as against 866), and so is more computation-
ally intensive. However, the correlation between actual and predicted magnitudes is
increased to 0.54.
A second analysis is based on the spBayes package, and uses a 10% sample of the full
data. The data involve repeated observations at the same locations which may cause
numerical problems. Therefore, the actual locations are randomly jiggered to avoid
repeat locations. A further 10% subsample of the coordinates (i.e. of 294 coordinates) is
used to provide a set of knots. As an illustration of a particular covariance option, con-
sider an exponential decay function, which using the notation in the package, assumes
a covariance model
s 2 exp( − d dij ) + t 2 ,
where σ2 is the partial sill. To provide initial values for σ2, τ2 and the decay parameter δ,
the variogram and fit.variogram options in gstat are used. This provides an estimated
range of ϕ = 3.2, and hence an initial value for the decay parameter in spBayes of 0.31 (the
spBayes parameterisation of the exponential uses a decay parameter δ = 1/ϕ). Tuning
values for the Metropolis sampler are chosen to produce an acceptance rate of around
30%. With an MCMC sample of 2,000 iterations and burn-in of 1,000, a correlation of
0.485 is obtained between actual and predicted magnitudes. Posterior means (and sd)
for σ2, τ2 and δ are 0.13 (0.02), 0.31 (0.01) and 0.38 (0.07).
The gstat commands
v = variogram(Y~1, D)
fit.variogram(v, vgm(c("Exp", "Mat", "Sph","Gau")))
suggest a spherical model as better fitting, but a lower correlation (of 0.474) between
actual and fitted values is obtained under this option. The GPD criterion of Gelfand and
Ghosh (1998) also prefers the exponential model.
Finally, again using the full dataset, but with jiggering to avoid repeat locations, the near-
est neighbour Gaussian Process approach is applied using spNNG. An exponential covari-
ance and m = 10 neighbours, are assumed. This provides a correlation between actual and
predicted magnitudes of 0.51. Increasing the number of neighbours m from 10 to 15 makes
no difference to the fit. Figure 6.5 shows the predicted magnitude surface. For m = 15, the
estimated posterior means (and sd) for σ2, τ2 and δ are 0.08 (0.05), 0.32 (0.03), and 0.33 (0.13).
4.8
70
4.7
60 4.6
Latitude
4.5
50
4.4
40 4.3
4.2
–30 –20 –10 0 10 20 30 40
Longitude
FIGURE 6.5
Magnitude predictions from NNGP.
representation, based on the Gaussian process, but one that adapts to spatial nonstation-
arity and anisotropy, is the process convolution approach (Higdon, 1998; Lee et al., 2005;
Higdon, 2007; Liang and Lee, 2014). This involves convolving a continuous white noise
process w(g) with a symmetric smoothing kernel K(g), with the spatial effect obtained as
s( g ) =
∫ K( g − u) w(u)du,
G
where G is the region of interest. The spatial process might be combined with fixed effect
regression impacts and with appropriate regression links for non-normal observations.
For example, if y(g) were binary, such as species presence or absence at site g (Gelfand et al.,
2005a), then y( g ) ∼ Bern(p( g )) and
logit[p( g )] = b0 + s( g ).
s( gi ) = ∑ K( g − t )w ,
j =1
i j j
where for large m, the wj can be taken as a collection of random effects (Higdon, 2007,
p.245). Lee et al. (2005) consider options for representing the kernel, possibly by a form
Representing Spatial Dependence 243
with known variance (e.g. a standard normal), and consequent ways for modelling the
wj. Note that if both the K function and w series have unknown variances, then there is
potential non-identifiability. Options for the wj include exchangeable effects or low order
random walks, with unknown precision τw. Assuming K is a normal kernel, by varying τw
one can mimic the effect of the range parameter in a conventional Gaussian process model
with a Gaussian variogram.
For example, Lee et al. (2005) consider n = 12 observations yi in G1 at equally spaced loca-
tions gi between 0 and 10. These are generated according to a Gaussian process s(g) with
mean 0 and covariance matrix
where dij relates to distances between points gi and gj on the line. A white noise error
ui with standard deviation 0.2 is also used to define yi, so that yi = s( gi ) + ui . They then
fit a discrete convolution model to the yi so generated, using a grid with m = 20 points tj
equally spaced between −2 and 12. They assume the wj follow a 1st order random walk,
and assume the kernel is a normal density with standard deviation 0.6.
Best et al. (2000) consider a convolution model for health counts yi ∼ Po(Pi li ) observed
for areas rather than points, where Pi are populations and λi are latent rates. In this case,
a rectangular grid is defined over m points in the region, and an additive (rather than log
link) regression is used for modelling the latent rates. So, with a single predictor xi taking
positive values only, one has
m
li = b0 + b1xi + b2 ∑ K( g − t )w ,
j =1
i j j
where the wj (and the β parameters) are gamma distributed and the kernel function K has
a known variance. One can decompose the total risk parameter into three sources: one due
to the background rate β0, one reflecting the known predictor, and one the latent spatially
configured risk over the region.
Semiparametric approaches to spatial modelling based on the stick-breaking prior can
also be related to this theme (Reich and Fuentes, 2007). Thus there are kernel functions
for each of m potential clusters, with the kernel centres t j = (t1 j , t2 j ) being unknowns, and
the cluster allocation probabilities for sites or areas i at location gi = ( g1i , g 2i ) incorporating
spatial information. While the cluster effects w j ∼ N (0, 1/tw ) are unstructured, the cluster
for area or point i is chosen using indicators
with the pij determined both via beta distributed Vj ∼ Be(c, d), and by cluster specific ker-
nels Kij constrained to lie in [0,1]. The realised spatial effect for area or point i is then w Ji .
Defining Rij = K ijVj , one has
pi1 = Ri1
pim = (1 − Ri1 )… (1 − Ri , m −1 )
244 Bayesian Hierarchical Models
K ij = exp[−|gi − t j|/2g j ]
defines a normal kernel with bandwidth γj. Bandwidths can be taken equal across kernel
functions or vary across kernel functions according to a positive prior (e.g. inverse gamma).
y i ∼ N( mi , s 2 ),
mi = b0 + ∑ K w ,
j
ij j
1
K ij (dij ) = exp( − dij /h),
2ph
with distances dij = [( g i1 − t j1 )2 + ( g i 2 − t j 2 )2 ]0.5 . The grid effects wj are assumed iid ran-
dom normal with zero mean, and with standard deviation σw.
Because of confounding between the grid effects and the kernel, for identifiability,
it is assumed that η = 1, but that the wj have an unknown variance, with σw assigned a
U(0,100) prior. Using jagsUI for estimation, this model provides a correlation between
actual and predicted magnitudes of 0.32. Computation is slower if η is taken as an
unknown, and σw is set to 1. Also, the fit is not improved.
However, a much-improved fit is obtained by a two-group mixture intercept, with
preset probabilities on the two groups of 0.95 and 0.05 to facilitate identifiability. Thus
mi = b0 Ji + ∑ K w ,
j
ij j
J i ∼ Categoric(0.95, 0.05).
This increases the correlation between actual and predicted magnitudes to 0.76. The
estimates (posterior means and sd) for the intercepts β01 and β02 are 4.40 (0.02) and 6.07
(0.05). Further improvements in fit might be obtained by taking additional groups in the
discrete mixture intercept.
For this model, site-specific effects are obtained by comparing the μi to their over-
all average. Then 359 of the 2945 sites have a posterior probability over 95% that the
effect is positive. Figure 6.6 maps out three significance categories, and in particular
shows spatial clustering of sites with over 0.95 probability of elevated earthquake
magnitudes.
Representing Spatial Dependence 245
Significance Group
70
0.05-0.95
Under 0.05
Over 0.95
60
Latitude
50
40
–20 0 20 40
Longitude
FIGURE 6.6
Significance of site effects.
6.8 Computational Notes
[1] With d[i] denoting the vector of neighbour numbers (the number of areas adjacent
to area i), and W the interaction matrix, the Leroux et al. (1999) prior has the form
D=diag(d)
R=D-W
I <- diag(N)
# data inputs
D = list(n = N, # number of observations
y = y, # observed number of cases
T=T,
x=x,
R = R,
I=I)
model="
data {
int<lower = 1> n;
int<lower = 0> y[n];
real x[n];
int T[n];
matrix[n, n] R;
matrix[n, n] I;
}
246 Bayesian Hierarchical Models
transformed data{
vector[n] zeros;
zeros = rep_vector(0, n);
}
parameters {
real beta[2];
vector[n] phi;
real<lower = 0> tau;
real<lower = 0, upper = 1> alpha;}
transformed parameters {
real theta[n];
real eta[n];
for (i in 1:n) {eta[i]=beta[1]+beta[2]*x[i] + phi[i];
theta[i]=exp(eta[i])/(1+exp(eta[i]));}
}
model {
phi ~multi_normal_prec(zeros, tau * ((1−alpha)*I+alpha*R));
beta~normal(0, 5);
tau ~gamma(2, 2);
y ~binomial(T, theta);
}
generated quantities
{real log_lik[n];
for (i in 1:n) {log_lik[i]= binomial_lpmf(y[i]T[i],theta[i]);}
}
"
sm = stan_model(model_code=model)
fit = sampling(sm,data =D,iter = 2500,warmup=250,chains = 2,seed=
12345)
summary(fit,pars=c("beta","alpha"), probs=c(0.025,0.975))$summary
# Fit
loo(as.matrix(fit,pars="log_lik"))
References
Adin A, Lee D, Goicoa T, Ugarte M (2018) A two-stage approach to estimate spatial and spatio-
temporal disease risks in the presence of local discontinuities and clusters. Statistical Methods in
Medical Research, In press
Allard D, Beauchamp M, Bel L, Desassis N, Gabriel É, Geniaux G, Malherbe L, Martinetti D, Opitz
T, Parent É, Romary T, Saby N (2017) Analyzing spatio-temporal data with R: Everything you
always wanted to know – but were afraid to ask. Journal de la Société Française de Statistique,
158(3), 124–158.
Anselin L (2010) Thirty years of spatial econometrics. Papers in Regional Science, 89(1), 3–25.
Anselin L, Bera A (1998) Spatial dependence in linear regression models, with an introduction to spa-
tial econometrics, pp 237–290, in Handbook of Applied Economic Statistics, eds A Ullah, D Giles.
Marcel Dekker, New York.
Bakka H, Rue H, Fuglstad G, Riebler A, Bolin D, Krainski E, Simpson D, Lindgren F (2018) Spatial
modelling with R-INLA: A review. arXiv preprint arXiv:1802.06350.
Banerjee S, Carlin B, Gelfand A (2014) Hierarchical Modeling and Analysis for Spatial Data. Chapman
and Hall/CRC.
Representing Spatial Dependence 247
Bell B, Broemeling L (2000) A Bayesian analysis for spatial processes with application to disease map-
ping. Statistics in Medicine, 19, 957–974.
Berke O (2004) Exploratory disease mapping: kriging the spatial risk function from regional count
data. International Journal of Health Geographics, 3, 18.
Bernardinelli L, Clayton D, Pascutto C, Montomoli C, Ghislandi M, Songini M (1995) Bayesian analy-
sis of space–time variation in disease risk. Statistics in Medicine, 14(21–22), 2433–2443.
Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal
Statistical Society: Series B, 36, 192–236.
Besag J (1989) Towards Bayesian image analysis. Journal of Applied Statistics, 16, 395–407.
Besag J, Green P (1993) Spatial statistics and Bayesian computation. Journal of the Royal Statistical
Society: Series B, 55, 25–37.
Besag J, Green P, Higdon D, Mengersen K (1995) Bayesian computation and stochastic systems.
Statistical Science, 10, 3–66.
Besag J, Kooperberg C (1995) On conditional and intrinsic autoregressions. Biometrika, 82, 733–746.
Besag J, York J, Mollie A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43, 1–21.
Best N, Arnold R, Thomas A, Waller L, Conlon E (2000) Bayesian models for spatially correlated dis-
ease and exposure data, p 131, in Bayesian Statistics 6: Proceedings of the Sixth Valencia International
Meeting, Vol. 6. Oxford University Press.
Bhattacharjee A, Jensen-Butler C (2006) Estimation of the spatial weights matrix under structural
constraints. Regional Science and Urban Economics, 43(4), 617–634.
Bivand R, Gomez-Rubio V, Rue H (2014) Approximate Bayesian inference for spatial econometrics
models. Spatial Statistics, 9, 146–165.
Bivand R, Gomez-Rubio V, Rue H (2015). Spatial data analysis with R-INLA with some extensions.
Journal of Statistical Software, 63(20), 1–31.
Bivand R, Pebesma E, Gómez-Rubio V, Pebesma E (2013). Applied Spatial Data Analysis with R, 2nd
Edition. Springer, New York.
Blangiardo M, Cameletti M (2015) Spatial and Spatio-temporal Bayesian models with R-INLA. John Wiley
& Sons.
Bowman V, Woods D (2016) Emulation of multivariate simulators using thin-plate splines with
application to atmospheric dispersion. SIAM/ASA Journal on Uncertainty Quantification, 4(1),
1323–1344.
Brown P (2015) Model-based geostatistics the easy way. Journal of Statistical Software, 63(12), 1–24.
Brunsdon C (2018) Using rstan and spdep for Spatial Modelling. https://rstudio-pubs-static.
s3.amazonaws.com/
Brunsdon C, Comber L (2015) An Introduction to R for Spatial Analysis and Mapping. Sage.
Calder K (2003) Exploring latent structure in spatial temporal processes using process convolutions.
PhD Thesis, Duke University, Durham, NC. https://www2.stat.duke.edu/people/theses/
CalderK.html
Calder K (2007) Dynamic factor process convolution models for multivariate space–time data with
application to air quality assessment. Environmental and Ecological Statistics, 14, 229–247.
Clark J, Silman M, Kern R, Macklin E, HilleRisLambers J (1999) Seed dispersal near and far: Patterns
across temperate and tropical forests. Ecology, 80, 1475–1494.
Clayton D, Kaldor J (1987) Empirical Bayes estimates of age-standardised relative risks for use in
disease mapping. Biometrics, 43, 671–682.
Congdon P (2007) Mixtures of spatial and unstructured effects for spatially discontinuous health
outcomes. Computational Statistics and Data Analysis, 51, 3197–3212.
Congdon P (2008) A spatially adaptive conditional autoregressive prior for area health data. Statistical
Methodology, 5, 552–563.
Congdon P (2017) Quantile regression for overdispersed count data: a hierarchical method. Journal of
Statistical Distributions and Applications, 4, 18.
248 Bayesian Hierarchical Models
Conlon E, Louis T (1999) Addressing multiple goals in evaluating region-specific risk using Bayesian
methods, pp 31–47, in Disease Mapping and Risk Assessment for Public Health, eds A Lawson, A
Biggeri, D Bohning, E Lesaffre, J Viel, R Bertollini. John Wiley, Chichester, UK.
Cressie N, Kapat P (2008) Some diagnostics for Markov random fields. Journal of Computational and
Graphical Statistics, 17, 726–749.
Cressie N, Wikle CK (2011) Statistics for Spatio-Temporal Data. John Wiley & Sons, Inc., New York.
Datta A, Banerjee S, Finley A, Gelfand A (2016) Hierarchical nearest-neighbor Gaussian process mod-
els for large geostatistical datasets. Journal of the American Statistical Association, 111, 800–812.
De Oliveira V (2012) Bayesian analysis of conditional autoregressive models. Annals of the Institute of
Statistical Mathematics, 64(1), 107–133.
Diggle P, Ribeiro P (2002) Bayesian inference in Gaussian model-based geostatistics. Geographical and
Environmental Modelling, 6, 129–146.
Diggle P, Ribeiro P, Christensen O (2003) An introduction to model based geostatistics, pp 43–86,
in Spatial Statistics and Computational Methods, ed Möller J. Lecture Notes in Statistics, Vol. 173.
Springer.
Diggle P, Ribeiro P (2007) Model-based Geostatistics. Springer-Verlag, New York
Dormann, C., McPherson J, Araújo M, Bivand R, Bolliger J (2007) Methods to account for spatial auto-
correlation in the analysis of species distributional data: A review. Ecography, 30(5), 609–628.
Duan J, Guindani M, Gelfand A (2007) Generalized spatial Dirichlet process models. Biometrika, 94,
809–825.
Eidsvik J, Finley A, Banerjee S, Håvard R (2012) Approximate Bayesian inference for large spa-
tial datasets using predictive process models. Computational Statistics & Data Analysis, 56(6),
1362–1380.
Fernandez C, Green P (2002) Modelling spatially correlated data via mixtures: A Bayesian approach.
Journal of the Royal Statistical Society: Series B, 64, 805–826.
Finley A, Sang H, Banerjee S, Gelfand A (2009) Improving the performance of predictive process
modeling for large datasets. Computational Statistics & Data Analysis, 53(8), 2873–2884.
Finley A, Banerjee S, Gelfand A (2015) spBayes for large univariate and multivariate point-referenced
spatio-temporal data models. Journal of Statistical Software, 63(13), 1–28.
Finley, A, Datta A, Cook B, Morton D, Andersen H, Banerjee S (2017) Applying Nearest Neighbor
Gaussian Processes to Massive Spatial Data Sets: Forest Canopy Height Prediction Across
Tanana Valley Alaska. https://arxiv.org/abs/1702.00434
Furrer R, Sain S (2010) spam: A sparse matrix R package with emphasis on MCMC methods for
Gaussian Markov random fields. Journal of Statistical Software, 36(10), 1–25.
Gelfand A, Kottas A, MacEachern S (2005b) Bayesian nonparametric spatial modeling with Dirichlet
process mixing. Journal of the American Statistical Association, 100(471), 1021–1035.
Gelfand A, Latimer A, Wu S, Silander J (2005a) Building statistical models to analyse species distribu-
tions, in Hierarchical Modelling for the Environmental Sciences, Statistical Methods and Applications,
eds J Clark, A Gelfand. OUP.
Gelfand AE, Ghosh SK (1998) Model choice: A minimum posterior predictive loss approach.
Biometrika, 85(1), 1–11.
Gerber F, Furrer R (2015) Pitfalls in the implementation of Bayesian hierarchical modeling of areal
count data: An illustration using BYM and Leroux models. Journal of Statistical Software, Code
Snippets, 63(1), 1–32. http://www.jstatsoft.org/v63/c01/
Giardini D, Woessner J, Danciu L (2014) Mapping Europe’s seismic hazard. EOS, 95(29): 261–262.
Goméz-Rubio V, Bivand R (2018) R Package ‘INLABMA’, Bayesian Model Averaging with INLA.
https://rdrr.io/rforge/INLABMA/
Goméz-Rubio V, Bivand R, Rue H (2018) Estimating spatial econometrics models with integrated
nested laplace approximation. arXiv preprint arXiv:1703.01273.
Gotway C, Wolfinger R (2003) Spatial prediction of counts and rates. Statistics in Medicine, 22,
1415–1432.
Gramacy R, Lee H (2008) Gaussian processes and limiting linear models. Computational Statistics &
Data Analysis, 53, 123–136.
Representing Spatial Dependence 249
Green P, Richardson S (2002) Hidden Markov models and disease mapping. Journal of the American
Statistical Association, 97, 1055–1070.
Griffin J, Steel M (2006) Order-based dependent Dirichlet processes. Journal of the American Statistical
Association, 101, 179–194.
Grunwald S (2005) Environmental Soil-Landscape Modeling: Geographic Information Technologies and
Pedometrics. CRC Press.
Gschlößl S, Czado C (2006) Modelling count data with overdispersion and spatial effects. Technische
Universität München, Statistical Papers. DOI: 10.1007/s00362-006-0031-6
Haran M, Hodges J, Carlin B (2003) Accelerating computation in Markov random field models for
spatial data via structured MCMC. Journal of Computational & Graphical Statistics, 12, 249–264.
Hepple L (2003) Bayesian and maximum likelihood estimation of the linear model with spatial mov-
ing average disturbances. Working Papers Series, School of Geographical Sciences, University
of Bristol.
Higdon D (1998) A process-convolution approach to modelling temperatures in the North Atlantic
Ocean. Environmental and Ecological Statistics, 5, 173–190.
Higdon D (2007) A primer on space-time modelling from a Bayesian perspective, Chapter 6, in
Statistical Methods for Spatio-Temporal Systems, eds B Finkelstadt, L Held, V Isham. CRC Press.
Hodges J, Carlin B, Fan Q (2003) On the precision of the conditionally autoregressive prior in spatial
models. Biometrics, 59, 317–322.
Jiruše M, Machek J, Beneš V, Zeman P (2004) A Bayesian estimate of the risk of tick-borne diseases.
Applications of Mathematics, 49, 389–404.
Joseph M (2016) Exact Sparse CAR Models in Stan. http://mc-stan.org/users/documentation/case-
studies/mbjoseph-CARStan.html
Kelsall J, Wakefield J (2002) Modelling spatial variation in disease risk: A geostatistical approach.
Journal of the American Statistical Association, 97, 692–770.
Knorr-Held L, Becker N (2000) Bayesian modelling of spatial heterogeneity in disease maps with
application to German cancer mortality data. Journal of the German Statistical Society, 84, 121–140.
Knorr-Held L, Rasser G (2000) Bayesian detection of clusters and discontinuities in disease maps.
Biometrics, 56, 13–21.
Lacombe D, LeSage J (2015) Using Bayesian posterior model probabilities to identify omitted vari-
ables in spatial regression models. Papers in Regional Science, 94(2), 365–383.
Lang S, Brezger A (2004) Bayesian P-splines. Journal of Computational and Graphical Statistics, 13(1),
183–212.
Lavine M (1999) Another look at conditionally Gaussian Markov random fields, in Bayesian Statistics
6, eds J Bernardo, J Berger, P Dawid, A Smith. Oxford University Press, Oxford, UK.
Lawson A (2008) Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology. CRC Press.
Lawson A, Clark A (2002) Spatial mixture relative risk models applied to disease mapping. Statistics
in Medicine, 21, 359–370.
Lee D (2011) A comparison of conditional autoregressive models used in Bayesian disease mapping.
Spatial and Spatio-Temporal Epidemiology, 2(2), 79–89.
Lee D (2013) CARBayes: An R package for Bayesian spatial modeling with conditional autoregres-
sive priors. Journal of Statistical Software, 55(13), 1–24.
Lee H, Higdon D, Calder C, Holloman C (2005) Efficient models for correlated data via convolutions
of intrinsic processes. Statistical Modelling, 5, 53–74.
Leroux B, Lei X, Breslow N (1999) Estimation of disease rates in small areas: a new mixed model
for spatial dependence, pp 135–178, in Statistical Models in Epidemiology, the Environment and
Clinical Trials, eds M Halloran, D Berry. Springer-Verlag, New York.
LeSage J (1999) Spatial econometrics, in The Web Book of Regional Science (www.rri.wvu.edu/regsc-
web.htm), ed R W Jackson. Regional Research Institute, West Virginia University, Morgantown,
WV.
Liang W, Lee H (2014) Sequential process convolution gaussian process models via particle learning.
Statistics and Its Interface, 7(4), 465–475.
Lindgren F (2012) Continuous domain spatial models in R-INLA. The ISBA Bulletin, 19(4), 14–20.
250 Bayesian Hierarchical Models
Lindgren F, Rue H, Lindstrom J (2011) An explicit link between Gaussian fields and Gaussian
Markov random fields: The stochastic partial differential equation approach. Journal of the Royal
Statistical Society: Series B, 73(4), 423–498.
Lindgren F, Rue H (2015) Bayesian spatial modelling with R-INLA. Journal of Statistical Software,
63(19), 1–25
MacNab Y (2014) On identification in Bayesian disease mapping and ecological–spatial regression
models. Statistical Methods in Medical Research, 23(2), 134–155.
MacNab Y, Kmetic A, Gustafson P, Shaps S (2006) An innovative application of Bayesian disease
mapping methods to patient safety research. Statistics in Medicine, 25, 3960–3980.
Mardia K, Watkins A (1989) On multimodality of the likelihood in the spatial linear model. Biometrika,
76, 289–295.
Marra G, Wood S (2012) Coverage properties of confidence intervals for generalized additive model
components. Scandinavian Journal of Statistics, 39(1), 53–74.
Mitas L, Mitasova H (1999) Spatial interpolation, pp 481–492, in Geographical Information Systems:
Principles, Techniques, Management and Applications, eds P Longley, M Goodchild, D Maguire, D
Rhind, 1st Edition. Wiley.
Moores M, Hargrave C, Deegan T, Poulsen M, Harden F, Mengersen K (2015) An external field prior
for the hidden Potts model with application to cone-beam computed tomography. Computational
Statistics & Data Analysis, 86, 27–41.
Morris M (2018) Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data. http://
mc-stan.org/users/documentation/case-studies/icar_stan.html
Norton J, Niu X (2009) Intrinsically autoregressive spatiotemporal models with application to aggre-
gated birth outcomes. Journal of the American Statistical Association, 104, 638–649.
Paciorek CJ, Schervish MJ (2006) Spatial modelling using a new class of nonstationary covariance
functions. Environmetrics, 17(5), 483–506.
Pettitt A, Weir I, Hart A (2002) A conditional autoregressive Gaussian process for irregularly
spaced multivariate data with application to modelling large sets of binary data. Statistics and
Computing, 12, 353–367.
Reich BJ, Fuentes M (2007) A multivariate semiparametric Bayesian spatial modeling framework for
hurricane surface wind fields. The Annals of Applied Statistics, 1(1), 249–264.
Ribeiro P, Diggle P (2018) Package ‘geoR’. https://cran.r-project.org/web/packages/geoR/geoR.
pdf
Richardson S, Guihenneuc C, Lasserre V (1992) Spatial linear models with autocorrelated error struc-
ture. The Statistician, 41, 539–557.
Richardson S, Monfort C (2000) Ecological correlation studies, in Spatial Epidemiology Methods and
Applications, eds P Elliott, J Wakefield, N Best, D Briggs. Oxford University Press.
Richardson S, Thomson A, Best N, Elliott P (2004) Interpreting posterior relative risk estimates in
disease-mapping studies. Environmental Health Perspectives, 112, 1016–1025.
Riebler A, Sørbye S, Simpson D, Rue H (2016) An intuitive Bayesian spatial model for disease map-
ping that accounts for scaling. Statistical Methods in Medical Research, 25, 1145–1165.
Riggan W, Manton K, Creason J, Woodbury M, Stallard E (1991) Assessment of spatial variation of
risks in small populations. Environmental Health Perspectives, 96, 223–238.
Risser M, Calder C (2017) Local likelihood estimation for covariance functions with spatially-varying
parameters: The convoSPAT package for R. Journal of Statistical Software, 81(14), 1–32.
Rodrigues A, Assuncao R (2008) Propriety of posterior in Bayesian space varying parameter models
with normal data. Statistics & Probability Letters, 78, 2408–2411.
Rue H, Held L (2005) Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall,
London, UK.
Rue H, Martino S, Chopin, N (2009) Approximate Bayesian inference for latent Gaussian models by
using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B,
71(2), 319–392.
Rue H, Tjelmeland H (2002) Fitting Gaussian Markov random fields to Gaussian fields. Scandinavian
Journal of Statistics, 29, 31–49.
Representing Spatial Dependence 251
Sangalli L, Ramsay J, Ramsay T (2013) Spatial spline regression models. Journal of the Royal Statistical
Society: Series B, 75(4), 681–703.
Schabenberger O, Gotway C (2004) Statistical Methods for Spatial Data Analysis. Chapman & Hall/
CRC.
Schrödle B, Held L (2011) A primer on disease mapping and ecological regression using INLA.
Computational Statistics, 26(2), 241–258.
Sethuraman J (1994) A constructive definition of dirichlet priors. Statistica Sinica, 4, 639–665.
Seya H, Tsutsumi M, Yamagata Y (2012) Income convergence in Japan: A Bayesian spatial Durbin
model approach. Economic Modelling, 29(1), 60–71.
Thomas A, Best N, Lunn D, Arnold R, Spiegelhalter D (2014) GeoBUGS User Manual. https://www.
mrc-bsu.cam.ac.uk
Vecchia AV (1988) Estimation and model identification for continuous spatial processes. Journal of the
Royal Statistical Society: Series B (Methodological), 50(2), 297–312.
Wakefield J (2007) Disease mapping and spatial regression with count data. Biostatistics, 8, 158–183.
Wall M (2004) A close look at the spatial structure implied by the CAR and SAR models. Journal of
Statistical Planning and Inference, 121, 311–324.
Waller L (2002) Hierarchical models for disease mapping, in Encyclopedia of Environmetrics, eds A
El-Shaarawi, W Piegorsch. Wiley, Chichester, UK.
Waller L, Carlin B (2010) Disease mapping, Chapter 14, pp 217–243, in Handbooks of Modern Statistical
Methods. ed G Fitzmaurice, Chapman & Hall/CRC.
Waller L, Gotway C (2004) Applied Spatial Statistics for Public Health Data. Wiley.
Webster R, Oliver M, Muir K, Mann J (1994) Kriging the local risk of a rare disease from a register of
diagnoses. Geographical Analysis, 26, 168–185.
Wood S (2006) Low-rank scale-invariant tensor product smooths for generalized additive mixed
models. Biometrics, 62(4), 1025–1036.
Wood S, Bravington M, Hedley S (2008) Soap film smoothing. Journal of the Royal Statistical Society:
Series B, 70(5), 931–955.
Wood S N (2016) Just another Gibbs additive modeller: Interfacing JAGS and mgcv. arXiv preprint
arXiv:1602.02539.
Yang C, Xu J, Li Y (2016) Bayesian geoadditive modelling of climate extremes with nonparametric
spatially varying temporal effects. International Journal of Climatology, 36(12), 3975–3987.
Yanli Z, Wall M (2004) Investigating the use of the variogram for lattice data. Journal of Computational
and Graphical Statistics, 13, 719–738.
Zhang H (2002) On estimation and prediction for spatial generalised linear mixed models. Biometrics,
58, 129–136.
Zhang L, Datta A, Banerjee S (2018) Practical Bayesian Modeling and Inference for Massive Spatial
Datasets On Modest Computing Environments. arXiv preprint arXiv:1802.00495.
Zhu L, Gorman D, Horel S (2006) Hierarchical Bayesian spatial models for alcohol availability, drug
hot spots and violent crime. International Journal of Health Geographics, 5, 54.
7
Regression Techniques Using Hierarchical Priors
7.1 Introduction
This chapter is concerned with the application of hierarchical prior schemes to regressions
involving univariate responses, where the observations are non-nested, but may be spa-
tially or temporally configured. Nested data applications are considered in Chapters 8 and
10. A range of Bayesian packages in R for regression, often in specialised applications, are
detailed at https://cran.r-project.org/web/views/Bayesian.html, and include BayesLogit
(Windle, 2016), BMA (Raftery et al., 2005), and BMS (Zeugner and Feldkircher, 2015). The
treatment here is intended as providing a generic overview of regression applications
involving hierarchical principles, and development of flexible data analysis through appli-
cation-specific coding.
As a first illustration of regression modelling invoking hierarchical principles, much
attention has focused on Bayesian methods for predictor selection, commonly using selec-
tion indicators or shrinkage priors, and these are discussed in Section 7.2. Many regression
selection applications involve categorical predictors, including analysis of variance, and
the particular issues raised are considered in Section 7.3. Hierarchical specfications also
apply when latent responses or random effects are used to (a) improve data representa-
tion in overdispersed general linear models (Section 7.4), or (b) generate latent continuous
responses (augmented data) underlying discrete observations (Section 7.5). Section 7.6 con-
siders heterogeneity in regression relationships or variance parameters over exchangeable
sample units. Heterogeneous regression effects and predictor selection are then consid-
ered for responses structured in time or space (Sections 7.7 and 7.8).
7.2 Predictor Selection
Regression model uncertainty most commonly focuses on which predictors to retain,
though other aspects of regression specification may be considered also. Predictor selec-
tion methods generally also aim for improved predictive performance, through develop-
ing an encompassing model, or model simplification without adversely affecting predictive
accuracy (Piironen and Vehtari, 2017a). Formal model choice is simplified for normal linear
regression, as marginal likelihoods may be obtained analytically, but for a large number of
predictors, comparison of the many possible models becomes infeasible.
One option for a more feasible analysis involves a form of discrete mixture: predictor
selection indicators, possibly combined with particular priors on regression coefficients or
253
254 Bayesian Hierarchical Models
variances, are introduced to enable additional inferences (e.g. marginal retention probabil-
ities on predictors) (Rockova et al., 2012; Malsiner-Walli and Wagner, 2019). Alternatively,
regularisation or shrinkage priors include a penalty (e.g. an L1 norm) or some other mecha-
nism to shrink unnecessary regression effects towards zero (e.g. Polson and Scott, 2010;
Carvalho et al., 2009).
A leading motivation for predictor selection or effect regularisation is to estimate coef-
ficients which alleviate for predictor collinearity. If not controlled for, collinearity may
lead to low precision for regression coefficients, and coefficients with effect sizes or signs
contrary to subject matter expectations (Winship and Western, 2016).
7.2.1 Predictor Selection
Predictor selection recognises model uncertainty, and a predictive target which acknowl-
edges such uncertainty may be of interest. An example might be a mean treatment suc-
cess rate conditional on predictors (Garcia-Donato and Martinez-Beneito, 2013). Using
predictor selection, one is implicitly averaging over a set of plausible regression models,
so providing an encompassing model with potential predictive advantages (Piironen and
Vehtari, 2017a).
The discrete mixture approach involves binary selection indicators g j for predictors
j = 1,…,p. These indicators may directly determine inclusion (e.g. Kuo and Mallick, 1998),
or define prior regression coefficient variances consistent with inclusion or effective exclu-
sion, as under stochastic search variable selection (SSVS) (George and McCullogh, 1993).
Selection usually applies to all predictors except the intercept.
With normal priors on regression coefficients under univariate linear regression, and
with response yi and predictors Xi, one has under SSVS
yi ~ N ( b0 + Xi b , s 2 ),
b j |g j = 1 ~ N (0, t j2 ),
b j |g j = 0 ~ N (0, c jt j2 ),
g j |w j ~ Bern(w j ),
w j ~ p(w j ).
The setting for t j2 (at a known value) allows unrestricted search over the potential param-
eter space, while cj is set suitably small, so that γj = 0 corresponds to effective exclusion from
the regression. The spike and slab prior (Kuo and Mallick, 1998), denoted SSP for short,
specifies the inclusion and exclusion options as
b j |g j = 1 ~ N (0, t j2 ),
b j |g j = 0 ~ d0 (),
g jk ~ Bern(w j ),
where c is a fixed small constant, and a hierarchical prior is set on tk2 . Thus, Richardson
et al. (2010) suggest a model with selection parameters ωjk determined both by the outcome
and predictor,
w jk = wk r j ,
where ρj captures the propensity for predictor j to influence several outcomes, and ωk con-
trols the complexity of the regression for outcome k.
A similar hierarchical indicators procedure is proposed by Chen et al. (2016) for sparse
group selection, where covariates can be formed into substantively defined groups. The
aim is to select the most important groups of predictors, and within those selected groups,
select the more important predictors. Thus, retention for group j is determined by binary
indicators ρj ~ Bern(ωρ), so that for predictors k within groups, the selection rule is
256 Bayesian Hierarchical Models
g jk ~ (1 - r j )d0 + r j Bern(wg ).
7.2.2 Shrinkage Priors
Shrinkage priors seek a sparse representation of the regression coefficients without neces-
sarily including a mechanism to actually formally exclude unnecessary predictors, with
potential advantages in MCMC sampling (Bhattacharya et al., 2015; Makalic and Schmidt,
2016). For example, the Lasso prior specifies a heavy tailed double exponential or Laplace
prior density for regression coefficients, where this density is defined as
l
DE ( x| m, l ) = exp ( - l x - m ) .
2
The prior b j ~ DE(0, l) assigns higher weight to values near zero than the normal prior and
favours shrinkage, with the scale parameter λ controlling the amount of shrinkage. Larger
values of λ imply greater shrinkage with lower variance around the zero prior mean. This
prior can be expressed in hierarchical terms (Kotz et al., 2001) as
b j ~ N (0, hj2 ),
In a normal linear regression with residual variance σ2, the first stage of the prior should be
expressed as b j ~ N (0, s 2hj2 ) , (Park and Casella, 2008). One may also allow the second stage
parameters lj2 to vary between coefficients e.g. following a gamma prior (Yi and Ma, 2012).
Shrinkage priors can be represented generically (Polson and Scott, 2010; Bhadra et al.,
2016) as
with different possible choices of prior density for the hj2 (local shrinkage parameters) and
τ2 (the global shrinkage parameter).
The horseshoe prior specifies a half-Cauchy prior for the ηj, allowing considerable
shrinkage for unnecessary coefficients (Carvalho et al., 2009; Polson and Scott, 2012). There
is some debate about a suitable prior for τ2 (Piironen and Vehtari, 2017b; Piironen and
Vehtari, 2017c), with Carvalho et al. (2009) recommending a half-Cauchy prior also, namely
t ~ C + (0, 1) .
This can be expressed in terms of a Beta(0.5,0.5) density for the shrinkage parameters
The estimated κj can be interpreted as “the amount of weight that the posterior mean for βj
places on 0” (Carvalho et al., 2009; Piironen and Vehtari, 2017c); so higher κj correspond to
irrelevant predictors. Accordingly, Piironen and Vehtari (2017c) propose an effective num-
ber of coefficients measure
Regression Techniques Using Hierarchical Priors 257
å(1 - k ).
j =1
j
namely a half-Student t-prior with ν degrees of freedom (Piironen and Vehtari, 2016).
Piironen and Vehtari (2017c) mention using this prior (with small ν) to alleviate divergent
transitions produced by the No U-Turn Sampler (NUTS) algorithm, but this implies a loss
of sparsity.
b j ~ N(0, g j t j2 ),
with gamma priors on 1/t j2 and a beta or uniform prior on ω. A shrinkage mecha-
nism operates whereby when γj = 0 the variance of βj is very small. The parameter ω, if
taken unknown, acts as a complexity parameter, controlling model size. Here v0 = 0.005
and ω ~ U(0,1). It is assumed that 1/t j2 ~ Exp(1), centred around 1, since predictors are
standardised.
The second half of a two-chain run of 5,000 iterations provides a posterior mean for ω
of 0.84. The control for collinearity implicit in predictor selection reveals x9 (LTG serum)
as a significant predictor (Table 7.1), but with 95% credible intervals for β2 and β20 strad-
dling zero. The predictor x7 (high-density lipoprotein or HDL cholesterol) has a 95%
interval concentrated on negative values, albeit with an inclusion probability of 0.95.
Retention probabilities are close to 1 for BMI, MAP, and LTG.
A second analysis uses a hierarchical version of the Lasso prior, namely
b j ~ N(0, s 2 h j2 ),
h j2 ~ E(l 2 /2).
258 Bayesian Hierarchical Models
TABLE 7.1
Predictor Selection, Diabetes Progression, Spike-Slab Prior
βj γj
Predictor Notation Mean 2.5% 97.5% Mean
Sex X2 −4.09 −12.15 0.45 0.93
BMI X3 26.64 19.55 33.74 1.00
MAP X4 11.91 3.56 19.06 1.00
HDL X7 −5.84 −15.73 0.43 0.95
LTG X9 25.34 18.01 32.55 1.00
Age-Sex X20 4.05 −0.38 10.94 0.96
TABLE 7.2
Predictor Selection, Lasso Shrinkage Prior
βj
Predictor Notation Mean 2.5% 97.5%
Sex X2 −7.88 −13.75 −2.09
BMI X3 23.14 15.86 30.22
MAP X4 13.52 7.37 19.80
LTG X9 23.39 15.74 31.18
Age-Sex X20 5.86 0.44 11.88
The residual precision 1/σ2 is assigned a Ga(1,0.001) prior, and λ is assigned a uniform
U(0.001,100) prior. Results are as in Table 7.2, with five predictors x2, x3, x4, x9, and x20
judged significant in terms of 95% credible intervals either entirely negative or posi-
tive. In the sense that regression including predictor selection is still a model for the
data, fit statistics - penalised DIC (deviance information criterion), WAIC (widely appli-
cable information criterion) and LOO-IC (leave-one-out information criterion - are very
similar between the spike-slab model and Lasso models. For example, their respective
LOO-IC are 4797 and 4794.
A third analysis uses a horseshoe prior, implemented in rstan using the scheme
b j ~ N(0, t 2 lj2 ),
lj ~ C + (0, 1),
where C+(0,1) is a half-Cauchy density. A two-chain run of 2000 iterations provides pos-
terior mean estimates for κj below 0.05 only for x3.
One may also include a predictor selection mechanism when shrinkage priors are used
for the coefficients (Yuan and Lin, 2005). Thus in a selection version of the Lasso prior
b j |g j = 1 ~ N(0, s 2 h j2 ),
b j |g j = 0 ~ d0 (),
h j2 ~ E(l 2 /2),
with ω and λ assigned Be(1,1) and U(0.001,100) priors. This provides posterior inclusion
probabilities above 0.95 for x2, x3, x4, and x9, though the median probability model also
includes x7, x20, and x37.
d jq = b j , q + 1 - b jq , q = 1, … Q j - 1.
For p ordinal predictors, the Lasso prior may be taken on p sets of differences:
d jq ~ N (0, t 2hj2 ), j = 1, … , p; q = 1, ¼ , Q j - 1
t 2 ~ IG( a, b),
d jq ~ N (0, t j2 ), j = 1, … , p; q = 1, … , Q j - 1
t j2 ~ IG( a, b).
These types of penalty apply shrinkage to groups of parameters representing the same
categorical predictor, and also smooth over successive ordered categories. They can be
combined with spike-slab discrete mixture priors where the spike either sets coefficients
to zero, or scales the coefficient variances to very small positive values (i.e. effective
exclusion).
For nominal predictors, the regularising prior can be applied to all possible contrasts
between categories (Wagner and Pauger, 2016). Thus, with βj1 = 0, consider contrasts
260 Bayesian Hierarchical Models
d jqr = b jq - b jr , q = 1, ¼ , Q j - 1; q > r
b jq ~ N (0, tj2 ) q = 1, ¼ , Q j
with identifiability possibly enforced by centring (Tingley, 2012). Xu and Ghosh (2015) con-
sider the hierarchical group lasso representation
b jq ~ N (0, tj2 ),
æ Qj + 1 l 2 ö
t j2 ~ Ga ç , ÷ .
è 2 2 ø
This can be a combined scheme with a spike-slab discrete mixture with indicators γj, where
the spike option γj = 0 sets the entire group j coefficient set ( b j1 , b j 2 ,¼, b jQ j ) to zero. This is
consistent with a principle of group level sparsity. To allow for coefficient selection within
groups one can use products of selection indicators γjγjq, with γj and γjq following separate
Bernoulli densities (Xu and Ghosh, 2015, p.924).
yij = a + bi + eij i = 1, ¼ , nI ; j = 1, ¼ , nJ
where eij ~ N (0, se2 ) , and factor effects βi are estimated either as fixed or random effects.
Classical assessment of the hypothesis b1 = ¼ = bnI = 0 uses F tests comparing mean
squares due to the factor and the errors.
One possible perspective, broadening into multilevel applications (Geinitz et al., 2015),
considers the parameters of this model as random effects (Gelman, 2005). For example, as
a baseline representation for the one-way ANOVA,
yij ~ N ( bi , se2 ),
Regression Techniques Using Hierarchical Priors 261
bi ~ N (a , s b2 ),
with significant variation in the factor assessed by the comparison Pr(s b > se | y ), which
can be estimated from MCMC sampling. An analogous comparison may be made using
the marginal variance sβ, estimated from the sampled βi during MCMC runs. The latter
can be regarded as estimating the variance over the observed units, rather than some
broader population of units.
b j ~ N(0, h j2 ), j = 1, 2,
b j1 = 0,
b j , q + 1 = b jq + d jq j = 3, 4; q = 1,… , Q j - 1
d jq ~ N(0, h j2 ),
l ~ U(0.001, 100).
This model produces an improved penalised deviance (Plummer, 2008), at 197 com-
pared to 202 under the first model (though not a better Brier score). There is an enhanced
role for the width predictor, with a shrinkage in the coefficient for weight (Table 7.3).
This effect may reflect better control for impact of multicollinearity (the correlation
between weight and width is 0.89). There is also an attenuation in the impacts of the
colour categories, though there is still a 95% probability that the impact of medium
colour is positive.
Posterior inference may be sensitive to the prior for λ, with large values producing
overshrinkage (Xu and Ghosh, 2015). A sensitivity analysis assumes a prior λ ~ E(1). This
262 Bayesian Hierarchical Models
TABLE 7.3
Horseshoe Crabs. Logistic Regression for Satellite Presence
Without Selection
Predictor (Category) Parameter Mean 2.5% 97.5% Pr(βj > 0)
Width β1 0.58 −0.27 1.44 0.91
Weight β2 0.51 −0.32 1.38 0.89
Spine (category 2) β32 −0.11 −1.55 1.30 0.44
Spine (category 3) β33 0.44 −0.57 1.45 0.81
Color (category 2) β42 1.27 0.06 2.51 0.98
Color (category 3) β43 1.66 0.54 2.79 1.00
Color (category 4) β44 1.86 −0.05 3.90 0.97
Lasso Prior, λ ~ U(0.001,100)
Mean 2.5% 97.5% Pr(βj > 0)
Width β1 0.53 −0.06 1.18 0.95
Weight β2 0.43 −0.12 1.10 0.92
Spine (category 2) β32 0.01 −0.73 0.73 0.52
Spine (category 3) β33 0.15 −0.48 0.93 0.66
Color (category 2) β42 0.49 −0.16 1.54 0.90
Color (category 3) β43 0.79 −0.06 1.96 0.95
Color (category 4) β44 0.82 −0.23 2.40 0.91
Laplace scale λ 3.4 0.9 8.4
Lasso Prior, λ ~ E(1)
Mean 2.5% 97.5% Pr(βj > 0)
Width β1 0.54 −0.08 1.25 0.95
Weight β2 0.47 −0.17 1.16 0.92
Spine (category 2) β32 −0.01 −0.83 0.77 0.51
Spine (category 3) β33 0.18 −0.53 1.02 0.67
Color (category 2) β42 0.58 −0.14 1.60 0.93
Color (category 3) β43 0.93 0.01 2.06 0.98
Color (category 4) β44 0.97 −0.20 2.53 0.94
Laplace scale λ 2.1 0.7 4.2
Lasso Prior and Selection
Mean 2.50% 97.50% Pr(βj > 0)
Width β1 0.59 −0.03 1.34 0.90
Weight β2 0.42 −0.11 1.26 0.82
Spine (category 2) β32 0.04 −0.56 0.82 0.54
Spine (category 3) β33 0.17 −0.44 1.01 0.65
Color (category 2) β42 0.72 −0.09 2.00 0.88
Color (category 3) β43 0.93 −0.02 2.24 0.94
Color (category 4) β44 0.98 −0.20 2.63 0.92
Laplace scale λ 1.6 0.4 3.6
Selection probability ω 0.64 0.15 0.98
produces both a lower deviance and improved Brier score as compared to a model with-
out selection. There is still considerable shrinkage in the colour coefficients.
A third model introduces selection indicators γj, such that when γj = 0 the variances h j2
are scaled by a small constant ρ. This may be preset or assigned a prior centred on an
informative value. Here ρ = 0.0001, and the prior λ ~ E(1) is retained. This model enables
Regression Techniques Using Hierarchical Priors 263
one to assess fusion probabilities, namely, that successive ordinal category coefficients
are equated. Focusing on the colour coefficients, we find a probability of 0.33 (fusecol
in the code) that b42 = b43 = b44 , amalgamating over iterations where β42 is retained or
excluded.
bi ~ N(a , s b2 /di ),
di ~ Ga(2, 2).
This analysis provides δ2 = 0.66 as the lowest estimated scale parameter. Downweighting
the influence of rail 2 leads to a higher posterior means (sd) for α of 67.3 (11.1), while σβ
is reduced to 19.7 (7.5).
A third analysis assumes a double exponential prior for βi,
bi ~ DE(a , l/s ),
combined with an exponential E(1) prior on λ. This provides a posterior mean (sd) for α
of 68.0 (13.8) and for λ of 0.22 (0.10).
120
100
80
60
Mean
40
2.5%
20 97.5%
0
1 2 3 4 5 6
Number of Rail
FIGURE 7.1
Profiles of β coefficients.
264 Bayesian Hierarchical Models
é y q - b(q i ) ù
p( yi |q i , f ) = exp ê i i + c( yi , f )ú
ë ai (f ) û
with canonical parameter θi, dispersion function ai (f) = f/wi , and wi known. Under the
generalised linear model (GLM) framework, the mean of y is mi = E( yi |q i ) = b¢(q i ), predicted
via a monotone link function g( mi ) = Xi b , and the variance is
The exponential family includes as special cases the normal, binomial, Poisson, multino-
mial, negative binomial, exponential, and gamma densities.
However, GLM regressions often show a residual variance larger than expected under
the exponential family models, due to unknown omitted covariates, clustering in the orig-
inal units, or inter-subject variations in propensity (Zhou et al., 2012). Particular types of
response pattern (e.g. an excess proportion of zero counts as compared to the expected fre-
quency) may also cause overdispersion (Garay et al., 2015; Musio et al., 2010) (Section 7.6).
Without correction for such extra-variability, regression parameter estimates may be
biased, and their credible intervals will be too narrow, so that incorrect inferences about
significance may be obtained. The solution involves regression with additional random
effects to account for excess residual variation, and the focus in Monte Carlo Markov
Chain (MCMC) is usually on the complete data likelihood, rather than the marginal model
obtained by integrating over the random effects.
yi ~ Po( mi ),
mi ~ Ga(ai , hi ).
Denoting the mean of the μi as xi = ai /hi , one obtains Var( mi ) = ai /hi2 = xi2 /ai and
Another possibility is to set mi = xiwi where wi ~ Ga(a , a) so that the frailties average 1, with
variance ϕ = 1/α. Integrating out the ωi leads to a marginal negative binomial (NB2) den-
sity for the yi, namely
a yi
G(a + yi ) æ a ö æ x ö
p( yi |b , a ) = ç ÷ ç ÷ .
G(a )G( yi + 1) è a + x i ø è a + x i ø
Regarding the dispersion parameter, one may adopt a Ga(a,b) prior for α, with a = 1, and
with b ~ Ga(1,0.005) as an extra unknown (Fahrmeir and Osuna, 2006).
Setting
æ a ö
pi = ç , ÷
è a + xi ø
the negative binomial (NB2) can also be denoted as NB(pi, α). For example, Zhou et al. (2012)
propose predictor impacts be represented via a logit regression for pi, with the regression
including an additional error term to partly represent heterogeneity.
More general negative binomial forms have been suggested, such as the NBk
(Winkelmann and Zimmermann, 1995), with variance function
æ x 1-k x - k ö
mi ~ Ga ç i , i ÷ .
è f f ø
The values k = 0 and k = 1 lead to the NB1 and NB2 variance forms. The NB-P model (Greene,
2008) replaces α in the NB2 formulation by ax i2-P, with P = 2 corresponding to the NB2 and
P = 1 to the NB1.
Nonconjugate random mixture models are often adopted for count data (Kim et al.,
2002), as in
log( mi ) = Xi b + log(ei ),
æ -t 2 2 ö
e i ~ LN ç ,t ÷ ,
è 2 ø
ensures E(εi) = 1, while variance matching priors can be adopted for the Poisson log-normal
and Poisson-gamma models (Millar, 2009). The nonconjugate approach is convenient when
multivariate, multiple, or multilevel random effects are to be considered. An example of
multiple effects is the convolution prior (Neyens et al., 2012) for area disease events yi with
expected totals Pi. Thus yi ~ Po(Pi mi ), where
log( mi ) = Xi b + ei + si
266 Bayesian Hierarchical Models
and both random effects {εi,si} may account for overdispersion, but the εi are unstructured
(i.e. exchangeable with regard to area identifiers), while the si are spatially structured.
For count regressions only involving an unstructured error, one may specify
showing that the variance has a quadratic form, as for the NB2 form of the negative
binomial.
y i ~ Po(xi wi ),
wi ~ Ga(a , a),
2016). Thus, one has NB(pi, αi), where log(αi) = Xiδ. This model shows dispersion to be
greater among general program students, though shows no benefit in terms of fit, with
higher LOO-IC than the Poisson-gamma model.
The suitability of NB1 or NB2 forms of negative binomial regression may be assessed
using the NBP (negative binomial P) model. To assess this, an exponential prior centred
at 2 is used for P following the Greene (2008) approach. We obtain posterior mean (95%
interval) for P of 1.23 (0.73,1.65). This excludes 2, and so does not favour the NB2 param-
eterisation. The LOO-IC is improved as compared to the NB2 model (Poisson-gamma
version), namely 1540 vs 1559. Substantive inferences are affected in that the gender
effect β3 now has an entirely positive 95% credible interval, (0.02,0.45).
pi (1 - pi )/(g + 1).
g(pi ) = Xi b.
Nonconjugate random mixture models are often adopted for binomial data, with normal
or Student t errors in the regression link. The presence of an error term permits predic-
tor selection using a g-prior approach (Gerlach et al., 2002; Kinney and Dunson, 2007) in
mixed logistic models. For y binomial or binary with probabilities pi, and n observations,
one might specify
logit( pi ) = Xi b + ei ,
ei ~ N (0, s 2 )
with
and g an unknown, with prior such as g ~ IG(0.5, 0.5n) (Zellner and Siow, 1980; Perrakis
et al., 2015). This prior can be combined with spike-slab binary selection indicators. For y
binary, and data augmentation (Section 7.5), the g-prior can also be used.
For multinomial data (e.g. on voting patterns yij for parties j by constituency i), over-
dispersion may occur when choice probabilities vary between the Ni individuals in each
observation unit, but clusters of individuals within each unit have similar probabilities.
The individual-level factors associated with such clustering are not observed, so a ran-
dom effect will proxy such unobserved factors; for example, voters with different educa-
tion levels may differ in their voting preferences, but only the average education in each
constituency is observed. The raw percentages yij/Ni are also likely to show erratic fea-
tures, whereas hierarchical models for pooling strength provide frequency smoothing and
model interdependencies between categories.
This form of data may be modelled as a product multinomial likelihood conditioning on
known Ni = yi+. With probabilities πij of choices j = 1,… J , the sampling model is
for j = 1, … , J , where the ai would typically be fixed effects assigned vague priors e.g.
ai ~ N (0, 1000).
Regression Techniques Using Hierarchical Priors 269
a) x1, the proportion of each county’s votes for different presidential candidates in
1996;
b) x2, changes between 1996 and 2000 in party registration;
c) x3, percentage of census population Cuban in district i.
Specification of x1 and x2 (predictors specific for area and candidate) follows Mebane
and Sekhon (2004), but x3 differs from their variable. Mebane and Sekhon (2004) find
substantial overdispersion in these data.
The sampling model for the random effects multinomial is
pij = fij åf
ij
ij
and to account for overdispersion, normal effects a i = {a i1 ,¼, a i( J -1) } are included in
multiple logit links. These have non-zero means Aj, namely the intercepts for the first
four candidate choices. In a hierarchical parameterisation
log(fij ) = aij ,
H ij = A j + Xi b j ,
log(fiJ ) = 0,
with { b j , k ; j = 1,¼, 4, k = 1,¼, 3}, Aj assigned diffuse priors, and the precision matrix D−1
assigned a Wishart prior with identity scale matrix and J − 1 = 4 degrees of freedom2.
MCMC convergence is considerably assisted by the hierarchical parameterisation
above and by centring predictors. Inferences are based on the second half of a two-chain
run of 20,000 iterations. The β coefficients show 1996 voting to influence later voting,
except for Nader, while change in party registration is important for all candidates,
except Gore. The proportion of Cuban-Americans has a positive effect on voting for
Gore and Bush. A posterior predictive check comparing chi-square values for replicate
and observed data is satisfactory at around 0.49.
Posterior predictive checks are not satisfactory when a fixed effects only model is
applied with
log(fij ) = A j + Xi b j , j = 1,… , J - 1.
There is then zero posterior probability that c 2 ( y rep ,q ) > c 2 ( y ,q ). Standard deviations
of predictor effects are also considerably understated if allowance is not made for excess
variation, and, in fact, all coefficient effects are significant (95% CRI either entirely posi-
tive or negative) under this model.
270 Bayesian Hierarchical Models
U ji = Vji + e ji = Xi b j* + e ji ,
yi* = U1i - U 0 i .
Pr( yi = 1) = Pr( yi* > 0) = Pr(e0 i - e1i < V1i - V0 i ) = Pr(e0 i - e1i < Xi b )
where b = b1* - b 0*. Alternative forms for ε lead to different links: taking εji to be normal with
mean zero and variance σ2 leads to a probit link with Pr( yi = 1) = F(Xi b/s ) . It is apparent
that β and σ cannot be separately identified, and the commonest identifying device takes
σ2 = 1.
A probit regression with binary responses yi may therefore be obtained by truncated
normal sampling for yi* , with the form of constraint determined by the observed y. Thus, if
yi = 1, yi* is constrained to be positive, and sampled from a normal with mean Xiβ (including
an intercept in p-dimensional Xi) and variance 1. If yi = 0, yi* is sampled from the same den-
sity, but constrained to be negative. With a normal prior on the coefficients b ~ N p (B0 , V0 ),
the full conditional distribution of β is also normal, namely
b | y * ~ N (B, V )
B = V -1(V0-1B0 + X ¢y * )
V = (V0-1 + X ¢X )-1.
Improved MCMC mixing is obtained by updating y* and β jointly (Holmes and Held,
2006), and justified by the factorisation,
p( b , y * | y ) = p( y * | y )p( b | y * )
where updating of β is as above, but y* is updated from its marginal distribution integrated
over β.
Heavier tailed links are obtained by sampling yi* directly from a Student t with ν degrees
of freedom, or by using the scale mixture version of the Student t density (Chang et al., 2006).
Regression Techniques Using Hierarchical Priors 271
This again involves constrained normal sampling but with gamma subject-specific precisions
li ~ Ga(n/2, n/2), so that
Skew densities for ε have also been proposed, such as a skew-probit link (Bazan et al., 2010)
with augmentation scheme
yi* = Xi b + ei ,
ei = s éë -jVi - (1 - j 2 )Wi ùû ,
where Vi is half normal Vi ~ N + (0, 1) , Wi ~ N (0, 1), j ~ U( -1, 1), and σ = 1 for identifiability.
In hierarchical form, one has
Taking ε to be logistic, a logit regression is obtainable (e.g. Holmes and Held, 2006), by the
augmentation scheme
with variance κ2/τ2, where κ2 = π2/3. A logit link can be approximated by Student t sam-
pling when ν = 8, or equivalently by scale mixture normal sampling with li ~ Ga(n/2, n/2),
combined with constrained sampling according to the observed y values. Specifically, a t8
variable is approximately 0.634 times a logistic variable, so that
æ 1 ö
yi* ~ t8 ç Xi b , ÷ I (0, ¥), when yi = 1
è 0.634 2 ø
æ 1 ö
yi* ~ t8 ç Xi b , 2 ÷
I (-¥ , 0), when yi = 0.
è 0. 634 ø
æ 1 ö
yi* ~ N ç Xi b , 2 ÷
I (0, ¥), when yi = 1
è li (0.634) ø
æ 1 ö
yi* ~ N ç Xi b , 2 ÷
I (-¥ , 0), when yi = 0.
è li (0.634) ø
272 Bayesian Hierarchical Models
Pr(yi = k ) µ exp{Xi b }
and a latent exponential variable version (Scott, 2011) of the logit link involves sam-
pling {z0i,z1i} from exponential densities E(λji), with parameters λ0i = 1 and l1i = exp(Xi b ).
If yi = arg min( z0 i , z1i ), then Pr( yi = k |Xi ) µ lki as under a logit regression. This principle
extends to multiple logit regression by sampling {z0 i , z1i , … , z J -1, i }.
Augmented data sampling for the logit model can also be achieved using a discrete mix-
ture approximation of the type 1 extreme value error (Fruhwirth-Schnatter and Fruhwirth,
2010). With U0i and U1i as utilities of categories 0 and 1, and
U1i = Xi b + ei
the binary logit is obtained when U0i and εi follow type 1 extreme value distributions.
Using the relation between the exponential and type 1 extreme distributions, and with
ni = exp(Xi b ), one has
When yi = 1, one has U1i > U 0 i , or equivalently exp( -U1i ) < -exp( -U 0 i ) , so that,
When yi = 0, one has U 0 i > U1i , or equivalently exp( -U 0 i ) < -exp( -U1i ) , so that
exp(-U 0 i ) ~ E(1 + n i ),
where di ~ E(ni ).
A useful diagnostic feature resulting from the augmented data approach is that the resid-
uals yi* - Xi b are nominally a random sample from the assumed cumulative distribution
Regression Techniques Using Hierarchical Priors 273
for ε (Johnson and Albert, 1999). So for the latent data probit, ei = yi* - Xi b is approximately
N(0,1) if the model is appropriate for case i, whereas if the posterior distribution of εi is sig-
nificantly different from N(0,1) then the model conflicts with the observed y. So one might
obtain the probability Pr(|ei|> 2| y ) and compare to its prior value of 0.045. For the latent
data logit, one may obtain Pr(|ei|/k > 2| y ) .
yi = j if a j -1 £ yi* < a j .
The αj are cutpoints dividing the values of y* according to the observed y values (Bürkner
and Vuorre, 2018). The regression in the latent data is then
yi* = Xi b j + e ji
274
TABLE 7.4
Birthweight Data. Probit Regressions
Probit Probit Horseshoe Probit g-prior
Predictor Name β, Mean β, St devn Pr(βj > 0) β, Mean β, St devn κj, Mean Pr(βj > 0) β, Mean β, St devn γ, Mean
x1 Age 0.26 0.16 0.948 0.14 0.14 0.77 0.843 0.06 0.12 0.38
x2 Lwt −0.36 0.14 0.003 −0.26 0.14 0.72 0.023 −0.22 0.16 0.78
x3 Black 0.59 0.33 0.960 0.35 0.32 0.66 0.878 0.11 0.23 0.28
x4 Other 0.51 0.28 0.972 0.32 0.27 0.68 0.889 0.11 0.21 0.33
x5 Smoke 0.81 0.27 1.000 0.51 0.27 0.60 0.974 0.31 0.28 0.67
x6 Ptl 0.29 0.20 0.924 0.27 0.20 0.70 0.920 0.23 0.24 0.59
x7 Ht 1.20 0.44 0.997 0.84 0.44 0.51 0.980 0.83 0.55 0.84
x8 Ui 1.11 0.40 0.996 0.62 0.36 0.57 0.965 0.30 0.38 0.48
x9 Ftv −0.09 0.13 0.260 −0.04 0.10 0.82 0.328 −0.01 0.05 0.20
x10 ui*smoke −1.00 0.57 0.042 −0.30 0.47 0.67 0.271 −0.03 0.31 0.26
x11 ftv*age −0.49 0.15 0.000 −0.38 0.14 0.64 0.002 −0.31 0.13 0.96
Bayesian Hierarchical Models
Regression Techniques Using Hierarchical Priors 275
where εji is usually either normally or logistically distributed (Albert and Chib, 2001). So
P(e ) = F(e ) , where Φ is the cumulative normal function, or P(e ) = 1/(1 + exp( - e )), the cumu-
lative logistic.
The corresponding model for cumulative probabilities is
or
according to the assumed form for εji. Let g ji = Pr( yi* £ a j ), then
a1 = D 1 ,
one may, however, specify unconstrained normal priors, such as D j ~ N (0, VD ) where VΔ is
preset or possibly itself unknown.
An equivalent specification of this model involves sets of J − 1 binary variables for each
subject, namely zji = 1 if yi ≤ j, and zji = 0 otherwise. So if J = 3, and if yi = 1, then z1i = 1, z2i = 1; if
yi = 2, then z1i = 0, z2i = 1. So, for ε normal,
yi = Xi b + ei ,
where ei ~ N (0, s 2 ) . Potential limitations of this specification reflect both the assumptions
regarding errors and the same form of regression effect for all subjects. Assumptions of
discrete regression (e.g. Poisson, binomial) are also vitiated by excess observations at par-
ticular outcomes (e.g. clumping at zero).
ei = zi /li0.5
where the λi are independent positive random variables. The tν distribution results as a
scale mixture of normal distributions by taking λi gamma with scale and shape ν/2, with
the Cauchy when ν = 1 (Boris Choy and Chan, 2008). An alternative scale mixture of uni-
forms method can lead to both heavier and lighter tails than the normal (Qin et al., 2000).
Other approaches to heteroscedasticity include variance transformation and variance
regression modelling (Cepeda and Gamerman, 2000; Chib and Greenberg, 2013). As an
alternative to the canonical linear regression, one may consider a heteroscedastic model
(Wang and Zhou, 2007)
yi = hi + si zi ,
si = exp(hi ),
l
si = s hi ,
or
yi = a + å x ( b + v ) + e ,
j =1
ji j ji i
where {v1i , … , vpi } are zero mean random effects. Random regression effects are often
applied to structured data (e.g. time or spatially configured data).
Variation in regression effects is also approached using discrete regression mixtures,
with form
K
p( yi |Xi ) = å p f (X , b , f ),
k =1
k k i k k
278 Bayesian Hierarchical Models
where βk are component specific regression effects, and ϕk are other parameters defining
densities f k. Such mixtures are useful for detecting subpopulations with different behav-
iours, while accounting for excess heterogeneity (e.g. overdispersion) related to varying
regression relationships. Examples include normal regression mixtures
K æ p ö
p( yi |Xi ) = å p kN çak +
ç å x ji b jk , s k2 ÷ ,
÷
k =1 è j =1 ø
Poisson regression mixtures
K æ æ p öö
p( yi |Xi ) = å p k Po ç exp ç a k +
ç ç å x ji b jk ÷ ÷ ,
÷÷
k =1 è è j =1 øø
æ æ
å x b ö÷ø ö÷÷ .
p
K ç exp ç a k + ji jk
å è j=1
p( yi |Xi ) = p k Bern ç
æ
å x b ö÷ø ÷÷ø
p
ç
k =1
ç 1 + exp ç a k + ji jk
è è j=1
The probabilities πk for the components may be predicted for each individual via regres-
sion (e.g. multinomial logit).
In Bayesian applications, MCMC sampling is facilitated by the introduction of latent allo-
cation indicators Gi Î(1, … , K ), with full conditionals based on multinomial probabilities
pk f k (Xi , bk , fk ) å p f (X , b , f ).
k =1
k k i k k
Certain identification and estimation issues apply to discrete regression mixtures, and a
variety of sampling and post-processing methods, and priors to gain or improve identifi-
ability, have been proposed. Different component labels cannot be distinguished during
MCMC sampling unless some identifiability constraint is imposed. Another issue involves
small components (with low probabilities πk), especially when combined with small sam-
ples, since at particular MCMC iterations, no cases may be allocated to a particular group,
so that the associated parameters are not updated.
Sampling and estimation methods for discrete regression mixtures differ in whether
they impose identifying constraints or allow switching between different numbers of
components. For example, Viele and Tong (2002) apply identifying restrictions in linear
regression mixtures. Ordering of variances may work better when the variances are well
separated, whereas ordering of particular regression parameters works well when sub-
populations are distinct in substantive terms. Such features might be established by pre-
liminary classical estimation.
a zero-inflated Poisson (ZIP) model, or more generally zero modified model (ZMP)
(Conceição et al., 2013), zero counts may be either true zeroes, or result from a stochastic
mechanism, when the process is “active,” but sometimes produces zero events. A distinc-
tion is similarly made between structural and random zeroes (Martin et al., 2005).
Denote the active stochastic mechanism as f(y), and let di = 1 for true zeroes as against
stochastic zeroes, obtained when di = 0. Setting Pr(di = 1) = w , one has for discrete f(y),
Regressors Xi may be relevant both to the binary inflation mechanism, and to the param-
eters defining the density (Czado et al., 2007). A useful representation for programming
the zero-inflated Poisson involves the data augmentation scheme (Ghosh et al., 2006):
di ~ Bern(wi ),
yi ~ Po( mi (1 - di )),
P( yi = 0|Xi ) = wi + (1 - wi )e - mi ,
P( yi = j|Xi ) = (1 - wi )e - mi miyi / yi !, j = 1, 2, …
2 æ p ö
p( yi |Xi ) = å p kN ça +
ç å x ji b j , s k2 ÷
÷
k =1 è j =1 ø
where s s , and π2 is taken small (e.g. π2 = 0.05). This provides variance-inflation for
2
2
2
1
outliers. Mohr (2007) advocates a two-group model allowing for both clustered outliers
(defined by similar predictor values), and for scattered outliers, generated by a variance
inflation mechanism.
y i = hi = b0 + b1xi + si zi ,
This yields a significant γ1, with mean (95% interval) of 0.068 (0.044,0.094), so indicating
heteroscedasticity. The penalty criterion CP (obtained by summing the posterior vari-
ances of replicates) is 1.41E+06, while the predictive fit criterion, the posterior mean of
S i ( y i - y rep ,i )2 , is CF = 2.635E+06.
A variance power model in absolute predictor values, namely
si = s(1+|hi|)l .
is then applied (Bonate, 2011). A U(−2,2) prior is taken on λ, and a U(0,250) prior on σ,
which includes the observed standard deviation of 213. The final 5000 iterations of a
two-chain run of 15,000 iterations give an estimate for λ = 0.61(0.44,0.82), and provide
improved fit criteria (CP , CF ) = (1.21E + 06, 2.42E+06).
A student t with ν degrees of freedom via normal scale mixing (centred on a single
variance parameter σ2) is then applied. A U(0.01,1) prior is applied on the inverse of the
degrees of freedom 1/ν. This shows 21 datapoints with posterior mean precision adjust-
ment factors κi below 0.5, and ν estimated at 2.74. Although such estimates clearly show
non-normality, the fit criteria deteriorate to (CP , CF ) = (1.66E+06, 2.87 E+06) .
Finally, a discrete mixture regression is applied, with group varying intercept, slope,
and scale, namely
Gi ~ Mult(1, p)
4 components shows the lowest BIC for K = 3, and also shows a considerable differentia-
tion in the residual standard deviations between the three components.
For a Bayesian analysis in rjags, initially without predictor selection and with K = 3, an
identifying constraint based on ordering of the residual variances is adopted, with nor-
mal N(0,1000) priors on the regression coefficients βjk. A Dirichlet prior assigns weights
of 5 on each component πk. The final 5,000 iterations of a two-chain run of 15,000 itera-
tions leads to components distinguished firstly by salary level: the respective means
on the response within components are 0.4, 1.3, and 10.8 (this is the node avg.sal in the
code). The first component shows a significant effect of x3, with posterior mean (95% CrI)
of 0.60 (0.45, 0.74). The second component shows a relatively strong impact for x1, namely
1.9 (−0.1,3.9), while the third component shows a pronounced impact for x3, namely 9.0
(7.1, 10.9).
The second analysis uses Laplace priors on the regression coefficients, with Laplace
parameters and selection rate parameters component specific
b jk |g jk = 0 ~ d0 (),
lk ~ U(0.01, 100),
g jk ~ Bern(wk ).
The final 5,000 of a two-chain run of 30,000 iterations shows a high retention rate only
for x3 in the first and third component, with mean (95% CrI) for the realised coefficient
xjk = g jk b jk in the third component of 8.5 (6.7,10.1). The estimated λk for the second compo-
nent is relatively high, reflecting the lack of significant predictor effects.
A final analysis uses Laplace priors again, but without a binary selection mechanism.
The impact of x3 in the third component is unaffected, whereas that in the first compo-
nent is eliminated. Again, the estimated λk for the first two components are relatively
high, in line with shrinkage in predictor effects.
P( y i = 0|Xi ) = wi + (1 - wi )e - mi ,
logit(mi ) = b0 + Xi b.
logit(wi ) = g0 + Xig.
Anxious attachment has a significant effect in both regressions, with β2 and γ2 having
respective mean (95% CrI) coefficients of 0.13 (0.06,0.20) and −0.49 (−0.71,−0.27). The pos-
terior mean marginalised likelihood (Millar, 2009) is −805.5.
282 Bayesian Hierarchical Models
As a mixed predictive check (Marshall and Spiegelhalter, 2007), replicate zero infla-
tion indicators di* ~ Bern(wi ) are sampled, and replicate responses sampled from the
corresponding shifted mean y i* ~ Po( mi (1 - di* )) . There is found to be only 1 case with
probabilities of overprediction, Pr(y i* > yi)+0.5Pr(y i* = yi), exceeding 0.95, but 28 cases
with probabilities under 0.05, indicating underprediction of some larger counts. Hence,
the ZIP model may not be representing the full extent of overdispersion.
A more general representation is obtained by using a zero-inflated negative binomial.
This increases the posterior mean marginalised likelihood to −570 and the number of
underpredicted cases is reduced to 15.
yt = Xt b + e t ,
et = ret -1 + ut ,
where ut ~ N (0, s 2 ) are iid and independent of εt, is an effective scheme for controlling for
temporal error dependence if (as often) most correlation from previous errors is transmit-
ted through the impact of εt−1. This assumption is widely used in longitudinal models (e.g.
Chi and Reinsel, 1989). With se2 = var(et ) , and assuming stationarity with |ρ| < 1, AR(1)
error dependence implies
so that se2 = s 2 /(1 - r 2 ), and the initial condition for the stationary case is
æ s2 ö
e 1 ~ N ç 0, .
2 ÷
è 1- r ø
Regression Techniques Using Hierarchical Priors 283
AR(1) error dependence for non-metric responses is illustrated by the Poisson count out-
comes case yt ~ Po( mt ) (Chan and Ledolter, 1995; Nelson and Leroux, 2006), with
log( mt ) = Xt b + et ,
e t = re t -1 + ut .
Bayesian analysis of AR(1) errors for count data is exemplified by Oh and Lim (2001) and
Jung et al. (2006), who also consider augmented data sampling for count responses, while
Ibrahim and Chen (2000) set out sampling algorithms under a power prior approach (that
assumes historic data with the same form of design are available).
The Durbin–Watson statistic for AR(1) error dependence, namely
DW =
å(e - e t t -1 )2
=2-2
å(e - e t t -1 )
= 2 - 2 r ,
åe 2
t åe åe 2
t
2
t -1
is often used to test temporal autocorrelation (when predictors exclude lagged responses),
and in a Bayesian context can be applied in a posterior predictive check. For example,
Spiegelhalter (1998, p.126) considers a Poisson time series for cancer cases yijt in ages
i = 1,… I , districts j = 1, … , J , and years t = 1, … , T , with μijt being Poisson means. At each
iteration, deviance residuals dijt = -2 log{ p( yijt | mijt )} are obtained, and an average DW sta-
tistic derived for each age and district, namely
å (d - d ) .
T
2
ijt ij , t - 1
DW = t=2
å (d - d )
ij T
2
ijt ij.
t =1
Pr(St = k |St -1 = j) = p jk { j , k = 1, … , K }.
yt = Xt bt + et ,
bt = b m + ut
with ut taken as iid random effects. However, in time series contexts, it is likely that devia-
tions from the central coefficient effect βμ will be correlated with nearby deviations in time.
A flexible framework for time-varying parameter effects is provided by the linear
Gaussian state space model (Shumway, 2016), involving first order random walks in scalar
or vector coefficients βt
yt = Xt bt + et , et ~ NT (0, S t )
bt = Gt bt -1 + wt . wt ~ N R (0, Vt ).
Often Gt = I, Σt = Σ, and Vt = V, but if there is stochastic volatility, the variances or log vari-
ances can also be brought into a random walk scheme.
Subject matter considerations are likely to govern the anticipated level of smoothness in
the regression effects. For example, the RW(2) scheme
bt = 2 bt -1 - bt - 2 + wt wt ~ N (0, V )
provides a more plausible smoothly changing evolution for changing regression effects
(Beck, 1983). Dangl and Halling (2012) consider dynamic linear models for asset returns,
and formal model choice between constant regression effects with V = 0, and differing
levels of variation in βt, via a discrete prior over a set of covariance matrix discount factors.
Varying regression effects are important in particular applications of dynamic gener-
alised linear models for discrete responses (Gamerman, 1998; Fruhwirth-Schnatter and
Fruhwirth, 2007; Ferreira and Gamerman, 2000). Consider y from an exponential family
density
where the predictors Xt may include past responses { yt - k , yt*- k }, both observed and latent
(Fahrmeir and Tutz, 2001, p.345). For example, yt* , the latent response (e.g. utility in eco-
nomic applications) when yt is binary, may depend on previous values of both yt* and yt.
The link for μt involves random regression parameters
g( mt ) = Xt bt ,
where the parameter vector evolves according to a linear Gaussian transition model,
bt = Gt bt -1 + wt ,
with multivariate normal errors wt ~ N R (0, Vt ) independent of lagged responses, and of the
initial condition b0 ~ N R (B0 , V0 ).
Models for binary time series with state-space priors on the coefficients have been men-
tioned in several studies. Thus, Fahrmeir and Tutz (2001) consider a binary dynamic logit
model involving trend and varying effects of a predictor and lagged response,
Regression Techniques Using Hierarchical Priors 285
bt ~ N 3 ( bt -1 , V ),
while Gamerman (1998) consider nonstationary random walk priors in a marketing appli-
cation with binomial data, where logit(pt ) = b1t + b2t xt , and xt is a measure of cumulative
advertising expenditure.
–2
–4
–6
CPO
–8
–10
–12
–14
FIGURE 7.2
CPO estimates, seizures data.
286 Bayesian Hierarchical Models
TABLE 7.5
Seizures Data. Parameters of Markov Chain Model
Mean St Devn 2.5% 97.5%
β01 −6.96 0.50 −7.94 −6.10
β11 7.41 0.55 6.43 8.49
β21 −0.26 0.06 −0.36 −0.12
β31 −2.28 0.15 −2.58 −2.02
β02 −0.23 0.44 −0.99 0.64
β12 1.24 0.53 0.20 2.16
β22 −0.38 0.13 −0.61 −0.11
β32 −0.43 0.16 −0.72 −0.11
π11 0.75 0.05 0.65 0.83
π21 0.62 0.08 0.45 0.78
π12 0.25 0.05 0.17 0.35
π22 0.38 0.08 0.22 0.55
b0t ~ N(2 b0 , t - 1 - b0 , t - 2 , s b2 0 ),
b jt ~ N( b j , t - 1 , s b2 j ),
with 1/sa2 and 1/s b2 j assigned Ga(1,1) priors. As a predictive check, one step ahead
predictions yt* ~ Po( mt - 1 ) (t > 1) are used to estimate posterior exceedance probabilities
Pr(yt* > yt ) + 0.5Pr(yt* = yt ). Low or high values for Qt indicate failures of fit and/or for-
ward prediction.
For the Poisson regression with constant predictor effects (fit using jagsUI), the aver-
age (scaled) Poisson deviance is 1406, so there appears to be relatively little overdisper-
sion in relation to the 1247 observations. Of the three regression coefficients, only β1
has a 95% posterior interval that excludes zero, namely −0.12 to −0.03. One step ahead
Regression Techniques Using Hierarchical Priors 287
predictive checks show 143 of 1246 values of Qt (11.5%) exceeding 0.95 or under 0.05. The
LOO-IC and LPML are respectively 7046 and −3524.
For the second model, with convergence readily obtained using rstan, a slightly
improved fit is obtained. (Convergence is delayed if an RW2 prior is adopted in the
intercept). Note that standardisation of covariates is important in this example to avoid
numeric errors. One step ahead predictive checks now show 117 of 1246 values of Qt (9.4%)
exceeding 0.95 or under 0.05. The LOO-IC and LPML are respectively 7028 and −3514.
Figures 7.3 and 7.4 plot the time-varying coefficients β2t and β3t. Significant effects of
PM10 (Figure 7.4) are limited to a central period, similar to the findings of Chiogna and
Gaetan (2002).
++ +
+++++++
++++ +
+++ +++ ++++
+ +++
+
++
++++++
+ +++++++ + ++++
++++
+++
++ +++
0.15 +
+
+
++++++
+++ +++
+ + ++++++
+++ + ++
++
++
+++
+ +++ ++++++++++
+++
+ ++++++
++ +++++ +++++++ ++++
+++++++
++
++++
+ ++++ +++++ +
+++ +
++ +++++ + ++
+ ++ +++
Posterior Mean and 80% CRI, b2
++
+ + + +++++++++++++++++
+
++ ++
++
+
+
+
+ ooooooooooo
++++++ ++++
0.10 ++++ o oo o ++ +++ +++++++
++ ++
+ o
ooooooo oooooo ooooo +++++++++++ ++
+ +
+
+ ooo ooooo
o oooo oo oo ++++ +++++++++
++++++ +
+++++++ + + o
o oooooo ooooooooooooo + + +++
+++++ +++
++++ ++
+
+++ ooo o
ooo
oo ooooooo +++
++ +
+
+ ++
+ o oo
oo
oo
oo ooo
+ ++++ ++++ o
ooooo ooo
o o oo
ooo
ooo ooooooooooo
o o
o
+
+
++++ ++ o
oo o
ooo o oo
+++++ ++ o o
0.05 ++
++ +
+
+ ooo o
o ooooo
oooooooooo
oo
oo
oooo
+++++ ++++
++ oo
ooo
o ooooooo
o oooo
++++++++ oo *
************* oooooooooo
ooo ******* ********** ****** oooooooo
oo
o ooo **
**
** *
*** * **
**
** *****
*
*
oo
ooo
oooooooo ooo ooo **
* **************** ***** ********** ooo
oo oo
oooo ooooo
o o oo *
** ** *
**** oooooooooooo
0.00 oooo oo **
**********
**
* ***** * ***** ooo
ooo ooo
o *
* * * *
************** ***** o
oooooo oo oo **
** * oo
oo oo ** * * *** ooo
oooo o o
o * *
** ********************* oo
o ***
* * **** * * *
oooooooo **
*** *** **** ******
* ***** *
********
***** **********
–0.05 **
*** ** ***
**** ****** ******
**
********* ***** ******** ***
***
****** ** ***
****** **** *****
* ** **** ********
*** ** ***
–0.10 **** **
*******
*
**
* *
**
**
**
FIGURE 7.3
Varying beta coefficients, beta2.
Posterior Mean and 80% CRI, b3
0.05
0.00
–0.05
FIGURE 7.4
Varying beta coefficients, beta3.
288 Bayesian Hierarchical Models
7.8 Spatial Regression
In spatial data modelling, just as for time series regression, there may be correlated resid-
uals, spatially varying predictor effects, and/or predictor collinearity. Correlated errors
can bias regression parameter estimates and cause standard errors to be mis-stated (Boyd
et al., 2005). Nonlinear predictor effects may contribute to residual spatial correlation
(Dormann et al., 2007), as may incorrectly assuming homogenous regression coefficients
when a nonstationary process (a varying regression coefficient approach) is appropriate
(Fotheringham et al., 2002). Omitted spatially dependent predictors are another source
(LeSage and Dominguez, 2012) of residual spatial dependence. Regarding collinearity,
recent studies (e.g. Reich et al., 2010; Choi and Lawson, 2016) propose predictor selection
combined with spatially varying coefficients (Section 7.8.2).
e¢We/S0
I=
e¢e/n
where S0 = SiS j wij , and W = [wij] represent spatial interactions. Alternatively, Congdon et al.
(2007) apply a measure suggested by Fotheringham et al. (2002, p.106) obtained via linear
*
regression of appropriately defined residuals ei on the spatial lag ei = å w e /å w .
j
ij j
j
ij
ei = r0 + r1ei* + ui ,
where the ui are taken as unstructured. This is done at each MCMC iteration to provide a
posterior mean and 95% intervals on the spatial correlation index ρ1, the spatial lag regres-
sion coefficient (SLRC). If the 95% interval excludes zero, then spatial correlation is present.
Standard ways to deal with spatially correlated errors are to include a spatially lagged
response as a predictor, or to incorporate spatial effects in the residual specification.
Correcting for spatial correlation in this way may affect the significance and direction of
predictor effects, as compared to a model with non-spatial error structure (Kuhn, 2007).
yi = r å c y + X b + u ,
j
ij j i i
Regression Techniques Using Hierarchical Priors 289
zi = r å c z + X b + u ,
j
ij j i i
ui ~ N (0, su2 ),
with su2 = 1 for identifiability. In econometric or voting applications, this might amount to
expecting individuals located at similar points in space to exhibit similar choice behaviour
(Smith and Lesage, 2004). In matrix terms
z = rCz + X b + u,
z = (I - rC )-1 X b + u* ,
yi = Xi b + ei ,
ei = r å c e + u ,
j
ij j i
with ui ~ N (0, S u ), and a maximum possible value of 1 for ρ, since the spatial weights are
standardised. A lower prior limit for ρ of 0 may be assumed, since negative values are
implausible.
Writing the equation for e = (e1 , … , en ) as e = (I - rC )-1 u , the covariance matrix for ε is
(I - rC )-1 S u (I - rC ¢ )-1 ,
1 é 1 ù
L(a , r , s 2 | y ) = |D¢ D|0.5exp ê - 2 éë( y - X b )¢ D¢ D( y - X b )ùû ú
2ps n
ë 2s û
7.8.3 Conditional Autoregression
By contrast to SAR spatial models, conditional autoregressive error schemes (Besag, 1974)
specify εi conditional on remaining effects ε[i]. One option takes unstandardised spatial
interactions with
E(ei |e[i] ) = l å w e ,
j¹i
ij j
Var( ei |e[i] ) = s 2 ,
with joint covariance s 2 (I - lW )-1 . In this case (Bell and Broemeling, 2000), λ is constrained
by the eigenvalues Ek of W, namely l Î[1/Emin , 1/Emax ]. Conditional variances may differ
between subjects with M = diag(si2 ) and the covariance is then (I - lW )-1 M (Lichstein et
al., 2002).
If predictor effects are written hi = Xi b , this formulation may be restated in terms of an
own area regression effect, and a filtered effect of neighbouring regression residuals. Thus
for y metric
æ ö
y i ~ N çh i + l
ç å w (y -h ),s
ij j j
2 ÷.
÷
è j¹i ø
In many spatial health applications yi are Poisson counts, with means ni = Ei ri where Ei are
expected events, and ρi are unknown relative risks. One may then (Bell and Broemeling,
2000; Assunção and Krainski, 2009) assume ri = log( ri ) are Normal with
æ ö
ri ~ N çhi + l
ç å wij (rj - h j ), s 2 ÷ .
÷
è j¹i ø
The other conditional autoregressive option takes standardised spatial interactions, with
conditional means and variances
E(e i |e [i] ) = k åc e ,
j¹i
ij j
Var( ei |e[i] ) = s 2 å w .
j¹i
ij
The joint covariance for the εi is then s 2 (D - kW )-1 , where D is diagonal with di = S j¹i wij
(Sun et al., 1999, p.342). Equivalently, for binary wij, the diagonal terms of the precision
matrix are τdi where t = 1/s 2 (Kruijer et al., 2007), while off-diagonal terms equal −τκ when
Regression Techniques Using Hierarchical Priors 291
i and j are neighbours and 0 otherwise. For κ = 1, one obtains the CAR(1) prior of Besag et
al. (1991) with joint covariance matrix no longer positive definite.
æ 1 ö
y k ~ N ç mik , ÷ k = 1,¼, n
è t w
i ik ø
Wi y = Wi X bi + ei ,
æ b1 ö
ç ÷
ç ¼÷
b i = (wi1 Ä I R ,¼, win Ä I R ) ç ¼÷ + ui .
ç ÷
ç ¼÷
ç bn ÷
è ø
With Vi = diag(v1 , … , vn ), the error terms have priors
ei ~ N (0, s 2Vi ),
with the specification on ui being a form of Zellner g-prior, in which δ2 governs adherence
to the smoothing specification.
b(s) ~ N (1n ´1 Ä mb , Vb )
where mb = ( mb1 , … , mbR )¢ contains mean regression effects, and Vβ is the nR × nR covariance
matrix defined as
Vb = C(h) Ä L
hi = å b x ,
r =1
ri ri
p( b |F ) µ|F|n/2exp ì -0.5
í åå
wij ( bi - b j )¢ F( bi - b j )ü ,
ý
î i j þ
with R × R precision matrix Φ, and spatial interactions wij usually binary (wij = 1 when areas
i and j are adjacent, zero otherwise). When yi is metric with μi = ηi, with residual precision
τ = 1/σ2, one may, following Gamerman et al. (2003), scale the covariance by τ, namely
î i j þ
Regression Techniques Using Hierarchical Priors 293
kii = wi + = å w ,
j¹i
ij
kij = - wij i ¹ j.
Hence these priors are improper because the elements in each row of K add to zero.
Assuncao (2003, p.460) notes that propriety can be obtained by a constraint such as
å b = A , where A is any preset R-vector. This consideration leads to a practical strategy
i
i
representing βi as bi = mb + bi where the bi follow the pairwise difference prior, but are zero
centred at each MCMC iteration, and the mean regression effect is mb = ( mb1 , … , mbR ). This
can be implemented using the car.normal or mvcar options in BUGS.
b ij( r ) = b ijg ij ,
b ij ~ N ( m b j , 1/t b j )
g ij ~ Bern( rij ),
where the rij are entirely spatially structured, as under a CAR(1) prior, or admit spatial
structure, as under the Leroux et al. (1999) scheme.
with a Ga(0.5,0.0005) prior on the precision of si. This gives a considerably improved DIC
of 464, with significant effects remaining for both predictors, including an enhanced
effect of x2. Thus, β1 and β2 have respective means (95%CrI) of −0.08 (−0.10,−0.02) and 0.06
(0.01,0.12). Such a change in the strength of predictor effects demonstrates the impor-
tance of correct error specification for inferences regarding risk factors in spatial data.
Spatial correlation in residuals is removed as judged by an SLRC with 95% interval
(−1.03,0.74).
An alternative possible solution to spatially correlated residuals is to consider spatial
nonstationarity in predictor effects. Here
where the βki follow independent CAR(1) priors. This model gives a DIC of 486, while
posterior means (95% intervals) for mb1 and mb2 are obtained as −0.03 (−0.09,0.04) and
0.17 (0.09,0.25).
Adding a spatial residual to this model leads to
where priors on spatial effects are specified from first principles. This improves the DIC
to 472. This is a slight loss of fit compared to the spatial residual model, but acknowledg-
ing regression heterogeneity in regression effects may often be important on substantive
grounds, and may impact on average regression effects over all areas. Posterior means
(95% intervals) for m b1 and mb2 are obtained as −0.05 (−0.13,0.03) and 0.13 (0.05,0.22), so
that recognising heterogeneity has much enhanced the inequality effect, as compared
to the spatial residual model.
Wheeler and Tiefelsdorf (2005, p.169) mention implausibly signed effects when using a
classical GWR approach. The Bayesian SVC approach reveals four counties with poste-
rior probabilities Pr( b1i > 0|y ) over 0.25 (the maximum being 0.27), and no county with
a posterior probability Pr( b2i < 0|y ) under 0.75 (the minimum being 0.76).
Regression Techniques Using Hierarchical Priors 295
y i ~ Bin(Vi , pi ),
logit(p i ) = b 0 + åX b
j
ij
(r )
ij + si ,
b ij( r ) = b ijg ij ,
bij ~ N( mj , 1/t j )
296 Bayesian Hierarchical Models
where si are Leroux et al. (1999) effects, whereas the rij are CAR(1). A two-chain run
of 100,000 iterations does not attain convergence according to Brooks-Gelman-Rubin
(BGR) statistics. Inferences at this stage show the highest retention probabilities (found
by averaging γij over areas) for the higher education and population density variables.
These predictors both have negative effects, with posterior means (sd) of −0.33 (0.02)
and −0.10 (0.04), as assessed from posterior mean b ij( r ) averaged over areas. The age 65+
and non-UK-born predictors have retention probabilities below 0.50.
Y(0), Y(1) ^ X |C ,
Suppose a logit regression is used to predict Pr(Xi = 1|Ci , g) , so that the propensity score
is Si = 1/ éë1 + exp(-g Ci )ùû . It is potentially important to exclude insignificant confounder
variables (Weitzen et al., 2004), so one may include Bayesian variable selection in the esti-
mation of the propensity score, for example, using binary retention indicators δk,
logit(Si ) = g0 + å d g C . (7.3)
k
k k ik
Subsequent analysis options are then to apply the score in a subsequent regression to pre-
dict Y, stratify the sample according to propensity score (e.g. into quintiles or deciles),
match on the propensity score, or use inverse weighting by the propensity score. Suppose
the subsequent analysis involves regression. Regression to assess effects of exposure or
treatment can then be (a) on X and S; (b) on X and groupings of S (e.g. a categorical variable
based on a decile grouping of the S scores; or (c) on X, S and C.
A Bayesian approach may be specified using a joint likelihood (Zigler et al., 2013;
McCandless et al., 2009). Suppose Y is binary with pi = Pr(Yi = 1|Xi , Ci ), then the joint likeli-
hood consists of (7.3) and an outcome model such as (Zigler and Dominici, 2014, section 2.3)
where the selection indicators δk are common to both regressions. An average treatment or
exposure effect
may be calculated by comparing estimated responses for each subject at X = 1 and X = 0 in
(7.4) (Davis et al., 2017).
The estimation of the propensity score S via a joint likelihood contrasts with a separate
stage perspective whereby the propensity score is intended to approximate the design stage
of a randomised study, without access to the outcome. In accord with a two-stage perspec-
tive, one may instead apply a quasi-Bayesian approach whereby feedback is cut between
the PS model and the outcome model (McCandless et al., 2010; Zigler and Dominici, 2014,
section 3.2).
For hierarchical data (e.g. subjects nested within institutions, or within areas) contri-
butions of covariates to treatment assignment may vary across institutions. In terms of
multilevel coefficients, this implies that a random slope analysis is needed to represent
institutionally varying or area-varying effects of covariates. Such a cross-level interaction
effect on the probability of receiving treatment means that each institution then has a dif-
ferent propensity equation. If Cij denotes individual confounders and Wj denotes institu-
tional confounders, then the ignorability assumption is now stated as
(Y(0), Y(1) ^ X |C , W )
The aim of the propensity score method is to ensure that within groups homogeneous on
the propensity score, the distributions of the covariates are essentially the same for treated
and untreated subjects (Austin, 2009). The achievement of covariate balance may be tested
(Baser, 2006) e.g. by testing for significant differences in covariate distributions within
propensity score strata.
298 Bayesian Hierarchical Models
logit(Si ) = g0 + å g C .
k =1
k ik
This model has a satisfactory performance in reproducing the data based on posterior
predictive tests using the Brier score.
The three confounders have significant positive effects in the propensity score regres-
sion, so that patients receiving the new drug have a distinctly adverse risk profile.
Within quintiles of the propensity score, differences between confounder profiles are
not significant (these are represented by match.C1, match.C2, and match.C3 in the code).
In the outcome model, β1 and β2 have respective posterior means (sd) of −0.47 (0.28) and
2.83 (1.15). An estimate of the causal effect (of the new drug in reducing mortality) is
based on evaluating outcomes p1i = logit -1 ( b0 + b1Xi + b2Si ) and p0 i = logit -1 ( b0 + b2Si ) for
all subjects. Then the causal effect (mort.X in the code) is estimated as the average of the
differences p1i - p0 i , which has a mean (95% CRI) of −0.065 (−0.142, 0.006). LOO-IC crite-
ria are obtained separately for the propensity score and outcome models as 529 and 369.
A second model extends the propensity score model to include quadratic and interac-
tion terms (C12 , C22 , C32 , C1C2 , C1C3 , C2C3 ), and also allows for confounder selection feed-
back, with a residual confounding effect in the outcome model. Additionally, rather
than selection via SSVS or other spike-slab priors (Zigler and Dominici, 2014), horseshoe
shrinkage priors are used, with a sharing of the shrinkage parameters between propen-
sity and outcome models. Thus
logit(Si ) = g0 + åg C
k =1
k ik ,
gk ~ N(0, tg2 rk ),
kk ~ Be(0.5, 0.5),
rk = 1/kk - 1,
hk ~ N(0, th2 rk ).
Regression Techniques Using Hierarchical Priors 299
LOO-IC criteria for the propensity score and outcome models are, in fact, now slightly
higher at 533 and 372. Values of κk are above 0.5 (indicating redundant regressors) except
for the main linear terms in C1, C2 and C3. The average causal effect p1i - p0 i is little
changed, with mean (95% CRI) of −0.063 (−0.137, 0.009).
M ~ f1(X , C )
Y ~ f 2 (X , M , X.M , C ).
300 Bayesian Hierarchical Models
Let M(X*) denote the prediction of the mediator M under the setting X = X*. For example,
suppose M is continuous, and a normal linear regression is specified with
Mi ~ N (hi , s M
2
),
where
hi = b1 + b2Xi + b3Ci .
Then expected values of M(X) and M(X*) can be obtained as equal to the corresponding
regression terms, namely as ηi and
respectively. Under a Bayesian perspective, they can also potentially be obtained as the
respective predictions
Mnew, i ~ N (hi , sM
2
),
* *
i ~ N ( hi , s M ).
2
Mnew,
while the natural direct effect (Lange et al., 2012) compares E[Y {X , M(X * )}] and
E[Y {X * , M(X * )}] , namely
The natural indirect, or mediated, effect is the difference between the total effect and the
natural direct effect, namely
An additional effect sometimes of interest (Naimi et al., 2014b; Vanderweele, 2013), namely
the controlled direct effect CDE, the effect of exposure on outcome if the mediator is con-
trolled uniformly at a particular value of M, say M.c. Then
Consider a structural model, as in Figure 7.5, with dependencies specified via normal lin-
ear regressions, with regression terms
E[ M|X , C] = b1 + b 2X + b 3C ,
FIGURE 7.5
Causal path example.
Then the Natural Direct Effect (NDE) and the Natural Indirect Effect (NIE) can be obtained
by effect substitution (substitution of regression means). So defining expected M(X) and
M(X*) from the corresponding regressions,
E[ M(X )] = b1 + b 2X + b 3C ,
E[ M(X * )] = b1 + b 2X * + b 3C ,
one has
NIE = (X - X * )(q 3 b 2 + q 4 b 2X ).
Underlying the effect decomposition in the above model are assumptions of conditional
ignorability
Y {X , M(X )} ^ X |C
These specify independence of exposure and outcome, given the confounders (Vanderweele,
2015), and of mediator and outcome, given confounders and exposure. Additional assump-
tions (VanderWeele and Vansteelandt, 2014) are M(X ) ^ X |C and Y {X , M(X )} ^ M(X * )|C .
An alternative method for estimating causal effects is set out by Imai et al. (2010, p.312)
based on assumptions of sequential ignorability. The initial assumptions relate to ignor-
ability of the treatment (or exposure) given confounders, namely Y {X , M(X )} ^ X |C and
302 Bayesian Hierarchical Models
Assuming a binary treatment, the two causal mediation effects are defined for treatment
settings x = 0,1 as
di ( x) = Yi ( x , Mi (1)) - Yi ( x , Mi (0)).
Assuming that causal mediation and direct effects do not vary according to treatment
status, so that di (1) = di (0) = di and zi (1) = zi (0) = zi , one has that the total treatment effect
ti = Yi (1, Mi (1)) - Yi (0, Mi (0)) is the sum of the causal mediation and direct effects, ti = di + zi .
The Imai et al. method may be characterised as a quasi-Bayesian Monte Carlo algorithm,
and one may adapt the method using fully Bayesian principles, with substitution of appro-
priate replicates. Assuming the treatment X is binary, and based on sampled parameters
at each MCMC iteration, one samples replicate mediator values M1 = Mrep (C , X = 1) and
M0 = Mrep (C , X = 0) at different treatment levels. One then substitutes these (as media-
tor values) in the regression term for predicting Y, along with settings X = 1 and X = 0 on
the treatment values. This provides predictions Yrep (1, M1 ) , Yrep (0, M1 ) , Yrep (1, M0 ), and
Yrep (0, M0 ), with the total treatment effect Yrep (1, M1 ) - Yrep (0, M0 ) .
In real situations, the exposure X may influence one or more confounders C. Schematically,
C ~ f1(X ),
M ~ f 2 (X , C ),
Y ~ f 3 (X , M , X.M , C ).
Then more detailed calculations are obtained, since C(X) and C(X*) will differ. To illustrate
linear effect substitution, suppose C denotes a confounder influenced by X, and D denotes
confounders independent of X. Then one will have an additional linear regression with
expectation such as
whereby E[C(X )] = a1 + a2X + a3D , and E[C(X * )] = a1 + a2X * + a3D. Then one has
and
The binary outcome has predictors X and C, but also involves the mediator. So assum-
ing a probit regression one has
congmesg[i] ~dbern(p[i])
p[i] <- phi(b[1]+b[2]*treat[i]+b[3]*anx[i]+b[4]*age.c[i]+b[5]*equals
(edu[i],2)
+b[6]*equals(edu[i],3)+b[7]*equals(edu[i],4)+b[8]*gend[i]+b[9]*incom
e[i])
Four alternative predictions of the outcome are obtained at settings X = 1 and
X = 0, crossed with mediator values set at the predictions M1 = Mrep (C , X = 1) and
M0 = Mrep (C , X = 0) (anx.1[i] and anx.0[i] in the code). The direct causal effect is defined
as
Two assumptions regarding the density for the positive anxiety score are made. Under a
truncated normal assumption, a two-chain run using jagsUI provides means (95% CRI)
for the average mediation, direct and total effects as 0.083 (0.011, 0.160), 0.012 (−0.113,
0.140) and 0.096 (−0.042, 0.238). These estimates are similar in location to, but less precise
than, those contained in Tingley et al. (2014). In inference terms, the direct impact of the
treatment is insignificant, and the impact of the treatment on the response is mainly due
to its effect on the anxiety mediator. The LOO-IC for the mediator and outcome models
are obtained as 2,099 and 298 respectively.
Assuming instead a lognormal density for anxiety, the respective means (95% CRI)
become 0.090 (0.011, 0.181), 0.016 (−0.106, 0.142), and 0.106 (−0.034, 0.253). The LOO-IC for
the mediator model is reduced to 2,092.
ALC ~ (ALC|SES)
A different imputation strategy is adopted by Daniel et al. (2011) which may affect find-
ings. Normal linear regressions are adopted for each outcome. In full, the regression
term assumed for predicting the outcome SBP is
One aim is to estimate the natural direct effect NDE, defined (in generic symbols) as
the expected value of the difference Y(X , M(X * )) - Y(X * , M(X * )). Accordingly, a counter-
factual alcohol consumption level ALC* = 0 is defined, with corresponding predictions
(obtained as Bayesian replicates)
and
NDE = E éSBP {ALC, GGT* , BMI* , SES}ù - E éSBP {ALC* , GGT* , BMI* , SES}ù
ë û ë û
with the first and second components defining NDE at subject level denoted NDE.a[i]
and NDE.star[i] in the code.
The natural indirect effect NIE is defined generically as the expected value of the dif-
ference Y(X , M(X )) - Y(X , M(X * )). In terms of the application, we have
In the first component, GGT and BMI are replicates (ggt.new[i] and bmi.new[i] in the
code). Including the prediction GGT* = GGT(ALC* , BMI* ) in the second component
allows for the fact that BMI (a confounder) depends on the exposure ALC, so that
BMI* = BMI(ALC* , SES) and BMI(ALC,SES) differ.
The total causal effect TCE is the sum of NDE and NIE. Additionally, the controlled
direct effect CDE may be obtained at the setting GGT.c = 3, namely
Table 7.6 shows the posterior summary for these quantities and the regression param-
eters θ from a two-chain run using jagsUI. The estimated total causal effect (TCE)
implies that the reduction of alcohol consumption to zero would reduce average SBP
by 8.04 units (95% CRI from 7.71 to 8.40). A relatively small part of the reduction (with
posterior mean 1.30 units) is mediated through GGT. It may be noted that the impact
of alcohol on SBP is possibly nonlinear, with evidence of a U-shaped effect, and an
extended model might allow nonlinearity (Jackson et al., 1985).
306 Bayesian Hierarchical Models
TABLE 7.6
Prediction of SBP, Posterior Parameter Summary
Parameter Predictor Mean St devn 2.5% 97.5%
TCE 8.04 0.18 7.71 8.40
NDE 6.74 0.16 6.42 7.06
NIE 1.30 0.10 1.10 1.51
CDE 6.63 0.15 6.32 6.93
θ1 Intercept 89.86 1.25 87.63 92.58
θ2 ALC 5.94 0.19 5.58 6.33
θ3 GGT 7.03 0.16 6.74 7.35
θ4 GGT.ALC −0.99 0.05 −1.10 −0.89
θ5 BMI 0.51 0.05 0.41 0.59
θ6 SES2 −5.34 0.24 −5.80 −4.87
θ7 SES3 −10.32 0.31 −10.92 −9.71
wi = 1/Pr(Xi = x|Ci )
that X = x given confounders C, following binary regression of X on C. The weights are
wi = 1/P(Xi = 1|Ci ) = Si
Estimating the marginal structural model then involves a weighted likelihood (nor-
mal, logistic, etc.) of Y on X. Applying weights in this way creates an artificial popula-
tion which tends to balance on covariates X used in deriving the weights (Naimi et al.,
2014a). Doubly robust weights may also be defined that estimate causal effects if either
the propensity score model or the outcome model is correctly specified. Davis et al. (2017)
use Bayesian methods to estimate the parameters needed to define a propensity score in
Regression Techniques Using Hierarchical Priors 307
spatial applications, and substitute relevant posterior means to estimate IPTW weights,
with the latter considered as frequentist.
Marginal structural models may also be estimated using a regression of Y on X (expo-
sure) and C (confounders) to predict counterfactual outcomes for all subjects. This is in
line with g-computation principles (Wang and Arah, 2015). Then E(Y[X , C]) denotes the
prediction, possibly counterfactual, at the value X. So for X binary, and X = 1 as exposed,
the total causal effect is estimated as
Snowden et al. (2011) use an additional regression step, involving 2n outcomes (half being
actual responses at observed X, half being counterfactual responses at the counterfactual
X*) and estimate the treatment effect by regression of the expanded outcome vector on cor-
responding X (or X*) values. However, the TCE may also be estimated by averaging over
predictions at appropriate settings of X and C (Example 7.18).
where X.C1 denotes an interaction between X and C1, etc. We estimate this model using
jagsUI, and find the regression coefficient β2 to have posterior mean (sd) of −0.486 (0.08).
By contrast, the marginal causal effect has posterior mean (sd) of −0.337 (0.054). This is
estimated by averaging the difference between the predictions
bj = gj J j ,
g j ~ N(0, 10),
J j ~ Bern(0, 0.5).
This gives posterior probabilities of 1 for retaining β2, and β3, and 0.95 for retaining β6, as
expected in line with the data generation mechanism. Other coefficients have retention
308 Bayesian Hierarchical Models
probabilities below 0.05. Including predictor selection affects estimates slightly: the
regression coefficient β2 now has a posterior mean (sd) of −0.481 (0.059), while the mar-
ginal causal effect has a posterior mean (sd) of −0.340 (0.054).
We also consider the propensity score approach of Section 7.9.1, regressing the prob-
ability Si that Xi = 1 on C1i, C2i and the interaction C1iC2i. One may then either simply
regress the response Yi on Si and Xi, or also allow for residual confounding (Zigler and
Dominici, 2014), namely,
This is carried out using a joint likelihood, though feedback between the Y-model and
the X-model can be avoided using the BUGS “cut” function. Results for the coefficient β2,
and hence the ACE, are very similar whether or not residual confounding is allowed for,
and also whether or not feedback is avoided. Allowing feedback, and without allowing
residual confounding, the mean (sd) of the ACE is estimated as −0.352 (0.053).
To illustrate the IPTW approach, we again regress the probability that X = 1 on C1,
C2, and the interaction C1.C2. This provides probabilities Si = Pr(Xi = 1|C1i , C2i ) , and the
weights
wi = Xi /Si + (1 - Xi )/(1 - Si )
are then used in a weighted linear regression of Y on X with weights σ2/wi. To avoid
feedback between the logit regression for X on {C1,C2}, and the marginal structural
regression of Y on X, the “cut” function in BUGS is applied to the predicted probabili-
ties Si before they are inserted in the weights. From the second half of a 10,000 iteration
sequence, the posterior mean (sd) for the marginal causal effect (MCE) is −0.36 (0.10). If
feedback between the two regressions is allowed, convergence in the coefficients of the
X-model is impeded, and the MCE has a value closer to null, around −0.27.
References
Albert J (1996) Bayesian selection of log-linear models. Canadian Journal of Statistics, 24, 327–347.
Albert JH, Chib S (1993) Bayesian analysis of binary and polychotomous response data. Journal of the
American statistical Association, 88(422), 669–679.
Albert J, Chib S (2001) Sequential ordinal modeling with applications to survival data. Biometrics,
57(3), 829–836.
Arbia G (2014) A Primer for Spatial Econometrics: With Applications in R. Palgrave.
Assunçao RM (2003) Space varying coefficient models for small area data. Environmetrics, 14(5),
453–473.
Assunção R, Krainski E (2009) Neighborhood dependence in Bayesian spatial models. Biometrical
Journal, 51(5), 851–869.
Austin P (2009) Balance diagnostics for comparing the distribution of baseline covariates between
treatment groups in propensity-score matched samples. Statistics in Medicine, 28, 3083–3107.
Baragatti M, Pommeret D (2012) A study of variable selection using g-prior distribution with ridge
parameter. Computational Statistics and Data Analysis, 56(6), 1920–1934.
Barbieri M, Berger J (2004) Optimal predictive model selection. Annals of Statistics, 32, 870–897.
Barreto-Souza W, Simas A (2016) General mixed Poisson regression models with varying dispersion.
Statistics and Computing, 26, 1263–1280.
Baser O (2006) Too much ado about propensity score models? Comparing methods of propensity
score matching. Value in Health, 9(6), 377–385.
Regression Techniques Using Hierarchical Priors 309
Bazán J, Bolfarine H, Branco M (2010) A framework for skew-probit links in binary regression.
Communications in Statistics, Theory and Methods, 39(4), 678–697.
Beck N (1983) Time-varying parameter regression models. American Journal of Political Science, 27,
557–600.
Bell S, Broemeling LD (2000) A Bayesian analysis for spatial processes with application to disease
mapping. Statistics in Medicine, 19(7), 957–974.
Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal
Statistical Society B, 36, 192–225.
Besag J, York J, Mollié A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43(1), 1–20.
Bhadra A, Datta J, Polson N, Willard B (2016) Default Bayesian analysis with global-local shrinkage
priors. Biometrika, 103, 955–969.
Bhattacharya A, Pati D, Pillai N, Dunson D (2015) Dirichlet–Laplace priors for optimal shrinkage.
Journal of the American Statistical Association, 110(512), 1479–1490.
Bonate P (2011) Pharmacokinetic-Pharmacodynamic Modeling and Simulation, 2nd Edition. Springer,
New York.
Boris Choy S, Chan J (2008) Scale mixtures distributions in statistical modelling. Australian & New
Zealand Journal of Statistics, 50(2), 135–146.
Boyd H, Flanders W, Addiss D, Waller L (2005) Residual spatial correlation between geographically
referenced observations: A Bayesian hierarchical modeling approach. Epidemiology, 16, 532–541.
Brockmann H (1996) Satellite male groups in horseshoe crabs, Limulus polyphemus. Ethology, 102(1),
1–21.
Bürkner P, Vuorre M (2018, February 28) Ordinal Regression Models in Psychology: A Tutorial.
https://doi.org/10.31234/osf.io/x8swp
Calcagno V, de Mazancourt C (2010) glmulti: An R package for easy automated model selection with
(generalized) linear models. Journal of Statistical Software, 34(12), 1–29.
Carvalho C, Polson N, Scott J (2009) Handling Sparsity via the Horseshoe. Proceedings of Machine
Learning Research, 5, 73–80.
Cepeda E, Gamerman D (2000) Bayesian modeling of variance heterogeneity in normal regression
models. Brazilian Journal of Probability and Statistics, 14(2), 207–221.
Chan K, Ledolter J (1995) Monte Carlo EM estimation for time series models involving counts. Journal
of the American Statistical Association, 90, 242–252.
Chang Y, Gianola D, Heringstad B, Klemetsdal G (2006) A comparison between multivariate Slash,
Student’s t and probit threshold models for analysis of clinical mastitis in first lactation cows.
Journal of Animal Breeding and Genetics, 123, 290–300.
Chen R, Chu C, Yuan S, Wu Y (2016) Bayesian sparse group selection. Journal of Computational and
Graphical Statistics, 25(3), 665–683.
Chi EM, Reinsel GC (1989) Models for longitudinal data with random effects and AR (1) errors.
Journal of the American Statistical Association, 84(406), 452–459.
Chib S, Greenberg E (2013) On conditional variance estimation in nonparametric regression. Statistics
and Computing, 23(2), 261–270.
Chiogna M, Gaetan C (2002) Dynamic generalized linear models with application to environmental
epidemiology. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51(4), 453–468.
Choi J, Lawson A (2016, June 16) Bayesian spatially dependent variable selection for small area health
modeling. Statistical Methods in Medical Research. pii: 0962280215627184.
Choi J, Lawson AB (2018) Bayesian spatially dependent variable selection for small area health mod-
eling. Statistical Methods in Medical Research, 27(1), 234–249.
Conceição K, Andrade M, Louzada F (2013) Zero-modified Poisson model: Bayesian approach, influ-
ence diagnostics, and an application to a Brazilian leptospirosis notification data. Biometrical
Journal, 55(5), 661–678.
Congdon P, Almog M, Curtis S, Ellerman R (2007) A spatial structural equation modelling frame-
work for health count responses. Statistics in Medicine, 26(29), 5267–5284.
Cox D (1981) Statistical analysis of time series: Some recent developments. Scandinavian Journal of
Statistics, 8, 93–115.
310 Bayesian Hierarchical Models
Czado C, Erhardt V, Min A, Wagner S (2007) Zero-inflated generalized Poisson models with regres-
sion effects on the mean, dispersion and zero-inflation level applied to patent outsourcing
rates. Statistical Modelling, 7(2), 125–153.
Dangl T, Halling M (2012) Predictive regressions with time-varying coefficients. Journal of Financial
Economics, 106(1), 157–181.
Daniel R, De Stavola B, Cousens S (2011) gformula: Estimating causal effects in the presence of
time-varying confounding or mediation using the g-computation formula. Stata Journal, 11(4),
479–517.
Darmofal D (2015) Spatial Analysis for the Social Sciences. Cambridge University Press.
Davis M, Neelon B, Nietert P, Hunt K, Burgette L, Lawson A, Egede L (2017) Addressing geographic
confounding through spatial propensity scores: A study of racial disparities in diabetes.
Statistical Methods in Medical Research, 28(3), 734–748.
Dormann C, McPherson N, Araújo M et al. (2007) Methods to account for spatial autocorrelation in
the analysis of species distributional data: A review. Ecography, 30(5), 609–628.
Epstein D, O’Halloran S (1996) The partisan paradox and the US tariff, 1877–1934. International
Organization, 50(2), 301–324.
Fahrmeir L, Osuna E (2006) Structured additive regression for overdispersed and zero-inflated count
data. Applied Stochastic Models in Business and Industry, 22(4), 351–369.
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, pp
69–137. Springer, New York.
Fernández C, Steel MF (1998) On Bayesian modeling of fat tails and skewness. Journal of the American
Statistical Association, 93(441), 359–371.
Ferreira M, Gamerman D (2000) Dynamic generalized linear models, pp 57–72, in Generalized Linear
Models: A Bayesian Perspective, eds D Dey, S Ghosh, B Mallick. Marcel Dekker, New York.
Fokianos K, Kedem B (2003) Regression theory for categorical time series. Statistical Science, 18(3),
357–376.
Fonseca TC, Ferreira MA, Migon HS (2008) Objective Bayesian analysis for the Student-t regression
model. Biometrika, 95(2), 325–333.
Fotheringham A, Brunsdon C, Charlton M (2002) Geographically Weighted Regression: The Analysis of
Spatially Varying Relationships. Wiley, Chichester, UK.
Franzese RJ, Hays JC (2007) Spatial econometric models of cross-sectional interdependence in politi-
cal science panel and time-series-cross-section data. Political Analysis, 15(2), 140–164.
Fruhwirth-Schnatter S, Fruhwirth R (2007) Auxiliary mixture sampling with applications to logistic
models. Computational Statistics and Data Analysis, 51, 3509–3528.
Frühwirth-SchnatterS, Frühwirth R (2010) Data augmentation and MCMC for binary and multino-
mial logit models, pp 111–132, in Statistical Modelling and Regression Structures, eds T Kneib, G
Tutz. Physica-Verlag HD.
Gamerman D (1998) Markov chain Monte Carlo for dynamic generalised linear models. Biometrika,
85(1), 215–227.
Gamerman D, Moreira A, Rue H (2003) Space-varying regression models: Specifications and simula-
tion. Computational Statistics and Data Analysis, 42, 513–533.
Garay A, Lachos V, Bolfarine H, Ortega E (2015) Bayesian estimation and case influence diagnos-
tics for the zero-inflated negative binomial regression model. Journal of Applied Statistics, 42(6),
1148–1165.
Garcia-Donato G, Martinez-Beneito M (2013) On sampling strategies in Bayesian variable selec-
tion problems with large model spaces. Journal of the American Statistical Association, 108(501),
340–352.
Geinitz S, Furrer R (2016) Conjugate distributions in hierarchical Bayesian ANOVA for computa-
tional efficiency and assessments of both practical and statistical significance. arXiv:1303.3390.
Geinitz S, Furrer R, Sain S (2015) Bayesian multilevel analysis of variance for relative comparison
across sources of global climate model variability. International Journal of Climatology, 35(3),
433–443.
Regression Techniques Using Hierarchical Priors 311
Gelfand A, Kim H, Sirmans C, Banerjee S (2003) Spatial modelling with spatially varying coefficient
models. Journal of the American Statistical Association, 98, 387–396.
Gelfand AE, Ghosh SK (1998) Model choice: A minimum posterior predictive loss approach.
Biometrika, 85(1), 1–11.
Gelman A (2005) Analysis of variance—Why it is more important than ever. The Annals of Statistics,
33(1), 1–53.
George E, McCullogh R (1993) Variable selection via Gibbs sampling. Journal of the American Statistical
Association, 85, 398–409.
Gerlach R, Bird R, Hall A (2002) Bayesian variable selection in logistic regression: Predicting com-
pany earnings direction. Australian & New Zealand Journal of Statistics, 44, 155–168.
Ghosh J, Ghattas A (2015) Bayesian variable selection under collinearity. The American Statistician,
69(3), 165–173.
Ghosh S, Mukhopadhyay P, Lu J-C (2006) Bayesian analysis of zero-inflated regression models.
Journal of Statistical Planning and Inference, 136, 1360–1375.
Greene W (2008) Functional forms for the negative binomial model for count data. Economics Letters,
99(3), 585–590.
Greenland S (2000) Causal analysis in the health sciences. Journal of the American Statistical Association,
95, 286–289.
Hensher DA, Greene WH (2003) The mixed logit model: the state of practice. Transportation, 30(2),
133–176.
Holloway G, Shankar B, Rahmanb S (2002) Bayesian spatial probit estimation: A primer and an appli-
cation to HYV rice adoption. Agricultural Economics, 27(3), 383–402.
Holmes C, Held L (2006) Bayesian auxiliary variable models for binary and multinomial regression.
Bayesian Analysis, 1, 145–168.
Hooten M, Hobbs N (2015) A guide to Bayesian model selection for ecologists. Ecological Monographs,
85(1), 3–28.
Ibrahim JG, Chen MH (2000) Power prior distributions for regression models. Statistical Science,
15(1), 46–60.
Imai K, Keele L, Tingley D (2010) A general approach to causal mediation analysis. Psychological
Methods, 15(4), 309.
Ishwaran H, Kogalur U, Rao J (2010) spikeslab: Prediction and variable selection using spike and slab
regression. R Journal, 2(2), 68–73.
Ishwaran H, Rao J (2005) Spike and slab variable selection: Frequentist and Bayesian strategies.
Annals of Statistics, 33, 730–773.
Jackson R, Stewart A, Beaglehole R, Scragg R (1985) Alcohol consumption and blood pressure.
American Journal of Epidemiology, 122(6), 1037–1044.
Jia Z, Xu S (2007) Mapping quantitative trait loci for expression abundance. Genetics, 176, 611–623.
Joffe MM, Ten Have TR, Feldman HI, Kimmel SE (2004) Model selection, confounder control, and
marginal structural models: Review and new applications. The American Statistician, 58(4),
272–279.
Johnson VE, Albert JH (1999) Ordinal Data Modeling. Springer-Verlag.
Jung R C, Kukuk M, Liesenfeld R (2006) Time series of count data: Modeling, estimation and diag-
nostics. Computational Statistics & Data Analysis, 51, 2350–2364.
Kahn M, Raftery A (1996) Discharge rates of Medicare stroke patients to skilled nursing facilities:
Bayesian logistic regression with unobserved heterogeneity. Journal of the American Statistical
Association, 91, 29–41.
Khalili A, Chen J (2007) Variables selection in finite mixture of regression models. Journal of the
American Statistical Association, 102, 1025–1038.
Kim H, Sun D,Tsutakawa R K (2002) Lognormal vs. gamma: Extra variations. Biometrical Journal,
44(3), 305–323.
Kinney S, Dunson D (2007) Fixed and random effects selection in linear and logistic models.
Biometrics, 63, 690–698.
312 Bayesian Hierarchical Models
Raftery A, Painter I, Volinsky C (2005) BMA: An R package for Bayesian model averaging. R News,
5(2), 2–8.
Reich B, Fuentes M, Herring A, et al. (2010) Bayesian variable selection for multivariate spatially-
varying coefficient regression. Biometrics, 66, 772–782.
Richardson S, Bottolo L, Rosenthal J (2010) Bayesian models for sparse regression analysis of high
dimensional data. Bayesian Statistics, 9, 539–569.
Robins J, Hernan M, Brumback B (2000) Marginal structural models and causal inference in epidemi-
ology. Epidemiology 11(5), 550–560.
Rockova V, Lesaffre E, Luime J, Löwenberg B (2012) Hierarchical Bayesian formulations for selecting
variables in regression models. Statistics in Medicine, 31(11–12), 1221–1237.
Scott S (2011) Data augmentation, frequentist estimation, and the Bayesian analysis of multinomial
logit models. Statistical Papers, 52(1), 87–109.
Shumway R (2016) State space models, Chapter 6, in Time Series Analysis and Its Applications, eds R
Shumway, D Stoffer. Springer, New York.
Smith RL, Davis JM, Sacks J, Speckman P, Styer P (2000) Regression models for air pollution and
daily mortality: Analysis of data from Birmingham, Alabama. Environmetrics: The official journal
of the International Environmetrics Society, 11(6), 719–743.
Smith T, LeSage J (2004) A Bayesian probit model with spatial dependencies, pp 127–160, in Pace
Advances in Econometrics: Vol 18: Spatial and Spatiotemporal Econometrics, eds J LeSage, R Kelley.
Elsevier Science.
Snowden JM, Rose S, Mortimer KM (2011) Implementation of G-computation on a simulated data
set: Demonstration of a causal inference technique. American Journal of Epidemiology, 173(7),
731–738.
Spiegelhalter D (1998) Bayesian graphical modelling: A case-study in monitoring health outcomes.
Applied Statistics, 47, 115–133.
Sun D, Tsutakawa RK, Speckman PL (1999) Posterior distribution of hierarchical models using CAR
(1) distributions. Biometrika, 86(2), 341–350.
Tchetgen E, Vanderweele T (2014) Identification of natural direct effects when a confounder of the
mediator is directly affected by exposure. Epidemiology, 25(2), 282–291.
Tingley D, Yamamoto T, Hirose K, Keele L, Imai K (2014) Mediation: R package for causal mediation
analysis. Journal of Statistical Software, 59. https://www.jstatsoft.org/article/view/v059i05
Tingley M (2012) A Bayesian ANOVA scheme for calculating climate anomalies, with applications to
the instrumental temperature record. Journal of Climate, 25(2), 777–791.
Tutz G, Gertheiss J (2016) Regularized regression for categorical data. Statistical Modelling, 16(3),
161–200.
Utazi C, Sahu S, Atkinson P, Tejedorc N, Tatem A J (2016) A probabilistic predictive Bayesian approach
for determining the representativeness of health and demographic surveillance networks.
Spatial Statistics, 17, 161–178.
VanderWeele T (2013) Policy-relevant proportions for direct effects. Epidemiology, 24(1), 175–176.
VanderWeele T (2015) Explanation in Causal Inference: Methods for Mediation and Interaction. OUP.
VanderWeele T, Vansteelandt S (2014) Mediation analysis with multiple mediators. Epidemiologic
Methods, 2(1), 95–115.
Vansteelandt S, Daniel R (2014) On regression adjustment for the propensity score. Statistics in
Medicine, 33, 4053–4072.
Vaughn M, Beaver K, Wexler J, DeLisi M, Roberts G (2011) The effect of school dropout on verbal
ability in adulthood: A propensity score matching approach. Journal of Youth and Adolescence,
40(2), 197–206.
Verdinelli I, Wasserman L (1991) Bayesian analysis of outlier problems using the Gibbs sampler.
Statistics and Computing, 1(2), 105–117.
Viele K, Tong B (2002) Modeling with mixtures of linear regressions. Statistics and Computing, 12(4),
315–330.
Wagner H, Pauger D (2016) Discussion: Bayesian regularization and effect smoothing for categorical
predictors. Statistical Modelling, 16(3), 220–227.
Regression Techniques Using Hierarchical Priors 315
Wang A, Arah O (2015) G-computation demonstration in causal mediation analysis. European Journal
of Epidemiology, 30(10), 1119–1127.
Wang L, Zhou XH (2007) Assessing the adequacy of variance function in heteroscedastic regression
models. Biometrics, 63(4), 1218–1225.
Wang P, Puterman M (1999) Markov Poisson regression models for discrete time series, part 1:
Methodology. Journal of Applied Statistics, 26, 855–869.
Wang P, Puterman M, Cockburn I, Le N (1996) Mixed poisson regression models with covariate
dependent rates. Biometrics, 52, 381–400.
Weitzen S, Lapane K, Toledano A Y, Hume A L, Mor V (2004) Principles for modeling propensity
scores in medical research: A systematic literature review. Pharmacoepidemiology and Drug Safety,
13(12), 841–853.
Wheeler D, Calder C (2006) Bayesian spatially varying coefficient models in the presence of collinear-
ity. ASA Section on Bayesian Statistical Science, Proceedings of the Joint Statistical Meetings,
Seattle, WA, August 6–10, 2006.
Wheeler D, Calder C (2007) An assessment of coefficient accuracy in linear regression models with
spatially varying coefficients. Journal of Geographical Systems, 9, 145–166.
Wheeler D, Tiefelsdorf M (2005) Multicollinearity and correlation among local regression coefficients
in geographically weighted regression. Journal of Geographical Systems, 7, 161–187.
Wilhelm S, de Matos M (2013) Estimating spatial probit models in R. The R Journal, 5(1), 130–143.
Windle J (2016) BayesLogit. https://www.rdocumentation.org/packages/BayesLogit/versions/0.6
Winkelmann R, Zimmermann K F (1995) Recent developments in count data modelling: Theory and
application. Journal of Economic Surveys, 9(1), 1–24.
Winship C, Western B (2016) Multicollinearity and model misspecification. Sociological Science, 3,
627–649.
Xu X, Ghosh M (2015) Bayesian variable selection and estimation for group lasso. Bayesian Analysis,
10(4), 909–936.
Yi N, Ma S (2012) Hierarchical shrinkage priors and model fitting for high-dimensional generalized
linear models. Statistical Applications in Genetics and Molecular Biology, 11(6). DOI: https://doi.
org/10.1515/1544-6115.1803.
Yuan M, Lin Y (2005) Efficient empirical Bayes variable selection and estimation in linear models.
Journal of the American Statistical Association, 100, 1215–1224.
Zellner A, Siow A (1980) Posterior odds ratios for selected regression hypotheses, pp 585–603, in
Bayesian Statistics: Proceedings of the First International Meeting Held in Valencia, eds Bernardo J,
DeGroot M, Lindley D, Smith A. University of Valencia Press.
Zeugner S, Feldkircher M (2015) Bayesian model averaging employing fixed and flexible priors: The
BMS package for R. Journal of Statistical Software, 68(4), 1–37.
Zhou M, Li L, Dunson D, Carin L (2012) Lognormal and gamma mixed negative binomial regression.
Proceedings of the 29th International Conference on Machine Learning, 2012, 1343–1350.
Zigler C, Dominici F (2014) Uncertainty in propensity score estimation: Bayesian methods for vari-
able selection and model-averaged causal effects. Journal of the American Statistical Association,
109(505), 95–107.
Zigler C, Watts K, Yeh R, Wang Y, Coull B, Dominici F (2013) Model feedback in Bayesian propensity
score estimation. Biometrics, 69(1), 263–273.
8
Bayesian Multilevel Models
8.1 Introduction
The rationale for applying multilevel models to hierarchical data is well-established
(Snijders and Bosker, 1999; Skrondal and Rabe-Hesketh, 2004). When lower level units are
nested within one or more higher level strata, conventional single-level regression analy-
sis is not appropriate, since observations are no longer independent: pupils in the same
schools, or households in the same communities, tend to be more similar to one another
than pupils in different schools or households in different communities. Such dependency
means standard errors are downwardly biased if the nesting is ignored, and spurious
inferences regarding predictor or treatment effects may be made (Hox, 2002; Aarts et al.,
2015; Bliese and Hanges, 2004).
In multilevel analysis, predictors may be defined at any level and the interest focuses
on adjusting predictor effects for the simultaneous operation of contextual and individual
variability in the outcome. This may be important in health applications, for example, if
impacts of individual-level risk factors vary by geographic context (Congdon and Lloyd,
2010). Another major goal is variance partitioning (Goldstein et al., 2002; Gelman and
Pardoe, 2006); for example, what proportion of area variations in crime rates is due to
characteristics of those areas (what is sometimes termed “contextual variation”), and how
much is due to the characteristics of the individuals who live in these areas (termed “com-
positional variation”) (Subramanian et al., 2003).
One may also be interested in estimates for geographic areas or institutions that include
both individual and area information; for example, the multilevel model for county radon
estimates discussed by Gelman (2006). Gelman (2006) notes that compared to estimates
involving no pooling or complete pooling, inferences from multilevel models are more
reasonable. Complete pooling leads to identical estimates for all units, while a no-pooling
model (no borrowing strength) overfits the data, giving implausibly high or low estimates
for particular units and low precisions for such estimates.
As well as predictor effects at any level, a multilevel model is likely to involve ran-
dom effects defined over the clusters at higher level(s), and possibly correlation between
different cluster effects. As in Chapter 4, one seeks to pool strength in inferences
about clusters when the number of observations for each cluster might be quite small.
While exchangeable cluster effects dominate the multilevel literature, there may well
be instances where random cluster effects are better regarded as non-exchangeable, as
recognised in the general design general linear mixed model of Zhao et al. (2006). For
example, it is possible that the significance level of cluster effects is overstated in area
multilevel applications that disregard spatial dependence between clusters (Chaix et al.,
2005; Dong et al., 2016).
317
318 Bayesian Hierarchical Models
with bi and uij denoting random cluster effects and observation level random effects
respectively. The intercept x1ij = 1 with parameter β1 is included in xij. With N = Sim=1ni total
observations, the nested form of the model is
y = X b + Zb + u,
X1
where y is N × 1, X ≡ … is N × p, with Xi = ( xi1 , … xini )′ of dimension ni × p, and where
X m
the N × mq matrix Z is block diagonal with m diagonal blocks Zi = ( zi1 , … zini )′ of dimension
ni × q (Gamerman, 1997, p.61; Zhao et al., 2006, p.3). Here β is a (p × 1) vector of population
parameters and bi = (b1i , … , bqi )′ is a q × 1 vector of zero mean cluster specific deviations
around those population parameters, with bi assumed random.
Bayesian Multilevel Models 319
While random effects models offer a way to borrow strength (e.g. when level 2 clus-
ter sizes ni are relatively small), fixed effect models, especially for varying intercepts are,
however, advocated in longitudinal applications, especially in econometrics. Fixed effects
for parameter collections are sometimes used in cross-sectional multilevel applications
(Snijders and Berkhof, 2002). The choice between the two depends on the purpose of the
statistical inference and how far the level 2 units can be regarded as a sample from a
policy-relevant population (Draper, 1995). If the sampled clusters are representative of
(exchangeable with) a wider population, then a random coefficient model is, in principle,
appropriate (Hsiao, 1996). If statistical inference is confined to the particular unique set of
level 2 units included in a data set, then a fixed effects model may be more appropriate.
The conjugate linear normal model with random cluster effects assumes multivariate
normality for these effects and for the observation level errors. Assuming the zij are a sub-
vector of xij, the cluster effects have zero mean, so that
The total impact of xrij is then obtained by cumulating over fixed and random components
as br + bri .
Assume the unstructured level 1 errors ui = (ui1 , … uini )′ have prior ui ∼ N ni (0, H i ) where
Hi represents the within-cluster dispersion matrix. The stacked form of the linear mixed
model at cluster level, namely yi = Xi b + Zibi + ui , may then be expressed in joint likelihood
form as
yi Xi b Zi Σ b Zi′ + H i Zi Σ b
b ∼ N ni + q 0 , Σ Z′ Σ b
,
i b i
or in marginal form as
yi ∼ N ni (Xi b , Zi Σ b Zi′ + H i ).
The level 1 errors are typically assumed to be independent, given cluster effects and regres-
sion terms, often with Hi = σ2I for all clusters.
The conjugate model then takes inverse gamma and inverse Wishart priors for σ2 and
Σb respectively (or gamma and Wishart priors on σ−2 and Σ b−1 ), and common practice is to
adopt just proper priors e.g. s 2 ∼ IG(e , e ) where ε is small. Recent research shows that such
priors can lead to effectively improper posteriors and also that inferences are sensitive to
the choice of hyperparameters (Natarajan and McCulloch, 1998). Alternatives for the level
1 variance include uniform or half t priors on σ (Gelman, 2006b), while hierarchical models
for Σb are considered by Daniels and Kass (1999) and Daniels and Zhao (2003). A separa-
tion strategy using the LKJ (Lewandowski, Kurowicka and Joe) prior is another option
(McElreath, 2016).
Following Gamerman (1997), one may sometimes also include random predictor effects
at observation level
which is one way of specifying what is known as complex level 1 variation or heterosce-
dasticity related to level 1 attributes (Browne et al., 2002). This means that variances
depend on subject level predictors (when subjects j are nested in clusters i) or in panel data
320 Bayesian Hierarchical Models
applications that variances are changing over time (when times t are nested in subjects i).
For categorical wij, one may equivalently specify complex variation in terms of category-
specific variances. Thus Goldstein (2005) considers school exam data yij (pupils j nested in
schools i), with a single predictor gender xij (=1 for boy, 0 for girl). Then level 1 heterosce-
dasticity can be represented as
where u0 ij ∼ N (0, s02 ) is the prior for girl observation level errors, and u1ij ∼ N (0, s12 ) is the
prior for boy observation level errors. Equivalently, setting wij = xij ,
yij ~ N ( b1 + b 2 xij , s w2 ij ).
It can be seen that random variation over clusters or at level 1 in specification (8.1) raises
questions of empirical identification (see Chapter 1), as the fixed regression effects are con-
founded with the mean of the associated cluster random effect. Suppose xij = ( x fij , x hij ) and
b = ( b f , bh ), where xfij of dimension p–q contains predictors where no variation in clusters
is posited, while xhij contains predictors (usually including the constant term) which have
a randomly varying effect over clusters.
Under hierarchical centring of the cluster effects, which has been argued to improve
MCMC convergence (Gelfand et al., 1995), varying cluster effects γri are centred on βr
so that the rth varying predictor effect is gri = bhr + bri in cluster i. The parameterisation
( b , bi ) = ([ b f , bh ], bi ) with zero mean bi, is replaced by the parameterisation ( b f , gi ) where
gi = bh + bi . Then
(g1i , … , gqi ) ∼ N q ( bh , Σ g ),
where the vectors zij and xij are now distinct, with xij now containing only xfij, while zij = x hij .
yi = Zi bi + ui , i = 1, … , m (8.3)
bi = kWi + bi
where yi = ( yi1 , … , yini )′ is ni × 1, κ is q × r, Zi is ni × q, βi is a q × 1 vector of random cluster
regression parameters, and the errors ui = (ui1 , … , uini )′ have prior uij ∼ N (0, s 2 ). The level 2
regression for βi involves a fixed effect parameter matrix κ, and errors bi = (b1i , … , bqi )′ with
Bayesian Multilevel Models 321
mean zero and precision matrix Tb. Substituting the second equation in (8.3) into the first
yields the model
yi = Zi kWi + Zibi + ui .
To constrain the effect of one or more level 1 predictors to have an identical effect across all
clusters, the model may be reformulated as the mixed model (8.2) above.
In (8.3), one may assume flat (uniform) priors for κ, and gamma and Wishart priors for σ−2
and Tb, namely 1/s 2 ∼ Ga( au , bu ), Tb ∼ W (Se , ne ). Also define rij = yij − Zij bi , bˆi = (Zi′Zi )−1 Zi yi ,
Vi = (s −2Zi′Zi + Tb )−1 , Vi = s 2Zi′Zi , Λ i = (Vi−1 + Tb )Vi−1, U i = ( bi − kWi ) and G = [ SWi¢TbWi ] . Then
-1
m ni
1/s 2 ∼ Ga 0.5( au + m), 0.5 bu +
∑∑ i =1 j =1
rij2
(
b i ~ N q L i bˆi + (I - L i )k Wi , Vi )
m
Tb ∼ W Se +
∑i =1
U iU i′, m + ne
m
i =1
∑
k ∼ N r G WiTb bi , G .
uij ∼ N(0, s 2 )
(g1i , g2i ) ∼ N 2 ( bh , Σ g ),
where xij = (gend) excludes an intercept, and zij = (1, homework ), with βh1 providing the
regression intercept.
The brms package is applied to assess gain in fit, using WAIC (widely applicable
information criterion) and LOO-IC (leave-one-out information criterion), through add-
ing the extra source of cluster variability. The command form
BRMS2=brm(y ~ 1+homework+gend+(1+homework|sch), data = D,
family = “gaussian”, chains = 2)
322 Bayesian Hierarchical Models
ensures that mean random effects in model 2 are zero. The default setting for the
LKJ prior for the random effects correlation matrix is adopted, with shape parameter
1 (Buerkner, 2017, p.4). Sensitivity may be assessed, for example, by specifying set_
prior(“lkj(2)”, class = “cor”) or set_prior(“lkj(0.5)”, class = “cor”).
There is a substantial gain in fit in adding homework random slopes according to both
WAIC and LOO information criteria, which in this example have very similar values.
The WAIC falls from 3712.7 to 3578, and the LOOIC from 3712.9 to 3579.6.
R2OpenBUGS codes for these models include exceedance checks Pr(yij,rep > yij|y) based
on the mixed predictive method (Marshall and Spiegelhalter, 2007; Green et al., 2009).
Exceedance checks are also included at cluster level, obtained by checking school aver-
aged replicates of yij,rep against school averages on the response. In the R2OpenBUGS
code for the second model, a Wishart prior with identity scale matrix and 2 degrees of
freedom is assumed for the cluster precision matrix Σ g−1 , and a Ga(1,0.001) prior for the
observation level precision σ−2. Predictors are centred, but not standardised (as in BRMS).
WAIC measures are very similar between the LKJ and Wishart approaches to model
2, at just under 3580. However, there is sensitivity to priors in covariance estimates:
the LKJ prior identifies a negative correlation of −0.78 between school intercepts and
slopes, whereas the Wishart prior method estimates a positive correlation of around
0.35. In fact, the 23 observed school-level averages on achievement and homework also
show a positive correlation of 0.40. Random intercepts under the Wishart model 2 have
a correlation of 0.90 with observed average school achievement levels, as against a cor-
responding correlation of 0.48 under the LKJ prior. Sensitivity may be partly related to
small cluster sizes (e.g. schools 2 and 3 have under 10 pupils).
Cross-validatory checks at school level (testmx.sch in the R2OpenBUGS code) under
model 2 show a 96% probability of overprediction for school 17, and a 6% probability of
overprediction for school 2. This may indicate the need to adjust for school-level predic-
tors, or to adopt a cluster effects scheme that is more robust to outlier schools. However,
this is an improved performance over model 1 which shows three schools with mixed
exceedance probabilities under 0.05 or over 0.95 (8, 17 and 18).
Individual pupils with extreme cross-validatory checks differ according to cluster ran-
dom effects approach. Both models have under 10% of cases with mixed cross-validatory
probabilities either exceeding 0.95 or under 0.05 (cvtail[1] and cvtail[2] in the code). For
the random intercepts model, the lowest (highest) exceedance probabilities are for sub-
jects 51 and 88 respectively, subject 51 having zero homework hours but a relatively high
achievement of 67, while subject 88 has 5 homework hours but achievement of 33.
The random intercepts and slopes model shows widely discrepant homework effects
between schools (under both LKJ and Wishart priors). Hence, outlier pupils may be
identified if they are discrepant with the cluster sub-model defined by school-specific
intercepts and slopes.
A third analysis illustrates the economy of coding possible with rstan and assumes
σ2 differing by gender (complex level 1 variation). An LKJ prior is assumed on the inter-
cepts-slopes correlation. The posterior mean residual standard deviation is found to
be slightly lower for females as compared to males (7.05 vs. 7.64), but the LOO-IC is
unchanged (in fact slightly increased) at 3581.
outcomes. Thus, consider univariate observations yij, with repetitions j nested in clusters i,
that, conditional on cluster effects bi, follow an exponential family density
yijqij − d(qij )
f ( yij |bi ) ∝ exp + c( yij , fij ) ,
fij
where θij is the canonical parameter and ϕij is usually a known scale parameter. Additionally,
E( yij |qij ) = d′(qij ) and Var( yij |qij , fij ) = d″(qij )fij . For example, under the Poisson, d(u) = exp(u),
and for binomial data, d(u) = log(1 + eu). Taking the regression terms as hij = g(qij ) where g
is a link function, the observation level model (including a level 2 regression on cluster
attributes) is
bi = kWi + e ,
Assume priors b ∼ N p ( a, R), bi ∼ N q (0, Σ b ) and uij ∼ N r (0, Σ u ) , with inverse Wishart priors
Σ b ∼ IW (nb , Sb ) and Σ u ∼ IW (nu , Su ) . Then the full posterior conditional for each bi vector is
ni
yijqij − d(qij )
p(bi |b[i] , b , u, Σ b , Σ u ) ∝ exp −0.5bi′Σ b−1bi + ∑ fij ,
j =1
−1
yijqij − d(qij )
p(uij |u[ij] , b , b , u, Σ b , Σ u ) ∝ exp −0.5uij′
∑ u
uij +
fij
.
Additionally, the covariance matrices have inverse Wishart full conditionals, namely
m
Σ b ∼ IW nb + m, Sb +
∑ b b′ ,
i =1
i i
m
Σ u ∼ IW nu +
∑i =1
ni , Su + ∑ i, j
uijuij′ .
with probability πijk that option k is chosen by subject j in cluster i, namely that yij = k (or
dijk = 1) where options are unordered. A particular choice (k ∈1, … , K ) made by subject j in
cluster i results from comparing the latent utilities of all options (hij1 , … , hijK ), with
where the ηijk include systematic effects and random errors εijk. Suppose the errors follow
a Gumbel (extreme value type I) density, namely P(e ) = exp( − e − exp( − e )), then since dif-
ferences between Gumbel errors follow a standard logistic distribution, the choice prob-
abilities reduce to the multinomial logit (Hedeker, 2003, p.1439).
Predictors in the systematic term may be defined at option-subject, or at option level, but
consider subject level predictors xij and zij (e.g. voter age) of respective dimensions p and q,
that may vary according to cluster i. Then with the final category as a reference, fixed effect
parameters and random effects are specific to choices k, with K − 1 sets of random effects
bih each of dimension q,
1
Pr( yij = K ) = .
∑
K −1
1+ exp(ah + xij bh + zijbih )
h=1
The bi = (bi1 , … bi , K −1 ) are multivariate zero mean effects, typically assumed multivariate
normal.
hij = xij b + bi ,
where bi ~ N (0, s b2 ). Since the variance of the standard logistic is p 2 /3, the intraclass cor-
relation at level 2 may be obtained as s b2 /(s b2 + p 2 /3), and monitored over MCMC itera-
tions. Moreover, if the composite fixed effect term xijβ is monitored and its posterior
variance s F2 obtained, one may obtain a proportion of variance explained by covariates as
s F2 /[s F2 + s b2 + p 2 /3], where s b2 is the posterior mean of s b2.
Bayesian Multilevel Models 325
provide information about an underlying metric variable yij∗ defined by cutpoints such that
if
kk −1 < yij∗ ≤ kk ,
where ε is normal or logistic. If xij excludes an intercept, there are K − 1 unknown cutpoints
(k1 , … , kK −1 ), with yij = 1 if yij∗ ≤ k1, yij = 2 if k1 < yij∗ ≤ k2 , etc., and yij = K if yij∗ > kK −1.
A standard logistic density for εij with mean 0, variance π2/3, and distribution function
F(e ) = exp(e )/(1 + exp(e )) leads to a logit link for the cumulative probabilities
gijk = ∑p
m=1
ijm = Pr( yij∗ ≤ kk ) = Pr( yij ≤ k ), k = 1, … , K − 1
with pijK = 1 - S Km-=11 pijm . Taking eij ∼ N (0, 1) corresponds to a probit link for γijk. For ε logistic,
the hierarchical regression is expressed as follows
that is,
With a logit link to predictors, and level 2 regression involving cluster level predictors,
Wi one has
To provide robustness (e.g. to outlier clusters), the ei may be taken as Student t distributed
(see Section 8.5). The prior on δi has the form
di
p( di ) ∝ ,
( hi + di )2
where hi = min (Tij ) . For Poisson data, one has yij ∼ Po(oijqij ), where oij is an offset for the
j ∈1,…, ni
expected response, and qij ∼ Ga( mij di , di ). The regression model then involves a log link for
the μij,
More specialised models apply for particular data structures. For example, Van Duijn and
Jansen (1995) suggest a model for repeated counts (e.g. tests j = 1, … , ni within students
i = 1, … , m ) with Poisson means
mij = ni dij ,
and gamma distributed student ability effects ni ∼ Ga( a1 , a2 ), where a1 and a2 are additional
parameters, and the δij represent subject specific difficulty parameters for tests j, with iden-
tifiability constraint S jd ij = 1 , and prior
where the ξj are also unknowns. If the subjects fall into known (or possibly unknown)
groups k = 1,… K with allocation indicators Si ∈(1, … , K ) , then a more general model speci-
fies (ni |Si = k ) ∼ Ga( a1k , a2 k ) .
A conjugate structure for stratified area health counts is considered by Dean and
MacNab (2001). Thus for micro areas j = 1, … , ni nested within larger areas i = 1, … , m , let
μ be an average event rate across all m areas, and Tij be populations at risk. Assume first
cluster level overdispersion represented by effects ρi, so that yij ∼ Po( mTij ri ), where ρi have
mean 1, and let the mean and variance of yi+ be Ti+μ and Ti+ m(1 + s r2 ) . Under gamma mixing
T m T m
ri ∼ Ga i +2 , i +2 ,
sr sr
with variance s r2 /(Ti+ m) . The interpretation is that ρi represents the average relative risk
over the Ti+ individuals in area i.
Bayesian Multilevel Models 327
where wb[h] denotes the observed binary outcome. The R2OpenBUGS analysis centres
the group intercepts around the impact of group average hours worked.
Both methods of estimation report a stronger impact of group average hours worked
than individual hours worked on individual well-being, with respective posterior means
(sd) of −0.27 (0.05) and −0.10 (0.02) (respectively beta[2] and beta[1] in the R2OpenBUGS
code).
Subject level mixed predictive checks (Marshall and Spiegelhalter, 2007) are based
on sampling replicate cluster intercepts, and these predictive checks are aggregated
to group level (testmx.grp in the R2OpenBUGS code). These show well-being in some
groups to be much better explained than in others, with average predictive success
varying from 0.50 (group 68) to 0.84 (group 57). Both the brms logistic regression and
the augmented data logistic regression show around 22% of subjects with predictive
concordance below 0.50 (the model does not improve on guesswork for such subjects).
with two sets of higher level errors, namely level 3 random errors u3i with variance s 32 ,
and level 2 class errors u2ij with variance s 22 (pertaining to effects of classrooms within
schools). Then
with N(0,1000) priors on fixed effects and U(0,1000) priors on the random effect standard
deviations. Predictors are centred.
A two-chain run of 2,500 iterations with 500 burn-in gives posterior means for the
cutpoints (k1 , k2 , k3 ) of –1.39, −0.11, and 1.10, with a significant coefficient of 0.83 on the
curriculum intervention and significant influence also of pre-intervention score, but no
significant effects for TV or the interaction term. The posterior means for σ2 and σ3 are
0.41 and 0.30 (for classrooms and schools respectively) with densities bounded away
from zero. Similar estimates are obtained using brms and rstan.
By contrast, a maximum likelihood analysis using numerical quadrature reported by
Rabe-Hesketh et al. (2004) finds an insignificant school variance, and Vermunt (2013)
also finds a model with class effects only to be the best fitting.
The analysis was also carried out using the latent data approach, which may be use-
ful for obtaining intraclass correlations or for model checking. This produces larger
estimates of σ2 and σ3, namely 0.59 and 0.36, but similar fixed predictor and cutpoint
estimates. The worst fit (using pointwise WAIC) is for subjects 952 and 190, who have
respectively high (low) THK scores, despite low (high) pre-intervention THK scores and
absence (presence) of the curriculum intervention.
where
Alternatively, variation over the extra crossed factor may be applied to a different predic-
tor than those subject to random variation over the main level 2 classification. Often the
additional random effects would be confined to intercept variation over the extra crossed
factor, so that with q = 1 and zi1 = 1 also, one has
In these situations, the random effects are confounded and empirical identification may
be impeded. Selection between random effects may well be needed (Browne et al., 2001).
Another possible source of variation in crossed models is defined in cells formed by
cross-classification of two or more higher level factors. For example, N patients living in
a particular administrative health district may be classified into subpopulations s based
on intersections of their primary care general practitioner i1 = 1, … , m1 , and small area of
residence i2 = 1, … , m2 (Congdon and Best, 2000). Often there may be no subjects in certain
combinations of higher level factors. So define total non-empty cells as Sn, equal to or less
than the total S = m1m2 of all possible combinations, with different values s = 1,… Sn defined
by cross-hatched factor identifiers [i1,i2]. Let r = 1, … , N denote a single string subject level
identifier. Subjects will be classified by subpopulation sr ∈{1, … Sn } , by higher level factor 1
classification indicator h1r (general practitioner), higher level factor 2 classification indicator
h2r (small area of residence), and so on. Random intercept variation in a metric response
over the two factors and the cells then takes the form
Kj
where (b1i , … , bqi ) ∼ N q (0, Σ b ), i = 1, … , m . If the pupil predictors vary over affiliations, then
Kj
yij = xij b + ∑ w z b + u .
k =1
jk jk k ij
Multiple member schemes extend to data frames which are structured spatially or tem-
porally rather than nested. A particular kind of multiple member prior can be applied
to spatially configured count responses yi subject to random intercept variation. Thus let
yi ∼ Po(oi mi ) where oi are expected events, and where the μi measure the Poisson intensity
330 Bayesian Hierarchical Models
relative to expected levels (in spatial health applications the μi are termed relative risks).
Then the impact of Ki neighbouring areas can be represented by random effects bk while
own area effects are represented by effects ui in a model
Ki
log( mi ) = xi b + ∑ w b + u ,
k =1
ik k i
where the wik are row standardised with S Kk =i 1wik = 1 , obtained from spatial interactions
C = cik. These might be based on binary spatial interactions cik (cik = 1 if areas i and k are
contiguous, cik = 0 otherwise), or based on distances dik between area centres, such as
cik = exp( − hdik ) where η is positive; then wik = cik / S Kk =i 1cik .
y r = xr b + a 1, h1r + a 2 , h2 r + ur ,
ur ~ N(0, s 32 ),
with xr excluding a constant term, and all predictors centred. Centring the random
effects around the intercept γ1, and neighbourhood deprivation effect γ2 improves
convergence.
In the R2OpenBUGS analysis, gamma priors are adopted on neighbourhood, school,
and pupil random effect precisions. The model is also estimated using the brms library
in R, but with half-t priors on standard deviation parameters.
Model 1 results from R2OpenBUGS show residual pupil standard variation (σ3 has
posterior mean 0.67) as more substantial than either school or neighbourhood variation
(σ1 and σ2 have means 0.09 and 0.08). A negative deprivation impact γ2 on attainment
(with mean −0.156 and 95% interval from −0.202 to −0.106) operates via (is mediated by)
the neighbourhood effects.
Bayesian Multilevel Models 331
A second model allows the deprivation effect to vary by school – expressing poten-
tially varying effectiveness on schools h2r in countering catchment area effects (also
known as contextual value-added effects). There are now four random variances, with
y r = xr b + a 1, h1r + a 2 , h2 r + d h2 r Deph1r + ur ,
a 1i1 ~ N(0, s 12 ),
a 2i2 ~ N(g 1 , s 22 ),
d i2 ~ N(g 2 , s 32 ),
ur ~ N(0, s 42 ).
the cumulative probability Pr(n = 2.1) + Pr(n = 2.2) + … is calculated for each point, and the
U(0, 1) draw determines which is sampled.
Following the Pinheiro et al. (2001) scheme, assume a gamma-normal hierarchical repre-
sentation with scale mixture parameters si ∼ Ga(0.5n , 0.5n), and also that ei ∼ N q (0, I ) . Then
for continuous responses yi = ( yi1 , … , yini )′ , a level 2 assumption of t distributed random
effects bi = (b1i , … , bqi )′ with dispersion Σb leads to
yi = Xi b + Zibi + ui , i = 1, … , m
bi = kWi + Σ 0b.5ei / si
For outlier clusters with low si the overall dispersion Σ b /si2 is inflated, but the fixed effect
κ will be less distorted than under normal level 2 errors.
The degrees of freedom parameters νi of the level 2 multivariate t prior may be taken to
vary between clusters, namely
yi Xi b Zi Σ b Zi′ + Λ i Zi Σ b
b ∼ tni + q 0 , Σ Z′ , ni .
i b i Σ b
yi Xi b 1 Zi Σ b Zi′ + Λ i Zi Σ b
b ∼ N ni + q 0 , s Σ Z′ Σ b
.
i i b i
where si ∼ Ga(0.5ni , 0.5ni ) . The si can then be used for identifying cluster outliers. An alter-
native to assuming cluster specific degrees of freedom is to take ni = ngi , according to a
known or possibly unknown grouping variable gi ε (1, … , G) applicable to clusters, for
example, type of school in an educational application.
Discrete mixtures of random effects are also possible for outlier accommodation, model-
ling non-normality or other asymmetry in random effects. Latent mixtures of regression
effects may also be present: Muthén and Asparouhov (2009) show how latent regression
classes may be misrepresented as random cluster variation. To detect outlier random
effects, Daniels and Gatsonis (1999, p.36) adapt the approach of Albert and Chib (1997) in
their models for hierarchical conjugate priors for discrete data.
For nested binomial data yij ∼ Bin( nij , pij ), a mechanism to detect level 1 outliers may be
specified with pij drawn for a two-group mixture of beta densities, both with means πij.
For the main group, the dispersion parameters are δi, while for the outlier group they are
deflated as δi/K where K 1. Then
d d
pij ∼ (1 − l)Beta(pij di ,(1 − pij )di ) + lBeta pij i ,(1 − pij ) i .
K K
If the outlier probability λ is preset to a low value (e.g. λ = 0.05), then K might be taken as
an extra parameter. Weiss et al. (1999) suggest a similarly motivated prior for mixtures of
normal random effects at levels 1 and 2 in (8.1) and (8.3), namely
bi ∼ (1 − lb )N q (0, Σ b ) + lb N q (0, K b Σ b ),
uij ∼ (1 − lu )N (0, su2 ) + lu N (0, K usu2 ),
Bayesian Multilevel Models 333
An alternative mixture prior to reduce the impact of parametric assumptions is the mix-
ture of Dirichlet process approach (Kleinman and Ibrahim, 1998; Guha, 2008). Thus, a con-
ventional first stage likelihood
yi ∼ N (Xi b + Zibi , s 2 ),
may be combined with a semiparametric approach for bi = (b1i , … bqi )′ , typically with a mul-
tivariate normal base G0 as in
bi ∼ G,
G ∼ DP(a , G0 ),
G0 = N q (0, D),
D −1 ∼ Wishart(d0 , R0 ),
Gibbs sampling for D−1 is modified for clustering among the sampled bi (Kleinman and
Ibrahim, 1998, p.94).
The ethnicity fixed effect ( b1 , b2 , b3 ) has black ethnicity as reference. In practice, the uij
are centred around the regression term b0 + bethij to improve convergence. U(0,100) pri-
ors are adopted for the random standard deviations.
Using replicate random effects uij,rep and bi,rep , and the resulting replicate data y ij,rep
sampled from the model, predictive checks involve the mixed predictive exceedance
criterion Pr( y ij , rep > y ij |y ) (Green et al., 2009). The observation level log posterior predic-
tive densities (LPPDs) associated with the WAIC are also obtained. The significance of
individual precinct effects bi is assessed using the probabilities Pr(bi > 0|y ) .
The scaled deviance (DV in the code) is estimated as 925, so overdispersion is
accounted for. The Hispanic and white ethnic coefficients ( b2 and b3 ) have 95% intervals
(−0.12,0.22), and (−0.59,−0.23), so whites have lower chances of being subject to “stop and
frisk.” Specifically, they have a 33% lower relative risk, namely 100(1 − exp(−0.4)) where
–0.4 is the posterior mean of β3. Despite the presence of precinct-cell error terms (which
might reduce the need for separate precinct effects), a relatively high number (25 out of 75)
334 Bayesian Hierarchical Models
30
25
20
Frequency
15
10
FIGURE 8.1
Precinct random effects, truncated Dirichlet prior.
of the precinct effects bi are significant in the sense that the probabilities Pr(bi > 0|y )
exceed 0.95 or are under 0.05.
Around 8.7% of the mixed predictive exceedance checks are in the extreme tails
(under 0.05 or over 0.95), so the model is reproducing the data effectively. The lowest
LPPD values and extreme exceedance probabilities are for subjects with very high stop
counts, and for subjects with zero stop counts, despite relatively large offsets.
Of interest in terms of the robustness of the model assumptions are the character-
istics of the posterior estimates of bi and uij. The proportion of extreme values for the
precinct effects may cast doubt on a normality assumption, and as an alternative, a
truncated Dirichlet process prior is adopted for these effects (model 2). A fixed Dirichlet
concentration parameter is assumed, namely α = 1, to aid convergence. The base density
involves normal random effects over a maximum of 20 clusters.
A slight reduction in WAIC (from 6897 to 6891) is obtained. Around 9% of the mixed
predictive exceedance checks are in the extreme tails (under 0.05 or over 0.95), similar
to model 1. Similar results regarding the fixed effects and ethnic differences in risk of
stop and frisk are also estimated for this model. However, a histogram of the posterior
mean bi suggests non-normality (Figure 8.1), shown, for example, by a bimodal pattern,
and five precincts (2,26,28,51,70) with unusually low bi.
education, husband’s occupation, and presence of a modern toilet; and at level 1: (exist-
ing) child’s age, mother’s age, and birth order.
Then with yijk ~ Bern(pijk), a binary multilevel model specifies normal random intercept
variation at levels 2 and 3, namely according to both mother and community. Using a
non-centred parameterisation and logit link, one has
where bi3 ~ N(0,s32 ) are community effects, and bij2 ~ N(0,s22 ) are mother level effects.
Alternatively, community effects and mother effects could be centred at Xiβ3 and Xijβ2
respectively. Instability across estimation methods in this dataset is noted by Rodriguez
and Goldman (2001) and Guo and Zhao (2000).
This instability may be related partly to small cluster sizes at both levels as well as
the binary form of outcome. Here we illustrate the potential impacts of (fixed effect)
predictor collinearity, comparing a diffuse normal prior on predictors with a horseshoe
prior. Under the horseshoe prior, the student-t prior of the local shrinkage parameters
TABLE 8.1
Modern Pregnancy Advice
Diffuse Normal Prior on Fixed Horseshoe Prior on Fixed
Regression Effects Regression Effects
Fixed effects Mean 2.50% 97.50% Mean 2.50% 97.50%
Intercept 5.3 0.3 11.1 3.0 0.5 5.9
Pregnancy Level
Child aged 3–4 years −1.38 −2.16 −0.66 −0.89 −1.51 −0.29
Mother aged > 25 years 1.28 0.00 2.65 0.60 −0.16 1.57
Birth order 2–3 −1.05 −2.18 0.03 −0.35 −1.16 0.18
Birth order 4–6 −0.54 −2.06 0.98 −0.01 −0.88 0.76
Birth order > 7 −1.29 −3.45 0.73 −0.32 −1.75 0.65
Mother Level
Indigenous, no Spanish −7.85 −13.20 −3.59 −5.30 −8.95 −1.70
Indigenous Spanish −4.21 −7.68 −1.18 −2.60 −5.18 −0.10
Mother’s education primary 2.66 0.85 4.79 1.59 0.18 3.03
Mother’s education secondary 5.73 1.40 10.95 3.51 0.00 7.22
Husband’s education primary 1.14 −0.84 3.22 0.57 −0.45 2.18
Husband’s education secondary 4.89 1.04 9.21 3.35 0.19 6.64
Husband’s education missing 0.07 −3.03 3.11 −0.05 −1.38 1.25
Husband professional etc. −0.54 −5.53 4.36 0.65 −0.87 2.70
Husband agric. self-employed −2.73 −7.09 1.46 −0.57 −2.40 0.60
Husband agric. employee −3.82 −8.49 0.44 −1.33 −3.52 0.16
Husband skilled service −1.17 −5.46 3.17 0.25 −1.03 1.83
Modern toilet in households 2.80 0.08 5.87 1.82 −0.02 4.00
Television not watched daily 2.18 −1.81 6.37 0.59 −0.74 3.03
Television watched daily 2.14 −0.38 4.93 1.00 −0.32 3.08
Community Level
Proportion indigenous, 1981 −6.60 −11.84 −1.98 −4.95 −8.96 −1.08
Distance to nearest clinic −0.07 −0.14 −0.02 −0.06 −0.10 −0.02
Random effect variances
Family 10.5 7.7 14.2 7.3 5.7 9.3
Community 5.6 3.8 7.8 4.0 2.8 5.3
336 Bayesian Hierarchical Models
(see Equation 7.1) has 1 degree of freedom (Piironen and Vehtari, 2016), while the global
parameter has a Cauchy prior with scale 1. Using rstan, convergence is achieved in two
chain runs of 2000 iterations.
Table 8.1 shows that posterior mean random intercept variances at both family and
community level are reduced by about 30% under the horseshoe prior. Fixed regression
effects show considerable shrinkage, but significant predictor effects (on 8 of the 21 pre-
dictors) are maintained (as assessed by 95% credible intervals either entirely negative
or positive). The indicators κj (see Equation 7.2) show that the “indigenous, no Spanish”
(mother) and “proportion indigenous” (community) predictors have the highest rele-
vance, with posterior mean κj around 0.12 for both (kappa[6] and kappa[20] in the code),
and posterior median κj around 0.06. This strategy improves fit: the LOO-IC falls from
2653 to 1765 on adopting shrinkage priors.
References
Aarts E, Dolan C, Verhage M, van der Sluis S (2015) Multilevel analysis quantifies variation in the
experimental effect while optimizing power and preventing false positives. BMC Neuroscience,
16, 94.
Albert J, Chib S (1997) Bayesian tests and model diagnostics in conditionally independent hierarchi-
cal models. Journal of the American Statistical Association, 92(439), 916–925.
Bliese P (2016a) Package ‘Multilevel’ Manual. https://cran.r-project.org/web/packages/multilevel/
Bliese P (2016b) Multilevel Modeling in R: A Brief Introduction to R, the multilevel Package and the
nlme Package. Darla Moore School of Business, University of South Carolina.
Bliese P, Halverson R (1996) Individual and nomothetic models of job stress: An examination of work
hours, cohesion, and well-being. Journal of Applied Social Psychology, 26(13), 1171–1189.
Bliese P, Hanges P (2004) Being both too liberal and too conservative: The perils of treating grouped
data as though they were independent. Organizational Research Methods, 7(4), 400–417.
Browne W (2004) An illustration of the use of reparameterisation methods for improving MCMC
efficiency in crossed random effect models. Multilevel Modelling Newsletter, 16, 13–25.
Browne W, Draper D, Goldstein H, Rasbash J (2002) Bayesian and likelihood methods for fitting
multilevel models with complex level-1 variation. Computational Statistics and Data Analysis,
39, 203–225.
Browne W, Goldstein H, Rasbash J (2001) Multiple membership multiple classification (MMMC)
models. Statistical Modelling, 1, 103–124.
Buerkner P (2017) brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical
Software, 80(1), 1–28.
Candel J, Winkens B (2003) Performance of empirical Bayes estimators of level-2 random parameters
in multilevel analysis: A Monte Carlo study for longitudinal designs. Journal of Educational and
Behavioral Statistics, 28, 169–194.
Chaix B, Merlo J, Chauvin P (2005) Comparison of a spatial approach with the multilevel approach
for investigating place effects on health: The example of healthcare utilisation in France. Journal
of Epidemiology and Community Health, 59, 517–526.
Chen Z, Dunson D (2003) Random effects selection in linear mixed models. Biometrics, 59, 762–769.
Congdon P, Best N (2000) Small area variation in hospital admission rates: Adjusting for referral and
provider variation. Journal of the Royal Statistical Society: Series C, 49(2), 207–226.
Congdon P, Lloyd P (2010) Estimating small area diabetes prevalence in the US using the behavioral
risk factor surveillance system. Journal of Data Science, 8(2), 235–252.
Croon M, van Veldhoven M (2007) Predicting group-level outcome variables from variables mea-
sured at the individual level: A latent variable multilevel model. Psychological Methods, 12(1),
45–57.
Bayesian Multilevel Models 337
Daniels M, Gatsonis C (1999) Hierarchical generalized linear models in the in the analysis of varia-
tions in health care utilization. Journal of the American Statistical Association, 94, 29–42.
Daniels M, Kass R (1999) Nonconjugate Bayesian estimation of covariance matrices and its use in
hierarchical models. Journal of the American Statistical Association, 94, 1254–1263.
Daniels M, Zhao Y (2003) Modelling the random effects covariance matrix in longitudinal data.
Statistics in Medicine, 22, 1631–1647.
Dean C, MacNab Y (2001) Modeling of rates over a hierarchical health administrative structure.
Canadian Journal of Statistics, 29, 405–419.
Dong G, Ma J, Harris R, Pryce G (2016) Spatial random slope multilevel modeling using multivari-
ate conditional autoregressive models: A case study of subjective travel satisfaction in Beijing.
Annals of the American Association of Geographers, 106(1), 19–35.
Draper D (1995) Inference and hierarchical modeling in the social sciences. Journal of Educational and
Behavioral Statistics, 20, 115–147.
Draper D (2006) Bayesian multilevel analysis and MCMC, Chapter 2, in Handbook of Quantitative
Multilevel Analysis, eds J de Leeuw, E Meijer. Springer, New York.
Flay B, Hansen W, Johnson C, Collins L, Dent C, Dwyer K, Grossman L, Hockstein G, Rauch J,
Sobol J, Sobel D, Sussman S, Ulene A (1987) Implementation effectiveness trial of a social influ-
ences smoking prevention program using schools and television. Health Education Research, 2,
385–400.
Gamerman D (1997) Sampling from the posterior distribution in generalized linear mixed models.
Statistics and Computing, 7, 57–68.
Gelfand A, Sahu S, Carlin BP (1995) Efficient parameterisations for normal linear mixed models.
Biometrika, 82, 479–488.
Gelman A (2006) Multilevel (hierarchical) modeling: What it can and can’t do. Technometrics, 48,
432–435.
Gelman A, Hill J (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge
University Press.
Gelman A, Pardoe I (2006) Bayesian measures of explained variance and pooling in multilevel (hier-
archical) models. Technometrics, 48(2), 241–251.
Givens G, Hoeting J (2012) Computational Statistics, 2nd Edition. John Wiley.
Goldstein H (2005) Heteroscedasticity and complex variation, pp 790–795, in Encyclopedia of Statistics
in Behavioral Science, Vol. 2, eds B Everrit, D Howell. Wiley, New York.
Goldstein H, Browne W, Rasbash J (2002) Partitioning variation in multilevel models. Understanding
Statistics, 1, 223–232.
Green MJ, Medley GF, Browne WJ (2009) Use of posterior predictive assessments to evaluate model
fit in multilevel logistic regression. Veterinary Research, 40(4), 1–10.
Guha S (2008) Posterior simulation in the generalized linear mixed model with semiparametric ran-
dom effects. Journal of Computational and Graphical Statistics, 17, 410–425.
Guo G, Zhao H (2000) Multilevel modeling for binary data. Annual Review of Sociology, 26, 441–462.
Hedeker D (2003) A mixed-effects multinomial logistic regression model. Statistics in Medicine, 22,
1433–1446.
Hox J (2002) Multilevel Analysis: Techniques and Applications. Lawrence Erlbaum Associates, Mahwah, NJ.
Hsiao C (1996) Random coefficient models, pp 77–99, in The Econometrics of Panel Data, eds L Matyas,
P Sevestre. Kluwer, Dordrecht, Netherlands.
Kreft I, de Leeuw J (1998) Introducing Multilevel Modeling. Sage, Thousand Oaks, CA.
Kleinman K, Ibrahim J (1998) A semi-parametric Bayesian approach to generalized linear mixed
models. Statistics in Medicine, 17, 2579–2596.
Langford I, Lewis T (1998) Outliers in multilevel data. Journal of the Royal Statistical Society: Series A,
161, 121–160.
Lindley DV, Smith AF (1972) Bayes estimates for the linear model. Journal of the Royal Statistical
Society: Series B (Methodological), 34(1), 1–18.
Mai Y, Zhang Z (2018) Software Packages for Bayesian Multilevel Modeling. Structural Equation
Modeling, 25(4), 650–658.
338 Bayesian Hierarchical Models
9.1 Introduction
A range of multivariate techniques are available both for modelling multivariate collec-
tions of metric, binary, or count data, and for modelling multivariate random effects or
regression residuals. These include data reduction (reduced dimension) methods such as
factor and principal component analysis (e.g. Hayashi and Arav, 2006; Lopes and West,
2004), structural equation modelling (Schumacker and Lomax, 2016), discriminant analy-
sis (e.g. Brown et al., 1999; Rigby, 1997), and data mining, as well as direct (full dimen-
sion) modelling of the joint density of the observations or regression residuals (e.g. Chib
and Winkelmann, 2001; Martinez-Beneito, 2013). Structured multivariate effects in the
analysis of spatial or time configured data raise additional issues, such as representing
inter-variable correlation within units as well as non-exchangeability between units (Song
et al., 2005). Bayesian applications of factor analysis and structural equation modelling
have grown considerably in recent years; for overviews, see Palomo et al. (2007), Merkle
and Wang (2016), Kaplan and Depaoli (2012), Lee (2007), Stromeyer et al. (2015), and Levy
and Mislevy (2016).
The rationale for introducing latent variables lies in parsimonious representation of the
covariance structure of multivariate data, while also revealing underlying clustering of,
or associations between, the variables, ideally with substantive interpretability. The latent
variables are typically unobservable constructs (e.g. authoritarianism, population morbid-
ity, or a common trend over time) that can only be imperfectly measured by observed indi-
cators. The latent variables may be continuous, as in factor analysis (Fokoue, 2004; Lopes
and West, 2004), or categorical, as in latent class analysis (Berkhof et al., 2003). The original
variables might themselves also be discrete or continuous. For example, item response
models typically involve multiple binary observed items and a single latent continuous
ability score (Bazan et al., 2006; Luo and Jiao, 2017; Albert and Ghosh, 2000). Bayesian latent
variable packages in R include blavaan (Merkle and Rosseel, 2017), brms (Byrnes, 2017),
BayesFM (Piatek, 2017), bfa (Murray, 2016), and BayesLCA (White, 2017). Preliminary anal-
ysis using classical estimation is often useful in problem definition, for example, using the
lavaan or openMX packages (Boker et al., 2011).
The extraction of information from multivariate observed indicators to derive a smaller
set of latent variables defines a measurement model, as in confirmatory and explanatory
factor analysis (Bartholomew, 1987; Skrondal and Rabe-Hesketh, 2007). The subsequent
use of the latent constructs in describing causal relationships or associations leads into
339
340 Bayesian Hierarchical Models
structural equation modelling (Lee, 2007). Both types of model have been developed, espe-
cially in areas such as psychology, marketing, educational testing, and sociology, where
it is not possible to measure underlying constructs directly. Newer areas of development
include environmental modelling (Malaeb et al., 2000; Nikolov et al., 2007), biomass mod-
els (Arhonditsis et al., 2006a), and time series and spatial data analysis using common
factor approaches.
The observed variables in a measurement model are variously known as “items” (e.g. in
psychometric tests), as “indicators,” or as “manifest variables.” Canonical assumptions are
that (a) conditional on the constructs, the observed indicators are independent, in which
case the constructs explain the observed correlations between the indicators, and (b) that
the construct scores are independent over subjects. As Bollen (2002) points out, the local
independence property in (a) is not an intrinsic feature of structural equation models,
while spatial and time series factor and structural equation models (Hogan and Tchernis,
2004; Congdon et al., 2007) exemplify how construct scores may be dependent over space
or time.
This chapter presents a selective review of multivariate techniques, namely
a) factor modelling via continuous latent constructs, as applied in normal linear and
general linear model contexts (Sections 9.2, 9.3, and 9.4);
b) models for multivariate discrete area (lattice) data, including spatial factor models
(Sections 9.6 and 9.7); and
c) models for multivariate time series (Section 9.8), with a focus on dynamic linear
and general linear models.
yi = ay + L y Fi + ui ,
xi = ax + L x H i + ei ,
Fi = BFi + CH i + wi ,
wi ~ N (0, F ),
where an intercept is typically not identified, and B is a Qy × Qy matrix with zero diago-
nal elements and off-diagonal parameters describing relations between endogenous con-
structs. The matrix C is Qy × Qx with parameters describing the impact of exogenous on
endogenous constructs. The structural model may also contain further observed variables
as responses or predictors.
Many multivariate reduction applications involve just a measurement model (i.e. a sim-
ple factor analysis), and so distinction between different types of observed indicator and
factor is not needed. Then a normal linear factor model is
yi = a + LFi + ui , (9.1)
Fi = BFi + wi ,
or independent factors
Fi ~ NQ (0, F )
residuals (u1i , … , uPi )¢ are typically taken to be independent over cases i and variables, so
that S = diag(s12 , s22 , … , sP2 )I . This assumption can equivalently be stated as that the out-
come variables are conditionally independent, given the latent variables (Skrondal and
Rabe-Hesketh, 2007).
It may be noted that path analysis models, a special case of SEM, may be estimated
straightforwardly using brms (Byrnes, 2017) [1]. Whereas SEM models in general may
include latent variables, path analysis models assume observed variables measured
without error. Only postulated structural relationships between observed variables are
included in the model. This approach is often used when particular variables are thought
to mediate relationships between others.
9.2.1 Forms of Model
If all loadings λpq in the P × Q matrix Λ are free parameters (apart from those subject to
identification constraints, as discussed below), this structure is known as an exploratory
factor analysis (EFA), and typically assumes independent factors, with Φ = I (Merkle and
Wang, 2016). By contrast, in a confirmatory factor analysis (CFA) or measurement model,
many of the loadings take preset values (usually zero) on the basis of substantive theory,
and correlations between factors may be assumed. A particular form of confirmatory
model is known as simple structure, such that each observed variable ypi loads on only one
of the constructs Fqi. For example, Fleishman and Lawrence (2003) apply a simple struc-
ture model to ordinal items from the SF12 questionnaire, assuming that each item reflects
either a physical or mental health construct.
A multiple indicator-multiple cause (MIMIC) model extends confirmatory models by
incorporating the effects of exogenous observed variables on latent factors (Joreskog and
Goldberger, 1975; Tekwe et al., 2014). MIMIC models for normal outcomes consist of (a)
measurement equations
yi = a + LFi + jXi + ui ,
relating multiple indicator variables yi to latent constructs Fi, and possibly also to known
influences Xi, and (b) structural equations. In the latter, the latent variables Fi are related,
both to one another and to observed exogenous variables Zi, which are viewed as causal
influences on the factors, namely
Fi = BFi + CZi + wi ,
where Zi excludes a constant term, and the coefficient matrix B allows reciprocal effects
between latent factors. A MIMIC model with a single latent construct, as applied, for
instance, in analyses of the size of underground economies (Wang et al., 2006), would typi-
cally take the form
y pi = ap + lp Fi + jp Xi + upi ,
Fi = gZi + wi .
As noted by Breusch (2005), the correlation structure in a MIMIC model may need substan-
tive support, as it typically assumes that (i) the indicators y are conditionally independent
of the causes Z, given the latent construct(s) F, and (ii) that the indicators y1 , … , y P are
Factor Analysis, Structural Equation Models, and Multivariate Priors 343
mutually independent given F. This amounts to saying that all connections that indica-
tor variables y have with the causal variables Z, and with one another, are transmitted
through the latent variable(s).
9.2.2 Model Definition
Bayesian analysis in the normal linear factor model has recently focused on model defini-
tion questions. These include selection of important factor-indicator loadings (analogous
to predictor selection), covariance specification, and uncertainty in the number of factors.
Predictor selection methods such as SSVS (George and McCulloch, 1993) can be adapted
to selection of important loadings using binary indicators γjk for observed item j and latent
factor k. These indicators provide information about which items are associated with par-
ticular factors, and which items are relevant or irrelevant to the overall latent structure. For
a preset number of factors Q, this leads to confirmatory analysis, but subject to uncertainty
(Lu et al., 2016). Thus, analogous to SSVS, one has for γjk = 0,
ljk ~ N (0, j j2 ),
where j j2 is set very small so as to shrink λjk towards zero, whereas for γjk = 1,
ljk ~ N (0, c 2j ),
where cj is chosen large (e.g. cj = 10 or cj = 100) to enable effective search for non-zero λjk val-
ues. Alternatively a spike and slab prior may be used, with λjk = 0 when γjk = 0.
Such procedures can be extended with binary indicators δk that allow retention or exclu-
sion of factors (Mavridis and Ntzoufras, 2014). This leads to item and factor selection in an
exploratory factor analysis in which
dk ~ Bern(pd ),
g jk |dk ~ Bern(pg dk ).
æn n ö
f jk ~ Ga ç , ÷ ,
è2 2ø
k
tk = Õd ,
l=1
l
344 Bayesian Hierarchical Models
d1 ~ Ga( a1 , 1),
{d2 , … , dk } ~ Ga( a2 , 1),
where a2 > 1, so that the precisions τk are necessarily increasing. Fokoue (2004) proposes
to seek relatively simple structure (a Bayesian version of varimax rotation) by taking the
precisions for each loading as unknown gamma variables, namely
In related work, Muthén and Asparouhov (2012) propose a modified constraint form of
confirmatory analysis, labelled as a Bayesian SEM, and included in the Mplus package.
Under this approach, the main loadings (those consistent with simple structure) have a
prior variance large enough to represent non-zero effects. However, instead of constrain-
ing other (cross) loadings to zero, they are assigned informative priors with very low vari-
ance (e.g. 0.01), so are approximate rather than exact zeros. If certain cross-loadings are
found to be significant (95% credibility interval excluding zero) despite these priors, such
that an item loads on more than one construct, then this suggests simple structure no
longer holds. The model may be re-estimated with those cross-loadings assigned a less
informative prior (Smith et al., 2017).
Default covariance specifications, such as diagonal Σ and Φ in (9.1), may be restrictive
in certain applications. The package blavaan (Merkle and Rosseel, 2017) uses a form of
parameter expansion, involving phantom latent variables, to facilitate the estimation of
non-diagonal covariance matrices.
Choice between models involving different numbers of factors may be tackled using
parameter expansion, combined with a Bayes factor approximation (Ghosh and Dunson,
2008), by RJMCMC (reversible jump Markov chain Monte Carlo) methods (Lopes and
West, 2004), or by marginal likelihood approximation using path sampling (Lee, 2007).
The latter approach may be extended to full structural equation models (SEMs) (Lee and
Song, 2008). The parameter expansion method may also improve MCMC performance
(Ghosh and Dunson, 2009; Merkle and Wang, 2016), and involves a reference model with
standardised factors, and a lower triangular structure for Λ (see Section 9.3), including the
diagonals constraint λqq > 0.
Thus, the reference model is
where R allows correlations between factors, but has diagonal 1. The expanded model is
*
where Ψ is unconstrained, the loadings Λ* are not subject to the diagonals constraint lqq > 0,
but Λ* is still lower triangular. Q(Q −1) parameters in Λ* are set to zero when R is non-diagonal
(Merkle and Wang, 2016). Priors on parameters in the expanded model induce priors on
(Λ,F,R) in the reference model, via
where a sign function, S(x) = −1 if x < 0 and S(x) = 1 if x ³ 0, is used to ensure a positive
diagonals constraint in Λ.
é yi ù æ éa , LFL¢ + S LF ù ö
ê F ú ~ N P +Q çç ê 0 FL¢
÷.
F úû ÷ø
ë iû èë
When the factors are standardised (Bartholomew et al., 2002, p.150; Lopes and West, 2004,
p.44), the marginal variance of yp is accordingly lp21 + … lpQ 2
+ s p2 and the marginal covari-
ance of yp and ym is lp1lm1 + lp 2lm 2 … + lpQlmQ. The contribution lp21 +… lpQ 2
of the common
factors to explaining the marginal variability in the yp is known as the “communality,”
while that part due to the residual error sp2 is called the “unique variance” or “uniqueness.”
The marginal likelihood structure for cov(y) as LFL ¢ + S does not lead to any simple
form for the posterior distributions of the unknowns, though it can be used in RJMCMC
approaches to estimation and factor model selection (Lopes and West, 2004). In Gibbs sam-
pling estimation of linear Bayesian factor and SEM models, it is simplest to approach esti-
mation of the parameters ( F , a , L , F , S ) indirectly through the conditional likelihood or
complete data model (Aitkin and Aitkin, 2005; Fokoue, 2004), with the F scores regarded
as missing data rather than integrated out (Lee and Shi, 2000). Setting q = (a , L , F , S ) , the
posterior density is then
While MCMC sampling is typically used with the conditional likelihood, the marginal
covariance LFL ¢ + S may be useful in posterior checking of model assumptions (e.g. condi-
tional independence between the y variables given the factor scores). For example, Lee and
Shi (2000) suggest a posterior check using D( y ,q ) = å
y¢i (LFL¢ + S)-1 yi . Following Gelman
i
et al. (1996), replicate data yrep,i are sampled from the predictive distribution p( yrep | y , q )
and D(y,θ) compared to D(yrep,θ).
From a set of MCMC samples, one seeks the marginal posterior density p(θ|y) of the
hyperparameters, and the predictive distribution p(F|y) of the factor scores. Estimation at
iteration t + 1 proceeds by switching between (a) sampling θ(t+1) from the posterior condi-
tional p(q | y , F (t ) ) for θ conditional on y and sampled F scores, and (b) updating F(t+1) from
346 Bayesian Hierarchical Models
the conditional density p( F|y ,q (t +1) ) . The latter corresponds to the imputation step in data
augmentation (Tanner, 1996).
A range of inference issues may occur, subject to identifiability being fully considered
(Section 9.3). The patterns of significant loadings and subject factor scores raise questions
of substantive theory, depending on the application area. As noted by Aitkin and Aitkin
(2005), one can assess the significance of parameter or factor score contrasts on the basis
of the MCMC sample, such as pairwise difference or ratio comparisons of scores on the
kth factor for subjects i1 and i2, Fi1k - Fi2 k and Fi1k /Fi2 k . Compared to classical analysis, the
posterior means and variances of the factor scores (and of factor contrasts) are routinely
obtained.
To illustrate MCMC complete data-sampling, assume Σ is diagonal in the conjugate nor-
-1
mal model (9.1) with priors spp ~ Ga(a0 p , b0 p ), that the precision matrix for F has a Wishart
-1
prior F ~ W (R0 , r0 ) , and that the prior for Λ follows the form proposed by Press and
Shigemasu (1989). Specifically, with Λp as the pth row of Λ,
L p ~ NQ (L 0 p , spp H 0 p ),
where the Q × Q matrix H0p is positive definite. Often, simple assumptions such as
H 0 p = IQ are made (Lee and Shi, 2000, p.729). Letting y ¢p be the pth row of y, and denoting
W p = ( H 0-p1 + F ¢F )-1, and hp = W p ( H 0-p1L 0 p + Fy p ), the posterior conditional for the unique vari-
ances is (Lee and Shi, 2000, p.725)
-1
spp ~ Ga(a0 p + n/2, b0 p + 0.5[ y ¢p y p - hp¢ W -p1hp + L ¢0 p H 0-p1L 0 p ]).
The conditional for Λp is a Q-variate normal with mean ηp and covariance σppΩp, and the
conditional for Φ−1 is Wishart with scale matrix FF ¢ + R0 and degrees of freedom n + r0 .
Finally, the conditional p( Fi | y , q ) for the factor scores for subject i is a Q-variate normal
with mean [F -1 + L ¢S -1L]-1 L ¢S -1 yi and covariance [F -1 + L ¢S -1L]-1 .
V = LL ¢ + S,
with PQ + P parameters on the right-hand side under a local independence assumption (Σ
taken as diagonal). For P(P + 1)/ 2 ³ PQ + P to apply requires that P ³ 2Q + 1 (Geweke and
Zhou, 1996).
In confirmatory models, certain elements of Λ are generally preset to zero, alleviating
requirements such that Σ be diagonal or that Φ exclude covariances/correlations. However,
in exploratory factor analysis (EFA) with multiple factors (Q > 1), additional identifying
constraints must be set to avoid rotation invariance. Otherwise, there is no unique solu-
tion because any orthogonal transformation of Λ leaves the likelihood unchanged (Everitt,
1984, p.16). Thus for F * = H ¢F and L * = LH , where HH ¢ = I ,
y = 1a + LF + u = 1a + (LH )( H ¢F ) + u = 1a + L *F * + u
where cov( F * ) = H ¢ cov( F )H = cov( F ) . The exception is the simple structure case (each
observed variable loading on only one factor) when rotational identifiability is not an issue
(Wedel et al., 2003, p.358; Liu et al., 2005, p.550).
In other cases, EFA identification may be achieved by fixing enough λpq to ensure a
unique solution; thus in the case Q = 2, setting any lp2 = 0 would be sufficient. Provided the
variables are ordered in such a way as to ensure substantive justification, a widely adopted
option is to assume Λ to be lower triangular, as in Geweke and Zhou (1996), Ghosh and
Dunson (2009), Zhou et al. (2014), and Mavridis and Ntzoufras (2014), namely
é l11 0 0 ¼ 0 0 ù
ê l l22 0 ¼ 0 0 úú
ê 21
ê l31 l32 l33 ¼ 0 0 ú
ê ú
ú
L=ê .
êlQ-1,1 lQ-1,2 lQ-1,3 ¼ lQ-1,Q-1 0 ú
ê ú
ê lQ1 lQ 2 lQ 3 ¼ lQ ,Q-1 lQQ ú
ê ú
ê ú
ëê lP1 lP 2 lP 3 ¼ lP ,Q-1 lPQ úû
The required structural zeros can be chosen according to prior knowledge, perhaps requir-
ing rearrangement of the indicators. A possible drawback with this constraint is order
dependence (Bhattacharya and Dunson, 2011), whereby the choice of the first Q responses
becomes an important model feature. Conti et al. (2014) avoid assuming a lower triangular
Λ by including identifying criteria into prior densities for model parameters. This leads
to an EFA in which indicators are uniquely allocated to only one factor, but where neither
the number of factors nor the structure of the loading matrix are specified a priori. This
approach is applied in the R package BayesFM (Piatek, 2017).
To avoid potential labelling issues, a lower triangular Λ can be combined with the diago-
nals constraint
lqq > 0.
If the λqq are unknowns under a standardised factor scale with Φ = I, one might take
or some other positive prior (e.g. lognormal). Otherwise, without such a constraint, and
since LF = ( - L )( - F ), loadings on (and hence scores for) a particular factor may flip over
during MCMC iterations (Geweke and Zhou, 1996, p.566). In fact, this may happen even
if a necessarily positive prior, as in (9.3), is adopted. The effectiveness of the qth indicator,
in acting as a “factor founder” (Aßmann et al., 2016) or “anchor item,” and hence guiding
the remaining loadings on the qth factor, may be influenced in substantive applications by
the ordering of indicators (see Example 9.2). This may be so in applications with a large
number of indicators and/or relatively modest correlations.
To completely avoid possible label-switching, a positivity constraint may be applied to
all loadings (Ghosh and Dunson, 2009; Sahu, 2002). A positivity constraint on all difficulty
loadings is in fact standard in item response theory (IRT) (Section 9.4.2) (Natesan et al.,
2016; Luo and Jiao, 2017). Setting one loading for each construct to be fixed (usually at 1.0)
under an anchoring constraint, also usually ensures remaining loadings conform to a con-
sistent interpretation and direction of the factor (Levy and Mislevy, 2016).
y 2i = a 2 + l21F1i + u2i
y 3 i = a 3 + l31F1i + u3 i
y 4 i = a 4 + l42 F2i + u4 i
y6 i = a 6 + l62 F2i + u6 i
Factor Analysis, Structural Equation Models, and Multivariate Priors 349
where the uji are mutually uncorrelated, with upi ~ N (0, tp /Dpi ) (Hogan and Tchernis, 2004,
p.316).
Since F1 and F2 have arbitrary location and scale, one way of providing identifiability
(the variance scaling or standardisation constraint) is to define them to be in standard
form with zero means and variances of 1 (while still possibly allowing a non-zero corre-
lation between the two factors, which is possible under this confirmatory model). Under
the alternative anchoring constraint (Skrondal and Rabe-Hesketh, 2007), one loading on
each construct is preset for identification, for example, λ11 = λ42 = 1. The Fqi may be assumed
independent of one another, although correlation over areas i may still be incorporated
via two separate univariate CAR (conditional autoregressive) priors (Besag et al., 1991).
Alternatively, correlation both between factors and over areas may be assumed, so that
{F1i,F2i} follow a bivariate CAR prior (see Section 9.6). Under an anchoring constraint, the
within area factor covariance matrix would then contain three unknowns {f11 , f22 , r}
æ f11 r f11f22 ö
F=ç ÷,
ç r f11f22 f22 ÷ø
è
whereas under a standardisation constraint, the diagonal elements in Φ are set to 1, and
only ρ would be unknown.
Adopting an anchoring constraint has utility in helping to prevent “relabelling” of the
construct scores Fqi during MCMC sampling. Since the indicators { y1 , … , y 3 } in this exam-
ple are positive measures of material deprivation, setting λ11 = 1 is consistent with the con-
struct F1i being a positive deprivation measure. If, however, one adopted the standardised
factor assumption with ϕpp = 1 and all the λpq free, it would be necessary, in order to prevent
label switching, to set a prior on one or possibly more loadings constraining positivity, for
example,
with factor 1 corresponding to the verbal tests, and factor 2 to the performance items. So
exact zeros are used to define loadings (l71 , l81 , l91 , l10 ,1 , l11,1 ) of the performance items
on factor 1, and of the verbal item loadings (l12 , l22 , l32 , l42 , l52 , l62 )on factor 2. Thus, with
exact zero loadings not shown, one has
y 2i = a 2 + l21F1i + u2i
y 3 i = a 3 + l31F1i + u3 i
y 4 i = a 4 + l41F1i + u4 i
y 5i = a 5 + l51F1i + u5i
y 6 i = a 6 + l61F1i + u6 i
y7 i = a 7 + l72 F2i + u7 i
y 8 i = a 8 + l82 F2i + u8 i
with uncorrelated normally distributed upi. In the second model, exact zeros are
replaced by approximate zeros specified using informative normal priors with a small
variance of 0.01, so that 95% of the prior variation is between 0.2 and 0.2 (Muthen and
Asparouhov, 2012).
This model has a lower WAIC (widely applicable information criterion), namely 4828
compared to 4841, than the exact zero CFA. Table 9.1 compares the two sets of estimated
loadings. Estimated main loadings under the second model are similar to those under
the exact zero CFA, and both show that indicator 11 (Coding) is essentially unrelated
to the second factor (the loading λ11,2). The second model also suggests a significant
cross-loading (λ22) of indicator 2 (Comprehension) on the performance factor, with 95%
posterior interval (0.07,0.34).
A third analysis via rjags uses binary factor-indicator selection indicators (Mavridis
and Ntzoufras, 2014), while also retaining the approximate zero prior formula-
tion. Spike-slab priors are adopted on the selection indicators. Thus cross-loadings
(l71 , l81 , l91 , l10 ,1 , l11,1 ) and (l12 , l22 , l32 , l42 , l52 , l62 ) are assigned informative N(0,0.01) pri-
ors, while for identifiability the main loadings are assigned N(0,1) priors constrained to
positive values.
One objective of this analysis is to detect indicators not relevant to the postulated
confirmatory analysis scheme. The analysis with rstan indicated that indicator 11 may
not be relevant. Selection indicators γpq therefore have Bernoulli probabilities πp that are
indicator specific. For indicator p, one has
g pq ~ Bernoulli(p p ),
Factor Analysis, Structural Equation Models, and Multivariate Priors 351
TABLE 9.1
Posterior Summary. Exact Zero vs Approximate Zero
Confirmatory Factor Analysis
Analysis 1 (Cross Loadings Preset to Zero)
Mean St Devn 2.5% 97.5%
λ11 0.77 0.07 0.63 0.91
λ21 0.70 0.07 0.56 0.85
λ31 0.57 0.08 0.42 0.73
λ41 0.71 0.07 0.57 0.86
λ51 0.78 0.07 0.64 0.93
λ61 0.39 0.08 0.24 0.55
λ72 0.59 0.09 0.42 0.76
λ82 0.47 0.09 0.3 0.65
λ92 0.69 0.08 0.53 0.85
λ10,2 0.57 0.08 0.41 0.74
λ11,2 0.11 0.07 0.01 0.26
Factor Correlation 0.58 0.08 0.41 0.72
Analysis 2 (Informative Prior on Cross Loadings)
Main loadings Mean St Devn 2.5% 97.5%
λ11 0.78 0.07 0.65 0.92
λ21 0.59 0.08 0.44 0.74
λ31 0.55 0.08 0.39 0.71
λ41 0.64 0.08 0.49 0.79
λ51 0.76 0.08 0.61 0.91
λ61 0.39 0.08 0.23 0.55
λ72 0.55 0.10 0.36 0.73
λ82 0.43 0.09 0.25 0.61
λ92 0.65 0.09 0.47 0.83
λ10,2 0.56 0.09 0.39 0.75
λ11,2 0.09 0.06 0 0.24
Cross Loadings Mean St Devn 2.5% 97.5%
Loadings of Performance Items on Verbal Factor
λ71 0.10 0.06 0.01 0.24
λ81 0.08 0.05 0.01 0.19
λ91 0.08 0.05 0 0.20
λ10,1 0.05 0.04 0 0.15
λ11,1 0.06 0.04 0 0.17
Loadings of Verbal Items on Performance Factor
λ12 0.04 0.03 0 0.13
λ22 0.20 0.07 0.07 0.34
λ32 0.06 0.05 0 0.17
λ42 0.14 0.06 0.02 0.27
λ52 0.06 0.05 0 0.17
λ62 0.05 0.04 0 0.14
Factor Correlation 0.35 0.11 0.12 0.57
352 Bayesian Hierarchical Models
TABLE 9.2
Posterior Summary. Predictor Selection Combined with Approximate Zero CFA
Mean Selection
Main loadings Mean St Devn 2.5% 97.5% Probability (γjk)
λ11 1.15 0.29 0.64 1.80 1
λ21 0.84 0.23 0.45 1.37 1
λ31 0.81 0.22 0.44 1.29 1
λ41 0.92 0.24 0.49 1.47 1
λ51 1.11 0.28 0.62 1.72 1
λ61 0.59 0.19 0.28 1.01 1
λ72 0.65 0.22 0.29 1.14 1
λ82 0.50 0.19 0.19 0.93 1
λ92 0.74 0.27 0.31 1.35 1
λ10,2 0.65 0.23 0.28 1.17 1
λ11,2 0.02 0.06 0.00 0.21 0.14
Cross Loadings Mean St Devn 2.5% 97.5%
Loadings of performance items on verbal factor
λ71 0.03 0.07 −0.11 0.20 0.67
λ81 0.01 0.07 −0.13 0.17 0.63
λ91 0.01 0.07 −0.14 0.16 0.64
λ10,1 −0.03 0.07 −0.19 0.11 0.65
λ11,1 0.01 0.05 −0.09 0.15 0.36
Loadings of verbal items on performance factor
λ12 −0.06 0.07 −0.22 0.06 0.71
λ22 0.15 0.08 0.00 0.30 0.93
λ32 −0.01 0.06 −0.15 0.12 0.60
λ42 0.08 0.08 −0.04 0.24 0.80
λ52 −0.01 0.06 −0.16 0.10 0.61
λ62 −0.04 0.07 −0.20 0.08 0.67
Factor Correlation 0.54 0.10 0.33 0.71
pp ~ Beta(1, 1),
TABLE 9.3
Estimated Factor Loadings, Two Factors. Applicant Data
Bayesian EFA, Posterior Summary
Maximum
Likelihood Factor 1 Factor 2
Variable Factor1 Factor2 Mean 2.5% 97.5% Mean 2.5% 97.5%
y1 Form of Letter 0.37 0.56 0.62 0.33 0.93 0.00 −0.17 0.19
y2 Appearance 0.53 0.03 0.16 −0.21 0.54 0.44 0.09 0.79
y3 Academic ability 0.12 0.19 0.21 −0.14 0.58 −0.01 −0.38 0.35
y4 Likeability 0.51 0.08 0.20 −0.17 0.60 0.38 −0.01 0.74
y5 Self confidence 0.84 −0.39 −0.22 −0.67 0.36 0.98 0.67 1.36
y6 Lucidity 0.88 −0.18 0.02 −0.40 0.57 0.88 0.52 1.23
y7 Honesty 0.36 −0.29 −0.18 −0.57 0.22 0.46 0.11 0.83
y8 Salesmanship 0.91 −0.05 0.16 −0.26 0.68 0.83 0.42 1.17
y9 Experience 0.32 0.73 0.81 0.48 1.21 −0.16 −0.68 0.33
y10 Drive 0.86 0.10 0.30 −0.09 0.77 0.68 0.25 1.03
y11 Ambition 0.89 −0.14 0.05 −0.37 0.59 0.87 0.51 1.22
y12 Grasp 0.90 0.00 0.21 −0.20 0.72 0.79 0.37 1.13
y13 Potential 0.89 0.09 0.31 −0.09 0.79 0.71 0.27 1.06
y14 Keeness to join 0.63 0.06 0.21 −0.16 0.61 0.50 0.12 0.85
y15 Suitability 0.61 0.67 0.83 0.49 1.22 0.11 −0.43 0.58
variance in the original indicators. It can be seen that the second factor emphasises form
of letter (y1), experience (y9), and suitability (y15), while the first factor is more generic,
with loadings over 0.8 on seven of the 15 indicators. Similar results are obtained using
the R package lavaan.
With intercorrelation between the two factors allowed under an EFA approach, there
are Q2 = 4 restrictions needed when loadings are treated as fixed effects parameters
(Merkle and Wang, 2016). Under fixed effects priors, and counting of parameter restric-
tions under a degrees of freedom approach, this can be achieved by (a) assuming stan-
*
dardised factors in the reference model; (b) setting l12 = l12 = 0 (the loading of indicator 1
on factor 2) for a lower triangular structure; and (c) setting an additional loading on the
first factor to zero (e.g. lp*1 = lp1 = 0 for some p). However, setting one of the loadings lp1 *
on the first factor to be zero is potentially arbitrary, and may affect substantive findings.
We avoid this, and formal parameter counting, by (a) adopting an approximate zero
* *
prior on l12 , namely l12 ~ N(0, 0.01), instead of an exact zero, and by (b) adopting hierar-
* *
chical priors, rather than fixed effects priors, on the remaining lp2 , and on the lp1 . Thus
lp*1 ~ N(0, w1 )
where ω1 and ω2 are unknown variances. Because of the approximate zero restriction on
l12* , one might anticipate that factor 1 in the EFA would be the most relevant to explain-
ing indicator y1 (form of application letter).
A two-chain run of 20,000 iterations in jagsUI shows estimated loadings under
Bayesian and MLE estimation as in Table 9.3. The estimated loadings λp1 on factor 1 are
highest for y1, y9, and y15, with other loadings having credible intervals straddling zero.
By contrast, the second factor is a more generic factor, similar to factor 1 in the MLE
354 Bayesian Hierarchical Models
estimation, with highest loadings on y5, y6, y8, y10, y11, y12, and y13. The 90% highest pos-
terior density interval for the factor correlation (rho in the jags code) is (0.04,0.75). Both
types of analysis (Bayesian and maximum likelihood) show low loadings of academic
ability (y3) on the first two factors, and one might adopt a variable selection approach to
confirm its relevance, or add an additional factor.
A maximum likelihood analysis with three factors identifies a third factor, loading
highly on likeability (y4), honesty (y7), and keenness to join (y14). A Bayesian analysis
with the implicit constraint λqq > 0 identifies a similar factor (as factor 3) if the originally
labelled indicators y3 and y4 are reordered, so that likeability becomes y3. The Bayesian
analysis then identifies a factor with high positive loadings on likeability and honesty
(y3 and y7 in the revised sorting), with respective 90% hpd intervals (0.15,1.34) and
(−0.04,1.21). This factor is not detected if the original ordering of indicators is retained.
ì y piq pi - b(q pi ) ü
p( y pi |Fi ) µ exp í + c( y pi , fpi )ý
î f pi þ
where θpi is the canonical parameter, with the ϕpi typically taken as known scale parameters.
Denoting regression terms as hpi = g(qpi ) where g is a link function, and hi = (h1i , … , hPi )¢ ,
intercept a = (a1 , ¼ , aP )¢ and P × Q loading matrix Λ, the regression term without extra-
variation is
hi = a + LFi ,
while allowing extra-variation
hi = a + LFi + ui ,
with ui = (u1i , … , uPi )¢ , where the upi are independent of each other under conditional inde-
pendence. The errors u (if present) and factor scores F are also independent.
Normality of errors and factors is often assumed with (u1i , … , uPi )¢ ~ N P (0, S ) , where Σ is
diagonal, and Fi ~ NQ (0, F ) , where Φ may be non-diagonal according to the form of model
(e.g. exploratory or confirmatory) assumed. Compared to the normal data-normal factor
model, the marginal densities of y are no longer simply derived, but involve integration
over F, namely
p( yi |q , y) =
ò Õ p(y
p=1
pi |Fi , q )p( Fi |y)dFi ,
Factor Analysis, Structural Equation Models, and Multivariate Priors 355
where ψ are hyperparameters defining the density of Fi. The usual conditional indepen-
dence assumptions are made. For example, for a P-variate categorical response (Kp cat-
egories for the pth response), the conditional probability that subject i with factor scores
Fi = ( F1i , … , FQi )¢ exhibits a particular set of responses is the product of separate categorical
likelihoods
Pr( y1i = k1 , y 2i = k 2 , … , y Pi = k P |Fi ) = Pr( y1i = k1 |Fi )Pr( y 2i = k 2 |Fi )… Pr( y Pi = k P |Fi ).
For factor reduction of binary, multinomial, or ordinal data, there may be benefit (e.g. in
simplified MCMC sampling algorithms) in considering latent variables posited to underlie
the observed discrete responses. The missing data then consists not only of factor scores
but of the latent scale data y *pi that underlie the observed data ypi. Thus for ypi binary, and
ypi = 1 if y *pi > 0 and ypi = 0 otherwise, one might take yi* = ( y1*i , ¼ y Pi
*
) to be normal or logistic,
with the diagonal terms in the unique covariance matrix Σ set (usually to 1) for identifi-
ability. For instance, a normal model taking the underlying responses to be conditionally
independent given the factors, would be
where Λp is the pth row of Λ, and the truncation ranges are determined by the observed ypi.
m pi = exp( b p x pi + upi ),
where D is an unrestricted covariance matrix; see also Inouye et al. (2017) on Poisson mix-
ture formulations, and Rodrigues-Motta et al. (2013) for the case where ypi may follow
different count densities. The ypi are conditionally independent given the correlated errors
ui = (u1i , ¼ , uPi )¢ . Defining vpi = exp(upi ), one has equivalently y pi ~ Po(lpivpi ) with
lpi = exp( bp x pi )
(v1i , … , vPi ) ~ LN P ( mv , S v ).
That is, the vpi are multivariate lognormal with mean vector mv = exp(0.5diag(D)), and
covariance S v = diag( mv )[exp(D) - 11¢ ]diag( mv ) .
Other ways to generate correlated count data include the overlapping sums technique
(Madsen and Dalthorp, 2007). Thus consider independent Poisson variables Z12, Z1 and Z2
with means θ12, θ1 and θ2; then y1 = Z1 + Z12 and y 2 = Z2 + Z12 are correlated with marginal
356 Bayesian Hierarchical Models
means θ1 + θ12 and θ2 + θ12 and covariance θ12. The mean and covariance of the correspond-
ing joint Poisson density for three variables is provided by Karlis and Meligkotsidou (2005,
p.257).
Factor models for count data typically may include both normal factor scores and residu-
als upi, taken as uncorrelated if the usual conditional independence assumption is made.
Thus
y pi ~ Po( mpi ),
m pi = exp( b p x pi + L p Fi + upi ).
where Λp is 1 × Q, Fi = ( F1i , … , FQi )¢ and under a standardised factor constraint Fi ~ NQ (0, RF )
where R F is a correlation matrix, with possibly unknown off-diagonal terms subject to
identifiability. Alternatively, Wedel et al. (2003) consider gamma distributed factors in an
identity link model, as well as normal F scores combined with a log link. Gamma factors
would have mean 1 to avoid location invariance, and taking their cumulative impact to be
multiplicative, one could have
w w w
m pi = exp( b p x pi + upi )( F1i p1 F2i p 2 … FQipQ )
Fqi ~ Ga(jq , jq ),
Fqi = b q xi + wqi .
while the errors in the Poisson likelihood measurement equations for y pi ~ Po( mpi ) are cor-
related over outcomes under a common factor model
log( m pi ) = a p + L p Fi + k pui .
Assuming ui is univariate, one of the loadings κp is preset if the variance su2 of the common
residual scores ui is unknown. Spatial applications of common factors are exemplified by
Wang and Wall (2003) and Nethery et al. (2015).
Factor Analysis, Structural Equation Models, and Multivariate Priors 357
logit(ppi ) = bp x pi + upi ,
( y1*i , y 2*i , … , y Pi
*
) ~ N P (hpi , S )I ( Ai , Bi ), (9.4)
The lower and upper sampling limits in the vectors Ai = ( A1i , … , APi ) and Bi = (B1i , … , BPi )
depend on the observations: sampling of the constituent y *pi is confined to values above
zero when ypi = 1, and to zero or negative values when ypi = 0. Scale mixtures of multivariate
normal densities for the y *pi are also possible, and equivalent to a multivariate Student t for
( y1*i , y 2*i , … , y Pi
*
) , which for particular degrees of freedom approximates a multivariate logit
link (Chen and Dey, 1998). A multivariate logit regression may be achieved directly with
suitable mixing strategies (Chen and Dey, 2000; O’Brien and Dunson, 2004).
The covariance matrix Σ in (9.4) is not identified, and when the predictor effects vary by
response, only the correlation matrix can be identified (Rossi et al., 2005). The identification
criteria for the multivariate probit differ from those of the multinomial probit where iden-
tification is obtained by setting one of the diagonal variance elements σpp (e.g. the first) to 1
(McCulloch et al., 2000). It is possible to sample the correlation matrix R directly (Barnard
et al., 2000; Chib and Greenberg, 1998). Talhouk et al. (2012) use a parameter expansion
method to sample R, and the LKJ (Lewandowski, Kurowicka and Joe) prior may be used
(see Example 9.6). One may also (Edwards and Allenby, 2003; McCulloch and Rossi, 1994)
sample the Σ matrix or its inverse from an unrestricted prior, and then scale both the fixed
effects and the covariance matrix to their identified forms, namely
b p* = b p / s pp ,
R = DSD,
Factor models for multiple binary data most typically have the general linear mixed
form
y pi ~ Bern(ppi ),
g(ppi ) = bp x pi + L p Fi ,
where g is the link, and Fi = ( F1i , ¼ , FQi )¢ . As for the normal linear factor model, a com-
mon assumption for the density of F is normal with a known scale. If, additionally, factors
are independent, then Fqi ~ N (0, 1) , q = 1, ¼ , Q . If instead the assumption Fqi ~ Logist(0, 1) is
made, with loadings κpq in
then k pq » ( 3/p )lpq, since the variance of a standard logistic is π2/3 (Bartholomew, 1987).
A widely applied method in educational and psychometric evaluation (Albert, 1992;
Rupp et al., 2004; Fox and Glas, 2005) is based on item response theory (IRT for short).
Typically, the observation vector yi = ( y1i , … , y Pi ) consists of binary items measuring abil-
ity, with 1 denoting a correct answer and 0 an incorrect answer, and a model seeks a single
latent ability factor score Fi. Factor score identifiability is generally obtained by assuming
Fi ~ N(0,1). Under conditional independence, the joint success probability given Fi is
Pr( y1i = 1, y 2i = 1, ¼ , y pi = 1|Fi ) = Pr( y1i = 1|Fi )Pr( y 2i = 1|Fi )¼ Pr( y Pi = 1|Fi ).
With a Bernoulli likelihood, y pi ~ Bern(ppi ) , and link g, one has a factor model
IRT rests on relatively strong assumptions, namely that a unidimensional factor is appro-
priate, conditional independence of the items given the latent factor, and a monotonic rela-
tionship between latent ability and performance on the items (Arima, 2015).
The intercepts αp can be interpreted as measures of difficulty of item p, with more nega-
tive αp implying greater difficulty under the parameterisation in (9.5), while λp measures
an item’s power to discriminate ability between subjects. A now frequent practice assigns
positive (e.g. lognormal, gamma, truncated normal) priors to the discrimination param-
eters λp, and draws these parameters from a hierarchical density with common variance
(Curtis, 2010; Luo and Jiao, 2017). A hierarchical prior may also be assumed for the diffi-
culty parameters. Using hierarchical priors may improve convergence. An alternative is to
adopt fixed effects priors (Sahu, 2002), as illustrated in the Stan Case Studies. This model
may also be parameterised as
g(ppi ) = lp ( Fi - ap ),
so that αp increases with difficulty. These are called two-parameter logistic or probit IRT
models. The three-parameter model includes a guessing (or threshold) parameter cp,
whereby
An IRT information function Ip(F) measures how precisely an item measures the latent
ability scale. For example, easy items may provide little information about higher abil-
ity subjects, while difficult items will provide little information regarding lower abil-
ity subjects. Assuming a two-parameter logistic, the item information function can be
obtained as
where pp ( F ) = logit -1[lp ( F - ap )], and can be displayed graphically with F is taken over
the range of ability scores. The total information function is the sum of the item-specific
functions.
Soares et al. (2009) and Fox and Glas (2005) describe Bayesian IRT models allowing dif-
ferential item functioning (DIF), for example, when one or more items are not appropriate
for measuring ability because the knowledge needed for a correct answer is culturally
specific. Thus, let xi = 0 for a reference population and xi = 1 for a focal group (e.g. disad-
vantaged or minority group) (Magis et al., 2015). Then DIF is indicated if the extended
model
hpi = ap + lp Fi + xi (gp + dp Fi ),
has better fit than the standard model without group differentiation (Choi et al., 2011).
and ypi = 0 corresponds to y *pi ~ N (hpi , 1)I (, 0) . For a logit link g -1(u) = L(u) = e u /(1 + e u ) and
sampling of y* is from a standard logistic. Sahu (2002) considers an extra data impu-
tation to provide three-parameter IRT models. The three-parameter probit IRT model
specifies
ppi = cp + (1 - cp )F(ap + lp Fi )
ppi = cp + (1 - cp )L(ap + lp Fi ).
Lee and Song (2003) adopt a latent scale approach to a structural equation model for mul-
tiple binary observations. Their model specifies
yi* = a + LFi + ui ,
360 Bayesian Hierarchical Models
where the latent constructs Fi are partitioned into endogenous and exogenous vector com-
ponents Fi = ( F1i , F2i ) of dimension Q1 and Q2 respectively, with structural model
For identification ui ~ N P (0, I ), while F1i ~ NQ1 (0 , F1 ), F2i ~ NQ2 (0, F 2 ), wi ~ NQ1 (0, S w ) and
each row of Λ follows a separate normal prior. The observed binary data y is augmented
with latent data {y*,F} to provide complete data {y,y*,F}. Setting q = (a , L , F1 , F 2 , S d , B, G ), the
updating sequence involves sampling from conditionals p(q (t +1) |F (t ) , y *(t ) ), p( F (t +1) |q (t +1) , y *(t ) ),
and p( y *(t +1) |F (t +1) ,q (t +1) ).
Dunson and Herring (2005) consider instead the case where the underlying y *pi (e.g.
tumour counts) are Poisson, or overdispersed Poisson, and the observations ypi (e.g.
whether tumours are present) are binary. Thus
where xi = (x1i , … , xQi ) are gamma distributed latent constructs, and the loadings Λp are
also gamma distributed. The sampling limits are (Api = 0, Bpi = 0 ) when ypi = 0, and (Api = 1,
Bpi = ∞) when ypi = 1.
9.4.4 Categorical Data
For unordered polytomous indicators ypi with Mp categories (p = 1, ¼ , P ), intercept and
loading parameters are typically specific to the category of each item, with one category
(e.g. the final one) as reference. Assume a multiple logit link, with multinomial parameter
ppi = (ppi1 , ¼ ppiMp ) for subject i and indicator p. Then while factors are common across cat-
egories, loadings are specific to indicator p and category h of that indicator,
Mp
p pih = j pih åj
m=1
pim h = 1,… M p
j piMp = 1
with the usual constraints on Λ and/or F to avoid scale and rotational invariance.
Factor models for multiple ordinal items y pi Î(1, ¼ , K p ) refer to locations on an underly-
ing continuous scales zpi. Thus ypi = j when ap , j -1 £ z pi < apj , where αpj are cutpoints on the
underlying scale. Define binary indicators dpij = 1 if ypi = j, and dpij = 0 otherwise, and denote
dpi = (dpi1 , ¼ , dpiK p ). With ppi = (ppi1 , … , ppiK p ) , ηpi denoting a regression term potentially
including latent factors, and z pi = hpi + epi , where errors εpi have cdf P(ε), one has
= gpij - gpi , j -1 ,
where
where taking ηpi as uniform across response categories j defines the proportional odds
assumption. For example, assuming a univariate latent factor Fi, and other predictors Xi,
one has
logit(gpij ) = apj - lp Fi - bp Xi .
One application is in the graded response model for ordinal outcome IRT (Luo and
Jiao, 2017). Assuming Xi excludes an intercept, the Kp − 1 thresholds {ap1 , ap 2 … , aK p -1 } are
unknowns subject to the order constraint ap1 £ ap 2 … £ aK p -1. An augmented data approach
may also be used for latent variable analysis with ordinal responses (Lee and Tang, 2006;
Poon and Wang, 2012).
D -1 ~ W (PI , P).
With jagsUI for estimation, the mean scaled deviance is obtained as 209, comparing
closely to the number of observations, namely 196, so that Poisson extra-variation is
accounted for. Most predictor effects are insignificant: the only significant effects are
of unemployment on the rape crime rate (with β11 having mean 0.064), and of GDP on
manslaughter rates. Most correlations rjm = corr(u ji , umi ) in the regression residuals have
credible intervals straddling zero, though r13 has a posterior mean 0.46 with 95% CRI
(−0.03,0.81). Adequate model performance is shown by the fact that only 9 of the 196
observations have mixed predictive p values under 0.05 or over 0.95 (Marshall and
Spiegelhalter, 2007).
To illustrate a factor analytic approach to these data, the four predictors are taken to
be causes of a single underlying crime construct, Fi, in a MIMIC analysis. The Poisson
regressions form a measurement model in which crime levels are indicators of Fi. A
further common factor ui is included in the model for the crime types to account for
residual variation in the crime data. So
r pi = exp(a p + lp Fi + k p ui ),
Fi = b1x1i + b2 x2i + b3 x3 i + b4 x 4 i + wi ,
wi ~ N(0, 1/tw ),
ui ~ N(0, 1/tu ).
Anchoring constraints are used to define the scale of the factor scores Fi and ui. So λ1 = 1
and κ2 = 1, with the latter setting corresponding to a belief that arson is relatively distinct
from the other variables in its pattern.
Inferences are based on a 15,000 iterations run with two chains using jagsUI. The pos-
terior means (sd’s) of the unknown λp coefficients (p = 2,3,4) are respectively −0.94 (0.95),
0.69 (0.38), 0.82 (0.54). These loadings tend to confirm F as a positive crime construct with
positive loadings on all crime variables except for arson. The posterior mean F scores
range from −0.96 to 1.22, with high F scores in prefectures with above average violent
crime (such as prefecture 13), or where high violent crime is combined with smuggling.
By contrast, low F scores occur in prefectures with little crime (prefecture 40), or in areas
where arson is unduly elevated (e.g. prefecture 48). The xi are relatively weak predictors
of Fi though the GDP coefficient (b3) has a mainly positive 95% interval (−0.03, 0.28).
The average scaled deviance of this model (278) indicates some residual over-disper-
sion. The estimated parameter total is lower at 129 (compared to 170 for model 1), though
the DIC is higher at 756 (compared to 730 for model 1). Model checks are, however, ade-
quate: 14 of the 196 observations have mixed predictive p values under 0.05 or over 0.95.
The fact that this particular data reduction method did not yield a better fit may be
taken to illustrate caveats to discrete data factor reduction, as also illustrated by Chib
and Winkelmann (2001). They undertake a Poisson regression analysis of six health use
outcomes, and conclude that “a flexible model with a full set of correlated latent effects
is needed to adequately describe the correlation structure [in the regression residuals].”
(Greenacre and Blasius, 2006). The indicators are Lickert scales with five levels, with
wording as follows: y1, we believe too often in science, and not enough in feelings and
faith; y2, overall, modern science does more harm than good; y3, any change humans
cause in nature, no matter how scientific, is likely to make things worse; and y4, mod-
ern science will solve our environmental problems. Responses range from 1 = strongly
agree to 5 = strongly disagree. Except for the fourth question, agreement suggests a
negative attitude toward science, while disagreement (higher ordinal ranks) suggests
a positive attitude.
A logit regression for ordinal responses y pi Î(1,… , K ) ( p = 1,… , P ; k = 1,… K ), where
P = 4 and K = 5, assumes underlying continuous variables zpi such that ypi = k when
ap , k - 1 £ z pi < apk . Define binary indicators dpik = 1 if ypi = k, and dpik = 0 otherwise, and
denote dpi = (dpi1 ,… , dpiK p ) . So with ppi = (pp1i ,… , ppKi ) , and assuming Q latent factors, the
sampling model is dpi ~ Mult(1, ppi ) , where
= gpki - gp - 1, ki
Fi ~ N( b1x1i + b 2 x2i + b 3 x3 i , s F2 ),
TABLE 9.4
Parameter Estimates. Scientific Attitudes
Parameter Mean St Devn 2.5% 97.5%
β1 −0.346 0.112 −0.566 −0.133
β2 −0.086 0.034 −0.153 −0.018
β3 0.219 0.043 0.128 0.305
α11 −2.561 0.146 −2.869 −2.295
α12 −0.183 0.108 −0.416 0.018
α13 1.157 0.119 0.924 1.387
α14 3.363 0.198 3 3.751
α21 −3.839 0.282 −4.457 −3.354
α22 −1.703 0.18 −2.087 −1.386
α23 −0.141 0.139 −0.432 0.117
α24 2.306 0.184 1.957 2.672
α31 −2.432 0.176 −2.779 −2.087
α32 −0.011 0.124 −0.259 0.223
α33 1.455 0.137 1.187 1.734
α34 3.549 0.224 3.138 4.024
α41 −2.617 0.129 −2.875 −2.375
α42 −0.689 0.072 −0.834 −0.547
α43 0.273 0.069 0.142 0.403
α44 1.568 0.09 1.406 1.757
λ2 1.45 0.199 1.141 1.882
λ3 1.24 0.141 0.984 1.536
λ4 −0.019 0.061 −0.135 0.103
τΦ = 1/σ2Φ 0.642 0.103 0.474 0.881
where q i ~ N(0, 1) are ability scores, and the items are all positive measures of ability.
The λp are assigned a hierarchical LN(0, s l2 ) prior. The model is checked by assessing
whether mixed predictive replicates y new, pi sampled from the model (Marshall and
Spiegelhalter, 2007) are concordant with actual values ypi, though one may also compare
actual and predicted totals falling into particular item response patterns (Sahu, 2002).
The three-parameter logit model includes guessing parameters g p (also called thresh-
old parameters), whereby
The g p = logit -1 (x p ) are obtained as inverse logits of ξp, which are assigned a hierarchical
normal prior.
As an example of IRT outputs, we obtain the item-specific information functions
and the test information function (or total information function), using the formulas
in Baker (2001). The test information function indicates where the set of items provides
most information about students with varying ability.
The analysis is implemented using rstan, with the rstan analysis having a substantial
advantage in early convergence. Parameters from the 2PL and 3PL models are shown
in Table 9.5. Measures of global fit show little gain in adopting a 3PL model instead of
the 2PL, with the latter in fact having a lower LOO-IC (4910 vs 4914), but a higher WAIC
(4903 vs 4897).
In terms of discrimination, item 3 shows maximum difficulty (highest αp) under both
models. Posterior mean rates of predictive concordance for the five items are (0.87, 0.69,
Factor Analysis, Structural Equation Models, and Multivariate Priors 365
TABLE 9.5
LSAT Data, Parameter Summary, 2PL vs 3PL IRT
Two Parameter Logistic
Mean St Devn 2.5% 50% 97.5%
λ1 0.90 0.20 0.54 0.89 1.33
λ2 0.76 0.16 0.46 0.76 1.07
λ3 0.86 0.18 0.54 0.85 1.24
λ4 0.76 0.17 0.46 0.76 1.10
λ5 0.78 0.17 0.44 0.78 1.11
α1 −3.26 0.64 −4.81 −3.14 −2.31
α2 −1.36 0.28 −2.03 −1.31 −0.95
α3 −0.30 0.11 −0.52 −0.29 −0.11
α4 −1.79 0.37 −2.71 −1.72 −1.26
α5 −2.83 0.64 −4.43 −2.69 −2.03
Three Parameter Logistic (Hierarchical Normal on Inv_Logit(gamma)
Mean St Devn 2.5% 50% 97.5%
λ1 1.15 0.56 0.65 1.02 2.33
λ2 1.15 0.99 0.67 1.01 2.15
λ3 1.09 0.37 0.68 1.01 1.93
λ4 1.06 0.34 0.65 1.00 1.76
λ5 1.09 0.48 0.65 1.01 1.93
α1 −0.68 0.84 −2.56 −0.49 0.46
α2 −0.12 0.41 −1.01 −0.10 0.69
α3 0.18 0.32 −0.28 0.10 0.94
α4 −0.26 0.47 −1.33 −0.20 0.58
α5 −0.51 0.69 −2.09 −0.34 0.54
γ1 0.74 0.15 0.29 0.80 0.88
γ2 0.37 0.12 0.08 0.39 0.56
γ3 0.16 0.10 0.01 0.14 0.38
γ4 0.45 0.13 0.11 0.48 0.63
γ5 0.64 0.15 0.21 0.69 0.8
Three Parameter Logistic (Hierarchical Beta prior on gamma)
Mean St Devn 2.5% 50% 97.5%
λ1 1.11 0.40 0.68 1.02 2.02
λ2 1.05 0.33 0.66 1.00 1.67
λ3 1.05 0.29 0.69 1.00 1.71
λ4 1.03 0.27 0.65 1.00 1.69
λ5 1.07 0.39 0.63 1.00 1.91
α1 −0.88 1.01 −3.02 −0.58 0.45
α2 −0.23 0.46 −1.19 −0.16 0.60
α3 0.11 0.29 −0.34 0.05 0.80
α4 −0.36 0.54 −1.52 −0.26 0.60
α5 −0.60 0.79 −2.39 −0.38 0.52
γ1 0.69 0.23 0.04 0.79 0.88
γ2 0.34 0.14 0.01 0.37 0.54
γ3 0.14 0.09 0.00 0.13 0.34
γ4 0.42 0.15 0.03 0.46 0.63
γ5 0.61 0.19 0.05 0.69 0.80
366 Bayesian Hierarchical Models
0.8
0.7
0.6
Test Information
0.5
0.4
0.3
0.2
–4 –3 –2 –1 0 1 2
Ability
FIGURE 9.1
Test information plot (mean and 60% CRI).
0.57, 0.74, 0.82) under the 2PL model, suggesting that the third item is the least well
explained by, and possibly less relevant to, the latent structure model. The test informa-
tion plot (Figure 9.1) for the 2PL model peaks at around −2, indicating that this set of
items best identifies learners with an ability less than average.
Estimates (and fit) for the 3PL may be sensitive to the prior adopted for γp. For example,
a hierarchical Beta(a1,b1) prior for the γp, with a1 and b1 assigned Exponential(1) priors,
provides differing estimates of the difficulty parameters, and a lower LOO-IC of 4910.
y pi ~ Bern(ppi ),
qi ~ N( d Xi , 1),
where Xi consists of centred covariates (an intercept not being identifiable). The λp and αp
parameters are assigned hierarchical normal priors, with λp constrained to positive val-
ues. Table 9.6 shows the 2PL parameter estimates for this model (obtained using jagsUI),
with a significant effect δ of male gender in improving spelling ability. One feature to
note is the more informative nature (compared to Example 9.5) of the total information
function. Figure 9.2 shows that this provides a higher information level centred at aver-
age ability. The LOO-IC is 3033.
Factor Analysis, Structural Equation Models, and Multivariate Priors 367
TABLE 9.6
Spelling Data, Parameter Estimates Compared
2PL Latent Regression
Mean Sd 2.50% 97.50%
α1 −1.66 0.26 −2.26 −1.24
α2 −0.54 0.11 −0.77 −0.35
α3 0.90 0.14 0.66 1.21
α4 −0.12 0.08 −0.28 0.03
λ1 0.97 0.18 0.64 1.36
λ2 1.26 0.24 0.84 1.78
λ3 1.22 0.23 0.83 1.73
λ4 1.47 0.32 0.98 2.26
Predictive Concordance Item 1 0.67 0.02 0.63 0.70
Predictive Concordance Item 2 0.53 0.02 0.49 0.57
Predictive Concordance Item 3 0.58 0.02 0.54 0.62
Predictive Concordance Item 4 0.51 0.02 0.47 0.54
Δ 0.23 0.12 0.01 0.46
Differential Item Functioning
Mean Sd 2.50% 97.50%
α1,1 −1.50 0.29 −2.22 −1.08
α2,1 −1.39 0.32 −2.14 −0.91
α1,2 −0.53 0.15 −0.87 −0.27
α2,2 −0.55 0.15 −0.87 −0.29
α1,3 1.15 0.26 0.73 1.75
α2,3 0.68 0.14 0.44 1.00
α1,4 0.18 0.12 −0.03 0.42
α2,4 −0.52 0.15 −0.86 −0.26
λ1,1 1.35 0.35 0.75 2.16
λ2,1 0.99 0.25 0.57 1.54
λ1,2 1.18 0.31 0.67 1.87
λ2,2 1.45 0.37 0.85 2.30
λ1,3 0.94 0.24 0.56 1.47
λ2,3 1.88 0.58 1.05 3.29
λ1,4 1.31 0.39 0.74 2.31
λ2,4 1.39 0.39 0.80 2.30
Predictive Concordance Item 1 0.67 0.02 0.63 0.70
Predictive Concordance Item 2 0.53 0.02 0.49 0.57
Predictive Concordance Item 3 0.58 0.02 0.54 0.62
Predictive Concordance Item 4 0.52 0.02 0.48 0.55
Differential Item Functioning (Reduced Model)
Mean Sd 2.50% 97.50%
α1,1 −1.49 0.15 −1.80 −1.21
α2,1 −1.15 0.15 −1.45 −0.86
α1,2 −0.47 0.11 −0.69 −0.25
α2,2 −0.55 0.13 −0.81 −0.31
α1,3 0.90 0.12 0.66 1.15
α2,3 0.79 0.14 0.53 1.07
(Continued)
368 Bayesian Hierarchical Models
1.5
Information
1.0
0.5
–3 –2 –1 0 1 2 3
Ability
FIGURE 9.2
Test information function, mean, and 80% CRI.
Fit is improved (with LOO-IC reduced to 3015) using a DIF model with difficulty and
discrimination parameters varying by group. Thus
y pi ~ Bern(ppi ),
qi ~ N(0, 1)
where gender gi is coded as (F = 1, M = 2), and where the λgp and αgp parameters are again
assigned hierarchical normal priors. There is some evidence favouring differential
Factor Analysis, Structural Equation Models, and Multivariate Priors 369
functioning. For example, Table 9.6 shows higher difficulty for females on items 3 and 4,
and higher discrimination for males on items 2 and 3.
A simplified DIF approach is discussed by Magis et al. (2015) involving a single dis-
crimination parameter (homogenous across groups and items), and αgp parameters sub-
ject to (classical) Lasso penalisation. One possible Bayesian option is a Lasso shrinkage
prior
æ r2 ö
h gp
2
~ Exponential ç ÷ ,
è 2 ø
r ~ U(0.001, 1000),
1/sa2 ~ Exponential(1),
TABLE 9.7
Multivariate Probit, Spelling Data, Posterior Summary
Mean St devn 2.5% 5% 95% 97.5%
β11 0.89 0.07 0.75 0.77 1.01 1.03
β12 0.29 0.06 0.17 0.19 0.40 0.42
β13 −0.54 0.07 −0.67 −0.65 −0.42 −0.41
β14 −0.10 0.07 −0.24 −0.21 0.00 0.02
β21 −0.20 0.11 −0.41 −0.38 −0.02 0.01
β22 0.07 0.10 −0.13 −0.10 0.22 0.26
β23 0.06 0.10 −0.15 −0.11 0.23 0.26
β24 0.42 0.10 0.24 0.27 0.59 0.62
R12 0.30 0.06 0.17 0.19 0.40 0.42
R13 0.28 0.07 0.14 0.16 0.39 0.41
R14 0.33 0.06 0.20 0.23 0.43 0.45
R23 0.37 0.06 0.25 0.27 0.47 0.48
R24 0.38 0.06 0.26 0.28 0.47 0.49
R34 0.36 0.06 0.24 0.26 0.46 0.48
370 Bayesian Hierarchical Models
To identify possible observation outliers, one may monitor the lowest weights ζpi. Assume
also standardised and uncorrelated factor scores, but following a Student t rather than
normal density. Then the corresponding heavy tailed construct score model for identify-
ing construct outliers is
Skewness in outcomes or factor scores may also be present. Following Azzalini (1985), let
f and g be symmetric probability density functions, with G being the cumulative distribu-
tion function associated with g. Then for location parameter μ and scale parameter σ, the
density
2 æ x - mö æ x - mö
fç G k ÷
s è s ÷ø çè s ø
is a skew pdf for any κ. If f = ϕ and G = Φ (respectively the normal pdc and cdf), one obtains
the skew-normal distribution. Positive (negative) values of κ indicate positive (negative)
skewness, while κ = 0 provides the normal density. Bazan et al. (2006) consider the applica-
tion of the skew-normal density in item analysis. For binary items p = 1, … , P , and with
kp
dp = ,
(1 + kp2 )
they define a skew probit IRT model involving a common factor Fi and item-specific effects
Vpi to allow for skew errors. So
Fi ~ N (0, 1),
with sampling limits { Api , Bpi } defined according to the observed binary responses. This
parameterisation necessitates priors for δp in the interval [−1,1].
Example 9.7 Greek Crimes by Prefecture: Non-Parametric Prior for Random Effects
The analysis of the Greek crime data in Example 9.3 assumed normally distributed
errors upi in the log-link model for the crime rates ρpi. As noted by Knorr-Held and
Rasser (2000) a fully parametric specification of the random effects distribution may
result in oversmoothing, and mask local discontinuities, especially when the true dis-
tribution is characterised by a finite number of locations. Here a truncated Dirichlet
process prior (DPP) is adopted to model the density of the residuals upi, with potential
values {u*pk , p = 1,… , P} from K clusters centred on the multivariate normal G0 = N P (0, D),
where P = 4. D−1 has a Wishart prior with identity scale matrix and P degrees of freedom.
Thus the infinite DPP representation is approximated by one truncated at K £ n com-
ponents, with appropriate values upi for prefecture i chosen according to an allocation
indicator Si Î(1,¼K ). The probabilities πk of allocation to clusters {1,… , K } are deter-
mined by K − 1 beta distributed random variables Vk ~ Beta(1, k) , with unknown con-
centration parameter κ, and VK = 1 to ensure the random weights πk sum to 1 (Ishwaran
and James, 2001; Sethuraman, 1994). Then π1 = V1 and
Following Ishwaran and Zarepour (2000, p.377), the gamma prior for κ, namely
k ~ Ga(n1 , n2 ) has relatively large ν1 and ν2, with ν2 set larger than ν1. Such a setting dis-
courages small and large values for κ. Here ν2 = 4 and ν1 = 2. The maximum possible
clusters is set at K = 20.
Estimation using jagsUI shows early convergence and replicates Example 9.3 in show-
ing mostly non-significant predictor effects. There are, however, significant positive
effects of unemployment on rape, and of urban centre on manslaughter, and a signifi-
cant negative effect of GDP levels on manslaughter. The posterior mean for κ is 1.26,
with the average number of non-empty clusters K* being 8.26.
Extreme residuals, and departures from normality, are associated with poorly fit-
ted cases (with extreme response values and high pointwise LOO-IC). For example,
Figure 9.3 plots out positively skewed mean residuals for u3i (manslaughter), with the
most extreme positive residual for the elevated observation y3,14.
with Fi ~ N(0, 1) and the λp all being unknowns. A U(−1,1) prior is adopted on the δp
parameters, with the prior on discrimination parameters
15
10
Frequency
FIGURE 9.3
Histogram of residuals, manslaughter.
A two-chain run of 10,000 iterations (with convergence at under 1,000) shows none
of the δp (and hence κp) parameters to be significantly positive or negative. Despite the
apparent absence of skew, this model has a lower LOO-IC than the symmetric probit,
297 as against 322. Posterior mean percentages of predictive concordance for the six
items are also higher under the skew probit, namely (57.5,59.1,65.9,66.5,70.5,66.3).
Table 9.8 shows the estimated loadings obtained under maximum likelihood (via
the lavaan package), using Bayesian analysis with MVN factors (via jagsUI), and using
Bayesian estimation with MVT factors, obtained using scale mixing (via the rube pack-
age). The prior for the unknown degrees of freedom ν follows Juárez and Steel (2010).
The first factor from the maximum likelihood analysis is reverse signed, but otherwise
the loadings are similar between the alternative estimation methods. The respective
LOO-IC values for the MVN and MVT factor models are 1577 and 1557. All estimated
loadings show shrinkage from the generating loadings.
The MVT analysis provides a posterior mean (95% CRI) of 12.4 (4.0,34.7) for ν, as com-
pared to the generating value ν = 5. The posterior median for ν of 10.1 is a better estima-
tor of the generating value. Around 10% of the observations have scale adjustments
under 0.8, and two observations (103, 152) have scale adjustments with 95% credible
intervals entirely below 1.
ì y piqpi - b(qpi ) ü
p( y pi |spi , xi ) µ exp í + c( y pi , fpi )ý
î fpi þ
where θpi is the canonical parameter, and ϕpi a known scale. Denoting regression terms
as hpi = g(qpi ) with link g, the spi are included to measure spatially configured but unmea-
sured predictors. So, one has at a minimum the representation
hpi = ap + bp xi + spi ,
where the spatial effects for area i, si = (s1i , ¼ , sPi )¢ , follow a multivariate spatial prior. For
certain definitions of spatial effects, it may be appropriate to also include unstructured (i.e.
exchangeable over areas) multivariate effects, in line with a multivariate form of the Besag
et al. (1991) convolution prior. Thus, the full dimension analogue to the convolution prior is
where the upi also follow a multivariate prior. Other possibilities, following Chapter 6,
include regression effects βpi that vary spatially as well as over response variables.
374
TABLE 9.8
EFA of Simulated Data, Estimated Loadings
Bayesian Estimation MVN Factors (Mean, 95% CRI) Bayesian Estimation MVT Factors (Mean, 95% CRI)
Maximum
Likelihood Factor 1 Factor 2 Factor 1 Factor 2
Indictor Factor1 Factor2 Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5%
y1 −0.71 0.00 0.69 0.50 0.88 0.01 −0.19 0.25 0.61 0.44 0.79 0.00 −0.18 0.17
y2 0.30 0.78 −0.11 −0.96 0.90 0.84 0.48 1.41 −0.13 −0.79 0.66 0.71 0.43 1.23
y3 −0.75 0.00 0.75 0.46 1.06 0.08 −0.25 0.49 0.65 0.44 0.91 0.05 −0.21 0.35
y4 0.68 0.19 −0.63 −0.88 −0.28 0.14 −0.20 0.48 −0.56 −0.78 −0.28 0.14 −0.10 0.41
y5 0.45 0.00 0.15 −0.42 0.92 0.52 0.28 1.06 0.12 −0.37 0.72 0.46 0.26 0.85
y6 0.59 0.43 −0.47 −0.94 0.16 0.45 0.10 0.90 −0.44 −0.84 0.09 0.40 0.15 0.75
Bayesian Hierarchical Models
Factor Analysis, Structural Equation Models, and Multivariate Priors 375
D = Diag(d1 , … , dn ),
where di = å j¹i
wij . If wij = 1 when areas i and j are contiguous, and zero otherwise (binary
adjacency), then di is the number of neighbours for area i. The neighbourhood for area i is
often denoted ∂i, and if area j is a neighbour of area i, then the neighbour relation (under
binary interaction) is denoted j ~ i.
The joint density for a normal MGMRF for P spatial effects and with nP × nP precision
matrix Q may be expressed
nP/2
æ 1ö
p(s|Q) = ç ÷ Q 0.5
exp [(s - m)¢Q(s - m)]
è 2p ø
nP/2
æ 1ö
=ç ÷
è 2p ø
Q 0.5
å exp éë(s - m )¢Q (s - m )ùû .
ij
i i ij j j
Q is block diagonal with P × P sub-matrix elements Qij that are non-zero (zero) if area j is
(is not) a neighbour of area i. Retaining the possibility of a regression model in the means
μi (Rue and Held, 2005), the corresponding full conditional density is
æ ö
å
1
si |s[i] ~ N ç mi - Qii-1 Qij (s j - m j ), ÷,
ç Qii ÷
è j¹i ø
with conditional precision matrices
Equivalently define P × P matrices Bij = -Qij /Qii , with Bii = 0, and D i = Qii . Then
Prec(si |s[i] ) = D i .
D iBij = D jBji .
For example, setting Bij = [wij /di ]I P ´ P , and Δi = diζ (where ζ is a P × P precision matrix) will
ensure a valid joint density.
A number of multivariate priors which incorporate spatial dependence between areas
have been proposed. The generalisation of the intrinsic univariate CAR to a multivari-
ate setting is denoted as the multivariate CAR or MCAR prior (Mardia, 1988; Jin et al.,
2005, equation 6; Song et al., 2006, p.254). This takes the vector of multivariate area
effects s = (s11 , s12 , … , s1n ; … ; sP1 , sP 2 , … , sPn ) as multivariate normal with mean consisting
of a vector of zeros of length nP, and with nP × nP precision matrix, Q = (D - aW ) Ä z ,
namely
nP/2
æ 1ö P /2 n/ 2 é 1 ù
p(s|z , a) = ç ÷ D - aW z exp ê - s¢ Qsú , (9.8)
è 2p ø ë 2 û
where a Î(0, 1) is a propriety parameter. The P × P positive definite symmetric matrix ζ−1
describes covariation between the outcomes, and D − αW is the precision matrix for the spa-
tial effects. The latter matrix can also be written as D(I − αB) where B = D−1W. Let the effects
be arranged by variable rather than subject, so that S1 = (s11 , s12 , … , s1n )¢ , S2 = (s21 , s22 , ¼ , s2 n )¢ ,
etc., then for P = 2, the joint prior is
æ S1 ö æ æ 0 ö éz 11(D - a W ) z 12 (D - a W )ù ö
-1
ç ÷ ~ N çç ÷,ê ÷,
è S2 ø ç è 0 ø ëz 12 (D - a W ) z 22 (D - a W )úû ÷
è ø
E(spi |s[i] ) = M pi = a åw s åw
j¹i
ij pj
j¹i
ij
If the wij are set to 1 for neighbouring areas and to 0 otherwise, then the M pi = a j ζi
spj /di å
are locality averages (times α) of the spatial effect for the pth response. Setting α = 1 pro-
vides the multivariate version of the intrinsic CAR prior of Besag et al. (1991); such intrin-
sic GMRFs (for spatial and non-spatial priors) are considered by Rue and Held (2005,
Chapter 3).
MacNab (2007) discusses a multivariate extension of the Leroux et al. (1999) prior, which
allows the data to determine the appropriate mix between spatial or exchangeable depen-
dence. This may be achieved with a single set of random effects rpi rather than the two
sets {spi,upi} present in the multivariate extension (9.6) of the convolution prior. Thus with
ri = (r1i , ¼ , rPi )¢ , parameter k Î(0, 1) , and spatial interactions W = [wij ]
Factor Analysis, Structural Equation Models, and Multivariate Priors 377
é ù
E(ri |r[i] ) = [ M1i , … MPi ] = k å wij I P rj ê1 - k + k
ê å wij ú
ú
j¹i ë j¹i û
é ù
Prec(ri |r[i] ) = ê1 - k + k
ê ú å
wij ú z
ë j¹i û
where, as above, ζ is the within area covariance of dimension P × P. Thus
é ù
ê kwij ú
Bij = ê ú IP´P
ê é1 - k + k
êë êë å j¹i
ù
wij ú ú
û úû
é ù
D i = ê1 - k + k
ê ú å
wij ú z
ë j¹i û
and D iBij = D jBji holds. When the wij are binary adjacency indicators, with di the number of
neighbours of area i, the conditional expectations become
E(rpi |r[i] ) = M pi =
k å j ζi
rpj
.
[1 - k + kdi ]
Define
H = diag(1 - k + k åw
j ¹1
1j ,¼, 1 - k + k åw
j¹ n
nj ) = (1 - k)I n + kD.
Then the joint density is multivariate normal with mean vector 0 and np × np precision
matrix ( H - kW ) Ä z .
Jin et al. (2005) propose a generalised MCAR (GMCAR) model whereby the joint dis-
tribution for a multivariate spatial effect is obtained by specifying a sequence of condi-
tional and marginal models. Let effects be arranged by variable rather than subject. Then
for a bivariate spatial effect with P(S1 , S2 ) = P(S1 |S2 )P(S2 ), where S1 = (s11 , s12 , … , s1n )¢ and
S2 = (s21 , s22 , … , s2 n )¢, one has
æ S1 ö æ æ 0 ö é S11 S12 ù ö
ç ÷ ~ N çç ç ÷ , ê ÷,
è S2 ø è è 0 ø ëS12 S 22 úû ÷ø
-1 -1 -1
where E(S1 |S2 ) = S 12S 22 S2 , and var(S1 |S2 ) = S 11.2 = S 11 - S 12S 22 S 12
¢ . Hence with G = S 12S 22 ,
one has equivalently
æ S1 ö æ æ 0 ö éS11.2 + GS 22G¢ GS 22 ù ö
ç ÷ ~ N çç ç ÷ , ê ÷.
è S2 ø è è 0 ø ë (GS 22 )¢ S 22 úû ÷ø
378 Bayesian Hierarchical Models
To specify the joint distribution of S1 and S2, it is therefore necessary to specify the matrices
Σ11.2, Σ22, and G.
-1 -1
Taking S 11 .2 = t1[D - a1W ] , S 22 = t2 [D - a2W ] and G = g0 I + g1W , the marginal joint prior for
the second set of effects is then
As above the 0 < ap < 1 are propriety parameters, and the γ0 parameter links different
variable-same area effects, namely regresses s1i on s2i, while γ1 links s1i with other variable-
other area effects {s2 j , j ¹ i} . This approach is possibly more suitable for small P, as P! con-
ditional density sequences are possible, and may give different inferences or fits – though
Jin et al. (2005, p.957) demonstrate how initial regression analysis may lead one to prefer
one sequence to another.
The linear co-regionalisation model of Jin et al. (2007) avoids dependence on any par-
ticular ordering. Assuming binary adjacency, the most general option in Jin et al. (2007),
namely Case 3 (dependent and non-identical latent processes), specifies a conditional mean
æ ö
E(spi |sp ,k ¹i , sq¹ p ,i , sq¹ p ,k ¹i ) = a pp å
k ~i
spk /di + å å ç a pq
ç
q¹ p è k ~i
sqk /di ÷ ,
÷
ø
where αpp is the spatial autocorrelation measure for the pth outcome, and αpq is a crossspa-
tial correlation between Sp and Sq. The joint distribution (Martinez-Beneito, 2013, p.4) may
be represented
ì æ (D - a 11W )z 11 ¼ (D - a 1PW )z 1P ö ü
ï ç ÷ï
s ~ N nP í0, ç ¼ ¼ ¼ ÷ý
ï ç (D - a W )z ¼ (D - a PPW )z PP ÷ø ïþ
î è 1P 1P
with ϑ = ζpq denoting the within area between disease precision matrix.
Martinez-Beneito (2013) represents the joint prior for the spatial error s of length nP in
the generic form s ~ N nP (0, S b Ä S w ) where Σb and Σw represent between and within disease
covariance matrices. Denoting S b and S w as lower triangular matrices such that S b = S b S Tb
and S w = S w S Tw , one has that s = S w eS Tb with ε of dimension n × P, consisting of indepen-
dent N(0,1) variates. Representing f = S w e as a matrix with P columns containing a set of
particular spatial distributions (e.g. P independent ICAR densities), then interdependence
is induced via the product form s = vec(fS Tb ) , which has covariance S b Ä (D - W )-1. If ϕ
consists of independent ICAR(αp) densities, then Case 2 of Jin et al. (2007) is obtained. More
flexibility is obtained by representing Σb as S b = S bCC T S Tb where C is any square orthogo-
nal matrix, which enables reproduction of Case 3 of Jin et al. (2007).
Factor Analysis, Structural Equation Models, and Multivariate Priors 379
hpi = ap + bp xi + L p Fi ,
where the vector Λp is of dimension Q, and the factor score variables Fi = ( F1i , … , FQi )¢ are
spatially dependent over areas i, as well as mutually intercorrelated. For example, a MCAR
prior would specify the joint pairwise difference density for the factor scores
é ù
p( F|S F ) µ|S F|- n/2 exp ê -0.5
ê å w (F - F )¢S
ij i j
-1
F ( Fi - Fj )ú .
ú
ë i, j û
As in other factor models, constraints are required to deal with label switching and loca-
tion, scale, and rotational indeterminacy. Constraining one or more loadings to be positive
is one strategy for avoiding label switching (Mar-Dell’Olmo et al., 2011). In the multivariate
CAR model for Fi = ( F1i , … , FQi )¢, the location is fixed in practice by centring each of the Q
sets of spatial factor scores at each MCMC iteration. Scale may be determined by fixing the
Q variances of the Fqi scores at 1, or by fixing one of the loadings (l1q , … , lPq ) linking the P
manifest indicators to the qth factor. Additional loadings would need to be fixed to avoid
rotational indeterminacy, typically λpq = 0 for q > p. For example, if Q = 2, and the variances
of the F scores are free parameters, then the two loadings λqq may be set to 1 to define the
scale, while rotational invariance is avoided by setting λ12 = 0.
h
y pi ~ Po(Epi e pi ),
hpi = ap + spi .
380 Bayesian Hierarchical Models
hpi = ap + lp Fi + upi
where Fi follows a univariate Leroux et al. (1999) prior. Thus for k Î(0, 1) , precision
parameter τF, and with F[i] = ( F1 ,¼, Fi - 1 , Fi + 1 ,¼, Fn ), and binary spatial interactions, the
conditional mean and precision for ward i are
E( Fi |F[i] ) =
k å j ζi
Fj
,
[1 - k + kdi ]
and
With tF = sF-2 taken as an unknown, with prior σF ~ U(0,1000), one of the loadings λj
must be fixed for identification, and accordingly λ1 = 1. This model has a LOO-IC of 1823,
with κ estimated as 0.89.
Factor Analysis, Structural Equation Models, and Multivariate Priors 381
yt = m + F1 yt -1 + … + F R yt - R + ut - Q1ut -1 … - QSut - S ,
where the coefficient matrices are all of order P × P, and ut denotes P-variate white noise,
with E(ut) = 0, and
E(utut¢- k ) = 0 k ¹ 0;
E(utut¢- k ) = S k = 0.
For the vector autoregressive or VAR model obtained on omitting moving average terms,
stationarity requires that the roots of the characteristic equation
det(I - F1z + … + F r z R ) = 0
* Classical approaches using autoregressive moving average models may rest on assumptions of stationarity,
following transformation or differencing: a time series is integrated of order d, or I(d), when differencing to
order d is needed for stationarity. Such series are cointegrated if some linear combination of the series has a
lower order of integration than the individual series (Phillips and Durlauf, 1986), for example, when two series
yt and xt are both I(1), but there is a parameter α such that ut = yt − αxt is stationary (integrated of order zero).
382 Bayesian Hierarchical Models
Koopman, 2012) and focuses on underlying components of multiple series without requir-
ing initial differencing. The multivariate normal dynamic linear model specifies
yt = Ftqt + et , et ~ N (0, Vt ), t = 1, … , T
where yt is a P × 1 observation vector, and θt is a Q × 1 latent state vector following a Markov
process. The disturbance vectors et and ut are assumed normally distributed, and uncor-
related with each other and over time. The initialising prior for the state vector is typically
assumed to be a normal fixed effect with mean m1 and covariance matrix C1, q1 ~ N (m1 , C1 ).
The system matrices Ft , Gt , Vt , Wt and Ht may be assumed to be known, in which case sim-
ple updating, forecasting and filtering densities can be derived – see West and Harrison
(1997, p.582). In more realistic settings where the covariances Vt and Wt are unknown, time-
invariant assumptions such as Vt = Σe and Wt = Σu are one possible parameterisation. A sim-
ple case occurs (Koopman and Durbin, 2000) when Vt is diagonal, the assumption being
that the observations are independent conditional on the latent states.
Common model forms include the local level (LL) model with measurement and transi-
tion equations
yt = qt + et , et ~ N (0, S e ), t = 1, ¼ , T
qt + 1 = qt + ut , ut ~ N (0, S u ),
where yt is a P × 1 metric observation, θt also has dimension P, and Σe and Σu are of dimen-
sion P × P. A local linear trend (LLT) includes a trend in the underlying level, as in
yt = qt + et , et ~ N (0, S e ), t = 1, ¼ , T
qt + 1 = qt + dt + ut , ut ~ N (0, S u ),
dt + 1 = dt + wt , wt ~ N(0, S w ).
For example, Proietti (2007) applies a multivariate local level model to measuring core
inflation, while Moauro and Savio (2005) apply a LLT approach to temporal disaggrega-
tion of multiple economic series. Multivariate signal models may be applied to measure
latent risk, as in the accident rate and credit card use examples of Bijleveld et al. (2005).
This approach involves time series or panel data on exposure totals (xt or xit), outcomes (yt
or yit), and what may be generically termed “losses” (zt or zit). A simple bivariate case with
xt = vehicle registrations and yt = motor accidents would lead to a model
where the components of qt = (qt( E) ,qt( R) ) represent underlying log exposure and log risk,
which evolve according to a bivariate local linear trend
Factor Analysis, Structural Equation Models, and Multivariate Priors 383
qt + 1 = qt + dt + ut , ut ~ N (0, S u ),
dt + 1 = dt + wt , wt ~ N(0, S w ).
A simplifying “homogenous” model (Harvey, 1989, Chapter 8) for the covariance matrices
is obtained for the LL model by setting
S u = qS e
where q is an unknown signal-to-noise ratio, and for the LLT model by setting
S u = q1S e
S w = q2S e .
Generalisations to include trend, seasonal, and cyclical effects can be made in which each
sort of effect is independent of the other and each follows its own multivariate evolu-
tion prior (Durbin and Koopman, 2001, p.44). These assumptions lead to what is termed a
seemingly unrelated time series equations or SUTSE model (Harvey and Shephard, 1993;
Harvey and Koopman, 1997), since the individual series are connected only via the cor-
related disturbances in the measurement and transition equations. More complex matrix
normal priors (West and Harrison, 1997, p.597) result from assuming interdependence
between different types of parameter.
A model with level, seasonal, and cyclical effects for multivariate yt = ( y1t , … , y Pt ) would
specify
yt = qt + gt + yt + et , t = 1, … , T
et ~ N (0, S e ),
qt + 1 = qt + ut ,
ut ~ N (0, S u ),
where the seasonal components for the pth variable (with s seasons) evolve according to
gpt = gp ,t -1 + gp ,t - 2 … + gp ,t - s + 1 + wpt ,
with
Following Harvey and Koopman (1997), the cyclical effects ψt may be assumed “similar,”
namely to have the same damping factor ρ and frequency 0 £ l £ p across variables. The
period is then 2π/λ with the full prior being
where means myp and myp* for the pth variable are obtained according to
éy pt ù é cos(l ) sin(l )ù éy p ,t -1 ù éh pt ù
êy * ú = r ê - sin(l ) + .
ë pt û ë cos(l )úû êëy p*,t -1 úû êëh pt* úû
It may be noted that multivariate DLMs occur in the analysis of univariate data, for
example for categorical and ordinal outcomes. Thus Cargnoni et al. (1997) propose a
model for time series of a multinomial outcome with M categories, and denominators
nt. One has
hmt = amt + bm xt , m = 1, ¼ M - 1
hMt = 0
where the time-varying category intercepts at = (a1t , ¼ , aMt ) follow a multivariate normal
random walk prior
at ~ N M -1(at -1 , S a ).
yt = qt + yt + et , t = 1,… , T
et ~ N(0, S e ),
qt + 1 = qt + bt + ut ,
ut ~ N(0, S u ),
bt + 1 = bt + wt ,
wt ~ N(0, S w ),
where the cyclical effects for the two species have the same damping factor r ~ U(0, 1)
and frequency λ, and the non-diagonal covariance matrices are of order P × P.
Since the series contains 64 points, an informative assumption is made that the period
is between 4.2 and 21, namely that l ~ U(0.3, 1.5) . Taking a simple uniform prior on λ
between 0 and π is associated with implausibly low λ. Covariances are linked using the
homogeneity assumption, namely S u = qu S e , S w = qw S e , S h = qh S e , and S h * = qh * S e , with
the signal to noise ratios {qu , qw , qh , qh * } all assumed to follow Exponential(1) priors. For
S e-1 , a Wishart prior assumes 5 degrees of freedom and a prior covariance matrix based
on the observed covariance.
Inferences are from the final 75,000 of a two-chain run of 100,000 iterations, using
R2OpenBUGS. One finds the cycles to have a mean period of 9.9 years, with 95% CRI
(9.3,10.9). Figures 9.4 and 9.5 show modelled trends in the mink and muskrat series
(theta.var[1,] and theta.var[2,] in the code) together with the original data. The posterior
means for qu, qw, qη, and qh * are (0.093, 0.004,0.058, 0.16). The LOO-IC is 11.1, with point-
wise LOO-IC identifying the discordant observation in 1908, when muskrat sales were
unduly low.
The interlinking of the two series (and its predator-prey nature) also shows in a
VAR(1) model with
y 2t = g 2 + a 21 y1,t -1 + a 22 y 2 ,t -1 + u2t ,
ut ~ N(0, S u ),
12.0
11.5
Annual Sales (log)
11.0
10.5
Data
Mean
10.0
2.5%
97.5%
9.5
1848
1850
1852
1854
1856
1858
1860
1862
1864
1866
1868
1870
1872
1874
1876
1878
1880
1882
1884
1886
1888
1890
1892
1894
1896
1898
1900
1902
1904
1906
1908
1910
FIGURE 9.4
Annual mink sales, 1848–1911 (logarithm).
386 Bayesian Hierarchical Models
14.5
14.0
13.5
Annual Sales (log)
13.0
12.5 Data
Mean
12.0 2.5%
97.5%
11.5
1848
1851
1854
1857
1860
1863
1866
1869
1872
1875
1878
1881
1884
1887
1890
1893
1896
1899
1902
1905
1908
1911
FIGURE 9.5
Annual muskrat sales, 1848–1911 (logarithm).
with y11 and y21 taken as known, and where Σu is non-diagonal. The estimated α coef-
ficient matrix from a two-chain run of 25,000 iterations is
æ 0.61 0.21 ö
ç ÷
è -0.49 0.91 ø
where the negative α21 coefficient, with posterior mean (95% interval) of −0.49 (−0.69,−0.12),
shows muskrat numbers are lower when mink number are higher. Maximum likeli-
hood estimates from the vars package are similar, as are estimates using rstan code,
which uses the Cholesky parameterisation of the bivariate normal covariance matrix.
Residuals between the two series (u.corr in the code) are positively correlated, with
posterior mean 0.26, after accounting for the lag 1 effect of one series on the other. The
LOO-IC for this model is 41.
AR1 coefficients which follow random walk priors. Thus, first order random walk priors in
r = 1, ¼ , R autoregressive parameters ϕrt leads to
y pt = ap + lp Ft + e pt ,
e pt ~ N (0, sp2 ),
æ R
ö
Ft ~ N ç
ç
è
åf F
r =1
rt t - r , s F2 ÷ ,
÷
ø
frt ~ N (fr ,t -1 , sf2 ),
yt = a + LFt + et , et ~ N (0, S e ), t = 1, ¼ , T
with the transition equation specifying a random walk in the factors, namely
Ft + 1 = Ft + ut , ut ~ N (0, S u ),
æ IQ ö
L =ç *÷
èL ø
with Λ* of dimension (P − Q) × Q containing unknown loadings. If Σe is diagonal (only
residual variances assumed unknown), and Σu is also diagonal, but contains unknown fac-
tor variances, then one may set λpp = 1 and λpq = 0 for q > p. This is the anchoring constraint of
Skrondal and Rabe-Hesketh (2004), with the latter constraint used to avoid rotation invari-
ance (Geweke and Zhou, 1996, pp.565–566).
If Σe contains just residual variances, and Σu is diagonal with known factor variances
(typically of 1), then constraints on the λpq to ensure scale identification are not needed, but
388 Bayesian Hierarchical Models
the rotational constraint λpq = 0 for q > p still applies. However, Geweke and Zhou (1996)
suggest λpp > 0 as an identification device in this case, to ensure a unique labelling of factors.
yt = at + L t Ft + et ,
et ~ N (0, S t ),
Ft = GFt -1 + wt ,
wt ~ N (0, F t ),
y pt = apt + lpt Ft + e pt ,
e pt ~ N (0, sp2 ),
Ft = g Ft -1 + wt ,
wt ~ N (0, sw2 ),
hpt ~ N (0, xp ).
For multivariate factors, sparsity-inducing priors on the coefficients λpqt may be indicated,
with Zhou et al. (2014) proposing a threshold mechanism.
Factor Analysis, Structural Equation Models, and Multivariate Priors 389
Simplifications to such a scheme are often the focus, involving decompositions of the
residual variance. Applications typically involve metric series yt = ( y1t , y 2t , … , y Pt )¢ either
mean centred, or in transformed form (e.g. logs of share prices compared between suc-
cessive time points), with effectively zero means. A latent factor will not necessarily be
involved. Thus, for centred or appropriately transformed prices or returns ypt, one possible
model (Asai et al., 2006) for a response yt = ( y1t , y 2t , … , y Pt )¢ is
yt = Ht et ,
where et is a vector of independent standard normal variates, and ht = ( h1t , ¼ , hPt )¢ is a vec-
tor of unobserved log variances (or volatilities), evolving according to stationary autore-
gressive schemes,
upt ~ N (0, tp ),
æ et ö éæ 0 ö æ Re 0 öù
ç ÷ ~ N êç ÷ , ç ÷ú .
è ut ø êëè 0 ø è 0 S u ø úû
where Re is a positive definite correlation matrix with a diagonal of ones, and Σu is a P × P
covariance matrix for volatility shocks. Taking Re to be non-diagonal means shocks in
prices may be correlated, while taking Σu to be non-diagonal allows volatility shocks to be
correlated (Yu and Meyer, 2006, 365–366). Thus, in a bivariate example, taking
æ f11 f12 ö
f =ç ÷
è f21 f22 ø
æ 1 rt ö
Ret = ç ÷
è rt 1ø
means that not only log volatilities ht, but also correlations between the observed series are
time-varying. Specifically, with
390 Bayesian Hierarchical Models
exp( gt ) - 1
rt =
exp( gt ) + 1
gt + 1 = mg + fg ( gt - mg ) + vt .
Factor analytic models may also include correlated volatility (Pitt and Shephard, 1999;
Chan et al., 2006; Zhou et al., 2014). As an example, for two series { y pt , p = 1, 2} and a uni-
variate factor Ft, one might have
y 2t = l2 Ft + e2t
with evolving variances for Ft and the ept. The stochastic variance prior for the residuals
ept may include autoregressive dependence, since a factor structure may be sufficient to
account for the non-diagonal elements of the residual variance matrix of the outcomes, but
not sufficient to explain all the marginal persistence in volatility (Pitt and Shephard, 1999,
p.551). Thus, one might have
Ft ~ N (0, e h1t ),
e2t ~ N (0, e h3 t ),
hpt = fp hp ,t -1 + upt t = 2, … , T
with possibly unknown initial conditions hp1. For identification, one may set one or other
of the λp parameters to 1 (an anchoring constraint). Alternatively, a standardised factor
constraint might be implemented by setting the scale of the factors at one time point, for
instance by taking F1 ~ N (0, 1), that is h11 = 0.
Adaptivity in the modelling of stochastic variances can be combined with factor reduc-
tion. Chib et al. (2006) propose a multivariate stochastic volatility factor model that permits
both series-specific jumps at each time, and Student-t innovations with unknown degrees
of freedom. For bivariate data and a univariate factor, this model has the form
where qpt = 1 with probability πp, and the εpt follow independent Student t densities with
unknown degrees of freedom νp. In hierarchical form
e pt = e pt/g pt0.5 ,
Factor Analysis, Structural Equation Models, and Multivariate Priors 391
æn p n p ö
g pt ~ Ga ç , ÷ ,
è 2 2 ø
[e1t , e2t ] ~ N 2 (0, Vt ).
hp ,t + 1 - mp = fp ( hpt - mp ) + spupt ,
The variables zpt = log(1 + dpt ) are assumed to be N ( -0.5xp2 , xp2 ) where ξp are additional
unknowns. The more general form for yt = ( y1t , … , y Pt )¢ and Ft = ( F1t , … FQt )¢, Q £ P is
yt = BFt + D t qt + et
with identification constraints λpp = 1 and λpq = 0 for q > p. These constraints set a scale and
prevent rotation invariance. The covariance matrix for Ft is diagonal with evolution scheme
as for the log diagonal elements of Vt.
5.6
5.4
5.2
Log Price
4.8 Buffalo
Minneapolis
4.6
Kansas City
4.4
Aug/1972
Aug/1973
Aug/1974
Aug/1975
Aug/1976
Aug/1977
Aug/1978
Aug/1979
Aug/1980
Nov/1972
Nov/1973
Nov/1974
Nov/1975
Nov/1976
Nov/1977
Nov/1978
Nov/1979
Nov/1980
Feb/1973
May/1973
Feb/1974
May/1974
Feb/1975
May/1975
Feb/1976
May/1976
Feb/1977
May/1977
Feb/1978
May/1978
Feb/1979
May/1979
Feb/1980
May/1980
FIGURE 9.6
Flour prices in three cities.
392 Bayesian Hierarchical Models
are assumed to follow a random walk with F1 = 0 to identify the level of the scores. The
observation residuals after accounting for the common factor are assumed multivariate
normal, with a Wishart prior on precision matrix with 3 degrees of freedom and a diag-
onal scale matrix. The elements of the Wishart scale matrix are based on the observed
variances Vp of the three series, leading to a data-based prior.
Thus, with P = 3, and an anchoring constraint on the loadings,
y pt = ap + lp Ft + upt
S = diag(V1 , V2 ,.., VP ),
l1 = 1;
Ft ~ N( Ft - 1 , sF2 ) t = 2,… , T ,
F1 = 0,
sF ~ U(0, 10).
An alternative model for the factor scores adopts a locally adaptive prior, allowing for
changing variance through time (Lang et al., 2002). Thus
Ft ~ N( Ft - 1 , exp(ht )) t = 2,¼, T ,
th ~ Ga(1, 0.001),
h1 ~ N(0, 1),
F1 = 0.
Following Migon and Moreira (2004), fit may be assessed using the predictive approach
of Gelfand and Ghosh (1998), based on a goodness of fit term G = å å (y rep , pt - y pt )2
å å var(y
p t
and a penalty term H = rep , pt ). The LOO-IC is also used.
p t
Figure 9.7 shows the estimated factor scores through time under the constant variance
model. The posterior mean for sF2 is 0.0018, with posterior mean for G of 1.417 and with
H = 1.108. The non-constant variance model has similar fit criteria, namely a posterior
mean for G of 1.397 with H = 1.098. The respective LOO-IC values are −1027 and −1028.
There seems little to choose between the models, though the plot of the evolving log
variances (Figure 9.8) suggests a reduction in volatility in the second half of the obser-
vation period.
–12
–10
–8
-6
–4
–2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
–0.2
–0.1
Aug/1972
A ug/197
g/1972
Aug/1972
FIGURE 9.7
FIGURE 9.8
Nov/1972
Nov/197
Nov/1972
v/1972
Feb/1973
Feb/197 Feb/1973
b/1973
May/1973
May/197 May/1973
y/1973
Aug/1973
Aug/197 Aug/1973
g/1973
Nov/1973
Nov/197 Nov/1973
v/1973
Feb/1974
Feb/197 Feb/1974
b/1974
May/1974
May/197 May/1974
y/1974
Aug/1974
Aug/197 Aug/1974
g/1974
Nov/1974
Nov/197
Nov/1977
Nov/197
97.5%
Nov/1977
v/1977
Feb/1978
Feb/197 Feb/1978
b/1978
May/1978
May/197 May/1978
y/1978
2.5%
Aug/1978
Aug/197
Factor Analysis, Structural Equation Models, and Multivariate Priors
Mean
Aug/1978
g/1978
97.5%
Nov/1978
Nov/197 Nov/1978
v/1978
Feb/1979
Feb/197 Feb/1979
b/1979
May/1979
May/197 May/1979
y/1979
Aug/1979
Aug/197 Aug/1979
g/1979
Nov/1979
Nov/197 Nov/1979
v/1979
Feb/1980
Feb/198 Feb/1980
b/1980
May/1980
May/198 May/1980
y/1980
Aug/1980
Aug/198 Aug/1980
g/1980
393
394 Bayesian Hierarchical Models
yt = Ht et ,
with unobserved log volatilities ht = ( h1t , h2t )¢ evolving via a VAR(1) model
ht +1 = m + diag(f11 , f22 )( ht - m ) + ut ,
h1 = m + u1 .
The errors in the price series and volatilities equations are multivariate normal
æ et ö éæ 0 ö æ Re 0 öù
ç ÷ ~ N êç ÷ , ç ÷ú .
u
è øt êëè 0 ø è 0 S u ø úû
115
Relative to Index Start Point (30-10-2006)
110
105
100
FTSE S&P
95
90
30/10/2006
13/11/2006
27/11/2006
11/12/2006
25/12/2006
08/01/2007
22/01/2007
05/02/2007
19/02/2007
05/03/2007
19/03/2007
02/04/2007
16/04/2007
30/04/2007
14/05/2007
28/05/2007
11/06/2007
25/06/2007
09/07/2007
23/07/2007
06/08/2007
20/08/2007
03/09/2007
17/09/2007
01/10/2007
15/10/2007
FIGURE 9.9
US and GB share indices, October 30, 2006 to October 19, 2007.
Factor Analysis, Structural Equation Models, and Multivariate Priors 395
h1t = su1h1*t ,
This raises an identification issue for the correlation ρu, since ru h1*t = ( - ru )( - h1*t ), which
is resolved by assuming ρu to be U(0,1) rather than U(−1,1). The correlation ρe in Re is
*
taken as U(−1,1). The stationary autocorrelation parameters are obtained as fpp = 2fpp - 1,
* -1
where fpp ~ Beta(19, 1) . The diagonal terms in S u are taken to be Exponential(1) .
Estimation using jagsUI provides early convergence, with posterior means for ρe and
ρu of 0.54 and 0.87, and with the autoregressive coefficients in the AR1 log volatility
equations having means 0.88 and 0.74. Figure 9.10 plots the resulting log volatility series
(posterior means of h1t and h2t) with the two periods of market turbulence apparent.
The pointwise LOO-ICs detect aberrant observations for t = 85 (comparing 27/02/2007
and 26/02/2007) when there was a sharp fall in the S&P index, and t = 206 (comparing
16/08/2007 and 15/08/2007) when there was a sharp fall in the FTSE100.
0
Log volatility
–1
h1
–2 h2
FIGURE 9.10
Log volatility plot.
396 Bayesian Hierarchical Models
9.9 Computational Notes
1. The application of brms to path models can be illustrated using data from an analysis
for job satisfaction (Bryman and Cramer, 2005) and considered in Congdon (2001, p.98).
Thus, job survey data on age, income, job satisfaction, and job autonomy is available for 68
workers. A path model is proposed with y1 (age) influencing the three remaining variables,
namely y2 = autonomy, y3 = income, and y4 = satisfaction. Autonomy is postulated to influ-
ence both income and satisfaction, while all three variables – age, income, and autonomy
– affect satisfaction. All variables are standardised. Hence, the following regressions are
involved
y 2 = b1 y1 + e2 ,
y 3 = b2 y1 + b3 y 2 + e3 ,
y 4 = b4 y1 + b5 y 2 + b6 y 3 + e 4 .
We may be interested in calculating the total effect of age on satisfaction, which involves
the direct effect b4, a path from age to income to satisfaction, calculated as b2b6, a path
from age to autonomy to satisfaction, obtained as b1b5, and a path from age to autonomy to
income to satisfaction, obtained as b1b3b6 The following code encapsulates the anticipated
relationships and obtains the total effect:
library(brms)
D=read.table("DS_BRMS_SATIS_CH9.txt",header=T)
auton_mod = bf(auton ~age)
income_mod = bf(income ~age+auton)
satis_mod = bf(satis ~age+auton+income)
fit= brm(auton_mod + income_mod + satis_mod+ set_rescor(FALSE), data
= D, chains = 2)
make_stancode(auton_mod + income_mod + satis_mod, data=D)
b1 = posterior_samples(fit, "auton_age")
b2 = posterior_samples(fit, "income_age")
b3 = posterior_samples(fit, "income_auton")
b4 = posterior_samples(fit, "satis_age")
b5 = posterior_samples(fit, "satis_auton")
b6 = posterior_samples(fit, "satis_income")
# total effect of age, posterior mean and sd
tot.age=b4+b2*b6+b1*b5+b1*b3*b6
mean(tot.age[1:2000,])
sd(tot.age[1:2000,])
# indirect effect
ind.age=tot.age-b4
mean(ind.age[1:2000,])
sd(ind.age[1:2000,])
The output from brms shows a small direct effect of age on satisfaction, with posterior
mean (sd) of −0.06 (0.10). However, the total effect is obtained as 0.36 (0.12), and the indirect
effect as 0.43 (0.11).
Factor Analysis, Structural Equation Models, and Multivariate Priors 397
References
Aßmann C, Boysen-Hogrefe J, Pape M (2016) Bayesian analysis of static and dynamic factor models:
An ex-post approach towards the rotation problem. Journal of Econometrics, 192(1), 190–206.
Aitkin M, Aitkin I (2005) Bayesian inference for factor scores, in Contemporary Psychometrics, eds A
Maydeu-Olivares, J McArdle. Lawrence Erlbaum Associates.
Aktekin T, Polson N, Soyer R (2017) Sequential Bayesian analysis of multivariate count data. Bayesian
Analysis, 13(2), 385–409.
Albert J (1992) Bayesian estimation of normal ogive response curves using Gibbs sampling. Journal of
Educational Statistics, 17, 251–269.
Albert J (2015) Introduction to Bayesian item response modelling. International Journal of Quantitative
Research in Education, 2(3–4), 178–193.
Albert J, Ghosh M (2000) Item response modeling, pp 173–193, in Generalized Linear Models: A Bayesian
Perspective, eds D Dey, S Ghosh, B Mallick. Addison–Wesley, New York.
Andersson M, Karlsson S (2007) Bayesian forecast combination for VAR models. Working paper,
Öebro University.
Anselin L, Hudak S (1992) Spatial econometrics in practice: A review of software options. Regional
Science & Urban Economics, 22, 509–536.
Arhonditsis G, Paerl H, Valdes-Weaver L, Stow C, Steinberg J, Reckhow K (2006a) Application of
Bayesian structural equation modeling for examining phytoplankton dynamics in the Neuse
River Estuary. Estuarine, Coastal & Shelf Science, 72, 63–80.
Arhonditsis G, Stow C, Steinberg L, Kenney M, Lathrop R, McBride S, Reckhow K (2006b) Exploring
ecological patterns with structural equation modeling and Bayesian analysis. Ecological
Modelling, 192(3–4), 385–409.
Arima S (2015) Item selection via Bayesian IRT models. Statistics in Medicine, 34(3), 487–503.
Asai M, McAleer M, Yu J (2006) Multivariate stochastic volatility: A review. Econometric Reviews, 25,
145–175.
Azzalini A (1985) A class of distributions which includes the normal ones. Scandinavian Journal of
Statistics, 12, 171–178.
Bai J, Wang P (2015) Identification and Bayesian estimation of dynamic factor models. Journal of
Business & Economic Statistics, 33(2), 221–240.
Baker F (2001) The Basics of Item Response Theory, 2nd Edition. ERIC Clearinghouse on Assessment
and Evaluation.
Banerjee S, Carlin BP, Gelfand AE (2004) Hierarchical Modeling and Analysis for Spatial Data. Chapman
and Hall/CRC, Boca Raton, FL.
Barnard J, McCulloch R, Meng X-L (2000) Modeling covariance matrices in terms of standard devia-
tions and correlations, with applications to shrinkage. Statistica Sinica, 10, 1281–1312.
Bartholomew D (1987) Latent Variable Models and Factor Analysis. Charles Griffin, London, UK.
Bartholomew D, Steele F, Moustaki I, Galbraith J (2002) The Analysis and Interpretation of Multivariate
Data for Social Scientists. CRC Press.
Bazan J, Branco M, Bolfarine H (2006) A skew item response model. Bayesian Analysis, 1, 861–892.
Berkhof J, van Mechelen I, Gelman A (2003) A Bayesian approach to the selection and testing of mix-
ture models. Statistica Sinica, 13, 423–442.
Besag J, York J, Mollie A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43, 1–59.
Bhattacharya A, Dunson D (2011) Sparse Bayesian infinite factor models. Biometrika, 98(2), 291–306.
Bijleveld F, Commandeur J, Gould P, Koopman, S (2005) Model-based measurement of latent risk in
time series with applications. Tinbergen Institute Discussion Paper No. 05-118/4. Available at
SSRN: http://ssrn.com/abstract=873466
Boker S, Neale M, Maes H, Wilde M, Spiegel M, Brick T, Spies J, Estabrook R, Kenny S, Bates T,
Mehta P (2011) OpenMx: An open source extended structural equation modeling framework.
Psychometrika, 76(2), 306–317.
398 Bayesian Hierarchical Models
Bollen K (1989) Structural Equations with Latent Variables. Wiley, New York.
Bollen KA (2002) Latent variables in psychology and the social sciences. Annual Review of Psychology,
53(1), 605–634.
Brandt P, Freeman J (2005) Advances in Bayesian time series modeling and the study of politics:
Theory testing, forecasting, and policy analysis. Political Analysis, 14(1), 1–36.
Breusch T (2005) Estimating the Underground Economy using MIMIC Models. Working Paper,
National University of Australia, Canberra.
Brown P, Fearn T, Haque M (1999) Discrimination with many variables. Journal of the American
Statistical Association, 94, 1320–1329.
Bryman A, Cramer D (2005) Quantitative Data Analysis with SPSS 12 and 13: A Guide for Social Scientists.
Routledge.
Byrnes J (2017) Bayesian SEM with BRMS. http://rpubs.com/jebyrnes/343408
Cargnoni C, Müller P, West M (1997) Bayesian forecasting of multinomial time series through con-
ditionally Gaussian dynamic models. Journal of the American Statistical Association, 92(438),
640–647.
Chan D, Kohn R, Kirby C (2006) Multivariate stochastic volatility models with correlated errors.
Econometric Reviews, 25, 245–274.
Chapados N (2014) Effective Bayesian modeling of groups of related count time series. Proceedings
of the 31st International Conference on Machine Learning, Beijing, China. JMLR: W&CP vol-
ume 32.
Chen M-H, Dey D (1998) Bayesian modeling of correlated binary responses via scale mixture of mul-
tivariate normal link functions. Sankhya, 60A, 322–343.
Chen M-H, Dey D (2000) Bayesian analysis for correlated ordinal data models, in Generalized Linear
Models: A Bayesian Perspective, eds D Dey, S Ghosh, B Mallick. Marcel Dekker, New York.
Chib S, Greenberg E (1998) Analysis of multivariate probit models. Biometrika, 85, 347–361.
Chib S, Nardari F, Shephard N (2006) Analysis of high dimensional multivariate stochastic volatility
models. Journal of Econometrics, 134, 341–371.
Chib S, Winkelmann R (2001) Markov chain Monte Carlo analysis of correlated count data. Journal of
Business & Economic Statistics, 19, 428–435.
Choi S, Gibbons L, Crane P (2011) Lordif: An R package for detecting differential item functioning
using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simu-
lations. Journal of Statistical Software, 39(8), 1.
Commandeur J, Koopman S (2007) An Introduction to State Space Time Series Analysis. Oxford
University Press.
Congdon P (2001) Bayesian Statistical Modelling. Wiley.
Congdon P, Almog M, Curtis S, Ellerman R (2007) A spatial structural equation modelling frame-
work for health count responses. Statistics in Medicine, 26(29), 5267–5284.
Conti G, Fruhwirth-Schnatter S, Heckman J, Piatek R (2014) Bayesian exploratory factor analysis.
Journal of Econometrics, 183(1), 31–57.
Curtis S (2010) BUGS code for item response theory. Journal of Statistical Software, 36(1), 1–34.
Dunson D, Herring A (2005) Bayesian latent variable models for mixed discrete outcomes. Biostatistics,
6, 11–25.
Durbin J, Koopman S (2001) Time Series Analysis by State Space Methods, 1st Edition. OUP.
Durbin J, Koopman S (2012) Time Series Analysis by State Space Methods. Oxford University Press.
Edwards Y, Allenby G (2003) Multivariate analysis of multiple response data. Journal of Marketing
Research, 40, 321–334.
Evans J, Middleton N, Gunnell D (2004) Social fragmentation, severe mental illness and suicide.
Social Psychiatry and Psychiatric Epidemiology, 39, 165–170.
Everitt BS (1984) Introduction to Latent Variable Models. Chapman and Hall, London, UK.
Feng X, Wu H, Song X (2017) Bayesian adaptive Lasso for ordinal regression with latent variables.
Sociological Methods & Research, 46(4), 926–953.
Fleishman J, Lawrence W (2003) Demographic variation in SF−12 scores: True differences or differen-
tial item functioning? Medical Care, 41, 75–86.
Factor Analysis, Structural Equation Models, and Multivariate Priors 399
Fokoue E (2004) Stochastic determination of the intrinsic structure in Bayesian factor analysis. SAMSI
Technical Report #2004-17. http://www.samsi.info/reports/index.shtml
Fox J, Glas C (2005) Bayesian modification indices for IRT models. Statistica Neerlandica, 59, 95–106.
Gelfand A, Ghosh S (1998) Model choice: A minimum posterior predictive loss approach. Biometrika,
85, 1–11.
Gelman A, Meng X-L, Stern H (1996) Posterior predictive assessment of model fitness. Statistica
Sinica, 6, 733–807.
George E, McCulloch R (1993) Variable selection via Gibbs sampling. Journal of the American Statistical
Association, 88, 881–889.
Geweke J, Zhou G (1996) Measuring the pricing error of the arbitrage pricing theory. Review of
Financial Studies, 9, 557–587.
Ghosh J, Dunson D (2008) Bayesian model selection in factor analytic models, pp 151–163, in Random
Effect and Latent Variable Model Selection, ed D Dunson. Springer.
Ghosh J, Dunson D (2009) Default prior distributions and efficient posterior computation in Bayesian
factor analysis. Journal of Computational and Graphical Statistics, 18(2), 306–320.
Gielen E, Riutort-Mayol G, Palencia-Jiménez J, Cantarino I (2017) An urban sprawl index based on
multivariate and Bayesian factor analysis with application at the municipality level in Valencia.
Environment and Planning B, 45(5), 888–914.
Greenacre M, Blasius J (eds) (2006) Multiple Correspondence Analysis and Related Methods. CRC Press.
Gunnell D, Peters T, Kammerling R, Brooks J (1995) Relation between parasuicide, suicide, psychiat-
ric admissions and socio-economic deprivation. British Medical Journal, 311, 226–230.
Harvey A (1989) Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University
Press, Cambridge, UK.
Harvey A, Koopman S (1997) Multivariate structural time series models, pp 269–298, in Systematic
Dynamics in Economic and Financial Models, eds C Heij, H Schumacher, B Hanzon, C Praagman.
Wiley, Chichester, UK.
Harvey A, Shephard N (1993) Structural time series models, in Handbook of Statistics, Vol. 11, eds G S
Maddala et al. Elsevier Science Publishers, Barking, UK.
Hayashi K, Arav M (2006) Bayesian factor analysis when only a sample covariance matrix is avail-
able. Educational and Psychological Measurement, 66, 272–284.
Hogan J, Tchernis R (2004) Bayesian factor analysis for spatially correlated data, with application to
summarizing area-level material deprivation from census data. Journal of the American Statistical
Association, 99, 314–324.
Hoyle R (ed) (1995) Structural Equation Modeling: Concepts, Issues, and Applications. Sage.
Inouye D, Yang E, Allen G, Ravikumar P (2017) A review of multivariate distributions for count data
derived from the Poisson distribution. Wiley Interdisciplinary Reviews: Computational Statistics,
9(3), e1398.
Ishwaran H, James L (2001) Gibbs sampling methods for stick-breaking priors. Journal of the American
Statistical Association, 96, 161–173.
Ishwaran H, Zarepour M (2000) Markov chain Monte Carlo in approximate Dirichlet and beta two-
parameter process hierarchical models. Biometrika, 87, 371–390.
Jackson L, Kose M, Otrok C, Owyang M (2016) Specification and estimation of Bayesian dynamic
factor models: A Monte Carlo analysis with an application to global house price comovement,
in Advances in Econometrics, Vol. 35, eds S Koopman, E Hillebrand. Emerald Publishing.
Jin X, Banerjee S, Carlin BP (2007) Order-free co-regionalized areal data models with application to
multiple-disease mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
69(5), 817–838.
Jin X, Carlin B, Banerjee S (2005) Generalized hierarchical multivariate CAR models for areal data.
Biometrics, 61, 950–961.
Jöreskog KG (1973) Analysis of covariance structures, pp 263–285, in Multivariate Analysis–III. ed P
Krishnaiah, Academic Press.
Joreskog K, Goldberger A (1975) Estimation of a model with multiple indicators and multiple causes
of a single latent variable. Journal of the American Statistical Association, 70, 631–639.
400 Bayesian Hierarchical Models
Juárez M, Steel M (2010) Model-based clustering of non-Gaussian panel data based on skew-t distri-
butions. Journal of Business & Economic Statistics, 28, 52–66.
Jungbacker B, Koopman S, van der Wel M (2009) Dynamic factor models with smooth loadings
for analyzing the term structure of interest rates. Tinbergen Institute Discussion Paper, TI
2009-041/4.
Kaplan D, Depaoli S (2012) Bayesian structural equation modeling, pp 650–673, in Handbook of
Structural Equation Modeling, eds R Hoyle. Guilford Publications Inc.
Karlis D, Meligkotsidou L (2005) Multivariate Poisson regression with covariance structure. Statistics
and Computing, 15, 255–265.
Karlsson S (2015) Forecasting with Bayesian vector autoregression. Handbook of Economic Forecasting,
2B, 791–897.
Kastner G (2016) Dealing with stochastic volatility in time series using the R package stochvol. Journal
of Statistical Software, 69(5), 1–30.
Kastner G, Frühwirth-Schnatter S, Lopes H (2017) Efficient Bayesian inference for multivariate fac-
tor stochastic volatility models. Journal of Computational and Graphical Statistics, 26(4), 905–917.
Kaufmann S, Schumacher C (2013) Bayesian estimation of sparse dynamic factor models with order-
independent identification (No. 13.04). Working Paper, Study Center Gerzensee.
Kavanagh L, Lee D, Pryce G (2016) Is poverty decentralising? Quantifying uncertainty in the decen-
tralisation of urban poverty. Annals of the American Association of Geographers, 106(6), 1286–1298.
Kendall M (1975) Multivariate Analysis. Charles Griffin & Co., London, UK.
Kleibergen F, Paap R (2002) Priors, posteriors and Bayes factors for a Bayesian analysis of cointegra-
tion. Journal of Econometrics, 111(2), 223–249.
Knorr-Held L, Rasser G (2000) Bayesian detection of clusters and discontinuities in disease maps.
Biometrics, 56, 13–21.
Koop G, Korobilis D (2010) Bayesian multivariate time series methods for empirical macroeconom-
ics. Foundations and Trends in Econometrics, 3(4), 267–358.
Koop G, Strachan R, van Dijk H, Villani M (2006) Bayesian approaches to cointegration, in The
Palgrave Handbook of Theoretical Econometrics, eds K Patterson, T Mill. MacMillan.
Koopman S, Durbin J (2000) Fast filtering and smoothing for multivariate state space models. The
Journal of Time Series Analysis, 21, 281–296.
Krueger F (2016) bvarsv: Bayesian Analysis of a Vector Autoregressive Model with Stochastic
Volatility and Time-Varying Parameters. https://sites.google.com/site/fk83research/code
Lai M, Zhang J (2017) Evaluating fit indices for multivariate t-based structural equation modeling
with data contamination. Frontiers in Psychology, 8, 1286.
Lang S, Fronk E, Fahrmeir L (2002) Function estimation with locally adaptive dynamic models.
Computational Statistics, 17, 479–500.
Lee S-Y (2007) Structural Equation Modelling: A Bayesian Approach. Wiley.
Lee S-Y, Shi J (2000) Joint Bayesian analysis of factor score and structural parameters in the factor
analysis models. Annals of the Institute of Statistical Mathematics, 52, 722–736.
Lee S-Y, Song X-Y (2003) Bayesian model selection for mixtures of structural equation models with
an unknown number of components. British Journal of Mathematical and Statistical Psychology,
56, 145–165.
Lee S-Y, Song X-Y (2008) Bayesian model comparison of structural equation models, pp 121–149, in
Random Effect and Latent Variable Model Selection, ed D Dunson. Springer.
Lee S-Y, Tang N (2006) Bayesian analysis of structural equation models with mixed exponential fam-
ily and ordered categorical data. British Journal of Mathematical and Statistical Psychology, 59,
151–172.
Leroux B, Lei X, Breslow N (1999) Estimation of disease rates in small areas: A new mixed model
for spatial dependence, pp 135–178, in Statistical Models in Epidemiology, the Environment and
Clinical Trials, eds M Halloran, D Berry. Springer-Verlag, New York.
Levy R, Mislevy R (2016) Bayesian Psychometric Modeling. CRC Press.
Litterman R (1986) Forecasting with Bayesian vector autoregressions – Five years of experience.
Journal of Business & Economic Statistics, 4, 25–38.
Factor Analysis, Structural Equation Models, and Multivariate Priors 401
Liu X, Wall M, Hodges J (2005) Generalized spatial structural equation modeling. Biostatistics, 6,
539–557.
Lopes H, West M (2004) Bayesian model assessment in factor analysis. Statistica Sinica, 14, 41–67.
Lu Z-H, Chow S, Loken E (2016) Bayesian factor analysis as a variable-selection problem: Alternative
priors and consequences. Multivariate Behavioral Research, 51(4), 519–539.
Luo Y, Jiao H (2017) Using the Stan program for Bayesian item response theory. Educational and
Psychological Measurement, 77, 1–25.
MacNab Y (2007) Mapping disability-adjusted life years: A Bayesian hierarchical model framework
for burden of disease and injury assessment. Statistics in Medicine, 26(26), 4746–4769.
MacNab Y (2018) Some recent work on multivariate Gaussian Markov random fields. Test, 27(3),
1–45.
Madsen L, Dalthorp D (2007) Simulating correlated count data. Environmental and Ecological Statistics,
14, 129–148.
Magis D, Tuerlinckx F, De Boeck P (2015) Detection of differential item functioning using the lasso
approach. Journal of Educational and Behavioral Statistics, 40(2), 111–135.
Malaeb Z, Summers K, Pugesek B (2000) Using structural equation modeling to investigate relation-
ships among ecological variables. Environmental and Ecological Statistics, 7, 93–111.
Mardia K (1988) Multi-dimensional multivariate Gaussian Markov random fields with application to
image processing. Journal of Multivariate Analysis, 24, 265–284.
Mar-Dell’Olmo M, Martnez-Beneito M, Borrell C, Zurriaga O, Nolasco A, Domínguez-Berjón M (2011)
Bayesian factor analysis to calculate a deprivation index and its uncertainty. Epidemiology, 22(3),
356–364.
Marshall EC, Spiegelhalter DJ (2007) Identifying outliers in Bayesian hierarchical models: A simula-
tion-based approach. Bayesian Analysis, 2(2), 409–444.
Martinez-Beneito MA (2013) A general modelling framework for multivariate disease mapping.
Biometrika, 100(3), 539–553.
Mavridis D, Ntzoufras I (2014) Stochastic search item selection for factor analytic models. British
Journal of Mathematical and Statistical Psychology, 67(2), 284–303.
McCulloch R, Rossi P (1994) An exact likelihood analysis of the multinomial probit model. Journal of
Econometrics, 64, 207–240.
McCulloch R, Polson N, Rossi P (2000) A Bayesian analysis of the multinomial probit model with
fully identified parameters. Journal of Econometrics, 99, 173–193.
Merkle E, Rosseel Y (2017) blavaan: Bayesian structural equation models via parameter expansion.
Journal of Statistical Software, 85(4), 1–30.
Merkle E, Wang T (2016) Bayesian latent variable models for the analysis of experimental psychology
data. Psychonomic Bulletin & Review, 25(1), 256–270.
Mezzetti M, Billari F (2005) Bayesian correlated factor analysis of socio-demographic indicators.
Statistical Methods and Applications, 14(2), 223–241.
Migon H, Moreira A (2004) Core inflation: Robust common trend model forecasting. Brazilian Review
of Econometrics, 24, 1–19.
Moauro P, Savio G (2005) Temporal disaggregation using multivariate structural time series models.
Journal of Econometrics, 8, 214–234.
Murray J (2016) R Package ‘bfa’, Bayesian Factor Analysis. https://cran.r-project.org/web/pack
ages/bfa/bfa.pdf
Muthén B, Asparouhov T (2012) Bayesian structural equation modeling: A more flexible representa-
tion of substantive theory. Psychological Methods, 17(3), 313–335.
Natesan P, Nandakumar R, Minka T, Rubright J (2016) Bayesian prior choice in IRT estimation using
MCMC and variational Bayes. Frontiers in Psychology, 7, 1422.
Nethery R, Warren J, Herring A, Moore K, Evenson K, Diez-Roux A (2015) A common spatial fac-
tor analysis model for measured neighborhood-level characteristics: The multi-ethnic study of
atherosclerosis. Health & Place, 36, 35–46.
Nikolov M, Coull B, Catalano P (2007) An informative Bayesian structural equation model to assess
source-specific health effects of air pollution. Biostatistics, 8, 609–624.
402 Bayesian Hierarchical Models
O’Brien S, Dunson D (2004) Bayesian multivariate logistic regression. Biometrics, 60, 739–746.
Palomo J, Dunson D, Bollen K (2007) Bayesian structural equation modeling, in Handbook of Latent
Variable and Related Models, ed S-Y Lee. Elsevier.
Petris G, Petrone S, Campagnoli P (2009) Dynamic Linear Models with R. Springer.
Piatek R (2017) R Package ‘BayesFM’, Bayesian Inference for Factor Modeling. https://cran.r-proje
ct.org/web/packages/BayesFM/BayesFM.pdf
Pitt M, Shephard N (1999) Time varying covariances: A factor stochastic volatility approach, pp 547–
570, in Bayesian Statistics 6, eds J Bernardo, J Berger, A Dawid, A Smith. Oxford University Press.
Poon W, Wang H (2012) Latent variable models with ordinal categorical covariates. Statistics and
Computing, 22(5), 1135–1154.
Prado R, West M (1997) Exploratory modelling of multiple non-stationary time series: Latent pro-
cess structure and decompositions, in Modelling Longitudinal and Spatially Correlated Data, ed T
Gregoire. Springer-Verlag.
Press S, Shigemasu K (1989) Bayesian inference in factor analysis, pp 271–287, in Contributions to
Probability and Statistics. eds L Gleser, M Perlman, S Press, A Sampson. Springer, New York.
Primiceri G E (2005) Time varying structural vector autoregressions and monetary policy. The Review
of Economic Studies, 72(3), 821–852.
Proietti T (2007) Measuring core inflation by multivariate structural time series models, in Optimisation,
Econometric and Financial Analysis, eds E J Kontoghiorghes, C Gatu. Advances in Computational
Management Science, Vol. 9. Springer, Berlin/Heidelberg, Germany.
Reinsel G (2003) Elements of Multivariate Time Series Analysis. Springer Science & Business Media.
Rigby R (1997) Bayesian discrimination between two multivariate normal populations with equal
covariance matrices. Journal of the American Statistical Association, 92, 1151–1154.
Rodrigues-Motta M, Pinheiro H, Martins E, Araujo M, dos Reis S (2013) Multivariate models for cor-
related count data. Journal of Applied Statistics, 40(7), 1586–1596.
Rossi P, Allenby G, McCulloch R (2005) Bayesian Statistics and Marketing. Wiley.
Rue H, Held L (2005) Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall/
CRC.
Rupp A, Dey D, Zumbo B (2004) To Bayes or not to Bayes, from whether to when: Applications of
Bayesian methodology to modeling. Structural Equation Modeling, 11, 424–451.
Sahu S (2002) Bayesian estimation and model choice in item response models. Journal of Statistical
Computation and Simulation, 72, 217–232.
Sain S, Cressie N (2007) A spatial model for multivariate lattice data. Journal of Econometrics, 140,
226–259.
Sanchez B, Butdz-Jorgensen E, Ryan L, Hu H (2005) Structural equation models: A review with
applications to environmental epidemiology. Journal of the American Statistical Association, 100,
1443–1455.
Schumacker R, Lomax R (2016) A Beginner’s Guide to Structural Equation Modeling, 4th Edition.
Routledge.
Sethuraman J (1994) A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639–665.
Sims C, Zha T (1998) Bayesian methods for dynamic multivariate models. International Economic
Review, 39, 949–968.
Skrondal A, Rabe-Hesketh S (2004) Generalized latent variable modeling: Multilevel, longitudinal
and structural equation models. Chapman & Hall/CRC, Boca Raton, FL.
Skrondal A, Rabe-Hesketh S (2007) Latent variable modelling: A survey. Scandinavian Journal of
Statistics, 34, 712–745.
Smith D, Harvey P, Lawn S, Harris M, Battersby M (2017) Measuring chronic condition self-man-
agement in an Australian community: Factor structure of the revised Partners in Health (PIH)
scale. Quality of Life Research, 26(1), 149–159.
Soares T, Gonçalves F, Gamerman D (2009) An integrated Bayesian model for DIF analysis. Journal of
Educational and Behavioral Statistics, 34(3), 348–377.
Song J, Ghosh M, Miaou S, Mallick B (2005) Bayesian multivariate spatial models for roadway traffic
crash mapping. Journal of Multivariate Analysis, 97, 246–273.
Factor Analysis, Structural Equation Models, and Multivariate Priors 403
Song X-Y, Lee S-Y, Ng M, So W-Y, Chan J (2006) Bayesian analysis of structural equation models with
multinomial variables and an application to type 2 diabetic nephropathy. Statistics in Medicine,
26, 2348–2369.
Stromeyer W, Miller J, Sriramachandramurthy R, DeMartino R (2015) The prowess and pitfalls of
Bayesian structural equation modeling: Important considerations for management research.
Journal of Management, 41(2), 491–520.
Tabachnick B, Fidell L (2006) Using Multivariate Statistics, 5th Edition. Allyn & Bacon, Inc., Needham
Heights, MA.
Talhouk A, Doucet A, Murphy K (2012) Efficient Bayesian inference for multivariate probit models
with sparse inverse correlation matrices. Journal of Computational and Graphical Statistics, 21(3),
739–757.
Tanner M (1996) Tools for Statistical Inference: Methods for the Exploration of Postrior Distributions and
Likelihood Functions, 3rd Edition. Springer-Verlag, New York.
Tekwe C, Carter R, Cullings H, Carroll R (2014) Multiple indicators, multiple causes measurement
error models. Statistics in Medicine, 33(25), 4469–4481.
Thissen D, Steinberg L, Wainer H (1993) Detection of differential item functioning using the param-
eters of item response models, pp 67–113, in Differential Item Functioning: Theory and Practice, eds
P W Holland, H Wainer. Lawrence Erlbaum Associates, Hillsdale, NJ.
Tiao G, Tsay R (1989) Model specification in multivariate time series. Journal of the Royal Statistical
Society: Series B, 51, 157–213.
Tsay R (2014) Multivariate Time Series Analysis with R and Financial Applications. Wiley, Hoboken, NJ
Tzala E, Best N (2007) Bayesian latent variable modelling of multivariate spatio-temporal variation
in cancer mortality. Statistical Methods in Medical Research, 2007 Sep 13 (epub).
Wang D, Lin J-Y, Yu T (2006) A MIMIC approach to modeling the underground economy in Taiwan.
Physica A, 371, 536–542.
Wang F, Wall M (2003) Generalized common spatial factor model. Biostatistics, 4(4), 569–582.
Wang Y, Neuman U, Wright S, Warton D (2012) mvabund: an R package for model-based analysis of
multivariate abundance data. Methods in Ecology and Evolution, 3, 471–473.
Wedel M, Bockenholt U, Kamakura W (2003) Factor models for multivariate count data. Journal of
Multivariate Analysis, 87, 356–369.
West M, Harrison J (1997) Bayesian Forecasting and Dynamic Models, 2nd Edition. Springer Verlag.
White A, Murphy T (2014) BayesLCA: An R Package for Bayesian Latent Class Analysis. Journal of
Statistical Software, 61(13). https://www.jstatsoft.org/article/view/v061i13
Yu J, Meyer A (2006) Multivariate stochastic volatility models: Bayesian estimation and model com-
parison. Econometric Reviews, 25, 361–384.
Yuan K-H, Bentler P, Chan W (2004) Structural equation modeling with heavy tailed distributions.
Psychometrika, 69, 421–436.
Zheng X, Rabe-Hesketh S (2007) Estimating parameters of dichotomous and ordinal item response
models with gllamm. Stata Journal, 7(3), 313–333.
Zhou X, Nakajima J, West M (2014) Bayesian forecasting and portfolio decisions using dynamic
dependent sparse factor models. International Journal of Forecasting, 30(4), 963–980.
Zivot E, Wang J (2006) Modeling Financial Time Series with S-PLUS. Springer, Berlin, Germany.
10
Hierarchical Models for Longitudinal Data
10.1 Introduction
Longitudinal data sets occur when continuous or discrete observations yit on a set of sub-
jects, or units i = 1, … , n , are repeated over a number of measuring occasions t = 1, … , Ti
possibly differing between subjects (with N = SiTi total observations). There are many con-
texts for such data to occur, with variation in type of unit, study design, and data form. For
instance, in economic and marketing applications (Keane, 2015; Rossi et al., 2005), the unit
is typically an individual consumer, household, or firm, whereas in actuarial applications
(Antonio and Beirlant, 2007), the units may consist of groups of policyholders (risk classes)
with responses being insurance claim counts. Longitudinal studies often feature an inter-
vention or treatment comparison, with intervention studies (Thiese, 2014) including both
observational studies and controlled clinical trials with random treatment assignment.
In balanced studies, repeat measurements on all subjects are contemporaneous, whereas
measurement at different times for different units leads to unbalanced longitudinal data
(Daniels and Hogan, 2008), with unit specific times {ait , t = 1, … , Ti } at which events are
recorded. Furthermore, measuring occasions may be over more than one time scale. An
example is disease incidence by calendar time and age at onset or death, leading to a fur-
ther implicit cohort scale defined by the difference between age and time (Schmid and
Held, 2007; Lagazio et al., 2003); see Section 10.7.
There are also a variety of approaches to analysing longitudinal data, such as ran-
dom effects (conditional) models on the one hand, and marginal or population-averaged
approaches on the other (Heagerty and Zeger, 2000; Lee and Nelder, 2004), with conditional
model and marginal model parameters not necessarily having the same interpretation
(Verbeke et al., 2010). The focus of this chapter is on conditionally specified hierarchical
and random effect models, and on MCMC estimation via conditional likelihood with ran-
dom effects as part of the parameter set (e.g. Daniels and Hogan, 2008; Chib and Carlin,
1999); see Section 10.2.
Longitudinal data offer major advantages over cross-sectional designs in the analysis
of causal interrelationships between variables, including developmental and growth pro-
cesses and clinical studies, and before-after studies (Menard, 2002; Chen et al., 2016). The
accumulation of information over both times and subjects increases the power of statistical
methods to identify treatment effects or values-added (Lockwood et al., 2003), and permits
the estimation of parameters (e.g. permanent random effects or “frailties” for subjects i)
that are not identifiable from cross-sectional analysis or from repeated cross-sections on
different subjects.
On the other hand, analysis of longitudinal data may be problematic if the longitudinal
sequences are subject to missing observations; see Section 10.8. Missingness may involve
405
406 Bayesian Hierarchical Models
æ y q - a(qit ) ö
p( yit |qit , f) = exp ç it it + C( yit , f)÷
è f ø
where θit denotes the natural parameter (Tsai and Hsiao, 2008; Fong et al., 2010; Natarajan
and Kass, 2000). The structural assumption governs the forms assumed for the condi-
tional means E( yit ) = mit = a¢(qit ), with regression link g( mit ) = hit , and for the variances
Vit = fVar( mit ). This involves questions such as whether the conditional mean is linear or
nonlinear in predictors and random effects, and at what levels random effects are present.
As in Chapter 9, random effects may be used at different levels: enduring differences
between subjects may be represented by time-invariant random effects, while represent-
ing excess dispersion may require observation level random effects. Fixed effects are also
used to represent subject level heterogeneity, especially in econometrics (Frees, 2004). This
is equivalent to generating dummy variables for each subject, and works best for relatively
few subjects and more time periods, as there is no pooling strength in fixed effects models
and the parameter count increases with the number of subjects n. In this chapter, the focus
is on random subject iid effects (exchangeable over units), though if the units are spatially
configured (say), then a structured prior for the permanent effects can be used.
Conjugate priors may be suitable for handling random variation at subject or observa-
tion level for exponential family responses, especially if random variation does not involve
predictors (Lee and Nelder, 2000). However, more flexible models, possibly involving sub-
ject specific regression effects, as well as varying intercepts, involve a vector of subject-
level effects bi = (b1i , … , bQi )¢ in a general linear mixed model format. These are typically
Hierarchical Models for Longitudinal Data 407
taken to be normal with dispersion matrix D, with elements {sbij , i , j Î1, … , Q} , and with
mean B = (B1 , … , BQ ) , with elements zero or non-zero depending on how the predictors are
defined (see Section 10.2.1). With link g, the structural assumption specifies
where Zit = ( z1it , … , zQit ) is 1 × Q. In a typical analysis, bi and ui = (ui1 , ¼ , uiTi )¢ are assumed
independent, and both also taken to be normal, at least initially. Residual autocorrela-
tion may necessitate a correlation structure in the observation errors (e.g. Franco and Bell,
2015). In some applications (e.g. Poisson data without overdispersion), the observation level
errors uit may not be present.
A particular widely applied GLMM is the normal linear mixed model, with
with ui = (ui1 , … , uiTi )¢ iid normal with mean zero, Ti × Ti covariance matrix S i = s 2I , and
conditional expectations
The normal linear mixed model may be achieved with latent data yit* underlying observed
data, either binary or categorical (e.g. Chib, 2008; Chib and Jeliazkov, 2006). Thus for binary yit
with iid residuals under conditional independence having known variance σ2 = 1, but with
D still unknown.
Stacked over times, the conditional mean in (10.1) is expressed as
hi = Xi b + Zibi + ui , (10.2)
where ηi is Ti × 1, Xi is Ti × P, and Zi is Ti × Q, while the normal linear mixed model is
yi = Xi b + Zibi + ui .
For the normal linear mixed model, the marginal model (with bi integrated out) is obtain-
able analytically as
yi ~ N (Xit b , ZiDZi¢ + s 2I ),
which is a feature not present for the broader class of general linear mixed models
(Molenberghs and Verbeke, 2006).
Given the fixed regression effects, and subject permanent effects b = (b1 , … , bn ) , repeated
observations on the same subject are conditionally independent (Kleinman and Ibrahim,
1998; Tutz and Kauermann, 2003), and the conditional likelihood factors as
n
p( y |b , b , s 2 ) = Õ p(y |b , b , s ),
i =1
i i
2
408 Bayesian Hierarchical Models
Õ
Ti
where p( yi |bi , b , s ) = p( yit |bi , b , s 2 ). Similarly, the joint density p( y , b|b , D, B, s 2 ) =
2
t =1
p( y|b , b , s 2 )p(b|B, D) factors into subject-specific elements
æ Ti
ö
ç
ç
è
Õ
t =1
p( yit |bi , b , s 2 ) ÷ p(bi |B, D).
÷
ø
If model checking in fact reveals the conditional independence assumption does not pro-
vide an adequate fit, then the model requires elaboration. For example, if checking shows
regression errors correlated through time, then the iid assumption for uit may have to be
reconsidered, or lagged effects in the response included in predictor sets Xit or Zit (Frees,
2004, p.279); see Sections 10.3 and 10.5.
Such hierarchical centring may assist precise identification and MCMC convergence. If no
predictors have fixed effect coefficients, one has what is sometimes termed a random coef-
ficient regression, namely
where bqi have non-zero means Bq (e.g. Daniels and Hogan, 2008).
Papaspiliopoulos et al. (2003) compare MCMC convergence for centred, non-centred,
and partially non-centred hierarchical model parameterisations, and mention that hierar-
chical centring may be less effective when the latent effects bi are relatively weakly identi-
fied. Consider the normal linear mixed model in the form
where ei and vi, of dimension Ti and Q respectively, are standard normal variables.
Then the non-centred parameterisation (NCP) and partially non-centred parameterisa-
tions (PNCP) are respectively
b i = bi - B,
Hierarchical Models for Longitudinal Data 409
and
w
b i = bi - WiB,
b i ~ N (0, D),
w
b i ~ N (B - WiB, D).
The proportion of B subtracted from bi under the PNCP form (that has favourable MCMC
convergence properties) is observation specific. The longitudinal model under the NCP
and PNCP parameterisations become
The NCP form has potential use in random effects selection (see Section 10.2.3).
where the means B = (B1 , … , BQ ) are either zeroes or unknown fixed effects, and
D = [drs ] = [sbrs ] represents covariation within subjects between the rth and sth random
effects bri and bsi. If the Zit are a subset of the Xit, then the means Bq will be zero. For robust-
ness against non-normality or outliers, other forms of mixture, including scale mixtures of
normals, or discrete mixtures of random effects, may be assumed for subject effects (sec-
tion 10.6). For spatially configured units, a prior for (b1i , … , bQi ) including correlation over
areas is likely to be relevant. For doubly nested data (e.g. observations yijt within subjects i
within clusters j), the second stage parameters are likely to be cluster specific and possibly
also randomly varying, as in
In many applications, the Zit will be of relatively small dimension, confined to the intercept
or simple time functions. For example, if Q = 1 and Zit = 1, one has the normal linear form
where bi represent permanent subject effects, namely enduring differences between sub-
jects due to unmeasured attributes. If Xit excludes (or includes) an intercept, then the bi will
be normal with mean B (or zero) and variance D.
410 Bayesian Hierarchical Models
In growth curve applications, the Zit typically include transforms of time or age, and the
mean level for an individual changes with time or age (e.g. linearly or quadratically) with
growth rates specific to each subject. For example, under a linear growth model with Q = 2,
each subject has their own linear growth rate (Weiss, 2005)
where D12 measures the correlation between intercepts and slopes. Assuming Xit omits an
intercept and linear time term, one may take
so that an individual’s response will differ from his/her mean level at a particular time
or age by a random term uit. Another option is to replace known time functions by an
unknown time-varying function, δt, as in
with δt subject to identifying constraints (e.g. δ1 = 0, and δT = 1), or with the variance of b2i
preset (Zhang et al., 2007; Zhang, 2016).
To illustrate MCMC sampling, consider the random coefficient normal linear model, namely
Let t = 1/s 2 and assume, following Wakefield et al. (1994), that t ~ Ga(n0 /2, t0n0 /2). Also
assume a multivariate normal prior for the second-stage population means B, and a
Wishart prior for D−1, namely
B ~ N (B0 , C ),
D ~ W ([ rR]-1 , r).
-1
Setting
N= å T ,
i =1
i
Ei-1 = tZi¢Zi + D -1 ,
V -1 = nD -1 + C -1 ,
n
b= å b /n,
i =1
i
Hierarchical Models for Longitudinal Data 411
B ~ N (V[nD -1 b + C -1B0 ], V )
æé n
ù
-1
ö
D ~Wçê
-1
çê å (bi - B)(bi - B)¢ + r R ú , n + r ÷
úû ÷
èë i =1
ø
æn + N 1 é n
ùö
t ~ Ga ç 0
ç 2
è
, ê
2 êë åi =1
( yi - Zibi )¢( yi - Zibi ) + n 0t 0 ú ÷ .
úû ÷ø
When predictors are available that might explain heterogeneity between subjects (e.g. treat-
ment allocations), regression priors may be used as means for unit random effects bqi (Chib,
2008). Thus, consider a model with varying intercepts b1i, varying linear and quadratic
growth effects {b2i, b3i}, and observations at differentially spaced time points {ai1 , ai 2 , … , aiTi }
(Muthen et al., 2002). So
where random growth coefficients are related to an intervention variable Tri according to
with (e1i , … , eQi ) ~ NQ (0, D) . Treatment is randomised so the baseline effects b1i are taken to
be independent of the intervention Tri.
may lead to an improper joint posterior for D and β under certain conditions (Natarajan
and Kass, 2000). The conjugate model for Q > 1 random effects involves a Wishart prior for
D−1, D -1 ~ Wish( A, n), or
n /2 0.5( n - Q - 1)
p(D -1|A, n) µ A D -1 exp( -0.5tr( AD -1 ),
D = DRD.
Barnard et al. (2000) construct the correlation matrix R from an inverse Wishart distribu-
tion, but a more versatile approach is provided by the LKJ(ν) prior (Lewandowski et al.,
2009), available in rstan. Thus
R ~ LKJ (n),
where, as ν increases, large correlations are less plausible and the prior concentrates around
the unit correlation matrix. At ν = 1, the LKJ(ν) correlation distribution reduces to the iden-
tity distribution over correlation matrices, so that all correlations are equally plausible. A
setting such as LKJ(1.5) might be taken as applicable in many situations, where extreme
correlations of −1 or +1 are downweighted slightly, but relatively high correlations are not
to be ruled out. Any suitable prior (inverse gamma, lognormal, uniform, half Cauchy) may
be used for the standard deviations σbj.
Cholesky decomposition methods also provide flexibility. Consider the Cholesky decom-
å
Q
position D = CC′ where C is a lower triangular matrix, with Dpq = r =1
cpr cqr and variances
obtained as
Q
Dqq = åc
r =1
2
qr .
æ c11 0 ö æ z 1i ö
( z1it , z2it ) ç ÷ ç ÷ = z 1i ( z1itc11 + z2itc21 ) + z 2i z2itc22 ,
è c21 c22 ø è z 2i ø
Instead of a Wishart prior on D−1, priors are then adopted for each element of C. To ensure
D is positive definite, the diagonal terms c11 and c22 need to be assigned positive priors,
while the prior c21 is unconstrained. For Q = 3, one has
æ c11 0 0 ö æ z 1i ö
ç ÷ç ÷
( z1it , z2it , z3 it ) ç c21 c22 0 ÷ ç z 2i ÷ = z 1i ( z1itc11 + z2itc21 + z3 itc31 )
ç c31
è c32 c33 ÷ø çè z 3 i ÷ø
with three positive unknowns cqq and three unconstrained lower diagonal unknowns.
An alternative Cholesky decomposition (Cai and Dunson, 2006; Chen and Dunson, 2003)
has
D = LWW¢L,
æ 1 0 … 0ö
ç ÷
w21 1 … 0÷
W=ç .
ç… … 0÷
çç ÷
è wQ1 wQ 2 … 1 ÷ø
æ l1 0 … 0 ö
ç ÷
w21l2 l2 … 0 ÷
C=ç .
ç … … 0 ÷
çç ÷÷
è wQ1lQ wQ 2lQ … lQ ø
Positive priors (e.g. lognormal, gamma) are taken for λq, while normal N(0,VΩ) priors may
be assumed for unconstrained elements of Ω, with VΩ = 0.5 providing relatively diffuse
priors on correlations between the bqi. Retention of terms in Λ is determined by binary
indicators gqq ~ Bern(pqq ), where πqq may be preset or unknown. Retention of the unknown
terms in Ω is determined both by binary indicators gqr ~ Bern(pqr ), and also by whether λq
and λr are retained; if either of {λq,λr} is omitted, then ωqr necessarily is.
If Zit is not a subset of Xit, one may consider the non-centred parameterisation (Frühwirth-
Schnatter and Tüchler, 2008)
where (z1i , … , zQi ) ~ NQ(0,I). As above, diagonal terms cqq need to be assigned posi-
tive priors, while priors for cqr (q > r) are unconstrained. Selection of which cqq and cqr
terms to retain may be based on binary indicators {gqq ~ Bern(pqq ), pqq ~ Be( aqq , bqq )},
{gqr ~ Bern(pqr ), pqr ~ Be( aqr , bqr ), q > r} where aqq = bqq = aqr = bqr = 1 is a default option. In
effect, the model involves composite terms,
The posterior estimate for D would be based on MCMC monitoring of Dqr = SQs=1GqsGrs .
414 Bayesian Hierarchical Models
t -1
yit - mit = å f (y
j =1
itj ij - mij ) + uit ,
é 1 ù
ê -f 1 ú
ê i 21 ú
Fi = ê -fi 31 -fi 32 1 ú.
ê ú
ê … … … … ú
ê -fiT 1 -fiT 2 … -fiT ,T -1 1úû
ë
Var(ui ) = H i = Fi S i Fi¢.
The parameters ϕitj and hit may be referred to respectively as the generalised autoregres-
sive parameters and the innovation variances of Σi (Pourahmadi and Daniels, 2002).
A parsimonious covariance model, especially for large T, may then be achieved by using
predictors zit and witj in the regressions
fitj = witj l.
é 1 ù
ê -f 1 ú
ê 21 ú
F = ê -f31 -f32 1 ú,
ê ú
ê … … … … ú
ê -fT 1 -fT 2 … -fT ,T -1 1úû
ë
Hierarchical Models for Longitudinal Data 415
with Var(ui ) = H = F SF ¢ . The covariates used for covariance model become {zt,wtj}, where
the wtj might simply be powers in (t − j) as illustrated by Cepeda and Gamerman (2004),
and the zt are simply powers of t. A possible drawback to using polynomial functions of
time is the multicollinearity that may be encountered, and Bayesian regression selection
may then be applied. One may also consider autoregressive or random walk priors in ht
and modelling ϕtj as a collection of iid random effects under a shrinkage prior strategy
(Daniels and Pourahmadi, 2002, p.558).
with unknown variances D = var(bi ) and s 2 = var(uit ). The conjugate approach with the
advantage of simple posterior conditionals involves separate gamma priors on D−1 and
τ = σ−2. These could be informative (e.g. downweighted results from a maximum likelihood
fit), but are often taken to be diffuse with small scale and shape parameters, leading to
potentially delayed convergence of Gibbs sampling methods since sampling is from an
almost improper posterior (Natarajan and McCulloch, 1998). These problems may increase
if an autocorrelated error term is added to the white noise error as in
é ìï 1 n
ü ù
p(D) µ det ê IQ + í
êë ïî n
å Z¢W Z ïýïþ Dúúû
i =1
i i i
where Wi is diagonal of dimension Ti with elements 1/Vit [¶hit /¶mit ]2 ) where Vit = fVar( mit )
and g( mit ) = hit .
416 Bayesian Hierarchical Models
æ1 0 0 ö
ç ÷
D-1 = ç 0 100 0 ÷,
ç0 0 10000 ÷ø
è
So, the known covariance structure involves uncorrelated random effects bqi with
respective standard deviations {1,0.1,0.01}.
The parameters are first re-estimated (in R2OpenBUGS) under a conjugate
Wishart prior for the precision matrix D -1 ~ W (I , 3), and with t ~ Ga(1, 0.001) and
(Bq ~ N(0, 100), q = 1,… , 3) . The last 4,000 of a 5,000 iteration two-chain run provides esti-
mated means (sd) for sbq = Dqq 0.5
of 1.13 (0.047), 0.22 (0.022), and 0.35 (0.06). The standard
deviations σb2 and σb3 are therefore overestimated as compared to the simulation param-
eters 0.1 and 0.01. The correlation r(b1,b2) is estimated as 0.23 with 95% interval (0.02,0.45).
A more diffuse prior on D−1 (e.g. a scale matrix with diagonal terms 0.1) provides lower
posterior means for (σb2,σb3), namely 0.14 and 0.16, but an increased posterior mean of
0.38 for r(b1,b2). Possible limitations of the Wishart prior are mentioned in the literature
(Alvarez et al., 2016).
The posterior means and standard deviations on σbq under the W(I,3) prior are used
to set gamma priors on the Cholesky terms cqq in a selection model (10.6) (Fruhwirth-
Schnatter and Tuchler, 2008), but with precision downweighted 100 times, namely
Ga(5.7,5.1), Ga(1.1,4.7), and Ga(0.35,0.95). Heavier downweighting (e.g. a thousandfold) is
avoided, as it may lead to over-diffuse priors cqq. Inferences from such selection may be
sensitive to priors on the Cholesky elements, and a full analysis would consider several
choices of prior. As in Section 10.2.3, with composite terms Gqr = cqr gqr and Zit = (1, t , xit ) ,
the linear predictor is
with selection indicators, γqr ~ Bern(πqr), and with πqr unknown and assigned uniform
priors. The off-diagonal Cholesky terms cqr (q > r) are assigned N(0,1) priors.
The last 4,000 of a 5,000 iteration two-chain run (in R2OpenBUGS) with D = CC′, pro-
vides estimated means (medians) for sbq = Dqq 0.5
of 1.12 (1.12), 0.033 (0), and 0.049 (0), with
the densities for σb2 and σb3 both having spikes at zero (consistent with zero random vari-
ation). The retention probabilities γ22 and γ33 are respectively 0.30 and 0.38. It is not pos-
sible to monitor the correlation matrix, but the covariance D12 has posterior mean 0.02,
with γ21 estimated at 0.27. The selection approach (with the priors adopted) provides
Hierarchical Models for Longitudinal Data 417
a more accurate estimate of the original σb3, and of the correlation structure, but also
essentially eliminates the random variability in the slopes on time.
Random effect selection is also undertaken with D = LWW¢L, namely C = ΛΩ. Priors
on λq are the same as for cqq under the decomposition D = CC′, while N(0,0.5) priors are
used for the elements of Ω, and Bernoulli priors with preset probability 0.5 for γqr. The
last 4,000 of a 5,000 iteration two-chain run provides posterior means (medians) for
sbq = Dqq0.5
of 1.12 (1.12), 0.040 (0), and 0.021 (0). The densities for σb2 and σb3 again both have
spikes at zero. So, a more accurate estimate of the original σb3 is obtained than under
a Wishart prior, but random variability in the slopes on time (as summarised in σb2) is
understated.
Either Cholesky decomposition approach can also be applied without selection (in
effect setting γqr = 1). For example, with Ga(1,1) priors on λq and N(0,0.5) priors on the ωqr,
the second method gives posterior means for σb2 and σb3 of 0.09 and 0.11, with σb3 less
inflated (as compared to the true value) than under a Wishart prior.
Also considered is the rstan option whereby an LKJ(ν) prior is applied to the lower
Cholesky factor of the correlation matrix between (b1i,b2i,b3i) (Vaidyanathan, 2016;
Baldwin, 2014). Half Cauchy priors with scale 5 are assumed for σbq. With a shape param-
eter ν = 1 for the LKJ(ν) prior, a two-chain run of 5,000 iterations provides posterior
means for σb2 and σb3 of 0.09 and 0.15. So again the estimated σb3 is less inflated (as com-
pared to the true value) than under a Wishart prior. However, the estimated correlation
matrix shows r(b1,b2) to be positive with mean 0.54 and 95% limits (0.04,0.91). Setting
ν = 1.5 provides posterior means for σb2 and σb3 of 0.08 and 0.10, with r(b1,b2) having mean
0.48 and 95% limits (−0.09,0.87).
The Cholesky factor correlation matrix approach is also applied using a Dirichlet parti-
tion to a total random variance parameter sT2 . This recognises the interdependence of the
sources of random variation. Thus, one has sb2 = fb sT2 , where (f1 , f2 , f3 ) ~ Dir(w1 , w2 , w3 ),
with the w vector itself Dirichlet distributed with prior weights 1. The total variance sT2
is taken as half Cauchy. This provides posterior means for σb2 and σb3 of 0.08 and 0.10,
with r(b1,b2) estimated at 0.49.
Finally, a direct separation strategy is applied (McElreath, 2015, p.393). Thus with D =
ΔRΔ, the correlation matrix R is assigned an LKJ(1.5) prior, and σbj (the diagonal ele-
ments in Δ) taken as lognormal with variance 1. This gives posterior means for σb2 and
σb3 of 0.09 and 0.13, with r(b1,b2) estimated at 0.55.
One can say in conclusion that some of the options considered provide better perfor-
mance in certain regards, but that no approach satisfactorily reproduces all aspects of
the known covariance structure. It may be that a more extended longitudinal simula-
tion (e.g. with T = 10), would be less sensitive, as there is more information on each unit.
y it - mt = åf (y
tj ij - mj ) + uit ,
é 1 ù
ê -f ú
ê 21 1 ú
F = ê -f31 -f31 1 ú,
ê ú
ê … … … … ú
ê -fT 1 -fT 1 … -fT ,T -1 1úû
ë
418 Bayesian Hierarchical Models
provides the covariance decomposition Var(ui ) = H = FSF ¢ . The covariates {wtj,zt} used
for the covariance regression model are powers of t − j for wtj, and powers of t for zt. Then
the model takes
ftj = l1 + l2 (t - j) + l3 (t - j)2 ,
with the model for ϕtj here extending only to a quadratic term in (t − j) rather than a
quartic as in Cepeda and Gamerman (2004).
The last 9,000 iterations from a two-chain run of 10,000 iterations in R2OpenBUGS
provide posterior mean (sd) estimates for the parameters as follows: β1 = 94.2(0.39),
β2 = 0.82(0.11), β3 = −0.021(0.011), β4 = 4.0E − 4(3.4E−4), γ1 = 0.36(0.37), γ2 = −0.189(0.071),
γ3 = 0.0085(0.0029), λ1 = 0.34(0.39), λ2 = −0.053(0.011), and λ3 = −0.00181(5.3E−4). Predictions
from the model reproduce the observations satisfactorily, with 11 of the 144 data points
having predictive exceedance probabilities Pr( y new , it > y it |y ) under 0.05 or over 0.95.
Despite providing an insight into the temporal aspects of covariation, this model has
worse fit measures than a standard approach, with a Wishart prior on trivariate normal
random intercepts and slopes in a quadratic growth model. The LOO-IC for the latter is
532, compared to over 2,380 for the joint regression model.
Finally, a quadratic growth curve model is combined with AR1 dependence in the
errors, using the appropriate form of error covariance matrix represented by a function
(see section 10.3.1). This is implemented in rstan. We find a significant AR1 parameter
with posterior mean (sd) 0.93 (0.03). This model can also be coded from first principles.
The LOO-IC is reduced to 359.
where Xit includes an intercept, bi ~ N(0,D), and uit ~ N(0,σ2). Assuming uit and bi are inde-
pendent, the correlation between wit = uit + bi and wis = uis + bi at periods t and s is
sometimes called the intraclass correlation. The random intercept model leads to the “com-
pound symmetry” form for the intra-subject covariance matrix Σi (Weiss, 2005, pp.246–
250), with diagonal terms S itt = s 2 , and off-diagonal terms S ist = s 2k , s ¹ t . Equivalently
S i = s 2 [(1 - k)I + k J ],
Hierarchical Models for Longitudinal Data 419
kt = Dt /(Dt + s 2 ),
and the correlation between wit = Dt0.5bi + uit at times t and s is (κtκs)0.5. The corresponding
RIAS model has loadings D10t.5 and D20t.5 and standard normal random effects b1i and b2i,
so that
For discrete data, the temporal correlation under random intercept and RIAS
models may be confined to positive values only. Thus, for Poisson counts yit, with
log[E( yit |bi )] = log( mit ) = Xit b + bi , and git = exp(Xit b ), one has under conditional indepen-
dence that
cov( yit , yis ) = E[cov( yit , yis |bi )] + cov[E( yit |bi ), E( yis |bi )]
= cov([e bi git ],[e bi gis ]) = gitgis var(e bi ),
while
where Σi is a unit level covariance matrix of dimension Ti × Ti. Commonly adopted schemes
for such residuals include low order random walks (e.g. first order or RW1 priors), or low
order stationary schemes (typically AR1 or MA1). For example, Xu et al. (2007), Oh and Lim
(2001), and Ibrahim et al. (2000) adopt stationary AR1 errors in models for longitudinal
count data, with yit ~ Po( mit ),
where uit ~ N (0, su2 ) are iid, and |ρ| < 1. For metric data, a stationary AR1 error scheme with
where |ρ| < 1, and uit ~ N (0, su2 ) , leads to error covariance matrix Σi with elements
s u2 |s-t|
Sist = var(e it )r|s-t| = r ,
1- r2
with
æ 1 r r Ti -1 ö
r2 …
ç ÷
2 ç
r 1 … ÷ r r2
su ç
Si = … … … … ÷ .
…
1- r2 ç ÷
ç … r21 r ÷r
ç Ti -1 ÷
èr … r2 r 1 ø
Assuming homogenous parameters across subjects so that Σi = Σ, and that subjects are
independent, the full population covariance matrix is
s u2
F= I n Ä S,
1- r2
where In is an identity matrix of order n. With eit = yit - Xit b , the marginal likelihood for
parameters c = ( r , s u2 , b ) is then of the form L( c| y ) = const - 0.5 log|F|+ e’F -1e .
A stationary first-order moving average or MA1 scheme, namely
and with |θ| < 1, leads to a particular form of a Toeplitz covariance matrix (Weiss, 2005,
p.267). Thus set
j 2 = var(uit + q ui ,t -1 ) = s 2 (1 + q 2 ),
g = q /(1 + q 2 ),
then
æ1 g 0 0 …ö
ç ÷
çg 1 g 0 …÷
2ç
Si = j … … … … …÷
ç ÷
ç… 0 g 1 g ÷
ç… … 0 g 1 ÷ø
è
æ1+q 2 q 0 0 … ö
ç ÷
ç q 1+q 2 q 0 … ÷
=s2ç … … … … … ÷.
ç ÷
ç … 0 q 1+q 2 q ÷
ç … … 0 q 1 + q 2 ÷ø
è
Hierarchical Models for Longitudinal Data 421
Stationary or random walk models for errors can be extended in various ways. Thus, for
unequally spaced data at points {ai1 , ai 2 , … , aiT } , the AR1 model becomes
|ait - ai , t - 1|
eit = r ei ,t -1 + uit ,
Another option when Ti is relatively large are subject varying autocorrelation parameters,
possibly independently distributed ri ~ U ( -1, 1) (Ryu et al., 2007) or hierarchically speci-
fied; see Example 10.3.
The use of autocorrelated or random walk effects raises issues about how to specify the
initial conditions (initial random effects) such as εi1 under an AR1 or RW1 prior on εit, and
{εi1,εi2} under an AR2 or RW2 prior. For stationary autoregressive errors, such as the AR1
prior
the variances of εit and υit are analytically linked, so that the initial conditions are neces-
sarily specified as part of the prior. So, for stationary AR1 dependence in εit and equally
spaced data, one has
and
var(e i1 ) = s 2u /(1 - r 2 ),
and the joint distribution of the εit is obtained (Xu et al., 2007) as
p(ei1 ) Õ p(e |e
t=2
it i ,t -1 )
where
1
p(eit |ei ,t -1 ) = exp( -0.5[eit - rei ,t -1 ]2 /su2 ).
su (2p)0.5
In non-stationary and random walk models with |ρ| ≥ 1, initial conditions are usually
specified by diffuse fixed effect priors, though Chib and Jeliazkov (2006) interlink the vari-
ance of the initial conditions with that of the main sequence of effects to provide a proper
joint prior on {ei1 , … eiTi }. One may also link initial conditions εi1 and subject heterogeneity,
as in
bi ~ N (yei1 , sb2 ),
where ψ can be positive or negative (Chamberlain and Hirano, 1999). This amounts to
assuming a bivariate density for bi and εi1.
422 Bayesian Hierarchical Models
The coefficients b2i measure how far the return of security i is attributable to market fac-
tors. A bivariate normal prior is assumed for{b1i , b2i } , with mean (B1,B2), and covariance
D. A Wishart W(I,2) prior for D−1 is assumed, with the prior mean for the covariance
matrix D then being the identity matrix.
The last 4,000 of a 5,000 iteration two-chain run in R2OpenBUGS provide a signifi-
cant effect for xt with B2 having posterior mean (95% credible interval) of 0.72 (0.63,0.81).
The mixed predictive procedure (Marshall and Spiegelhalter) shows a satisfactory
fit, around 8% of the 5,400 observations to have predictive exceedance probabilities
Pr( y rep.mix , it > y it |y ) over 0.95 or under 0.05. However, to assess whether first-order autore-
gressive dependence might be present, define realised residuals eit = y it - b1i - b2i xt .
Then a firm-specific measure of AR1 error dependence is
T T
r1i = åe e
t=2
it i , t - 1 åe
t =1
it
2
.
Thus 58 of the 90 firms have probabilities below 0.05 that r1i > 0 , with the sample-wide
AR1 dependence parameter (the mean of the r1i ) estimated at −0.097 with 95% CRI
(−0.103,−0.091).
Another evaluation involves a posterior predictive check based on an average of
Durbin-Watson (DW) statistics taken over all 90 firms. Thus at each iteration r, a DW
statistic is derived for each firm, namely
T T
DWi( r ) = å
t =2
(eit( r ) - ei(,rt)- 1 )2 å(e
t =1
(r ) 2
it ) .
å DW
(r) (r)
A summary statistic for autocorrelation is then the average over firms DW = i /n ,
i
(r ) (r )
which is obtained for actual data DW , and for replicate data DW
obs (Gelman et al., new
1996). The resulting posterior probability Pr(DW obs ³ DW new |y ) is 1, indicating inad-
equate fit.
Accordingly, a revised model (model 2) includes a stationary AR1 error, so that
and a stationary prior, r ~ U( -1, 1). A 5,000 iteration two-chain run (with the last 4,000
for inference) gives a significant ρ estimate, with posterior mean (sd) −0.088 (0.014).
Hierarchical Models for Longitudinal Data 423
The LOO-IC for this model is 39,696, compared to 39,733 for model 1. However, checking
based on firm-specific r1i shows 39 firms with probability under 0.05 that r1i > 0 , and 30
firms with probability over 0.95 that r1i > 0 .
An extension to unit-specific AR1 parameters is therefore adopted. Thus
di ~ Beta( ad , bd ),
ri = 2di - 1,
ad ~ E(1),
bd ~ E(1).
Priors are as above on (b1i,b2i), and D. Estimation of this model shows checks that
r1i > 0 are now considerably less concentrated in the tails, with only one firm now hav-
ing a probability under 0.05 that r1i > 0 , and no firms with probability over 0.95 that
r1i > 0 . However, possibly illustrating that improved model checks are not necessarily
associated with improved overall fit, the LOO-IC rises to 39737, as the complexity index
(p_loo) rises to 174.
chooses brand k in period t might be modelled using a multinomial logit (MNL) regres-
sion, with choice K as a reference,
where β0k are intercept terms, Pkt and Akt are brand-time specific characteristics (e.g. price
and advertising spend) varying in whether associated regression parameters are choice
specific, and bik are random consumer-brand taste effects. These are typically taken as
multivariate normal of dimension K − 1, with biK = 0 for identifiability (Malchow-Moller and
Svarer, 2003).
Consumer variation in response to prices or attributes would involve making the βk and γ
coefficients specific to each consumer, and defining hyperparameters for the densities of βki
and γi. For Pkt of dimension R, Rossi et al. (2005, p.136) propose a conjugate normal hierarchi-
cal prior structure for bi = ( b1i , … , bRi ), with mean ZiΔ, where Zi are consumer attributes, and
with variance Vβ of dimension dim( bi ) = (R - 1)R . Vβ is assigned an inverse Wishart prior hav-
ing with expectation I and dim(βi) + 3 degrees of freedom. They demonstrate the improved
MCMC convergence for βi obtained by using a random walk Metropolis with increments that
have covariance s2 ( H i + (Vb( r ) )-1 )-1, where Hi is the Hessian of a composite likelihood based on
multiplying the MNL subject specific likelihood by the pooled (all subject) likelihood raised
to power ri = Ti /cN , and c > 1 and s = 2.93/sqrt[dim( bi )] are tuning constants.
Consider categorical longitudinal data with subject level predictors only, namely Xit and
Zit of dimension P and Q, and category-specific fixed regression effects, namely
where Zitbik = z1itbik 1 + z2itbik 2 + … + zQitbikQ . Assuming the Xit and Zit are non-overlapping,
one may adopt Q independent sets of subject-category effects each of dimension K − 1, one
for each predictor zqit,
Alternatively, the covariance matrix of the random effects may of dimension (K − 1)Q with
the bik correlated over both categories and predictors. In the case where Zit is a subset of Xit,
the bikq are zero mean random effects, and covariance matrices may be choice specific Dk of
dimension Q, so that
where Ck Ck¢ = Dk .
Hierarchical Models for Longitudinal Data 425
with predictor effects also possibly varying over (at least one of) categories, subjects or
times. For example, Spiess (2006) considers predictor effects varying over times, as in
where P(εikt) is usually a normal or logistic distribution. These distributions are very simi-
lar though the logistic places more probability in the tails (Hedeker, 2003). So
or
An equivalent specification of this model involves sets of K − 1 binary variables for each
subject-time pairing, namely dikt = 1 if yit £ k , and dikt = 0 otherwise. Then for ε logistic,
å
n
vations is N = Ti = 2412 . Known influences on brand choice are brand and time
i =1
specific, namely features Akt (=1 if the brand k was subject to an advertising feature at
the time t of purchase, =0 otherwise), and shelf price Pkt.
426 Bayesian Hierarchical Models
K
p ikt = Pr(dikt = 1) = fikt åf
k =1
ikt ,
As mentioned by Chen and Kuo (2001), observations from the same household are usu-
ally correlated in brand choice applications, and not accounting for such dependence
may produce biased estimates. A random intercepts model (model 2) accordingly allows
for heterogeneity at household-choice level, though retaining homogenous impacts for
brand attributes. This has the form
where the vector B denotes the average category intercepts ( b01 ,… , b0 , K - 1 ). A Wishart
prior for the precision matrix, D -1 ~ W (I , 3), is assumed.
Inferences (from jagsUI) show significant fixed effects, (γ1,γ2), under model 1, for both
feature and price, with posterior means (sd) of 0.49 (0.12) and −36.7 (2.3). This model
has a LOO-IC of 5,324, whereas the trivariate normal random intercept model has a
LOO-IC of 2,181. Estimates of the correlation matrix under model 2 show brand 1 and
3 choices to be positively correlated, with r(bi1,bi3) = 0.44. The impacts for feature, and
to a lesser degree, price, are enhanced, though with reduced precision. Thus namely
posterior means (sd) of (γ1,γ2) are now 0.86 (0.18) and −44.8 (3.8). While model 2 yields a
pronounced gain in fit, it has not controlled for consumer variation in price or advertis-
ing responsiveness, which would involve making the γ1 and γ2 coefficients household
specific.
= gitj - git , j - 1 ,
where
and B = (B1,B2) contains an overall intercept and time slope. Since the overall intercept
is an unknown, identification of the K − 1 = 6 thresholds requires setting κ1 to zero.
The remaining five threshold parameters are subject to monotonicity constraints:
kk = kk -1 + dk , where dk ~ Ga(1, 1).
The model is first fitted in rjags, with inferences from a two-chain run of 10,000
iterations. The coefficient β1 (a measure of differences in symptom level between treat-
ment options at baseline) is not significant, but there is a steeper decline in ill-health
for treated subjects. Thus, the coefficient β2 has a posterior mean (95%CRI) of −0.69
(−1.05,−0.32). Posterior means for σb1 and σb2 are 1.71 and 0.86 respectively, with r(b1,b2)
estimated at −0.47, showing steeper decline effects for higher initial symptom levels. A
posterior predictive check based on comparing total likelihoods for actual and replicate
data gave probability 0.54, indicating a satisfactory model. Diagnostic tests such as Q-Q
plots and Jarque–Bera tests support normality of the permanent effects (b1i,b2i).
To assess sensitivity to alternative priors regarding random effect covariance, the
above model is also fitted in rstan using an LKJ(1.5) prior applied to the lower Cholesky
factor of the correlation matrix between b1i and b2i. The code for this model involves six
threshold parameters, with B1 set at 0. From a run of 2,000 iterations, estimates for β1 and
β2 are little changed, while posterior means for (σb1, σb2) are 1.80 and 0.92 respectively,
with r(b1,b2) estimated at −0.49.
where the uit ~ N(0,σ2) are independent of each other, and under standard assumptions are
also uncorrelated with the initial observations yi1 and with permanent subject effects bi.
If Xit contains a constant term, then the bi have mean zero, and bi ~ N(0,D). Allowing for
subject level variation in a Q length vector of predictors Zit, as well as for first-order lagged
response, leads to
Assuming a stationary process with |ϕ| < 1, one possible model for yi1 is
Zitbi Xit b
y i1 = + + ui1 ,
1- f 1- f
with ui1 ~ N (0, s 2 /(1 - f 2 )). A simplifying approach, more feasible for large T, is to condition
on the first observation in a model involving a first-order lag in y, so that y1 is non-stochas-
tic (Bauwens et al. 1999, p.135). Geweke and Keane (2000) and Lancaster (2002) consider
Bayesian approaches to the dynamic linear longitudinal model, in which the model for
period 1 is not necessarily linked to those for subsequent periods in a way consistent with
stationarity.
Maximum likelihood analysis of dynamic longitudinal models is subject to an initial
conditions problem if in fact there is correlation between the permanent subject effects bi
and the initial observations (Hsiao, 1986). In case of such correlation, possible options are a
joint random prior (e.g. bivariate normal) involving bi and ui1 (Dorsett, 1999), or a prior for
bi that is conditional on yi1, such as (Wooldridge, 2005; Hirano, 2002)
Dynamic linear models may be extended in several ways, to include ARMA(p,q) error
schemes, effects of time functions, or random variation over subjects or times in the
impacts of lagged predictors. For example, a dynamic model for earnings (e.g. Galler, 2001)
might include AR1 autocorrelated errors as in
where the random effects b1i and b2i allow subject specific variation in wage level and wage
growth. Taking the time function to be an unknown function of t, δt, lead to autoregressive
latent trait models (Bollen and Curran, 2004). Allowing for time-varying coefficients on
lagged responses yi ,t -1 , as well as random subject intercepts and growth rates, one might
then have
mit = exp(Xit b + f yi ,t -1 + bi ), t = 2, … , T
are mentioned by Fahrmeir and Tutz (2001). This option for modelling lag response impacts
defines the Markov property scheme studied by Fotouhi (2007), under which the initial
observation is modelled as
mi1 = exp(Xi1b + ci ),
where Xi1 includes any relevant predictors for the first period, and the subject effects bi and
ci follow a bivariate normal with correlation ρ.
Alternatively, the impact of a lagged count response may be modelled by a log or other
transform g(y), with extra preset or unknown parameters in case the lagged y is zero. Thus
if g( y ) = log( y + c), where c = 1 (say), one has
mit = exp(Xit b + f g( yi ,t -1 ) + bi ), t = 2, … , T
mit = f yi ,t -1 + exp(Xit b + bi ).
while the full autoregressive conditional Poisson specification (Jung et al., 2006) specifies
In contrast to count regression, regression for binary responses yit ~ Bern(pit ) may straight-
forwardly include lags in observed outcomes yi ,t - s leading to Markov Chain models
(Kedem and Fokianos, 2005). First order Markov dependence, as in
logit(pit ) = a0 + a1 yi ,t -1 + Xit b + bi ,
logit(pit ) = a0 + åa y
s =1
k i ,t - s + Xit b + bi ,
430 Bayesian Hierarchical Models
with L preset or determined by selection (Erkanli et al., 2001). Alternatively, fixed predictor
effects β, and parameters for random effects bi, may vary according to the previous value
s of the binary response; so { bs , Ds } are specific to previous response yi ,t -1 = s (Islam and
Chowdhury, 2006).
Such alternatives extend in principle to multinomial outcomes yit Î(1, … , K ), or equiva-
lently dikt = 1 if category k applies (or is chosen), and ditk = 0 otherwise. So
where nit = 1. Use of lags is complicated by the possible influence of cross-category lags as
well as own-category lags. Pettitt et al. (2006) consider a Bayesian hierarchical multinomial
model for changes in employment status (a trichotomy), with one period lags in status as
predictors. Thus, with employment status 1 as the reference (and so ϕi1t = 1), one has for t > 1
where bik are category specific random effects. For the initial period, a static multinomial
logit model can be adopted, without lag effects or bik, and with distinct regression effects,
namely
log(fik 1 ) = dk Xi1 k = 2, ¼ , K .
This follows from a linear approximation to the reduced form obtained when lagged
response variables are replaced by their specifications under the dynamic model for peri-
ods preceding t = 1.
Dynamic modelling approaches may also be applied using latent metric responses, asso-
ciated with binary or ordinal observations. Suppose observations yit are binary such that
the latent continuous response yi*,t > 0 if and only if yit = 1, and yi*,t £ 0 if yit = 0. Then one
might specify
with uit ~ N (0, 1) , and lag one dependence on both previous events and latent utilities. If
there is serial correlation (e.g. AR1 dependence) in the errors, then e it = r1e i ,t -1 + uit, with
uit ~ N (0, 1). In this way, one may avoid spurious state dependence in which previous
responses proxy unobserved variation.
observation. The analysis here is based on a 10% sample of the n = 4164 subjects who
have at least two measurements on yearly log earnings, where earnings figures for each
subject are divided by calendar year averages to correct for inflation. In this way, the
earnings profile of a subject observed over 1968–1975 (say) can be compared with that
for a subject observed over 1978–1985. An alternative might be to have fixed or random
effects for each calendar year to model population trends in average income.
Although not all years were subject to survey updates, the analysis here takes a sub-
ject’s entire observation span (obtained by comparing initial and last observation year)
to define that subject’s total times Ti. Any intervening years without observations are
treated as missing data, whether this is due to intermittent missingness or the absence
of an NLS update in particular years. Thus the first subject is observed on twelve occa-
sions (in the studies in 1970, 1971, 1972, 1973, 1975, 1977, 1978, 1980, 1983, 1985, 1987, and
1988), but that subject’s total times Ti is set at 19, with the intervening years without
observations (e.g. 1974, 1976, etc.) treated as missing data. Missingness is taken to be at
random, not depending on the possibly missing response value.
With yit denoting (inflation corrected) log earnings, the initial regression model
includes subject effects bi, and fixed binary attributes {W1i , W2i , W3 i } , with W1i for college
graduate (=1, 0 otherwise), W2i for white ethnicity (=1, 0 for other ethnicities), and an
interaction W3i = W1iW2i. So, for i = 1,… , n ,
where bi ~ N( b1 , D) , and uit ~ N(0, s 2 ) . Uniform U(0,10) priors are assumed for σ and
sb = D0.5 , and N(0,1000) priors for fixed effects {b1 , g 1 , g 2 , g 3 }.
Estimation using jagsUI give posterior means (sd) for γ1 and γ2 of 0.32 (0.05) and 0.036
(0.022), showing significantly higher earnings for college graduates, and a positive but not
significant white ethnicity effect. The effect of the interaction term is significantly nega-
tive, with mean (sd) of −0.12 (0.06), suggesting a greater positive impact of college educa-
tion on earnings for non-white subjects. The posterior mean for the standard deviation
of the bi is 0.18, so that a subject for whom bi is one standard deviation above the average
would have earnings about 20%, namely 100exp(0.18), above average, given observed
personal characteristics Wi. Taking ûit = y it - b1 - bi - Wig , there is evidence of autocorre-
å å å å
n Ti n Ti
lated errors, with the 95% interval for the statistic ru = uˆ it uˆ i , t - 1 uˆ it2
i =1 t=2 i =1 t =1
being (0.06,0.11).
To improve fit, a second dynamic model is non-stationary, in that there is a distinct
model for the first period for each subject (Geweke and Keane, 2000), and a one period
lag effect ϕ of earnings, with this effect not constrained to stationarity. Random subject
effects are also included in the model for periods t = 2,… , Ti so that
y i1 = Wig1 + ui1 ,
with an N(0,1) prior on ϕ, and with uit ~ N(0, s 2 ) and ui1 ~ N(0, s12 ) taken independently.
The 95% interval for ϕ is obtained as (0.57, 0.64), along with considerably reduced auto-
correlation, with 95% interval for ru now from −0.048 to −0.003. Fit is improved, with
WAIC now lower at −1530, compared to −387 under the non-dynamic model. The γ coef-
ficients are reduced in absolute size, but the college effect γ1 remains significant, with
95% interval (0.08,0.17). The posterior mean for σb is also reduced, to 0.072. There is scope
for further model development, as the probability that ru is positive is low (under 0.02),
and this might involve subject specific lag parameters, random slopes on time, or auto-
correlated errors.
432 Bayesian Hierarchical Models
mi1 = exp(Xi1b + ci ),
(bi1 , bi 2 , ci ) ~ N 3 (0, D),
where D -1 ~ W (I , 3), and fixed effects are assigned N(0,10) priors. Estimation using jagsUI
shows neither the main treatment effect or the treatment by time effect to be significant,
while the 95% interval for the coefficient ϕ on lagged seizure counts is (−0.009,−0.002).
The correlation between bi1 and ci is 0.77. This model leaves excess dispersion: the mean
scaled deviance of 440 (Fit[1] in the code) exceeds the number, 5 × 59 = 295, of observa-
tions, an issue returned to in Example 10.10. Predictive discrepancy shows in a posterior
predictive check involving the deviance, with zero probability that the deviance involv-
ing replicate data exceeds the deviance for the actual data. On the other hand, mixed
predictive checks (Marshall and Spiegelhalter, 2007), denoted exc.mx[i,t] in the code, do
not show an excess of tail value probabilities: 10 under 0.05 and 10 over 0.95.
A second analysis replicates model 1 except in taking (bi1 , bi 2 , ci ) as multivariate skew
student t, to account for possibly heavy tailed or skew random effects. Thus
æn n ö
x i ~ Ga ç , ÷ ,
è2 2ø
where the Wji are independently half normal N+(0,1), and the skew parameters have
dk ~ N(0, 10) priors. The degrees of freedom ν has a set value, ν = 4, providing a robust
setting (Gelman et al., 2014), as estimation of ν may be sensitive to priors adopted.
The skew parameters have 95% credible intervals (−0.04,0.82), (−1.01,0.28), and (−0.16,
0.13). The lowest scale factors (xi[i] in the code) are for subjects 49, 18, 15, and 8, namely
x49 = 0.28, x18 = 0.37 , x15 = 0.44 , and x8 = 0.45 (cf. Fotouhi, 2007). This extension reduces
the LOO-IC slightly, from 1,764 to 1,762.
Hierarchical Models for Longitudinal Data 433
The general linear mixed model for y possibly being a discrete response may not include
observation level residuals, and for overlapping Xit and Zit, then takes the form
with (b1i , b2i , … , bQi ) ~ NQ (0, D), where, again, normality of errors and constant dispersion
D are default assumptions.
Violation of standard assumptions regarding the forms of error density, or of homosce-
dasticity, are likely to affect inferences. Among principles that may provide a robust
approach to departures from such standard assumptions is that of embedding the model
in a more general framework (Zhang et al., 2014; Ma et al., 2004; Rice, 2005), with conven-
tional assumptions (e.g. normality and homoscedasticity of errors) as special cases of a
broader model.
Following Chapter 8, assumptions of homoscedasticity at level 1 (repeated observations
within subjects) or at level 2 (heterogeneity between subjects) may be modified to allow
more general variance functions varying over subjects, times, or both, including depen-
dence of the variance on subject or observation level attributes. For example, heterosce-
dasticity may exist in the permanent random effects component of longitudinal models,
which may be modelled by variance regression in a positive function. For varying inter-
cepts bi as in (10.4), one might relate subject specific variances Di to predictor values aver-
aged over time, Xi , as in
Di = a 2 (1 + j Xi )2 ,
where terms in the scalar or vector φ are positive. Heteroscedasticity may be considered at
observation level, so that for yit = Xit b + Zitbi + uit one might take
Thus yit ~ N (hit , hitw /t), where ω is an unknown power and τ is an overall precision param-
eter, and ω = 0 corresponds to homoscedasticity.
Similarly, more general error densities allowing for skewness, heavy tails, or other non-
normal features may be adopted, with the standard assumptions embedded within them.
Alternatives to assuming multivariate normal subject effects may include heavy tailed
Student t heterogeneity (Zhang et al., 2014; Chib, 2008; Lin and Lee, 2006), skew normal
and skew-t densities, and skew-elliptical densities (Ma et al., 2004). Thus, the normal linear
mixed model can be embedded within a wider class of scale mixture normal densities,
with the subject or observation level scale parameters measuring outlier status (Wakefield
et al., 1994; Chib, 2008). Thus, the model of (10.2), with normal cluster effects bi and normal
residuals uit, is a special case of a scale mixture model with
A widely applied option takes the densities {Gλ,Gξ} to be gamma with equal scale and
shape, νλ/2 and νξ/2 respectively, leading to multivariate t densities with {νλ,νξ} degrees of
freedom. This provides resistance to atypical data at both observation and cluster levels.
For possibly skew residual or subject effects, skew-normal or skew-t densities may be
adopted. Ghosh et al. (2007) consider bivariate skew-normal errors at both subject and
observation level in a linear longitudinal model for metric responses, while Jara et al.
(2008) allow both subject random effects and observation level errors to follow a multivari-
ate skew-t distribution. Thus, for a linear mixed model for y of dimension Ti
yi = Xi b + Zibi + ui , (10.7)
suppose yi follows the multivariate skew-t density (Sahu et al., 2003). Then
p( y|b , b , s 2 , R, D) µ Õ2
i =1
Ti
tTi ,n ( yi |Xi b + Zibi , s 2Ri + D i2 )
¥
´
òt 0
Ti ,n (wi |mw , S w )dwi ,
Hierarchical Models for Longitudinal Data 435
G(n - 1)/2]
E( yi | b , bi , s 2 , d ) = Xi b + Zibi + (n/p)0.5 d1Ti ,
G(n/2)
2
n é G(n - 1)/2] ù 2
Var( yi | b , bi , s 2 , d ) = (s 2 + d 2 )ITi + (n/p) ê ú d ITi .
n-2 ë G(n/2) û
Under the reductions Ri = I Ti , dit = d , the conditional density may be described by a mixture
of normal distributions by conditioning on positive variables wi = (w1i , … , wTi i ) obtained by
truncated sampling from a multivariate normal with identity covariance matrix of dimen-
sion Ti and subject-specific scalings li ~ Ga(n/2, n/2), so that
æ s2 ö
yi |b , bi , s 2 , wi , li , d ) ~ NTi ç Xi b + Zibi + d wi , I ÷.
è li ø
æ 1 ö
wi ~ NTi ç 0, I ÷ I (0, ).
è li ø
In the (usual) case when Xi b + Zibi contains an intercept, then for identifiability, the ele-
ments in the vector wi may be centred (subsequent to truncated sampling) (Jara et al., 2008).
Thus, at each iteration, the average of the wit can be obtained, and then the centred vari-
ables Wit = wit - wi , so that
æ s2 ö
yit ~ N ç Xit b + Zitbi + d Wit , ÷ .
è li ø
Additionally, in the model (10.7), the permanent random effects bi may also be taken as skew
multivariate t. Assuming the Z predictors are a subset of the X predictors, one then has
where D is Q × Q, νb is the degrees of freedom, and G i = diag(g1i , … , gQi ) contains skew-
ness parameters relevant to the permanent effects. Assuming common skew parameters
G i = G = diag(g1 , … , gQ ) , and conditional on a Q vector of positive variables, hi = ( h1i , … , hQi ),
with
æ 1 ö
hi ~ NQ ç 0, I ÷ I ( hi > 0), (10.8)
è xi ø
æn n ö
x i ~ Ga ç b , b ÷ ,
è 2 2ø
the random effects are mixtures of normals, namely
æ 1 ö
bi ~ NQ ç Ghi , D ÷ .
è xi ø
436 Bayesian Hierarchical Models
For improved identification, the hi can be centred around their means (at each MCMC itera-
tion), namely H qi = hqi - hq. , so that
æ 1 ö
bi ~ NQ ç GH i , D ÷ .
è x i ø
where the factors {jb > 1, ju > 1} are used for variance inflation for the outlier group. The
prior probabilities of being in the main population are set high (e.g. pb = pu = 0.95), and
variance inflation factors are typically large e.g. jb = ju = 5 or 10. Provided one or other of
the parameter sets {pb , pu } or {jb , ju } is assumed known (i.e. is assigned preset values), the
other set may be taken as unknowns.
Another option is “switching” or shift priors whereby one group has zero effects, but a
minority group has non-zero effects. These may be used for iid errors introduced to reflect
overdispersion in count or binomial data. For example, for yit ~ Po( mit ), one may have
where σ is a scale factor, kit ~ Bern(pu ) , uit ~ N (0, 1) , such that observation level effects are
zero when kit = 0. One may preset πu low, say πu = 0.05. For a longitudinal series with level
ct subject to possible shifts, and Xit not containing an intercept, one may similarly propose
that
rationale for assuming subject level effects bqi follow a discrete prior at subject level. The
hyperparameters governing the subject effects {b1i , b2i , … , bQi } then become specific for the
latent category. Thus, in a growth curve model for modelling changes in aggression rat-
ings, Muthen et al. (2002) assume that a small number of latent trajectories characterise
growth in aggression. For subject i, let the latent category be denoted ki Î(1, … , K ). Then
conditional on ki = k, (10.5) would become
where (e1i , e2i , e3 i ) ~ N 3 (0, Dk ). Observation level dispersion parameters may also differ
according to latent group.
Flexible discrete mixture models are also obtained under Dirichlet process and related
semiparametric priors (Dunson, 2009), as considered for repeated binary data by Quintana
et al. (2008), for longitudinal count data by Kleinman and Ibrahim (1998), and for mul-
tiple membership longitudinal models by Savitsky and Paddock (2014) and Paddock and
Savitsky (2013). Averaging over different number of mixture components K is possible
under discrete parametric mixture models using the RJMCMC (reversible jump MCMC)
algorithm – see Ho and Hu (2008) for an application to the linear mixed model. In the non-
parametric mixture approach, the number of clusters is an outcome of other parameters
such as the Dirichlet process mass parameter κ. Under the truncated Dirichlet process
(Ohlssen et al., 2007), one may set a maximum Km possible clusters, with the realised num-
ber at each iteration being K £ K m . The posterior density of K will indicate whether the
assumed maximum Km is sufficient.
Hirano (2000, 2002) discusses non-parametric alternatives regarding white noise obser-
vation errors uit in longitudinal data, while Kleinman and Ibrahim (1998) and Muller and
Rosner (1997) consider mixed Dirichlet process (MDP) modelling of Q dimensional unit
level effects bi. Under the MDP option, one has bi following a density G which is itself
unknown, centred on a specified base density G0 with precision κ. For example, with a base
density G0 = NQ (B, D), one has
where priors on b , B, D, s 2 are typically as considered above. This is the conjugate MDP
prior for the normal linear mixed model which tends to the conventional hierarchical prior
as κ → ∞.
The model considered by Hirano (2002) is also conjugate, and based on a dynamic model
yit = bi + r yi ,t -1 + uit ,
438 Bayesian Hierarchical Models
where the bi are zero mean effects that are modelled parametrically, and uit = yit - bi + r yi ,t -1
may have non-zero means. One has for qit = { mit , sit2 }
where G0 specifies
1 c 2 (s)
G0 ( m, s 2 ) : ~ ; m ~ N (m, bs 2 ).
s 2
sL
where s, L, m and b are preset. As discussed in Chapter 3, κ may be preset or taken as an
unknown. Thus, Kleinman and Ibrahim (1998, p.2592) consider defaults such as κ = 1.5 and
κ = 100, while Hirano (2002) takes k ~ Ga(2, 0.5).
where αi > 0 and βi > 0 are respectively the volume of distribution and clearance param-
eters for each subject. A hierarchical model is proposed with the second stage consisting
of a multivariate normal or multivariate Student t for the transformed subject effects
(b1i , b2i ) = {log(ai ), log( bi )}.
For the first stage density, one option is a log-normal since y is positive, or a truncated
normal, with yit constrained to be positive. Under the latter, a heteroscedastic power model,
with a single precision parameter τ, leads to a variance hitw /t , and the first stage model is
Note that zero y values are replaced by 0.001 to avoid conflict with this density assump-
tion. Another option for the first stage model involves a normal scale mixture, namely
li ~ Ga(0.5n , 0.5n).
Here these options are compared under the priors t ~ Ga(1, 0.001) and n ~ Ga(2, 0.1) . A
uniform U(0,5) prior is assumed for ω, as in Wakefield et al. (1994).
At the second stage, a bivariate normal for (b1i , b2i ) is assumed with
D-1 ~ W ([ r R]-1 , r ),
2000
1500
Frequency
1000
500
FIGURE 10.1
Predictive distribution of y2.
Inferences are based on runs of 10,000 iterations using the rube package. The scale
mixture model with variances hitw /[lt i ] has a lower DIC, namely −183.5, as compared to
−180 for the model with variances hitw /t . The posterior mean ν under the scale mixture
is 16, with the lowest scale parameter for subject 2, λ2 = 0.66. The power ω is estimated
at 0.86 under the better performing model, whereas a log-normal model would imply
w 2.
Out-of-sample predictions of concentrations are made for a duration of 32 hours.
For subject 2, whose plasma concentrations remain relatively high compared to other
subjects, the mean prediction is 0.11. Figure 10.1 shows the predictive distribution.
Inferences on the population distribution of concentration parameters are important,
for example, the half-life (period of time required for a drug concentration to be reduced
by one-half), which for patient i is ai log(2)/bi . The median half-life and clearance are
obtained, and Figure 10.2 shows the corresponding bivariate posterior plot.
FIGURE 10.2
Bivariate posterior.
skew in the random effects (b1i , b2i ). For skew bivariate normal random effects, one has
(as model 1)
bi |D, G ~ SN(0, D, G ),
hi ~ NQ (0, I ) I (0, ),
(with Q = 2), the random intercepts and slopes in (10.8) are obtained as
bi ~ NQ (Ghi , D).
While centred positive variables hi and wit may be preferred for identification, this slows
MCMC analysis considerably and uncentred effects are used for illustration.
Estimates for model 1 using jagsUI show significant skewness in subject intercepts
b1i, but not in the time slopes, with the respective γ parameters having 95% intervals
(0.38,0.60) and (−0.26,0.23). The LOO-IC is obtained as −20.5 under both models. Under
model 2, the δt parameters all have credible intervals straddling zero, but earlier ones
are biased to positive values.
Hierarchical Models for Longitudinal Data 441
where α is the regression intercept, and a U(0,100) prior is assumed for D0.5. Additionally,
a uniform shrinkage prior (Natarajan and Kass, 2000) is adopted in relation to the other
variance component var(uit ) = su2 , with
D
f= ~ U(0, 1).
D + su2
Estimation using jagsUI gives posterior means for su2 and D of 0.13 and 0.27. The poste-
rior mean of the scaled deviance is now 271, and a posterior predictive check is satisfac-
tory, providing a probability of 0.26 that the deviance involving replicate data exceeds
the deviance for the actual data. The LOO-IC is 1200.
One may also model subject intercept heterogeneity using a discrete mixture of inter-
cepts, so avoiding parametric assumptions about such heterogeneity. Thus
with a U(0,100) prior for D0.5, and f = D/(D + su2 ) ~ U(0, 1). πδ may be an unknown or
preset. Here a value πδ = 0.10 is adopted, so that the posterior values Pr( di = 1|y ) can
provide clear contrasts to the prior values Pr( di = 1) = pd . From a two-chain run of
442 Bayesian Hierarchical Models
10,000 iterations, it emerges that seven patients have sufficiently high posterior odds
Pr( di = 1|y )/(1 - Pr( di = 1|y )) to provide marginal Bayes factors exceeding 3, namely 10,
11, 16, 25, 39, 53, and 56. This model gives a LOO-IC of 1,215.
A fourth analysis reverts to random subject effects only, but allows for possible
non-normality via a mixed Dirichlet process. The random effects are bivariate, with
non-zero means, one for the intercept B1 (α in above models) and one for a linear slope
B2 on visit. Thus
(b1i , b2i ) ~ G,
G ~ DP(k G0 ),
G0 = NQ (B, D),
k ~ Ga(2, 4),
(B1 , B2 ) ~ N 2 (0, 1000I ),
D -1 ~ W (R, r),
R = diag(20),
r = 10,
so that E(D -1 ) = diag(0.5), as in Kleinman and Ibrahim (1998). The maximum number of
possible clusters is set at Km = 20.
Conditional on the particular choice made for the prior on κ, one obtains a mean
scaled deviance of 417, still leaving excess variability. The posterior density for the
realised number of clusters has 0.025 and 0.975 percentiles at 6 and 13, with mean 8.8,
while κ has posterior mean 1.32. Histograms of the mean {b1i,b2i} with superimposed
normal curves show excess kurtosis (i.e. peaked densities) (Figures 10.3 and 10.4).
FIGURE 10.3
Histogram with normal curve, varying intercepts.
Hierarchical Models for Longitudinal Data 443
FIGURE 10.4
Histogram with normal curve, varying slopes.
Autocorrelated errors may also be required to model temporal dependencies, so that the
unexplained variance may be due to a number of sources. For example, Lee and Hwang
444 Bayesian Hierarchical Models
(2000) consider a normal mixed effects model, applicable in growth curve applications
with multiple groups of subjects j = 1, ¼ , J , with
where
and the variances of bij, uijt, and vijt are subject to uniform shrinkage priors (see Section
10.2.3).
For nested longitudinal data inferences (e.g. on growth patterns) may be improved by
borrowing strength over clusters. Similarly, with longitudinal data on multiple outcomes,
ymit for subjects i = 1, … , n , outcomes m = 1, … , M , and repetitions t = 1, … , T , inferences
on particular outcomes may be strengthened by incorporating correlations between out-
comes. An example might be for longitudinal data on correlated, but relatively rare, spa-
tially configured health events, such as cancer types. Multiple outcome longitudinal data
are common in clinical and educational applications, and the effectiveness of interventions
may be judged in terms of multiple (usually) correlated outcomes rather than by a single
criterion (Dunson, 2007). In environmental applications, multiple outcomes with related
aetiology are likely to be correlated (e.g. Liu and Hedeker, 2006; Jorgensen et al., 1999).
With metric or discrete data ymit for multiple outcomes, the general linear mixed model
with time homogenous, time-varying, and subject varying predictor effects becomes
where Xit, Zit, and Hit are of length P, Q, and R. For example, consider multivariate repeated
binary responses,
y mit ~ Bern(pmit ),
Given multivariate random outcome-subject effects bmi, and fixed effects cmt subject to an
identifying corner constraint such as cm1 = 0 or cmT = 0, the ymit are assumed conditionally
independent.
The corresponding normal linear mixed model for multivariate metric responses is
where the residuals umit are typically iid normal with variances sm2 specific to outcome m.
The M sets of permanent effect priors bmi, each of dimension Q, may be correlated between
predictors q within outcomes m, or between outcomes m within predictors q, or most
generally over both predictors q and outcomes m. The same applies to the outcome-time
effects cmt, which may be random, and incorporate short range temporal dependence. For
Hierarchical Models for Longitudinal Data 445
example, time-varying intercepts ct = (c1t , … , cMt ) in the case R = 1 (and Hit = 1) could follow
autoregressive or random walk priors correlated over outcomes, as in
Yi = Xi b + Zibi + H ict + ui ,
qim ~ Ga( am , am ),
represent subject-outcome permanent random effects. The ξmit represent observation level
effects that are iid, or autoregressive, as in
xmit ~ Ga(bmxmi ,t -1 , bm )
Structural equation models for longitudinal data typically involve both response indica-
tors ymit of dimension Py which measure latent outcomes ηqit of dimension Qy < Py, and
exogenous predictors xkit of dimension Px which measure latent causal influences ξqit of
dimension Qx < Px (Dunson, 2007). For example, for Qy = Qx = 1, ξit might be a time-varying
stress severity scale related to short-term stressors {xkit , k = 1, … Px } , and ηit might be a
time-varying latent depression scale related to mood scale measures { y mit , m = 1, … Py } .
Then the measurement model is
while the structural model might include a linear effect, possibly time-varying, of ξit on ηit.
A simple common factor model may be applied when there are alternative measuring
scales, typically a gold standard measure, and one or more measures of the same quan-
tity, but less expensive to obtain. Consider a situation where bivariate data { y1ijt , y 2ijt } are
obtained for subjects i within clusters j, where y1ijt denotes repetitions on the standard mea-
sure, and y2ijt denotes repetitions on the proxy measure. The goal is to assess the reliability
of the proxy measure. One may postulate a shared permanent effect bij between the two
outcomes, as well as a unique permanent effect cij for the proxy measure. In the absence of
intercepts for the y1 model, one has
where the bij have non-zero cluster means Bj, but the cij are zero mean effects, namely:
bij ~ N (Bj , D1 j ),
cij ~ N (0, D2 j ).
The residuals are distributed as umijt ~ N (0, 1/tmj ) . The hypothesis that {a j = 0, lj = 1} cor-
responds to y1 and y2 being identically calibrated in group j (Oman et al., 1999, p.43), that is,
they both measure the same quantity on the same scale (see Example 10.11).
possibly also by area or actuarial risk group i = 1, … , n (Chernyavskiy et al., 2019). A fur-
ther cohort dimension c = 1, … , C is implicit in biological age-time data via the relation
c = t - x + X , and there have been extensive developments in Bayesian age-period-cohort
(APC) and area APC models (AAPC) models (Lagazio et al., 2003; Schmid and Held,
2004; Bray, 2002; Baker and Bray, 2005), which draw on developments in space-time mod-
els involving spatial and temporal autocorrelation (Quick et al., 2018; Rushworth et al.,
2014; Donald et al., 2015). For rare event totals yixt in relation to large populations Nixt, and
assuming yixt ~ Po( N ixt mixt ), a baseline age-period (AP) model might assume independence
of age and period dimensions, with
or equivalently
where structured (e.g. random walk or autoregressive) priors might be adopted for age-area
effects ηix and area-time effects θit, and the intercept κ is identified according to possible
constraints on the random effects.
Thus Clayton and Schifflers (1987) consider data of the form yxt (i.e. without further strat-
ification), with means μxt where
log( mxt ) = hx + qt ,
with both sets of effects assumed to be random, though fixed effects may be used when
X or T is small. In the absence of an overall intercept in this model, one or other series
(say ηx) sets the level, and identifiability may be gained by centring the remaining series
θt at zero (possibly repeatedly at each MCMC iteration), or by setting one parameter in the
remaining series to a fixed value e.g. θ1 = 0. If the model includes an overall intercept κ, then
centring both sets of effects, namely x å
hx =
t å
qt = 0 , provides a way of ensuring identi-
fiability. An APC model including a mean and structured age, period and cohort effects is
log( mxt ) = k + hx + qt + gc ,
and identifiability requires either that the three sets of effects be centred, or that edge
constraints such as h1 = q1 = g1 = 0 are used to avoid confounding of the three series.
Additionally the relation c = X - x + t means an extra constraint is needed for full identifi-
cation, for example, by taking g1 = g2 = 0 (Clayton and Schifflers, 1987).
The convolution prior of Besag et al. (1991) may be generalised by adopting structured
and iid effects for each time scale, as well as for areas (Knorr-Held, 2000). Hence an APC
model would then become
where u1x, u2t and u3c are iid zero mean random effects, while {hx , qt , gc } follow structured
(i.e. random walk or other autoregressive) form. For area-age-period data, yixt ~ Po( N xit mxit ),
this approach leads to
where si follows a structured spatial autoregressive prior, but the u4i are iid zero mean
random effects.
In the preceding models, the dimensions are independent and multiplicative in the risk
scale (additive in the log risk scale). In practice, interactions between one or more of the
different time scales, or between the time scales and the units (e.g. areas or actuarial risk
groups), are likely. Interactions ψxc between age and cohort are relevant if the age slope is
changing between cohorts (e.g. cancer deaths at younger ages are less common in recent
cohorts), while in mortality forecasting, age-time interactions ψxt are of interest, since dif-
ferent age groups may be subject to different mortality improvements (Pedroza, 2006; Lee
and Carter, 1992). In area APC models, area-cohort and area-time interactions might be
relevant (Lagazio et al., 2003), while in area life table models (Congdon, 2006), age-area
interactions may be investigated, since deprived areas may have relatively high “prema-
ture” mortality (sometimes defined by death before age 75).
In area-time (spatio-temporal) models, one may extend the RIAS principle, and assume
area-specific random variation for both the level and a time covariate. This amounts to
taking the interaction ψit as a linear trend model, with neighbouring areas having similar
trend parameters, as in Bernardinelli et al. (1995). Thus with yit ~ Po( N it mit ) ,
where ω1i and ω2i are spatially correlated over areas. One may further adopt a bivariate
spatial (e.g. bivariate CAR) prior for {ω1i, ω2i}, allowing level and trend parameters to be cor-
related. Additionally, a convolution form may be adopted both for level and trend, so that
where u1i and u2i are iid random effects. Equivalently, letting c ji = w ji + u ji one has
A variation is to introduce an overall nonlinear trend via parameters δt, along with time
specific spatial and iid effects {ωit,uit}, and stationary AR1 dependence in the total lagged
spatial effect cit = wit + uit (Martinez-Beneito et al., 2008). Thus for t > 2,
ci1
log( mi1 ) = k + d1 + .
(1 - r 2 )0.5
å
t
This is equivalent to assuming log( mit ) = k + dt + r t -1(1 - r 2 )-0.5 ci1 + r t - k cik , where the
k=2
last term is zero when t = 1.
In area-age-time models for mortality counts yixt ~ Po( mixt ), area-age-time interactions
ψixt may be parsimoniously modelled by separate linear time trends for each age and area,
namely
as in Sun et al. (2000), where the random coefficients ω1x and ω2i may be structured over
ages and areas respectively. Sun et al. (2000) actually assume a spatial CAR(ρ) prior with
mean zero for the ω2i (section 6.3.3), but take the ω1x to be unrelated fixed effects. The full
model of Sun et al. (2000) also includes iid age-area-time effects, uixt, so that
where positive loadings ω1x and ω2i specify which ages are most sensitive to trend effects δt.
For identification, the δt are centred at zero or have a corner constraint such as δ1 = 0, and the
loadings ω1x and ω2i may be centred at 1, constrained to sum to 1, or have a minimum of 1.
So, for declining mortality, represented by δt following (say) a 1st order random walk, larger
ω1x and ω2i indicate which age groups and areas contribute most to the mortality decline.
Lee and Carter (1992) apply the age-time product model yxt = wx dt in mortality forecasting,
with identification obtained by ensuring δt sum to zero, and that the ωx sum to 1.
Interaction priors may also be based on a Kronecker product of the structure matrices
for the relevant dimensions (Knorr-Held, 2000; Clayton, 1996), where a structure matrix is
a constituent part of the precision (inverse covariance) matrix. For example, if the structure
matrix of separate area and age effects are denoted Ks and Kx, then K sx = K s Ä K x defines
the structure matrix for the joint prior for ψix, and conditional priors on ψix can be obtained
from Ksx. Thus an RW1 prior in age has a structure matrix with off-diagonal elements
K x[ ab] = -1 if ages a and b are adjacent, and K x[ ab] = 0 otherwise. Diagonal elements are 1 if
a = b = 1 or a = b = X, and equal 2 for other diagonal terms. An RW2 prior for age has struc-
ture matrix
é1 -2 1 ù
ê -2 5 -4 ú
ê ú
ê1 -4 6 -4 1 ú
ê ú
ê 1 -4 6 -4 1 ú
Kx = ê . . . . ú .
ê ú
ê 1 -4 6 -4 1 ú
ê 1 -4 6 -4 1ú
ê ú
ê 1 -4 5 -2ú
ê ú
ë 1 -2 1û
Similarly, the CAR(1) prior for spatially structured errors s = (s1 , … , sn ) based on adjacency
of areas is multivariate normal with precision matrix τsKs, where τs is an overall precision
parameter, and off-diagonal terms K s[ij] = -1 if areas i and j are neighbours, and K s[ij] = 0
for non-adjacent areas. The diagonal terms in Ks are Li where Li is the cardinality of area i
(its total number of neighbours). Then an area-age interaction effect ψix formed by crossing
an RW1 age prior with a CAR(1) spatial effect has joint precision
1
K s Ä K x ,
sy2
450 Bayesian Hierarchical Models
and full prior conditionals with variances sy2 /Li when x = 1 or x = X, and sy2 /(2Li ) other-
wise. With ∂i denoting the neighbourhood of area i, the prior conditional means Ψix for
ψix are
Yi1 = yi 2 + å y /L - å y /L ,
j ζi
j1 i
j ζi
j2 i
YiX = yi , X -1 + åy
j ζi
jX /Li - åy
j ζi
j , X -1 /Li .
For identification, the ψix should be doubly centred at each iteration (over areas for a given
age x, and over ages for a given area i).
with a further scaling by 0.85 for women only. There are four patient groups formed
by crossing gender with whether third-space body fluids were present on at least one
visit. The J = 4 groups are then (1 = female, no fluids; 2 = female, fluids; 3 = male, no flu-
ids; 4 = male, fluids), with group sizes n = (51, 12, 41, 9) and total visits within groups
N = (211, 42, 148, 36) .
The repeated responses for patients i = 1,… , n j within groups j = 1,… , J are
y1ijt = log(MCCijt ) and y 2ijt = log(ECCijt ). The model involves a patient-group common
factor bij, and a unique factor cij, for each outcome-cluster pair, namely
The intercepts for y2 are represented by the product of loadings and B-coefficients.
Identification issues are lessened by the fact that period 1 defines the direction of the bij
effects. Gamma priors with index and shape parameters of unity are assumed for the
precisions {1/Dmj , tmj } , and N(0,1000) priors for the fixed effects {λj,Bj}.
Hierarchical Models for Longitudinal Data 451
Estimation via jagsUI provides posterior means (95% CRI) for λj of 1.03 (1.01,1.06), 1.06
(0.98,1.14), 1.02 (0.99,1.04), and 1.05 (0.96,1.15). The representation adopted avoids includ-
ing weakly identified separate intercepts for y2, and the results support identical calibra-
tion except for the first cluster, where there is a high probability that the λ coefficient is
positive. This probability is obtained from monitoring the node step.lambda[1:4] in the
rjags code. These conclusions are unaffected by adopting a robust student t (with preset
d.f. = 4) option for the bij effects.
with a logit link for the mortality rates μixt. The application here involves annual
deaths to male white non-Hispanics over the period 1999–2014 (T = 16 years), in n = 51
US states (including District of Columbia). Deaths dixt and SEER population data Pixt
are for X = 13 age bands (<1, 1–4, 5–9, 10–14, 15–19, 20–24, 25–34, 35–44, …, 75–84, 85+).
Recent research (Squires and Blumenthal, 2016; Case and Deaton, 2015) reports an
unexpected rise in death rates among middle-aged, white Americans between 1999
and 2014; see also www.commonwealthfund.org/publications/issue-briefs/2016/jan/
mortality-trends-among-middle-aged-whites.
Thus, one might adopt a linear trend model (model 1) with independent age and area
impacts (ηx and ri) on the mortality level, and parallel effects (ρ1x and ρ2i) on the trend
also. This leads to
where the intercept κ is assigned a normal N(0,1000) prior, the area effects ri follow the
Leroux et al. (1999) prior allowing for spatial dependence, and age effects ηx follow a nor-
mal first order random walk. The associated conditional precisions (τr,τη) are assigned
gamma Ga(1,0.01) priors. The ρ1x and ρ2i linear trend coefficients are taken to be iid nor-
mal random effects with zero means, and precisions τρ1 and τρ2 that are assigned gamma
Ga(1,0.01) priors. Model 1 allows for miscellaneous departures (e.g. age-area interactions
in level and trend) from a linear trend by adding iid Normal errors uixt ~ N(0,1/tu ) for
each observation, where tu ~ Ga(1, 0.01) .
Hierarchical centring is applied with uixt ~ N(hx + ri + ( r1x + r2i )(t - t ), 1/tu ), with the
spatial effects additionally centred around the overall intercept κ. For the continental
states (i = 1,…,49), the prior is then
æ (rj - k ) ö
å
1
ri ~ N ç k + l , ÷,
ç 1 - l + l Di t r [1 - l + l Di ] ÷
è jÎLi ø
where l Î[0, 1] measures spatial dependence, Li denotes the locality of area i (i.e. the set
of states adjacent to state i), and there are Di states in that locality. For the remaining two
states (Alaska, Hawaii) without neighbours, Di = l = 0, one specifies
ri ~ N(k , 1/tr ).
452 Bayesian Hierarchical Models
0.03
Linear slope
0.02
0.01
0
<1 1–4 5–9 10–14 15–19 20–24 25–34 35–44 45–54 55–64 65–74 75–84 85+
–0.01
–0.02
–0.03 Mean
2.5%
–0.04 97.5%
–0.05
FIGURE 10.5
Linear trend slopes for mortality by age band, white non-Hispanic males, 1999–2014.
Y-likelihood (Rubin, 1976; Fichman and Cummings, 2003). However, for non-ignorable
missingness, both the R-likelihood and Y-likelihood must be modelled.
As an illustration, drop-out at time t is classed as being at random if
Pr(Rt = 1|Y ) = Pr(Rt = 1|Y1 , ¼Yt -1 ) , namely when the missingness probability is related to
lagged observed responses. However, if the probability of missingness at time t is related
also to the current outcome Yt, possibly missing, so that Pr(Rt = 1|Y ) = Pr(Rt = 1|Y1 , ¼ , Yt ) ,
then missingness is non-random or informative (Diggle and Kenward, 1994). In practice,
informative missingness is assessed empirically, and would require a significant effect of
(possibly missing) Yit on pit = Pr(Rit = 1) in a binary regression also involving other influ-
ences on missingness, with the regression taken over subjects i and repetitions t = 1, … , Ti .
For dropouts, one takes Ti = Ti,obs + 1 where Ti,obs is the last interval where data on subject i
was obtained (Roy and Lin, 2002). Since MNAR missingness can never be excluded as a
generating mechanism, a sensitivity analysis under different mechanisms may be consid-
ered (Kenward, 1998). This means estimating the model under a “range of assumptions
about the non-ignorability parameters and assessing the impact of these parameters on
key inferences” (Ma et al., 2005).
A common set of predictors Xit may be relevant to modelling both the data Yit, and
missingness indicators Rit, or different predictors Wit may be used in the R model. King
(2001) accordingly presents a statement of the MCAR-MAR-MNAR alternatives as above,
but replacing Y by D = (Y,X), namely predictor and outcome data combined, and where
D = (Dobs,Dmis) denotes the subdivision of the data according to observation status. For
example, the MCAR assumption then requires P(R|D) = P(R), while missingness at random
requires P(R|D) = P(R|Dobs ) .
An alternative less stringent definition of MCAR missingness is used by Little (1995),
in which missingness is independent of Y, whether observed or not, but may depend on
fully observed covariates X (Curran et al., 2002, p.12; Daniels and Hogan, 2008, p.92). Such
covariates might for instance include time, as missingness rates often increase at later
stages of longitudinals (Hedeker and Gibbons, 2006, p.281). So, given Xobs, R is independent
of both Yobs and Ymis, leading to what is sometimes termed “covariate dependent MCAR
missingness.”
logit(pit ) = g1 + g2 yit + g3 yi ,t -1 ,
logit(pit ) = hit ,
into quantile groups (e.g. quartiles) (Rosenbaum and Rubin, 1983). Among subjects located
within particular quantiles of ηit, some subjects will exit but some remain. Sampling of
the missing yit for exiting subjects may be based on sampling with replacement from the
known yit values of stayers in the same quantile – this is sometimes called the “approxi-
mate Bayesian bootstrap method” (Rubin and Schenker, 1986; Lavori et al., 1995). In mul-
tiple imputation, this imputation process would be repeated several times to provide
multiple filled-in datasets.
Then if r ¹ 0, the missing data are informative or non-ignorable, whereas ρ = 0 corresponds
to missingness at random.
A similar principle involves low dimension random effects F, also known as common
factors, that are shared between outcome and missingness models; similar shared frailty
models are used for models with outcome-dependent follow-up (Ryu et al., 2007). As often
in factor models, the outcome data and missingness patterns may be viewed as condi-
tionally independent, given the common factors (Song and Belin, 2004; Albert et al., 2002;
456 Bayesian Hierarchical Models
Roy and Lin, 2002; Ten Have et al., 1998). Equivalently, it is assumed that “all information
about the missing data in the observed response is accounted for through the shared ran-
dom effects” (Albert and Follmann, 2007). In fact, Li et al. (2007) and Yang and Shoptaw
(2005) distinguish such models as an alternative to selection and pattern mixture methods,
since under conditional independence one may represent the (R,Y,F) joint density as
ò
P(Ri , Yi |qY ,q R ) = P(Ri |q R , Fi )P(Yobs ,i , Ymis ,i |qY , Fi )P( Fi |q F )dFi .
Other assumptions are possible, as under the “conditional linear model” (Daniels and
Hogan, 2008, p.112), with the conditioning sequence
One form of common effect that may be used to model informative missingness is based
on shared heterogeneity (e.g. Li et al., 2007; Chib, 2008, p.507). An example is a general lin-
ear mixed outcome model with permanent subject random effects bi = (b1i , … , bQi )
where the missingness model for Pr(Rit = 1) also conditions on the bi, and possibly on sepa-
rate predictors Wit, and on the history of responses H it = { yi1 , … , yit } . Consider the case
Q = 1 with zit = 1, and suppose predictors Wit are relevant to dropout (e.g. baseline health
status in a clinical trial). Then a common factor model adapted to predicting the missing-
ness probability pit = Pr(Rit = 1|Wit , H it ) might take the form
where bi are zero mean random effects, and the predictors {Xit,Wit} both include an inter-
cept. For example, Li et al. (2007) consider Poisson data with yit ~ Po(lit ),
log(lit ) = Xit b + bi ,
and with binary indicators for missingness, and a lagged outcome scheme adapted to
counts, one would obtain
In fact, the model of Li et al. (2007) distinguishes between intermittently missing data and
permanent attrition via a multinomial rather than binary regression, and uses a transition
probability missingness model.
Hierarchical Models for Longitudinal Data 457
A model with shared latent effects exemplified by (10.9) imposes possibly restrictive
assumptions on the correlations among repeated responses for a given subject. Conditional
on the time-invariant shared effects bi, observations on a subject are uncorrelated (Albert
and Follmann, 2007). An alternative is a shared autoregressive process, as in
for t = 1, … , Ti , where for dropouts Ti = Ti ,obs + 1 and Ti,obs is the last interval where data was
observed. Furthermore, the factor scores may depend on known predictors {Uit,Zit} and
zero mean random permanent effects bi, as in
with uit ~ N (0, 1) if all loadings κ and λm are unknowns, and with Uit omitting an intercept
for identifiability (Roy and Lin, 2002, p.42). The missingness model is non-ignorable by
virtue of dependence of πit on Fit, which represents possibly missing ymit (Roy and Lin, 2002,
p.43).
where pX now models the likelihood of the predictors. If RY is conditional on all the com-
ponents of RX, one has
Alternatively, RY may be modelled jointly with the RX, though complexity increases as the
number of predictors subject to missingness rises, giving rise to different possible condi-
tional sequences for RY and the components of RX.
Suppose a subset of q predictors have missing values, with Rji = 1 if Xji is missing, and
Rji = 0 otherwise. If Y is fully observed, a selection approach specifies
where p(RX) is a multinomial with 2q cells. To define pX, one needs to specify the joint distribu-
tion of Xi , mis = {X1i , … , X qi } . Suppose the incompletely observed covariates Xmis = (X1 , ¼ , X q )
are both categorical Xmis, D = {X1 , … , X r } and continuous Xmis,C = {X r +1 , … , X q } , with fully
observed covariates denoted Xobs = {X q + 1 , … , X p }. Ibrahim et al. (1999) proposed the joint
density of Xmis be specified as a series of conditional distributions, namely
yit = b1,Gi + b2 ,Gi t + b3 ,Gi Tri + b4 ,Gi (t.Tri ) + b1i + b2it + eit ,
bi ~ N (0, DGi ),
eit ~ N (0, sG2 i ).
460 Bayesian Hierarchical Models
Curran et al. (2002, p.13) allow for an additional autocorrelated error εit with pattern spe-
cific covariance matrix RGi .
The conditional linear model (Paddock, 2007; Hogan et al., 2004) is a version of the pat-
tern mixture model that may be applied to continuously recorded longitudinal data (rather
than fixed interval longitudinal data). The impact of missingness on Y involves functions
βj(Ui) of possibly continuous dropout times Ui though this reduces to a grouping approach
for fixed intervals; that is, the βj(Ui) become step functions (Hogan et al., 2004, p.856). At
their most simple, such functions are linear in U, but polynomial functions or non-para-
metric models (e.g. splines) can be used. In the preceding example, one might have
and a test for missingness at random is whether the αj1 are zero. Paddock (2007) applies
a Bayesian regression selection approach to coefficients in models involving quadratic
effects of Ui.
Estimates from iterations 1,001–10,000 of a two-chain run with rjags show no effect on
missingness probabilities πit of the possibly unobserved current outcome yit (gamma[2]
in the code) with 95% interval {−0.008, 0.012}. The treatment effect β2 in the outcome
model is not significant, but the treatment-time interaction parameter β4 has a predomi-
nantly negative density, albeit with an inconclusive 95% credible interval {−16.2,0.7). The
WAIC is 17048 for the y-model, and 544 for the R-model.
An alternative model involves a common factor Fi (multiple indicator, multiple cause)
that depends on Basei = baseline spend (standardised). The prior mean for Fi specifies
a regression with intercept omitted for identifiability; the prior variance of the factor
scores is set at 1. The missingness model now involves a lagged response and the com-
mon factor, while the Y likelihood no longer involves baseline spending. Thus
where a LN(0,1) prior is adopted for κ, and the prior on λ is N(1,1) and constrained to
positive values. Estimates show the common factor is a positive function of baseline
spend with η having 95% CRI (6.1,15.9). Its impact on πit is positive, with κ having 95%
CRI (0.01,0.05). So, the chance of a missing value (Rit = 0) tends to diminish with the score
on the common factor. The WAIC for the y-data under this model is broadly similar
(17055) to that for the earlier one.
Finally, a pattern mixture analysis is applied, distinguishing simply between non-
completer (Gi = 1) and completer groups (Gi = 2). The assumed model is
bi ~ N(0, DGi ),
eit ~ N(0, s 2 ).
wij = 0, j = 1,… , Ti - 1;
wiTi = 1,
with priors
D -1 ~ Wish(I , 2),
uit ~ N(0, 1/tu ),
tu ~ Ga(1, 0.001).
The missing data model is a complementary log-log regression, sharing the random
intercept b1i and with an interaction between treatment and the shared effect. Thus
with {g1 j ~ N(0, 1000), j = 1,… , max(Ti )} ,g2 ~ N(0, 1000), and {ak ~ N(1, 1), k = 1,… , 2} . Non-
ignorable missingness corresponds to any of the αk coefficients being distinct from zero
(Hedeker and Gibbons, 2006, p.298).
A two-chain run of 5,000 iterations using rjags shows early convergence, and pos-
terior means on β coefficients (from the last 4000 iterations) similar to those reported
by Hedeker and Gibbons (2006). In particular, β4 has posterior mean (sd) of −0.65 (0.08)
consistent with a greater reduction in morbidity for the treatment group. Dropout is
lower for treated patients, with γ2 having mean (sd) of −0.65 (0.23). Both the α coefficients
have 95% credible intervals excluding zero: α1 has mean and 95% interval 0.84 (0.17,1.57)
indicating that among untreated patients, those more ill are likely to drop out, while the
sum (a1 + a2 ) has mean (95%CrI) −0.50 (−1.08,0.03), showing that for those being treated,
the more ill are in fact less likely to drop out.
References
Agresti A (1997) A model for repeated measurements of a multivariate binary response. Journal of the
American Statistical Association, 92, 315–321.
Agresti A, Natarajan R (2001) Modeling clustered ordered categorical data: A survey. International
Statistical Review, 69, 345–371.
Hierarchical Models for Longitudinal Data 463
Albert P, Follmann D (2007) Random effects and latent processes approaches for analyzing binary
longitudinal data with missingness: A comparison of approaches using opiate clinical trial
data. Statistical Methods in Medical Research, 16, 417–439.
Albert PS, Follmann DA, Wang SA, Suh EB (2002) A latent autoregressive model for longitudinalbi-
nary data subject to informativemissingness. Biometrics, 58, 631–642.
Allison P (2000) Multiple imputation for missing data: A cautionary tale. Sociological Methods and
Research, 28, 301–309.
Alvarez I, Niemi J, Simpson M (2016) Bayesian inference for a covariance matrix. arXiv:1408.4050v2
Antonio K, Beirlant J (2007) Actuarial statistics with generalized linear mixed models. Insurance:
Mathematics and Economics, 40, 58–76.
Austin P, Escobar M (2005) Bayesian modeling of missing data in clinical research. Computational
Statistics and Data Analysis, 49, 821–836.
Baker A, Bray I (2005) Bayesian projections: What are the effects of excluding data from younger age
groups? American Journal of Epidemiology, 162, 798–805.
Baldwin S (2014) Visualizing the LKJ Correlation Distribution. https://www.psychstatistics.
com/2014/12/27/d-lkj-priors/
Barnard J, McCulloch R, Meng XL (2000) Modeling covariance matrices in terms of standarddevia-
tions and correlations, with applications to shrinkage. Statistica Sinica, 10, 1281–1311.
Bauer R, Guzy S, Ng C (2007) A survey of population analysis methods and software for complex phar-
macokinetic and pharmacodynamic models with examples. The AAPS Journal, 9(1), E60–E83.
Bauwens L, Lubrano M, Richard J-F (1999) Bayesian Inference in Dynamic Econometric Models. Oxford
University Press, Oxford, UK.
Beckett L, Tancredi D, Wilson R (2004) Multivariate longitudinal models for complex change pro-
cesses. Statistics in Medicine, 23, 231–239.
Bernardinelli L, Clayton D, Pascutto C, Montomoli C, Ghislandi M, Songini M (1995) Bayesian analy-
sis of space-time variation in disease risk. Statistics in Medicine, 14, 2433–2443.
Berrington A, Hu Y, Ramirez-Ducoing K, Smith P (2005) Multilevel modelling of repeated ordi-
nal measures: An application to attitude towards divorce. Southampton Statistical Sciences
Research Institute Applications and Policy Working Paper M05/10 and ESRC Research Method
Programme Working Paper No. 26.
Besag J, York J, Mollié A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43(1), 1–20.
Bollen K, Curran P (2004) Autoregressive latent trajectory (ALT) models: A synthesis of two tradi-
tions. Sociological Methods and Research, 32, 336–383.
Bonate P (2008) Pharmacokinetic-Pharmacodynamic Modeling and Simulation, 2nd Edition. Springer,
New York.
Bond S (2002) Dynamic panel data models: A guide to microdata methods and practice. Portuguese
Economic Journal, 1, 141–162.
Bray I (2002) Application of Markov chain Monte Carlo methods to projecting cancer incidence and
mortality. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51(2), 151–164.
Case A, Deaton A (2015) Rising morbidity and mortality in midlife among white non-hispanic
Americans in the 21st century. Proceedings of the National Academy of Sciences of the United States
of America, 112(49), 15078–15083.
Cepeda E, Gamerman D (2004) Bayesian modeling of joint regressions for the mean and covariance
matrix. Biometrical Journal, 46, 430–440.
Chamberlain G, Hirano K (1999) Predictive distributions based on longitudinal earnings data.
Annales d’Economie et de Statistique, 55–56, 211–242.
Chen Z, Kuo L (2001) A note on the estimation of the multinomial logit model with random effects.
The American Statistician, 55, 89–95.
Chen Z, Rus H, Sen A (2016) Border effects before and after 9/11: Longitudinal data evidence across
industries. World Economics. DOI:10.1111/twec.12413.
Chernyavskiy P, Little M, Rosenberg P (2019) A unified approach for assessing heterogeneity in age–
period–cohort model parameters using random effects. Statistical Methods in Medical Research,
28(1). https://journals.sagepub.com/doi/abs/10.1177/0962280217713033
464 Bayesian Hierarchical Models
Chib S (2008) Panel data modeling and inference: A Bayesian primer, pp 479–515, in The Econometrics
of longitudinal Data, 3rd Edition, eds L Matyas, P Sevestre. Springer-Verlag, Berlin, Germany.
Chib S, Carlin B (1999) On MCMC sampling in hierarchical longitudinal models. Statistics and
Computing, 9, 17–26.
Chib S, Jeliazkov I (2006) Inference in semiparametric dynamic models for binary longitudinal data.
Journal of the American Statistical Association, 101, 685–700.
Chintagunta P, Kyriazidou E, Perktold J (2001) Panel data analysis of household brand choices.
Journal of Economics, 103, 111–153.
Clayton D (1996) Generalized linear mixed models, pp 275–301, in Markov Chain MonteCarlo in
Practice, eds WR Gilks, S Richardson, DJ Spiegelhalter. Chapman & Hall, London, UK.
Clayton D, Schifflers E (1987) Models for temporal variation in cancer rates. II: Age-period-cohort
models. Statistics in Medicine, 6, 467–810.
Congdon P (2006) A model for geographical variation in health and total life expectancy. Demographic
Research, 14, 157–178.
Copas JB, Li HG (1997) Inference for non-random samples. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 59(1), 55–95.
Curran D, Molenberghs G, Aaronson N, Fossa S, Sylvester R (2002) Analyzing longitudinal continu-
ous quality of life data with dropout. Statistical Methods in Medical Research, 11, 5–23.
Curran P, Obeidat K, Losardo D (2010) Twelve frequently asked questions about growth curve mod-
eling. Journal of Cognition and Development, 11(2), 121–136.
Daniels M, Normand S (2006) Longitudinal profiling of health care units based on continuous and
discrete patient outcomes. Biostatistics, 7, 1–15.
Daniels M J, Jackson D, Feng W, White I (2015) Pattern mixture models for the analysis of repeated
attempt designs. Biometrics, 71(4), 1160–1167.
Davidian M, Giltinan D (2003) Nonlinear models for repeated measures data: An overview and
update. Journal of Agricultural, Biological, and Environmental Statistics, 8, 387–419.
Depaoli S, Boyajian J (2014) Linear and nonlinear growth models: Describing a Bayesian perspective.
Journal of Consulting and Clinical Psychology, 82(5), 784–802.
Diggle P, Kenward M (1994) Informative dropout in longitudinal data analysis. Journal of the Royal
Statistical Society: Series C, 43, 49–94.
Donald M, Mengersen K, Young R (2015) A four dimensional spatio-temporal analysis of an agricul-
tural dataset. PLOS ONE, 10(10), e0141120.
Dorsett R (1999) An econometric analysis of smoking prevalence among lone mothers. Journal of
Health Economics, 18, 429–441.
Dunson D (2003) Dynamic latent trait models for multidimensional longitudinal data. Journal of the
American Statistical Association, 98, 555–563.
Dunson D (2006) Bayesian dynamic modeling of latent trait distributions. Biostatistics, 7, 551–568.
Dunson D (2007) Bayesian methods for latent trait modeling of longitudinal data. Statistical Methods
in Medical Research, 16, 399–415.
Dunson D (2009) Bayesian nonparametric hierarchical modeling. Biometrical Journal, 51(2), 273–284.
Erkanli A, Soyer R, Angold A (2001) Bayesian analyses of longitudinal binary data using Markov
regression models of unknown order. Statistics in Medicine, 20, 755–770.
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, pp
69–137. Springer, New York.
Fichman M, Cummings J (2003) Multiple imputation for missing data: Making the most of what you
know. Organizational Research Methods, 6, 282–308.
Fitzmaurice G, Laird N, Ware J (2004) Applied Longitudinal Analysis. Wiley.
Fokianos K, Kedem B (2003) Regression theory for categorical time series. Statistical Science, 18,
357–376.
Fong Y, Rue H, Wakefield J (2010) Bayesian inference for generalized linear mixed models. Biostatistics,
11(3), 397–412.
Fotouhi A (2007) The initial conditions problem in longitudinal count process: A simulation study.
Simulation Modelling Practice and Theory, 15, 589–604.
Hierarchical Models for Longitudinal Data 465
Franco C, Bell W (2015) Borrowing information over time in binomial/logit normal models for small
area estimation. Statistics in Transition, 16(4), 563–584.
Frees E (2004) Longitudinal and Panel Data. Cambridge University Press, Cambridge, UK.
Frühwirth-Schnatter S, Tüchler R (2008) Bayesian parsimonious covariance estimation for hierarchi-
cal linear mixed models. Statistics and Computing, 18, 1–13.
Galatzer-Levy I (2015) Applications of Latent Growth Mixture Modeling and allied methods to post-
traumatic stress response data. European Journal of Psychotraumatology, 6, 27515.
Galatzer-Levy I, Bonanno G (2012) Beyond normality in the study of bereavement: Heterogeneity in
depression outcomes following loss in older adults. Social Science & Medicine, 74(12), 1987–1994.
Galler H (2001) On the dynamics of individual wage rates – Heterogeneity and stationarity of wage
rates of West German Men, pp 269–293, in Econometric Studies. A Festschrift in Honour of Joachim
Frohn, eds R Friedmann, L Knüppel, H Lütkepohl. LIT, Münster, Germany.
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis. CRC, Boca
Raton, FL.
Gelman A, Meng XL, Stern H (1996) Posterior predictive assessment of model fitness via realized
discrepancies. Statistica Sinica, 733–760.
Geweke J, Keane M (2000) An empirical analysis of earnings dynamics among men in the PSID:
1968–1989. Journal of Econometrics, 96, 293–356.
Ghosh P, Branco M, Chakraborty H (2007) Bivariate random effect model using skew-normal distri-
bution with application to HIV-RNA. Statistics in Medicine, 26, 1255–1267.
Grunwald GK, Hyndman RJ, Tedesco LM, Tweedie RL (2000) Non-Gaussian conditional linear AR(1)
models. Australian and New Zealand Journal of Statistics, 42, 479–495.
Heagerty P, Zeger S (2000) Marginalized multilevel models and likelihood inference. Statistical
Science, 15, 1–26.
Hedeker D (2003) A mixed-effects multinomial logistic regression model. Statistics in Medicine, 22,
1433–1446.
Hedeker D, Gibbons R (2006) Longitudinal Data Analysis. Wiley, New York.
Hedeker D, Gibbons R (1997) Application of random-effects pattern-mixture models for missing data
in longitudinal studies. Psychological Methods, 2, 64–78.
Hirano K (2000) A semiparametric model for labor earnings dynamics, in Practical Nonparametric and
Semiparametric Bayesian Statistics, eds D Dey, P Mueller, D Sinha. Springer-Verlag, New York.
Hirano K (2002) Semiparametric Bayesian inference in autoregressive panel data models. Econometrica,
70, 781–799.
Ho R, Hu I (2008) Flexible modelling of random effects in linear mixed models – A Bayesian approach.
Computational Statistics and Data Analysis, 52, 1347–1361.
Hogan J, Lin X, Herman B (2004) Mixtures of varying coefficient models for longitudinal data with
discrete or continuous non-ignorable dropout. Biometrics, 60, 854–864.
Hsiao C (2014) Analysis of Panel Data (No. 54). Cambridge University Press.
Ibrahim J, Chen M-H, Ryan L (2000) Bayesian variable selection for time series count data. Statistica
Sinica, 10, 971–987.
Ibrahim J, Lipsitz S, Chen M-H (1999) Missing covariates in generalized linear models when the
missing data mechanism is non-ignorable. Journal of the Royal Statistical Society, Series B, 61,
173–190.
Ibrahim J, Molenberghs G (2009) Missing data methods in longitudinal studies: A review. Test, 18(1),
1–43.
Islam M, Chowdhury R (2006) A higher order Markov model for analyzing covariate dependence.
Applied Mathematical Modelling, 30, 477–488.
Jara A, Quintana F, San Martin E (2008) Linear effects mixed models with skew-elliptical distribu-
tions: A Bayesian approach. Computational Statistics & Data Analysis, 52, 5033–5045.
Jorgensen B, Lundbye-Christensen S, Song P, Sun L (1999) A state space model for multivariate lon-
gitudinal count data. Biometrika, 86, 169–181.
Jung R, Kukuk M, Liesenfeld R (2006) Time series of count data: Modeling, estimation and diagnos-
tics. Computational Statistics and Data Analysis, 51, 2350–2364.
466 Bayesian Hierarchical Models
Keane M (2015) Longitudinal data discrete choice models of consumer demand, in The Oxford
Handbook of longitudinal Data, ed B Baltagi. OUP.
Kedem B, Fokianos K (2005) Regression models for binary time series, pp 185–199, in Modeling
Uncertainty, eds M Dror, P L’Ecuyer, F Szidarovszky. Springer.
Kenward M (1998) Selection models for repeated measurements with non-random dropout: An illus-
tration of sensitivity. Statistics in Medicine, 17, 2723–2732.
King G (2001) Analyzing incomplete political science data: An alternative algorithm for multiple
imputation. American Political Science Review, 95, 49–69.
Kinney S, Dunson D (2007) Fixed and random effects selection in linear and logistic models.
Biometrics, 63, 690–698.
Kleinman K, Ibrahim J (1998) A semi-parametric Bayesian approach to generalized linear mixed
models. Statistics in Medicine, 17, 2579–2596.
Knorr-Held L (2000) Bayesian modelling of inseparable space-time variation in disease risk. Statistics
in Medicine, 19, 2555–2567.
Lagazio C, Biggeri A, Dreassi E (2003) Age-period-cohort models and disease mapping. Environmetrics,
14, 475–490.
Lancaster T (2002) Orthogonal parameters and panel data. The Review of Economic Studies, 69, 647–666.
Lavori P, Dawson R, Shera D (1995) A multiple imputation strategy for clinical trials with truncation
of patient data. Statistics in Medicine, 14, 1913–1925.
Lee J, Hwang R (2000) On estimation and prediction for temporally correlated longitudinal data.
Journal of Statistical Planning and Inference, 87, 87–104.
Lee RD and Carter LR (1992) Modeling and forecasting U.S. mortality. Journal of the American Statistical
Association, 87(419), 659–671.
Lee Y, Nelder J (2000) Two ways of modelling overdispersion in non-normal data. Applied Statistics,
49, 591–598.
Lee Y, Nelder J (2004) Conditional and marginal models: another view. Statistical Science, 19, 219–238.
Leroux B, Lei X, Breslow N (1999) Estimation of disease rates in small areas: A new mixed model
for spatial dependence, pp 135–178, in Statistical Models in Epidemiology, the Environment and
Clinical Trials, eds M Halloran, D Berry. Springer-Verlag, New York.
Lewandowski D, Kurowicka D, Joe H (2009) Generating random correlation matrices based on vines
and extended onion method. Journal of Multivariate Analysis, 100, 1989–2001.
Li J, Yang X, Wu Y, Shoptaw S (2007) A random-effects Markov transition model for Poisson-
distributed repeated measures with non-ignorable missing values. Statistics in Medicine, 26,
2519–2532.
Lin H, McCulloch C, Rosenheck R (2004) Latent pattern mixture models for informative intermittent
missing data in longitudinal studies. Biometrics, 60, 295–305.
Lin T, Lee J (2006) A robust approach to t linear mixed models applied to multiple sclerosis data.
Statistics in Medicine, 25, 1397–1412.
Lindsey J (1993) Models for Repeated Measurements. Oxford University Press, New York.
Little RJA (1993) Pattern-mixture models for multi-variate incomplete data. Journal of the American
Statistical Association, 88, 125–133.
Little R (1995) Modeling the drop-out mechanism in repeated-measures studies. Journal of the
American Statistical Association, 90, 1112–1121.
Little R, Rubin D (2002) Statistical Analysis with Missing Data, 2nd Edition. Wiley-Interscience,
Hoboken, NJ.
Liu F, Zhang P, Erkan I, Small D S (2017) Bayesian inference for random coefficient dynamic panel
data models. Journal of Applied Statistics, 44(9), 1543–1559.
Liu L, Hedeker D (2006) A mixed-effects regression model for longitudinal multivariate ordinal data.
Biometrics, 62, 261–268.
Lockwood J, Doran H, McCaffrey D (December 2003) Using R for estimating longitudinal student
achievement models. R Newsletter, 3(3), 17–23.
Ma Y, Genton M, Davidian M (2004) Linear mixed effects models with semiparametric generalized
skew elliptical random effects, pp 339–358, in Skew-Elliptical Distributions and their Applications:
A Journey Beyond Normality, ed M Genton. Chapman and Hall/CRC, Boca Raton, FL.
Hierarchical Models for Longitudinal Data 467
Pourahmadi M (2000) Maximum likelihood estimation of generalized linear models for multivariate
normal covariance matrix. Biometrika, 87, 425–435.
Pourahmadi M, Daniels M (2002) Dynamic conditionally linear mixed models. Biometrics, 58, 225–231.
Qiu Z, Song P, Tan M (2002) Bayesian hierarchical models for multi-level repeated ordinal data using
WinBUGS. Journal of Biopharmaceutical Statistics, 12, 121–135.
Quick H, Waller L A, Casper M (2018) A multivariate space–time model for analysing county level
heart disease death rates by race and sex. Journal of the Royal Statistical Society: Series C (Applied
Statistics), 67(1), 291–304.
Quintana F, Müller P, Rosner G (2008) A semiparametric Bayesian model for repeated binary mea-
surements. The Journal of the Royal Statistical Society: Series C (Applied Statistics), 57(4):419–431.
Rice K (2005) Bayesian measures of goodness of fit, in Encyclopedia of Biostatistics, eds P Armitage, T
Colton. John Wiley, Chichester, UK.
Roberts GO, Sahu SK (2001) Approximate predetermined convergence properties of the Gibbs sam-
pler. Journal of Computational and Graphical Statistics, 10(2), 216–229.
Rosenbaum P, Rubin D (1983) The central role of the propensity score in observational studies for
causal effects. Biometrika, 70, 41–55.
Rossi P, Allenby G, McCulloch R (2005) Bayesian Statistics and Marketing. Wiley.
Roy J, Lin X (2000) Latent variable models for longitudinal data with multiple continuous outcomes.
Biometrics, 56, 1047–1054.
Roy J, Lin X (2002) Analysis of multivariate longitudinal outcomes with non-ignorable dropouts and
missing covariates: Changes in methadone treatment practices. Journal of the American Statistical
Association, 97, 40–52.
Rubin DB (1976) Inference and missing data. Biomelrika, 63, 581–592.
Rubin D, Schenker N (1986) Multiple imputation for interval estimation from simple random sam-
ples with ignorable nonresponse. Journal of the American Statistical Association, 81, 366–374.
Rushworth A, Lee D, Mitchell R (2014) A spatio-temporal model for estimating the long-term effects
of air pollution on respiratory hospital admissions in Greater London. Spatial and Spatio-
Temporal Epidemiology, 10, 29–38.
Ryu D, Sinha D, Mallick B, Lipsitz S, Lipshultz S (2007) Longitudinal Studies with outcome-depen-
dent follow-up: Models and Bayesian regression. Journal of the American Statistical Association,
102, 952–961.
Sahu S, Dey D, Branco M (2003) A new class of multivariateskew distributions with applications to
bayesian regression models. The Canadian Journal of Statistics, 31, 129–150.
Savitsky T, Paddock S (2014) Bayesian semi-and non-parametric models for longitudinal data with
multiple membership effects in R. Journal of Statistical Software, 57(3), 1–35.
Schafer J (1997) Imputation of missing covariates under a multivariate linear mixed model. Technical
report, Dept. of Statistics, The Pennsylvania State University.
Schafer J, Graham J (2002) Missing data: Our view of the state of the art. Psychological Methods, 7,
147–177.
Schmid V, Held L (2004) Bayesian extrapolation of space–time trends in cancer registry data.
Biometrics, 60(4), 1034–1042.
Schmid V, Held L (2007) Bayesian age-period-cohort modeling and prediction – BAMP. Journal of
Statistical Software, 21(8). http://www.jstatsoft.org/
Song J, Belin TR (2004) Imputation for incomplete high-dimensional multivariate normal data using
a common factor model. Statistics in Medicine, 23(18), 2827–2843.
Spiess M (2006) Estimation of a two-equation panel model with mixed continuous and ordered cat-
egorical outcomes and missing data. Jour Roy Stat Soc C 55: 525–538.
Squires D, Blumenthal D (2016) Mortality trends among working-age whites: The untold story. Issue
Brief (Commonwealth Fund), 3, 1–11.
Steele F (2008) Multilevel models for longitudinal data. Journal of the Royal Statistical Society: Series A,
171(1), 5–19.
Sun D, Tsutakawa R, Kim H, He Z (2000) Spatio-temporal interaction with disease mapping. Statistics
in Medicine, 19, 2015–2035.
Hierarchical Models for Longitudinal Data 469
Ten Have T, Kunselman A, Pulkstenis E, Landis R (1998) Mixed effects logistic regression models for
longitudinal binary response data with informative drop-out. Biometrics, 54, 367–383.
Terzi E, Cengiz M (2013) Bayesian hierarchical modeling for categorical longitudinal data from seda-
tion measurements. Computational and Mathematical Methods in Medicine, 2013, 579214.
Thall P, Vail S (1990) Some covariance models for longitudinal count data with overdispersion.
Biometrics, 46, 657–671.
Thiese M S (2014) Observational and interventional study design types; an overview. Biochemia
Medica, 24(2), 199–210.
Troxel A, Harrington D, Lipsitz S (1998) Analysis of longitudinal data with non-ignorable non-mono-
tone missing values. Applied Statistics, 47, 425–438.
Troxel A, Ma G, Heitjan D (2004) An index of local sensitivity to nonignorability. Statistica Sinica, 14,
1221–1237.
Tsai M-Y, Hsiao C (2008) Computation of reference Bayesian inference for variance components in
longitudinal studies. Computational Statistics, 23(4), 587–604.
Tutz G, Kauermann G (2003) Generalized linear random effects models with varying coefficients.
Computational Statistics and Data Analysis, 43, 13–28.
Tzala E, Best N (2008) Bayesian latent variable modelling of multivariate spatio-temporal variation
in cancer mortality. Statistical Methods in Medical Research, 17, 97–118.
Vaidyanathan R (2016) Using a LKJ Prior in Stan. http://stla.github.io/stlapblog/posts/
StanLKJprior.html
Verbeke G, Fieuws S, Molenberghs G, Davidian M (2014) The analysis of multivariate longitudinal
data: A review. Statistical Methods in Medical Research, 23(1), 42–59.
Verbeke G, Molenberghs G, Rizopoulos D (2010) Random effects models for longitudinal data,
Chapter 2, pp 37–96, in Longitudinal Research with Latent Variables, eds K van Montfort, J Oud,
A Satorra. Springer.
Wakefield J, Smith A, Racine-Poon A, Gelfand A (1994) Bayesian analysis of linear and non-linear
population models using the Gibbs sampler. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 43, 201–221.
Weiss R (2005) Modelling Longitudinal Data. Springer, New York.
Weiss R, Cho M, Yanuzzi M (1999) On Bayesian calculations for mixture priors and likelihoods.
Statistics in Medicine, 18, 1555–1570.
Wooldridge J (2005) Simple solutions to the initial conditions problem in dynamic, nonlinear panel
data models with unobserved heterogeneity. Journal of Applied Econometrics, 20, 39–54.
Xu S, Jones R, Grunwald G (2007) Analysis of longitudinal count data with serial correlation.
Biometrical Journal, 49, 416–428.
Yang X, Shoptaw S (2005) Assessing missing data assumptions in longitudinal studies: An example
using a smoking cessation trial. Drug and Alcohol Dependence, 77, 213–225.
Yau KK, Kuk AY (2002) Robust estimation in generalized linear mixed models. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 64(1), 101–117.
Zayeri F, Kazemnejad A, Khanafshar N, Nayeri F (2005) Modeling repeated ordinal responses using
a family of power transformations: Application to neonatal hypothermia data. BMC Medical
Research Methodology, 5, 29.
Zhang D, Davidian M (2001) Linear mixed models with flexible distributions of random effects for
longitudinal data. Biometrics, 57(3), 795–802.
Zhang Z (2016) Modeling error distributions of growth curve models through Bayesian methods.
Behavioral Research, 48, 427–444.
Zhang Z, Hamagami F, Wang L, Grimm K, Nesselroade J (2007) Bayesian analysis of longitudinal
data using growth curve models. International Journal of Behavioral Development, 31(4), 374–383.
Zhang Z, Keke L, Zhenqiu L, Xin T (2014) Bayesian inference and application of robust growth curve
models using student’s t distribution. Structural Equation Modeling, 20, 47–78.
11
Survival and Event History Models
11.1 Introduction
In many applications in the health and social sciences, the response of interest is duration
to a certain event, such as age at first maternity, survival time after diagnosis, or times
spent in different jobs or places of residence. In clinical applications, the interest is typi-
cally in representing and comparing the distribution of times to an event among different
patient groups (e.g treatment vs control groups) (Brard et al., 2017), whereas in social sci-
ence applications, the interest may focus on the impacts of demographic or socioeconomic
attributes on human behaviours.
Typically, durations or event times are not observed for all subjects, either because not
all subjects are followed up, or because for some events the event may never occur (e.g. age
at first marriage). So some times are missing or censored, and the missingness mechanism
is generally assumed to be at random. The most common form is right-censoring, when
the event has not occurred by the end of the observation period; the unknown failure time
exceeds the subject’s survival time c when observation ceased. A failure time is left cen-
sored at c if its unobserved actual value is less than c (e.g. a population census may record
limiting illness status by current age, but not the age when it commenced). A failure time
is interval censored if it is known only that it lies in the interval (c1,c2).
Distributions of durations or survival times are equivalently described by hazard rates,
also known as failure rates, exit rates, or forces of mortality according to the application. The
modelling of the hazard rate through time may be undertaken parametrically. Alternatively,
one may adopt semiparametric methods, such as assuming piecewise constancy in the
rates within sub-intervals of the observation span (Ibrahim et al., 2001). Pooling strength
through correlated priors is then relevant, as rates in successive intervals tend to be similar.
Imposing smoothness conditions on the baseline hazard also provides stable estimators
when observations are sparse at particular durations (Omori, 2003).
Variations in failure rates between subjects or other units may be explained to a large
degree by observed covariates, the impact of which may also vary over intervals or time.
Selection of covariates may be relevant in particular applications (Lee et al., 2015). However,
unobserved random variations between subjects are present in many applications and may
be modelled by introducing subject level frailty (see section 11.4). Additionally, duration
times may be hierarchically stratified (e.g. patient survival by hospital or by area of residence)
(e.g. Austin, 2017). Durations or survival times may also be differentiated by types of pos-
sible exit, as in competing risk analysis (see Section 11.7). One may also consider multivariate
survival outcomes, as in multiple component failure (Damien and Muller, 1998) or in familial
survival studies (Viswanathan and Manatunga, 2001). In such situations, shared frailty mod-
els may account for correlated unobserved variation over different strata or causes of exit.
471
472 Bayesian Hierarchical Models
while the probability of surviving beyond t is S(t) = 1 − F(t) = Pr(T > t). Note that one has
S(∞) = 0, except for applications with a cure fraction (Lambert, 2007). So, the density of T
can be expressed as
dF(t) dS(t)
f (t) = =− .
dt dt
The chance of an event occurring in a short interval (t , t + dt), given survival to t, is
− d log S(t)
h(t) = . (11.1)
dt
On integrating both sides in (11.1), one obtains the cumulative hazard rate
t t − log S( t )
− d log S(u)
∫
H (t) = h(u)du =
0 0
∫
du du = − ∫0
d log S(u) = − log S(t),
and so
Survival and Event History Models 473
t
0
∫
S(t) = exp [ − H (t)] = exp − h(u)du .
The hazard function is the central focus for modelling variations in survival. Assume pre-
dictors Zi are available (excluding a constant). Their impact is most simply modelled using
a proportional hazards form (e.g. Kiefer, 1988; Li, 2007)
h(t|Z) = h0 (t)exp(Zi b ),
where h0(t) is known as the baseline hazard, and the regression impact is constant across
time. Letting hi = Zi b , the associated survivor function is
t
0 ∫
S(t|Zi ) = exp − h(u|Zi )du
= exp − H 0 (t)e hi
= [S0 (t)]
exp( hi )
{
= exp − exp hi + log H 0 (t) , }
where H0(t) is the integrated baseline hazard. The proportional hazard assumption is often
restrictive, though Yin and Ibrahim (2006) show the proportional hazard model (PHM) may
be nested in a broader class of transformation hazard models, with parameter 0 ≤ g ≤ 1 and
1/g
h(t|Z) = h0 (t)g + exp(Zi b )
which reduces to the proportional model when γ = 0 and to an additive model when γ = 1.
Consider an absorbing (non-repeatable) type of exit, and let di = 1 for an observed exit
and di = 0 for a censored time. Assuming censoring is non-informative, the likelihood con-
tribution for subject i is
if di = 1, and S(ti |Zi ) if di = 0. The likelihood contribution may therefore be expressed in
equivalent form as
h(ti |Zi )di Si (ti |Zi ) = f (ti |Zi )di Si (ti |Zi )1− di .
For a PHM, the likelihood contribution also may be written (Aitkin and Clayton, 1980;
Orbe and Nunez-Anton, 2006) as
d 1- di
éë h0 (ti )exp {Zi b - H 0 (ti )exp(Zi b )}ùû i éëexp {- H 0 (ti )exp(Zi b )}ùû
di
h (t )
( )
= midi e − mi 0 i (11.2)
H 0 (ti )
474 Bayesian Hierarchical Models
where
and the second bracketed term in (11.2) depends only on the baseline hazard and is inde-
pendent of β. The first term in (11.2) is the kernel of a Poisson likelihood for the event status
indicators di ∼ Po( mi ). From (11.3), the corresponding log-linear model is
where log(H0(ti)) is an offset using the observed time, whether censored or uncensored.
Equivalently
t
where Λ(t) =
∫ l(u)du is the integrated intensity, with Λ(t) = E(N(t)).
0
The intensity is equal to the hazard while the subject or system is still under obser-
vation, that is, still at risk, but is zero when the event has happened (when the event is
non-repeatable), or when a sequence of (repeatable) events has finished. An example of
the latter might be when a repairable system subject to repeated breakdowns is finally
decommissioned – see Watson et al. (2002) for a counting process analysis of failure times
of water pipes. Let Y(t) = I (T ≥ t) denote the at-risk indicator, then
l(t) = Y(t)h(t).
This representation of the intensity function generalises to include predictors and random
effects (or frailties). So for proportional hazards, effects of predictor Zi would be included via
One may then compare observed and predicted counts via the Martingale residual at t,
defined as
ti
ò
Mi (t) = N (ti ) - L 0 (ti |Zi ) = N (ti ) - Yi (u)exp(Zi b )dH 0 (u).
0
The total residual Mi = Mi(∞) for a subject with observation time ti is obtainable for a
non-repeatable event, and event indicator di, as
Mi = di − Λ 0 (ti |Zi ).
N (∞ ) − Mi
ri = sgn( Mi ) 2 Mi − N i (∞)log i
N i (∞)
11.2.2 Parametric Hazards
The hazard rate h(t) is called “duration dependent” if its value changes over t. Under
negative duration dependence (often observed in occupational or residential careers),
h(t) decreases with time. In practice, plots of survivor proportions are often jagged with
respect to time, and semiparametric or non-parametric methods for representing the haz-
ard function reflect this. However, parametric lifetime models are also often applied to
test whether certain basic features of duration dependence are supported by the data; see
http://rstudio-pubs-static.s3.amazonaws.com/5560_c24449c468224fd4af9f3e512a24e07d.
html for a discussion of exploratory graphical comparison of parametric and non-para-
metric approaches in R.
The simplest parametric model is the exponential model, under which the leaving rate is
constant, defining a stationary process with hazard
h(t) = l,
f (t|l ) = l exp(-lt).
f (ti |l , di ) = l di exp(-lti ).
h(t|Zi ) = l e Zi b .
h(t|Zi ) = e b0 + Zi b .
476 Bayesian Hierarchical Models
Equivalently, under the Poisson likelihood approach of Aitkin and Clayton (1980), one has,
for event indicators di,
di ~ Po( mi ),
log( mi ) = log(lti ) + Zi b ,
log( mi ) = log(ti ) + b0 + Zi b.
This Poisson likelihood device can be used in piecewise exponential models as considered
below.
Another commonly used parametric form is the Weibull (e.g. Thamrin et al., 2013), with
scale parameter λ and shape κ, namely
h(t|l , k ) = lk tk -1 ,
so that
S(t|l , k ) = exp[-ltk ],
f (t|l , k ) = lk tk -1 exp[-ltk ].
The Weibull hazard rate is monotonic, with positive duration dependence if κ > 1 (and if
the 95% credible interval excludes 1), and negative dependence if κ < 1. Impacts of covari-
ates, including an intercept, on the hazard, are represented via
h(ti |l , k , Zi , b ) = e b0 + Zi b k tk -1.
The generalised gamma density (Stacy, 1962; Cox and Matheson, 2014) is of interest in
including the Weibull, gamma, and lognormal as special cases. This has various param-
eterisations, with BUGS using the representation
g
f (t|a , l , g ) = l ag tag -1 exp éë -(lt)g ùû ,
G(a )
as in Morris et al. (1994) (see Example 11.1 below). Instead, one may take B = λγ, leading to
g
f (t|a , l , g ) = Ba tag -1 exp[-tg B],
G(a )
where setting γ = 1 gives the gamma, α = 1 gives the Weibull, and α → ∞ provides the log-
x
a −1 − u
∫
normal. The survival function is 1 − I (g , tg B), where I ( a, x) = (1/Γ( a)) u e du . Cox and
0
Matheson (2014) consider a parameterisation involving δ = γ0.5, allowed to take both posi-
tive and negative values, with the sign of δ leading to different survivor functions.
Survival and Event History Models 477
library(survival)
KMinputs <- Surv(survdat$time,survdat$status)
KM <- survfit(KMinputs)
plot(log(KM$time), log(−log(KM$surv)), type="S").
However, many processes exhibit peaks in exit rates; for example, the rate may at first
increase, but after reaching a peak, tail off again (Gore et al., 1984; Shao and Zhou, 2004).
Parametric models accommodating such a pattern include the log-logistic model and the
sickle model (Bennett, 1983; Brüderl and Diekmann, 1995; Diekmann and Mitter, 1983). The
log-logistic density has hazard
k −1 −1
h(t) = lkt 1 + lt k ,
−1
S(t) = 1 + lt k ,
where all parameters are positive, and the scale parameter λ can be adapted to model the
impact of predictors; see Li (1999) for a Bayesian application to Chapter 11 bankruptcies.
An alternative common parameterisation (Florens et al., 1995) sets l = n k , so that
n k k tk -1
h(t) = . (11.5)
[1 + (n t)k ]
The sickle model has corresponding functions
∞
t →∞
0
∫
r = lim S(t) = exp − h(u)du
11.2.3 Accelerated Hazards
In contrast to the proportional hazard model with h(ti |Zi ) = h0 (ti )exp(Zi b ), in an acceler-
ated failure time (AFT) model the explanatory variates are assumed to act multiplicatively
on time (Wei, 1992; Swindell, 2009; Rivas-López et al., 2014). AFT models focus on the effect
of the explanatory variates on the survival function, rather than the hazard function under
proportional hazards models. With Bi = exp(Zi b ), one has
and the effect of the predictors Zi on survival time is more direct, acting to accelerate
or decelerate the time to failure. To illustrate this in the case of a treatment comparison,
assume Zi excludes an intercept, and that the baseline hazard includes a scale parameter
to model the mean hazard (e.g. the parameter λ in exponential and Weibull models). Also
assume a single predictor such as zi = 1 for a new treatment and zi = 0 for control. Then,
with Bi = e b zi = e b (= f) for a treated subject, one has a hazard fh0 (fti ) and a survivor func-
tion S(fti ) for a treated subject, but a hazard h0(ti) and a survivor function S(ti) for a control
subject. So, the lifetime under the new treatment is ϕ times the lifetime under the control
regime.
More inclusive schemes are possible. For example, defining Gi = exp(Zig), one has
which includes the AFT and PHM forms as special cases (Chen and Jewell, 2001). For
example, for the log-logistic density, this would imply
k −1 −1
h(ti |Zi ) = Bi lk(tiGi ) 1 + Bi l(tGi )k .
Apart from avoiding the assumption of proportional hazards, the AFT approach has the
advantage of a direct regression form which may be useful in modelling nonlinear effects
of predictors (Orbe and Nunez-Anton, 2006). Let Zi be of dimension p, and Ti denote the
completed failure time which for censored subjects is unobserved. Then Ti = ti when di = 1
but Ti > ti when di = 0, so truncated sampling with the censored time as the lower limit is
necessary. The regression formulation is then
where σ is a scale parameter, and the errors are defined by the survivor function, namely
log(ti ) − Zig
S(ti ) = 1 − Φ .
s
Taking u to be standard logistic, with density p(u) = e u /(1 + e u )2, corresponds to a log-
logistic failure time density with
{ }
−1
log(ti ) − Zig
S(ti ) = 1 + exp
s
with σ corresponding to the inverse of the shape parameter κ. Finally, consider a Weibull
density for failure times with hazard h(ti |Zi ) = lk tik -1 exp( b Zi ) where Zi excludes a con-
stant term. Taking u to follow a standard extreme value density, namely p(u) = exp(u − e u ) ,
the AFT regression takes the form (Keiding et al., 1997)
log l b bp ui
log(Ti ) = − − z1i 1 …− z pi + .
k k k k
so that g j = − b j /k .
li = exp(Zi b ),
with density
This is equivalent to accelerated hazards regression for logged length of stay with
error ui
log(ti ) = Zig + s ui
where g = − b/k . The γ coefficients express influences on length of stay (i.e. survival)
while the β coefficients express influences on mortality.
We consider rstan estimation for the PH Weibull model, with both priors and likeli-
hoods represented using the target += option, and an implicit flat prior on the regres-
sion coefficients. For the Weibull analysis, the log likelihood for censored cases is
provided by the weibull_lccdf function in rstan, namely the log of the Weibull comple-
mentary cumulative distribution function for the response ti. It may be noted that the
rstan parameterisation of the Weibull is
480 Bayesian Hierarchical Models
k -1
k æ ti ö
h(t|li , k ) = ,
li çè li ÷ø
differing from that in BUGS and JAGS, and a re-expression of the regression term ηi = Ziβ
is needed to achieve results in line with the parameterisation (11.6) (Buros, 2016). Thus,
one obtains the code elements
The Weibull shape parameter κ has a posterior 95% credible interval entirely under
1, so mortality is associated with shorter stays (sometimes denoted negative duration
dependence). This feature, combined with a negative age effect (albeit not significant),
may reflect varying frailty (selection effects). Other regression coefficient estimates
(Table 11.1) for the treatment and attribute variables replicate those of Morris et al.
(1994). Health status is measured in terms of dependency in activities of daily living;
with health=2 if there are four or fewer activities with dependence (reference category),
health=3 for five dependencies, health=4 for six dependencies, and health=5 if there
were special medical conditions requiring extra care. It can be seen that higher ADL
(activities of daily living) dependency is associated with earlier mortality and shorter
stays. The LOO-IC for the Weibull model is 16463, with κ estimated as 0.61. Estimation
using the AFT form gives the same result.
TABLE 11.1
Nursing Home Stays. Parameter Posterior Summary
Weibull PH Model Generalised Gamma AFT Weibull AFT
Mean 2.50% 97.50% Mean 2.50% 97.50% Mean 2.50% 97.50%
Influences on Leaving NH Influences on Stay Length Influences on Stay Length
Age a −0.45 −1.18 0.30 1.27 0.06 2.57 0.71 −0.49 1.93
Treatment −0.13 −0.24 −0.02 0.11 −0.09 0.28 0.21 0.02 0.39
Male 0.35 0.22 0.48 −0.58 −0.80 −0.34 −0.57 −0.79 −0.35
Married 0.16 0.01 0.31 −0.22 −0.50 0.07 −0.26 −0.51 −0.01
ADL Status 3 −0.03 −0.19 0.13 0.07 −0.10 0.27 0.05 −0.21 0.28
ADL Status 4 0.23 0.08 0.39 −0.44 −0.61 −0.30 −0.38 −0.64 −0.14
ADL Status 5 0.53 0.33 0.74 −0.85 −1.20 −0.44 −0.88 −1.20 −0.54
κ 0.61 0.58 0.64
α 8.60 7.53 9.25
γ 0.18 0.17 0.19
σ 1.64 1.57 1.72
LOO-IC 16463 16364 16463
a Actual age divided by 100.
Survival and Event History Models 481
To assess poorly fitted observations, one may implement different forms of residual.
Here the Martingale residual and the normal deviate residual are obtained, with simu-
lation as in Nardi and Schemper (2003) to obtain estimated normal deviate residuals for
censored observations. These two residuals have a correlation of 0.94, and high negative
values on both highlight subjects (e.g. 1,589 and 1,596) with long lengths of stay despite
high ADL dependency.
There is evidence of redundancy among the predictors used above, and covariate
selection or shrinkage priors could be applied (e.g. Zhang et al., 2018). The impact of
the latter can be demonstrated simply by an application of the BayesMixSurv package,
which estimates a two-component discrete mixture of Weibull regressions, but allows
a single component option. Lasso shrinkage priors are assumed for regression coef-
ficients in this package. Thus, defining an event indicator (endstay=1-censored) as the
complement of the censoring indicator, and defining agec=age/100, one has
C1=bayesmix surv(Surv(time,endst ay)~t rt+agec+ma rstat+gender+hltst3+hltst4+hl
tst5, D, control=bayesmixsurv.control(iter=1000,single=T)).
This shows an age coefficient much closer to zero (with posterior mean −0.03) than
obtained using the flat prior in the rstan code.
A Weibull accelerated failure time regression can be obtained by simply replacing
nu[i] = exp(−eta[i]/kappa) by nu[i] = exp(eta[i]) in the preceding code. As in Morris et al.
(1994), we compare Weibull and generalised gamma AFT regressions. The latter requires
specific functions in the rstan code to define the density and survivor likelihoods.
Convergence issues have been noted for the generalised gamma, even under maxi-
mum likelihood, though convergence may be improved by fixing one of the extra gen-
eralised gamma parameters (Lawless, 1980). Estimates here are based on a single chain
run of 5,000 iterations in rstan, at which point SRFs for the hyperparameters are 1.3 or
less. Table 11.1 shows a more pronounced effect of age, and a diminished treatment
effect, under the generalised gamma, which produces a lower LOO-IC (16364) than the
Weibull AFT. The estimate of α, with 95% CRI (7.5,9.3), suggests the lognormal may be
preferred to the Weibull.
11.3 Semiparametric Hazards
In the proportional hazards model
it may be difficult to choose a parametric form for the baseline hazard h0(t), and semipara-
metric or non-parametric approaches are often preferable. These have benefits in avoiding
possible mis-specification of parametric hazard forms, and in facilitating other aspects
of hazard regression, such as time-varying predictor effects (Gamerman, 1991). Such
approaches have been applied to the cumulative hazard, and implemented in counting
process models (Clayton, 1991). However, they may also be specified for the baseline haz-
ard h0 itself (e.g. Gamerman, 1991; Sinha and Dey, 1997) and typically use only information
about the time intervals in which exit occurs.
Consider a partition of the response time scale into J intervals ( a0 , a1 ], … ,( aJ −1 , aJ ] , where
aJ equals or exceeds the largest observed time, censored or uncensored (Ibrahim et al.,
2001, p.106). The partition scheme can be based on distinct values in the profile of observed
times {t1 , … , tn } , whether censored or not, or by siting knots aj at selected points in the
range (tmin,tmax). Yin and Ibrahim (2006, p.173) propose that the partitioning should ensure
482 Bayesian Hierarchical Models
an approximately equal number of failures in each of the J intervals, with each interval
containing at least one failure. Among alternatives are knots sited at (( j − 1)/J )th quantiles
of observed times (Gustafson et al., 2000), or evenly spaced along the range of the observed
t values. As the number of intervals J tends to infinity, a truly non-parametric model is
obtained, but is not likely to be empirically well identified (Lopes et al., 2007).
Different approaches may be based on the assumption that the baseline hazard is con-
stant within each interval. Thus Ibrahim et al. (1999) and Ibrahim et al. (2001, p.55) consider
discrete approximation to the gamma process of Dykstra and Laud (1981). This involves a
prior on the increments
∆ j = h0 ( a j ) − h0 ( a j −1 ), j = 1, … , J
t
J
∫
S(t|Zi ) = exp −Bi h0 (u)du exp −Bi
∑ ∆ (t − aj ) ,
j −1 +
0 j=1
where Bi = e Zi b , and (u)+ = u if u > 0 and is zero otherwise. The probability of exit in interval
j is then
q j = S( a j-1 ) - S( a j )
é ìï j -1
üïù é ìï j
üïù
êexp í-Bi
ê
ë îï m=1
å
D m ( a j-1 - am-1 )ýú ê1 - exp í-Bi ( a j - a j-1 )
þïúû êë îï m=1
D m ýú .
þïúû
å
11.3.1 Piecewise Exponential Priors
Piecewise exponential (PE) priors (Ibrahim et al., 2001; Bender et al., 2018; Sinha et al., 1999;
Demarqui et al., 2008; Brezger et al., 2005) are one approach to estimating the hazard func-
tion without specifying the hazard parametrically, though semiparametric approaches
avoiding the simplifying PE assumptions have been proposed (Murray et al., 2016; Marano
et al., 2016). The PE prior specifies a baseline parameter λj for each interval, possibly com-
bined with interval-specific regression parameters βj, so that
where Zi excludes an intercept. Let Bij = exp(Zi b j ). For a subject surviving beyond the jth
interval, namely with ti > aj, the likelihood contribution during interval j is
exp( − lj ( a j − a j −1 )Bij ).
For a subject with a j −1 < ti ≤ a j , either failing (dij = 1) in interval j, or censored but neverthe-
less exiting (dij = 0) in the jth interval, the likelihood contribution is
d
ljBij ij exp − lk (ti − a j −1 )Bij ) .
So, a Poisson likelihood approach may be applied as in (11.2)–(11.4), with responses yij
defined by the event type in each interval, and with offsets Δij, defined according to
whether the subject survives the interval (see Example 11.2).
Survival and Event History Models 483
The successive baseline parameters λj are likely to be correlated, but also possibly to
show erratic fluctuations or be imprecisely estimated if treated as fixed effects. Hence, a
smoothing prior is indicated. One might assume a parametric model (e.g. polynomial in
j) but allowing for additional random variation. Thus, Albert and Chib (2001) and Omori
(2003) assume a polynomial for a j = log(lj ), whereby
a j = y0 + y1( j − 1) + y2 ( j − 1)2 + u j ,
a j ~ N (a j -1 , s a2d j ),
with α1 a separate fixed effect, and with t a = 1/s a2 following a gamma prior. Alternatively,
as in Gustafson et al. (2003), one may take
w j = 0.5( a j + a j + 1 ),
zj = w j − w j −1 ,
zj
a j ~ N (a j -1 + (a j -1 - a j - 2 ) , s a2 (z j /z )2 ).
z j -1
Since setting particular partitions of the time scale involves an element of arbitrariness,
Sahu and Dey (2004) apply RJMCMC (reversible jump MCMC) techniques in which J is an
additional unknown; they specify a sparse precision matrix formulation for the joint prior
for the (a1 , … , aJ ) under an RW1 prior for particular J values.
Because random walk priors of degree r set a mean level not on the αj themselves, but on
differences of order r (e.g. an RW1 prior specifies a zero mean for a j − a j −1), identifiability
may require that a separate regression intercept is omitted or that the αj are centred to
sum to zero at each MCMC iteration, by the operation a′j = a j − a . Alternatives are to set
any value, say the hth, to zero (by the operation a′j = a j − ah at each iteration), or set the first
effect α1 to zero (Sahu and Dey, 2004).
A gamma prior in the baseline hazard rates λj is also possible (Arjas and Gasbarra, 1994),
namely
lj ∼ Ga(b , b/lj −1 ),
where λ1 is a separate positive effect, and larger b values lead to smoother sequences of λj.
The same identifiability issues obtain as for a j = log(lj ) and devices such as normalisation
of the λj (to value 1) at each iteration may be applied.
Piecewise priors may also be used to model non-constant predictor effects, though typi-
cally values of time-varying regression coefficients βj in successive intervals are expected
to be close (Sinha et al., 1999). Sargent (1998) considers alternative gamma priors for the
484 Bayesian Hierarchical Models
b j ~ N ( b j -1 , s b2d j ).
Prior knowledge in this application (from the Veterans Administration lung cancer trial)
suggests that values of time-varying coefficients on successive days would differ by at
most 0.001. Taking this as the standard deviation of the normal distribution, the prior mean
precision for the gamma is 106. This corresponds to quartiles (0.0027, 0.0038, 0.0059) for
σβ. An alternative prior adopted by Sargent has mean precision 105. Posterior inferences
for the mean precision were different under the alternative priors, but not those for the
estimated βj. Fahrmeir and Knorr-Held (1997) suggest gamma Ga(1,b) priors on precision
parameters τα on varying log baseline rates, or precisions τβ on varying predictor effects.
Sensitivity is gauged by taking alternative values for b (e.g. b = 0.05 and b = 0.0005), since b
determines how close to zero the variances are allowed to be a priori.
where dH*(t) is a prior estimate of the hazard rate per unit time. Other possibilities include
normal priors on log(dH0).
Let J + 1 intervals (s0 , s1 ], … ,(sJ , sJ + 1 ] be defined by the J distinct failure times in a dataset,
with s1 equal to the minimum observed failure time, and sJ+1 exceeding the largest failure
time sJ (Sargent, 1997, p.16). The likelihood for individual i exiting or censored before sj, so
that Yij = 0 for t > sj, reduces to a discretised form of Poisson likelihood over all possible
intervals j with binary responses dNij and means Yij exp(Zij b j )dH 0 j . This model may be
adapted to allow for unobserved covariates or frailty, as considered in Section 11.4. It also
allows for autoregressive dependencies between intervals.
Sargent (1997) considers a counting process version of the Cox model for these data with
time-varying effects on the KS predictor.
Here we first consider the piecewise exponential model
and a partitioning of the time scale involving J = 20 intervals, and 21 cut-points aj at the
{0th,5th,10th,…,95th,100th} percentiles of the survival times ti. With di denoting event
indicators, the responses yij and offsets Δij are defined in R using the commands
The first model assumes a constant Karnofsky score effect, but a time-varying (log)
baseline hazard, namely
a j ∼ N(a j − 1 , sa2 ),
with a gamma prior on 1/sa2 . To assist identifiability, estimation with the R package rube
uses the BUGS car.normal prior, and the re-expression
a j = b0 + a 0 j ,
where α0j are RW1 centred random effects. With a two-chain run, the posterior density
of σα is found to be bounded away from zero, with 95% interval (0.05,0.22). The posterior
mean αj show an irregular rise in mortality over the intervals (Figure 11.1). The LOO-IC
is 984. This LOO-IC is in fact lower than if more intervals are used (e.g. with J = 50,
and using either quantile cutpoints or equally spaced cutpoints). The coefficient on the
Karnofsky score has 95% interval (−0.38,−0.15), while the interaction term has coefficient
with 95% interval (−0.57,−0.11). Coefficients for celltypes 1 and 2 are also significantly
positive.
A second model additionally takes the Karnofsky score to have a time-varying coef-
ficient, using a random walk prior adjusted for intervals of unequal length. Thus
b j ~ N( b j -1 , s b2d j )
where the Ga(1,0.0001) prior for 1/s b2 has mean and variance 105, so supporting large val-
ues (Sargent, 1997). The LOO-IC is lower at 971, and there is an upward trend (towards a
null value) in the coefficient, with an insignificant effect at higher intervals (Figure 11.2).
Topics for further investigation might be the sensitivity of the form of time variation in
the KS effect to the partitioning scheme of the durations, or to unobserved heterogene-
ity between subjects.
We also demonstrate the changing effect of the KS score using a rstan coding of the
counting process model [1]. The settings for the gamma increments prior are as for the
Winbugs Volume 1 “Leuk: Cox regression” example. Predictors are the KS score, prior
486 Bayesian Hierarchical Models
–4.8 +
+ + + + +
+ + +
+ + + +
+ + + +
+ +
Posterior Mean and 80% CRI –5.0 +
o
o o o o o o o
o
–5.2
o o o o o o o
o o
o
o
–5.4
* * * * * * * *
*
*
* * * * * *
–5.6 * *
*
*
5 10 15 20
Interval
FIGURE 11.1
Trend in αj by interval.
0.0
–0.1
Posterior mean bKS
–0.2
–0.3
5 10 15 20
Interval
FIGURE 11.2
Trend in beta coefficient.
therapy, and their interaction (the latter two having time-constant effects). A random
walk prior is assumed for the varying KS score effect.
Figure 11.3 shows the diminution of the KS effect at higher intervals (based on dis-
tinct event times). Figure 11.4 shows differing survival chances according to whether
prior therapy was received (upper curve) or not (lower curve), with the KS score set at
its upper quartile.
Survival and Event History Models 487
+++++
+ +++
0.2 ++ + +
+ + +++++ +++
+ ++++ ++ +++++++++++++
+ ++++
+
Posterior Mean and 80% CRI, bKS +
+
ooooooo
0.0 + ooo oooooooooooo
+ oo o ooooooooooooooo
+ o ooooooo
+++ oo
++ ++
++ o
++++++++++++ o ******
–0.2
+ o **** **************** *
+++ ++ o ** ** *******
++++++ o * **
++ ++++
++++ ooooooo ** ***
oo *
oooooooo o * **
o o * *
–0.4 o oo *
o o
oooooooo oo o *
oo o ********
ooo
******* **
** * *
**
***** ** *
–0.6 * *
*** * *
* ***
*
0 20 40 60 80 100
Interval
FIGURE 11.3
Trend in KS effect.
FIGURE 11.4
Survival chances according to therapy.
488 Bayesian Hierarchical Models
11.4 Including Frailty
Subjects with a given profile of attributes are still likely to show variations in survival
times due to unobserved factors. Such factors mean that subjects have different frailties
(i.e. liabilities to experience the event) and the most frail will typically exit before others,
so that survivors are subject to a selection effect (Aalen, 1988; Wienke, 2010). Inferences
from survival analysis may be incorrect if unobserved heterogeneity is ignored (Lancaster,
1990), with a possibility of negative duration bias (Boring, 2009). Another consideration is
possible sensitivity of inferences to the assumed form of unobserved heterogeneity.
The canonical form for introducing unobserved differences between observations is via
a multiplicative frailty, γi, distributed independently of Zi and ti, with
leading to mixed proportional hazard or MPH models (Mosler, 2003; Van den Berg, 2001;
Abbring and van den Berg, 2007). Except for the case of positive stable frailty distribu-
tions, the MPH model is inconsistent with the usual Cox proportional hazard formulation
(Henderson and Oman, 1999).
A typical assumption for the distribution p(γi) of multiplicative frailties is that they are
gamma distributed (Perperoglou et al., 2006), typically gi ∼ Ga(k , k ) where k is unknown.
So, the frailties have mean 1, and variance 1/k, with normalisation to ensure identifica-
tion when Zi includes an intercept. Another possibility is to include the regression effect
exp(Ziβ) in the specification of the frailty density. So, for example,
g i ~ Ga(k exp[Zi b ], k ).
As one option, Sohn et al. (2007) assume Weibull distributed survival times, with density
form
a a -1
f (ti ) = ti exp(tia /g i ),
gi
and then take γi as inverse gamma. With the form x ~ IG(a,b) corresponding to
f ( x) = (b a /Γ( a))x −( a + 1) exp[−b/x], the frailty density is then
Other positive parametric densities can be used to represent frailty, such as the log-nor-
mal (Gustafson, 1997). An advantage of gamma frailty combined with Weibull hazard is
that joint and marginal survival functions can be obtained analytically. An alternative
is to assume the γi have a positive stable distribution (Hougaard, 2000), in which case
the proportional hazards property is preserved after the γi are integrated out (Aalen and
Hjort, 2002).
Estimates resulting from the mixed proportional hazard model are often sensitive to the
functional form of the heterogeneity distribution, and may be biased if the functional form
of the distribution is mis-specified (Baker and Melino, 2000; Keiding et al., 1997). Heckman
Survival and Event History Models 489
and Singer (1984) report sensitivity of regression estimates according to different paramet-
ric distributions of frailty. They propose discrete mixture models with finite support at a
small number K of points, so that
where Gi is a multinomial indicator with K categories. Sahu and Dey (2004) compare
gamma, stable, and skewed log-t frailty models, and show how the gamma assumption
may attenuate covariate effects as compared to the other forms.
Despite such sensitivity, it is important to consider possible heterogeneity. One can show
(Lancaster, 1990) that a model neglecting frailty will show spurious duration dependence,
and specifically overestimate the extent of negative duration dependence in the true base-
line hazard, and underestimate the extent of positive duration dependence. This is a con-
sequence of selection, since in the presence of negative duration dependence, subjects with
high values of γi exit faster, so survivors at a given survival time are increasingly biased
towards relatively low γi values, and lower hazard rates. These features can be illustrated
with the MPH assumption and particular parametric hazards. Conditional on a particular
value of γi, the survivor function is
é ti
ù
ê
ë 0
ò
S(ti |Zi , g i ) = exp ê -g i exp(Zi b ) h0 (u)du ú ,
ú
û
or in terms of the cumulative hazard H0(ti),
∫
S(ti |Zi ) = S(ti |Zi , gi )p(gi )dgi
0
∞
For γi following a gamma density, gi ∼ Ga( a, b), the unconditional survivor function is
−a
S(ti |Zi ) = b a b + H 0 (ti )e Zi b
−k
S(ti |Zi ) = 1 + k −1H 0 (ti )e Zi b .
−k
S(ti |Zi ) = 1 + k −1e Zi b ti ,
490 Bayesian Hierarchical Models
- k -1
f (ti |Zi ) = e Zi b éë1 + k -1e Zi b ti ùû ,
-1
h(ti |Zi ) = e Zi b éë1 + k -1e Zi b ti ùû .
For a frailty variance 1/k > 0, the hazard rate is a decreasing function of t, an example of
spurious duration dependence. If frailty is present, but ignored, not only will duration
effects be mis-stated, but covariate effects will be underestimated (Hougaard et al., 1994;
Pickles and Crouchley, 1995). Lancaster (1990) confirmed this analytically for uncensored
Weibull survival data.
More general forms of subject-level random variation can be achieved by a general linear
mixed model form, where the impact of selected predictors wi = (w1i , … , wri ) is assumed to
vary over subjects, or clusters of subjects. Thus
When r = 1, and wi = 1, the random effect bj ~ N(B,1/τb) is used to represent variability in
frailties between subjects. If Zi contains an intercept, the bi are constrained to have zero
mean, namely bj ~ N(0,1/τb).
A general linear mixed model form for frailty may take account of spatial locations of
subjects, as in geoadditive hazard regression (Kneib, 2006; Henderson et al., 2002). Suppose
subjects ij are nested within J locations, then {b j , j = 1, … , J } would be spatially correlated
with local pooling of strength (Zhou et al., 2017). For example, if individuals in neigh-
bouring locations are subject to similar (unobserved) environmental risks, this will affect
survival.
In accelerated failure time models (section 11.2.3), frailty is conveniently obtained by
discrete mixture modelling of the error term. Following Roeder and Wasserman (1997), a
mixture of normals provides a flexible model for estimation of densities. Suppose mem-
bership of latent sub-groups is denoted by a categorical variable Gi with K options, and
prior Gi ∼ Mult(1,[p1 , … pK ]). Assuming a log-normal density for exit times, one may trans-
form observed failure or censoring times as ri = log(ti), and to account for right censoring,
define lower sampling limits Li = log(ti) if di = 0, and Li = 0 if di = 1. Then the discrete mixture
adopts varying group intercepts and variances in the survivor function
K
log ti - g 0 k - g 1z1i - g 2 z2i - … ö
S(ti ) = 1 - åp F æçè
k =1
k
sk ÷.
ø
Mixed Dirichlet process and Polya Tree priors for the errors u in an AFT regression are
used by Kuo and Mallick (1997) and Walker and Mallick (1999).
susceptible population. Herring and Ibrahim (2002) point out – in the context of cancer
survival – that improved treatment means that a substantial proportion of patients may
now be cured, whereas traditional survival analysis, including the Cox (1972) regression
model, assume that no patients are cured, but that all remain at risk of death or relapse.
Similarly, in the context of component reliability, Sinha et al. (2003) consider the case where
if a unit is free of manufacturing faults, it will never fail in its technological lifetime under
usual stress levels.
The most common approach to modelling events with a permanent survival fraction or
cure rate assumes the total survival rate is a binary mixture (Ibrahim et al., 2001). The non-
susceptible subpopulation has Sc(t) = 1 with probability (1 − π), and the other (the non-cured
or susceptible subpopulation) follows a conventional survival pattern in which Sn(t) → 0 as
t → ∞. So the overall survivor function is
F ∗ (t) = p Fn (t).
Ibrahim et al. (2001, p.157) point out that if covariate effects are modelled via binary regres-
sion for πi then the proportional hazard property no longer obtains.
Let Ri be a partially unobserved binary indicator with Ri = 1 if a subject is susceptible.
Schmidt and Witte (1989) and Banerjee and Carlin (2004) follow the standard cure rate
model and take Ri to be Bernoulli with Pr(Ri = 1) = pi being a propensity to experience the
event (e.g. propensity to relapse). For simplicity, omit the subscript n in the survivor func-
tion for susceptibles. Then for subjects observed to fail, namely with di = 1, it necessarily
follows that Ri = 1, and so the likelihood contribution from such cases is
which reduces to the usual form f (ti )di S(ti )1− di when Ri = 1 for all subjects, and so πi = 1
for all i (i.e. there is no permanent survivor fraction). Any form of binary regression (e.g.
logit) may be used for predicting πi (Schmidt and Witte, 1989). Banerjee and Carlin (2004)
carry out a Bayesian analysis with individual level regression in the scale parameter of the
failure distribution f(t), but without a regression for the susceptible probability. However,
their observations are hierarchical (spatially configured) response times tij (subjects i
within areas j), and they allow spatial variability in the propensities so that pij = p j ; see
also Cooner et al. (2006).
Chen et al. (1999) describe an alternative structure in which there is a latent count of risks
Ci, taken to be Poisson with mean θ (for example, tumour cells remaining after treatment
492 Bayesian Hierarchical Models
that have varying potentials to cause relapse), and unobserved times U i1 , … , U iCi associ-
ated with each of these risks. The Uic are assumed to follow the same failure distribution
F(t) = 1 − S(t). An observed failure time ti is the minimum of these times. If Ci = 0 then a
subject survives permanently from the event being modelled (e.g. a form of cancer). In this
case the composite survival function is
h∗ (ti ) = q f (ti ).
An alternative derivation of this model, not tied to the notion of multiple latent risks, is that
t
0 ∫
the cumulative hazard H (t) = h(u)du tends to a finite positive limit θ as t → ∞ (Tsodikov
et al., 2003). Chen et al. (1999) and Ibrahim et al. (2000, p.158) mention that the survivor
function of the non-cured subpopulation can be written
so that the composite survival function is in fact also representable as a binary mixture,
namely
Chen et al. (1999) introduce covariates into a Poisson regression model for subject-spe-
cific θi. Consider Weibull distributed times with F(ti |Zi ) = 1 - exp[-litik ], li = exp(Zi b ), and
f (ti |k , Zi ) = lik tk -1 exp(-litik ) . The likelihood when predictors are used to explain both θi
and λi, and with di being event status indicators, is then
( )
éë h* (ti )ùû i S* (ti ) = éëq ilik tik -1 exp(-litik )ùû i exp -q i {1 - exp(-litik )} .
d d
Multiplicative frailty, as in the MPH setup above, can be introduced in cure rate models,
but identifiability may be weak because susceptibility responses are partially unobserved
themselves. Models for frailty in multivariate cure fraction models are considered by Yin
(2005). Thus, for times tij observed on subjects i and events j, Yin proposes multiplica-
tive frailty at subject level combined with Poisson regression for θij in the cure fractions
exp(−θij). One option takes
lk tk -1
h(t) = ,
[1 + ltk ]
is appropriate to the non-monotonic form of hazard for first maternity, typically peak-
ing between ages 25 to 35. This model is implemented in rstan using the custom likeli-
hood approach. A standard log-logistic is compared with a log-logistic model with a
permanent survivor fraction (PSF), modelled according to the latent count approach
(Chen et al., 1999). The PSF log-logistic model is then generalised to allow for unmea-
sured heterogeneity in the age at first maternity. Permanent survivorship in this case is
equivalent to a woman never undergoing a maternity, and at population level is essen-
tially equivalent to the rate of childlessness.
Regression effects are included in the scale parameter of the log-logistic hazard via
li = exp(Zi b ), with θ assumed constant. However, a Poisson regression for θi could be
included. Predictors Zi and regression effects β under the standard log-logistic are as in
Table 11.2. Predictors are binary apart from number of siblings and education years. The
modal age c = [(k - 1)exp(-ZT b )]1/k reported in Table 11.2 is based on a predictor vector
ZT for a white subject with 13 years of education, and 3 siblings.
The standard log-logistic model gives a LOO-IC of 8,700. Significant coefficients in
Table 11.2 show that delayed AFM is associated with longer education, being white, and
TABLE 11.2
Age at First Maternity. Parameter Posterior Summaries
Log-logistic with Cure
Standard log-logistic Fraction Cure Fraction and Frailty
Predictor Mean 2.50% 97.50% Mean 2.50% 97.50% Mean 2.50% 97.50%
Years of education −0.213 −0.249 −0.184 −0.294 −0.334 −0.26 −0.504 −0.686 −0.377
Number of siblings 0.018 −0.012 0.045 0.017 −0.017 0.046 0.032 −0.03 0.086
White −0.594 −0.807 −0.407 −0.909 −1.14 −0.706 −1.649 −2.364 −1.133
Immigrant −0.312 −0.589 −0.083 −0.585 −0.912 −0.317 −0.921 −1.566 −0.45
Low income −0.009 −0.219 0.167 0.122 −0.126 0.325 0.217 −0.244 0.615
(age 16)
Living in city −0.074 −0.256 0.074 −0.007 −0.214 0.157 0.019 −0.343 0.334
(age 16)
Shape parameter 5.04 4.79 5.26 8.79 8.36 9.17 15.9 11.81 20.44
Modal age 33 32.1 33.7 31.7 31.1 32.4 31.5 22.8 38.7
(typical subject)
Proportion 0.172 0.151 0.19 0.167 0.146 0.185
childless
Frailty SD 2.45 1.44 3.47
494 Bayesian Hierarchical Models
li = exp(Zi b + ui )
ui ~ N(0, s u2 ),
reduces the LOO-IC to 7,855. This analysis uses a corner constraint on the ui for iden-
tifiability, and this option provides better convergence than (a) excluding the intercept
from Ziβ and centring the ui at β0, namely ui ~ N( b 0 , 1/t u ), or (b) expressing the random
effect as a product of σu and N(0,1) terms. The option of centring the ui is more computa-
tionally intensive. The lowest (most negative) frailty values are for subjects with delayed
age at first maternity, combined with low education and non-white ethnicity.
Allowing for a childless subpopulation (as a cure rate) is a form of frailty in itself,
and enhances (absolutely) the coefficients on significant predictor effects [3]. Formally
including frailty in the modelling of the event density further enhances predictor
effects. The childless fraction (i.e. the permanent survival fraction), exp(−θ), is estimated
at around 0.17, regardless of the presence or not of frailty. A standard log-logistic model
leads to a significantly later modal age than the extended models. In fact, a better repre-
sentation of the age at first maternity process may be provided by the generalised log-
logistic of Brüderl and Diekmann (1995), as discussed in Congdon (2008).
and the hazard function (the conditional probability of failing in interval j given survival
till the start of the interval) is
Sj-1 - Sj
q j = Pr(t Î ( a j-1 , a j )|t ³ a j-1 ) = Pr(t = j|t ³ j) = f j /Sj-1 = .
Sj-1
Alternatively stated, qj is the proportion of subjects at risk at the beginning of interval j
who experience the event sometime during the interval. The survivor function (the prob-
ability of surviving beyond interval j) is obtained as
Sj = Pr(t > a j ) = ∏ (1 − q ) = f
k =1
k j +1 + f j + 2 + …+ f J = Sj −1(1 − q j ),
Survival and Event History Models 495
though an alternative survivor function S j = Pr(t > a j −1 ) may be defined as the probability of
surviving to the start of interval j (Fahrmeir and Tutz, 2001, p.396; Aitkin et al., 2004, p.350).
Let wij = 1 if individual i undergoes the event during interval j and wij otherwise. The
likelihood up to interval k for that individual is then (Aitkin et al., 2004, p.351),
f ikwik Sik1− wik = (qik Si , k −1 )wik [Si , k −1(1 − qik )]1− wik
∏ (1 − q )
(1− wij )
= qikwik (1 − qik )1− wik ij
j =1
∏q
wij (1− wij )
= ij (1 − qij ) .
j =1
This shows that the likelihood involves binary responses wij ~ Bern(qij), where the qij may
vary between time intervals, but are assumed constant within them. So, the hazard prob-
ability can be represented as
where F is a suitable distribution function, and αj models the baseline hazard (Singer and
Willetts, 1993). If the predictors include lagged event status indicators {wi , j −1 , wi , j − 2 , etc} ,
one is led to discrete Markov event histories (e.g. Barmby, 2002). Lagged predictor effects
may also be used (Fahrmeir and Tutz, 2001, p.410).
A benefit of the discrete framework is that the baseline hazard can be modelled via poly-
nomial functions of j (Efron, 1988), for example:
a j = y0 + y1( j − 1) + y2 ( j − 1)2 + u j ,
where u j ∼ N (0, su2 ) . Parametric time models can also be modelled straightforwardly: a
Weibull model is represented in a complementary log-log link for F by taking the log of the
time interval as a covariate (Allison, 1997). Non-parametric models for time (e.g. via splines)
can also be applied, or a correlated random effect prior assumed, as in Section 11.3.1. Time-
varying predictor effects are straightforward to use (Muthen and Masyn, 2005), and non-
proportional effects are modelled by including interactions between subject attributes Zij
and j.
Commonly used links for the probabilities qij are the logit, probit, and complementary
log-log. For example, a logit link with time-varying intercepts and predictor effects (where
the vector Zij excludes a constant term) would mean
exp(a j + Zij b j )
q( j|Zij ) = .
1 + exp(a j + Zij b j )
Adopting a logit link means the log-odds of the event occurring are modelled as functions
of predictors and time (i.e. interval). The complementary log-log link model with
ti
∫
{
S(ti |Zi ) = exp − h(u|Zi )du = exp − exp Zi b + log H 0 (ti ) . }
0
aj
predictor effects as under a PH model (Kalbfleisch and Prentice, 1980; Fahrmeir and Tutz,
2001, p.401).
If correlated priors (e.g. random walks) on the αj and βj are adopted, the setting of priors on
the hyperparameters (e.g. precisions) follows the same considerations as discussed above
in connection with semiparametric models for continuous time hazards (Section 11.3.1).
Fahrmeir and Knorr-Held (1997) discuss alternative Hastings sampling schemes for collec-
tions of time-varying coefficients {a j , b j1 , … , b jp } in discrete hazard regression.
As for continuous time survival modelling, neglecting unobserved heterogeneity may
mean that the estimated baseline hazard parameters are biased downwards, the impact
of constant covariates is underestimated, or that spurious time-dependent effects for
observed predictors are obtained. For improved identification, frailties may be included
at subject level, rather than at subject-interval level, though bilinear schemes are possible.
Thus, a log-normal frailty might specify
where one of the δj is set to a fixed value for identification if the variance of bi is unknown.
Muthen and Masyn (2005) use a discrete mixture approach in which Gi ∈(1, … , K ) are
latent groups (e.g. developmental trajectories in educational applications). Then
where the probability that Gi = k is defined by predictors Ui in a separate multiple logit
regression. The factor scores bi may be defined by bi ~ N (0, s b2 ), or by a hierarchical linear
regression on the predictors Ui.
11.5.1 Life Tables
Life tables are a particular way of analysing discrete time survival data. They may be
applied to situations where permanent survival or withdrawal is possible, such as marital
status life tables (Schoen, 2016), or to population mortality. The intervals in such applica-
tions refer to age or duration bands, and discretisation may extend beyond that present in
the data, as in abridged life tables (Kostaki and Panousis, 2001). The intervals are not nec-
essarily of equal length (Wong, 1977). For example, in one common scheme for human life
Survival and Event History Models 497
tables, ages under 1 form the first interval, ages one to four comprise the second interval,
ages five to nine, the third interval, and so on for successive five year bands, with the final
interval typically open ended, such as ages over 90. Often human life tables are estimated
from population deaths data over a specified calendar period, to provide “period” life
tables, based on current mortality in individuals born in different periods, as distinct from
cohort life tables, based on follow-up studies of mortality in a group of individuals born in
the same time period (Richards and Barry, 1998).
Following life table conventions, ages are denoted x and age intervals are denoted
[x , x + n) , e.g. n = 5 if intervals are five years in length. Let T denote a random variable for
the total lifetime (age of death) of an individual. Also, in line with life table conventions,
the probability Pr(T > x) that the age of death T is x or higher (the survivor function) is
denoted l(x). The hazard rate – also called the force of mortality in life table applications
– is then
l( x) − l( x + ∆x) −l′( x)
h( x) = lim = ,
∆x →∞ l( x)∆x l( x)
with solution
x
0
∫
l( x) = l(0)exp − h(u)du .
With l(0) = 1, the density of the age at death is f ( x) = h( x)l( x) . The probability of surviving
from age x to age x + n, given survival to x, namely Pr(t > x + n|t > x) , is denoted npx with
x+n
exp −
∫
h(u)du n
∫
0
n px = l( x + n)/ l( x ) = = e x p − h( x + u)du ,
x
∫
exp − h(u)du
0
0
while the probability of dying before age x + n conditional on reaching age x is
l( x) − l( x + n)
q = 1 − n px = 1 − l( x + n)/ l( x) =
n x .
l( x)
Important in linking these functions to estimable quantities is the central rate of mortal-
ity, which represents a weighted average of the force of mortality applying over the inter-
val [x , x + n) . Let P(x) denote the population of age x. Then the death rate for age interval
[x , x + n) is
x+n x+n
n Mx =
∫ x
h( a)P( a)da
∫ x
P( a)da .
Assuming linearity of l(a) in the interval from x to x + n, this can be simplified (Namboodiri
and Suchindran, 1987, p.36) to
l( x) − l( x + n)
n Mx = .
0.5n[l( x) + l( x + n)]
498 Bayesian Hierarchical Models
l( x + n) 1 − 0.5n( n Mx )
= n px =
l( x) 1 + 0.5n( n Mx )
giving
n( n Mx )
n x q = .
1 + 0.5n( n Mx )
To clarify the operations involved, life tables involve hypothetical populations of initial size
l0 = 100,000 (the radix) with lx denoting numbers still alive at age x from the initial popula-
tion. The number dying between age x and x + n is denoted n dx = lx − lx + n , and from above
lx − lx + n n dx
q = 1 − lx + n /l x =
n x = .
lx lx
To develop the life table from observed deaths and populations requires an estimator
for the probability nqx. Let Dx denote observed deaths for age band [x , x + n) over a cer-
tain period, Px denote observed mid-period populations at risk (or person-years), and Mx
denote age-specific death rates. One estimator of probability of dying in interval [x , x + n)
conditional on being alive at the start of the interval is then (Chiang, 1984)
nn Mx
q =
n x ,
1 + n(1 − n ax )n Mx
where n ax is the fraction of the interval lived by those dying during it. For most age groups,
a is taken as a half, but for infants (ages under one), it can be taken as 0.1, and for the one
n x
to four age group as 0.4.
Under conventional life table methods that are usually applied to large populations, the
Mx are treated as unrelated fixed effects and estimated by assuming binomial sampling
Dx ∼ Bin( Px , Mx ) or Poisson sampling Dx ∼ Po( Mx Px ). In a Bayesian version of the fixed
effect approach, the Mx would be assigned diffuse beta or gamma priors with known
hyperparameters, e.g. Mx ∼ Beta(1, 1). Overdispersed versions of binomial or Poisson den-
sities may also be used, involving hierarchical schemes for “borrowing strength” over
correlated mortality rates, with a higher stage density for the Mx involving unknown
hyperparameters. An example might be when age-specific deaths Dix for a set of areas or
hospitals (i = 1, … , I ) are to be analysed, and populations at risk are relatively small. Then
the conjugate binomial-beta approach would mean taking death rates Mix to be distributed
according a hierarchical model, namely
where {a,b} are unknown parameters. Congdon (2009) adopts a general linear mixed model
approach for data involving an additional stratifying group g in which
where i and x denote areas and ages, and a logistic regression with group-specific autore-
gressive area and age effects has the form
Other options might be to model the impact of age by a parametric function; for example,
Neves and Migon (2007) use Makeham’s Law, by which
Dx ∼ Po( Mx Px ),
Mx = a + bd x ,
and extend this to a time series model for age-specific death rates and times t, namely
Mxt = a t + btd tx .
with κ a positive parameter, and the regression term Zi including an intercept. This
representation is compared with a semiparametric baseline hazard modelled via a
first-order random walk (model 2), namely
12
10
8
Frequency
FIGURE 11.5
Posterior mean frailties.
where bi ∼ N(0, 1) , and the precision 1/s b2 of bi is assigned a gamma prior. A two-chain
run of 10,000 iterations provides a mean (95% CRI) of 0.33 (0.01, 0.81) for the κ coefficient.
The treatment and age effects, namely −2.1 (−3,−1.3) and 10.3 (4.0,16.6), are changed only
slightly. The WAIC falls slightly to 293, with 31 effective parameters. Figure 11.5 plots
out the bi and shows a negative skew, with negative frailty effects for older subjects still
surviving at higher intervals (e.g. subject 34).
m x = a + bd x ,
with the three parameters all positive. Allowing for variability in these parameters over
age groups, one may propose
m x = a x + b xd xx ,
Survival and Event History Models 501
log(b x ) = log( b x -1 ) + w2 x ,
log(d x ) = log(d x -1 ) + w3 x ,
where the wjx are initially taken as iid normal. Initial conditions (log(α1), etc) are taken
as N(0,25).
Implementing this model in rstan shows problematic convergence, unless informative
priors are assumed for the standard deviations σj of the errors wjx. Informative priors can
be motivated by an expectation of small changes in death rates between successive ages,
and we assume s j ∼ N + (0, 0.25). This model shows impaired convergence in log(βx) and σ2.
Improved convergence is shown by a model taking w2x and w3x to be Student t with known
d.f. = 4, and also s j ∼ t4+ (0, 0.25) (a half-t with 4 df). This option has LOO-IC of 605. As one
example of the parametric outputs, Figure 11.6 plots out the posterior mean αx. Predictive
checks from this model (comparing replicates and actual dx) are satisfactory, with no exceed-
ance probability under 0.05 or over 0.95. The minimum exceedance probability is 0.056 for
age x = 84, with a relatively large observed death total as compared to the modelled total.
Neves and Migon (2007) consider the implications of the fitted curve in deriving the
monthly whole life annuity-due to a life aged x (Bowers et al., 1986). Figure 11.7 plots out
the posterior density of this quantity for a male subject aged 60, at an assumed annual
real interest of 6%. This compares closely with Figure 3 in Neves and Migon (2007).
Convergence problems are completely alleviated if a constancy assumption δx = δ is
made, with the specification now
m x = a x + b xd x ,
0.04
0.03
Posterior Mean a
0.02
0.01
0.00
0 20 40 60 80
Age
FIGURE 11.6
Alpha by age.
502 Bayesian Hierarchical Models
1.5
1.0
Density
0.5
0.0
FIGURE 11.7
Posterior density, monthly whole life annuity.
log(b x ) = log( b x -1 ) + w2 x ,
with w1x and w2x normal and s j ~ N + (0, 0.25). However, this raises the LOO-IC to 606.
The predictive exceedance probability for age x=84 is now under 0.05.
Survival data on twins or other types of matched pair (Anderson et al., 1992);
Reliability data when the lifetime of one component is related to the lifetimes of other
components;
Failure times of paired human organs (Sahu and Dey, 2000; Tosch and Holmes, 1980).
Survival and Event History Models 503
Examples of grouped or clustered data are provided by Gustafson (1997) as when several
response times are measured for a single patient in a clinical trial, or when responses are
for patients categorised according to clinic of treatment. Multivariate perspectives on more
specialised survival models are exemplified by Bayesian multivariate cure rate models
(Chen et al., 2002; Yin, 2005), and multivariate counting processes (Sinha and Ghosh, 2005).
The statistical model applied to such data needs to account for the intra-cluster or
inter-event correlation. It may be possible to model the dependence structure directly, for
example, via multivariate versions of widely adopted parametric survival models (Yashin
et al., 2001). Thus Sahu and Dey (2000) consider bivariate exponential and Weibull survival
models for data on times to visual impairment for paired eyes, while Damien and Muller
(1998) provide a Bayesian treatment of a bivariate Gumbel model. The multivariate lognor-
mal is another possibility, which adapts to the situation of conditional multivariate data,
when durations on a second event are obtained conditional on the duration in a first event
(Henderson and Prince, 2000).
Another approach is to introduce random frailty terms at the cluster level or common
frailties across events. The frailty term represents common influences across clusters or
events that are neglected or not observed. Responses on members of a cluster (or on cor-
related events) are typically assumed independent given the value of the cluster effect (or
shared frailty factor) (Castro et al., 2014). Sahu and Dey (2004, p.325) describe how different
frailty assumptions lead to different correlations between log survival times in a bivariate
situation (under the assumption a Weibull baseline hazard).
Let tij be the failure time for the jth component or outcome ( j = 1, … , mi ) of the ith subject
(i = 1, … n) . Then the hazard function assuming a common multiplicative frailty takes the
form (Sahu et al., 1997; Yin and Ibrahim, 2005)
with the unit frailty effect γi distributed independently of Zij and tij. If γi is high, then all
hazards are raised, and so times tij tend to be low; if γi is low then all hazards are lowered
and the tij tend to be relatively extended. In this way, the common frailty induces a positive
association between observed times.
In the case of repeated occurrences r = 1, … , Ri of the same outcome to the same subject
(e.g. multiple occupation shifts or repeat cardiac events), the hazard function conditional
on γi is independent of the number r of previous occurrences (Sinha, 1993). Unconditionally,
however, the hazard for the (r + 1)th occurrence is
The same scenario applies when subjects i are nested within clusters j, with cluster effects
γj shared between the nj individuals in the same cluster
If the γj are assumed gamma distributed g j ∼ Ga( h, h) with variance 1/h, then smaller
values of h signify a closer relationship between subjects in the same group and greater
heterogeneity between the groups. For models including cure rates, Yin (2005) proposes
multiplicative frailty at cluster level combined with Poisson regression for θij in the cure
fractions exp(−θij). One option takes
504 Bayesian Hierarchical Models
k -1
h(tij |lij , k ) = lijk jtij j .
log(lij ) = Zi b j + bi + d ui ,
where bi ~ N (0, s b2 ), δ is positive, and ui ∼ N + (0, 1) with ui independent of bi. Under the skew
log-t model,
bi ~ t(0,n , s b2 ),
where b0(t) is the baseline intensity function, and the γs represent subject level frailty. The
t
where B0∗ (t) is an assumed mean intensity. The likelihood kernel for each spell within
each subject is Poisson in form [3] with response variables dNij = 1 or 0, and means
dB0 (tij )exp(Zij b )g s.
with a gamma process prior on the λj, and normally distributed patient and hospital
effects elm ~ N(0, s e2 ), and um ~ N(0, s u2 ). Because the two sources of variation are con-
founded, a uniform prior V ~ U(0,100) is adopted on the total variance V = s u2 + s e2, and a
U(0,1) prior on the ratio s u2 /(s u2 + s e2 ).
A two-chain run using jagsUI converges in 10,000 iterations and provides a LOO-IC
of 755 and WAIC of 754. Although Yau (2001) reported no significant hospital variation,
here posterior means (95% CRI) for σe and σu of 1.09 (0.59, 1.77) and 0.51 (0.11, 1.12) are
obtained, and within hospital correlation estimated as 0.21. The treatment effect is esti-
mated as −1.20 (−1.97,−0.49). Centred hospital effects have posterior means varying from
0.42 (hospital 2) to −0.27 (hospital 10), with the site 2 effect having an 87% probability of
being positive.
Q-Q plots of both the posterior mean elm and um suggests a departure from normality,
associated with positive skew in both sets of effects. Assuming Student t4 priors for both
sets of effects does not improve fit, and the corresponding Q-Q plots (using the R func-
tion TQQPlot) are also not satisfactory.
We then consider a Dirichlet process mixture prior to model the patient random
effects (model 3). A gamma(3.5,0.5) prior is adopted on the Dirichlet precision param-
eter (Dorazio, 2009), with the base density for the random effects taken as normal. This
produces a LOO-IC of 749. A skew-normal model also slightly improves fit in terms of
the LOO-IC, reducing it to 754. The estimated effective parameters total is unchanged
despite the addition of positive normal effects, as in the conditional representation of
the skew-normal (Huang and Dagne, 2012; Ghosh et al., 2007).
As a final option, the random patient and hospital effects are represented as multipli-
cative effects with gamma densities (Glidden and Vittinghoff, 2004). Thus
a 1lm ~ Ga(d 1 , d 1 ),
a 2 m ~ Ga(d 2 , d 2 ),
d j ~ Ga( ad , bd ).
The prior mean of the gamma effects α1/m and α2m is set at 1 for identification. With the
setting ad = 1, bd = 0.1, a two-chain run using jagsUI converges in 10,000 iterations. This
provides similar fit statistics as the normal errors model, namely a LOO-IC of 755 and
WAIC of 753. The estimated gamma parameters (posterior mean and standard devia-
tion) are 1.62 (1.23) and 12.46 (10.29). A plot of actual patient effects as against the implied
gamma density, Ga(1.62,1.62), shows a better representation of the skew in the patient
effects, albeit with still some discrepancies.
bi ~ N(0, s b2 ),
where bi are random patient effects, Age is age at diagnosis and Type relates to diabetes
type. A model (implemented using jagsUI) without frailty produces significant Weibull
shape effects: both shape parameters κj have 95% intervals below 1, suggesting a lesser
chance of impairment at longer follow-ups. The LOO-IC is 2,756, with the worst fitted
observation being eye 2 for patient 95. The treatment coefficient β2 has mean (95% CRI)
−0.79 (−1.12, −0.48), and the corresponding hazard ratio q = e - b 2 for untreated eyes aver-
ages 2.24, with a 95% credible interval (1.62, 3.06).
In the frailty model, a U(0,10) prior on the standard deviation σb of the random effects
is adopted, with a posterior mean for σb of 0.45 (0.12, 0.82). Under this model, the time
effects are attenuated, with the 95% interval for κ1 now straddling 1. Despite the signifi-
cant patient heterogeneity, the LOO-IC increases slightly to 2,762. The treatment effect
increases, with θ now averaging 2.31 with 95% interval (1.65, 3.22). A histogram and
kernel density plot of the posterior mean bi show a subgroup with high negative values
Survival and Event History Models 507
3.0
2.5
2.0
Density
1.5
1.0
0.5
0.0
FIGURE 11.8
Posterior mean frailties.
(see Figure 11.8) suggesting that a discrete mixture approach (e.g. a two-group discrete
mixture normal) to frailty might be appropriate.
11.7 Competing Risks
Competing risks (CR) models involve the tracking of multiple durations corresponding
to different types of exit or transition (Haller et al., 2013). A number of packages in R can
estimate competing risks survival regression (Scrucca et al., 2010; Putter, 2018; Scheike and
Zhang, 2011). With non-repeatable events, subjects are observed until the first exit and
completion of one of the multiple durations, but for repeatable events (e.g. occupational
or migration histories), event histories might include repeated transitions between differ-
ent job or residential destinations. Sometimes the cause of exit may be masked (Sen et al.,
2010): exact information on the cause of exit is missing, but information is available that
can determine a set of potential causes of exit.
Assume that there are K possible mutually exclusive causes of exit, and Ci be a subject
level categorical random variable with K possible outcomes representing observed cause of
exit. Under the latent failure time approach (Crowder, 2001; Box-Steffensmeier and Jones,
2004; Kozumi, 2004; Gelfand et al., 2000) with independent risks, there is a latent failure
time Tik corresponding to each outcome, but only the minimum time is observed when
individual i exits for cause ki, so that ti = min(Ti1 , … , TiK ) with ki = arg min(Ti1 , … , TiK ). The
remaining times are censored. All times are censored if an individual does not exit for any
of the K possible reasons.
With these assumptions, and conditioning on possibly cause-specific predictors Zk, the
total hazard rate may be expressed as a sum of cause-specific hazards,
508 Bayesian Hierarchical Models
h(t|Zk ) = ∑ h (t|Z ),
k =1
k k
where
S(t|Z) = ∏ S (t|Z ).
k =1
k k
Assuming a failure to risk ki is observed, the contribution of the ith subject to the likeli-
hood has the form
K K
f ki (ti |Ziki ) ∏l ≠ ki
Sl (ti |Zil ) = hki (t|Ziki ) ∏ S (t |Z ),
l =1
l i il
while for a subject censored on all risks, the contribution is P lK=1Sl (ti |Zil ). With event indi-
cators dik = 1 if Ci = k, and dir = 0 for r ≠ k , the likelihood contribution is equivalently
For continuous survival times, one may assume parametric forms for the time effect, e.g.
a Weibull hazard
hkl (t|Zik ) = lim Pr(t < Tim ≤ t + ∆t , Ci , m + 1 = l|Tim > t , Cim = k , Zik )/∆
∆t,
∆t → 0
is the instantaneous risk of moving from state k to state l (with l ≠ k ), given survival in the
mth state until t. Under independent risks, the overall hazard for leaving state k is then
hk (t|Zik ) = ∑ h (t|Z ).
l≠ k
kl ik
Survival and Event History Models 509
For discrete time data, the functions described in Section 11.5 similarly generalise to the
competing risk case. For non-repeated events, intervals [a j −1 , a j ) for j = 1, … , J + 1, and
Ci ∈(1, … , K ) , event probabilities are
f jk = Pr(t ∈[a j −1 , a j ), C = k ),
j K
Sj = ∏ ∏ (1 − q
m=1 h=1
mh ).
Define event indicators dimh = 1 when a non-repeatable event h occurs in interval m, and 0
otherwise. Then for subject i undergoing the kth event in the jth interval, the event indica-
tors are dijk = 1, {dijh = 0, h ≠ k } and di1h = di 2 h = … = di , j −1, h = 0 for all h, with likelihood
j −1 K
qijk ∏∏
m = 1 h = 1
(1 − qimh ) = qijk Si , j −1.
The response at each interval for discrete competing risks is multinomial, and to model
the impact of predictors different links may be used such as the multiple logit, or multiple
probit. Consider a multiple logit link with K + 1 categories (K alternative risks plus an extra
category for survival, denoted by Ci = 0). Let the reference category be for survival, and
define regression coefficients βk for the kth risk. Assuming the βr (r = 1, … , K ) do not con-
tain an intercept would lead to
1
q(t ∈[a j −1 , a j ), Ci = 0) = ,
∑
K
1+ exp(a jr + Zir br )
r =1
exp(a jh + Zih bh )
q(t ∈[a j −1 , a j ), Ci = h|Zih ) = , h = 1, … K .
∑
K
1+ exp(a jr + Zir br )
r =1
where the parameters αjh describe the baseline hazard for risk. K-dimensional versions of
the correlated prior processes discussed in Section 11.3 may be used for the αjh, for exam-
ple, multivariate normal first- or second-order random walks.
11.7.1 Modelling Frailty
Assuming independent risks, one may introduce unobserved frailties γik that impact on
each risk, but are uncorrelated across risks, such as independent gamma densities with
mean 1 for each possible cause. Under proportionality, the risk specific hazard in a con-
tinuous time CR hazard is then
510 Bayesian Hierarchical Models
The assumption of independent risks may not hold in practice because particular groups
of subjects may be more likely to experience subsets of the events. Just as it may be unre-
alistic in multinomial discrete choice situations to assume independence of irrelevant
alternatives (i.e. that ratios of choice probabilities of any two alternatives are unaffected
by changes in utilities of any other alternatives, or by their removal), so it may be unreal-
istic in survival analysis that the relative risks of two outcomes will be unaffected by the
removal of a third (Gordon, 2002).
To allow for dependent competing risks, especially for multiple spell data, one may assume
correlated or dependent frailties. In a generalisation of the MPH scheme, Abbring and van
den Berg (2003) mention that the joint distribution of (Ti1 , … , TiK ), given predictors Zik and
correlated frailties (gi1 , … , giK ), factorises into independent densities f (Tk |Zik , {gi1 , … , giK })
which are fully characterised by cause-specific hazard rates
Correlated frailties are also obtained by expanding the regression term to a general mixed
form, as in Section 11.4, so that in a continuous time analysis,
where bik are zero mean effects that might be multivariate normal, discrete mixtures of
multivariate normal, etc.
Assuming a multivariate normal bik with covariance matrix Σb, dependent risks will be
apparent in significant off-diagonal terms. Whether there are significant correlations in
the frailty effects over different risks will depend in part on whether observed predictors
successfully explain variations in event proneness. Another possibility is a common frailty
model with risk specific loadings, so that
bik = lk bi ,
1.0 Kaplan-Meier
Weibull
Log-logistic
0.8
0.6
0.4
0.2
0.0
0 20 40 60 80
Days in Neutropenia
FIGURE 11.9
Survival curve without predictors.
for risks k = 1, 2, where lik = exp( bk xi ). The cause-specific cumulative hazards are
Estimation uses rstan [4] with early convergence non-problematic. Table 11.3 shows that
allogeneic transplants are associated with a lower risk of bloodstream infection, and
Figure 11.10 plots out the contrasting cumulative hazard curves for this cause of exit
TABLE 11.3
Posterior Summary. Transplant Data
Weibull Regression Mean St Devn 2.5% 97.5% Standardised Variability
Risk 1 Intercept −4.20 0.20 −4.58 −3.83
Allogeneic Transplant −0.54 0.14 −0.82 −0.28 0.26
Female −0.18 0.14 −0.46 0.09 0.79
Weibull Shape 1.22 0.07 1.09 1.36
Risk 2 Intercept −4.98 0.13 −5.25 −4.72
Allogeneic Transplant −1.19 0.07 −1.33 −1.04 0.06
Female 0.09 0.07 −0.06 0.23 0.84
Weibull Shape 2.04 0.04 1.96 2.13
Cox Regression Mean St Devn 2.5% 97.5% Standardised Variability
Risk 1 Allogeneic Transplant −0.25 0.15 −0.54 0.03 0.58
Female −0.17 0.14 −0.46 0.11 0.83
Risk 2 Allogeneic Transplant −1.26 0.08 −1.41 −1.10 0.06
Female 0.06 0.08 −0.08 0.21 1.17
512 Bayesian Hierarchical Models
0.8
*
*
*
*
Cumulative Cause-Specific Hazards 0.6 *
o *
allogeneic *
* autologous
*
*
* o
o
0.4 * o
* o
o
* o
* o
o
* o
* o
o
0.2 * o
o
* o
* o
o
* o o
*
* o o
*o o o
*
o o
0.0 * *
o
0 5 10 15 20 25
Days
FIGURE 11.10
Cumulative hazard by transplant source.
15
Cumulative Cause-Specific Hazards
10
* *
* **
o allogeneic * **
autologous * ***
*
*
5 **
*
**
*
* *
** o o
ooo o
* ** o o o oo o o o
* o oo o
* o oo
* * *o o o o o o o o
o
0 *oo * *o *o o
* *oo
0 10 20 30 40
Days
FIGURE 11.11
Cox regression. Cumulative hazard, end of neutropenia, by transplant source.
up to 25 days. The effect of allogeneic transplantation is also negative for the other risk
(end of neutropenia), meaning that events of either type are delayed for the allogeneic
treatment group.
We also apply Cox regression to these data, based on the distinct event times for each
risk. Figure 11.11 shows the resulting cumulative hazard plots for the end of neutropenia.
Survival and Event History Models 513
An issue in comparing Cox and parametric regression is the possibility of differing preci-
sion of estimated covariate and treatment effects (Nardi and Schemper, 2003). It can be
seen from Table 11.3 that the estimated coefficients for allo and sex from Cox regression
have higher standardised variability (i.e. are less efficient). This is especially so for the
impact of transplant type on the risk of bloodstream infection. Standardised variability
is measured as sd( b )/|b |.
d1=ifelse(cause==1,1,0)
t.d1=subset(obs_t,cause==1)
# unique event times
t.d1.unique=unique(t.d1)
NT1=length(t.d1.unique)
t1_unique=c(sort(t.d1.unique),max(obs_t)+1)
# define at risk and counting process increments
Y1=dN1=matrix(,N,NT1)
for (i in 1:N) {for (j in 1:NT1) {Y1[i,j]
=ifelse(obs_t[i]>=t1_unique[j],1,0)}}
for (i in 1:N) {for (j in 1:NT1) {dN1[i, j] =Y1[i, j] * (t1_
unique[j + 1] > obs_t[i]) * d1[i]}}
This is the usual assignment of at risk and increment indicators in cause-specific haz-
ard regression. The cause-specific hazard is the instantaneous risk of the event (i.e. a
specific cause of exit) in subjects currently event-free, namely for cause k,
TABLE 11.4
Cell Lymphoma. Alternative Hazard Regression Coefficients, Posterior Summaries
Cause-Specific Hazard Subdistribution Hazard
Competing Events Predictor Mean 2.5% 97.5% Mean 2.5% 97.5%
Relapse Stage 0.35 0.09 0.61 0.40 0.16 0.65
Chemotherapy 0.09 −0.26 0.41 −0.03 −0.39 0.30
Age/100 4.29 3.35 5.25 1.85 0.97 2.71
HGB/100 0.79 0.12 1.47 0.60 −0.07 1.26
Death in Remission Stage 0.11 −0.44 0.63 −0.09 −0.62 0.42
Chemotherapy −0.08 −0.85 0.63 −0.38 −1.09 0.23
Age/100 8.32 6.12 10.52 4.36 2.66 5.95
HGB/100 −0.01 −1.61 1.56 −0.54 −1.89 0.71
The risk set now includes subjects who have previously experienced a competing cause
of exit, as well as subjects currently event-free. For the cause-specific hazard, the risk set
reduces every time there is an exit from another cause and is viewed as censored. With
the subdistribution hazard subjects that exit for a cause, j ≠ k remain in the risk set for
cause k and are given a censoring time larger than all event times. The coefficients from
a subdistribution hazard model may be interpreted as the impacts of covariates on the
incidence of the event (Austin and Fine, 2017).
Table 11.4 summarises the posterior distributions of the covariate effects under these
alternative approaches. The effects are not that dissimilar, mainly differing in a lower
impact of age on incidence, while the impact of stage on the incidence of relapse is
enhanced. The estimated coefficients under classical methods (using coxph from timereg,
and crr from cmprsk) are similar to the Bayesian estimates.
11.8 Computational Notes
[1] The rstan code for the counting process model in Example 11.2 includes R calcula-
tions to convert time and event indicators (ti,di) into suitable form. Thus
obs_t=t
t.d=subset(obs_t,d==1)
# unique event times
t.d.unique=unique(t.d)
NT=length(t.d.unique)
t_unique=c(sort(t.d.unique),max(obs_t)+1)
# define at risk and counting process increments
Y=dN=matrix(,N,NT)
for (i in 1:N) {for (j in 1:NT) {Y[i,j]
=ifelse(obs_t[i]>=t_unique[j],1,0)}}
for (i in 1:N) {for (j in 1:NT) {dN[i, j] =Y[i, j] * (t_unique[j + 1]
> obs_t[i]) * d[i]}}
# centred and scaled Karnosky score
KS.c=(KS−mean(KS))/10
Survival and Event History Models 515
# dataset
Dstan=list(N=N,NT=NT,t_unique=t_unique,Y=Y,dN=dN,Z=PT,KS=KS.c)
CP.stan ="
data {
int<lower=0> N;
real KS[N];
int<lower=0> NT;
int<lower=0> Y[N,NT];
int<lower=0> dN[N,NT];
int<lower=0> t_unique[NT + 1];
real PT[N];
}
transformed data {
real c;
real r;
c = 0.001;
r = 0.1;
}
parameters {
real beta[2];
real betaKS[NT];
real<lower=0.001> sigmaKS;
real<lower=0> dL0[NT];
}
model {
real dt[NT];
beta ~normal(0, 10);
sigmaKS ~uniform(0,1);
betaKS[1] ~normal(0,1);
//RW prior on KS coefficients
for (j in 2:NT){betaKS[j] ~normal(betaKS[j−1],sigmaKS);}
//gamma increments prior
for (j in 1:NT) {dt[j] = t_unique[j+1] − t_unique[j];
dL0[j] ~gamma(r * dt[j] * c, c);
for (i in 1:N) {if (Y[i, j]!= 0)
target += poisson_lpmf(dN[i, j]
Y[i, j]*exp(betaKS[j]*KS[i]+beta[1]*PT[i]+beta[2]*PT[i]
*KS[i]) * dL0[j]);}}}
generated quantities {
real S_noPT[NT];
real S_PT[NT];
for (j in 1:NT) {//Survivor functions by prior therapy, Karnofsky
score set at upper quartile
real s;
s = 0;
for (i in 1:j)
s = s + dL0[i];
S_PT[j] = pow(exp(−s), exp(betaKS[j]*1.64+beta[1]+beta[2]*1.64));
S_noPT[j] = pow(exp(−s), exp(betaKS[j]*1.64));}}
"
# Compilation and Estimation
sm = stan_model(model_code=CP.stan)
fit = sampling(sm,data =Dstan,iter = 1500,warmup=250,chains = 2,seed=
12345)
516 Bayesian Hierarchical Models
print(fit)
betaKS <- extract(fit,"betaKS",permute=F)
[2] The code for the cure fraction age of maternity model is
loglogistCF.stan ="
functions{
real loglogistCF_lpdf(real t, real kappa, real lambda, real theta) {
return(log(theta)+log(kappa)+log(lambda)+(kappa-1)*log(t)
−2*log(1+lambda*t94kappa)−theta*(1−1/(1+lambda*t94kappa)));}
real loglogistCF_S_lpdf(real t, real kappa, real lambda, real theta) {
return(−theta*(1−1/(1+lambda*t94kappa)));}
}
data {int<lower=1> n;//number of cases
vector[n] t;//response
int<lower=0,upper=1> d[n];//event indicator(1=occurred, 0=censored)
int<lower=0> p;//total regression parameters, incl. intercept
int<lower=0> educ[n];
int<lower=0> sibs[n];
int<lower=0> white[n];
int<lower=0> immig[n];
int<lower=0> lowinc[n];
int<lower=0> city[n];
}
parameters {vector[p] beta;
real<lower=1> kappa;//shape parameter
real<lower=0> theta;//cure fraction parameter
}
transformed parameters {
real eta[n];
real lambda[n];
real lambdaT;
real p_nochild;
real modeT;//modal age first maternity (13 years education, 3
siblings, white)
p_nochild = exp(−theta);//rate of childlessness
lambdaT = exp(beta[1]+beta[2]*13+beta[3]*3+beta[4]);
modeT = ((kappa−1)/lambdaT)94(1/kappa);
for (i in 1:n) {eta[i]= beta[1]+beta[2]*educ[i]+beta[3]*sibs[i]
+beta[4]*white[i]+beta[5]*immig[i]
+beta[6]*lowinc[i]+beta[7]*city[i];
lambda[i] =exp(eta[i]);}}
model {target += gamma_lpdf(kappa 0.01, 0.01);
target += gamma_lpdf(theta 0.01, 0.01);
for (i in 1:n) {
if (d[i] == 1) {target += loglogistCF_lpdf(t[i]kappa,
lambda[i],theta);}
else if (d[i] == 0) {target += loglogistCF_S_lpdf(t[i]kappa,
lambda[i],theta);}}}
generated quantities{real log_lik[n];
for (i in 1:n) {
if (d[i] == 1) {log_lik[i]= loglogistCF_lpdf(t[i] kappa,
lambda[i],theta);}
else if (d[i] == 0) {log_lik[i]= loglogistCF_S_lpdf(t[i] kappa,
lambda[i],theta);}}}
Survival and Event History Models 517
[3] An example of the computation involves repeated times being applied to mam-
mary tumour in rats randomly assigned to treatment and control groups (Sinha,
1993). Totals of tumours diagnosed in each rat varying between 0 and 13. So, spell
totals for each rat (including possibly censored final spells) range from 1 to 14.
There are n=253 spells in all, for K=48 rats, and J=35 distinct times relevant to
defining the intervals, with aJ = tmax = 182. A BUGS/JAGS code for such an analy-
sis, including gamma frailty for each rat, a treatment covariate, and indicators d[i]
of tumour occurrence or censoring, is
model {for (j in 1:J) {for(i in 1:n) {# Y indicates whether case
still at risk
Y[i,j] <- step(t[i] − a[j] + eps)
dN[i, j] <- Y[i, j] * step(a[j + 1] − t[i] − eps) * d[i]
dN[i, j] ~dpois(lam[i, j])
lam[i, j] <- Y[i, j] * exp(beta * trt[i]) * dB0[j] * gam[rat[i]]}
# independent increment gamma process
dB0[j] ~dgamma(mu[j], c); mu[j] <- dB0.star[j] * c
dB0.star[j] <- M * (a[j + 1] − a[j])
# Survivorship in two groups
S.tr[j] <- pow(exp(−sum(dB0[1: j])), exp(beta));
S.cntr[j] <- exp(−sum(dB0[1: j]))}
# priors on hyperparameters
c <- 1; M ~dexp(1); beta ~dnorm(0,0.001)
# frailty prior
for (k in 1:K) {gam[k] ~dgamma(h,h)}
h ~dgamma(1,0.001)
var.gam <- 1/h}
where eps is a small positive value to ensure at risk and counting indices are cor-
rectly defined. The gamma process includes an unknown parameter M defining
the mean intensity. The first few records for the spell level data take the form
rat[] trt[] t[] d[]
1 1 182 1
2 1 182 0
3 1 63 1
3 1 68 1
3 1 182 0
4 1 152 1
4 1 182 0
while the other data inputs are list(n=253,J=34,a=c(63,66,68,71,74,77,81,84,85,88,
91,95,98,102,105,108,112,116,119,123,
126,130,134,137,140,145,150,152,157,
161,167,172,174,179,182),eps=0.001,K=48).
weibCR.stan ="
data {
int<lower=1> N;//number of cases
int<lower=1> N2;//number of cases
int<lower=1> T;//number of time points for CH profiles
int<lower=1> K;//number of competing causes of exit
vector[N] time;//observed or censored times
518 Bayesian Hierarchical Models
else if (cens1[i] == 1) {log_lik[i]= weibull_lccdf(time[i] shape[1],
nu1[i]);}
if (cens2[i] == 0) {log_lik[i+N]= weibull_lpdf(time[i] shape[2],
nu2[i]);}
else if (cens2[i] == 1) {log_lik[i+N]= weibull_lccdf(time[i]
shape[2], nu2[i]);}
}}
References
Aalen O (1988) Heterogeneity in survival analysis. Statistics in Medicine, 7, 1121–1137.
Aalen O, Hjort N (2002) Frailty models that yield proportional hazards. Statistics & Probability Letters,
58, 335–342.
Abbring J, van den Berg G (2003) The identifiability of the mixed proportional hazards competing
risks model. Journal Royal Statistical Society: Series B, 65, 701–710.
Abbring J, van den Berg G (2007) The unobserved heterogeneity distribution in duration analysis.
Biometrika, 94, 87–99.
Aitkin M, Clayton D (1980) The fitting of exponential, Weibull and extreme value distributions to
complex censored survival data using GLIM. Journal of Applied Statistics, 29, 156–163.
Aitkin MA, Aitkin M, Francis B, Hinde J (2005) Statistical Modelling in GLIM 4. OUP, Oxford, UK.
Albert JH, Chib S (2001) Sequential ordinal modeling with applications to survival data. Biometrics,
57(3), 829–836.
Allison P (1997) Survival Analysis Using the SAS System: A Practical Guide. SAS Institute Inc., Cary, NC.
Anderson JE, Louis TA, Holm NV, Harvald B (1992) Time-dependent association measures for bivari-
ate survival distributions. Journal of the American Statistical Association, 87(419), 641–650.
Arjas E, Gasbarra D (1994) Nonparametric Bayesian inference from right censored survival data,
using the Gibbs sampler. Statistica Sinica, 4, 505–524.
Austin P (2017) A tutorial on multilevel survival analysis: Methods, models and applications.
International Statistical Review, 85(2), 185–203.
Austin P, Fine J (2017) Practical recommendations for reporting Fine-Gray model analyses for com-
peting risk data. Statistics in Medicine, 36(27), 4391–4400.
Baker M, Melino A (2000) Duration dependence and nonparametric heterogeneity: A Monte Carlo
study. Journal of Econometrics, 96, 357–393.
Banerjee S, Carlin BP (2004) Parametric spatial cure rate models for interval-censored time-to-relapse
data. Biometrics, 60(1), 268–275.
Barmby T (2002) Worker absenteeism: A discrete hazard model with bivariate heterogeneity. Labour
Economics, 9, 469–447.
Bender A, Groll A, Scheipl F (2018) A generalized additive model approach to time-to-event analysis.
Statistical Modelling, 18, 1–23.
Bennett S (1983) Log-logistic regression models for survival data. Applied Statistics, 32, 165–171.
Beyersmann J, Scheike T (2013) Classical regression models for competing risks, Chapter 8, pp 157–177,
in Handbook of Survival Analysis, eds J Klein, H C van Houwelingen, J G Ibrahim, T Scheike. CRC.
Bogaerts K, Komarek A, Lesaffre E (2017) Survival Analysis with Interval-Censored Data: A Practical
Approach with Examples in R, SAS, and BUGS. CRC Press.
Børing P (2009) Gamma unobserved heterogeneity and duration bias. Econometric Reviews, 29(1),
1–19.
Bowers N, Gerber H, Hickman J (1986) Actuarial Mathematics, 1st Edition. The Society of Actuaries,
Itasca, IL.
Box-Steffensmeier JM, Box-Steffensmeier JM, Jones BS (2004) Event History Modeling: A Guide for
Social Scientists. Cambridge University Press.
520 Bayesian Hierarchical Models
Brard C, Le Teuff G, Le Deley M, Hampson L (2017) Bayesian survival analysis in clinical trials: What
methods are used in practice? Clinical Trials, 14(1), 78–87.
Brezger A, Kneib T, Lang S (2005) BayesX: Analyzing Bayesian Structured Additive Regression
Models , Journal of Statistical Software September 14 (11), 1–22.
Brouhns N, Denuit M, Vermunt J (2002) A Poisson log-bilinear regression approach to the construc-
tion of projected lifetables. Insurance: Mathematics and Economics, 31(3), 373–393.
Brüderl J, Diekmann A (1995) The log-logistic rate model: two generalizations with an application to
demographic data. Sociological Methods & Research, 24, 158–186.
Buros J (2016) Model Checking with Simulated Data (Survival Model Example) https://www.bio
conductor.org/help/course-materials/2016/BioC2016/ConcurrentWorkshops4/Buros/wei
bull-survival-model.html
Castro M, Chen M-H, Ibrahim J, Klein J (2014) Bayesian transformation models for multivariate sur-
vival data. Scandinavian Journal of Statistics, 41(1), 187–199.
Chen Q, Wu H, Ware L B, Koyama T (2014) A Bayesian approach for the cox proportional hazards
model with covariates subject to detection limit. International Journal of Statistics in Medical
Research, 3(1), 32–43.
Chen M-H, Ibrahim J, Sinha D (1999) A new Bayesian model for survival data with a surviving frac-
tion. Journal of the American Statistical Association, 94, 909–919.
Chen M-H, Ibrahim J, Sinha D (2002) Bayesian inference for multivariate survival data with a cure
fraction. Journal of Multivariate Analysis, 80, 101–126.
Chen Y, Jewell N (2001) On a general class of semiparametric hazards regression models. Biometrika,
88, 687–702.
Chiang C (1984) The Life Table and its Applications. R.E. Krieger, Malabar, FL.
Clayton D (1991) A Monte Carlo method for bayesian inference in frailty models. Biometrics, 47,
467–485.
Congdon P (2008) A bivariate frailty model for events with a permanent survivor fraction and non-
monotonic hazards; with an application to age at first maternity. Computational Statistics & Data
Analysis, 52, 4346–4356.
Congdon P (2009) Life expectancies for small areas: A Bayesian random effects methodology.
International Statistical Review, 77(2), 222–240.
Cooner F, Banerjee S, McBean A (2006) Modelling geographically referenced survival data with a
cure fraction. Statistical Methods in Medical Research, 15, 307–324.
Cox C, Matheson M (2014) A comparison of the generalized gamma and exponentiated Weibull dis-
tributions. Statistics in Medicine, 33(21), 3772–3780.
Cox D (1972) Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34,
187–220.
Crippa A (2018) A Not So Short Review on Survival Analysis in R. https://rpubs.com/alecri/258589
Crowder M (2001) Classical Competing Risks. CRC Press.
Damien P, Muller P (1998) A Bayesian bivariate failure time regression model. Computational Statistics
& Data Analysis, 28, 77–85.
Demarqui F, Loschi R, Colosimo E (2008) Estimating the grid of time-points for the piecewise expo-
nential model. Lifetime Data Analysis, 14(3), 333–356.
Diekmann A, Mitter P (1983) The “Sickle Hypothesis”: A time-dependent Poisson model with appli-
cations to deviant behavior and occupational mobility. The Journal of Mathematical Sociology, 9,
85–101.
Dorazio R (2009) On selecting a prior for the precision parameter of the Dirichlet process mixture
models. Journal of Statistical Planning and Inference, 139, 3384–3390.
Dykstra RL, Laud P (1981) A Bayesian nonparametric approach to reliability. The Annals of Statistics,
9(2), 356–367.
Efron B (1988) Logistic regression, survival analysis, and the Kaplan-Meier curve. Journal of the
American Statistical Association, 83(402), 414–425.
Fahrmeir L, Knorr Held L (1997) Dynamic discrete time duration models. Sociological Methodology,
27, 417–452.
Survival and Event History Models 521
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd
Edition. Springer Series in Statistics. Springer Verlag, New-York, Berlin, Heidelberg.
Fleming TR, Harrington DP (1991) Counting Processes and Survival Analysis, Vol. 169. John Wiley &
Sons.
Florens J, Fougere D, Mouchart M (1995) Duration models, pp 491–534, in The Econometrics of Panel
Data, eds L Matyas, P Sevestre. Kluwer.
Gamerman D (1991) Dynamic Bayesian models for survival data. Applied Statistics, 40, 63–79.
Gelfand A, Ghosh S, Christiansen C, Soumerai S, McLaughlin T (2000) Proportional hazard models:
a latent competing risk approach. Journal of Applied Statistics, 49, 385–397.
Ghosh P, Branco MD, Chakraborty H (2007) Bivariate random effect model using skew-normal dis-
tribution with application to HIV-RNA. Statistics in Medicine, 26(6), 1255–1267.
Glidden D, Vittinghoff E (2004) Modelling clustered survival data from multicentre clinical trials.
Statistics in Medicine, 23(3), 369–388.
Gordon S (2002) Stochastic dependence in competing risks. American Journal of Political Science, 46,
200–217.
Gore S, Pocock S, Kerr G (1984) Regression models and non-proportional hazards in the analysis of
breast cancer survival. Journal of Applied Statistics, 33, 176–195.
Gustafson P (1997) Large hierarchical Bayesian analysis of multivariate survival data. Biometrics, 53,
230–242.
Gustafson P (2000) Bayesian regression modeling with interactions and smooth effects. Journal of the
American Statistical Association, 95(451), 795–806.
Gustafson P, Aeschliman D, Levy A (2003) A simple approach to fitting Bayesian survival models.
Lifetime Data Analysis, 9, 5–19.
Hagar Y, Dignam J, Dukic V (2017) Flexible modeling of the hazard rate and treatment effects in long-
term survival studies. Statistical Methods in Medical Research, 26(5), 2455–2480.
Haller B, Schmidt G, Ulm K (2013) Applying competing risks regression models: An overview.
Lifetime Data Analysis, 19(1), 33–58.
Heckman J, Singer B (1984) A method for minimizing the impact of distributional assumptions in
econometric models for duration data. Econometrica, 52, 271–320.
Henderson R, Oman P (1999) Effect of frailty on marginal regression estimates in survival analysis.
Journal of the Royal Statistical Society: Series B, 61, 367–379.
Henderson R, Shimakura S, Gorst D (2002) Modeling spatial variation in leukemia survival data.
Journal of the American Statistical Association, 97, 965–972.
Henderson R, Prince H (2000) Choice of conditional models in bivariate survival. Statistics in Medicine,
19, 563–574.
Herring A, Ibrahim J (2002) Maximum likelihood estimation in random effects cure rate models with
nonignorable missing covariates. Biostatistics, 3, 387–405.
Hougaard P (1987) Modelling multivariate survival. Scandinavian Journal of Statistics, 14(4), 291–304.
Hougaard P (2000) Analysis of Multivariate Survival Data. Springer, New York.
Hougaard P, Myglegaard P, Borch-Johnsen K (1994) Heterogeneity models of disease susceptibility,
with application to diabetic nephropathy. Biometrics, 50, 1178–1188.
Huang Y, Dagne G (2012) Bayesian semiparametric nonlinear mixed-effects joint models for data
with skewness, missing responses, and measurement errors in covariates. Biometrics, 68(3),
943–953.
Huster W, Brookmeyer R, Self S (1989) Modeling paired survival data with covariates. Biometrics, 45,
145–156.
Ibrahim J, Chen M-H, MacEachern S (1999) Bayesian variable selection for proportional hazards
models. The Canadian Journal of Statistics, 27, 701–717.
Ibrahim J, Chen M-H, Sinha D (2001) Bayesian Survival Analysis. Springer-Verlag.
Kalbfleisch JD (1978) Non-parametric Bayesian analysis of survival time data. Journal of the Royal
Statistical Society: Series B (Methodological), 40(2), 214–221.
Kalbfleisch JD, Prentice R (1980) The Statistical Analysis of Failure Time Data. Wiley, New York.
522 Bayesian Hierarchical Models
Keiding N, Andersen P, Klein J (1997) The role of frailty models and accelerated failure time models
in describing heterogeneity due to omitted covariates. Statistics in Medicine, 16, 215–224.
Kiefer N (1988) Economic duration data and hazard functions. Journal of Economic Literature, 26,
646–679.
Kneib T (2006) Mixed model-based inference in geoadditive hazard regression for interval-censored
survival times. Computational Statistics & Data Analysis, 51, 777–792.
Kostaki A, Panousis V (2001) Expanding an abridged life table. Demographic Research, 5, 1.
Kozumi H (2004) Posterior analysis of latent competing risk models by parallel tempering.
Computational Statistics & Data Analysis, 46, 441–458.
Kuo L, Mallick B (1997) Bayesian semiparametric inference for the accelerated failure-time model.
Canadian Journal of Statistics, 25, 457–472.
Lambert P (2007) Modeling of the cure fraction in survival studies. The Stata Journal, 7(3), 351–375.
Lancaster T (1990) The Econometric Analysis of Transition Data. Cambridge University Press.
Latouche A, Allignol A, Beyersmann J, Labopin M, Fine J (2013) A competing risks analysis should
report results on all cause-specific hazards and cumulative incidence functions. Journal of
Clinical Epidemiology, 66(6), 648–653.
Lawless J (1980) Inference in the generalized gamma and log gamma distributions. Technometrics,
22(3), 409–419.
Lee K, Chakraborty S, Sun J (2015) Survival prediction and variable selection with simultaneous
shrinkage and grouping priors. Statistical Analysis and Data Mining, 8(2), 114–127.
Li K (1999) Bayesian analysis of duration models: An application to Chapter 11 bankruptcy. Economics
Letters, 63(3), 305–312.
Li M (2007) Bayesian proportional hazard analysis of the timing of high school dropout decisions.
Econometric Reviews, 26, 529–556.
Lopes H, Muller P, Ravishanker N (2007) Bayesian computational methods in biomedical research, in
Computational Methods in Biomedical Research, eds R Khattree, D Naik.
Manda S, Gilthorpe M, Tu Y, Blance A, Mayhew M (2005) A Bayesian analysis of amalgam restora-
tions in the Royal Air Force using the counting process approach with nested frailty effects.
Statistical Methods in Medical Research, 14, 567–578.
Marano G, Boracchi P, Biganzoli E (2016) Estimation of the piecewise exponential model by Bayesian
P-splines via Gibbs sampling: Robustness and reliability of posterior estimates. Open Journal of
Statistics, 6, 451–468.
Morris CN, Norton EC, Zhou XH (1994) Parametric duration analysis of nursing home usage,
pp 231–248, in Case Studies in Biometry, eds N Lange, L Ryan, L Billard, D Brillinger, L Conquest,
J Greenhouse. Wiley.
Mosler K (2003) Mixture models in econometric duration analysis. Applied Stochastic Models in
Business and Industry, 19, 91–104.
Murray T, Hobbs B, Sargent D, Carlin B (2016) Flexible Bayesian survival modeling with semipara-
metric time-dependent and shape-restricted covariate effects. Bayesian Analysis, 11(2), 381–402.
Muthen B, Masyn K (2005) Discrete-time survival mixture analysis. Journal of Educational and
Behavioral Statistics, 30, 27–58.
Namboodiri K, Suchindran C (1987) Life Table Techniques and Their Applications. Academic Press, New
York.
Nardi A, Schemper M (2003) Comparing Cox and parametric models in clinical studies. Statistics in
Medicine, 22(23), 3597–3610.
Neves C, Migon H (2007) Bayesian graduation of mortality rates: An application to reserve evalua-
tion. Insurance: Mathematics and Economics, 40, 424–434.
Omori Y (2003) Discrete duration model having autoregressive random effects with application to
Japanese diffusion index. Journal of the Japan Statistical Society, 33, 1–22.
Orbe J, Núñez-Antón V (2006) Alternative approaches to study lifetime data under different sce-
narios: From the PH to the modified semiparametric AFT model. Computational Statistics & Data
Analysis, 50, 1565–1582.
Survival and Event History Models 523
Perperoglou A, van Houwelingen H, Henderson R (2006) A relaxation of the Gamma frailty (Burr)
model 2006. Statistics in Medicine, 25, 4253–4266.
Phadia E (2015) Prior Processes and Their Applications; Nonparametric Bayesian Estimation. Springer.
Pickles A, Crouchley R (1995) A comparison of frailty models for multivariate survival data. Statistics
in Medicine, 14, 1447–1461.
Pintilie M (2006) Competing Risks: A Practical Perspective. John Wiley, West Sussex, UK.
Putter H. (2018) Tutorial in biostatistics: Competing risks and multi-state models Analyses using
the mstate package. Leiden University Medical Center, Department of Medical Statistics and
Bioinformatics. https://cran.r-project.org/web/packages/mstate/vignettes/Tutorial.pdf
Richards H, Barry R (1998) U.S. Life tables for 1990 by sex, race, and education. Journal of Forensic
Economics, 11, 9–26.
Rivas-López M, López-Fidalgo J, Campo R (2014) Optimal experimental designs for accelerated fail-
ure time with Type I and random censoring. Biometrical Journal, 56(5), 819–837.
Roeder K, Wasserman L (1997) Practical Bayesian density estimation using mixtures of normals.
Journal of the American Statistical Association, 92(439), 894–902.
Sahu S, Dey D (2000) A comparison of frailty and other models for bivariate survival dataata. Lifetime
Data Analysis, 6, 207–228.
Sahu S, Dey D (2004) On a Bayesian multivariate survival model with skewed frailty, pp 321–338, in
Skew-Elliptical Distributions and Their Applications: A Journey Beyond Normality, eds M Genton.
CRC/Chapman & Hall, Boca Raton, FL.
Sahu S, Dey D, Aslanidou H, Sinha D (1997) A Weibull regression model with gamma frailties for
multivariate survival data. Lifetime Data Analysis, 3, 123–137.
Sargent DJ (1998) A general framework for random effects survival analysis in the Cox proportional
hazards setting. Biometrics, 54(4), 1486–1497.
Scheike T, Zhang M (2011) Analyzing competing risk data using the R timereg package. Journal of
Statistical Software, 38(2), i02.
Schmidt P, Witte A (1989) Predicting criminal recidivism using ‘split population’ survival time mod-
els. Journal of Econometrics, 40, 141–159.
Schoen R (2016) The continuing retreat of marriage: Figures from marital status life tables for United
States females, 2000–2005 and 2005–2010, pp 203–215, in Dynamic Demographic Analysis, ed R
Schoen. Springer.
Scrucca L, Santucci A, Aversa F (2010) Regression modeling of competing risk using R: An in depth
guide for clinicians. Bone Marrow Transplantation, 45(9), 1388–1395.
Sen A, Banerjee M, Li Y, Noone A (2010) A Bayesian approach to competing risks analysis with
masked cause of death. Statistics in Medicine, 29(16), 1681–1695.
Shao Q, Zhou X (2004) A new parametric model for survival data with long-term survivors. Statistics
in Medicine, 23, 3525–3543.
Singer JD, Willett JB (1993) It’s about time: Using discrete-time survival analysis to study duration
and the timing of events. Journal of Educational Statistics, 18(2), 155–195.
Sinha D (1993) Semiparametric Bayesian analysis of multiple event time data. Journal of the American
Statistical Association, 88, 979–983.
Sinha D, Chen M-H, Ghosh S (1999) Bayesian analysis and model selection for interval-censored
survival data. Biometrics, 55, 585–590.
Sinha D, Dey DK (1997) Semiparametric Bayesian analysis of survival data. Journal of the American
Statistical Association, 92(439), 1195–1212.
Sinha D, Patra K, Dey DK (2003) Modelling accelerated life test data by using a Bayesian approach.
Journal of the Royal Statistical Society: Series C (Applied Statistics), 52(2), 249–259.
Sohn Y, Chang I, Moon T (2007) Random effects Weibull regression model for occupational lifetime.
European Journal of Operational Research, 179, 124–131.
Stacy EW (1962) A generalization of the gamma distribution. The Annals of Mathematical Statistics,
33(3), 1187–1192.
Swindell W (2009) Accelerated failure time models provide a useful statistical framework for aging
research. Experimental Gerontology, 44(3), 190–200.
524 Bayesian Hierarchical Models
Thamrin S, McGree J, Mengersen K (2013) Bayesian Weibull survival model for gene expression
data, Chapter 10, pp 171–185, in Case Studies in Bayesian Statistical Modelling and Analysis, eds C
Alston, K Mengersen, A Pettitt. Wiley.
Tosch TJ, Holmes PT (1980) A bivariate failure model. Journal of the American Statistical Association,
75(370), 415–417.
Tsodikov A, Ibrahim J, Yakovlev A (2003) Estimating cure rates from survival data: An alternative
to two-component mixture models. Journal of the American Statistical Association, 98, 1063–1078.
Umlauf N, Klein N, Zeileis A (2018) BAMLSS: Bayesian additive models for location, scale, and shape
(and beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627.
Van den Berg G (2001) Duration models: Specification, identification, and multiple durations, in
Handbook of Econometrics 5, eds J Heckman, E Leamer. North Holland, Amsterdam, Netherlands.
Viswanathan B, Manatunga A (2001) Diagnostic plots for assessing the frailty distribution in multi-
variate survival data. Lifetime Data Analysis, 7, 143–155.
Walker S, Mallick B (1999) A Bayesian semiparametric accelerated failure time model. Biometrics, 55,
477–483.
Watson T, Christian C, Mason A, Smith M, Meyer R (2002) Bayesian-based decision support sys-
tem for water distribution systems. 5th International Conference on Hydroinformatics, Cardiff
University, UK.
Wei L (1992) The accelerated failure time model: A useful alternative to the Cox regression model in
survival analysis. Statistics in Medicine, 11, 1871–1879.
Wienke A (2010) Frailty Models in Survival Analysis. Chapman and Hall/CRC.
Winkelmann R, Boes S (2005) Analysis of Microdata. Springer-Verlag.
Wong O (1977) A competing-risk model based on the life table procedure in epidemiologic studiesIn-
ternational Journal of Epidemiology, 6, 153–159.
Yashin A, Iachine I, Begun A, Vaupel J (2001) Hidden frailty: Myths and reality. Research Report 34,
Department of Statistics and Demography, SDU - Odense University.
Yau KK (2001) Multilevel models for survival analysis with random effects. Biometrics, 57(1), 96–102.
Yin G (2005) Bayesian cure rate frailty models with application to a root canal therapy study.
Biometrics, 61, 552–558.
Yin G, Ibrahim J (2005) A class of Bayesian shared gamma frailty models with multivariate failure
time data. Biometrics, 61, 208–216.
Yin G, Ibrahim J (2006) Bayesian transformation hazard models, pp 170–182, in IMS Monograph Series,
Vol. 49. Institute of Mathematical Statistics.
Zhang Z, Sinha S, Maiti T, Shipp E (2018) Bayesian variable selection in the accelerated failure time
model with an application to the surveillance, epidemiology, and end results breast cancer
data. Statistical Methods in Medical Research, 27(4), 971–990.
Zhou H, Hanson T, Zhang J (2017) Generalized accelerated failure time spatial frailty model for arbi-
trarily censored data. Lifetime Data Analysis, 23(3), 495–515.
Zhou H, Hanson T, Zhang J (2018) spBayesSurv: Fitting Bayesian spatial survival models using R.
https://arxiv.org/abs/1705.04584
12
Hierarchical Methods for Nonlinear
and Quantile Regression
12.1 Introduction
Standard versions of the normal linear model and general linear models assume additive
and linear predictor effects in the regression mean, and a constant variance. While lin-
ear regression effects are often suitable, nonlinear predictor effects are common in areas
as diverse as economics, hydrology (Qian et al., 2005), and epidemiology (Natario and
Knorr-Held, 2003). In some applications, there may be a theoretical basis for a particular
form of nonlinearity, though some elements of specification will be uncertain – see Borsuk
and Stow (2000) on biochemical oxygen demand, and Meyer and Millar (1998) on mod-
els of fishery stock. In other situations, the form of nonlinearity is unknown and to be
assessed from the data – hence the term “non-parametric”, since a particular form for the
mean function is not assumed. Bayesian application of non-parametric smooth regression
is facilitated by R libraries such jagam (Wood, 2016) (www.rdocumentation.org/packages/
mgcv/versions/1.8-17/topics/jagam), bamlss (Umlauf et al., 2016; https://rdrr.io/rforge/
bamlss/), gammSlice (Pham and Wand, 2015), stan_gamm4 within rstanarm (https://cran.
rstudio.com/web/packages/rstanarm/index.html), and spikeSlabGAM (Scheipl, 2011).
In many applications, a nonlinear effect is present, or suspected, in only a subset of pre-
dictors, leading to partially linear models or semiparametric regression models. Consider
outcomes { yi , i = 1, … , n} from an exponential density
æ y q - a(q i ) ö
p( yi |q i , f ) = exp ç i i + c( yi , f ) ÷ ,
è f ø
with E( yi ) = mi = a′(qi ), and link g(μi) = ηi to a regression term ηi. Suppose it is intended that
R metric predictors Wi = (w1i , w2i , … , wRi ) be modelled non-parametrically via unknown
smooth functions S(wri), then
g( mi ) = hi = a + Xi b + S1(w1i ) + … + SR (wRi ) + ui ,
ui ~ N (0, s 2 ).
For instance, Engle et al. (1986) analyse the relationship between temperature and monthly
electricity sales (y metric and u normal, and with g an identity link) for four US cities.
The impact of electricity price, month (11 dummy variables), and income is modelled
525
526 Bayesian Hierarchical Models
g( mi ) = a + S(wi ) + ui , (12.1)
K
= a+ ∑ b (w − k )
k =1
k i
q
k + + ui ,
ui ~ N (0, s 2 ),
where q is a known positive integer, and the κk are knots placed within the range [wmin,wmax]
of w. In (12.1), the piecewise polynomials are fitted in each interval [κk,κk+1) and preferably
join smoothly at each knot (e.g. this applies for a cubic spline, as it has continuous 1st and
2nd derivatives at each knot).
An alternative spline specification (e.g. Meyer, 2005; Tutz and Reithinger, 2007,
p.2877) matches the degree q of the truncated function T (wi ) = S Kk =1bk (wi - k k )+q by a
Hierarchical Methods for Nonlinear and Quantile Regression 527
standard polynomial of order q, namely Q(wi ) = b1wi + …+ bq wiq . So, the total smooth is
S(wi ) = Q(wi ) + T (wi ), and one has
K
g( mi ) = a + b1wi + … + b q w +q
i å b (w - k )
k =1
k i
q
k + + ui . (12.2)
Values q = 1, 2, or 3 are most typical, with q = 1 often being suitable for reproducing a smooth
function given a large enough set of knots (Ruppert et al., 2003, p.68), but also capable of
reproducing abrupt changes in the underlying function (Dennison et al., 2002, p.52).
The knots in (12.1) and (12.2) may be known or unknown. If known, then they are typi-
cally much less than the sample size in number. They could be sited at percentile points
(e.g. deciles) of w, or possibly placed more densely at points where the function is known
to be rapidly changing and less densely elsewhere. Choosing too few knots can result in
oversmoothing, and choosing too many in overfitting – see the LIDAR data examples dis-
cussed by Ruppert et al. (2003, p.63). Coull et al. (2001, p.540) suggest the allocation of one
knot for every four to five observations, up to a maximum of about 40 knots. Yau and Kohn
(2003) suggest fitting a model with a small number of knots first and gradually increasing
their number until estimates and fit stabilise. An alternative procedure known as smooth-
ing splines places a knot at every observed distinct predictor value (Berry et al., 2002; Dias
and Gamerman, 2002). The most general model averaging approach takes both the number
of knots and their sitings as unknowns, while both Denison et al. (1998) and Biller (2000)
assume a large number of potential, but prespecified, candidate knot locations. If knots are
taken to have unknown locations within [wmin,wmax], identification may rely on order con-
straints such as kk > kk −1, and analysis resembles time series with multiple change points.
If the bk in (12.1)–(12.3) are modelled as fixed effects, predictor coefficient selection is open
as a way of achieving model parsimony, and is especially indicated under the smoothing
spline method (Smith and Kohn, 1996). With a large number of preset potential knot sit-
ings, predictor selection involves obtaining posterior probabilities Pr( d jk = 1| y ) on binary
indicator variables δ1k (k = 1, …, q) for retaining coefficients in the Q(w) component, and
δ2k (k = 1, …, K) in the T(w) component. One then has
K
g( mi ) = a + d 11b1wi + … + d 1q b q w +q
i åd
k =1
b (wi - k k )q+ + ui , (12.3)
2k k
K
g( mi ) = a + S(wi ) = a + å b (w - k )
k =1
k i
q
k + + ui , (12.4)
528 Bayesian Hierarchical Models
K
g( mi ) = a + b1wi + … + b q wiq + å b (w - k )
k =1
k i
q
k + + ui , (12.5)
where bk are a collection of random parameters from a common density with unknown
hyperparameters.
Possible priors for the random bk include an unstructured normal (Ruppert et al., 2003)
which, by comparison with a fixed effects prior, imposes a restriction on the bk when ϕ < ∞,
and tends to shrink the bk, leading to a smooth fit (Wand, 2003). A standard approach (e.g.
Lang and Brezger, 2004) assumes f ∼ IG( g , h) with g = 1 and h small (e.g. 0.001, 0.0001, or
0.00001), though there may be sensitivity to the value of h.
To illustrate equivalence to the broader class of mixed models, define design matrices
W = [1, wi , … , wiq ],
1≤ i ≤ n
Z = [(wi - k k )+q ],
1£ k £ K ,1£i£ n
g = X b + Zb + u,
éb ù æ 0 éf I 0 ùö
êu ú ~ N çç 0 , ê 0 ÷.
s 2I úû ÷ø
ë û è ë
Alternatives schemes for bk are a random walk penalty (Eilers and Marx, 1996), such as
∆ dbk ∼ N (0, f). For instance, taking d = 1 gives
Another option provides monotonic smooths – in applications where such smooths have
a substantive rationalisation (Brezger and Steiner, 2008) – and stipulates bk ∼ N (0, f) , but
subject to
bk ≥ bk −1 , k = 2, … , K ,
and an increasing function T (wi ) = S Kk =1bk (wi - k k )+q , or bk ≤ bk −1 for a decreasing function.
The function S(wi) resulting from a fixed effects prior on {bk} in (12.1)–(12.3) may be quite
rough, due to the large number of truncated polynomials being fitted, whereas the shrink-
age prior under the mixed model approach tends to penalise large coefficients and lead to
a smoother fit (Yau et al., 2003; Ngo and Wand, 2004; Meyer, 2005). Under an unstructured
prior bk ∼ N (0, f) , and smoothing or penalty parameter λ, the mode of the posterior density
of {β,b,ϕ} is the same as that obtained by maximising a penalised likelihood
PL = log[P( y | b , b , f)] − l ∑ b ,
k =1
2
k
Hierarchical Methods for Nonlinear and Quantile Regression 529
where the form of λ (in terms of variance parameters) depends on whether or not there
is an unstructured residual term ui in the regression model. For a metric outcome and
ui ∼ N (0, s 2 ), one has λ = σ2/ϕ (Fahrmeir and Knorr-Held, 2000). This penalised likelihood
is analogous to “ridge” penalties sometimes used with correlated predictors (Eilers and
Marx, 2004). For random walk priors of order d, one has (Lang and Brezger, 2004)
PL = log[P( y | b , b , f)] − l ∑ (∆ b ) .
k =d +1
d
k
2
of which the thin-plate spline (Kohn et al., 2001; Koop and Tole, 2004)
is a special case. These are examples of functions which are radially symmetric around
knots κk, such that the value of the function at wi depends only on the distance between wi
and the knot location. They have the form H (u) = H ( w − kk ), where|v|= v′v is the length
of the vector v. Other types of radial basis include Gaussian functions (Konishi et al., 2004)
with
w − kk
H k (wi ) = exp − ,
2nhk
where ν is the same over different knots. As for truncated splines, smoothing based on
radial basis functions may include a parametric polynomial term to degree q to match the
degree of the radial function, for example, with q = 1
g( mi ) = a + b1wi + ∑b
k =1
k wi − kk + ui .
Both radial and truncated power splines may be ill-conditioned in terms of broader regres-
sion considerations (Eilers and Marx, 2004). An alternative basis less prone to ill-condition-
ing is provided by B-splines, with health mapping applications exemplified by Silva et al.
(2008) and MacNab and Gustafson (2007). B-splines are defined to be non-zero for at most
q + 2 interior knots for a qth degree B-spline (also called a B-spline of order q + 1), which
means the condition number of the design matrix product is relatively low (Eilers and
530 Bayesian Hierarchical Models
Marx, 1996, p.90; Biller, 2000; Dennison et al., 2002, p.75). A B-spline of degree q consists
of q + 1 polynomial pieces of degree q and overlaps with 2q of its neighbours. For K knots,
and so K + 1 intervals, in the domain [wmin, wmax] of a predictor, there will be K* = K + 1 + q
B-spline schedules, because extra knots are placed outside the domain of w to get q over-
lapping B-splines in each interval.
Let Bk(wi,q) be the value at wi of the kth B-spline of degree q, with k = 1, … , K ∗ . Successive
B-spline values are defined by the recursion
wi − kk kk + q + 1 − wi
Bk (wi , q) = Bk (wi , q − 1) + Bk + 1(wi , q − 1).
kk + q − kk kk + q + 1 − kk + 1
The initial terms in the recursion are simply binary indicators defining a partition of the
w values. For equally spaced knots, a simplified B-spline recursion applies involving dif-
ferences in truncated power splines (Eilers and Marx, 2004). B-spline bases for T(w) can
be combined with random or fixed effects priors for the spline coefficients. For example,
random bk in an analysis with a single predictor wi leads to
K∗
g( mi ) = a + b1wi + … + bq w + q
i ∑ b B ( w , q) + u .
k =1
k k i i
In particular, Eilers and Marx (1996) combine a B-spline basis with a penalty on dth order
differences in adjacent bk coefficients. As mentioned above, difference penalties can be
achieved by random walk priors under a Bayesian approach (e.g. a second order random
walk prior if d = 2).
Relatively small numbers of knots may be needed to provide an effective smooth, as may
be illustrated by drawing on the Stan case study Kharratzadeh (2017), with B-spline sched-
ules defined either by a function, or by using the package splines. Consider the Boston data
set (in the R MASS package), and predicting median house values based on the percentage
of lower status of the population (lsat) (Figure 12.1). A B-spline of degree 3 is used, and a
random walk prior on the coefficients {bk} of the B-spline basis, with
b1 ∼ N (0, 5),
bk ∼ N (bk −1 , tb ),
tb ∼ C + (0, 5).
There are 508 observations. For K = 10 knots located at corresponding quantiles of lsat, we
find a LOO-IC (leave-one-out information criterion) of 3117 (Figure 12.2), while K = 5 knots
also gives a LOO-IC of 3117. By contrast, a larger number of knots, K = 20, shows evidence
of undersmoothing (overfitting) with a LOO-IC of 3122.
Bayesian application of spectral basis functions is discussed by Lenk (1999), Fahrmeir
and Tutz (2001, Chapter 5), and Kitagawa and Gersch (1996). Here the smooth may be rep-
resented by the series
∑
∞
T ( wi ) = bk H k (wi ),
k =1
Hierarchical Methods for Nonlinear and Quantile Regression 531
50
House Value 40
30
20
10
0 10 20 30
Lower Status
FIGURE 12.1
Median house values and status.
45
40
35
Median House Value
30
25
20
15
10 20 30
Lower Status
FIGURE 12.2
Smooth for K = 10 knots.
532 Bayesian Hierarchical Models
wi − wmin
with Hk including sine and/or cosine terms. Setting zi = and including only
cosine terms in Hk, as in Lenk (1999), gives w max − wmin
0.5
2
H k ( wi ) = cos(pkzi ).
wmax − wmin
Since a smooth T will not have high frequency components, a natural prior on the bk
expresses decay as k increases (penalises terms at higher k values) as in
where ck can be taken as a known increasing function of k, and δ determines the rate of
decay of the Fourier coefficients. Possibilities are ck = log(k) with δ > 1, and ck = k with δ > 0.
An alternative is a power function such as
bk ~ N (0, fd k ),
where d ∈(0, 1). For practical application, the Fourier Series is truncated above at K, namely
T ( wi ) = ∑ b H (w )
k =1
k k i
where K can be regarded as another parameter (cf. Ruppert et al., 2003, p.86).
12.2.3 Model Selection
Non-parametric regressions are often heavily parameterised and parameter redundancies
are likely, indicating that selection among predictor effects, including smooths, is neces-
sary (Yau et al., 2003; Belitz and Lang, 2008; Panagiotelis and Smith, 2008; Wood, 2008;
Marra and Wood, 2011; Banerjee and Ghosal, 2014; Gelman et al., 2014). Smooth selection
may be approached using binary indicators Jr (Yau et al., 2003), combined with conven-
tional selection for fixed effect predictor terms. Assume the framework in (12.1). Then for
r = 1, …, R predictor effects as smooths Sr (wri ) = Tr (wri ) , where Tr (wri ) = S Kj=1r brj Z j (wri ), and
Zj(wri) generically denotes a polynomial or B-spline. With binary selectors γj for fixed effect
predictors Xi of dimension p, the regression term would be
g( mi ) = a + g 1b1xii + … + g p b p x pi + … + J1T1(w1i )
+ J 2T2 (w2i )… + J RTR (wRi ) + ui .
Alternatively (e.g. Cottet et al., 2008), one may have both linear and smooth terms for each
wri. Numerical performance may be improved by scaling both the X and W predictors,
e.g. standardisation or transformation to the [0,1] interval (Cottet et al., 2008; Scheipl et al.,
2012).
Selection for retention (Jr = 1) is influenced by the degree of informativeness of the prior
adopted for the variances (ϕ1, …, ϕR) of (b1k … , bRk ) . Flat priors will tend to lead to low posterior
probabilities Pr( J r = 1| y ) for retaining random components. One option is to undertake ini-
tial runs with diffuse priors to develop an informative data-based prior (Shively et al., 1999).
Hierarchical Methods for Nonlinear and Quantile Regression 533
Related approaches include hierarchical priors (e.g. log-normal) on ϕr (Cottet et al., 2008;
Panagiotelis and Smith, 2008) as in
log(fr ) ∼ N ( g r , hr ),
g r ∼ N (0, 100),
hr ∼ IG(101, 10100),
where the γr are scaling factors, ε is a predefined small constant, and the mean m is set to −1
or 1 with equal probability. So for hr = 0 the random effects are effectively excluded, since
their variance is near zero. Assuming fr ∼ IG( af , bf ), then {af , bf } may be set by default, or
set in line with subject matter considerations. Default settings of ε = 0.00025, aϕ = 5, and
bϕ = 25 are proposed by Scheipl et al. (2012, p.1525).
D <- read.table("betacarotene.txt",header=T)
attach(D)
require(splines)
knots <- quantile(chol,probs=seq(0.05,0.95,0.05))
kap.chol <- as.vector(knots)
bs.chol <- bs(chol, df=NULL,knots, degree = 3, intercept = T,
Boundary.knots = range(chol))
bs.chol <- matrix(as.numeric(bs.chol), nr = nrow(bs.chol))
where
ui ~ N(0, s 2 ),
brk ~ N(0, fr ).
K* K*
y i = b 0 + b1xii + … + b p x pi + å
k =1
b1k B1k + åb
k =1
2k B2 k + ui ,
and assumes a 1st order random walk for brk, penalising first differences in the brk. For
identification, a corner rather than centring constraint is used to ensure identifiability,
namely br1 = 0. For precisions 1/σ2 and qr = 1/fr , gamma Ga(1,0.001) priors are assumed.
Applying jagsUI to the linear spline, a B-spline model (models 1 and 2) shows simi-
lar smooths in AGE and CHOL. Fit values favour the linear spline: a LOO-IC of 673,
as against 695 for the B-spline, though both fit values have large SE values. However,
convergence is much earlier achieved using the B-spline method, and effective sample
sizes are larger.
A third analysis involves predictor and smooth selection in the B-spline model, as in
K* K*
y i = b 0 + g 1b1xii + … + g p b p x pi + J1 åb
k =1
B + J2
1k 1k åb
k =1
2k B2 k + ui ,
In this application, all predictors are standardised and the B-spline coefficients accord-
ingly revised. Diffuse priors for βj and brk are likely to lead to an overly parsimonious
model, with few predictors retained. Instead, for the βj, normal N(0,1) priors are adopted
(McElreath, 2016). A data-based prior, based on posterior inferences from model 2, is
adopted for the precisions qr = 1/fr in the prior on the brk terms. Specifically, gamma
priors Ga(0.265,0.0007) and Ga(0.6815,0.0014) correspond to the posterior mean and vari-
ance of θr in model 2. In model 3, a four-point discrete prior for the θr for CHOL and
AGE is accordingly adopted, using values set by the four quintiles of Ga(0.265,0.0007)
and Ga(0.6815,0.0014) densities. The 13 retention indicators, {γj, Jr}, are assigned Bernoulli
priors, with the prior probability ω being an overarching complexity hyperparameter,
with prior ω ~ Be(1,1).
From a two-chain run of 10,000 iterations, vitamin use (vituse) and BMI among the X
predictors have posterior retention probabilities of 0.88 and 1.00 respectively (the indi-
cators J[1] and J[3] in the code), but otherwise retention probabilities are below 0.7. CHOL
and AGE have respective retention probabilities of 1 and 0.99 (J[12] and J[13] in the code).
Figure 12.3 shows the corresponding smooths. Retention of both smooth terms is also
reported by Liu et al. (2011).
A fourth analysis also involves selection, but using a partial adaptation of Scheipl
et al. (2012). Thus, it is assumed that
Credible Interval
4
Predicted Y
Mean
CI_05
CI_95
(a)
0 2 4
CHOL
6.0
5.5
Credible Interval
5.0
Predicted Y
Mean
CI_05
CI_95
4.5
4.0
(b)
–2 –1 0 1 2
AGE
FIGURE 12.3
(a) Beta-carotine smooth in CHOL. (b) Beta-carotine smooth in AGE.
536 Bayesian Hierarchical Models
with ε = 0.00025, aϕ = 5, and bϕ = 25. Alternative settings ( af , bf ) = (5, 50) and ( af , bf ) = (10, 30)
were also investigated. All three settings gave retention probabilities of 1 for the smooths
in both AGE and CHOL (J[12] and J[13] in the code). The first setting gives a LOO-IC of
678. Vitamin use (vituse) has posterior retention probabilities between 0.90 and 0.95,
while BMI has a retention rate of 1.00 for all three settings. Retention probabilities for
other predictors are below 0.75.
g( m i ) = a + åår =1 k =1
brk (wri - k rk )+q + ååå c
r ¹ s k =1 l=1
rs , kl (wri - k rk )+q (wsi - k sl )+q
R Kr Ks Kt
+ åååå d
r ¹ s¹t k =1 l=1 m=1
rst , klm (wri - k rk )+q (wsi - k sl )+q (wti - k tm )+q + … + ui
R
where Kr is the number of knots for predictor wr. There may be R main effects, second-
2
R
order interactions, third-order interactions and so on, with the associated parameters
3
{b,c,d,…} having dimension determined by the number of knots in Kr, {Kr,Ks}, {Kr,Ks,Kt}, etc.
Higher order interactions may be excluded, even if definable in principle, as an accept-
able smooth may often be obtained by restricting attention to main effects and low order
R
interactions. So, a model with main and second order effects only would have R +
2
parameter sets. Gustafson (2000) considers a BWISE approximation to smooth functions
involving main effects S1(w1 ), … , SR (wR ), and second-order interactions only, namely,
As an example, consider a tensor product of truncated polynomials with q = 1, and R = 3,
so that wi = (w1i , w2i , w3 i ) and with K1 = K2 = K3 = 5 knots. Also just consider linear step
R
functions (w − κ)+. Then there may be R = 3 main effects, = 3 second-order interactions,
2
R
and = 1 third-order interactions. In a model confined to main effects and second-
3
order interactions, the main effects would be terms S Kk =1 1
b1k (w1i - k 1k )+ , S Kk =1
2
b2 k (w2i - k 2 k )+ ,
and S Kk =1
3
b3 k (w3 i - k 3 k )+ , involving 15 parameters. The second-order interactions
would be terms S Kk =1
1
SlK=11c12 ,kl (w1i - k rk )+ (w2i - k 2l )+ , S Kk =11SlK=21c13,kl (w1i - k 1k )+ (w3 i - k 3 l )+ , and
S Kk =21SlK=31c23 ,kl (w2i - k 2 k )+ (w3 i - k 3 l )+ involving 75 parameters. If the coefficients {brk , crs , kl }
are assumed to be fixed effects, then predictor selection methods are relevant, as in
Smith and Kohn (1996) or the RJMCMC (reversible jump MCMC) methods discussed
by Dennison et al. (2002, p.105). If the {brk , crs , kl } are assumed to be random effects,
smoothness may be achieved by penalising large coefficients, and parsimony achieved
by selection between zero and positive variance components {fb1 , fb2 , fb3 , fc12 , fc13 , fc23 }.
The tensor product generalisation of (12.2) or (12.5) includes interactions between the
terms in T(w) and Q(w) (Smith and Kohn, 1997; Ruppert et al., 2003, p.240). Consider a
situation with R = 2, with K1 knots in w1i and K2 knots in w2i. For a linear spline (q = 1), and
random effect spline coefficients {brk , drsk , crskm } one would have
K1
K2 K2
+ å
k =1
b2 k (w2i - k 2 k )+ + åd
k =1
12 , k w1i (w2i - k 2 k )+
K1
+ åd
k =1
21k w2i (w1i - k 1k )+
K1 K2
+ åå c
k =1 m =1
12 km (w1i - k 1k )+ (w2i - k 2 m )+ + ui
where there are six variance components (fb1 , fb2 , fd12 , fd21 , fc12 , s 2 ). In the bivariate example
of Smith and Kohn (1997, p.1530), K1 = K2 = 9 and q = 3 leading to a (fixed effects) analysis
involving 169 coefficients.
A similar scheme applies when interactions between metric and categorical predictors
are considered. Thus let Ci ∈(1, … L) be a categorical predictor, and w1i and w2i be metric
predictors. Suppose that only the smooth in w2 is postulated to vary according to the level
of C, and define
zil = 1 if Ci = l
= 0 otherwise.
538 Bayesian Hierarchical Models
Also consider a metric response yi, and assume that interactions between w1 and w2 are not
present. Then, with a qth degree truncated polynomial basis in both predictors, one pos-
sible representation is
K2 L K2
+ å
k =1
b2 k (w2i - k 2 k )+q + å z {å c (w
l=2
il
k =1
kl 2i - k 2 k )q+ },
where b1k ~ N (0, fb1 ), b1k ~ N (0, fb1 ), ckl ∼ N (0, fcl ) and ui ∼ N (0, s 2 ) (Coull et al., 2001). The
amount of smoothing under S1 = Q1 + T1 and S2 ,Ci = Q2 + T2 ,Ci then depends on the ratios
s 2 /fb1 and s 2 /[fb 2 + fcl ].
In a multivariate mixed model generalisation of the radial basis, one may consider thin-plate
functions with exponents (2q − d) specified by integer combinations (q,d), where d is the
dimension of the covariate vectors in the relevant interaction (Yau et al., 2003). So
where z are univariate or multivariate vector predictor values, and tk are univariate or
multivariate knots. In applying such functions, heavily parameterised multivariate spline
models are often not likely to be well identified, and simpler options involving univariate
smooths in each predictor (with d = 1), and all possible bivariate interactions (with d = 2),
may be considered (Yau and Kohn, 2003). Consider the setting q = 2, with predictors, w1 and
w2, and let zi = (w1i , w2i ) denote bivariate covariate combinations, with K12 bivariate centres
tk = (t1k , t2 k ) that might be provided by an initial cluster analysis. Also denoting distances
hik = |zi − tk|, the bivariate basis for 2q − d = 2 is of the form h2log(h). With linear terms in the
parametric component Q(w), this leads to the representation
K2 K12
+ åb
k =1
2k w2 i - k 2 k +
3
åc h
k =1
2
k ik log( hik ).
With Kr knots {kr 1 , … krKr } for predictor wri, the R main effects are
Kr
Sr (wri ) = br wri + ∑b
k =1
rk |wri − krk|3 ,
Hierarchical Methods for Nonlinear and Quantile Regression 539
where the R sets of coefficients {[br 1 , br 2 , … brKr ], r = 1, … , R} are assumed random with
variances fb1 ,… fbR. Let the Krs bivariate knots for first order (wr,ws) interaction effects be
denoted trs , k = (trk , tsk ). Then the interaction bases have the form
K rs
∑c
2
Trs (wri , wsi ) = rs , k (wri , wsi ) − (tsk , trk ) log( (wri , wsi ) − (trk , tsk ) ).
k =1
R
The sets of coefficients crs,k are also assumed to be random.
2
Lo-rank thin-plate spline smooths as an approximation to the full thin-plate regres-
sion spline (TPRS) smoother are considered by Wood (2003, 2006, 2016). Thus, for an
R-dimensional predictor vector wi = (w1i , … , wRi ) , with linear model
yi = f (wi ) + ui
with ui random, the full TPS smooth of degree m involves a function g minimising
2
y − g + l J mR ( g ),
2 2 2
∂2 g ∂2 g ∂2 g
J mR ( g ) =
∫∫ ∂w 2
1
+ 2 ∂w ∂w + ∂w 2 dw1dw2 .
1 2 2
The function g has the form
n M
g( w ) = ∑ i =1
di hmR w − wi + ∑ a f (w),
j =1
j j
where δi and αj are unknowns. To reduce the number of unknowns, especially for larger
samples, a rank k orthonormal basis for the δ parameters is used instead. This approach
avoids the knot placement problems of conventional regression spline modelling. Thin-plate
regression splines with truncated basis are implemented in the R package mgcv, with the
jagam option (Wood, 2016) producing a modifiable rjags code incorporating the TPRS com-
mands (see Example 12.3).
FIGURE 12.4
(a) Smooth for TFR as function of GDP per capita. (b) Smooth for TFR as function of GDP per capita
and female education.
of education for females. TPRS models are applied and can be fitted using jagam/mgcv,
or the stan_gamm4 option in rtsanarm.
The first model involves a smooth in GDP only, and a truncated TPRS representation
with rank k = 20 and m = 2. This provides a penalised DIC of 415, with Figure 12.4a show-
ing the resulting centred smooth. Including separate univariate smooths in both GDP
and female education improves the pDIC to 380. Both analyses show rapid convergence.
The second model is illustrated both by jagam/mgcv and stan_gamm4 codes.
A third model involves a joint smooth s(gdp,fschool) in the predictors, and provides a
pDIC of 379. Figure 12.4b shows the resulting three-dimensional scatter plot. Combining
both univariate smooths and a joint smooth provides a slightly improved pDIC of 374.
A final analysis modifies the rjags code for this model to include likelihood calcu-
lations from which WAIC (widely applicable information criterion) and LOO-IC may
be derived, and also includes binary selection indicators, J k ∼ Bern(0.5), for the three
Hierarchical Methods for Nonlinear and Quantile Regression 541
smooths. A penalising complexity prior is adopted for the residual standard deviation,
based on an assumed 0.01 probability that this exceeds 2, and exponential, E(1), pri-
ors are adopted on the smoothing parameters. This analysis shows a 0.06 probability
for retaining the univariate smooth in gdp, and if that smooth is excluded (so that the
model consists only of a univariate smooth in fschool and a bivariate smooth), the pDIC
falls to 372.
K
g( mi ) = a + S(wi ) = a + b1wi + … + b q wiq + å b (w - k )
k =1
k i
q
k + + ui ,
where ui ∼ N (0, s 2 ), the κk are knots, and the spline coefficients bk may be taken as normal,
for example bk ∼ N (0, f) . This approach is spatially homogenous (in terms of the predictor
space), whereas a spatially adaptive regression may be used to represent heteroscedastic-
ity, which is also related to w values, or possibly to the values of other predictors (Currie
and Durban, 2002). Spatial adaptive regression may also be used to allow non-constant
variance in the bk, namely bk ~ N (0, fk ) , with log(ϕk) determined by a spline regression on
the knots (Yue et al., 2012).
For modelling heteroscedasticity, with ui ~ N (0, s i2 ), a subsidiary spline regression may
be applied to the variances s i2 = exp( hi ), with M knots in the same predictor
M
hi = g 0 + g 1wi … + g q wiq + … å c (w -y
m=1
m i
q
m + ) ,
with cm ∼ N (0, fc ) (e.g. (Chib and Greenberg, 2013)). Other options (Jerak and Lang, 2005)
are random walk priors in hi, such as an RW1
hi ∼ N ( hi −1 , 1/th ).
or discrete mixture over smoothing functions, with mixture probabilities based on mul-
tinomial logit regression involving additional covariates xi. For y metric and M mixture
components, one might have
M
p( yi |xi , wi ) ~ åp
m=1
m ( xi )N (Sm (wi ,q m ), Vm )
∑ p (x ) = 1
m=1
m i
where each smooth function Sm (w ,q m ) has its own parameter set θm.
542 Bayesian Hierarchical Models
100 100
Residuals
Residuals
0 0
–100 –100
–200 –200
500 600 700 800 900 0 10 20 30 40 50 60
Fitted FSM
8.8
8.6
8.4
Log Variance
8.2
8.0
7.8
0 20 40 60 80 100
FSM
FIGURE 12.5
(a) Residuals against fitted, homoscedastic model. (b) Residuals against free school meals. (c) Plot of log variance
Hierarchical Methods for Nonlinear and Quantile Regression 543
on the percentage of pupils receiving free meals (FSM), percentage of English language
pupils (ELP), and percentage of teachers with emergency credentials (EMCRED).
Let s 2 = Var(ui ) , and assume 1/s 2 ∼ Ga(1, 0.001) in a homoscedastic linear regression
With computation via jagsUI, this provides a LOO-IC of 4387, with pe = 5.1. However, a
plot of the residuals shows residual variation to decrease as fitted attainment increases
(Figure 12.5a). All three predictors have significant (negative) effects on attainment, but
the highest ratio of posterior mean to standard deviation is for FSM, and a plot of the
residuals against FSM (Figure 12.5b) suggests residual variation increases with FSM.
A second model therefore specifies y i ~ N( mi , s i2 ), where ui ~ N(0, s i2 ), with log(s i2 )
modelled by a cubic spline regression
M
log(s i2 ) = g 0 + å c (FSM -y
m =1
m i ) .
3
m +
The spline coefficients are random cm ∼ N(0, fc ) with 1/fc ∼ Ga(1, 1). There are M = 9
knots, sited at the 10th, 20th, and 90th percentiles of FSM. A corner constraint c1 = 0
is used for identifiability. A two-chain run of 20,000 iterations gives an estimate for
fc0.5 of 0.52 with 95% interval (0.33,0.84), whereas homoscedasticity would imply fc0.5 = 0.
Figure 12.5c accordingly demonstrates non-constancy in log(s i2 ) as FSM varies, though
there is no consistent monotonic upward or downward trend in variability as FSM
increases. The LOO-IC under the second model falls to 4375 (pe = 11.3).
A third model employs a different identification device, namely centring (at each iter-
ation) the observation level smooth Si (FSM) = S mM=1cm ( FSMi -y m )3+ around the overall
mean of such smooths. The centred smooth is then included in the spline regression for
log(s i2 ). This produces a similar fit (LOO-IC = 4376), and a similar non-monotonic rela-
tion between log(s i2 ) and FSM. The centred cm (c.cent in the R code) for this implementa-
tion have a correlation of 0.99 with those from the corner constraint option.
and let St = S(wt ) be a smooth function representing the locally changing impact of wt on
g(μt) as it varies over its range. Thus
g( mt ) = a + S(wt ) + ut ,
ut ~ N (0, s 2 ),
where depending on identification procedures used, the intercept α may not be present
(Koop and Poirier, 2004). Appropriate priors for St reflect the ordering and spacing of the w
values, and typically follow dynamic linear priors or other time series schemes. Normal or
544 Bayesian Hierarchical Models
Student t random walks in the first, second, or higher differences of St are one possibility
(Knorr-Held, 1999; Fahrmeir and Lang, 2001; Chib and Jeliazkov, 2006). For identifiability,
especially when there are smooths Srt = S(wrt) in several predictors one may adopt devices
such as centring of the Srt, or corner constraints (e.g. Sr1 = 0). Alternatively, to expedite com-
puting speed, one may monitor identified quantities such as the centred series Srt − Sr
without actually imposing centring constraints within the estimation. Because there is
only local smoothing, inferences may also be sensitive to priors assumed for evolution
variance τ2 for the St and other aspects of the model.
If the w values are equally spaced and distinct, then 1st and 2nd order random walk
priors are just
St ~ N (St -1 , t 2 ),
St ~ N (2St -1 - St - 2 , t 2 ),
where smaller values of τ2 result in a smoother curve. For metric or overdispersed discrete
responses, the parameterisation τ2λ = σ2 may be used, allowing for trade-off between the
residual variance and the variance of the smooth (Koop and Poirier, 2004).
In ordinary regression applications, values of the wt are typically unequally spaced, and
there may be tied values. To take account of unequal spacing between successive wt, the
prior is modified such that for second and higher order walks, the weighting on lagged
values is varied according to how distant they are from the current value (Fahrmeir and
Lang, 2001). In all orders of random walk, the precision of St is reduced the wider the
gap between wt, and its preceding ordered values. Let gaps between points be denoted
d 2 = w2 - w1 ,d 3 = w3 - w2 ,…,d n = wn - wn-1 (with δ1 = 0). Then a first-order Normal random
walk becomes
St ~ N (St -1 , d tt 2 ),
Separate usually fixed effect priors are assumed for the initial values (e.g. S1 in a first
order random walk). A scheme allowing choice between RW1 and RW2 dependence for
unequally spaced w is proposed by Berzuini and Larizza (1996), namely
st ~ N ( Mt , d tt 2 )
where
Larger values of η > 0, such that exp(−ηδt) tends to zero, imply an approximate RW1 prior
and less smoothness.
If there are ties in the w values, with only m < n distinct values, denoted {w∗j , j = 1, … , m},
then the above priors would be on the differences d j = w*j - w*j-1 in the ranked distinct val-
ues, and it is necessary to specify a grouping index Gt (ranging between 1 and m) for each
Hierarchical Methods for Nonlinear and Quantile Regression 545
g( mt ) = a + S(Gt ) + ut , t = 1, … , n
Sj ~ N (Sj -1 , d jt 2 ) j = 1, ¼ , m
with Gt ∈(1, … , m) .
If there is more than one predictor then a semiparametric model might be adopted with
smooth functions Sr(wr) on a subset r = 1, …, q of R predictors, with the remainder modelled
by assuming global linearity. So
If non-parametric functions are estimated for several regressors w1t , w2t , … , wqt , then
a unique ordering across all predictors is usually infeasible and grouping indices
G1t , G2t , … , Gqt for each of q regressors are necessary, even if the regressors have no tied
values. In the case of tied values, the indices range between 1 and m1,1 and m2,…,1 and mq
(rather than between 1 and n).
Another approach (Wahba, 1983; Biller and Fahrmeir, 1997; Wood and Kohn, 1998) to
Bayesian general additive modelling involves the state space version of the polynomial
smoothing spline. For a spline of general order 2h − 1, St = S(wt) is generated by a differen-
tial equation
d hSt dWt
h
=t ,
dt dt
with Wt a Weiner process, and τ2 the evolution variance. The state vector
dS d 2S d( h −1)S
Zt = St , t , 2t , … , ( h −1)t ,
dt dt dt
Zt = Ft Zt −1 + et , (12.8)
where Ft is an h × h transition matrix and et is a multivariate error. For the cubic spline case
with h = 2, Zt = (St , dSt /dt) is bivariate and the transition matrix is
1 dt
Ft = ,
0 1
where dt = wt +1 − wt . The et are also bivariate, for example, MVN with zero mean and cova-
riance τ2Et, where
æ d t3 /3 d t2 /2 ö
Et = çç 2 ÷.
è d t /2 d t ÷ø
546 Bayesian Hierarchical Models
As usual there may be ties in the w values, and the prior (12.8) would be on j = 1, … , m dis-
tinct ranked values. Each observation for t = 1, …, n would have a grouping index Gt with
values between 1 and m.
y i ∼ Bin( ni , mi ),
where tr2 is the variance for the randomly varying srj. There is excess dispersion which
may be removed by a model also including an unstructured effect
GCSE
–3
20 30 40 50 60 70
–3.1
–3.2
–3.3
Smooth
–3.4
–3.5
10%
–3.6 Mean
90%
–3.7
–3.8
IMD
–2.5
5 10 15 20 25 30 35 40 45 50
–2.7
–2.9
–3.1
Smooth
–3.3
–3.5
10%
–3.7 Mean
90%
–3.9
–4.1
FIGURE 12.6
(a) Smooth in GCSE (80% CRI). (b) Smooth in IMD (80% CRI).
increasing (Kohler et al., 2016). Time-varying regression effects are a special case of the
general varying coefficient model of Hastie and Tibshirani (1993), namely
where the effect modifiers u = (u1 ,… ,uR ) govern the effect of predictors w = (w1 , … wR ) . If
the modifiers are all the same (e.g. time) with u1 = u2 = … = uR = t then
and the time-varying coefficient model, or dynamic general linear model (West and
Harrison, 1997), is obtained. This extends to time-varying predictors writ, with
Tim-varying intercept or regression effects βr(t) of unknown form can be fitted by any non-
parametric method, such as regression, penalised splines, or random walks. For example,
a B-spline approach would take
K*
b r (t) = å b B (w
k =1
rk k rit , q)
where brk are modelled as fixed or random effects. The fixed effects approach would typi-
cally be combined with selection of significant coefficients.
Allowing for intercepts or regression effects to vary by subject makes random effects
a more sensible option. A comprehensive review of frequentist approaches to such non-
parametric mixed models is provided by Wu and Zhang (2006) – see also Chapter 9 in
Ruppert et al. (2003). A typical application is in growth curve analysis and involves subject
specific non-parametric growth curves in time or age. For example, a growth curve model
where observations at each wave included age could be modelled using a truncated spline
uit ~ N (0, s 2 ),
where η(a) is the population mean function, estimated non-parametrically, and Si(a) are
subject-specific deviation functions. Silva et al. (2008) consider cubic B-spline bases to
model region-wide and area-specific trends for health outcomes yit ∼ Bin( nit , pit ), namely
K* K*
logit(p it ) = a + h (t) + Si (t) + di = a + å b B (t, 3) + å c B (t, 3) + d ,
k =1
k k
k =1
ik k i
(e.g. growth curve) application with a single predictor wit, the impact of which is mod-
elled at population level by a smooth function S(wit). Then one may wish to allow both for
intercept (baseline) variation and for subject level variation around the average function
S(w). Thus
the mean effect of predictor wit, but this effect is stronger for subjects with b2i > 0, and
weaker for subjects with b2i < 0. So b2i acts to amplify or attenuate the non-parametric
impact of the variable wit. For some subjects, one may even obtain large negative estimates,
b2i < −1, so that the effect of wit is inverted. This model adapts to cross-sectional data where
g( mi ) = a + S(wi ) + biS(wi ) + ui ,
particularly in cases where the units are non-exchangeable, for example, if the units were
areas, and bi followed a spatial prior.
The impact of (1 + b2i) on the unknown function S(wit) is analogous to (subject specific)
factor loadings operating on factor scores, and is subject to identifiability (label switch-
ing) issues, since [−(1 + b2i ))[−S(wit )) = S( xit )(1 + b2i ) . However, labelling issues should be
avoided in practice if the impact of wit represented by S(w) is well-identified by the data.
An alternative product scheme is applied by Congdon (2006), based on the Lee and Carter
(1992) mortality forecasting model. In this scheme, subject-specific weights qi that sum to
1 over all subjects operate on S(wit), so that for Si qi = 1 the product scheme is qiS(wit). The
effect of w is stronger for subjects with higher qi, and weaker for subjects with lower qi, with
the average qi being 1/n.
require(splines)
cycval <- seq(-8,15)
bs.cycval <- bs(cycval,df=NULL,knots=c(-5,0,5,10),degree=3,intercep
t=T, Boundary.knots=range(cycval))
bs.cycval <- matrix(as.numeric(bs.cycval), nr = nrow(bs.cycval)).
550 Bayesian Hierarchical Models
K∗
mit = aGi + ∑b
k =1
Gik Bk (t , 3),
b jk ∼ N(0, 1/fj ),
fj ∼ Ga(1, 0.001),
t ∼ Ga(1, 0.001).
*
A two-chain run of 20,000 iterations is undertaken, with centring of c jt = S Kk =1bGik Bk (t , 3)
within groups for identification. There is a similar path between the two groups, in
terms of posterior means of {a j + c jt } up to the week after ovulation, but distinct trends
thereafter (Figure 12.7a). The LOO-IC is 6252.
2.5
1.5
Progesterone
0.5
Day
0
–8 –3 2 7 12
–0.5
Nonconcepve
–1 Concepve
–1.5
2.5
1.5
Progesterone
0.5
Day
0
–8 –3 2 7 12
–0.5
Nonconcepve
–1 Concepve
–1.5
FIGURE 12.7
(a) Growth curve smooths (Model 1). (b) Growth curve smooths (Model 2).
Hierarchical Methods for Nonlinear and Quantile Regression 551
K*
mit = a Gi + bi 0 + å b B (t, 3),
k =1
ik k
with
The corner constraint bi1 = 0 aids in identification. Average growth curves are shown
in Figure 12.7b. The LOO-IC for this model is 3319.
and D−1 follows a Wishart prior with identity scale matrix and 2 degrees of freedom.
A second-order random walk smooth is estimated over all (i,t) pairs using a normal
prior with a single variance parameter, rather than on the basis of successive ages
within each fertility sequence, which would permit distinct variance parameters for
each subject. The smooth involves 31 random parameters, namely for maternal ages
12 to 42. Identification is achieved by centring S(w) at each iteration.
A two-chain run of 10,000 iterations using the rube library shows significant het-
erogeneity around the overall smooth in age, with a posterior mean for var(b2) of 2.0,
and 95% interval {1.2, 3.3}. Figure 12.8a shows the varying non-parametric impact of
maternal age wit on birthweight according to b2i, namely for subjects with b2i = sd(b2 ),
b2i = 0, and b2i = − sd(b2 ), where the standard deviations are those at particular MCMC
iterations. A histogram plot of the posterior mean b2i (Figure 12.8b) indicates normality,
though an extreme negative outlier of −4.9 occurs for subject 470, whose fourth and fifth
infants weighed under 1kg, whereas the first two exceeded 3kg in weight. To assess out-
lier status at observation level, one may derive WAIC component scores for individual
(mother, infant) pairs: the largest such score (44 out of a total WAIC of 5788) is for the
fifth infant to mother 838.
552 Bayesian Hierarchical Models
0.6
0.4
Density
0.2
0.0
-4 -2 0 2 4
Posterior mean b2
FIGURE 12.8
(a) Smooth impacts of maternal age on birthweight, according to variability in b2. (b) Histogram
of b2.
12.7 Quantile Regression
Normal linear regression and generalised linear models focus on estimating the condi-
tional mean of the response yi. Quantile regression (Koenker, 2005) provides a more com-
plete perspective on the conditional density of yi, and focuses on estimating conditional
quantiles (such as the conditional median) of the response. Sometimes, conditional mean
regression will show a predictor as having no impact, whereas quantile regression will
show a significant impact over at least part of the quantile range (Cade and Noon, 2003),
though collinearity between predictors (and hence, predictor selection) may still be an
issue (Xi et al., 2016; El Adlouni et al., 2018). With quantiles denoted q Î[0,1], the condi-
tional quantile density is denoted by the quantile (inverse cumulative distribution) func-
tion Q(q|Xi ), defined as Pr[ yi < Q(q|Xi ) = q] .
Hierarchical Methods for Nonlinear and Quantile Regression 553
For linear regression involving a continuous response, the frequentist quantile regres-
sion estimator at quantile q minimises the function
Q(q|Xi ) = q å
yi Xi b
yi - Xi b q + (1 - q) å y -X b
y i < Xi b
i i q
This loss function downweights or emphasises absolute errors according to the quantile
q. For example, setting q = 0.9 results in a loss nine times larger for positive residuals with
yi Xi b than for negative residuals with yi < Xi b . So, the upper tail of the conditional dis-
tribution is emphasised.
A special case is provided by median regression, via minimisation of the absolute
deviations:
Q(0.5|Xi ) = å y - X b .
i i
This reduces the impact of outliers (influential observations) in the response space on esti-
mation, so as to provide a better fit for the majority of observations. Credible intervals (e.g.
for observation level predictions) estimated using conditional mean regression by averag-
ing over MCMC samples may also be affected by outliers. By contrast, median regression
is more robust to skewness and other departures from normality (Geraci and Bottai, 2006).
Thus, Min and Kim (2004) consider different forms of non-Gaussian errors, with asym-
metric and long-tailed distributions, and show that median regression outperforms con-
ditional mean regression, since the median is a more suitable centrality measure for data
with a skewed response.
Methods for Bayesian quantile regression include asymmetric Laplace likelihood (Yu
and Moyeed, 2001), exponentially tilted empirical likelihood (Schennach, 2005), and
Dirichlet process mixture median regression (Kottas and Gelfand, 2001). Yu and Moyeed
(2001) demonstrate that loss function minimisation is equivalent to estimation using an
asymmetric Laplace distribution (ALD), with density function
q(1 - q) é æ y - hq ö ù
ALD( y|hq , s , q) = exp ê r q ç ÷ú .
s ë è s øû
This density can be represented as a scale mixture of normals, thus facilitating Gibbs sam-
pling (Kozumi and Kobayashi, 2011).
Thus, for y ~ ALD(hq , s , q), one has for quantiles q = 1, …, Q the quantile-specific
representation
0.5
é 2s qWiq ù
yi = hiq + x qWiq + ê ú Ziq ,
ë q(1 - q) û
where ηiq is the regression term, xq = (1 − 2q)/q(1 − q) , Wiq ∼ Exp(sq ) , and Ziq ∼ N (0, 1). The
practical role of the xqWiq terms is to maintain the model as a satisfactory representation
554 Bayesian Hierarchical Models
of y, compensating for shifts in ηiq between quantiles. The Wiq are measures of outlier sta-
tus. Observations with higher Wiq have higher variances and lessened influence on the
likelihood. R packages to implement Bayesian quantile linear regression include brq
(Alhamzawi, 2012), bayesQR (Benoit and Van den Poel, 2014), and ALDqr (Sanchez et al.,
2017).
In practice, it is not necessarily guaranteed that estimated quantile curves will be non-
crossing, especially for quantiles not widely separated (e.g. q = 0.05 compared to q = 0.10)
(Bondell et al., 2010). Methods to circumvent this, not necessarily fully Bayesian, have been
proposed (Cai and Jiang, 2015). An ad hoc approach involves simultaneous estimation of
all quantiles of interest, and omitting MCMC samples where the expected ordering of the
quantile regression terms ηiq is not satisfied.
For longitudinal data (with units i, and times t) (e.g. Geraci and Bottai, 2006; Alhamzawi
et al., 2011), the regression term might include quantile-specific unit level random effects
biq. Assuming normal subject effects, the representation would then be
0.5
2sqWitq
yit = Xit bq + biq + xqWitq + Zitq ,
q(1 − q)
with biq ∼ N (0, sb2 ).
12.7.1 Non-Metric Responses
For binary responses, the augmented data method can be applied, combined with the scale
mixture version of the ALD (Benoit and Van den Poel, 2012; Benoit and Van den Poel, 2017).
Thus, binary responses yi can be regarded as determined by a continuous latent variable yi∗ .
To implement quantile regression for these latent variables, one specifies
with set scale parameter for identifiability and truncated sampling according to the
observed value of yi. Thus
Yue and Hong (2012) apply quantile tobit regression to highly skewed medical expenditure
data, focusing on the latent outcome in combination with the scale mixture ALD, while
Rahman (2016) uses the augmented data approach for quantile regression of ordinal data.
To extend quantile regression to count data, Machado and Santos Silva (2005) propose
adding uniform noise u to count responses, giving zi = yi + ui , where ui ∼ U (0, 1) , and apply
quantile regression of the form
for quantities
Another approach to quantile regression for overdispersed count data involves a scale
mixture version of the ALD (Yu and Moyeed, 2001), within a hierarchical Poisson lognor-
mal representation to account for overdispersion (e.g. Connolly and Thibaut, 2012). The
quantile regression is for latent outcomes at the second stage of the hierarchical model,
focused on estimating latent incidence rates or relative risks (Congdon, 2017). The Poisson
lognormal representation is in itself beneficial, since the tails of the lognormal are heavier
than for the gamma distribution, and for data with outliers, the Poisson lognormal model
may give a better fit than the negative-binomial model. Thus, for observed counts yi, one
specifies for quantiles q = 1, … , Q,
yi ∼ Poi( miq ),
miq = exp(niq ),
2Wiq dq
niq ∼ N Xi bq + xqWiq , ,
q(1 − q)
Wiq ~ Exp(δ q)
This approach is less computationally intensive than the uniform noise (jittering) method.
0.01
–0.005
–0.01
Mean
–0.015 2.5%
97.5%
–0.02
Linear Regression
0.01
Rate of Change in Trout Density
0.005
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
–0.005 Quanle
–0.01
–0.015 Mean
2.5%
97.5%
–0.02
Nonlinear Regression
10th quinle
2.5
50th quinle
90th quinle
2
Density
1.5
0.5
0 10 20 30 40 50 60
Width:depth
FIGURE 12.9
(a) Trout per meter by width/depth ratio, according to regression quantiles. (b) Relationship
between density and W-D ratios.
Hierarchical Methods for Nonlinear and Quantile Regression 557
Posterior predictive p-tests at each quantile are made using total absolute deviations
between actual responses (or replicates) and model predictions. These remain between
0.1 and 0.9 under the linear model, though are under 0.1 for middle quantiles (between
0.4 and 0.7). Predictive tests are satisfactory across all quantiles for the exponential
model.
An rstan implementation of the linear option confirms the lack of impact of w-d ratio
at q = 0.5, with β1 having mean (95% interval) of −0.0029 (−0.0072, 0.0022). However, at q =
0.9, the estimate is −0.0114 (−0.0146, −0.0081) (Figures 12.10 and 12.11).
0.01 0.05
0.04
0.005 Slope
0.03 2.5%
97.5%
0 0.02
0 0.2 0.4 0.6 0.8 1
Slope Coefficient
Slope Coefficient
0.01
–0.005
0
0 0.2 0.4 0.6 0.8 1
–0.01 Slope –0.01
2.5%
97.5%
–0.02
–0.015
–0.03
–0.02 –0.04
Quantile
Quantile
FIGURE 12.10
Quantile Regression Coefficient Plots, Linear (left), Exponential (right), Slope of Density on Width‐Depth Ratio.
1.2
10th quintile
0.8 50th quintile
90th quintile
Density
0.6
0.4
0.2
0
0 10 20 30 40 50 60
Width:depth
FIGURE 12.11
Conditional predictive profile, trout density against width-depth ratio, 10th, 50th and 90th quantiles, exponen-
tial transform model.
558 Bayesian Hierarchical Models
The estimation concerns only median regression (q = 0.5 in the above coding).
Table 12.1 shows the estimated coefficients. The WAIC, on the basis of a normal like-
lihood calculation, is 4,344, albeit with the least well-fitted cases having subject level
WAIC scores of 10 or more.
TABLE 12.1
Car Work Trip, Parameter Estimates
Mean 2.5% 50% 97.5%
β0 Intercept 4.63 3.98 4.63 5.30
β1 DCOST 0.97 0.53 0.97 1.41
β2 CAR 3.20 2.64 3.20 3.79
β3 DOVTT 1.00 0.38 0.99 1.75
β4 DIVTT 0.24 −0.32 0.23 0.82
Hierarchical Methods for Nonlinear and Quantile Regression 559
TABLE 12.2
Physician Visits, Comparison of Estimates, Median Regression
Median Regression
Hierarchical
Machado- Santos Machado- Santos HQRPLN
Negative Binomial Silva (via lqm. Silva (Bayesian (Bayesian
(Conditional Mean) counts) Estimates) Estimates)
Estimate SE Estimate SE Mean Std Mean Std
Intercept 1.00 0.05 0.72 0.08 0.43 0.05 0.48 0.06
hosp 0.23 0.02 0.26 0.03 0.28 0.02 0.25 0.02
health −0.36 0.06 −0.40 0.10 −0.39 0.07 −0.37 0.06
numchron 0.19 0.01 0.22 0.01 0.24 0.01 0.23 0.01
gender −0.13 0.03 −0.20 0.05 −0.20 0.04 −0.18 0.03
school 0.023 0.004 0.017 0.006 0.029 0.005 0.028 0.005
privins 0.19 0.04 0.22 0.05 0.33 0.05 0.30 0.05
TABLE 12.3
Physician Visits, Comparison of Estimates, Higher Quantiles
q = 0.75
Machado- Santos Silva Machado- Santos Silva Hierarchical HQRPLN
(via lqm.counts) (Bayesian Estimates) (Bayesian Estimates)
Estimate SE Mean Std Mean Std
Intercept 1.16 0.07 1.26 0.05 1.34 0.05
Hosp 0.26 0.04 0.26 0.02 0.25 0.02
Health −0.37 0.06 −0.36 0.06
−0.38 0.06
Numchron 0.21 0.02 0.20 0.01 0.19 0.01
Gender −0.13 0.04 −0.15 0.03
−0.15 0.03
School 0.026 0.005 0.020 0.003 0.019 0.004
Privins 0.21 0.05 0.21 0.04 0.18 0.04
q = 0.95
Machado- Santos Silva Machado- Santos Silva Hierarchical HQRPLN
(via lqm.counts) (Bayesian Estimates) (Bayesian Estimates)
Estimate SE Mean Std Mean Std
Intercept 1.90 0.07 2.32 0.05 2.12 0.05
Hosp 0.21 0.04 0.20 0.02 0.23 0.02
Health −0.37 0.09 −0.35 0.05 −0.42 0.05
Numchron 0.18 0.02 0.12 0.01 0.15 0.01
Gender −0.01 0.05 −0.07 0.03 −0.08 0.03
School 0.036 0.006 0.018 0.003 0.020 0.003
Privins 0.20 0.06 0.04 0.05 0.08 0.03
though less precisely estimated, than negative binomial regression. Posterior mean Wiq
from the HQRPLN estimation show subject 3735 as the most extreme outlier. This subject
has no physician visits, despite a high number of hospital stays and chronic conditions.
Estimated regression coefficients for higher quantiles show a diminished influence of
gender and insurance status. The Bayesian estimates for q = 0.95 also show a lessened
influence of total chronic conditions.
560 Bayesian Hierarchical Models
12.8 Computational Notes
[1] The JAGS code for the HQRPLN model is as follows:
References
Alhamzawi R (2012) R Package ‘Brq’, Bayesian Analysis of Quantile Regression Models. https://cr
an.r-project.org/web/packages/Brq/Brq.pdf
Alhamzawi R, Yu K, Pan J (2011) Prior elicitation in Bayesian quantile regression for longitudinal
data. Journal of Biometrics and Biostatistics, 2, 115.
Baladandayuthapani V, Mallick B, Carroll R (2005) Spatially adaptive Bayesian penalized regression
splines (P-splines). Journal of Computational and Graphical Statistics, 14, 378–394.
Banerjee S, Ghosal S (2014) Bayesian variable selection in generalized additive partial linear models.
Stat, 3(1), 363–378.
Belitz C, Lang S (2008) Simultaneous selection of variables and smoothing parameters in structured
additive regression models. Computational Statistics & Data Analysis, 53, 61–81.
Benoit D, Van den Poel D. (2012) Binary quantile regression: A Bayesian approach based on the asym-
metric Laplace distribution. Journal of Applied Econometrics, 27(7), 1174–1188.
Benoit D, Van den Poel D (2014) bayesQR: A Bayesian approach to quantile regression. Journal of
Statistical Software, 76(7). https://www.jstatsoft.org/article/view/v076i07
Benoit D, Van den Poel D (2017) bayesQR: A Bayesian approach to quantile regression. Journal of
Statistical Software, 76(7). https://www.jstatsoft.org/article/view/v076i07
Berry S, Carroll R, Ruppert D (2002) Bayesian smoothing and regression splines for measurement
error problems. Journal of the American Statistical Association, 97, 160–169.
Berzuini C, Larizza C (1996) A unified approach for modeling longitudinal and failure time data,
with application in medical monitoring. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(2), 109–123.
Biller C (2000) Adaptive Bayesian regression splines in semiparametric generalized linear models.
Journal of Computational and Graphical Statistics, 9, 122–140.
Biller C, Fahrmeir L (1997) Bayesian spline-type smoothing in generalized regression models.
Computational Statistics,12, 135–151.
Bondell H, Reich B, Wang H (2010) Noncrossing quantile regression curve estimation. Biometrika,
97(4), 825–838.
Hierarchical Methods for Nonlinear and Quantile Regression 561
Borsuk M, Stow C (2000) Bayesian parameter estimation in a mixed-order model of BOD decay.
Water Research, 34, 1830–1836.
Brezger A, Lang S (2006) Generalized structured additive regression based on Bayesian P-splines.
Computational Statistics and Data Analysis, 50, 967–991.
Brezger A, Steiner W (2008) Monotonic regression based on Bayesian P-splines: An application to
estimating price response functions from store-level scanner data. Journal of Business & Economic
Statistics, 26, 90–104.
Brumback B, Ruppert D, Wand M (1999) Variable selection and function estimation in additive non-
parametric regression using a data-based prior: Comment. Journal of the American Statistical
Association, 94, 794–797.
Brumback BA, Rice JA (1998) Smoothing spline models for the analysis of nested and crossed sam-
ples of curves. Journal of the American Statistical Association, 93(443), 961–976.
Cade B, Noon B (2003). A gentle introduction to quantile regression for ecologists. Frontiers in Ecology
and the Environment, 1(8), 412–420.
Cai Y, Jiang T (2015) Estimation of non-crossing quantile regression curves. Australian & New Zealand
Journal of Statistics, 57, 139–162.
Chen X, Ender P, Mitchell M, Wells C (2003) Regression with Stata, from http://www.ats.ucla.edu/
stat/stata/webbooks/reg/default.htm
Chen Z (1993) Fitting multivariate regression functions by interaction spline models. Journal of the
Royal Statistical Society, Series B, 55, 473–491.
Chib S, Greenberg E (2013) On conditional variance estimation in nonparametric regression. Statistics
and Computing, 23(2), 261–270.
Chib S, Jeliazkov I (2006) Inference in semiparametric dynamic models for binary longitudinal data.
Journal of the American Statistical Association, 101(474), 685–700.
Congdon P (2006) A model framework for mortality and health data classified by age, area, and time.
Biometrics, 62(1), 269–278.
Congdon P (2017) Quantile regression for overdispersed count data: A hierarchical method. Journal
of Statistical Distributions and Applications, 4, 18.
Connolly SR, Thibaut LM (2012) A comparative analysis of alternative approaches to fitting species-
abundance models. Journal of Plant Ecology, 5(1), 32–45.
Cottet R, Kohn R, Nott D (2008) Variable selection and model averaging in semiparametric overdis-
persed generalized linear models. Journal of the American Statistical Association, 103, 661–671.
Coull B, Ruppert D, Wand M (2001) Simple incorporation of interactions into additive models.
Biometrics, 57, 539–545.
Currie I, Durban M (2002) Flexible smoothing with P-splines: A unified approach. Statistical Modelling,
2, 333–349.
Deb P, Trivedi PK (1997) Demand for medical care by the elderly: A finite mixture approach. Journal
of Applied Econometrics, 12(3), 313–336.
Denison DG, Mallick BK, Smith AF (1998) Bayesian mars. Statistics and Computing, 8(4), 337–346.
Dennison D, Holmes C, Mallick B, Smith A (2002) Bayesian Methods for Non-linear Classification and
Regression. John Wiley, Chichester, UK.
Dias R, Gamerman D (2002) A Bayesian approach to hybrid splines non-parametric regression.
Journal of Statistical Computation and Simulation, 72, 285–298.
Dunham JB, Cade BS, Terrell JW (2002) Influences of spatial and temporal variation on fish-habi-
tat relationships defined by regression quantiles. Transactions of the American Fisheries Society,
131(1), 86–98.
Durban M, Currie I, Eilers P (2006) Multidimensional P-spline mixed models: A unified approach to
smoothing on large grids. Working Paper, Department of Statistic, Universidad Carlos III de
Madrid, Spain. http://www.unavarra.es/metma3/Papers/Invited/Durban.pdf
Eilers P, Marx B (1996) Flexible smoothing with B-splines and penalties. Statistical Science, 11, 89–121.
Eilers P, Marx B (2004) Splines, knots, and penalties. Working Paper. www.stat.lsu.edu/faculty/marx/
El Adlouni S, Salaou G, St-Hilaire A (2018) Regularized Bayesian quantile regression. Communications
in Statistics – Simulation and Computation, 47(1), 277–293.
562 Bayesian Hierarchical Models
Engle R, Granger C, Rice J, Weiss A (1986) Semiparametric estimates of the relation between weather
and electricity sales. Journal of the American Statistical Association, 81, 310–320.
Fahrmeir L, Knorr-Held L (2000) Dynamic and semiparametric models, pp 513–543, in Smoothing and
Regression: Approaches, Computation and Application, ed M Schimek. John Wiley.
Fahrmeir L, Lang S (2001) Bayesian inference for generalized additive mixed models based on
Markov random field priors. Journal of the Royal Statistical Society C, 50, 201–220.
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modeling Based on Generalized Linear Models. Springer,
Berlin.
Friedman J (1991) Multivariate adaptive regression splines. Annals of Statistics, 19, 1–67.
Fuzi M, Jemain A, Ismail N (2016) Bayesian quantile regression model for claim count data. Insurance:
Mathematics and Economics, 66, 124–137.
Gelman A, Stern H, Carlin J, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis, 3rd Edition.
Chapman and Hall/CRC.
Geraci M, Bottai M (2006) Quantile regression for longitudinal data using the asymmetric Laplace
distribution. Biostat, 8(1), 140–154.
Gustafson P (2000) Bayesian regression modelling with interactions and smooth effects. Journal of the
American Statistical Association, 95, 795–806.
Hastie T, Tibshirani T (1993) Varying coefficient models. Journal of the Royal Statistical Society B, 55,
757–796.
Hooper P (2001) Flexible regression modeling with adaptive logistic basis functions. Canadian Journal
of Statistics, 29, 343–378.
James G, Hastie T, Sugar C (2000) Principal component models for sparse functional data. Biometrika
87, 587–602.
Jerak A, Lang S (2005) Locally adaptive function estimation for binary regression models. Biometrical
Journal, 47, 151–166.
Kharratzadeh M (2017) Splines in Stan. https://mc-stan.org/users/documentation/case-studies/
splines_in_stan.html
Kitagawa G, Gersch W (1996) Smoothness Priors Analysis of Time Series. Springer Verlag, New York.
Klein N, Kneib T, Lang S (2015) Bayesian generalized additive models for location, scale, and shape
for zero-inflated and overdispersed count data. Journal of the American Statistical Association,
110(509), 405–419.
Knorr-Held L (1999) Conditional prior proposals in dynamic models. Scandinavian Journal of Statistics,
26, 129–144.
Koenker R (2005) Quantile Regression. Cambridge University Press, Cambridge, UK.
Kohler M, Umlauf N, Beyerlein A, Winkler C, Ziegler A-G, Greven S (2016) Flexible Bayesian additive
joint models with an application to type 1 diabetes research. arXiv preprint arXiv:1611.01485
Kohn R, Schimek M, Smith M (2000) Spline and kernel regression for dependent data, Chapter 6,
pp 135–158, in Smoothing and Regression Approaches, Computation and Estimation, ed M Schimek.
John Wiley.
Kohn R, Smith M, Chan D (2001) Nonparametric regression using linear combinations of basis func-
tions. Statistics and Computing, 11, 313–322.
Konishi S, Ando T, Imoto S (2004) Bayesian information criteria and smoothing parameter selection
in radial basis function networks. Biometrika, 91, 27–43.
Koop G, Poirier D (2004) Bayesian variants of some classical semiparametric regression techniques.
Journal of Econometrics, 123, 259–282.
Koop G, Tole L (2004) Measuring the health effects of air pollution: To what extent can we really say
that people are dying from bad air? Journal of Environmental Economics and Management, 47,
30–54.
Koop GM (2003) Bayesian Econometrics. John Wiley & Sons Inc.
Kottas A, Gelfand AE (2001) Bayesian semiparametric median regression modeling. Journal of the
American Statistical Association, 96(456), 1458–1468.
Kozumi H, Kobayashi G (2011) Gibbs sampling methods for Bayesian quantile regression. Journal of
Statistical Computation and Simulation, 81(11), 1565–1578.
Hierarchical Methods for Nonlinear and Quantile Regression 563
AAPC, see Area APC model Autocorrelation, 14, 20, 21, 173, 282, 283, 422,
Abrams, K., 107 427–428
Absolute risk difference (ARD), 133 Autoregression parameters, 30
Accelerated failure time (AFT) model, Autoregressive (AR) model, 166–172, 193, 204,
478, 481, 490 288–290, 418
ACP, see Autoregressive conditional Poisson low order, 169–170
models random coefficient, 168–169
Adaptive non-parametric Autoregressive conditional Poisson (ACP)
regression, 541–543 models, 189, 189–191, 190
AFT, see Accelerated failure time model Autoregressive moving average (ARMA)
Age-area interactions, 451–452 models, 165, 167–168, 173, 200, 202–204,
Age-period-cohort (APC) model, 447, 448 216, 282, 381
AIC, see Akaike Information Criterion Auxiliary momentum vector, 18
AICcmodavg, 59
Air passenger data, 181–182 Baker, R., 109
Aitchison, J., 142 Bamlss, see Bayesian Additive Models for
Aitkin, I., 346 Location, Scale, and Shape
Aitkin, M., 346 Banerjee, S., 215
Akaike Information Criterion (AIC), 71, BARMA, see Binary autoregressive moving
72, 73, 137 average models
Albert, J., 10, 332 Barnard, J., 412
Albert, J. H., 131 Barry, R., 67, 69
Alcohol effect, 304–305 Baseball salary data, 280–281
ALD, see Asymmetric Laplace distribution Baseline fixed effects model, 426
Anaesthesia, 110 Basic structural model (BSM), 178–179
Analysis of variance (ANOVA), categorical Basu, S., 139
predictors and, 259–263 Bayarri, M., 92
Ando, T., 143 Bayes approach, 1, 80, 187, 340
ANOVA, see Analysis of variance Bayes factor, 64, 68–71, 78, 86, 87, 88, 344
Antedependence model, 170–172, 204–205 Bayes formula, 2–3, 62
APC, see Age-period-cohort model Bayesian Additive Models for Location, Scale,
Approximate Bayesian bootstrap method, 455 and Shape (Bamlss), 45
Approximate methods, 60 Bayesian chi-square method, 90, 96–97
AR, see Autoregressive model Bayesian general linear models, 270
ARD, see Absolute risk difference Bayesian hierarchical methods, 103
Area APC (AAPC) model, 447 Bayesian Information Criterion (BIC), 60, 71, 72,
ARMA, see Autoregressive moving average 73, 77, 137, 280
models Bayesian Macroeconometrics (BMR), 166
ARMA-GLM model, 189 Bayesian spatial predictor selection models,
Asparouhov, T., 332, 344 293–296
Assuncao, R., 30 Bayesian spatial smoothing, 225
Asymmetric Laplace distribution (ALD), 110, Bayesian variable selection algorithms, 60
553, 554, 555 BayesMixSurv package, 481
Attitudes to science, 362–363 Bayes’ theorem, 6
Augmented data likelihood, see Complete data BayesVarSel, 59
likelihood BayesXsrc package, 546
Augmented data multilevel models, 324–325 Bazan, J., 370
Augmented data representations, 55 BCG vaccine, 119–120
565
566 Index
jagsUI package, 2, 48, 67, 78, 85, 93, 94, 115, 121, Latent growth curve models, 436, 437
140, 144, 154–155, 184, 202, 353, 362, 371, Latent regression vs differential item
395, 431, 440, 451, 534 functioning, 366–369
James, L., 143, 148 Latent trait longitudinal models, 445–446
Jansen, M., 326 Laud, P., 88
Jarque–Bera test, 91, 115 Lawson, A., 230, 293, 295
Jeffreys, H., 26, 70 Leave-one-out information criterion (LOO-IC), 47,
Jeliazkov, I., 30, 63, 421 61, 75–77, 78, 116, 143, 168, 172, 182, 190,
Jin, X., 377, 378 192, 196, 197, 201, 228–229, 232, 233, 234,
Job applicant data, 352–354 258, 267, 287, 298, 299, 304, 321, 336, 363,
Joe, H., 29, 54 369, 380, 385, 392, 418, 423, 485, 530, 540
Johnson, V., 90 Ledolter, J., 189
Joint density, 2, 10, 51, 61, 132, 169, 177, 216, 222, Lee, J., 415
223, 224, 239, 339, 375, 377, 453, 456, Lee, K., 118, 255
457, 458 Lee, K-J., 255
Joint posterior density, 4, 27 Lenk, P., 62, 142
Joint regression model, 417–418 Leroux, B., 224, 376
Jonsen, I., 29 Leroux global index, 232
Joreskog, K. G., 340 Lesage, J., 291
Jung, R. C., 283 Lewandowski, D., 29
Li, L., 91
Kaplan–Meier estimate, 477, 506, 510 Life tables, 496–502
Kashiwagi, N., 187 Lim, Y., 189, 283
Kass, R., 62, 70, 319, 415 Limiting long-term illness (LLTI), 218–221, 232
Kato, B., 91 Lin, T., 136
Keane, M., 428 Lindley–Smith model format, 320–322
Kernel density methods, 4, 62 Linear Bayes approach, 187
Kinney, S., 81 Linear co-regionalisation model, 378
Kleinman, K. P., 147, 148 Linear factor reduction model, 370
Knorr-Held, L., 27, 177, 230 Linear Gaussian state space model, 284
Kohn, R., 80, 81 Linear Gaussian transition model, 284
Kooperberg, C., 222 Linear regression, 46, 60, 300, 525, 542, 555
Koopman, S., 384 Little, R., 453
Kreft, I., 321 LLT, see Local linear trend
Kuk, A. Y., 441 LLTI, see Limiting long-term illness
Kumar, J., 170 Local level model, 175
Kuo, L., 80, 425, 426 Local linear trend (LLT), 382
Kurowicka, D., 29 Logistic-normal model, 128
Logistic regression, 27, 52–53, 255, 327, 331, 499
“Label switching,” 135, 138 Logit-binomial model, 93–95
Lag and error models, 288 Logit regression, 125, 265, 271, 272, 273, 278, 295,
Lagged count model, 432 297, 308, 327, 357, 363, 455, 541
Lagged earnings model, 430–431 Log likelihood ratio, 71
Laird, N., 110 Log-logistic model, 477, 493, 494
Lambert, P., 110 Log marginal likelihood, 60, 66, 68
Lancaster, T., 428 Log odds ratio (LOR), 133
Langevin random walk scheme, 10 Log posterior predictive density (LPPD), 75–76,
Laplace approximation, 3, 10, 20, 67 333, 334
Laplace methods, 86 Log relative risk (LRR), 133
LaplacesDemon package, 45 Longitudinal data, 405–462
Lasso prior, 256, 257, 258, 259, 481 categorical choice, 423–427
Lasso random effect models, 83 dynamic models, 427–432
Latent Gaussian models, 19–20 for discrete data, 429–432
572 Index
general linear mixed models for, 406–418 algorithms, 1–2, 9, 21, 45, 47
centred/non-centred priors, 408–409 sampling, 3–5, 14, 20, 22, 24, 29–30, 31, 37, 46,
multiple sources of error variation, 67, 80, 91, 139, 176, 180, 221, 256, 261,
415–418 264, 270, 278, 324, 345–346, 349, 410, 423
random covariance matrix and effect Markov Poisson regression, 283
selection, 411–415 Markov random field (MRF), 30, 214
unit level random effects, 409–411 Marra, G., 237
heteroscedasticity and generalised error Marriott, J., 167
densities, 433–442 MARS, see Multivariate adaptive regression
discrete mixture models, 436–442 spline method
missing data, 452–462 Marshall, C., 91, 92, 93, 110, 112
common factor models, 455–457 Martinez-Beneito, M. A., 378, 380
forms of regression, 454–455 Martin Marietta company, 46
pattern mixture models, 459–462 Math achievement, 321–322
predictor data, 457–459 Maths aptitude, 371–372
multilevel and multivariate, 443–452 Mavridis, D., 121
latent trait, 445–446 Maximum-entropy priors, 27
multiple scale, 446–452 Maximum likelihood (ML) analysis, 27, 219,
overview, 405–406 328, 428
temporal correlation and autocorrelated Maximum likelihood (ML) estimation, 26, 386
residuals, 418–423 Maximum likelihood (ML) factor analysis,
explicit temporal schemes, 419–423 352, 354
LOO-IC, see Leave-one-out information MCAR, see Missingness completely at random;
criterion Multivariate CAR prior
Lopes, H. F., 65 MCC, see Measure of creatinine clearance
LOR, see Log odds ratio MCE, see Marginal causal effect
Louis, T., 88 MCMC, see Markov chain Monte Carlo
Low birthweight babies, 273, 294 MCMCpack, 45
Low order autoregressive models, 169–170 MCMCvis, 45
LPPD, see Log posterior predictive density MDP, see Mixed Dirichlet process
LRR, see Log relative risk Measure of creatinine clearance (MCC),
LSAT data, 363–366 450–451
Lubrano, M., 170 Median regression, 110, 553, 558
Lung cancer trial, 484–486 Meta-analysis model, 22, 26, 27
Lung function and ozone exposure, 307–308 Meta-regression, 110–111
Metropolis–Hastings (M–H) sampling, 8, 10,
McCulloch, C. E., 551 14–17, 49, 63, 178, 195
McCulloch, R., 29 Metropolis sampling, 8–9, 32–34
MacNab, Y., 225, 326, 376 extended logistic model, 13–14
Mallick, B., 80 normal density parameters estimation, 11–12
maptools package, 218 Meyer, M., 88, 388
MAR, see Missingness at random MGMRF, see Multivariate Gaussian Markov
Marginal causal effect (MCE), 308 random field
Marginal likelihood, 3, 8, 12, 59, 60, 77, 80, 84, 87, M-H, see Metropolis-Hastings sampling
128, 345–346 Migon, H., 392, 501
approximation, 62–63, 344 Militino, A., 136
estimation, 51, 60, 61, 67, 68–71, 406 MIMIC, see Multiple indicator-multiple cause
Marginal structural model (MSM), 306–308 model
Markham, F., 111 Missing data in longitudinal data, 55, 452–462
Markov chain model, 3, 14, 16, 283, 285, 429 common factor models, 455–457
Markov chain Monte Carlo (MCMC), 61, 62, 72, forms of regression, 454–455
75, 76, 80, 91, 93, 106, 114, 135, 136, 137, pattern mixture models, 459–462
144, 269, 288, 293, 302, 320, 344, 483 predictor data, 457–459
Index 573
RCAR, see Random coefficient autoregressive RJMCMC, see Reversible jump Markov chain
models Monte Carlo methods
Regime switching models, 200 R MASS library, 53
Regression coefficients, 60, 67, 254 R mgcv package, 236
Regression parameter models, 30 Road fatalities in Ontario, 190–191
Regression techniques, 253–308 Robert, C., 135, 137
categorical predictors and analysis of Roberts, G., 25, 136
variance (ANOVA), 259–263 Robust random effects, 441–442
variance components testing, 260–263 Rodrguez-Bernal, M., 89
heteroscedasticity and heterogeneity, 276–282 Rodrigues, A., 30
nonconstant error variances, 276–277 rstan, 2, 14, 18, 29, 45, 49–54, 55, 70, 78, 143, 166,
varying effects using discrete mixtures, 167, 224, 233, 258, 321, 327, 349, 364, 417,
277–278 514–516
zero-inflated Poisson (ZIP), 278–282 beetle mortality, 34–35
latent scales for binary and categorical data, custom distributions through functions
270–276 block, 53–54
augmentation for ordinal responses, 273, Hamiltonian Monte Carlo, 49
275–276 Stan program syntax, 49–51
for overdispersed data, 264–269 target + representation, 51–53
binomial and multinomial, 267–269 rstanarm package, 45
Poisson regression, 264–267 rube package, 45, 46, 116, 151, 152, 373, 439, 452,
overview, 253 485, 499, 551
predictor selection, 254–256 Rubin, D., 453
selection bias and causal effects, 296–308 Rue, H., 376
causal path sequences, 299–305 runjags package, 48
marginal structural models, 306–308
mediation and marginal models, 299 Sahu, S., 88, 118
propensity score adjustment, 296–299 Salanti, G., 121
shrinkage priors, 256–259 SAR, see Spatial autoregressive
spatial, 288–296 Sargent, D., 27
Bayesian spatially varying coefficients, Savage, J., 143
292–293 Saville, B., 81, 86
Bayesian spatial predictor selection SBP, see Systolic blood pressure
models, 293–296 Scaled inverse chi-squared density, see Inverse
conditional autoregression, 290–291 chi-squared density
GWR and Bayesian SVC models, 291–292 Schabenberger, O., 228
lag and error models, 288 Schaefer, M. B., 185
simultaneous autoregressive models, Schifflers, E., 447
288–290 School attendance data, 266–267
time series, 282–287 Schools data meta analysis, 17–18
time-varying effects, 283–287 Schotman, P., 170
Reich, B., 293 SDM, see Spatial Durbin model
Residual autocorrelation, 422 Second-stage covariance, 119, 121
Residual variance, 60 Seeds data, 83–84, 93–95
Reverse-mode algorithmic differentiation, 19 Seemingly unrelated time series equations
Reversible jump Markov chain Monte Carlo (SUTSE) model, 383
(RJMCMC) methods, 344, 345, 437, 483 Seismic Hazard Harmonization in Europe
RIAS, see Random intercept and slope model (SHARE), 240
Richardson, S., 135, 136, 149, 230, 231 Self-exciting threshold autoregressive (SETAR)
R-INLA package, 20, 45, 166, 182, 191, 229, 295 models, 200
Risser, M., 238 Seltzer, M., 331
rjags, 2, 76, 78, 139, 151, 276, 281, 350, 427, 451, SEM, see Spatial errors model
461, 462, 539, 540 Semiparametric approaches, 243
Index 577
Stochastic search variable selection (SSVS), 80, binary autoregressive moving average
254, 255, 295, 298, 343 (BARMA) models, 191–193
Stochastic volatility model, 6, 193, 196 generalised autoregressive moving
Structural equation models (SEMs), 344 average (GARMA) representation,
Structural time series models, 28 188–189
Student t densities, 143 modelling discontinuities, 197–202
Student t distribution, 136 modelling temporal structure, 166–172
Student t model, 370, 372–373 antedependence models, 170–172
Suicides, London, 232–234 low order autoregressive models, 169–170
Survival and event history models, 471–519 random coefficient autoregressive
competing risks (CR), 507–514 models, 168–169
modelling frailty, 509–514 overview, 165–166
in continuous time, 472–481 state-space priors, 172–186
accelerated hazards, 478–481 basic structural model, 178–179
counting process functions, 474–475 identification questions, 179–184
parametric hazards, 475–477 nonlinear models for continuous data,
discrete time hazard models, 494–502 184–186
life tables, 496–502 sampling schemes, 176–178
including frailty, 488–494 simple signal models, 175–176
cure rate models, 490–494 stochastic variances, 193–197
multivariate and nested survival times, Time series regression, 282–287
502–506 time-varying effects, 283–287
overview, 471–472 Time-varying autoregressive (TVAR) models, 168
semiparametric hazards, 481–486 Tingley, D., 303
cumulative hazard specifications, Tiwari, R., 178
484–486 Tong, B., 277
piecewise exponential priors, 482–484 Total fertility rate (TFR), 539–541
Suspected myocardial infarction, 298–299 TPRS, see Thin-plate regression spline smoother
SUTSE, see Seemingly unrelated time series Trend-surface models, 236
equations model Trivariate normal random intercept model, 426
SVC, see Spatially varying coefficient Trout density, 555, 557
Symmetric proposal densities, 11 Truncated Dirichlet processes (TDP), 148–149,
Symmetric smoothing kernel, 242 334, 437
Systolic blood pressure (SBP), 120, 304–305 Truncated stick-breaking scheme, 148–149
Tsay, R., 391
Tam, W., 128 Tuchler, R., 81, 82, 86
Tanner, M., 371 Turtle mortality data, 68–71
Target + representation, 51–53 Turtle survival data, 35–36, 78
Taylor approximation, 20 Tutz, G., 429
TDP, see Truncated Dirichlet processes TVAR, see Time-varying autoregressive models
Terbinafine, 133–134 Two-chain analysis, 184
TFR, see Total fertility rate Two-level normal linear model, 28
Thin-plate regression spline (TPRS) smoother,
539, 540 UN Human Development Report, 539
Thin plate splines, 236 Unit level random effects, 409–411
Thompson, E., 10 Univariate random effects, 80
Thompson, S., 118 Univariate regression, 295
Tiao, G., 391 Unwanted pursuit behaviour (UPB), 281
Time series analysis, 165–205 Unweighted logistic regression, 52
for discrete responses and alternative state- UPB, see Unwanted pursuit behaviour
space approach, 186–193 US National Eye Institute, 506
autoregressive conditional Poisson (ACP) US National Longitudinal Survey, 430–431
models, 189–191 US presidential election voting data, 269
Index 579