100% found this document useful (3 votes)
1K views593 pages

Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)

Uploaded by

Armai Zsolt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
1K views593 pages

Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)

Uploaded by

Armai Zsolt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 593

Bayesian Hierarchical Models

With Applications Using R


Second Edition
Bayesian Hierarchical Models
With Applications Using R
Second Edition

By
Peter D. Congdon
University of London, England
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2020 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-4987-8575-4 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the
validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copy-
right holders of all material reproduced in this publication and apologize to copyright holders if permission to publish
in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know
so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users.
For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been
arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Contents

Preface...............................................................................................................................................xi

1. Bayesian Methods for Complex Data: Estimation and Inference.................................. 1


1.1 Introduction.................................................................................................................... 1
1.2 Posterior Inference from Bayes Formula.................................................................... 2
1.3 MCMC Sampling in Relation to Monte Carlo Methods; Obtaining
Posterior Inferences.......................................................................................................3
1.4 Hierarchical Bayes Applications..................................................................................5
1.5 Metropolis Sampling..................................................................................................... 8
1.6 Choice of Proposal Density..........................................................................................9
1.7 Obtaining Full Conditional Densities....................................................................... 10
1.8 Metropolis–Hastings Sampling................................................................................. 14
1.9 Gibbs Sampling............................................................................................................ 17
1.10 Hamiltonian Monte Carlo........................................................................................... 18
1.11 Latent Gaussian Models.............................................................................................. 19
1.12 Assessing Efficiency and Convergence; Ways of Improving Convergence......... 20
1.12.1 Hierarchical Model Parameterisation to Improve Convergence.............22
1.12.2 Multiple Chain Methods................................................................................ 24
1.13 Choice of Prior Density............................................................................................... 25
1.13.1 Including Evidence......................................................................................... 26
1.13.2 Assessing Posterior Sensitivity; Robust Priors........................................... 27
1.13.3 Problems in Prior Selection in Hierarchical Bayes Models...................... 29
1.14 Computational Notes.................................................................................................. 31
References................................................................................................................................ 37

2. Bayesian Analysis Options in R, and Coding for BUGS, JAGS, and Stan................ 45
2.1 Introduction.................................................................................................................. 45
2.2 Coding in BUGS and for R Libraries Calling on BUGS ......................................... 46
2.3 Coding in JAGS and for R Libraries Calling on JAGS............................................ 47
2.4 Coding for rstan .......................................................................................................... 49
2.4.1 Hamiltonian Monte Carlo............................................................................. 49
2.4.2 Stan Program Syntax...................................................................................... 49
2.4.3 The Target + Representation......................................................................... 51
2.4.4 Custom Distributions through a Functions Block..................................... 53
2.5 Miscellaneous Differences between Generic Packages
(BUGS, JAGS, and Stan)............................................................................................... 55
References................................................................................................................................ 56

3. Model Fit, Comparison, and Checking............................................................................. 59


3.1 Introduction.................................................................................................................. 59
3.2 Formal Model Selection.............................................................................................. 59
3.2.1 Formal Methods: Approximating Marginal Likelihoods......................... 62
3.2.2 Importance and Bridge Sampling Estimates..............................................63
3.2.3 Path Sampling.................................................................................................65

v
vi Contents

3.2.4 Marginal Likelihood for Hierarchical Models........................................... 67


3.3 Effective Model Dimension and Penalised Fit Measures...................................... 71
3.3.1 Deviance Information Criterion (DIC)......................................................... 72
3.3.2 Alternative Complexity Measures................................................................ 73
3.3.3 WAIC and LOO-IC......................................................................................... 75
3.3.4 The WBIC.........................................................................................................77
3.4 Variance Component Choice and Model Averaging..............................................80
3.4.1 Random Effects Selection..............................................................................80
3.5 Predictive Methods for Model Choice and Checking............................................ 87
3.5.1 Predictive Model Checking and Choice...................................................... 87
3.5.2 Posterior Predictive Model Checks.............................................................. 89
3.5.3 Mixed Predictive Checks............................................................................... 91
3.6 Computational Notes.................................................................................................. 95
References................................................................................................................................ 98

4. Borrowing Strength via Hierarchical Estimation......................................................... 103


4.1 Introduction................................................................................................................ 103
4.2 Hierarchical Priors for Borrowing Strength Using Continuous Mixtures........ 105
4.3 The Normal-Normal Hierarchical Model and Its Applications.......................... 106
4.3.1 Meta-Regression............................................................................................ 110
4.4 Prior for Second Stage Variance............................................................................... 111
4.4.1 Non-Conjugate Priors................................................................................... 113
4.5 Multivariate Meta-Analysis...................................................................................... 116
4.6 Heterogeneity in Count Data: Hierarchical Poisson Models............................... 121
4.6.1 Non-Conjugate Poisson Mixing.................................................................. 124
4.7 Binomial and Multinomial Heterogeneity............................................................. 126
4.7.1 Non-Conjugate Priors for Binomial Mixing............................................. 128
4.7.2 Multinomial Mixtures.................................................................................. 130
4.7.3 Ecological Inference Using Mixture Models............................................ 131
4.8 Discrete Mixtures and Semiparametric Smoothing Methods............................ 134
4.8.1 Finite Mixtures of Parametric Densities.................................................... 135
4.8.2 Finite Mixtures of Standard Densities....................................................... 136
4.8.3 Inference in Mixture Models...................................................................... 137
4.8.4 Particular Types of Discrete Mixture Model............................................ 141
4.8.5 The Logistic-Normal Alternative to the Dirichlet Prior.......................... 142
4.9 Semiparametric Modelling via Dirichlet Process and Polya Tree Priors.......... 144
4.9.1 Specifying the Baseline Density................................................................. 146
4.9.2 Truncated Dirichlet Processes and Stick-Breaking Priors...................... 148
4.9.3 Polya Tree Priors........................................................................................... 149
4.10 Computational Notes................................................................................................ 154
References.............................................................................................................................. 156

5. Time Structured Priors....................................................................................................... 165


5.1 Introduction................................................................................................................ 165
5.2 Modelling Temporal Structure: Autoregressive Models...................................... 166
5.2.1 Random Coefficient Autoregressive Models............................................ 168
5.2.2 Low Order Autoregressive Models............................................................ 169
5.2.3 Antedependence Models............................................................................. 170
5.3 State-Space Priors for Metric Data........................................................................... 172
Contents vii

5.3.1 Simple Signal Models................................................................................... 175


5.3.2 Sampling Schemes........................................................................................ 176
5.3.3 Basic Structural Model................................................................................. 178
5.3.4 Identification Questions............................................................................... 179
5.3.5 Nonlinear State-Space Models for Continuous Data............................... 184
5.4 Time Series for Discrete Responses; State-Space Priors and Alternatives......... 186
5.4.1 Other Approaches......................................................................................... 188
5.5 Stochastic Variances.................................................................................................. 193
5.6 Modelling Discontinuities in Time......................................................................... 197
5.7 Computational Notes................................................................................................ 202
References.............................................................................................................................. 206

6. Representing Spatial Dependence................................................................................... 213


6.1 Introduction................................................................................................................ 213
6.2 Spatial Smoothing and Prediction for Area Data.................................................. 214
6.2.1 SAR Schemes................................................................................................. 216
6.3 Conditional Autoregressive Priors.......................................................................... 221
6.3.1 Linking Conditional and Joint Specifications...........................................222
6.3.2 Alternative Conditional Priors.................................................................... 223
6.3.3 ICAR(1) and Convolution Priors................................................................. 226
6.4 Priors on Variances in Conditional Spatial Models.............................................. 227
6.5 Spatial Discontinuity and Robust Smoothing....................................................... 229
6.6 Models for Point Processes.......................................................................................234
6.6.1 Covariance Functions................................................................................... 237
6.6.2 Sparse and Low Rank Approaches............................................................ 238
6.7 Discrete Convolution Models................................................................................... 241
6.8 Computational Notes................................................................................................ 245
References.............................................................................................................................. 246

7. Regression Techniques Using Hierarchical Priors....................................................... 253


7.1 Introduction................................................................................................................ 253
7.2 Predictor Selection..................................................................................................... 253
7.2.1 Predictor Selection........................................................................................254
7.2.2 Shrinkage Priors........................................................................................... 256
7.3 Categorical Predictors and the Analysis of Variance........................................... 259
7.3.1 Testing Variance Components.................................................................... 260
7.4 Regression for Overdispersed Data........................................................................ 264
7.4.1 Overdispersed Poisson Regression............................................................ 264
7.4.2 Overdispersed Binomial and Multinomial Regression.......................... 267
7.5 Latent Scales for Binary and Categorical Data...................................................... 270
7.5.1 Augmentation for Ordinal Responses....................................................... 273
7.6 Heteroscedasticity and Regression Heterogeneity............................................... 276
7.6.1 Nonconstant Error Variances...................................................................... 276
7.6.2 Varying Regression Effects via Discrete Mixtures.................................. 277
7.6.3 Other Applications of Discrete Mixtures.................................................. 278
7.7 Time Series Regression: Correlated Errors and Time-Varying
Regression Effects...................................................................................................... 282
7.7.1 Time-Varying Regression Effects............................................................... 283
7.8 Spatial Regression...................................................................................................... 288
viii Contents

7.8.1 Spatial Lag and Spatial Error Models........................................................ 288


7.8.2 Simultaneous Autoregressive Models....................................................... 288
7.8.3 Conditional Autoregression........................................................................ 290
7.8.4 Spatially Varying Regression Effects: GWR and Bayesian SVC
Models............................................................................................................ 291
7.8.5 Bayesian Spatially Varying Coefficients.................................................... 292
7.8.6 Bayesian Spatial Predictor Selection Models............................................ 293
7.9 Adjusting for Selection Bias and Estimating Causal Effects............................... 296
7.9.1 Propensity Score Adjustment...................................................................... 296
7.9.2 Establishing Causal Effects: Mediation and Marginal Models.............. 299
7.9.3 Causal Path Sequences................................................................................. 299
7.9.4 Marginal Structural Models........................................................................306
References..............................................................................................................................308

8. Bayesian Multilevel Models.............................................................................................. 317


8.1 Introduction................................................................................................................ 317
8.2 The Normal Linear Mixed Model for Hierarchical Data..................................... 318
8.2.1 The Lindley–Smith Model Format............................................................. 320
8.3 Discrete Responses: GLMM, Conjugate, and Augmented Data Models........... 322
8.3.1 Augmented Data Multilevel Models.......................................................... 324
8.3.2 Conjugate Cluster Effects............................................................................. 325
8.4 Crossed and Multiple Membership Random Effects........................................... 328
8.5 Robust Multilevel Models......................................................................................... 331
References.............................................................................................................................. 336

9. Factor Analysis, Structural Equation Models, and Multivariate Priors................... 339


9.1 Introduction................................................................................................................ 339
9.2 Normal Linear Structural Equation and Factor Models......................................340
9.2.1 Forms of Model.............................................................................................342
9.2.2 Model Definition...........................................................................................343
9.2.3 Marginal and Complete Data Likelihoods, and MCMC Sampling.......345
9.3 Identifiability and Priors on Loadings....................................................................346
9.3.1 An Illustration of Identifiability Issues......................................................348
9.4 Multivariate Exponential Family Outcomes and Generalised Linear
Factor Models............................................................................................................. 354
9.4.1 Multivariate Count Data.............................................................................. 355
9.4.2 Multivariate Binary Data and Item Response Models............................ 357
9.4.3 Latent Scale IRT Models............................................................................... 359
9.4.4 Categorical Data............................................................................................ 360
9.5 Robust Density Assumptions in Factor Models.................................................... 370
9.6 Multivariate Spatial Priors for Discrete Area Frameworks................................. 373
9.7 Spatial Factor Models................................................................................................ 379
9.8 Multivariate Time Series........................................................................................... 381
9.8.1 Multivariate Dynamic Linear Models....................................................... 381
9.8.2 Dynamic Factor Analysis............................................................................ 386
9.8.3 Multivariate Stochastic Volatility............................................................... 388
9.9 Computational Notes................................................................................................ 396
References.............................................................................................................................. 397
Contents ix

10. Hierarchical Models for Longitudinal Data.................................................................. 405


10.1 Introduction................................................................................................................ 405
10.2 General Linear Mixed Models for Longitudinal Data......................................... 406
10.2.1 Centred or Non-Centred Priors..................................................................408
10.2.2 Priors on Unit Level Random Effects......................................................... 409
10.2.3 Priors for Random Covariance Matrix and
Random Effect Selection.............................................................................. 411
10.2.4 Priors for Multiple Sources of Error Variation.......................................... 415
10.3 Temporal Correlation and Autocorrelated Residuals........................................... 418
10.3.1 Explicit Temporal Schemes for Errors....................................................... 419
10.4 Longitudinal Categorical Choice Data....................................................................423
10.5 Observation Driven Autocorrelation: Dynamic Longitudinal Models............. 427
10.5.1 Dynamic Models for Discrete Data............................................................ 429
10.6 Robust Longitudinal Models: Heteroscedasticity, Generalised Error
Densities, and Discrete Mixtures............................................................................ 433
10.6.1 Robust Longitudinal Data Models: Discrete Mixture Models............... 436
10.7 Multilevel, Multivariate, and Multiple Time Scale Longitudinal Data..............443
10.7.1 Latent Trait Longitudinal Models..............................................................445
10.7.2 Multiple Scale Longitudinal Data..............................................................446
10.8 Missing Data in Longitudinal Models.................................................................... 452
10.8.1 Forms of Missingness Regression (Selection Approach)........................454
10.8.2 Common Factor Models............................................................................... 455
10.8.3 Missing Predictor Data................................................................................ 457
10.8.4 Pattern Mixture Models............................................................................... 459
References.............................................................................................................................. 462

11. Survival and Event History Models................................................................................ 471


11.1 Introduction................................................................................................................ 471
11.2 Survival Analysis in Continuous Time.................................................................. 472
11.2.1 Counting Process Functions....................................................................... 474
11.2.2 Parametric Hazards...................................................................................... 475
11.2.3 Accelerated Hazards.................................................................................... 478
11.3 Semiparametric Hazards.......................................................................................... 481
11.3.1 Piecewise Exponential Priors...................................................................... 482
11.3.2 Cumulative Hazard Specifications.............................................................484
11.4 Including Frailty........................................................................................................ 488
11.4.1 Cure Rate Models.......................................................................................... 490
11.5 Discrete Time Hazard Models................................................................................. 494
11.5.1 Life Tables...................................................................................................... 496
11.6 Dependent Survival Times: Multivariate and Nested Survival Times.............. 502
11.7 Competing Risks........................................................................................................ 507
11.7.1 Modelling Frailty.......................................................................................... 509
11.8 Computational Notes................................................................................................ 514
References.............................................................................................................................. 519

12. Hierarchical Methods for Nonlinear and Quantile Regression................................ 525


12.1 Introduction................................................................................................................ 525
12.2 Non-Parametric Basis Function Models for the Regression Mean..................... 526
12.2.1 Mixed Model Splines.................................................................................... 527
x Contents

12.2.2 Basis Functions Other Than Truncated Polynomials.............................. 529


12.2.3 Model Selection............................................................................................. 532
12.3 Multivariate Basis Function Regression................................................................. 536
12.4 Heteroscedasticity via Adaptive Non-Parametric Regression............................ 541
12.5 General Additive Methods.......................................................................................543
12.6 Non-Parametric Regression Methods for Longitudinal Analysis......................546
12.7 Quantile Regression.................................................................................................. 552
12.7.1 Non-Metric Responses................................................................................. 554
12.8 Computational Notes................................................................................................ 560
References.............................................................................................................................. 560

Index.............................................................................................................................................. 565
Preface

My gratitude is due to Taylor & Francis for proposing a revision of Applied Bayesian
Hierarchical Methods, first published in 2010. The revision maintains the goals of present-
ing an overview of modelling techniques from a Bayesian perspective, with a view to
practical data analysis. The new book is distinctive in its computational environment,
which is entirely R focused. Worked examples are based particularly on rjags and jagsUI,
R2OpenBUGS, and rstan. Many thanks are due to the following for comments on chap-
ters or computing advice: Sid Chib, Andrew Finley, Ken Kellner, Casey Youngflesh,
Kaushik Chowdhury, Mahmoud Torabi, Matt Denwood, Nikolaus Umlauf, Marco Geraci,
Howard Seltman, Longhai Li, Paul Buerkner, Guanpeng Dong, Bob Carpenter, Mitzi
Morris, and Benjamin Cowling. Programs for the book can be obtained from my website
at https://www.qmul.ac.uk/geog/staff/congdonp.html or from https://www.crcpress.com/
Bayesian-Hierarchical-Models-With-Applications-Using-R-Second-Edition/Congdon/p/
book/9781498785754. Please send comments or questions to me at [email protected].

QMUL, London

xi
1
Bayesian Methods for Complex Data:
Estimation and Inference

1.1 Introduction
The Bayesian approach to inference focuses on updating knowledge about unknown
parameters θ in a statistical model on the basis of observations y, with revised knowledge
expressed in the posterior density p(θ|y). The sample of observations y being analysed
provides new information about the unknowns, while the prior density p(θ) represents
accumulated knowledge about them before observing or analysing the data. There is
considerable flexibility with which prior evidence about parameters can be incorporated
into an analysis, and use of informative priors can reduce the possibility of confounding
and provides a natural basis for evidence synthesis (Shoemaker et al., 1999; Dunson, 2001;
Vanpaemel, 2011; Klement et al., 2018). The Bayes approach provides uncertainty intervals
on parameters that are consonant with everyday interpretations (Willink and Lira, 2005;
Wetzels et al., 2014; Krypotos et al., 2017), and has no problem comparing the fit of non-
nested models, such as a nonlinear model and its linearised version.
Furthermore, Bayesian estimation and inference have a number of advantages in terms
of its relevance to the types of data and problems tackled by modern scientific research
which are a primary focus later in the book. Bayesian estimation via repeated sampling
from posterior densities facilitates modelling of complex data, with random effects treated
as unknowns and not integrated out as is sometimes done in frequentist approaches
(Davidian and Giltinan, 2003). For example, much of the data in social and health research
has a complex structure, involving hierarchical nesting of subjects (e.g. pupils within
schools), crossed classifications (e.g. patients classified by clinic and by homeplace),
spatially configured data, or repeated measures on subjects (MacNab et al., 2004). The
Bayesian approach naturally adapts to such hierarchically or spatio-temporally correlated
effects via conditionally specified hierarchical priors under a three-stage scheme (Lindley
and Smith, 1972; Clark and Gelfand, 2006; Gustafson et al., 2006; Cressie et al., 2009), with
the first stage specifying the likelihood of the data, given unknown random individual or
cluster effects; the second stage specifying the density of the random effects; and the third
stage providing priors on parameters underlying the random effects density or densities.
The increased application of Bayesian methods has owed much to the development of
Markov chain Monte Carlo (MCMC) algorithms for estimation (Gelfand and Smith, 1990;
Gilks et al., 1996; Neal, 2011), which draw repeated parameter samples from the posterior
distributions of statistical models, including complex models (e.g. models with multiple
or nested random effects). Sampling based parameter estimation via MCMC provides
a full posterior density of a parameter so that any clear non-normality is apparent, and

1
2 Bayesian Hierarchical Models

hypotheses about parameters or interval estimates can be assessed from the MCMC sam-
ples without the assumptions of asymptotic normality underlying many frequentist tests.
However, MCMC methods may in practice show slow convergence, and implementation of
some MCMC methods (such as Hamiltonian Monte Carlo) with advantageous estimation
features, including faster convergence, has been improved through package development
(rstan) in R.
As mentioned in the Preface, a substantial emphasis in the book is placed on implemen-
tation and data analysis for tutorial purposes, via illustrative data analysis and attention
to statistical computing. Accordingly, worked examples in R code in the rest of the chap-
ter illustrate MCMC sampling and Bayesian posterior inference from first principles. In
subsequent chapters R based packages, such as jagsUI, rjags, R2OpenBUGS, and rstan are
used for computation.
As just mentioned, Bayesian modelling of hierarchical and random effect models via
MCMC techniques has extended the scope for modern data analysis. Despite this, applica-
tion of Bayesian techniques also raises particular issues, although these have been allevi-
ated by developments such as integrated nested Laplace approximation (Rue et al., 2009)
and practical implementation of Hamiltonian Monte Carlo (Carpenter et al., 2017). These
include:

a) Propriety and identifiability issues when diffuse priors are applied to variance or
dispersion parameters for random effects (Hobert and Casella, 1996; Palmer and
Pettit, 1996; Hadjicostas and Berry, 1999; Yue et al., 2012);
b) Selecting the most suitable form of prior for variance parameters (Gelman, 2006)
or the most suitable prior for covariance modelling (Lewandowski et al., 2009);
c) Appropriate priors for models with random effects, to avoid potential overfitting
(Simpson et al., 2017; Fuglstad et al., 2018) or oversmoothing in the presence of
genuine outliers in spatial applications (Conlon and Louis, 1999);
d) The scope for specification bias in hierarchical models for complex data structures
where a range of plausible model structures are possible (Chiang et al., 1999).

1.2 Posterior Inference from Bayes Formula


Statistical analysis uses probability models to summarise univariate or multivariate
observations y = ( y1 , … , y n ) by a collection of unknown parameters of dimension (say
d), q = (q1 ,… ,q d ) . Consider the joint density p( y ,q ) = p( y|q )p(q ), where p(y|θ) is the sam-
pling model or likelihood, and p(θ) defines existing knowledge, or expresses assumptions
regarding the unknowns that can be justified by the nature of the application (e.g. that
random effects are spatially distributed in an area application). A Bayesian analysis seeks
to update knowledge about the unknowns θ using the data y, and so interest focuses on
the posterior density p(θ|y) of the unknowns. Since p(y,θ) also equals p(y)p(θ|y) where p(y)
is the unconditional density of the data (also known as the marginal likelihood), one may
obtain

p( y ,q ) = p( y|q )p(q ) = p( y )p(q |y ). (1.1)


Bayesian Methods for Complex Data 3

This can be rearranged to provide the required posterior density as

p( y|q )p(q )
p(q |y ) = . (1.2)
p( y )
The marginal likelihood p(y) may be obtained by integrating the numerator on the right
side of (1.2) over the support for θ, namely


ò
p( y ) = p( y|q )p(q )dq .

From (1.2), the term p(y) therefore acts as a normalising constant necessary to ensure p(θ|y)
integrates to 1, and so one may write

p(q |y ) = kp( y|q )p(q ), (1.3)

where k = 1/p(y) is an unknown constant. Alternatively stated, the posterior density


(updated evidence) is proportional to the likelihood (data evidence) times the prior (his-
toric evidence or elicited model assumptions). Taking logs in (1.3), one has

log éë p(q |y )ùû = log(k ) + log éë p( y|q )ùû + log éë p(q )ùû

and log[ p( y|q )] + log[ p(q )] is generally referred to as the log posterior, which some R pro-
grams (e.g. rstan) allow to be directly specified as the estimation target.
In some cases, when the prior on θ is conjugate with the posterior on θ (i.e. has the same
density form), the posterior density and marginal likelihood can be obtained analytically.
When θ is low-dimensional, numerical integration is an alternative, and approximations to
the required integrals can be used, such as the Laplace approximation (Raftery, 1996; Chen
and Wang, 2011). In more complex applications, such approximations are not feasible, and
integration to obtain p(y) is intractable, so that direct sampling from p(θ|y) is not feasible.
In such situations, MCMC methods provide a way to sample from p(θ|y) without it having
a specific analytic form. They create a Markov chain of sampled values q (1) ,… ,q (T ) with
transition kernel K(q cand |q curr ) (investigating transitions from current to candidate values
for parameters) that have p(θ|y) as their limiting distribution. Using large samples from
the posterior distribution obtained by MCMC, one can estimate posterior quantities of
interest such as posterior means, medians, and highest density regions (Hyndman, 1996;
Chen and Shao, 1998).

1.3 MCMC Sampling in Relation to Monte Carlo


Methods; Obtaining Posterior Inferences
Markov chain Monte Carlo (MCMC) methods are iterative sampling methods that can be
encompassed within the broad class of Monte Carlo methods. However, MCMC methods
must be distinguished from conventional Monte Carlo methods that generate independent
simulations {u(1) , u(2) … , u(T ) } from a target density π(u). From such simulations, the expecta-
tion of a function g(u) under π(u), namely
4 Bayesian Hierarchical Models



Ep [ g(u)] = g(u)p(u)du,

is estimated as

g= ∑ g (u
t =1
(t )
)

and, under independent sampling from π(u), g tends to Ep [ g(u)] as T → ∞. However, such
independent sampling from the posterior density p(θ|y) is not usually feasible.
When suitably implemented, MCMC methods offer an effective alternative way to gen-
erate samples from the joint posterior distribution, p(θ|y), but differ from conventional
Monte Carlo methods in that successive sampled parameters are dependent or autocorre-
lated. The target density for MCMC samples is therefore the posterior density π(θ) = p(θ|y)
and MCMC sampling is especially relevant when the posterior cannot be stated exactly
in analytic form e.g. when the prior density assumed for θ is not conjugate with the like-
lihood p(y|θ). The fact that successive sampled values are dependent means that larger
samples are needed for equivalent precision, and the effective number of samples is less
than the nominal number.
For the parameter sampling case, assume a preset initial parameter value θ(0). Then
MCMC methods involve repeated iterations to generate a correlated sequence of sampled
values θ(t) (t = 1, 2, 3, …), where updated values θ(t) are drawn from a transition distribution

K (q (t ) |q (0) ,… ,q (t -1) ) = K (q (t ) |q (t -1) )

that is Markovian in the sense of depending only on θ(t−1). The transition distribution
K (q (t ) |q (t -1) ) is chosen to satisfy additional conditions ensuring that the sequence has
the joint posterior density p(θ|y) as its stationary distribution. These conditions typically
reduce to requirements on the proposal and acceptance procedure used to generate can-
didate parameter samples. The proposal density and acceptance rule must be specified in
a way that guarantees irreducibility and positive recurrence; see, for example, Andrieu
and Moulines (2006). Under such conditions, the sampled parameters θ(t) {t = B, B + 1, … , T },
beyond a certain burn-in or warm-up phase in the sampling (of B iterations), can be viewed
as a random sample from p(θ|y) (Roberts and Rosenthal, 2004).
In practice, MCMC methods are applied separately to individual parameters or blocks of
more than one parameter (Roberts and Sahu, 1997). So, assuming θ contains more than one
parameter and consists of C components or blocks {q1 , … , qC } , different updating methods
may be used for each component, including block updates.
There is no limit to the number of samples T of θ which may be taken from the poste-
rior density p(θ|y). Estimates of the marginal posterior densities for each parameter can
be made from the MCMC samples, including estimates of location (e.g. posterior means,
modes, or medians), together with the estimated certainty or precision of these parameters
in terms of posterior standard deviations, credible intervals, or highest posterior density
intervals. For example, the 95% credible interval for θh may be estimated using the 0.025
and 0.975 quantiles of the sampled output {q h(t ) , t = B + 1,… , T } . To reduce irregularities in
the histogram of sampled values for a particular parameter, a smooth form of the posterior
density can be approximated by applying kernel density methods to the sampled values.
Monte Carlo posterior summaries typically include estimated posterior means and vari-
ances of the parameters, obtainable as moment estimates from the MCMC output, namely
Bayesian Methods for Complex Data 5

Ê(q h ) = q h = åq
t =B + 1
(t )
h /(T - B)

V̂ (q h ) = å (q
t=B+1
(t )
h - q h )2 /(T - B).

This is equivalent to estimating the integrals


ò
E(q h |y ) = q h p(q |y )dq ,


ò
V (q h |y ) = q h2 p(q |y )dq - [E(q h |y )]2

= E(q |y ) - [E(q h |y )] .
2
h
2

One may also use the MCMC output to derive obtain posterior means, variances, and
credible intervals for functions Δ = Δ(θ) of the parameters (van Dyk, 2003). These are esti-
mates of the integrals


ò
E[D(q )|y] = D(q )p(q |y )dq ,



V[∆(q )| y] = ∆ 2 p(q | y )dq − [E( ∆ | y )]2

2 2
= E( ∆ | y ) − [E( ∆ | y )] .

For Δ(θ), its posterior mean is obtained by calculating Δ(t) at every MCMC iteration from
the sampled values θ(t). The theoretical justification for such estimates is provided by the
MCMC version of the law of large numbers (Tierney, 1994), namely that

T
D[q (t ) ]
å T - B ® E [D(q )],
t =B + 1
p

provided that the expectation of Δ(θ) under p (q ) = p(q |y ), denoted Eπ[Δ(θ)], exists. MCMC
methods also allow inferences on parameter comparisons (e.g. ranks of parameters or con-
trasts between them) (Marshall and Spiegelhalter, 1998).

1.4 Hierarchical Bayes Applications


The paradigm in Section 1.2 is appropriate to many problems, where uncertainty is limited
to a few fundamental parameters, the number of which is independent of the sample size
n – this is the case, for example, in a normal linear regression when the independent vari-
ables are known without error and the units are not hierarchically structured. However,
6 Bayesian Hierarchical Models

in more complex data sets or with more complex forms of model or response, a more gen-
eral perspective than that implied by (1.1)–(1.3) is available, and also implementable, using
MCMC methods.
Thus, a class of hierarchical Bayesian models are defined by latent data (Paap, 2002;
Clark and Gelfand, 2006) intermediate between the observed data and the underlying
parameters (hyperparameters) driving the process. A terminology useful for relating hier-
archical models to substantive issues is proposed by Wikle (2003) in which y defines the
data stage, latent effects b define the process stage, and ξ defines the hyperparameter stage.
For example, the observations i = 1,…,n may be arranged in clusters j = 1, …, J, so that the
observations can no longer be regarded as independent. Rather, subjects from the same
cluster will tend to be more alike than individuals from different clusters, reflecting latent
variables that induce dependence within clusters.
Let the parameters θ = [θL,θb] consist of parameter subsets relevant to the likelihood and
to the latent data density respectively. The data are generally taken as independent of θb
given b, so modelling intermediate latent effects involves a three-stage hierarchical Bayes
(HB) prior set-up

p( y , b ,q ) = p( y|b ,q L )p(b|q b )p(q L , q b ), (1.4)

with a first stage likelihood p( y|b ,q L ) and a second stage density p(b|θb) for the latent data,
with conditioning on higher stage parameters θ. The first stage density p(y|b,θL) in (1.4) is
a conditional likelihood, conditioning on b, and sometimes called the complete data or
augmented data likelihood. The application of Bayes’ theorem now specifies

p( y|b ,q L )p(b|q b )p(q )


p(q , b|y ) = ,
p( y )
and the marginal posterior for θ may now be represented as

p(q |y ) = =
ò
p(q )p( y|q ) p(q ) p( y|b ,q L )p(b|q b )db
,
p( y ) p( y )
where


ò ò
p( y|q ) = p( y , b|q )db = p( y|b ,q L )p(b|q b )db ,

is the observed data likelihood, namely the complete data likelihood with b integrated out,
sometimes also known as the integrated likelihood.
Often the latent data exist for every observation, or they may exist for each cluster in
which the observations are structured (e.g. a school specific effect bj for multilevel data yij
on pupils i nested in schools j). The latent variables b can be seen as a population of values
from an underlying density (e.g. varying log odds of disease) and the θb are then popula-
tion hyperparameters (e.g. mean and variance of the log odds) (Dunson, 2001). As exam-
ples, Paap (2002) mentions unobserved states describing the business cycle and Johannes
and Polson (2006) mention unobserved volatilities in stochastic volatility models, while
Albert and Chib (1993) consider the missing or latent continuous data {b1, …, bn} which
underlie binary observations {y1, …, yn}. The subject specific latent traits in psychometric or
educational item analysis can also be considered this way (Fox, 2010), as can the variance
Bayesian Methods for Complex Data 7

scaling factors in the robust Student t errors version of linear regression (Geweke, 1993) or
subject specific slopes in a growth curve analysis of panel data on a collection of subjects
(Oravecz and Muth, 2018).
Typically, the integrated likelihood p(y|θ) cannot be stated in closed form and classical
likelihood estimation relies on numerical integration or simulation (Paap, 2002, p.15). By
contrast, MCMC methods can be used to generate random samples indirectly from the
posterior distribution p(θ,b|y) of parameters and latent data given the observations. This
requires only that the augmented data likelihood be known in closed form, without need-
ing to obtain the integrated likelihood p(y|θ). To see why, note that the marginal posterior
of the parameter set θ may alternatively be derived as


ò ò
p(q |y ) = p(q , b|y )db = p(q |y , b)p(b|y )db ,

with marginal densities for component parameters θh of the form (Paap, 2002, p.5)

p(q h |y ) =
ò ò p(q , b|y)dbdq
q [ h] b
[ h] ,

µ
ò p(q |y)p(q )dq
q [ h]
[ h] =
ò ò p(q )p(y|b,q )p(b|q )dbdq
q [ h] b
[ h] ,

where θ[h] consists of all parameters in θ with the exception of θh. The derivation of suitable
MCMC algorithms to sample from p(θ,b|y) is based on Clifford–Hammersley theorem,
namely that any joint distribution can be fully characterised by its complete conditional
distributions. In the hierarchical Bayes context, this implies that the conditionals p(b|θ,y)
and p(θ|b,y) characterise the joint distribution p(θ,b|y) from which samples are sought, and
so MCMC sampling can alternate between updates p(b(t ) |q (t -1) , y ) and p(q (t ) |b(t ) , y ) on con-
ditional densities, which are usually of simpler form than p(θ,b|y). The imputation of latent
data in this way is sometimes known as data augmentation (van Dyk, 2003).
To illustrate the application of MCMC methods to parameter comparisons and hypoth-
esis tests in an HB setting, Shen and Louis (1998) consider hierarchical models with unit
or cluster specific parameters bj, and show that if such parameters are the focus of interest,
their posterior means are the optimal estimates. Suppose instead that the ranks of the unit
or cluster parameters, namely

Rj = rank(b j ) = ∑ I(b ≥ b ),
k≠i
j k

(where I(A) is an indicator function which equals 1 when A is true, 0 otherwise) are
required for deriving “league tables”. Then the conditional expected ranks are optimal,
and obtained by ranking the bj at each MCMC iteration, and taking the means of these
ranks over all samples. By contrast, ranking posterior means of the bj themselves can
perform poorly (Laird and Louis, 1989; Goldstein and Spiegelhalter, 1996). Similarly,
when the empirical distribution function of the unit parameters (e.g. to be used to obtain
the fraction of parameters above a threshold) is required, the conditional expected EDF
is optimal.
8 Bayesian Hierarchical Models

A posterior probability estimate that a particular bj exceeds a threshold τ, namely of the



integral Pr(b j > t| y ) =
∫ p(b |y)db , is provided by the proportion of iterations where b
t
j j
(t )
j

exceeds τ, namely
T


 ( b j > t| y ) =
Pr ∑ I (b
t =B + 1
(t )
j > t)/(T − B).

Thus, one might, in an epidemiological application, wish to obtain the posterior probabil-
ity that an area’s smoothed relative mortality risk bj exceeds unity, and so count iterations
where this condition holds. If this probability exceeds a threshold such as 0.9, then a sig-
nificant excess risk is indicated, whereas a low exceedance probability (the sampled rela-
tive risk rarely exceeded 1) would indicate a significantly low mortality level in the area.
In fact, the significance of individual random effects is one aspect of assessing the gain of
a random effects model over a model involving only fixed effects, or of assessing whether
a more complex random effects model offers a benefit over a simpler one (Knorr-Held and
Rainer, 2001, p.116). Since the variance can be defined in terms of differences between ele-
ments of the vector (b1 ,..., bJ ), as opposed to deviations from a central value, one may also
consider which contrasts between pairs of b values are significant. Thus, Deely and Smith
(1998) suggest evaluating probabilities Pr(b j ≤ tbk |k ≠ j , y ) where 0 < t ≤ 1, namely, the pos-
terior probability that any one hierarchical effect is smaller by a factor τ than all the others.

1.5 Metropolis Sampling
A range of MCMC techniques is available. The Metropolis sampling algorithm is still a
widely applied MCMC algorithm and is a special case of Metropolis–Hastings consid-
ered in Section 1.8. Let p(y|θ) denote a likelihood, and p(θ) denote the prior density for
θ, or more specifically the prior densities p(q1 ),… p(qC ) of the components of θ. Then the
Metropolis algorithm involves a symmetric proposal density (e.g. a Normal, Student t, or
uniform density) q(q cand |q (t ) ) for generating candidate parameter values θcand, with accep-
tance probability for potential candidate values obtained as

æ p (q cand ) ö æ p(q cand |y ) ö æ p( y|q cand )p(q cand ) ö


a (t ) = min ç 1, (t ) ÷
= min ç 1, ÷ = min ç 1, ÷ . (1.5)
è p (q ) ø è p(q |y ) ø p( y|q (t ) )p(q (t ) ) ø
(t )
è
So one compares the (likelihood * prior), namely, p( y|q )p(q ), for the candidate and exist-
ing parameter values. If the (likelihood * prior) is higher for the candidate value, it is auto-
matically accepted, and q (t+1) = q cand. However, even if the (likelihood * prior) is lower for
the candidate value, such that α(t) is less than 1, the candidate value may still be accepted.
This is decided by random sampling from a uniform density, U(t) and the candidate value
is accepted if a(t ) ≥ U (t ) . In practice, comparisons involve the log posteriors for existing and
candidate parameter values.
The third equality in (1.5) follows because the marginal likelihood p(y) = 1/k in the
Bayesian formula

p(q |y ) = p( y|q )p(q )/ p( y ) = kp( y|q )p(q ),


Bayesian Methods for Complex Data 9

cancels out, as it is a constant. Stated more completely, to sample parameters under the
Metropolis algorithm, it is not necessary to know the normalised target distribution,
namely, the posterior density, π(θ|y); it is enough to know it up to a constant factor.
So, for updating parameter subsets, the Metropolis algorithm can be implemented by
using the full posterior distribution

p (q ) = p(q |y ) = kp( y|q )p(q ),

as the target distribution – which in practice involves comparisons of the unnormalised


posterior p(y|θ)p(θ). However, for updating values on a particular parameter θh, it is not just
p(y) that cancels out in the ratio

p( y|q cand )p(q cand )


p (q cand )/p (q (t ) ) = ,
p( y|q (t ) )p(q (t ) )
but any parts of the likelihood or prior not involving θh (these parts are constants when θh
is being updated).
When those parts of the likelihood or prior not relevant to θh are abstracted out, the
remaining part of p(q |y ) = kp( y|q )p(q ), the part relevant to updating θh, is known as the
full conditional density for θh (Gilks, 1996). One may denote the full conditional density
for θh as

p h (q h |q[ h] ) µ p( y|q h )p(q h ),

where θh] denotes the parameter set excluding θh. So, the probability for updating θh can be
obtained either by comparing the full posterior (known up to a constant k), namely

æ p (q h ,cand ,q[(ht]) ) ö æ p( y|q h ,cand ,q[(ht]) )p(q h ,cand ,q[(ht]) ) ö


a = min çç 1, ÷
÷ = min çç 1, ÷÷ ,
è p (q (t ) ) ø è p( y|q (t ) )p(q (t ) ) ø
or by using the full conditional for the hth parameter, namely

æ p h (q h ,cand |q[(ht]) ) ö
a = min çç 1, ÷ .
è p h (q h(t ) |q[(ht]) ) ÷ø
Then one sets q h(t +1) = q h ,cand with probability α, and q h(t +1) = q h(t ) otherwise.

1.6 Choice of Proposal Density


There is some flexibility in the choice of proposal density q for generating candidate values
in the Metropolis and other MCMC algorithms, but the chosen density and the parameters
incorporated in it are relevant to successful MCMC updating and convergence (Altaleb
and Chauveau, 2002; Robert, 2015). A standard recommendation is that the proposal den-
sity for a particular parameter θh should approximate the posterior density p(θh|y) of that
parameter. In some cases, one may have an idea (e.g. from a classical analysis) of what
the posterior density is, or what its main defining parameters are. A normal proposal is
10 Bayesian Hierarchical Models

often justified, as many posterior densities do approximate normality. For example, Albert
(2007) applies a Laplace approximation technique to estimate the posterior mode, and uses
the mean and variance parameters to define the proposal densities used in a subsequent
stage of Metropolis–Hastings sampling.
The rate at which a proposal generated by q is accepted (the acceptance rate) depends on
how close θcand is to θ(t), and this in turn depends on the variance sq2 of the proposal density.
A higher acceptance rate would typically follow from reducing sq2 , but with the risk that
the posterior density will take longer to explore. If the acceptance rate is too high, then
autocorrelation in sampled values will be excessive (since the chain tends to move in a
restricted space), while a too low acceptance rate leads to the same problem, since the chain
then gets locked at particular values.
One possibility is to use a variance or dispersion estimate, sm2 or Σm, from a maximum
likelihood or other mode-finding analysis (which approximates the posterior variance)
and then scale this by a constant c > 1, so that the proposal density variance is sq2 = csm2 .
Values of c in the range 2–10 are typical. For θh of dimension dh with covariance Σm, a pro-
posal density dispersion 2.382Σm/dh is shown as optimal in random walk schemes (Roberts
et al., 1997). Working rules are for an acceptance rate of 0.4 when a parameter is updated
singly (e.g. by separate univariate normal proposals), and 0.2 when a group of parameters
are updated simultaneously as a block (e.g. by a multivariate normal proposal). Geyer and
Thompson (1995) suggest acceptance rates should be between 0.2 and 0.4, and optimal
acceptance rates have been proposed (Roberts et al., 1997; Bedard, 2008).
Typical Metropolis updating schemes use variables Wt with known scale, for example,
uniform, standard Normal, or standard Student t. A Normal proposal density q(q cand |q (t ) )
then involves samples Wt ~ N(0,1), with candidate values

q cand = q (t ) + s qWt ,

where σq determines the size of the jump from the current value (and the acceptance
rate). A uniform random walk samples Wt  Unif( −1,1) and scales this to form a proposal
q cand = q (t ) + k Wt , with the value of κ determining the acceptance rate. As noted above, it is
desirable that the proposal density approximately matches the shape of the target density
p(θ|y). The Langevin random walk scheme is an example of a scheme including informa-
tion about the shape of p(θ|y) in the proposal, namely q cand = q (t ) + s q [Wt + 0.5Ñ log( p(q (t ) |y )]
where ∇ denotes the gradient function (Roberts and Tweedie, 1996).
Sometimes candidate parameter values are sampled using a transformed version of a
parameter, for example, normal sampling of a log variance rather than sampling of a vari-
ance (which has to be restricted to positive values). In this case, an appropriate Jacobean
adjustment must be included in the likelihood. Example 1.2 below illustrates this.

1.7 Obtaining Full Conditional Densities


As noted above, Metropolis sampling may be based on the full conditional density when
a particular parameter θh is being updated. These full conditionals are particularly central
in Gibbs sampling (see below). The full conditional densities may be obtained from the
joint density p(q , y ) = p( y|q )p(q ) and in many cases reduce to standard densities (Normal,
Bayesian Methods for Complex Data 11

exponential, gamma, etc.) from which direct sampling is straightforward. Full conditional
densities are derived by abstracting out from the joint model density p(y|θ)p(θ) (likelihood
times prior) only those elements including θh and treating other components as constants
(George et al., 1993; Gilks, 1996).
Consider a conjugate model for Poisson count data yi with means μi that are themselves
gamma-distributed; this is a model appropriate for overdispersed count data with actual
variability var(y) exceeding that under the Poisson model (Molenberghs et al., 2007).
Suppose the second stage prior is μi ~ Ga(α,β), namely,

p( mi |a , b ) = mia -1e - bmi b a /G(a ),

and further that α ~ E(A) (namely, α is exponential with parameter A), and β ~ Ga(B,C)
where A, B, and C are preset constants. So the posterior density p(θ|y) of q = ( m1 ,..mn , a , b )
, given y, is proportional to

∏e ∏m
n
e − Aa b B −1e − C b  − mi
miyi   b a /Γ(a)  a − 1 − bmi
i e  (1.6)

 i   i 
where all constants (such as the denominator yi! in the Poisson likelihood, as well as the
inverse marginal likelihood k) are combined in a proportionality constant.
It is apparent from inspecting (1.6) that the full conditional densities of μi and β are also
gamma, namely,

mi ∼ Ga( yi + a , b + 1),

and

 
b ~ Ga  B + na , C +

∑ i
mi  ,

respectively. The full conditional density of α, also obtained from inspecting (1.6), is

∏m
n
p(a| y , b , m) ∝ e − Aa  b a /Γ(a)  i
a −1 
.
 i 
This density is non-standard and cannot be sampled directly (as can the gamma densities
for μi and β). Hence, a Metropolis or Metropolis–Hastings step can be used for updating it.

Example 1.1 Estimating Normal Density Parameters via Metropolis


To illustrate Metropolis sampling in practice using symmetric proposal densities,
consider n = 1000 values yi generated randomly from a N(3,25) distribution, namely a
Normal with mean μ = 3 and variance σ2 = 25. Note that, for the particular set.seed used,
the average sampled yi is 2.87 with variance 24.87. Using the generated y, we seek to
estimate the mean and variance, now treating them as unknowns. Setting θ = (μ,σ2), the
likelihood is

n
 ( y i − m)2 

1
p( y|q ) = exp  − .
i =1
s 2p  2s 2 
12 Bayesian Hierarchical Models

Assume a flat prior for μ, and a prior p(s ) ∝ 1/s on σ; this is a form of noninformative
prior (see Albert, 2007, p.109). Then one has posterior density

n
 ( y i − m)2 
∏ exp  −
1
p(q|y ) ∝ .
s n+1
i =1
2s 2 

with the marginal likelihood and other constants incorporated in the proportionality
sign.
Parameter sampling via the Metropolis algorithm involves σ rather than σ2, and uni-
form proposals. Thus, assume uniform U(−κ,κ) proposal densities around the current
parameter values μ(t) and σ(t), with κ = 0.5 for both parameters. The absolute value of
s (t ) + U( − k , k) is used to generate σcand. Note that varying the lower and upper limit of
the uniform sampling (e.g. taking κ = 1 or κ = 0.25) may considerably affect the accep-
tance rates.
An R code for κ = 0.5 is in the Computational Notes [1] in Section 1.14, and uses the
full posterior density (rather than the full conditional for each parameter) as the tar-
get density for assessing candidate values. In the acceptance step, the log of the ratio
p( y|q cand )p(q cand )
is compared to the log of a random uniform value to avoid computer
p( y|q (t ) )p(q (t ) )
over/underflow. With T = 10000 and B = 1000 warmup iterations, acceptance rates for
the proposals of μ and σ are 48% and 35% respectively, with posterior means 2.87 and
4.99. Other posterior summary tools (e.g. univariate and bivariate kernel density plots,
effective sample sizes) are included in the R code (see Figure 1.1 for a plot of the pos-
terior bivariate density). Also included is a posterior probability calculation to assess
Pr(μ < 3|y), with result 0.80, and a command for a plot of the changing posterior expec-
tation for μ over the iterations. The code uses the full normal likelihood, via the dnorm
function in R.

5.3 10

5.2
8

5.1
6
sigma

5.0

4
4.9

2
4.8

4.7 0
2.6 2.8 3.0 3.2 3.4
mu

FIGURE 1.1
Bivariate density plot, normal density parameters.
Bayesian Methods for Complex Data 13

Example 1.2 Extended Logistic with Metropolis Sampling


Following Carlin and Gelfand (1991), consider an extended logistic model for beetle
mortality data, involving death rates πi at exposure dose wi. Thus, for deaths yi at six
dose points, one has

y i ∼ Bin( ni , p(wi )),

p(wi ) = [exp( zi ) /(1 + exp( zi )]m1 ,

zi = (wi − m)/s ,

where m1 and σ are both positive. To simplify notation, one may write V = σ2.
Consider Metropolis sampling involving log transforms of m1 and V, and separate
univariate normal proposals in a Metropolis scheme. Jacobian adjustments are needed
in the posterior density to account for the two transformed parameters. The full poste-
rior p( m, m1 , V |y ) is proportional to

p(m1 )p( m)p(V ) ∏[p(w )]


i
i
yi
(1 − p(wi )]ni − yi

where p(μ), p(m1) and p(V) are priors for μ, m1 and V. Suppose the priors p(m1) and p(μ)
are as follows:

m1 ∼ Ga( a0 , b0 ),

m ∼ N(c0 , d02 ),

where the gamma has the form

b a a -1 - b x
Ga( x|a , b ) = x e .
G(a )
Also, for p(V) assume

V ∼ IG(e0 , f 0 ),

where the inverse gamma has the form

b a -(a +1) - b /x
IG( x|a , b ) = x e .
G(a )

The parameters ( a0 , b0 , c0 , d0 , e0 , f 0 ) are preset. The posterior is then proportional to

  m − c0   −( e0 + 1) − f0 /V
2

(m1a0 − 1e − b0m1 ) exp  −0.5 


  d0  
 V e ∏[p(w )]
i
i
yi
(1 − p(wi )]ni − yi .

Suppose the likelihood is re-specified in terms of parameters q1 = m, q2 = log(m1 ) and


θ3 = log(V). Then the full posterior in terms of the transformed parameters is propor-
tional to

 ∂m1   ∂V 
 ∂q   ∂q  p( m)p(m1 )p(V )
2 3
∏[p(w )]
i
i
yi
(1 − p(wi )]ni − yi .
14 Bayesian Hierarchical Models

One has (∂m1/∂q2 ) = e q2 = m1 and (∂V/∂q3 ) = e q3 = V . So, taking account of the param-
eterisation (θ1,θ2,θ3), the posterior density is proportional to

  m − c0   − e0 − f0 /V
2

(m1a0 e − b0m1 ) exp  −0.5 


  d0  
 V e ∏[p(w )]
i
i
yi
(1 − p(wi )]ni − yi .

The R code (see Section 1.14 Computational Notes [2]) assumes initial values for μ = θ1
of 1.8, for θ2 = log(m1) of 0, and for θ3 = log(V) of 0. Preset parameters in the prior den-
sities are (a0 = 0.25, b0 = 0.25, c0 = 2, d0 = 10, e0 = 2.000004, f0 = 0.001). Two chains are run
with T = 100000, with inferences based on the last 50,000 iterations. Standard devia-
tions in the respective normal proposal densities are set at 0.01, 0.2, and 0.4. Metropolis
updates involve comparisons of the log posterior and logs of uniform random variables
{U h(t ) , h = 1,… , 3} .
Posterior medians (and 95% intervals) for {μ,m1,V} are obtained as 1.81 (1.78, 1.83), 0.36
(0.20,0.75), 0.00035 (0.00017, 0.00074) with acceptance rates of 0.41, 0.43, and 0.43. The pos-
terior estimates are similar to those of Carlin and Gelfand (1991). Despite satisfactory
convergence according to Gelman–Rubin scale reduction factors, estimation is beset
by high posterior correlations between parameters and low effective sample sizes. The
cross-correlations between the three hyperparameters exceed 0.75 in absolute terms,
effective sample sizes are under 1000, and first lag sampling autocorrelations all exceed
0.90.
It is of interest to apply rstan (and hence HMC) to this dataset (Section 1.10) (see Section
1.14 Computational Notes [3]). Inferences from rstan differ from those from Metropolis
sampling estimation, though are sensitive to priors adopted. In a particular rstan esti-
mation, normal priors are set on the hyperparameters as follows:

m ∼ N(2, 10),

log(m1 ) ∼ N(0, 1),

log(s ) ∼ N(0, 5).

Two chains are applied with 2500 iterations and 250 warm-up. While estimates for μ
are similar to the preceding analysis, the posterior median (95% intervals) for m1 is now
1.21 (0.21, 6.58), with the 95% interval straddling the default unity value. The estimate
for the variance V is lower. As to MCMC diagnostics, effective sample sizes for μ and m1
are larger than from the Metropolis analysis, absolute cross-correlations between the
three hyperparameters in the MCMC sampling are all under 0.40 (see Figure 1.2), and
first lag sampling autocorrelations are all under 0.60.

1.8 Metropolis–Hastings Sampling
The Metropolis–Hastings (M–H) algorithm is the overarching algorithm for MCMC
schemes that simulate a Markov chain θ(t) with p(θ|y) as its stationary distribution.
Following Hastings (1970), the chain is updated from θ(t) to θcand with probability
Bayesian Methods for Complex Data 15

FIGURE 1.2
Posterior densities and MCMC cross-correlations, rstan estimation of beetle mortality data.

æ p(q cand |y )q(q (t ) |q cand ) ö


a (q cand |q (t ) ) = min ç 1, (t ) ÷
,
è p(q |y )q(q cand |q ) ø
(t )

where the proposal density q (Chib and Greenberg, 1995) may be non-symmetric, so
that q(q cand |q (t ) ) does not necessarily equal q(q (t ) |q cand ). q(q cand |q (t ) ) is the probability (or
density ordinate) of θcand for a density centred at θ(t), while q(q (t ) |q cand ) is the probabil-
ity of moving back from θcand to the current value. If the proposal density is symmetric,
with q(q cand |q (t ) ) = q(q (t ) |q cand ) , then the Metropolis–Hastings algorithm reduces to the
Metropolis algorithm discussed above. The M–H transition kernel is

K (q cand |q (t ) ) = a (q cand |q (t ) )q(q cand |q (t ) ),

for q cand ¹ q (t ) , with a nonzero probability of staying in the current state, namely
16 Bayesian Hierarchical Models


ò
K (q (t ) |q (t ) ) = 1 - a (q cand |q (t ) )q(q cand |q (t ) )dq cand .

Conformity of M–H sampling to the requirement that the Markov chain eventually sam-
ples from π(θ) is considered by Mengersen and Tweedie (1996) and Roberts and Rosenthal
(2004).
If the proposed new value θcand is accepted, then θ(t+1) = θcand, while if it is rejected the next
state is the same as the current state, i.e. θ(t+1) = θ(t). As mentioned above, since the target
density p(θ|y) appears in ratio form, it is not necessary to know the normalising constant
k = 1/p(y). If the proposal density has the form

q(q cand |q (t ) ) = q(q (t ) - q cand ),

then a random walk Metropolis scheme is obtained (Albert, 2007, p.105; Sherlock et al.,
2010). Another option is independence sampling, when the density q(θcand) for sampling
candidate values is independent of the current value θ(t).
While it is possible for the target density to relate to the entire parameter set, it is typi-
cally computationally simpler in multi-parameter problems to divide θ into C blocks or
components, and use the full conditional densities in componentwise updating. Consider
the update for the hth parameter or parameter block. At step h of iteration t + 1 the preced-
ing h − 1 parameter blocks are already updated via the M–H algorithm, while qh +1 , … , qC
are still at their iteration t values (Chib and Greenberg, 1995). Let the vector of partially
updated parameters apart from θh be denoted

q[(ht]) = (q1(t +1) ,q 2(t +1) ,… ,q h(t-+11) ,q h(t+)1 ,… ,qC(t ) ),

The candidate value for θh is generated from the hth proposal density, denoted
qh (q h ,cand |q h(t ) ) . Also governing the acceptance of a proposal are full conditional densities
p h (q h(t ) |q[(ht]) ) µ p( y|q h(t ) )p(q h(t ) ) specifying the density of θh conditional on known values of
other parameters θ[h]. The candidate value θh,cand is then accepted with probability

æ p( y|q h ,cand )p(q cand )q(q h(t ) |q cand ) ö


a = min ç 1, ÷ . (1.7)
è p( y|q h(t ) )p(q h(t ) )q(q cand |q h(t ) ) ø

Example 1.3 Normal Random Effects in a Hierarchical Binary Regression


To exemplify a hierarchical Bayes model involving a three-stage prior, consider binary
data yi ~ Bern(pi) from Sinharay and Stern (2005) on survival or otherwise of n = 244
newborn turtles arranged in J = 31 clutches, numbered in increasing order of the average
birthweight of the turtles. A known predictor is turtle birthweight xi. Let Ci denote the
clutch that turtle i belongs to. Then to allow for varying clutch effects, one may specify,
for cluster j = Ci, a probit regression with

pi |b j = Φ( b1 + b2 xi + b j ),

where {b j ∼ N(0, 1 / tb ), j = 1,… , J }. It is assumed that bk ∼ N(0, 10) and tb ∼ Ga(1, 0.001).
A Metropolis–Hastings step involving a gamma proposal is used for the random
effects precision τb, and Metropolis updates for other parameters; see Section 1.14
Computational Notes [3]. Trial runs suggest τb is approximately between 5 and 10, and a
Bayesian Methods for Complex Data 17

gamma proposal Ga(k , k/tb , curr ) with κ = 100 is adopted (reducing κ will reduce the M–H
acceptance rate for τb).
A run of T = 5000 iterations with warm-up B = 500 provides posterior medians (95%
intervals) for { b1 , b2 , sb = 1 / tb } of −2.91 (−3.79, −2.11), 0.40 (0.28, 0.54), and 0.27 (0.20,
0.43), and acceptance rates for {β1,β2,τb} of 0.30, 0.21, and 0.24. Acceptance rates for the
clutch random effects (using normal proposals with standard deviation 1) are between
0.25 and 0.33. However, none of the clutch effects appears to be strongly significant, in
the sense of entirely positive or negative 95% credible intervals. The effect b9 (for the
clutch with lowest average birthweight) has posterior median and 95% interval, 0.36
(−0.07, 0.87), and is the closest to being significant, while for b15 the median (95%CRI) is
−0.30 (−0.77,0.10).

1.9 Gibbs Sampling
The Gibbs sampler (Gelfand and Smith, 1990; Gilks et al., 1993; Chib, 2001) is a special
­componentwise M–H algorithm whereby the proposal density q for updating θh equals the
full conditional p h (q h |q h] ) µ p( y|q h )p(q h ). It follows from (1.7) that proposals are accepted with
probability 1. If it is possible to update all blocks this way, then the Gibbs sampler involves
parameter block by parameter block updating which, when completed, forms the transition
from q (t ) = (q1(t ) ,… ,qC(t ) ) to q (t +1) = (q1(t +1) ,… ,qC(t +1) ) . The most common sequence used is

q1(t +1) ~ f1(q1 |q 2(t ) ,q 3(t ) ,¼,qC(t ) );


1.
q 2(t +1) ~ f 2 (q 2 |q1(t +1) ,q 3(t ) ,¼,qC(t ) );
2.

qC(t +1) ~ fC (qC |q1(t +1) ,q 2(t +1) ,¼,qC(t-+11) ).
3.

While this scanning scheme is the usual one for Gibbs sampling, there are other options,
such as the random permutation scan (Roberts and Sahu, 1997) and the reversible Gibbs
sampler which updates blocks 1 to C, and then updates in reverse order.

Example 1.4 Gibbs Sampling Example Schools Data Meta Analysis


Consider the schools data from Gelman et al. (2014), consisting of point estimates yj (j = 1,
…, J) of unknown effects θj, where each yj has a known design variance s j2 (though the
listed data provides σj, not s j2 ). The first stage of a hierarchical normal model assumes

y j ∼ N(qj , s j2 ),

and the second stage specifies a normal model for the latent θj,

qj ∼ N( m, t 2 ).

The full conditionals for the latent effects θj, namely p(qj |y , m, t 2 ) are as specified by
Gelman et al. (2014, p.116). Assuming a flat prior on μ, and that the precision 1/τ2 has
a Ga(a,b) gamma prior, then the full conditional for μ is N(q , t 2 /J ), and that for 1/τ2 is
gamma with parameters ( J/2 + a, 0.5 ∑ (q − m)
j
j
2
+ b).
18 Bayesian Hierarchical Models

TABLE 1.1
Schools Normal Meta-Analysis Posterior Summary
μ τ ϑ1 ϑ2 ϑ3 ϑ4 ϑ5 ϑ6 ϑ7 ϑ8
Mean 8.0 2.5 9.0 8.0 7.6 8.0 7.1 7.5 8.8 8.1
St devn 4.4 2.8 5.6 4.9 5.4 5.1 5.0 5.2 5.2 5.4

For the R application, the setting a = b = 0.1 is used in the prior for 1/τ2. Starting values
for μ and τ2 in the MCMC analysis are provided by the mean of the yj and the median
of the s j2 . A single run of T = 20000 samples (see Section 1.13 Computational Notes [4])
provides the posterior means and standard deviations shown in Table 1.1.

1.10 Hamiltonian Monte Carlo


The Hamiltonian Monte Carlo (HMC) algorithm is implemented in the rstan library in R
(see Chapter 2), and has been demonstrated to improve effective search of the posterior
parameter space. Inefficient random walk behaviour and delayed convergence that may
characterise other MCMC algorithms is avoided by a greater flexibility in proposing new
parameter values; see Neal (2011, section 5.3.3.3), Gelman et al. (2014), Monnahan et al.
(2017), and Robert et al. (2018). In HMC, an auxiliary momentum vector ϕ is introduced
with the same dimension D = dim(θ) as the parameter vector θ. HMC then involves an
alternation between two forms of updating. One updates the momentum vector leaving θ
unchanged. The other updates both θ and ϕ using Hamiltonian dynamics as determined
by the Hamiltonian

H (q , f) = U (q ) + K (f),

where U (q ) = - log[ p( y|q )p(q )] (the negative log posterior) defines potential energy, and

å
D
K (f ) = q d2 /md defines kinetic energy (Neal, 2011, section 5.2). Updates of the momen-
d=1
tum variable include updates based on the gradients of U(q ),
dU (q )
g d (q ) = ,
dq d
with g(θ) denoting the vector of gradients.
For iterations t = 1, …, T, the updating sequence is as follows:

1. sample ϕ(t) from N(0,I), where I is diagonal with dimension D;


2. relabel ϕ(t) as ϕ0, and θ(t) as θ0 and with stepsize ε, carry out L “leapfrog” steps, start-
ing from i = 0
a) fi+0.5 = fi - 0.5e g(q i )
b) q d ,i+1 = q i + efi+0.5 /md
c) fi+1 = fi+0.5 - 0.5e g(q i );
3. set candidate parameter and momentum variables as θ* = θL and θ* = θL;
Bayesian Methods for Complex Data 19

4. obtain the potential and kinetic energies U(θ*) and K(ϕ*);


5. accept the candidate values with probability min(1,r) where

log(r ) = U (q (t ) ) + K (f (t ) ) - U (q * ) - K (f * ).

Practical application of HMC is facilitated by the No U-Turn Sampler (NUTS) (Hoffman


and Gelman, 2014) which provides an adaptive way to adjust the stepsize ε, and the
number of leapfrog steps L. The No U-Turn Sampler seeks to avoid HMC making
backwards sampling trajectories that get closer to (and hence more correlated) with
the last sample position. Calculation of the gradient of the log posterior is part of the
NUTS implementation, and is facilitated by reverse-mode algorithmic differentiation
(Carpenter et al., 2017).

1.11 Latent Gaussian Models


Latent Gaussian models are a particular variant of the models considered in Section 1.4,
and can be represented as a hierarchical structure containing three stages. At the first
stage is a conditionally independent likelihood function

p( y |x , f),

with a response y (of length n) conditional on a latent field x (usually also of length n),
depending on hyperparameters θ, with sparse precision matrix Qθ, and with ϕ denoting
other parameters relevant to the observation model. The hierarchical model is then

yi |xi ∼ p( yi |xi , f),

xi |q ∼ p( x|q ) = N (., Qq−1 ),

q , f ∼ p(q )p(f),

with posterior density

p ( x ,q , f |y ) µ p (q )p (f )p ( x|q ) Õ p(y |x ,f ).
i
i i

For example, consider area disease counts, yi ~ Poisson(Eiηi), with

log(hi ) = m + ui + si ,

where ui ∼ N (0, su2 ), the si follow an intrinsic autoregressive prior (expressing spatial
dependence) with variance ss2 , and s ∼ ICAR(ss2 ) and ui are iid (independent and identi-
cally distributed) random errors. Then x = (η,u,s) is jointly Gaussian with hyperparameters
( m, ss2 , su2 ).
20 Bayesian Hierarchical Models

Integrated nested Laplace approximation (or INLA) is a deterministic algorithm, unlike


stochastic algorithms such as MCMC, designed for estimating latent Gaussian models.
The algorithm is implemented in the R-INLA package, which uses R syntax throughout.
For large samples (over 5,000, say), it provides an effective alternative to MCMC estimation,
but with similar posterior outputs available.
The INLA algorithm focuses on the posterior density of the hyperparameters, π(θ|y),
and on the conditional posterior of the latent field π(xi|θ,y). A Laplace approximation for
the posterior density of the hyperparameters, denoted p (q | y ) , and a Taylor approximation
for the conditional posterior of the latent field, denoted p ( xi |q , y ) , are used. From these
approximations, marginal posteriors are obtained as



p ( xi | y ) = p (q | y )p ( xi |q , y )dq ,



p (qj | y ) = p (q | y )dq[ j] ,

where θ[j] denotes θ excluding θj, and integrations are carried out numerically.

1.12 Assessing Efficiency and Convergence;


Ways of Improving Convergence
It is necessary in applying MCMC sampling to decide how many iterations to use to accu-
rately represent the posterior density, and also necessary to ensure that the sampling pro-
cess has converged. Nonvanishing autocorrelations at high lags mean that less information
about the posterior distribution is provided by each iterate, and a higher sample size is
necessary to cover the parameter space. Autocorrelation will be reduced by “thinning”,
namely, retaining only samples that are S > 1 steps apart {q h(t ) ,q h(t +S) ,q h(t + 2S ) ,…} that more
closely approximate independent samples; however, this results in a loss of precision. The
autocorrelation present in MCMC samples may depend on the form of parameterisation,
the complexity of the model, and the form of sampling (e.g. block or univariate sampling
for collections of random effects). Autocorrelation will reduce the effective sample size Teff,h
for parameter samples {q h(t ) , t = B + 1,… , B + T } below T. The effective number of samples
(Kass et al., 1998) may be estimated as

 ∞

Teff , h = T / 1 + 2

∑r
k =0
hk  ,

where

r hk = g hk /g h 0 ,

is the kth lag autocorrelation, γh0 is the posterior variance V(θh|y), and γhk is the kth lag autoco-
K∗
variance cov[q ,q
(t )
h
(t + k )
h |y]. In practice, one may estimate Teff,h by dividing T by 1 + 2 ∑ k =0
rhk ,
where K* is the first lag value for which ρhk < 0.1 or ρhk < 0.05 (Browne et al., 2009).
Bayesian Methods for Complex Data 21

Also useful for assessing efficiency is the Monte Carlo standard error, which is an
estimate of the standard deviation of the difference between the true posterior mean


E(qh | y ) = qh p(q | y )dq , and the simulation-based estimate

T +B

å
1
qh = q h(t ) .
T t =B + 1
A simple estimator of the Monte Carlo variance is

1é 1 ù
T
ê
T êë T - 1 å(q
t=1
(t )
h - q h )2 ú
úû

though this may be distorted by extreme sampled values; an alternative batch means
method is described by Roberts (1996). The ratio of the posterior variance in a parameter
to its Monte Carlo variance is a measure of the efficiency of the Markov chain sampling
(Roberts, 1996), and it is sometimes suggested that the MC standard error should be less
than 5% of the posterior standard deviation of a parameter (Toft et al., 2007).
The effective sample size is mentioned above, while Raftery and Lewis (1992, 1996) esti-
mate the iterations required to estimate posterior summary statistics to a given accuracy.
Suppose the following posterior probability

Pr[∆(q | y ) < b] = p∆ ,

is required. Raftery and Lewis seek estimates of the burn-in iterations B to be discarded,
and the required further iterations T to estimate pΔ to within r with probability s; typical
quantities might be pΔ = 0.025, r = 0.005, and s = 0.95. The selected values of {pΔ,r,s} can also
be used to derive an estimate of the required minimum iterations Tmin if autocorrelation
were absent, with the ratio

I = T/Tmin ,

providing a measure of additional sampling required due to autocorrelation.


As to the second issue mentioned above, there is no guarantee that sampling from an
MCMC algorithm will converge to the posterior distribution, despite obtaining a high
number of iterations. Convergence can be informally assessed by examining the time
series or trace plots of parameters. Ideally, the MCMC sampling is exploring the posterior
distribution quickly enough to produce good estimates (this property is often called “good
mixing”). Some techniques for assessing convergence (as against estimates of required
sample sizes) consider samples θ(t) from only a single long chain, possibly after excluding
an initial t = 1, …, B burn-in iterations. These include the spectral density diagnostic of
Geweke (1992), the CUSUM method of Yu and Mykland (1998), and a quantitative measure
of the “hairiness” of the CUSUM plot (Brooks and Roberts, 1998).
Slow convergence (usually combined with poor mixing and high autocorrelation in sam-
pled values) will show in trace plots that wander, and that exhibit short-term trends, rather
than fluctuating rapidly around a stable mean. Failure to converge is typically a feature
of only some model parameters; for example, fixed regression effects in a general linear
mixed model may show convergence, but not the parameters relating to the random com-
ponents. Often measures of overall fit (e.g. model deviance) converge, while component
parameters do not.
22 Bayesian Hierarchical Models

Problems of convergence in MCMC sampling may reflect problems in model identifiabil-


ity, either formal nonidentification as in multiple random effects models, or poor empirical
identifiability when an overly complex model is applied to a small sample (“over-fitting”).
Choice of diffuse priors tends to increase the chance that models are poorly identified,
especially in complex hierarchical models for small data samples (Gelfand and Sahu, 1999).
Elicitation of more informative priors and/or application of parameter constraints may
assist identification and convergence.
Alternatively, a parameter expansion strategy may also improve MCMC performance
(Gelman et al., 2008; Ghosh, 2008; Browne et al., 2009). For example, in a normal-normal
meta-analysis model (Chapter 4) with

y j ~ N ( m + q j , s y2 ); q j ~ N (0, s q2 ), j = 1,¼, J

conventional sampling approaches may become trapped near σθ = 0, whereas improved
convergence and effective sample sizes are achieved by introducing a redundant scale
parameter l ∼ N (0, Vl )

y j ~ N ( m + lx j , s y2 ),

xj ∼ N (0, sx2 ).

The expanded model priors induce priors on the original model parameters, namely

qj = lxj ,

sq = l sx .

The setting for Vλ is important; too much diffuseness may lead to effective impropriety.
Another source of poor convergence is suboptimal parameterisation or data form.
For example, convergence is improved by centring independent variables in regres-
sion applications (Roberts and Sahu, 2001; Zuur et al., 2002). Similarly, delayed conver-
gence in random effects models may be lessened by sum to zero or corner constraints
(Clayton, 1996; Vines et al., 1996), or by a centred hierarchical prior (Gelfand et al., 1995;
Gelfand et al., 1996), in which the prior on each stochastic variable is a higher level sto-
chastic mean – see the next section. However, the most effective parameterisation may
also depend on the balance in the data between different sources of variation. In fact,
non-centred parameterisations, with latent data independent from hyperparameters,
may be preferable in terms of MCMC convergence in some settings (Papaspiliopoulos
et al., 2003).

1.12.1 Hierarchical Model Parameterisation to Improve Convergence


While priors for unstructured random effects may include a nominal mean of zero, in
practice, a posterior mean of zero for such a set of effects may not be achieved during
MCMC sampling. For example, the mean of the random effects can be confounded with
the intercept, especially when the prior for the random effects does not specify the level
(global mean) of the effects. One may apply a corner constraint by setting a particular ran-
dom effect (say, the first) to a known value, usually zero (Scollnik, 2002). Alternatively, an
Bayesian Methods for Complex Data 23

empirical sum to zero constraint may be achieved by centring the sampled random effects
at each iteration (sometimes known as “centring on the fly”), so that

ui∗ = ui − u

and inserting ui∗ rather than ui in the model defining the likelihood. Another option
(Vines et al., 1996; Scollink, 2002) is to define an auxiliary effect uia ∼ N (0, su2 ) and obtain
the ui, following the same prior N (0, su2 ) , but now with a guaranteed mean of zero, by the
transformation

n
ui = (uia − u a ).
n−1
To illustrate a centred hierarchical prior (Gelfand et al., 1995; Browne et al., 2009), consider
two way nested data, with j = 1, … , J repetitions over subjects i = 1, … , n

yij = m + ai + uij ,

with ai ∼ N (0, sa2 ) and uij ∼ N (0, su2 ). The centred version defines

ki = m + ai

yij = ki + uij ,

so that

yij ∼ N (ki , su2 ),

ki ∼ N ( m, sa2 ).

For three-way nested data, the standard model form is

yijk = m + ai + bij + uijk ,

with ai ∼ N (0, sa2 ) , and bij ∼ N (0, s b2 ) . The hierarchically centred version defines

zij = m + ai + bij ,

ki = m + ai ,

so that

yijk ∼ N (zij , su2 ),

zij ∼ N (ki , s b2 ),

and

ki ∼ N ( m, sa2 ).
24 Bayesian Hierarchical Models

Roberts and Sahu (1997) set out the contrasting sets of full conditional densities under the
standard and centred representations and compare Gibbs sampling scanning schemes.
Papaspiliopoulos et al. (2003) compare MCMC convergence for centred, noncentred, and
partially non-centred hierarchical model parameterisations according to the amount of
information the data contain about the latent effects ki = m + ai . Thus for two-way nested
data the (fully) non-centred parameterisation, or NCP for short, involves new random
effects k i with

yij = k i + m + su eij ,

k i = sa zi ,

where eij and zi are standard normal variables. In this form, the latent data k i and hyperpa-
rameter μ are independent a priori, and so the NCP may give better convergence when the
latent effects κi are not well identified by the observed data y. A partially non-centred form
is obtained using a number w ε [0,1], and

yij = k iw + w m + uij ,

k iw = (1 − w) m + sa zi ,

or equivalently,

k iw = (1 − w)ki + wk i .

Thus w = 0 gives the centred representation, and w = 1 gives the non-centred parameterisa-
tion. The optimal w for convergence depends on the ratio σu/σα. The centred representation
performs best when σu/σα tends to zero, while the non-centred representation is optimal
when σu/σα is large.

1.12.2 Multiple Chain Methods


Many practitioners prefer to use two or more parallel chains with diverse starting values
to ensure full coverage of the sample space of the parameters (Gelman and Rubin, 1996;
Toft et al., 2007). Diverse starting values may be based on default values for parameters (e.g.
precisions set at different default values such as 1, 5, 10 and regression coefficients set at
zero) or on the extreme quantiles of posterior densities from exploratory model runs. Online
monitoring of sampled parameter values {q k(t ) , t = 1,¼, T } from multiple chains k = 1, …, K
assists in diagnosing lack of model identifiability. Examples might be models with multiple
random effects, or when the mean of the random effects is not specified within the prior, as
under difference priors over time or space that are considered in Chapters 5 and 6 (Besag et
al., 1995). Another example is factor and structural equation models where the loadings are
not specified, so as to anchor the factor scores in a consistent direction, since otherwise the
“name” of the common factor may switch during MCMC updating (Congdon, 2003, Chapter
8). Single runs may still be adequate for straightforward problems, and single chain conver-
gence diagnostics (Geweke, 1992) may be applied in this case. Single runs are often useful
for exploring the posterior density, and as a preliminary to obtain inputs to multiple chains.
Convergence for multiple chains may be assessed using Gelman–Rubin scale reduction
factors that measure the convergence of the between chain variance in q k(t ) = (q1(kt) ,… ,q dk(t ) )
Bayesian Methods for Complex Data 25

to the variance over all chains k = 1, …, K. These factors converge to 1 if all chains are
sampling identical distributions, whereas for poorly identified models, variability of sam-
pled parameter values between chains will considerably exceed the variability within any
one chain. To apply these criteria, one typically allows a burn-in of B samples while the
sampling moves away from the initial values to the region of the posterior. For iterations
t = B + 1, … , T + B, a pooled estimate of the posterior variance sq2h|y of θh is

sqh|y = Vh /T + TWh /(T − 1),

where variability within chains Wh is defined as

K B+T

åå (q
1
Wh = (t )
hk - q hk )2 ,
(T - 1)K k =1 t=B+1

with qhk being the posterior mean of θh in samples from the kth chain, and where

∑ (q
T
Vh = hk − qh .)2 ,
K −1 k =1

denotes between chain variability in θh, with qh . denoting the pooled average of the qhk .
The potential scale reduction factor compares sq2h|y with the within sample estimate Wh.
Specifically, the scale factor is R̂h = (sq2h|y /Wh )0.5 with values under 1.2 indicating conver-
gence. A multivariate version of the PSRF for vector θ is mentioned by Brooks and Gelman
(1998) and Brooks and Roberts (1998) and involves between and within chain covariances
Vθ and Wθ, and pooled posterior covariance Σ q|y . The scale factor is defined by

b′Σ q|y b T − 1  1
Rq = max = +  1 +  l1
b b′Wq b T  K

where λ1 is the maximum eigenvalue of Wq−1Vq /T .


An alternative multiple chain convergence criterion also proposed by Brooks and Gelman
(1998), which avoids reliance on the implicit normality assumptions in the Gelman–Rubin
scale reduction factors based on analysis of variance over chains. Normality approximation
may be improved by parameter transformation (e.g. log or logit), but problems may still be
encountered when posterior densities are skewed or possibly multimodal (Toft et al., 2007).
The alternative criterion uses a ratio of parameter interval lengths: for each chain, the length
of the 100(1 − α)% interval for a parameter is obtained, namely the gap between 0.5α and
(1 − 0.5α) points from T simulated values. This provides K within-chain interval lengths, with
mean LU. From the pooled output of TK samples, an analogous interval LP is also obtained.
The ratio LP/LU should converge to 1 if there is convergent mixing over the K chains.

1.13 Choice of Prior Density


Choice of an appropriate prior density, and preferably a sensitivity analysis over alter-
native priors, is fundamental in the Bayesian approach; for example, see Gelman (2006),
Daniels (1999) and Gustafson et al. (2006) on priors for random effect variances. Before
26 Bayesian Hierarchical Models

the advent of MCMC methods, conjugate priors were often used in order to reduce the
burden of numeric integration. Now non-conjugate priors (e.g. finite range uniform priors
on standard deviation parameters) are widely used. There may be questions of sensitivity
of posterior inference to the choice of prior, especially for smaller datasets, or for certain
forms of model; examples are the priors used for variance components in random effects
models, the priors used for collections of correlated effects, for example, in hierarchical
spatial models (Bernardinelli et al., 1995), priors in nonlinear models (Millar, 2004), and
priors in discrete mixture models (Green and Richardson, 1997).
In many situations, existing knowledge may be difficult to summarise or elicit in the
form of an “informative prior”. It may be possible to develop suitable priors by simulation
(e.g. Chib and Ergashev, 2009), but it may be convenient to express prior ignorance using
“default” or “non-informative” priors. This is typically less problematic – in terms of poste-
rior sensitivity – for fixed effects, such as regression coefficients (when taken to be homog-
enous over cases) than for variance parameters. Since the classical maximum likelihood
estimate is obtained without considering priors on the parameters, a possible heuristic is
that a non-informative prior leads to a Bayesian posterior estimate close to the maximum
likelihood estimate. It might appear that a maximum likelihood analysis would therefore
necessarily be approximated by flat or improper priors, but such priors may actually be
unexpectedly informative about different parameter values (Zhu and Lu, 2004).
A flat or uniform prior distribution on θ, expressible as p(θ) = 1 is often adopted on fixed
regression effects, but is not invariant under reparameterisation. For example, it is not true
for ϕ = 1/θ that p(ϕ) = 1 as the prior for a function ϕ = g(θ), namely

d −1
p(f) = g (f) ,
df

demonstrates. By contrast, on invariance grounds, Jeffreys (1961) recommended the prior


p(σ) = 1/σ for a standard deviation, as for ϕ = g(σ) = σ2 one obtains p(ϕ) = 1/ϕ. More general
analytic rules for deriving noninformative priors include reference prior schemes (Berger
and Bernardo, 1992), and Jeffreys prior

0.5
p(q ) µ I (q ) ,

where I(θ) is the information matrix, namely

æ ¶ 2l(q ) ö
I (q ) = -E çç ÷÷ ,
è d l(q g )d l(q h ) ø
and l(q ) = log(L(q |y )) is the log-likelihood. Unlike uniform priors, a Jeffreys
prior is invariant under transformation of scale since I (q ) = I ( g(q ))( g¢(q ))2 and
p(q ) µ I ( g(q ))0.5 g¢(q ) = p( g(q )) g¢(q ) (Kass and Wasserman, 1996, p.1345).

1.13.1 Including Evidence
Especially for establishing the intercept (e.g. the average level of a disease), or regression
effects (e.g. the impact of risk factors on disease) or variability in such impacts, it may be pos-
sible to base the prior density on cumulative evidence via meta-analysis of existing studies,
or via elicitation techniques aimed at developing informative priors. This is well established
Bayesian Methods for Complex Data 27

in engineering risk and reliability assessment, where systematic elicitation approaches such
as maximum-entropy priors are used (Siu and Kelly, 1998; Hodge et al., 2001). Thus, known
constraints for a variable identify a class of possible distributions, and the distribution with
the greatest Shannon–Weaver entropy is selected as the prior. Examples are θ ~ N(m,V), if
estimates m and V of the mean and variance are available, or an exponential with parameter
–q/log(1 − p) if a positive variable has an estimated pth quantile of q.
Simple approximate elicitation methods include the histogram technique, which divides
the domain of an unknown θ into a set of bins, and elicits prior probabilities that θ is
located in each bin. Then p(θ) may be represented as a discrete prior or converted to a
smooth density. Prior elicitation may be aided if a prior is reparameterised in the form
of a mean and prior sample size. For example, beta priors Be(a,b) for probabilities can be
expressed as Be(mt,(1 − m)t), where m = a/(a + b) and τ = a + b are elicited estimates of the
mean probability and prior sample size. This principle is extended in data augmentation
priors (Greenland and Christensen, 2001), while Greenland (2007) uses the device of a
prior data stratum (equivalent to data augmentation) to represent the effect of binary risk
factors in logistic regressions in epidemiology.
If a set of existing studies is available providing evidence on the likely density of a
parameter, these may be used in a form of preliminary meta-analysis to set up an infor-
mative prior for the current study. However, there may be limits to the applicability of
existing studies to the current data, and so pooled information from previous studies may
be downweighted. For example, the precision of the pooled estimate from previous stud-
ies may be scaled downwards, with the scaling factor possibly an extra unknown. When a
maximum likelihood (ML) analysis is simple to apply, one option is to adopt the ML mean
as a prior mean, but with the ML precision matrix downweighted (Birkes and Dodge, 1993).
More comprehensive ways of downweighting historical/prior evidence have been pro-
posed, such as power prior models (Chen et al., 2000; Ibrahim and Chen, 2000). Let 0 ≤ d ≤ 1
be a scale parameter with beta prior that weights the likelihood of historical data yh relative
to the likelihood of the current study data y. Following Chen et al. (2000, p.124), a power
prior has the form

p(q , d |y h ) µ p( y h |q )]d [d ad -1(1 - d )bd -1 ]p(q ),

where p(yh|θ) is the likelihood for the historical data, and (aδ,bδ) are pre-specified beta den-
sity hyperparameters. The joint posterior density for (θ,δ) is then

p(q , d |y , y h ) µ p( y|q )[ p( y h |q )]d [d ad -1(1 - d )bd -1 ]p(q ).

Chen and Ibrahim (2006) demonstrate connections between the power prior and conven-
tional priors for hierarchical models.

1.13.2 Assessing Posterior Sensitivity; Robust Priors


To assess sensitivity to prior assumptions, the analysis may be repeated over a limited
range of alternative priors. Thus Sargent (1998) and Fahrmeir and Knorr-Held (1997, section
3.2) suggest a gamma prior on inverse precisions 1/τ2 governing random walk effects (e.g.
baseline hazard rates in survival analysis), namely 1/τ2 ~ Ga(a,b), where a is set at 1, but b is
varied over choices such as 0.05 or 0.0005. One possible strategy involves a consideration of
both optimistic and conservative priors, with regard, say, to a treatment effect, or the pres-
ence of significant random effect variation (Spiegelhalter, 2004; Gustafson et al., 2006).
28 Bayesian Hierarchical Models

Another relevant principle in multiple effect models is that of uniform shrinkage gov-
erning the proportion of total random variation to be assigned to each source of variation
(Daniels, 1999; Natarajan and Kass, 2000). So, for a two-level normal linear model with

yij = xij b + hj + eij ,

with eij ∼ N (0, s 2 ) and hj ∼ N (0, t 2 ) , one prior (e.g. inverse gamma) might relate to the
residual variance σ2, and a second conditional U(0,1) prior relates to the ratio t 2 /(t 2 + s 2 )
of cluster to total variance. A similar effect is achieved in structural time series models
(Harvey, 1989) by considering different forms of signal to noise ratios in state space models
including several forms of random effect (e.g. changing levels and slopes, as well as season
effects). Gustafson et al. (2006) propose a conservative prior for the one-level linear mixed
model

yi ∼ N (hi , s 2 ),

hi ∼ N ( m, t 2 ),

namely a conditional prior p(t 2 |s 2 ) aiming to prevent over-estimation of τ2. Thus, in full,

p(s 2 ,t 2 ) = p(s 2 )p(t 2 |s 2 )

where σ2 ~ IG(e,e) for some small e > 0, and

a -( a +1)
p(t 2 |s 2 ) = é1 + t 2 /s 2 ùû
2 ë
.
s
The case a = 1 corresponds to the uniform shrinkage prior of Daniels (1999), where

s2
p(t 2 |s 2 ) = ,
[s + t 2 ]2
2

while larger values of a (e.g. a = 5) are found to be relatively conservative.


For covariance matrices Σ between random effects of dimension k, the emphasis in recent
research has been on more flexible priors than afforded by the inverse Wishart (or Wishart
priors for precision matrices). Barnard et al. (2000) and Liechty et al. (2004) consider a sepa-
ration strategy whereby

Σ = diag(S).R.diag(S),

where S is a k × 1 vector of standard deviations, and R is a k × k correlation matrix. With
the prior sequence, p(R,S) = p(R|S)p(S), Barnard et al. suggest log(S) ~ Nk(ξ,Λ), where Λ is
usually diagonal. For the elements rij of R, constrained beta sampling on [−1,1] can be
used subject to positive definitiveness constraints on Σ. Daniels and Kass (1999) consider
the transformation hij = 0.5 log[(1 - rij )/(1 + rij )] and suggest an exchangeable hierarchical
shrinkage prior, ηij ~ N(0,τ2), where

p(t 2 ) ∝ (c + t 2 )−2 ;

c = 1/(k − 3).
Bayesian Methods for Complex Data 29

A separation strategy is also facilitated by the LKJ prior of Lewandowski et al. (2009) and
included in the rstan package (McElreath, 2016). While a full covariance prior (e.g. assum-
ing random slopes on all k predictors in a multilevel model) can be applied from the out-
set, MacNab et al. (2004) propose an incremental model strategy, starting with random
intercepts and slopes but without covariation between them, in order to assess for which
predictors there is significant slope variation. The next step applies a full covariance model
only for the predictors showing significant slope variation.
Formal approaches to prior robustness may be based on “contamination” priors. For
instance, one might assume a two group mixture with larger probability 1 − r on the
“main” prior p1(θ), and a smaller probability such as r = 0.1 on a contaminating density p2(θ),
which may be any density (Gustafson, 1996). More generally, a sensitivity analysis may
involve some form of mixture of priors, for example, a discrete mixture over a few alterna-
tives, a fully non-parametric approach (see Chapter 4), or a Dirichlet weight mixture over
a small range of alternatives (e.g. Jullion and Lambert, 2007). A mixture prior can include
the option that the parameter is not present (e.g. that a variance or regression effect is zero).
A mixture prior methodology of this kind for regression effects is presented by George
and McCulloch (1993). Increasingly also, random effects models are selective, including
a default allowing for random effects to be unnecessary (Albert and Chib, 1997; Cai and
Dunson, 2006; Fruhwirth-Schnatter and Tuchler, 2008).
In hierarchical models, the prior specifies both the form of the random effects (fully
exchangeable over units or spatially/temporally structured), the density of the random
effects (normal, mixture of normals, etc.), and the third stage hyperparameters. The form
of the second stage prior p(b|θb) amounts to a hypothesis about the nature and form of
the random effects. Thus, a hierarchical model for small area mortality may include spa-
tially structured random effects, exchangeable random effects with no spatial pattern, or
both, as under the convolution prior of Besag et al. (1991). It also may assume normality
in the different random effects, as against heavier tailed alternatives. A prior specifying
the errors as spatially correlated and normal is likely to be a working model assumption,
rather than a true cumulation of knowledge, and one may have several models for p(b|θb)
being compared (Disease Mapping Collaborative Group, 2000), with sensitivity not just
being assessed on the hyperparameters.
Random effect models often start with a normal hyperdensity, and so posterior infer-
ences may be sensitive to outliers or multiple modes, as well as to the prior used on the
hyperparameters. Indications of lack of fit (e.g. low conditional predictive ordinates for par-
ticular cases) may suggest robustification of the random effects prior. Robust hierarchical
models are adapted to pooling inferences and/or smoothing in data, subject to outliers or
other irregularities; for example, Jonsen et al. (2006) consider robust space-time state-space
models with Student t rather than normal errors in an analysis of travel rates of migrating
leatherback turtles. Other forms of robust analysis involve discrete mixtures of random
effects (e.g. Lenk and Desarbo, 2000), possibly under Dirichlet or Polya process models (e.g.
Kleinman and Ibrahim, 1998). Robustification of hierarchical models reduces the chance of
incorrect inferences on individual effects, important when random effects approaches are
used to identify excess risk or poor outcomes (Conlon and Louis, 1999; Marshall et al., 2004).

1.13.3 Problems in Prior Selection in Hierarchical Bayes Models


For the third stage parameters (the hyperparameters) in hierarchical models, choice of a
diffuse noninformative prior may be problematic, as improper priors may induce improper
posteriors that prevent MCMC convergence, since conditions necessary for convergence
30 Bayesian Hierarchical Models

(e.g. positive recurrence) may be violated (Berger et al., 2005). This may apply even if con-
ditional densities are proper, and Gibbs or other MCMC sampling proceeds apparently
straightforwardly. A simple example is provided by the normal two-level model with sub-
jects i = 1, …, n nested in clusters j = 1, …, J,

yij = m + qj + uij ,

where qj ∼ N (0, t 2 ) and uij ∼ N (0, s 2 ). Hobert and Casella (1996) show that the posterior dis-
tribution is improper under the prior p( m, t, s ) = 1/(s 2t 2 ), even though the full conditionals
have standard forms, namely

æ ö
ç n( y j - m ) 1 ÷
p(q j |y , m , s ,t ) = N ç
2 2
2 , n ÷,
ç n+ s 1 ÷
ç + 2 ÷
è t 2 s 2
t ø

æ s2 ö
p( m |y , s 2 ,t 2 ,q ) = N ç y - q , ÷,
è nJ ø

æJ ö
p(1/t 2 |y , m , s 2 ,q ) = Ga ç , 0.5
ç2 å q j2 ÷ ,
÷
è j ø

æ nJ ö
p(1/s 2 |y , m ,t 2 ,q ) = Ga ç , 0.5
ç 2 å ( yij - m - q j )2 ÷ ,
÷
è ij ø

so that Gibbs sampling could in principle proceed.


Whether posterior propriety holds depends on the level of information in the data,
whether additional constraints are applied to parameters in MCMC updating, and the
nature of the improper prior used. For example, Rodrigues and Assuncao (2008) demon-
strate propriety in the posterior of spatially varying regression parameter models under
a class of improper priors. More generally, Markov random field (MRF) priors such as
random walks in time, or spatial conditional autoregressive priors (Chapters 5 and 6), may
have joint forms that are improper, with a singular covariance matrix – see, for example,
the discussion by Sun et al. (2000, pp.28–30). The joint prior only identifies differences
between pairs of effects, and unless additional constraints are applied to the random
effects, this may cause issues with posterior propriety.
It is possible to define proper priors in these cases by introducing autoregression param-
eters (Sun et al., 1999), but Besag et al. (1995, p.11) mention that “the sole impropriety in
such [MRF] priors is that of an arbitrary level and is removed from the corresponding
posterior distribution by the presence of any informative data”. The indeterminacy in the
level is usually resolved by applying “centring on the fly” (at each MCMC iteration) within
each set of random effects, and under such a linear constraint, MRF priors become proper
(Rodrigues and Assunção, 2008, p.2409). Alternatively, “corner” constraints on particular
effects, namely, setting them to fixed values (usually zero), may be applied (Clayton, 1996;
Koop, 2003, p.248), while Chib and Jeliazkov (2006) suggest an approach to obtaining pro-
priety in random walk priors.
Bayesian Methods for Complex Data 31

Priors that are just proper mathematically (e.g. gamma priors on 1/τ2 with small scale
and shape parameters) are often used on the grounds of expediency, and justified as letting
the data speak for themselves. However, such priors may cause identifiability problems as
the posteriors are close to being empirically improper. This impedes MCMC convergence
(Kass and Wasserman, 1996; Gelfand and Sahu, 1999). Furthermore, using just proper pri-
ors on variance parameters may in fact favour particular values, despite being suppos-
edly only weakly informative. Gelman (2006) suggests possible (less problematic) options
including a finite range uniform prior on the standard deviation (rather than variance),
and a positive truncated t density.

1.14 Computational Notes

[1] In Example 1.1, the data are generated (n = 1000 values) and underlying parameters
are estimated as follows:

    library(mcmcse)
    library(MASS)
    library(R2WinBUGS)
    # generate data
    set.seed(1234)
    y = rnorm(1000,3,5)
    # initial vector setting and parameter values
    T = 10000; B = T/10; B1=B+1
    mu = sig = numeric(T)
    # initial parameter values
    mu[1] = 0
    sig[1] = 1
    u.mu = u.sig = runif(T)
    # rejection counter
    REJmu = 0; REJsig = 0
    # log posterior density (up to a constant)
    logpost = function(mu,sig){
    loglike = sum(dnorm(y,mu,sig,log=TRUE))
    return(loglike - log(sig))}
    # sampling loop
    for (t in 2:T) {print(t)
    mut = mu[t-1]; sigt = sig[t-1]
    # uniform proposals with kappa = 0.5
    mucand = mut + runif(1,-0.5,0.5)
    sigcand = abs(sigt + runif(1,-0.5,0.5))
    alph.mu = logpost(mucand,sigt)-logpost(mut,sigt)
    if (log(u.mu[t]) <= alph.mu) mu[t] = mucand
    else {mu[t] = mut; REJmu = REJmu+1}
    alph.sig = logpost(mu[t],sigcand)-logpost(mu[t],sigt)
    if (log(u.sig[t]) <= alph.sig) sig[t] = sigcand
    else {sig[t] <- sigt; REJsig <- REJsig+1}}
    # sequence of sampled values and ACF plots
    plot(mu)
32 Bayesian Hierarchical Models

    plot(sig)
    acf(mu,main="acf plot, mu")
    acf(sig,main="acf plot, sig")
    # posterior summaries
    summary(mu[B1:T])
    summary(sig[B1:T])
    # Monte Carlo standard errors
    D=data.frame(mu[B1:T],sig[B1:T])
    mcse.mat(D)
    # acceptance rates
    ACCmu=1-REJmu/T
    ACCsig=1-REJsig/T
    cat("Acceptance Rate mu =",ACCmu,"n ")
    cat("Acceptance Rate sigma = ",ACCsig, "n ")
    # kernel density plots
    plot(density(mu[B1:T]),main= "Density plot for mu posterior")
    plot(density(sig[B1:T]),main= "Density plot for sigma posterior ")
    f1=kde2d(mu[B1:T], sig[B1:T], n=50, lims=c(2.5,3.4,4.7,5.3))
    
filled.contour(f1,main="Figure 1.1 Bivariate Density", xlab="mu",
ylab="sigma",
    
color.palette=colorRampPalette(c(’white’,’blue’,’yellow’,’red’,’dark
red’)))
    
filled.contour(f1,main="Figure 1.1 Bivariate Density",xlab="mu",
ylab="sigma",
    
color.palette=colorRampPalette(c(’white’,’lightgray’,’gray’,’darkgra
y’,’black’)))
    # estimates of effective sample sizes
    effectiveSize(mu[B1:T])
    effectiveSize(sig[B1:T])
    ess(D)
    multiESS(D)
    # posterior probability on hypothesis μ < 3
    sum(mu[B1:T] < 3)/(T-B)

[2] The R code for Metropolis sampling of the extended logistic model is library(coda)

    # data
    w = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
    n = c(59, 60, 62, 56, 63, 59, 62, 60)
    y = c(6, 13, 18, 28, 52, 53, 61, 60)
    # posterior density
    f = function(mu,th2,th3) {
    # settings for priors
    a0=0.25; b0=0.25; c0=2; d0=10; e0=2.004; f0=0.001
    V = exp(th3)
    m1 = exp(th2)
    sig = sqrt(V)
    x = (w-mu)/sig
    xt = exp(x)/(1+exp(x))
    h = xt94m1;
    loglike = y*log(h)+(n-y)*log(1-h)
    # prior ordinates
    logpriorm1 = a0*th2-m1*b0
    logpriorV = -e0*th3-f0/V
Bayesian Methods for Complex Data 33

    logpriormu = -0.5*((mu-c0)/d0)942-log(d0)
    logprior = logpriormu+logpriorV+logpriorm1
    # log posterior
    f = sum(loglike)+logprior}
    # main MCMC loop
    runMCMC = function(samp,mu,th2,th3,T,sd) {
    for (i in 2:T+1) {
    # candidates for mu
    mucand = mu[i-1]+sd[1]*rnorm(1,0,1)
    f.cand = f(mucand,th2[i-1],th3[i-1])
    f.curr = f(mu[i-1], th2[i-1],th3[i-1])
    if (log(runif(1)) <= f.cand-f.curr) mu[i] = mucand else
    {mu[i] = mu[i-1]}
    # candidates for log(m1)
    th2cand = th2[i-1]+sd[2]*rnorm(1,0,1)
    f.cand = f(mu[i],th2cand,th3[i-1])
    f.curr = f(mu[i],th2[i-1], th3[i-1])
    if (log(runif(1)) <= f.cand-f.curr) th2[i] = th2cand else
    {th2[i] = th2[i-1]}
    # candidates for log(V)
    th3cand = th3[i-1]+sd[3]*rnorm(1,0,1)
    f.cand = f(mu[i],th2[i],th3cand)
    f.curr = f(mu[i],th2[i],th3[i-1])
    if (log(runif(1)) <= f.cand-f.curr) th3[i] = th3cand else
    {th3[i] = th3[i-1]}
    
samp[i-1.1] = mu[i]; samp[i-1.2] = exp(th2[i]); samp[i-1.3] =
exp(th3[i])}
    return(samp)}
    # number of iterations
    T=100000
    # warm-up samples
    B=50000
    B1=B+1
    R=T-B
    mu=th3=th2=numeric(T)
    sd=acc=numeric(3)
    # metropolis proposal standard devns
    sd[1] = 0.01; sd[2] = 0.2; sd[3] = 0.4
    # accumulate samples
    samp = matrix(,T,3)
    # initial parameter values
    mu[1] = 0; th2[1]= 0; th3[1] =0
    samp[1,1] = mu[1]; samp[1,2] = exp(th2[1]); samp[1,3] = exp(th3[1])
    # first chain
    chain1=runMCMC(samp,mu,th2,th3,T,sd)
    chain1=chain1[B1:T,]
    # posterior summary
    quantile(chain1[1:R,1], probs=c(.025,0.5,0.975))
    quantile(chain1[1:R,2], probs=c(.025,0.5,0.975))
    quantile(chain1[1:R,3], probs=c(.025,0.5,0.975))
    # second chain
    chain2=runMCMC(samp,mu,th2,th3,T,sd)
    chain2=chain2[B1:T,]
    # posterior summary
34 Bayesian Hierarchical Models

    quantile(chain2[1:R,1], probs=c(.025,0.5,0.975))
    quantile(chain2[1:R,2], probs=c(.025,0.5,0.975))
    quantile(chain2[1:R,3], probs=c(.025,0.5,0.975))
    # combine chains
    chain1=as.mcmc(chain1)
    chain2=as.mcmc(chain2)
    combchains = mcmc.list(chain1, chain2)
    gelman.diag(combchains)
    crosscorr(combchains)
    accsum = "Acceptance rates: mu, m1, and sigma942"
    print(accsum)
    1 - rejectionRate(combchains)
    effectiveSize(combchains)
    autocorr.diag(combchains)

[3] The rstan code for the beetle mortality example is

    library(rstan)
    library(bayesplot)
    library(coda)
    # data
    w = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
    n = c(59, 60, 62, 56, 63, 59, 62, 60)
    y = c(6, 13, 18, 28, 52, 53, 61, 60)
    D=list(y=y,n=n,w=w,N=8)
    # rstan code
    model ="
    data {
    int<lower=0> N;
    int n[N];
    int y[N];
    real w[N];
    }
    parameters {
    real <lower=0> mu;
    real log_sigma;
    real log_m1;
    }
    transformed parameters {
    real<lower=0> sigma;
    real<lower=0> sigma2;
    real<lower=0> m1;
    real x[N];
    real pi[N];
    sigma=exp(log_sigma);
    sigma2=sigma942;
    m1=exp(log_m1);
    for (i in 1:N) {x[i]=(w[i]-mu)/sigma;}
    for (i in 1:N) {pi[i]=pow(exp(x[i])/(1+exp(x[i])),m1);}
    }
    model {
    log_sigma ~normal(0,5);
    mu ~normal(2,3.16);
    log_m1 ~normal(0,1);
Bayesian Methods for Complex Data 35

    for (i in 1:N) {y[i] ~binomial_logit(n[i], pi[i]);}


    }
    "
    
fit=stan(model_code = model,data=D, iter = 2500,warmup =
250,chains=2,seed=10)
    # posterior summary
    print(fit,digits=6)
    # bivariate density plots
    color_scheme_set("gray")
    afit= as.array(fit)
    
mcmc_pairs(afit, pars = c("mu", "m1", "sigma"), off_diag_args =
list(size = 1.5))
    # MCMC diagnostics
    samps <- as.matrix(fit,pars= c("mu", "m1", "sigma"))
samps <- mcmc.list(lapply(1:ncol(samps), function(x) mcmc(as.
    
array(samps)[,x,])))
    crosscorr(samps)
    effectiveSize(samps)
    autocorr.diag(as.mcmc(samps))

[4] The R code for analysis of the turtle survival data is


    library(bridgesampling)
    options(scipen=999)
    data("turtles")
    y=turtles$y
    x=turtles$x
    C=turtles$clutch
    N = length(y)
    J = length(unique(C))
    # posterior density function
    f = function(beta,alpha,tau,e) {sig = 1/sqrt(tau)
    # survival model
    for (i in 1:N){p[i] = pnorm(alpha+beta*x[i]+e[C[i]])
    LL[i] = y[i]*log(p[i])+(1-y[i])*log(1-p[i])}
    # prior ordinates
    logpr[1] = -0.5*alpha942/10
    logpr[2] = -0.5*beta942/10
    logpr[3] = -0.001*tau
    for (j in 1:J){LLr[j] = -0.5*e[j]942/sig942-log(sig)}
    # log-posterior
    f = sum(LL[1:N])+sum(LLr[1:J])+sum(logpr[1:3])}
    # MCMC settings
    T = 5000
    # warm up
    B = T/10
    # accumulate M-H rejections for hyperparameters
    k1 = 0; k2 = 0; k3 = 0
    # gamma parameter for precision updates
    kappa=100
    # uniform samples for use in hyperparameter updates
    U1 = U2 = U3 = log(runif(T))
    # define arrays
    
alpha = numeric(T); beta = numeric(T); tau = numeric(T); logpr =
numeric(3)
36 Bayesian Hierarchical Models

    s = numeric(T); p = numeric(N); e = numeric(J); LL = numeric(N);


    LLr = numeric(J); ec = matrix(0,T,J); en = matrix(0,T,J);
    kran = numeric(J)
    # initial parameter values
    
beta[1]= 0.35; alpha[1]= -2.6; tau[1]= 5; for (j in 1:J) {ec[1,j]= 0;
kran[j]= 0}
    # main loop
    # update beta
    for (t in 2:T) {bstar = beta[t-1]+0.05*rnorm(1,0,1)
    
tn = f(bstar,alpha[t-1],tau[t-1],ec[t-1,]); tf =
f(beta[t-1],alpha[t-1],tau[t-1],ec[t-1,])
    if (U1[t] <= tn-tf) beta[t] = bstar
    else {beta[t] = beta[t-1]; k1 = k1+1}
    # update intercept
    astar = alpha[t-1]+0.5*rnorm(1,0,1)
    
tn = f(beta[t],astar,tau[t-1],ec[t-1,]); tf =
f(beta[t],alpha[t-1],tau[t-1],ec[t-1,])
    if (U2[t] <= tn-tf) alpha[t] = astar
    else {alpha[t] = alpha[t-1]; k2 = k2+1}
    # update precision
    taustar = rgamma(1,kappa,kappa/tau[t-1])
    s[t-1] = 1/sqrt(tau[t-1])
    
tn = f(beta[t],alpha[t],taustar,ec[t-1,])+log(dgamma(tau[t-1],
kappa,kappa/taustar))
    
tc = f(beta[t],alpha[t],tau[t-1],ec[t-1,])+log(dgamma(taustar,kappa,
kappa/tau[t-1]))
    if (U3[t] <= tn-tf) tau[t] = taustar
    else {tau[t] = tau[t-1]; k3 = k3+1}
    # update cluster effects
    for (j in 1:J) {en[j] = ec[t-1,j]
    ec[t,j] = ec[t-1,j]}
    for (j in 1:J) {en[j] = ec[t-1,j]+rnorm(1,0,1)
    
tn = f(beta[t],alpha[t],tau[t],en[]); tf = f(beta[t],alpha[t],tau[t]
,ec[t,])
    if (log(runif(1)) <= tn-tf) ec[t,j] = en[j]
    else {en[j] = ec[t-1,j]
    kran[j] = kran[j]+1}}}
    # hyperparameter summaries
    quantile(alpha[B:T], probs=c(.025,0.5,0.975))
    quantile(beta[B:T], probs=c(.025,0.5,0.975))
    quantile(tau[B:T], probs=c(.025,0.5,0.975))
    quantile(s[B:T], probs=c(.025,0.5,0.975))
    # random effects posterior medians and quantiles
    eff.mdn = apply(ec[B:T,], 2, quantile, probs = c(0.50))
    eff.q975=apply(ec[B:T,], 2, quantile, probs = c(0.975))
    eff.q025=apply(ec[B:T,], 2, quantile, probs = c(0.025))
    eff.q90=apply(ec[B:T,], 2, quantile, probs = c(0.90))
    eff.q10=apply(ec[B:T,], 2, quantile, probs = c(0.10))
    # number of significant 80% credible intervals for random effects
    sum(eff.q90>0 & eff.q10 >0)+ sum(eff.q90<0 & eff.q10 <0)
    # acceptance rates for hyperparameters (beta, alpha, tau.b)
    1-k1/T; 1-k2/T; 1-k3/T
    # acceptance rates for cluster effects
    1-kran/T
Bayesian Methods for Complex Data 37

[5] There are J+2 unknowns in the R code (N.B. the s j2 are not unknowns) for imple-
menting these Gibbs updates. There are T=20000 MCMC samples to be accumu-
lated in the matrix samples. With a = b = 0.1 in the prior for 1/τ2, and calling on coda
routines for posterior summaries, one has

    library(coda)
    # data
    y=c(28,8,-3,7,-1,1,18,12)
    sigma=c(15,10,16,11,9,11,10,18)
    sigma2 = sigma942
    J = 8
    # total MCMC iterations
    T = 20000
    # ten unknowns (eight effects, plus their mean and variance)
    samps = matrix(, T, 10)
colnames(samps) <- c("mu","tau","Sch1","Sch2","Sch3","Sch4","Sch5","
    
Sch6","Sch7","Sch8")
    # starting values
    mu=mean(y)
    tau2=median(sigma2)
    # sampling loop
    for (t in 1:T) {th.mean=(y/sigma2+mu/tau2)/(1/sigma2+1/tau2)
    th.sd=sqrt(1/(1/sigma2+1/tau2))
    theta=rnorm(J,th.mean,th.sd)
    mu=rnorm(1,mean(theta),sqrt(tau2/J))
    # prior on random effects precision
    invtau2=rgamma(1,J/2+0.1,sum((theta-mu)942)/2+0.1)
    tau2 = 1/invtau2
    tau = sqrt(tau2)
    # accumulate samples
    samps[t,3:10] = theta
    samps[t,1] =mu
    samps[t,2] =tau}
    # posterior summary
    summary(as.mcmc(samps))
    post.mn = apply(samps,2,mean)
    post.sd = apply(samps,2,sd)
    post.median = apply(samps,2,median)
    post.95=apply(samps, 2, quantile, probs = c(0.95))
    post.05=apply(samps, 2, quantile, probs = c(0.05))
    # trace and density plots
    plot(as.mcmc(samps))

References
Albert J (2007) Bayesian Computation with R. Springer.
Albert J, Chib S (1993) Bayesian analysis of binary and polychotomous response data. Journal of the
American Statistical Association, 88, 669–679.
Albert J, Chib S (1997) Bayesian tests and model diagnostics in conditionally independent hierarchi-
cal models. Journal of the American Statistical Association, 92, 916–925.
38 Bayesian Hierarchical Models

Altaleb A, Chauveau D (2002) Bayesian analysis of the logit model and comparison of two Metropolis–
Hastings strategies. Computational Statistics & Data Analysis, 39, 137–152.
Andrieu C, Moulines E (2006) On the ergodicity properties of some adaptive MCMC algorithms.
Annals of Applied Probability, 16(3), 1462–1505.
Barnard J, McCulloch R, Meng X (2000) Modeling covariance matrices in terms of standard devia-
tions and correlations, with applications to shrinkage. Statistica Sinica, 10, 1281–1311.
Bedard M (2008) Optimal acceptance rates for Metropolis algorithms: Moving beyond 0.234. Stochastic
Processes and their Applications, 118(12), 2198–2222.
Berger J, Bernardo J (1992) On the development of reference priors, in Bayesian Statistics 4, pp 35–60,
eds J Bernardo, J Berger, A Dawid, A Smith. Clarendon Press, Oxford.
Berger J, Strawderman W, Tang D (2005) Posterior propriety and admissibility of hyperpriors in nor-
mal hierarchical models. Annals of Statistics 33, 606–646.
Bernardinelli L, Clayton D, Montomoli C (1995) Bayesian estimates of disease maps: How important
are priors? Statistics in Medicine 14, 2411–2431.
Besag J, Green P, Higdon D, Mengerson K (1995) Bayesian computation and stochastic systems.
Statistical Science, 10(1),103–166.
Besag J, York J, Mollie A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43, 1–21.
Birkes D, Dodge Y (1993) Alternative Methods of Regression. John Wiley.
Brooks S, Gelman A (1998) Alternative methods for monitoring convergence of iterative simulations.
Journal of Computational and Graphical Statistics, 7, 434–456.
Brooks S, Roberts G (1998) Convergence assessment techniques for Markov chain Monte Carlo.
Statistics and Computing, 8, 319–335.
Browne W, Steele F, Golalizadeh M (2009) The use of simple reparameterizations to improve the
efficiency of Markov chain Monte Carlo estimation for multilevel models with applications to
discrete time survival models. Journal of the Royal Statistical Society: Series A, 172, 579–598.
Cai B, Dunson D (2006) Bayesian covariance selection in generalized linear mixed models. Biometrics,
62, 446–457.
Carlin B, Gelfand A (1991) An iterative Monte Carlo method for nonconjugate Bayesian analysis.
Statistics and Computing, 1(2), 119–128.
Carpenter B, Gelman A, Hoffman M, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P,
Riddell A (2017) Stan: A probabilistic programming language. Journal of Statistical Software,
76(1), 1–32
Chen M, Wang X (2011) Approximate predictive densities and their applications in generalized linear
models. Computational Statistics & Data Analysis, 55(4), 1570–1580.
Chen M-H, Ibrahim J (2006) The relationship between the power prior and hierarchical models.
Bayesian Analysis, 1, 551–574.
Chen M-H, Ibrahim J, Shao Q-M (2000) Power prior distributions for generalized linear models.
Journal of Statistical Planning and Inference, 84, 121–137.
Chen M-H, Shao Q-M (1998) Monte Carlo estimation of Bayesian credible and HPD intervals. Journal
of Computational & Graphical Statistics, 8(1), 69–92.
Chiang J, Chib S, Narasimhan C (1999) Markov chain Monte Carlo and models of consideration set
and parameter heterogeneity. Journal of Econometrics, 89, 223–248.
Chib S (2001) Monte Carlo methods and Bayesian computation: Overview, in International Encyclopedia
of the Social & Behavioral Sciences. https://doi.org/10.1016/B0-08-043076-7/00467-8
Chib S, Ergashev B (2009) Analysis of multifactor affine yield curve models. Journal of the American
Statistical Association, 104(488), 1324–1337.
Chib S, Greenberg E (1995) Understanding the Metropolis-Hastings algorithm. The American
Statistician, 49, 327–335.
Chib S, Jeliazkov I (2006) Inference in semiparametric dynamic models for binary longitudinal data.
Journal of the American Statistical Association, 101, 685–700.
Clark J, Gelfand A (eds) (2006) Hierarchical Modelling for the Environmental Sciences: Statistical Methods
and Applications. Oxford University Press.
Bayesian Methods for Complex Data 39

Clayton D (1996) Generalized linear mixed models, in: Markov Chain Monte Carlo in Practice, eds W
Gilks, S Richardson, D Spiegelhalter. Chapman & Hall, London, UK.
Congdon P (2003) Applied Bayesian Modelling. Wiley, Chichester, UK.
Conlon E, Louis T (1999) Addressing multiple goals in evaluating region-specific risk using Bayesian
methods, pp 31–47, in Disease Mapping and Risk Assessment for Public Health, eds A Lawson, A
Biggeri, D Bohning, E Lesaffre, J Viel, R Bertollini. Wiley.
Cressie N, Calder C A, Clark J S, Hoef J M V, Wikle C K (2009) Accounting for uncertainty in eco-
logical analysis: The strengths and limitations of hierarchical statistical modeling. Ecological
Applications, 19(3), 553–570.
Daniels M (1999) A prior for the variance in hierarchical models. Canadian Journal of Statistics, 27,
569–580.
Daniels M, Kass R (1999) Nonconjugate Bayesian Estimation of Covariance matrices and its use in
hierarchical models. Journal of the American Statistical Association, 94, 1254–1263.
Davidian M, Giltinan D M (2003) Nonlinear models for repeated measures data: An overview and
update. Journal of Agricultural, Biological, and Environmental Statistics, 8, 387–419.
Deely J, Smith A (1998) Quantitative refinements for comparisons of institutional performance.
Journal of the Royal Statistical Society, Series A, 161, 5–12.
Disease Mapping Collaborative Group (2000) Disease mapping models: An empirical evaluation.
Statistic in Medicine, 19, 2217–2241.
Dunson D (2001) Commentary: Practical advantages of Bayesian analysis of epidemiologic data.
American Journal of Epidemiology, 153, 1222–1226.
Fahrmeir L, Knorr-Held L (1997) Dynamic discrete-time duration models. Sociological Methodology,
27, 417–452.
Fox J-P (2010) Bayesian Item Response Modeling: Theory and Applications. Springer.
Fruhwirth-Schnatter S, Tuchler R (2008) Bayesian parsimonious covariance estimation for hierarchi-
cal linear mixed models. Statistics & Computing, 18, 1–13.
Fuglstad G, Simpson D, Lindgren F, Rue H (2018) Constructing priors that penalize the complexity of
Gaussian random fields. Journal of the American Statistical Association, 114(525), 445–452.
Gelfand A, Sahu S (1999) Identifiability, improper priors, and Gibbs sampling for generalized linear
models. Journal of the American Statistical Association, 94, 247–253.
Gelfand A, Sahu S, Carlin B (1995) Efficient parameterization for normal linear mixed models.
Biometrika, 82, 479–488.
Gelfand A, Sahu S, Carlin B (1996) Efficient parameterizations for generalised linear models, in
Bayesian Statistics 5, pp 165–180, eds J Bernardo, J Berger, A Dawid, A Smith. Clarendon Press,
Oxford, UK.
Gelfand A, Smith A (1990) Sampling-based approaches to calculating marginal densities. Journal of
the American Statistical Association, 85, 398–409.
Gelman A (2006) Prior distributions for variance parameters in hierarchical models. Bayesian Analysis,
1, 515–533.
Gelman A, Rubin D (1996) Markov chain Monte Carlo methods in biostatistics. Statistical Methods in
Medical Research, 5, 339–355.
Gelman A, Stern H, Carlin J, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis, 3rd Edition.
Chapman and Hall/CRC.
Gelman A, van Dyk D, Huang Z, Boscardin J (2008) Using redundant parameterizations to fit hierar-
chical models. Journal of Computational and Graphical Statistics, 17, 95–122.
George E, Makov U, Smith A (1993) Conjugate likelihood distributions. Scandinavian Journal of
Statistics, 20, 147–156.
George E, McCulloch R (1993) Variable selection via Gibbs sampling. Journal of the American Statistical
Association, 88(423), 881–889.
Geweke J (1992) Evaluating the accuracy of sampling-based approaches to calculating posterior
moments, in Bayesian Statistics, Volume 4. eds J Bernardo, J Berger, A Dawid, A Smith. Oxford
University Press, New York.
Geweke J (1993) Bayesian treatment of the Student’s-t linear model. Journal of Applied Economics, 8, S19–S40.
40 Bayesian Hierarchical Models

Geyer C, Thompson E (1995) Annealing Markov chain Monte Carlo with applications to ancestral
inference. Journal of the American Statistical Association, 90, 909–920.
Ghosh J (2008) Efficient Bayesian Computation and Model Search in Linear Hierarchical Models.
PhD Thesis ISDS, Duke University.
Gilks W (1996) Full conditional distributions, in Markov Chain Monte Carlo in Practice, pp 75–88, eds
W Gilks, S Richardson, D Spiegelhalter. Chapman and Hall, London, UK.
Gilks W, Richardson S, Spielgelhalter D (1996) Introducing Markov chain Monte Carlo, in Markov
Chain Monte Carlo in Practice, pp 1–19, eds W Gilks, S Richardson, D Spiegelhalter. Chapman
and Hall, London, UK.
Gilks W, Wang C, Yvonnet B, Coursaget P (1993) Random-effects models for longitudinal data using
Gibbs sampling. Biometrics, 38, 963–974.
Goldstein H, Spiegelhalter D (1996) League tables and their limitations: Statistical issues in com-
parisons of institutional performance. Journal of the Royal Statistical Society: Series A (Statistics in
Society), 159(3), 385–409.
Green P, Richardson S (1997) On Bayesian analysis of mixtures with an unknown number of compo-
nents. Journal of the Royal Statistical Society: Series B, 59, 731–792.
Greenland S (2007) Bayesian perspectives for epidemiological research. II. Regression analysis.
International Journal of Epidemiology, 36, 195–202.
Greenland S, Christensen R (2001) Data augmentation priors for Bayesian and semi-Bayes analyses of
conditional-logistic and proportional-hazards regression. Statistics in Medicine, 20, 2421–2428.
Gustafson P. (1996) Local sensitivity of inferences to prior marginals. Journal of the American Statistical
Association, 91, 774–781.
Gustafson P, Hossain S, MacNab Y (2006) Conservative priors for hierarchical models. Canadian
Journal of Statistics, 34, 377–390.
Hadjicostas P, Berry S (1999) Improper and proper posteriors with improper priors in a Poisson-
gamma hierarchical model. Test, 8, 147–166.
Harvey A (1989) Structural Time Series Models and the Kalman Filter. Cambridge University Press.
Hastings, W (1970) Monte-Carlo sampling methods using Markov Chains and their applications.
Biometrika, 57, 97–109.
Hobert J, Casella G (1996) The effect of improper priors on Gibbs sampling in hierarchical linear
mixed models. Journal of the American Statistical Association, 91, 1461–1473.
Hodge R, Evans M, Marshall J, Quigley J, Walls L (2001) Eliciting engineering knowledge about reli-
ability during design-lessons learnt from implementation. Quality and Reliability Engineering
International, 17, 169–179.
Hoffman M, Gelman A (2014) The No-U-turn sampler: Adaptively setting path lengths in Hamiltonian
Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.
Hyndman R (1996) Computing and graphing highest density regions. American Statistician, 50,
361–365.
Ibrahim J, Chen M-H (2000) Power prior distributions for regression models. Statistical Science, 15,
46–60.
Jeffreys H (1961) Theory of Probability, 3rd Edition. Oxford University Press, Clarendon Press, Oxford,
UK.
Johannes M, Polson N (2006) MCMC methods for continuous-time financial econometrics, in
Handbook of Financial Econometrics, eds Y Ait-Sahalia, L Hansen. North Holland, Amsterdam.
Jonsen I, Myers R, James M (2006) Robust hierarchical state–space models reveal diel variation in
travel rates of migrating leatherback turtles. Journal of Animal Ecology, 75, 1046–1057.
Jullion A, Lambert P (2007) Robust specification of the roughness penalty prior distribution in
spatially adaptive bayesian p-splines models. Computational Statistics and Data Analysis, 51,
2542–2558.
Kass R, Carlin B, Gelman A, Neal R (1998) Markov chain Monte Carlo in practice: A round table dis-
cussion. The American Statistician, 52, 93–100.
Kass R, Wasserman L (1996) The selection of prior distributions by formal rules. Journal of the American
Statistical Association, 91, 1343–1370.
Bayesian Methods for Complex Data 41

Kleinman K, Ibrahim J (1998) A semiparametric Bayesian approach to the random effects model.
Biometrics, 54, 921–938.
Klement, R, Bandyopadhyay, P, Champ, C, Walach, H (2018) Application of Bayesian evidence syn-
thesis to modelling the effect of ketogenic therapy on survival of high grade glioma patients.
Theoretical Biology and Medical Modelling, 15(1), 12.
Knorr-Held L, Rainer E (2001) Projections of lung cancer mortality in West Germany: A case study in
Bayesian prediction. Biostatistics, 2, 109–129.
Koop G (2003) Bayesian Econometrics. John Wiley.
Krypotos A, Blanken T, Arnaudova I, Matzke D, Beckers T (2017) A primer on Bayesian analysis for
experimental psychopathologists. Journal of Experimental Psychopathology, 8(2), jep-057316.
Laird N, Louis T (1989) Empirical Bayes confidence intervals for a series of related experiments.
Biometrics, 45(2), 481–495.
Lenk P, DeSarbo W (2000) Bayesian inference for finite mixture models of generalized linear models
with random effects. Psychometrika, 65, 475–496.
Lewandowski D, Kurowicka D, Joe H (2009) Generating random correlation matrices based on vines
and extended onion method. Journal of Multivariate Analysis, 100(9), 1989–2001.
Liechty J, Liechty M, Muller P (2004) Bayesian correlation estimation. Biometrika, 91, 1–14.
Lindley D, Smith A (1972) Bayes estimates for the linear model. Journal of the Royal Statistical
Society, B34, 1–41.
MacNab Y, Qiu Z, Gustafson P, Dean C, Ohlsson A, Lee S (2004) Hierarchical Bayes analysis of mul-
tilevel health services data: A Canadian neonatal mortality study. Health Services and Outcomes
Research Methodology, 5, 5–26.
Marshall C, Best N, Bottle A, Aylin P (2004) Statistical issues in the prospective monitoring of health
outcomes across multiple units. Journal of the Royal Statistical Society: Series A, 167, 541–559.
Marshall E, Spiegelhalter D (1998) Comparing institutional performance using Markov chain Monte
Carlo methods, pp 229–249, in Statistical Analysis of Medical Data: New Developments, eds B
Everitt, G Dunn. Arnold, London, UK.
McElreath R (2016) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Mengersen K, Tweedie R (1996) Rates of convergence of the Hastings and Metropolis algorithms. The
Annals of Statistics, 24, 101–121.
Millar R (2004) Sensitivity of Bayes estimators to hyper-parameters, with an application to maximum
yield from fisheries. Biometrics, 60, 536–542.
Molenberghs G, Verbeke G, Demetrio, C (2007) An extended random-effects approach to modelling
repeated, overdispersed count data. Lifetime Data Analysis, 13, 513–531.
Monnahan C C, Thorson J T, Branch T A (2017) Faster estimation of Bayesian models in ecology using
Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3), 339–348.
Natarajan R, Kass R (2000) Reference Bayesian methods for generalized linear mixed models. Journal
of the American Statistical Association, 95, 227–237.
Neal R (2011) MCMC using Hamiltonian dynamics, Chapter 5, in Handbook of Markov Chain Monte
Carlo, eds S Brooks, A Gelman, G Jones, X-L Meng. CRC Press.
Oravecz Z, Muth C (2018) Fitting growth curve models in the Bayesian framework. Psychonomic
Bulletin and Review, 25(1), 235–255.
Paap R (2002) What are the advantages of MCMC based inference in latent variable models? Statistica
Neerlandica, 56, 2–22.
Palmer J, Pettit L (1996) Risks of using improper priors with Gibbs sampling and autocorrelated
errors. Journal of Computational and Graphical Statistics, 5, 245–249.
Papaspiliopoulos O, Roberts G, Skold M (2003) Non-centered parameterisations for hierarchical
models and data augmentation, pp 307–326, in Bayesian Statistics 7, eds J Bernardo, S Bayarri, J
Berger, A Dawid, D Heckerman, A Smith, M West. Oxford University Press.
Raftery A (1996) Approximate Bayes factors and accounting for model uncertainty in generalized
linear models. Biometrika, 83, 251–266.
Raftery A, Lewis S (1992) One long run with diagnostics: Implementation strategies for Markov
chain Monte Carlo. Statistical Science, 7, 493–497.
42 Bayesian Hierarchical Models

Raftery A, Lewis S (1996) The number of iterations, convergence diagnostics and generic Metropolis
algorithms, in Practical Markov Chain Monte Carlo, eds W Gilks, D Spiegelhalter, S Richardson.
Chapman & Hall, London, UK.
Robert C (2015) The Metropolis–Hastings Algorithm. Wiley StatsRef: Statistics Reference Online, pp
1–15. https://onlinelibrary.wiley.com/doi/full/10.1002/9781118445112.stat07834
Robert C, Elvira V, Tawn N, Wu C (2018) Accelerating MCMC algorithms. WIRES Computational
Statistics, 10, e1435.
Roberts G, Gelman A, Gilks W (1997) Weak convergence and optimal scaling of random walk
metropolis algorithms. The Annals of Applied Probability, 7, 110–120.
Roberts G, Rosenthal J (2004) General state space Markov chains and MCMC algorithms. Probability
Surveys, 1, 20–71.
Roberts G, Sahu S (1997) Updating schemes, correlation structures, blocking and parameterization of
the Gibbs sampler. Journal of the Royal Statistical Society B, 59, 291–317.
Roberts G, Sahu S (2001) Approximate predetermined convergence properties of the Gibbs sampler.
Journal of Computational and Graphical Statistics, 10, 216–229.
Roberts G, Tweedie R (1996) Geometric convergence and central limit theorems for multidimen-
sional Hastings and Metropolis algorithms. Biometrika, 83, 95–110.
Rodrigues A, Assuncao R (2008) Propriety of posterior in Bayesian space varying parameter models
with normal data. Statistics & Probability Letters, 78, 2408–2411.
Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models
using integrated nested Laplace approximations. Journal of the Royal Statistical Society, Series B,
71(2), 319–392.
Sargent D (1998) A general framework for random effects survival analysis in the Cox proportional
hazards setting. Biometrics, 54(4), 1486–1497.
Scollnik D (2002) Implementation of four models for outstanding liabilities in WinBUGS: A discussion
of a paper by Ntzoufras and Dellaportas (2002). North American Actuarial Journal, 6, 128–136.
Shen W, Louis T (1998) Triple-goal estimates in two-stage hierarchical models. Journal of the Royal
Statistical Society: Series B, 60, 455–471.
Sherlock C, Fearnhead P, Roberts G (2010) The random walk Metropolis: Linking theory and practice
through a case study. Statistical Science, 25(2), 172–190.
Shoemaker J, Painter I, We B (1999) Bayesian statistics in genetics: A guide for the uninitiated. Trends
in Genetics, 15, 354–358.
Simpson D, Rue H, Riebler A, Martins T, Sørbye S (2017) Penalising model component complexity: A
principled, practical approach to constructing priors. Statistical Science, 32(1), 1–28.
Sinharay S, Stern H (2005) An empirical comparison of methods for computing Bayes factors in
generalized linear mixed models. Journal of Computational and Graphical Statistics, 14, 415–435.
Siu N, Kelly D (1998) Bayesian parameter estimation in probabilistic risk assessment. Reliability
Engineering and System Safety, 62, 89–116.
Spiegelhalter D (2004) Incorporating Bayesian Ideas into Health-Care evaluation. Statistical Science,
19, 156–174.
Sun D, Speckman P, Tsutakawa R (2000) Random effects in generalized linear mixed models
(GLMMs), pp 23–39, in Generalized Linear Models: A Bayesian Perspective, eds D Dey, S Ghosh, B
Mallick. Dekker, New York.
Sun D, Tsutakawa R, Speckman P (1999) Posterior distribution of hierarchical models using CAR(1)
distributions. Biometrika, 86, 341–350.
Tierney L (1994) Markov Chains for exploring posterior distributions. Annals of Statistics, 21,
1701–1762.
Toft N, Innocent G, Gettinby G, Reid S (2007) Assessing the convergence of Markov Chain Monte
Carlo methods: An example from evaluation of diagnostic tests in absence of a gold standard.
Preventive Veterinary Medicine, 79, 244–256.
van Dyk D (2003) Hierarchical models, data augmentation, and Markov chain Monte Carlo, pp 41–56,
in Statistical Challenges in Modern Astronomy III, eds G Babu, E Feigelson. Springer, New York.
Bayesian Methods for Complex Data 43

Vanpaemel W (2011) Constructing informative model priors using hierarchical methods. Journal of
Mathematical Psychology, 55(1), 106–117.
Vines S, Gilks W, Wild P (1996) Fitting bayesian multiple random effects models. Statistics and
Computing, 6, 337–346.
Wetzels R, van Ravenzwaaij D,Wagenmakers E (2014) Bayesian analysis, in The Encyclopedia of Clinical
Psychology, eds R Cautin, S Lilienfeld. Wiley-Blackwell, Hoboken, NJ.
Wikle C (2003) Hierarchical models in environmental science. International Statistical Review, 71,
181–199.
Willink R, Lira I (2005) A united interpretation of different uncertainty intervals. Measurement, 38,
61–66.
Yu B, Mykland P (1998) Looking at Markov samplers through cusum path plots: A simple diagnostic
idea. Statistics and Computing, 8(3), 275–286.
Yue Y, Speckman P, Sun D (2012) Priors for Bayesian adaptive spline smoothing. Annals of the Institute
of Statistical Mathematics, 64(3), 577–613.
Zhu M, Lu A (2004) The counter-intuitive non-informative prior for the Bernoulli family. Journal of
Statistics Education [Online], 12(2).
Zuur G, Garthwaite P, Fryer R (2002) Practical use of MCMC methods: Lessons from a case study.
Biometrical Journal, 44, 433–455.
2
Bayesian Analysis Options in R, and
Coding for BUGS, JAGS, and Stan

2.1 Introduction
R, available at https://cran.r-project.org/, is an integrated suite of software facilities for
data manipulation, statistical analysis, and graphical display (R Core Team, 2016). The
advantages of the R environment for Bayesian analysis are considerable, including access
to extensive graphical capabilities (e.g. ggplot) and data manipulation facilities; a range
of posterior diagnostic and summarisation tools; and the ability to obtain classical esti-
mates in tandem with a full Bayesian analysis. A full list of packages in R is available at
https://cran.r-project.org/web/packages/available_packages_by_name.html and www.­
onlinetoolz.com/tools/r-packages.php, while Bayesian analysis packages are listed at
https://cran.r-project.org/web/views/Bayesian.html.
Worked examples in subsequent chapters focus primarily on three options for generic
Bayesian analysis in R, based on user-defined program code. Implementation in R uses
interfaces for BUGS such as R2OpenBugs and R2MultiBUGS, for JAGS (e.g. rjags, run-
jags, jagsUI), and for Stan (rstan). The LaplacesDemon package (CRAN, 2018) also offers
Bayesian estimation options, with entirely R based user code. A number of packages use
one or more of BUGS, JAGS, or Stan as a basis for coding and computation, but provide
extra compilation checks, posterior summarisation, or data analysis options. Thus, the
rube package (Seltman, 2016) interfaces with BUGS and JAGS to provide additional compi-
lation details to assist with code debugging, while MCMCvis (Youngflesh, 2017) provides
tools for posterior summarisation and visualisation which can be applied across all three
generic options. The Nimble package aims to update BUGS and retain its functionality in
the R environment (de Valpine et al., 2017), while R2MultiBUGS is a recently developed
alternative to R2OpenBUGS and links to MultiBUGS (Goudie et al., 2019). Comparative
analyses of some of these packages include Li et al. (2018) and Monnahan et al. (2017).
A range of application packages not requiring user-defined code adapted to the applica-
tion is available. These have a different design philosophy to the generic coding options,
using MCMC algorithms that are model-specific and hence likely to be more efficient
(Martin and Quinn, 2006). As one example, bamlss (Bayesian Additive Models for Location,
Scale, and Shape) enables Bayesian estimation of generalised linear models, additive
regression, and spatial models (Umlauf et al., 2018). MCMCpack (Martin et al., 2011) allows
estimation of generalised linear models, change-point models, quantile linear regression,
and certain latent variable models. The rstanarm package (Gabry and Goodrich, 2018) uses
Stan as a basis for estimation, but using simplified functions: for example, the stan_glm
function to represent generalised linear models. The R-INLA package uses the Integrated

45
46 Bayesian Hierarchical Models

Nested Laplace Approximation as a computationally effective alternative to MCMC esti-


mation and is illustrated in later examples. It is applicable to a range of generalised mixed
models which can be represented in hierarchical form with latent Gaussian effects.

2.2 Coding in BUGS and for R Libraries Calling on BUGS


R interfaces to BUGS include R2OpenBUGS, rube, R2WinBUGS, and R2MultiBUGS.
These use unmodified BUGS coding principles, as used in the standalone WINBUGS
and OPENBUGS packages, and in the more recent MultiBUGS. For a review of BUGS, see
Lykou and Ntzoufras (2011).
As also applies to JAGS, a BUGS program is declarative, namely, a description of the
model and of the parameters or other stochastic nodes that may be monitored at each
MCMC step. Like JAGS (but unlike Stan), there is no prescribed order for particular code
elements. Thus, the code specifying prior densities may precede or follow the code speci-
fying the model and likelihood. A wide variety of worked examples, including program
codes, suggested initial value settings (for unknown parameters) are available at www.
openbugs.net/Examples/Volumei.html and www.openbugs.net/Examples/Volumeii.
html.
Many coding elements in BUGS are R-like, such as for-loops. However, unlike R, the
specification of the univariate normal density (in both BUGS and JAGS) is in terms of
mean and inverse variance (precision), with the multivariate normal being parameterised
in terms of the precision matrix.
As an example, consider a BUGS code for a normal linear regression model with responses
y = ( y1 , ¼ , y n ), and a single predictor x = ( x1 , … , xn ). We consider in particular the ereturns
data from the R package heavy (for heavy tailed regression, and related techniques), for
which a normal linear regression may provide poorly fitted cases. The response (m.marietta
in the dataset ereturns) is for excess returns from the Martin Marietta company, and the
predictor (CRSP, an index for the excess rate returns for the New York stock exchange, in
ereturns) is an index for the excess rate returns for the New York Stock Exchange.
The BUGS code includes predictions (replicates) at observed predictor values, and com-
parisons between these and the actual response data. It also includes a calculation of
the residual precision from the standard deviation, which is assigned a uniform prior.
Log-likelihood calculations are included to enable a subsequent call to the loo library
(Vehtari et al., 2017). The code is included within a call to R2OpenBUGS as follows:

options(scipen=999)
library(R2OpenBUGS)
library(heavy)
library(loo)
data(ereturns)
x=as.vector(ereturns[[4]])
y=as.vector(ereturns[[3]])
# Data
D=list(y=y,x=x,n=60,x.new=0.13)
# Model Code
model <- function() { for (i in 1:n) {y[i] ~dnorm(mu[i],tau)
mu[i] <- beta[1] + beta[2]*(x[i]-mean(x[]))
Bayesian Analysis Options in R 47

# log-likelihood
LL[i] <- -0.92+0.5*log(tau)-0.5*tau*pow(y[i]-mu[i],2)
# replicates (predictions) at observed x[i]
yrep[i] ~dnorm(mu[i],tau)
# check replicate against actual observation
check[i] <- step(yrep[i]-y[i])}
# priors
for (j in 1:2) {beta[j] ~dnorm(0,0.001)}
# calculate precision
tau <- 1/(sigma*sigma)
sigma ~dunif(0,100)
# prediction at new x value
mu.new <- beta[1]+beta[2]*(x.new-mean(x[]))
y.new ~dnorm(mu.new,tau)}
inits1 = list(beta=rep(0,2), sigma=1)
inits2 = list(beta=rep(0,2), sigma=2)
inits = list(inits1,inits2)
pars = c("beta","sigma","check","y.new","LL")
n.iters=10000; n.burnin=500; n.chains=2
R=bugs(D,inits,pars,n.iters,model,n.chains,n.burnin,debug=T,
codaPkg = F,bugs.seed=10)
R$summary
LOO=loo(R$sims.list$LL)
LOO.PW=LOO$pointwise[,3]

As expected, a number of cases, particularly 8, 15, 34, and 58 have extreme posterior
predictive checks, and these cases also have the most extreme pointwise LOO-IC values.
This example could also be run using R2MultiBUGS, with the second line now
library(R2MultiBUGS), and the bugs command being:

R=bugs(D, inits, pars, model,n.chains=2, n.workers = 2, n.iter=10000,


MultiBUGS.pgm = "C:/Program Files/MultiBUGS/MultiBUGS.exe").

MultiBUGS (www.multibugs.org/) parallelises the MCMC algorithm with resulting


shorter computing times.

2.3 Coding in JAGS and for R Libraries Calling on JAGS


While it is an adaptation of BUGS, JAGS has the advantages of more parsimonious coding.
For example, if the prior on a linear regression residual variance is specified as a uniform
on the standard deviation named as sigma (as in the example above), then one can directly
specify:

y[i] ~ dnorm(mu[i],1/(sigma^2)).

Drawbacks of JAGS code relative to BUGS are that loop limits cannot involve any cal-
culation, and the inability to take sub-samples at each MCMC iteration (see Example 3.5).
The JAGS code for the above regression example emphasises its essential similarity with
the BUGS code, but also coding flexibility, in that equality rather than assignment signs are
48 Bayesian Hierarchical Models

allowed, and extra facilities such as the logdensity.norm function to obtain log-likelihoods.
The JAGS code also includes a function to generate suitable initial parameter values and
calls on the jagsUI package. The jagsUI package has the benefit of repeatedly checking con-
vergence and thus avoiding unnecessary computing. The calling sequence is as follows:

library(jagsUI)
library(heavy)
library(loo)
data(ereturns)
x=as.vector(ereturns[[4]])
y=as.vector(ereturns[[3]])
# Data
D=list(y=y,x=x,n=60,x.new=0.13)
cat("
model {for (i in 1:n) {y[i] ~dnorm(mu[i], 1/sigma^2)
mu[i] = beta[1] + beta[2]*(x[i]-mean(x[]))
# log-likelihood
LL[i] = logdensity.norm(y[i],mu[i],1/sigma^2)
# replicates at observed x[i]
yrep[i] ~dnorm(mu[i],1/sigma^2)
# check replicate against actual observation
check[i] = step(yrep[i]-y[i])}
# priors
for (j in 1:2) {beta[j] ~dnorm(0,0.001)}
sigma ~dunif(0,100)
# prediction at new x value
mu.new = beta[1]+beta[2]*(x.new-mean(x[]))
y.new ~dnorm(mu.new, 1/sigma^2)}
", file="model.jag")
# Estimation
inits <- function(){list(sigma=runif(0,5), beta=rnorm(2,0,0.1))}
pars = c("beta","sigma","check","y.new","LL")
R=autojags(D,inits,pars,model.file="model.jag",2,iter.increment=1000,
n.burnin=100,Rhat.limit=1.025, max.iter=5000, seed=1234, codaOnly=
c(’LL’))
# Posterior Summary
R$summary
# Fit
LOO=loo(as.matrix(R$sims.list$LL))
LOO.PW.JAGS=LOO$pointwise[,3]
order(LOO.PW.JAGS)

Automatic convergence checking is also included in the package runjags (Denwood,


2016) and a calling sequence for this is as follows:

model ="model {for (i in 1:n) {y[i] ~dnorm(mu[i], 1/sigma^2)


mu[i] = beta[1] + beta[2]*(x[i]-mean(x[]))
# log-likelihood
LL[i] = logdensity.norm(y[i],mu[i],1/sigma^2)
# replicates at observed x[i]
yrep[i] ~dnorm(mu[i],1/sigma^2)
# check replicate against actual observation
check[i] = step(yrep[i]-y[i])}
Bayesian Analysis Options in R 49

# priors
for (j in 1:2) {beta[j] ~dnorm(0,0.001)}
sigma ~dunif(0,100)
# prediction at new x value
mu.new = beta[1]+beta[2]*(x.new-mean(x[]))
y.new ~dnorm(mu.new, 1/sigma^2 )} "
inits <- function(){list(sigma=runif(0,5), beta=rnorm(2,0,0.1))}
pars = c("beta","sigma","check","y.new","LL")
R = autorun.jags(model,data=D,startburnin=500,startsample=4000,
inits=inits,
monitor=pars ,n.chains=2)
add.summary(R)
# MCMC output for log-likelihoods
LLsamps=as.matrix(as.mcmc.list(R, vars = "LL"))
LOO=loo(LLsamps)
LOO.PW.JAGS=LOO$pointwise[,3]
order(LOO.PW.JAGS)

2.4 Coding for rstan


2.4.1 Hamiltonian Monte Carlo
Stan differs from BUGS and JAGS in using the no-U-turn sampler (Hoffman and Gelman,
2014) based on the Hamiltonian Monte Carlo (HMC) scheme (Neal, 2011; Duane et al.,
1987; Betancourt and Girolami, 2015). HMC typically explores the posterior parameter
space faster and more efficiently than Metropolis–Hastings or related algorithms, and
so reaches convergence earlier, especially for high-dimensional models (Hoffman and
Gelman, 2014), avoiding delays associated with random walk samplers. As mentioned in
Stan Development Team (2017, p.593), “Stan might work fine with 1000 iterations with an
example where BUGS would require 100,000 for good mixing.”

2.4.2 Stan Program Syntax


Stan codes break the model specification into blocks. First is the data block, specifying all
the data that is supplied in the input dataset, and referred to in later blocks in the code.
This includes the number of observations. Integer and real data items are distinguished.
The parameters block names the parameters, on which priors are usually explicit, and
whose estimation is sought. Priors on these parameters are usually specified in the sub-
sequent model block, which also specifies the data likelihood. Note that priors may be
omitted, so that one has a ‘flat’ prior. In that case, parameter bounds (lower, upper, or both)
should be stipulated, and this is good practice anyway. A transformed parameters block
is for obtaining functions of parameters, for example, if the parameter block names the
residual standard error σ, specified in the parameters block as real

<lower=0> sigma,

then one may be interested in estimating the precision as τ = 1/σ2. The transformed
parameters block may also specify limits, and this may facilitate particular types of model.
50 Bayesian Hierarchical Models

For example, one may specify a log-link in binomial regression by stipulating that prob-
abilities πi are between 0 and 1, with a log-link obtained as

real <lower=0,upper=1> pi[n];


for (i in 1:n) {pi[i]=exp(x[i]*beta);}

The generated quantities block specifies names and derivation of any quantities, such as log-
likelihoods, resulting from the calculations during estimation. All distinct statements in all
blocks must be terminated by a semicolon, which in for-loops precedes the closing } of the loop.
Flexibility in rstan coding is provided by opportunities to vectorise prior and likelihood
statements in the model block (and hence avoid for-loops); see Section 9 on Regression Models
in the Example Models section of the Stan User Guide (Stan Development Team, 2017).
We continue the regression example above, now specifying the predictors (intercept and
CRSP) as a regression matrix. The generated quantities block includes log-likelihoods, gen-
eration of replicate data, and posterior predictive checks comparing replicate and actual
data. Vectorisation is illustrated in the code by the statement

y ~normal(eta,sigma);

in the model block.


The calling sequence is:

options(scipen=999)
library(loo)
library(rstan)
# Regression Data
library(heavy)
data(ereturns)
x=as.vector(ereturns[[4]])
y=as.vector(ereturns[[3]])
x_new=0.13
K=2
X=matrix(,60,K)
X[,1]=1
X[,2]=x-mean(x)
x_new=x.new-mean(x)
D=list(y=y,X=X,n=60,K=2,x_new=x_new)
model="
data {
int n;        //  number of observations
real y[n]; // response
real x_new; // new predictor value
int K; // number of predictors
matrix[n,K] X; // predictor matrix
}
parameters {
vector[K] beta; // regression coefficients
real <lower=0> sigma; // residual standard deviation
}
transformed parameters {
vector[n] eta; // linear regression term
eta = X*beta;
}
Bayesian Analysis Options in R 51

model {
sigma ~uniform(0,100);
beta ~normal(0,31.6);
y ~normal(eta,sigma);
}
generated quantities { real LL[n];
real y_rep[n];
real y_new;
real check[n];
for (i in 1:n) {LL[i]= normal_lpdf(y[i] eta[i],sigma); }
for (i in 1:n) {y_rep[i] =normal_rng(eta[i],sigma);}
for (i in 1:n) {check[i] =step(y_rep[i]-y[i]);}
y_new = normal_rng(beta[1]+beta[2]*x_new,sigma); // prediction at new x
value
}
"
# Estimation
fit=stan(model_code = model,data=D, iter = 1500,warmup = 250,chains=2)
# Posterior Summary
print(fit,digits=3)
# plot of posterior densities
# stan_dens(fit)
# Fit
LLsamps <- as.matrix(fit,pars="LL")
LOO=loo(LLsamps)
LOO.PW.STAN= LOO$pointwise[,3]
order(LOO.PW.STAN)

2.4.3 The Target + Representation


An advantage of rstan is that the log posterior can be explicitly specified using the target +
representation. Thus, in the regression example, the prior samples sigma ~uniform(0,100)
and beta ~normal(0,31.6) can be expressed as target +=uniform_lpdf(sigma | 0,100) and
target += normal_lpdf(beta | 0, 31.6) respectively, and the likelihood y ~normal(eta,sigma)
can be expressed as target += normal_lpdf(y | eta, sigma). So the model block becomes:

model {
target += uniform_lpdf(sigma | 0,100);
target += normal_lpdf(beta | 0, 31.6);
target += normal_lpdf(y | eta, sigma);
}

This is relevant in, say, marginal likelihood estimation, if one seeks to scale the contribu-
tion of the log-likelihood to the log-posterior (see Example 3.1); in regression using weighted
log-likelihoods or regression using frequency tabulations; or in fitting distributions with
custom likelihoods (not available among the standard densities included in rstan).
Using the target + format, rstan can accommodate improper priors as long as the posteri-
ors are proper. Whereas BUGS and JAGS code specify a formal graphical model, for rstan,
the code simply specifies a joint density function needed for HMC. Thus, Jeffreys prior on
a variance σ2, namely

p (s ) = 1/ s ,
52 Bayesian Hierarchical Models

can be coded

target += -log(sigma);

As an illustration of a weighted log-likelihood estimated using the target + option, con-


sider the api dataset from the R package survey (Carnes, 2017). This dataset has 200 obser-
vations with sample weights wi. Unweighted logistic regression of the binary response
yr.rnd.numeric on predictors’ meals and mobility gives respective β coefficients (and s.e.)
as 0.041 (0.010) and 0.041 (0.015). By contrast, weighted logistic regression using the Zelig
package gives respective β coefficients (and s.e.) as 0.034 (0.012) and 0.086 (0.027), with
a much-amplified effect of the mobility predictor and a diminished effect of the meals
predictor.
For rstan estimation, the target + option includes survey weights w[i] to scale the
log-likelihood contributions, as shown by the code in the following calling sequence:

data(api, package = "survey")


library(Zelig)
library(rstan)
apistrat$yr.rnd.numeric <- as.numeric(apistrat$yr.rnd == "Yes")
w.logit = zelig(yr.rnd.numeric ~meals + mobility, model = "logit.survey",
weights = apistrat$pw, data = apistrat)
unw.logit = glm(yr.rnd.numeric ~meals + mobility, data = apistrat, family
= "binomial")
attach(apistrat)
# Data for STAN, weights w scaled to average 1.
D=list(N=200,y=yr.rnd.numeric,x1=meals,x2=mobility,w=pw/mean(pw))
model ="
data { int N;
int<lower=0, upper=1> y[N]; // outcomes
real x1[N];
real x2[N];
real w[N];  // weights
}
parameters { real beta[3];
}
model { beta ~normal(0,5);
for (i in 1:N) {target += w[i]*bernoulli_lpmf(y[i]  1/
(1+exp(-beta[1]-beta[2]*x1[i]-beta[3]*x2[i])));}
}
"
fit=stan(model_code = model,data=D, iter = 1500,warmup =
250,chains=2,seed=10)
# Posterior Summary
print(fit,digits=3)

The rstan estimation gives respective posterior means (sd) for the coefficients on meals
and mobility of 0.037 (0.09) and 0.064 (0.020). Again, the effect of mobility is amplified as
compared to unweighted logistic regression (though less so than under the Zelig approach),
while the effect of meals is attenuated.
The target + option can also be used with frequency data. Suppose housing tenants
are grouped into 72 groups (with frequencies FREQ) according to an ordinal satisfaction
Bayesian Analysis Options in R 53

response (three categories) and three categorical predictors, one with four categories, one
with three, and one binary. Then an ordinal logistic regression can be applied via the

model {for (i in 1:72) { target += FREQ[i]*ordered_logistic_lpmf(y[i] |


x[i] * beta, tau);}}

This scenario is in fact applicable to the housing dataset in the R MASS library.
To demonstrate the target + option applied to a non-standard density, consider the
Kumaraswamy distribution, obtained by sampling y ~ Beta(1,b) and then x = y1/a. The den-
sity is p( x|a, b) = abx a -1(1 - x a )b -1 . We generate 1000 observations with a = 3 and b = 2.
The code sequence below provides posterior means (sd) for a and b of 3.00 (0.11) and 1.92
(0.09).

N =1000; a = 3; b = 2
# Kumaraswamy density
x = rbeta(N, 1, b)^(1/a)
library(rstan)
model ="
data {
int<lower=1> N;
real<lower=0,upper=1> x[N];
}
transformed data {
real sum_log_x;
sum_log_x = 0.0;
for (i in 1:N) {sum_log_x = sum_log_x + log(x[i]);}
}
parameters {
real<lower=0> a;
real<lower=0> b;
}
model {
target += N * (log(a) + log(b)) + (a - 1) * sum_log_x;
for (i in 1:N) { target += (b - 1) * log1m(pow(x[i],a)); }
}
"
D = list(N = N, x = x)
fit=stan(model_code = model,data=D, iter = 2500,warmup =
250,chains=2,seed=10)
# Posterior Summary
print(fit,digits=3)

2.4.4 Custom Distributions through a Functions Block


Although the target + specification may be used for non-standard densities, these can also
be implemented using a functions block; see Chapter 24 in Stan Development Team (2017).
The functions block (if used) should precede other blocks in an rstan code. The functions
command will provide a log-likelihood term and the function name will include a _log
suffix. However, the function call in the subsequent likelihood block will omit the _log
suffix (Annis et al., 2017, p.872). Function names should not duplicate existing functions in
rstan.
54 Bayesian Hierarchical Models

We consider the generalised Poisson (Consul, 1989):

[(1 - w) m + w x]x -1
p( x|w , m) = (1 - w) m
x!
( )
exp - [(1 - w) m + w x ]

and the application by Joe and Zhu (2005). Joe and Zhu (2005, Table 3) consider data for
n = 158 tumour count observations and provide estimates (mean, se) for ω and ϑ = μ(1 − ω),
namely 0.79 (0.04) and 0.91 (0.10).
An rstan implementation involves the sequence:

library(rstan)
# Tumour count data from Joe and Zhu (2005)
x=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2
,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,
4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,7,8,9,9,10,10,1
0,10,10,11,13,14,15,16,20,20,20,
21,24,24,24,26,30,50,50)
D=list(x=x,N=158)
model ="
functions {
real generalized_poisson_log(int x, real theta, real omega) {
return log(theta) + (x - 1) * log(theta + x*omega) - lgamma(x + 1)
- x * omega - theta ; }
}
data {
int<lower=0> N;
int x[N];
}
parameters {
real<lower=0> mu;
real<lower=-1, upper=1> omega;
}
transformed parameters {
real<lower=0> theta;
theta=mu*(1-omega);
}
model {
for (i in 1:N) {x[i] ~generalized_poisson(theta, omega);}
}
"
fit=stan(model_code = model,data=D, iter = 2500,warmup =
250,chains=2,seed=10)
# Posterior Summary
print(fit,digits=3)

We obtain estimates (posterior mean (sd)) for ω and ϑ of 0.797 (0.037) and 0.919 (0.095).
Note that this model can be extended to better account for the zero inflation present in the
data.
Bayesian Analysis Options in R 55

2.5 Miscellaneous Differences between Generic


Packages (BUGS, JAGS, and Stan)
We consider the detailed availability of particular analysis options in the R-based implemen-
tations of the generic coding packages, namely BUGS, JAGS, and Stan; see also Appendix B
in the Stan User’s Guide and Reference Manual (Stan Development Team, 2017).
Missing Data: An advantage of BUGS and JAGS (and their R versions) is simple handling
of missing data, either as missing at random, or informatively missing, if a selection mech-
anism is modelled as well. If response values for certain cases are specified as NA, and no
specification regarding the origin of missingness is included, then values are imputed on
a missing at random assumption. If predictor values are missing (again coded as NA), a
specific generating density for that predictor needs to be included in the code. Note that
Stan may provide estimates of the missing data values in the generated quantities block,
possibly after data rearrangement (see Example 6.5). Missing data may also be included in
the model likelihood, after appropriate subdividing of the response data: see Section 11.1
in Stan Development Team (2014).
Predictor Retention and Discrete Mixture Models: BUGS and JAGS allow predictor selec-
tion in regression using binary retention or exclusion indicators, whereas this is not pos-
sible, at least directly, in Stan, because HMC requires continuous parameter spaces. This is
an illustration of a broader issue of “discrete assignment indicators,” which also arises in
discrete mixture analysis. Stan handles such analysis by focusing on the marginal likeli-
hood, for example, using the likelihood assignment target += log_mix; see Chapters 13
and 15 in the Stan manual (Stan Development Team, 2017) and McElreath (2018).
Spatial Modelling: Compared to the other main languages, BUGS has the advantage of
including priors to represent conditional spatial dependence for both univariate and mul-
tivariate outcomes, via the carnormal and mvcar priors. These priors can also be used in
time series modelling. JAGS does not have these options. Stan can estimate spatially corre-
lated effects (for SAR as well as CAR models) using the full multivariate priors (e.g. normal,
Student t) of dimension N, where N is the number of units or points (Joseph, 2016; Morris,
2018), but this may become computationally intensive for large N. Multivariate normal prior
options for CAR and SAR models are included in the brms package (Bürkner, 2017).
Augmented Data Representations: As compared to JAGS and Stan, BUGS can directly
estimate augmented data versions of binary and multinomial regression, with latent util-
ity interpretations (e.g. Albert and Chib, 1993). Stan can, however, estimate latent utilities
in the generated quantities block.
Priors on Precision and Covariance Matrices: A limitation common to BUGS and JAGS
(and their R implementations) is the limited choice in specifying the prior on the precision
matrix (inverse covariance matrix) for the multivariate normal or multivariate t densities.
For a univariate normal, there is more flexibility in these languages regarding the prior on
the precision (e.g. gamma, lognormal). One may also obtain a univariate precision param-
eter from the corresponding standard deviation or variance, with a prior (e.g. uniform,
lognormal) on the latter. For the multivariate normal, rstan (and associated R libraries such
as rethinking and arms) offers extra flexibility in allowing (a) LKJ prior distributions for a
correlation matrix, combined with priors on the standard deviations; (b) Cholesky decom-
position expressions of the covariance matrix; see Sections 61 and 63 of Stan Development
Team (2017); and (c) offering multivariate normal sampling using either the covariance or
precision matrix (e.g. y ~multi_normal_prec).
56 Bayesian Hierarchical Models

References
Albert J, Chib S (1993) Bayesian analysis of binary and polychotomous response data. Journal of the
American Statistical Association, 88, 669–679.
Annis J, Miller B, Palmeri T (2017) Bayesian inference with Stan: A tutorial on adding custom distri-
butions. Behavior Research Methods, 49(3), 863–886.
Betancourt M, Girolami M. (2015) Hamiltonian Monte Carlo for hierarchical models. Chapter 4,
pp 79–102, in U. Singh, S. Upadhyay, D. Dey (eds) Current Trends in Bayesian Methodology with
Applications. CRC, Boca Raton, FL.
Bürkner P (2017) brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical
Software, 80(1), 1–28.
Carnes N (2017) Logistic Regression for Survey Weighted Data. http://docs.zeligproject.org/arti-
cles/zelig_logitsurvey.html
Consul P (1989) Generalized Poisson Distribution: Properties and Applications. Marcel Decker, New York.
CRAN (2018) Laplaces Demon: Complete Environment for Bayesian Inference. https://cran.r-proj-
ect.org/web/packages/LaplacesDemon/LaplacesDemon.pdf
Denwood M (2016) runjags: An R package providing interface utilities, model templates, p ­ arallel
computing methods and additional distributions for MCMC models in JAGS. Journal of
Statistical Software, 71(9), 1–25.
de Valpine P, Turek D, Paciorek C, Anderson-Bergman C, Lang D, Bodik R (2017) Programming with
models: Writing statistical algorithms for general model structures with NIMBLE. Journal of
Computational and Graphical Statistics, 26(2), 403–413.
Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987) Hybrid Monte Carlo. Physics Letters B, 195,
216–222.
Gabry J, Goodrich B (2018) How to Use the rstanarm Package. https://cran.r-project.org/web/pack-
ages/rstanarm/vignettes/rstanarm.html
Goudie R, Turner R, De Angelis D, Thomas A (2019) MultiBUGS: A parallel implementation of
the BUGS modelling framework for faster Bayesian inference. Journal of Statistical Software.
arXiv:1704.03216
Hoffman M, Gelman A (2014) The No-U-turn sampler: adaptively setting path lengths in Hamiltonian
Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.
Joe H, Zhu R (2005) Generalized Poisson distribution: The property of mixture of Poisson and com-
parison with negative binomial distribution. Biometrical Journal, 47(2), 219–229.
Joseph M (2016) Exact sparse CAR models in Stan. http://mc-stan.org/documentation/case-stud-
ies/mbjoseph-CARStan.html
Li M, Dushoff J, Bolker B (2018) Fitting mechanistic epidemic models to data: A comparison of simple
Markov chain Monte Carlo approaches. Statistical Methods in Medical Research, 27(7), 1956–1967.
Lykou A, Ntzoufras I (2011) WinBUGS: A tutorial. WIRES: Wiley Interdisciplinary Reviews, 3(5),
385–396.
Martin A, Quinn K (2006) Applied Bayesian inference in R using MCMCpack. R News, 6(1), 2–7.
Martin A, Quinn K., Park J (2011) MCMCpack: Markov chain Monte Carlo in R. Journal of Statistical
Software, 42(9), 1–21. www.jstatsoft.org/v42/i09/
McElreath R (2018) Algebra and the Missing Oxen. http://elevanth.org/blog/2018/01/29/
algebra-and-missingness/
Monnahan C, Thorson J, Branch T (2017) Faster estimation of Bayesian models in ecology using
Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3), 339–348.
Morris M (2018) Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data. http://
mc-stan.org/users/documentation/case-studies/icar_stan.html
Neal R (2011) MCMC Using Hamiltonian Dynamics, Chapter 5, in S Brooks, A Gelman, G Jones, X–L
Meng (eds) Handbook of Markov Chain Monte Carlo. CRC Press, Boca Raton, FL, pp 113–162.
R Core Team (2016) R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna. www.r-project.org/
Bayesian Analysis Options in R 57

Seltman H (2016) R Package rube (Really Useful WinBUGS (or JAGS) Enhancer). Version 0.3-8.
http://www.stat.cmu.edu/~hseltman/rube/
Stan Development Team (2014) Stan Modeling Language: User’s Guide and Reference Manual.
https://github.com/stan-dev/stan/releases/download/v2.4.0/stan-reference-2.4.0.pdf
Stan Development Team (2017) Modeling Language User’s Guide and Reference Manual, Version
2.17.0. https://mc-stan.org/users/documentation/
Umlauf N, Klein N, Zeileis A (2018) BAMLSS: Bayesian additive models for location, scale, and shape
(and beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627.
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-
validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
Youngflesh C (2017) MCMCvis: Tools to Visualize, Manipulate, and Summarize MCMC Output.
https://mran.microsoft.com/snapshot/2017-04-22/web/packages/MCMCvis/index.html
3
Model Fit, Comparison, and Checking

3.1 Introduction
Model assessment involves choices between competing models in terms of best fit, and
checks to ensure model adequacy. For example, even if one model has a superior fit, it still
needs to be established whether predictions from that model check with, namely, repro-
duce satisfactorily, the observed data. Checking may also seek to establish whether model
assumptions (e.g. normality of random effects) are justified, whether the model reproduces
particular aspects of the data, and whether particular observations are poorly fit (Sinharay
and Stern, 2003; Berkhof et al., 2000; Kelly and Smith, 2011; Lucy, 2018; Conn et al., 2018;
Park et al., 2015).
Once adequacy is established for a set of candidate models, one may seek to choose a
particular best fitting model to base inferences on, or average over two or more adequate
models with closely competing fit. This chapter focuses on three main strategies to assess
model fit and carry out model checks: the formal approach; approaches based on posterior
analysis of the likelihood; and predictive methods based on samples of replicate data.
Particular emphasis is placed on their application in hierarchical models. Hierarchical
indicator priors for selecting predictors are considered here (Section 3.4), and more exten-
sively in Chapter 7.
R packages focusing particularly on Bayesian model selection or other aspects of model
comparison include loo (Vehtari et al., 2017); mombf for regression and mixture analyses
(Rossell, 2018; Johnson and Rossell, 2012); AICcmodavg (for deviance information criterion
(DIC) calculation) (https://rdrr.io/cran/AICcmodavg/); BayesFactor (https://rdrr.io/cran/
BayesFactor/), and the bridgesampling package (Gronau et al., 2017a). Packages focusing
on predictor selection include BayesVarSel (https://rdrr.io/cran/BayesVarSel/), and BMA
(https://rdrr.io/cran/BMA/).

3.2 Formal Model Selection


What is termed the formal approach to Bayes model selection is based on integration over the
model parameter space to estimate marginal likelihoods and posterior model probabilities,
leading on to possible model averaging. The canonical situation is provided by a “model-
closed” or M-closed scenario (Key et al., 1999; Bernardo and Smith, 1994) where the set of
models under consideration are judged to include the correct model, then formal model
choice strategies are directed towards finding which model is most likely given the data.

59
60 Bayesian Hierarchical Models

Let prior model probabilities be denoted p(m = k), where m Î(1, ¼ , K ) is a model indica-
tor. Then posterior model probabilities are obtained as

p( y |m = k )p(k )
p(m = k | y ) =
p( y )
where



p( y|m = k ) = p( y|qk )p(qk )dqk

is the marginal likelihood for model k, with parameter θk of dimension dk. This section
considers approximations to marginal likelihoods and to Bayes factors

BFjk = Pr( y |m = j)/Pr(m = k | y )

that compare such likelihoods. In simple models, such as normal linear regressions with
regression coefficients and residual variance as the only unknowns, the formal approach is
relatively simple to implement, and marginal likelihoods are available analytically under
certain priors (Bos, 2002).
Approximate methods (Tierney and Kadane, 1986) for obtaining summary fit measures
(e.g. marginal likelihoods) or posterior densities of parameters are also reliable in simple
models. A large sample approximation for the log marginal likelihood is provided by the
Bayesian Information Criterion (BIC) (Schwarz, 1978; Myung and Pitt, 2004) defined as

BIC = log[p(y|q̂k )] - 0.5 dk log(n)

where q̂k is the maximum likelihood estimator, dk is a known model dimension, and n is
the sample size. The BIC is consistent for a wide set of problems, meaning that the prob-
ability of selecting the most parsimonious true model tends to 1 as the sample size tends to
infinity. However, for singular model selection problems (discrete mixtures, factor models
where the true number of factors is unknown, etc.), the asymptotic justification for the
BIC no longer applies: considering the case of discrete parametric mixtures (Chapter 4),
the Fisher information matrix with K components is singular at a distribution based on
K-1 components. An alternative for such problems, the singular BIC, or sBIC, has been
proposed (Drton and Plummer, 2017) and implemented in the R package sBIC (Weihs and
Plummer, 2016). The widely applicable Bayesian information criterion (WBIC) can also be
applied for nonsingular models (Watanabe, 2013; Friel et al., 2017).
Posterior model probabilities on nested models may also be obtainable by adding model
selection indicators, as illustrated by Bayesian variable selection algorithms (Mitchell and
Beauchamp, 1988; Fernandez et al., 2001) for choosing predictors in regression. Such selec-
tion has been extended to variance hyperparameters in hierarchical models (e.g. Cai and
Dunson, 2006; Chen and Dunson, 2003; Fruhwirth-Schnatter and Tuchler, 2008; Kinney
and Dunson, 2008), enabling selection which avoids the complex issues involved in mar-
ginal likelihood estimation for random effects models. Section 3.4 considers variance
selection in hierarchical models.
However, in more complex random effect applications with discrete responses or
hierarchically structured data, there remain issues which impede the straightforward
Model Fit, Comparison, and Checking 61

application of the formal approach (Han and Carlin, 2001). For example, in approxi-
mating marginal likelihoods, there is a choice whether or not to integrate over random
effects (Sinharay and Stern, 2005). The more commonly advocated approach of integrat-
ing out random effects becomes impractical when there are multiple possibly correlated
random effects. The formal approach is also sensitive to priors adopted on parameters,
which in the case of random effect models include the form of prior on variance compo-
nents (e.g. inverse gamma or uniform), as well as the degree of prior informativeness.
As priors become more diffuse, the formal approach tends to select the simplest least
parameterised models, in line with the so-called Lindley or Bartlett paradox (Bartlett,
1957). Finally, the formal approach to model averaging requires both posterior densities
p(qk | y , m = k ), and posterior model probabilities p(m = k|y). Estimates of posterior densi-
ties p(qk | y , m = k ) may be difficult to obtain in complex random effects models with large
numbers of parameters.
Straightforward and pragmatic approaches to model comparison, which are also appli-
cable to complex hierarchical models, are available as alternatives to formal methods. The
two main approaches are based on posterior densities of fit measures (log-likelihood, devi-
ance) and on predictive assessment using samples of replicate data. Section 3.3 considers
the posterior deviance as a fit measure and the related measure of model complexity (effec-
tive dimension) that are of utility in comparing hierarchical models. Bayesian fit measures
such as the DIC or LOO-IC (Vehtari et al., 2017) are analogous to information theoretic
approaches in frequentist statistics (Burnham and Anderson, 2002), but more widely appli-
cable (e.g. to non-nested models). The components of the overall fit deriving from each
observation (e.g. the deviance contributions from particular observations) may be used in
model checking (Plummer, 2008).
The predictive approach to model choice and diagnosis (Section 3.5) has also been simpli-
fied by MCMC (Gelfand, 1996). Predictive methods shift the focus onto observables away
from parameters (Geisser and Eddy, 1979) and seek to alleviate the impact on model com-
parison of factors such as specification of priors. The predictive approach is particularly
advantageous in model checking, namely ensuring that a model actually reproduces the
data satisfactorily (e.g. Kacker et al., 2008), but is also applied to model choice, for example,
under posterior predictive loss criteria (Gelfand and Ghosh, 1998).
Predictive model checking typically involves repeated sampling of replicate data ynew
from a model’s parameters at each MCMC iteration (Gelfand et al., 1992). For a satisfactory
model this process generates data like the observed data such that (y,ynew) are exchange-
able draws from the joint density (Stern and Sinharay, 2005, pp.176–177).

p( y new , y ,q ) = p( y new |q , y )p( y|q )p(q ) = p( y new |q )p( y|q )p(q ).

When all the data is used in model estimation, such sampling provides estimates of
the posterior predictive density of model k, p( y new | y , m = k ). However, predictive com-
parisons based on models using all the data in estimation may be overly favourable
to the model being fitted (i.e. be conservative in terms of detecting model discrepan-
cies) (Bayarri and Berger, 1999). An alternative involves cross-validation (Alqallaf and
Gustafson, 2001) where the model predicts values for certain observations (the test
sample) on the basis of a model estimated using the remaining observations (the learn-
ing sample). Key et al. (1999) argue that cross-validation is approximately optimal in
an M-open scenario, where none of the models being considered is believed to be the
true model.
62 Bayesian Hierarchical Models

3.2.1 Formal Methods: Approximating Marginal Likelihoods


As mentioned above, the global fit of a model with parameter vector θ under the formal
Bayes paradigm is provided by the marginal likelihood p(y|m = k), obtained by integrating
the likelihood


ò
p( y ) = p( y|q )p(q )dq .

The marginal likelihood is also a component in Bayes formula, such that at any parameter
value θ

p( y|q )p(q )
p(q |y ) = .
p( y )

Consider models 1 and 2 with equal prior model probabilities p(m = 1) = p(m = 2) = 0.5. Then
the ratio of posterior model probabilities is obtained as

p(m = 2| y ) p( y |m = 2)
= = B21
p(m = 1| y ) p( y |m = 1)

where B21 is the Bayes factor. Kass and Raftery (1995) provide guidelines for interpreting
B21. If 2logeB21 is larger than 10, the evidence for model 2 is very strong, while values of
2logeB21 < 2 are inconclusive as evidence in favour of one model or another. Note that such
criteria are influenced by the prior adopted. In general, diffuse priors (whether on fixed
effect parameters or variances) are to be avoided, as they tend to favour the selection of the
simpler model.
Estimating the marginal likelihood by direct integration is generally infeasible in
multi-parameter applications. Hence, a range of approximations have been proposed for
estimating marginal likelihoods or associated model choice criteria, such as the Bayes fac-
tor. For example, on suitable rearrangement (Chib, 1995), the Bayes formula implies that
the marginal likelihood may be approximated by estimating the posterior ordinate p(θ|y)
in the relation

log[p( y )] = log[ p( y |qh )] + log[ p(qh )] - log[ p(qh | y )]

where θh is a point with high posterior density (e.g. posterior mean or median). One may
estimate p(θ|y) by kernel density methods or by moment approximations based on MCMC
output – see Lenk and DeSarbo (2000) for a discussion of such estimates. Let g(θ) denote an
estimated density that approximates p(θ|y). One may then evaluate g(θ) at θh (Sinharay and
Stern, 2005; Bos, 2002), so providing an estimate of the log marginal likelihood as

log[ p( y |qh )] + log[ p(qh )] - log[ g(qh )]

The relation log[ p( y )] = log[ p( y |qh )] + log[ p(qh )] - log[ p(qh | y )] also implies a sampling-
based estimator of the log marginal likelihood. Since this relation applies for all samples
θ(r), one may average over values

H ( r ) = log[ p( y|q ( r ) )] + log[ p(q ( r ) )] - log[ g(q ( r ) )]


Model Fit, Comparison, and Checking 63

to estimate the log of the marginal likelihood, log[p(y)]. Using log transforms is likely to
be the most suitable approach for larger samples, to avoid numeric overflow. For small
samples, one may set L( r ) = p( y|q ( r ) ) , p ( r ) = p(q ( r ) ), and g ( r ) = g(q ( r ) ). Then an estimator of the
marginal likelihood is provided by the simple average of the ratios L( r )p( r )/g ( r ) .
Alternatively, suppose θ contains B parameter sub-blocks. When the full conditionals of
each sub-block are available in closed form, Chib (1995) considers a marginal/conditional
decomposition of p(θh|y) as follows

p(qh | y ) = p(q1h | y )p(q2 h |q1h , y )p(q3 h |q1h , q2 h , y )¼ p(qBh |q1h ,..qB -1, h , y )

with p(θh|y), and thus p(y), estimated by using B − 1 sampling sequences subsidiary
to the main scheme. If B = 2, namely qh = (q1h , q2 h ), the posterior ordinate p(θh|y) is
then p(q1h | y )p(q2 h | y , q1h ), where p(q1h | y ) is estimated from the output of the main sample
e.g. as

p(q1h |y ) = å p(q
r =1
1h |y ,q 2( r ) )

or by an approximation technique (e.g. assuming univariate/multivariate posterior nor-


mality of θ1, or a kernel method). The second ordinate is available by inserting θ1h and θ2h
in the relevant full conditional density. Chib and Jeliazkov (2001) extend this method to
cases where full conditionals do not have a known normalising constant and have to be
updated by Metropolis–Hastings steps.

3.2.2 Importance and Bridge Sampling Estimates


Let θk be the parameter vector for model k, denote the marginal likelihoods p(y|m = k) as
Mk, and let

p* (q k |y , m = k ) = p( y|q k , m = k )p(q k |m = k )

denote the un-normalised posterior density of θk with

p* (q k |y , m = k )/ ck = p(q k |y ).

Then by definition


ò
p( y|m = k ) = p* (q k |y , m = k )dq .

Consider a function g(θ) with known normalising constants, often termed an importance
function, and one that should ideally approximate the posterior p(θ|y). Then one has

p∗ (qk |y , m = k )


p( y|m = k ) = p∗ (qk |y , m = k )dq =
∫ g(qk )
g(qk )dqk .

This suggests that an estimator for the marginal likelihood may be obtained using sam-
ples qk( r ) (r = 1,¼R) from g(θk), namely
64 Bayesian Hierarchical Models

p* (qk( r ) |y , m = k )
Mk = år
g(qk( r ) )
.

Let L(kr ) = p( y|qk( r ) ) , p k( r ) = p(qk( r ) ) and g k( r ) = g(qk( r ) ) . Then, the importance sample estimator
may be written in terms of weights wk( r ) = p k( r )/g k( r ) comparing the prior and importance
function, namely

Mk = ∑ L r
(r )
k wk( r ) .

Bridge sampling estimators of marginal likelihoods use the fact that the marginal likeli-
hood of model k is the normalising constant ck = p( y |m = k ) in the relation

p( y |qk , m = k )p(qk |m = k ) p* (qk | y , m = k )


p(qk | y , m = k ) = = .
p( y |m = k ) ck
The Bayes factor Bjk = p( y |m = j) / p( y |m = k ) is then a ratio cj/ck of normalising constants.
Let g(θ) be an approximation to p(θ|y) with known normalising constant (e.g. suppose g
consists of a multivariate normal density and a gamma density). Then one has

1=
∫a(q )p(q |y)g(q )dq
k k k k
=
Eg [a(qk )p(qk |y )]

∫a(q )g(q )p(q |y)dq
k k k k
Ep [a(qk ) g(qk )]

where α(θ) is a bridge function linking the densities g(θ) and p(θ|y) (Meng and Wong, 1996;
Gronau et al., 2017b), Eg[] denotes expectation with regard to the density g(θ), and Ep[]
denotes expectation with regard to the density p(θ|y). Substituting p* (qk | y , m = k )/ck for
p(θ|y) in 1 = Eg [a(qk )p(qk | y )]/ Ep [a(qk ) g(qk )] gives the result

Eg [a(qk )p* (qk | y , m = k )]


ck = .
Ep [a(qk ) g(qk )]
For simplicity, omit conditioning on model k. Then with samples q ( r ) (r = 1,… , S) and
q ( r ) (r = 1,… , R) from p(θ|y) and g(θ) respectively, one may estimate the marginal likeli-
hood p(y) of a particular model as

ìï 1 R ü ì S ü
í ê
ïî R r =1 ë
å
éa (q ( r ) )p * (q ( r ) |y )ù ï ï 1
ý í
ûú ï ï S
þ î r =1
å
éëa (q ( r ) ) g(q ( r ) )ùû ïý
ïþ
Setting a (q ) = 1/g(q ) then gives a marginal likelihood estimator

*
p (q ( r ) |y )
R

å
1
M=
R r =1
g(q ( r ) )

that uses only samples from the approximate posterior (or importance) density g(θ).
Setting a (q ) = 1/p* (q |y ) gives an estimator based on the harmonic mean of the ratios
*
p (q ( r ) |y )/g(q ( r ) ), and using parameters sampled from p(θ|y) rather than g(θ) (Gelfand and
Dey, 1994). So
Model Fit, Comparison, and Checking 65

S
g(q ( r ) )
å
1 1
= .
M S r =1
p* (q ( r ) |y )
*
The choice a (q ) = 1/[g(q )p (q |y )] leads to the geometric estimator of Lopes and West
(2004), namely

å éë p (q
1 0.5
* (r )
|y )/g(q ( r ) |y )ùû
R
M= r =1
S .
å éë g(q
1 0.5
(r )
|y )/p (q |y )ùû
* (r )
S r =1

A recursive scheme for obtaining an optimal estimate of α(θ) is also available, and men-
tioned by Lopes and West (2004, p.54) and Frühwirth-Schnatter (2004, equation 8). This sim-
plifies if R = S, as in the first illustrative worked application below. With R ≠ S, s1 = S/(S + R)
and s2 = 1 − s1, one has an updated estimate for M at recursion j

M j = A( M j -1 )/B( M j -1 )

where

A(u) = å W /(s W
r
2r 1 2r + s2u), B(u) = å 1/(s W
s
1 1s + s2u),

W2 r = p( y|q ( r ) )p(q ( r ) )/g(q ( r ) ),

and

W1s = p( y|q ( s) )p(q ( s) )/g(q ( s) ).

3.2.3 Path Sampling
Another approximation may be obtained by a technique known as path sampling (Gelman
and Meng, 1998; Xie et al., 2011; Friel and Pettitt, 2008). Consider a path variable t ranging
from 0 to 1, and define the power posterior based on various levels of weighted likelihood,
namely

pt (q |y ) µ p( y|q )]t p(q ).

Define the posterior expectation

ò
t
z( y|t) = éë p( y|q )ùû p(q )dq

so that z(y|t = 0) is the integral of the prior, namely 1 for proper priors, while z(y|t = 1) is the

ò
marginal likelihood, p( y ) = p( y|q )p(q )dq .
To derive an estimate of z(y|t = 1), one may use the identity
66 Bayesian Hierarchical Models

1
æ z( y|t = 1) ö
log( p( y )) = log ç
è z( y|t = 0) ø 0
ò
÷ = Eq|y ,t log[ p( y|q )]dt

which states that the log marginal likelihood is the expected log-likelihood with respect
to the power posterior at temperature t, with t ranging from 0 to 1. This follows (Friel and
Pettitt, 2008) because

d 1 d 1 d é
ò{p(y|q )} p(q )dq ùúû
t
log éë z( y|t)ùû = z( y|t) =
dt z( y|t) dt z( y|t) dt êë

1
ò{p(y|q )} log[p(y|q )]p(q )dq
t
=
z( y|t)

{p( y|q )}
t
p(q )
=
ò z( y|t)
log[ p( y|q )]dq

= Eq|y ,t log[ p( y|q )].


One may numerically evaluate the integral over t using the trapezoid rule over T intervals
defined using T + 1 temperature functions

qs = asc (3.1)

defined at cutpoints {a0 , … aL } in [0,1], where c is a specified positive power. So, the estimate
log(Mc) of the log marginal likelihood at that power is obtained by summing over T grid
points that combine information from successive expected log likelihoods,
T -1

å(q
1
log( Mc ) = s +1 - qs ) éëEq|y ,qs+1 log[ p( y|q )] + Eq|y ,qs log[ p( y|q )]ùû .
s=0
2
Friel and Pettitt (2008) take c = 4 in (3.1), while Xie et al. (2011) recommend values of c between 2.5
and 5. So with T = 40 intervals, equally spaced cutpoints {a0 = 0, a1 = 0.025, a2 = 0.05, … a40 = 1},
and setting c = 4, one has q0 = 0, q1 = (0.025)4 , …, q39 = 0.975 4 , q40 = 1. The Monte Carlo stan-
dard error of log(Mc) is obtained as the square root of the summed variances of the con-
tributions to log(Mc) at each of T grid points. Thus let ds = (1 / 2)(qs + 1 - qs ) and let νs be the
Monte Carlo variance of Eq|y ,qs+1 log[ p( y|q )] . Then the variance at each grid point is ds2ns

å
T -1
and the Monte Carlo variance of log(Mc) is ds2ns .
s=0
To illustrate estimation in path sampling consider the online vignette demonstrating the
use of the R package bridgesampling (Gronau et al., 2017a; https://cran.r-project.org/web/
packages/bridgesampling/vignettes/bridgesampling_example_jags.html). This example
assumes a normal-normal two stage hierarchy (see Chapter 4), as often used in meta-anal-
ysis, with known first level variance σ2

yi ~ N (qi , s 2 ),

qi ~ N ( m, t 2 ).

The comparison is between a model with μ = 0 and a model with μ unknown. The data (n = 20
cases) are generated under the first option, namely with μ = 0, and also with σ2 = 1 and τ2 = 0.5.
Model Fit, Comparison, and Checking 67

Path sampling as in Friel and Pettitt (2008) is applied, with qs = as4 where
{as } = {a0 , 1/T , 2/T , … , T - 1/T , T } and T = 30. For numeric stability, a0 is taken as 0.00001
rather than 0, so that q0 = 1E − 20. Estimates are made using jagsUI. The parameters and
likelihoods at each of the T + 1 points are estimated using the device from Barry (2006). The
likelihood is specified as in (4.4.3), namely:

yi ~ N (qi , s 2 + t 2 ),

with σ2 = 1 known. Using the code listed in the Computational Notes [1] in Section 3.7, the
estimated marginal likelihoods are closely similar to those reported in the bridgesam-
pling vignette, namely −37.53 for the zero-mean model, and −37.81 for the model with μ
unknown.

3.2.4 Marginal Likelihood for Hierarchical Models


For conjugate hierarchical models (e.g. Poisson-gamma mixtures) the marginal likelihood
can be obtained analytically (Albert, 1999). However, general linear mixed models (Clayton,
1996) are widely used for handling multiple random effects, with regression terms

hi = Xi b + Wibi ,

where Xi and Wi are predictors, and bi are latent data. For such non-conjugate schemes, the
marginal likelihood is not obtainable analytically, and one possible approach to evaluat-
ing marginal likelihoods is to work with the integrated likelihood


ò ò
p( y|q ) = p( y , b|q )db = p( y|b , q )p(b|q )db ,

where the random effects or latent data have been integrated out, and where θ includes
hyperparameters ψ (e.g. covariances) governing the b, as well as parameters φ (e.g. fixed
regression effects) not relevant to the random effect hyperdensity (Sinharay and Stern,
2005; Fruhwirth-Schnatter, 1999). This can be done in practice in MCMC sampling by
applying importance sampling, the Laplace approximation, or numeric integration meth-
ods to the complete data likelihood p(y,b|θ).
However, it may be argued that under a Bayesian approach, the distinction between
fixed and random regression coefficients is less relevant, and so use of the integrated likeli-
hood approach and implied numerical complexity may be avoided. For example, one may
(e.g. Clayton, 1996) adopt a unified perspective on the parameters in the joint precision
matrix for the fixed effects (and other parameters not in the hyperdensity) φ, and the ran-
dom effects hyperparameters ψ. Chib (2008) proposes marginal likelihood estimation for
different classes of panel model by marginalisation over the random effects. Sinharay and
Stern (2005) also mention obtaining the marginal likelihood by considering the expanded
parameter set ω = (b,φ,ψ), so that


p( y ) =
∫∫p(y|j,y)p(j,y)djdy,
=
∫∫∫ p( y|j , b)p(b|y)p(y, j)dbdydj.
68 Bayesian Hierarchical Models

The advantage of working with the expanded likelihood p(y|b,φ) is the avoidance of
repeated integration, but this comes at the expense of an often considerably increased
dimension of the parameter space (namely by the number of components in b). Marginal
likelihood approximation retaining the expanded likelihood is considered in real exam-
ples by Nandram and Kim (2002) and Gelfand and Vlachos (2003).


Let g(θ|b) be a density subject to g(q|b)dq = 1 , where θ = (ψ,φ), and let θ* be an appro-
priate fixed point (e.g. a posterior mean). Chen (2005) mentions an estimator for the log
marginal likelihood M = p(y) in a hierarchical modelling situation based on the identity

ò
p( y|q * ) = p( y|q * , b)p(b|q * ) g(q |b)dq db ,

g(q |b) p( y|q * , b)p(b|q * )


=
ò p(q ) p( y|q , b)p(b|q )
p( y|q , b)p(b|q )p(q )dq db ,

é g(q |b) p( y|q * , b)p(b|q * ) ù


= p( y )E ê |y ú ,
ë p(q ) p( y|q , b)p(b|q ) û
where the expectation is over samples from p(θ|y). Taking logarithms provides

é g(q |b) p( y|q * , b)p(b|q * ) ù


log[ M] = log[ p( y|q * )] - log[E ê |y ú .
ë p(q ) p( y|q , b)p(b|q ) û
So, with samples {q ( r ) , b( r ) } from p(q , b| y ), an estimator for log(M) is

é1 R
g(q ( r ) |b( r ) ) p( y|q * , b( r ) )p(b( r ) |q * ) ù
log[ M] = log[ p( y|q * )] - log ê
êë R
å
r =1
ú.
p(q ( r ) ) p( y|q ( r ) , b( r ) )p(b( r ) |q ( r ) ) úû

One option is then to set g(q |b) = p(q ) , leading to

é1 R
p( y|q * , b( r ) )p(b( r ) |q * ) ù
log[ M] = log éë p( y|q * )ùû - log ê
êë R
år =1
ú,
p( y|q ( r ) , b( r ) )p(b( r ) |q ( r ) ) úû
* *
The component log[ p( y|q )] = log(L ) may be estimated from the Monte Carlo average

å p(y|q , b
1
L* = * (r )
).
R r =1

Chen (2005) shows that a variance minimising estimator is, however, obtained by setting
g(q |b) = p(q |b , y ), namely the conditional posterior density of θ given b.

Example 3.1 Marginal Likelihood and Bayes Factors, Turtle Mortality Data
This example applies approximations to the marginal likelihood to data from Sinharay
and Stern (2005). These are nested binary data yij on n = 244 newborn turtles i = 1, …, mj
clustered into clutches j = 1, …, J, with responses yij = 1 or 0 according to survival or death.
The known predictor is turtle birthweight xij so there are p = 2 regression parameters,
including an intercept. Graphical analysis suggests that heavier turtles have better sur-
vival chances, but also suggests extraneous variability in survival rates across clutches.
Model Fit, Comparison, and Checking 69

Sinharay and Stern (2005) compare several methods of deriving formal model fit mea-
sures, namely marginal likelihoods or Bayes factors. Here two alternative models are
evaluated for the probability pij = Pr( y ij = 1) using the temperature path approach of
Friel and Pettitt (2008) and jagsUI. One involves a fixed effects only regression on birth-
weight with a probit link. The other assumes additional random effects based on clutch
membership. So model 1 specifies pij1 = F( b1 + b2 xij ) , while model 2 has

pij 2 = F( b1 + b2 xij + b j ),

where b j ~ N(0, sb2 ) . The predictor xij is standardised, and unlike Sinharay and Stern
(2005), N(0,1) priors are assumed on the fixed effects {β1,β2}. A gamma prior on τb, namely
tb ~ Ga(0.1, 0.1) , is assumed, as the shrinkage prior

1
p(sb2 ) µ , (3.2)
(1 + sb2 )2
used by Sinharay and Stern cannot be implemented in jagsUI.
There are some possible sources of sensitivity: formal model measures may depend
on informativeness in the priors and on the form of prior, for example, the prior density
adopted on the random effects variance sb2 or the precision tb = 1/sb2 . For example, the
value sb2 = 1 has a quarter of the prior weight for sb2 = 0 under the prior used by Sinharay
and Stern (2005). With this prior, they obtain an inconclusive Bayes factor of 1.27 in
favour of the simpler fixed effects only model. Under particular methods, additional
sensitivity issues occur. Using the temperature path approach, estimates of the mar-
ginal likelihood may be affected by the number of sequence points T (Drummond and
Bouckaert, 2015) and by the location of the points, especially points near zero.
The temperature path as has T = 10, and qs = as4 where a = (0.00025,0.0005,0.001,0.005,0.
05,0.075,0.1,0.25,0.5,0.75,0.99). The parameters and likelihoods at each of the T + 1 points
are estimated using the device from Barry (2006). For numeric stability, the initial point
in the path is taken as 0.00025 rather than 0. Although formally, the estimate of log[p( y )]
is obtained by piecing together the separate posterior estimates Eqk|y , ts p( y|qk ), an essen-
tially identical estimate is obtained by applying the trapezoid rule at each iteration and
monitoring the composite log marginal likelihood node.
The marginal likelihood estimate for model 1 is thus obtained as −150.4 as compared
to −152.56 for the random effects alternative, model 2, giving B12 = 8.7. sb2 is estimated at
0.152. Relatively large clutch effects (mean, posterior sd) are obtained under model 2 for
clutches 9 and 15, namely 0.46 (0.27) and −0.37 (0.27).
As an alternative option for the random effects approach, defining model 3, a shrink-
age prior is implemented by taking a uniform prior on U = 1/(1 + sb2 )2 , namely

U ~ Unif(0, 1),

sb2 = (1 - U 0.5 ) / U 0.5 .

This option produces a clutch variance estimate sb2 = 0.149 . The marginal likelihood
is −150.4, so that BF13 is 1. Thus varying Bayes factors illustrate the impacts of differ-
ent priors on the variance or precision. This approach can be extended to allow uncer-
tainty in the shrinkage prior power, allowing potentially more pronounced shrinkage
(Gustafson et al., 2006). Thus, one takes a uniform prior on

U = 1/(1 + sb2 )P ,
70 Bayesian Hierarchical Models

where P is unknown with a minimum of 1. So with P = 1 + P1, where P1 ~ Ga(0.1, 0.1), one
has

sb2 = (1 - U 1/P ) / U 1/P .

This leads to an estimate for P of the default value 1, but with an estimated marginal
likelihood of −151.4 (and so BF13 = 2.7), while the posterior mean for sb2 is now 0.155.
An advantage with rstan is that the prior (3.2) can be represented using the expression
(within the model segment)

target += −2 * log(1 + sigma2);

where sigma2 is the unknown variance. Using rstan in combination with the bridgesa-
mpling package provides respective marginal likelihoods for models 1 and 2 of −156.48
and −156.71, and a Bayes factor BF12 = 1.26 (Gronau et al., 2017a). This is close to that
reported by Sinharay and Stern (2005). The clutch variance for model 2 (with prior as in
equation 3.2) is sb2 = 0.153 .
This option is also used to compare the fixed effects regression model with a variable
slopes model (model 4), namely

p ij 4 = F( b1 + ( b 2 + b j )xij ),

where b j ~ N(0, sb2 ) . A gamma prior on 1/sb2 is taken. The resulting marginal likelihood
is −160.13, and so a more decisive advantage for the simpler model, with BF14 = 38.66.
This counts as strong evidence in favour of the simpler model according to the sched-
ules of Jeffreys (1961), and of Kass and Raftery (1995).
One may also apply rstan to direct path sampling, namely to estimating a sequence of
models with varying temperatures t,

pt (q|y ) µ [ p( y|q )]t p(q )

with t ranging from 0 to 1. If U[ti] denotes the actual log likelihood at an ascending
temperature sequence ti Î[0, 1] , for i = 1,… , T , then the marginal likelihood estimate is


T
U(ti )/T . Alternatively, one may generate T temperatures randomly from the uni-
i=1
form U(0,1). For a selected temperature t, the code for the fixed effects model is

model="data {
int<lower = 1> N;
int<lower = 0, upper = 1> y[N];
real<lower = 0> x[N];
real<lower=0, upper=1> t;//parameter for path sampling
}
parameters {
real alpha;
real beta;
}
transformed parameters {
real U_case[N];
for (i in 1:N) {U_case[i]= bernoulli_lpmf(y[i] |
Phi(alpha+beta*x[i]));}
}
Model Fit, Comparison, and Checking 71

model {
target += normal_lpdf(alpha 0, 3.16);
target += normal_lpdf(beta 0, 3.16);
for (i in 1:N) {target += t*bernoulli_lpmf(y[i] |
Phi(alpha+beta*x[i]));}
}
generated quantities {
real U;
U = sum(U_case);
}
"
For example, a calling sequence with T = 1000 randomly generated
temperatures is
T=1000
temps=runif(T,0,1)
U=c()
sink("sink.txt")
for (i in 1:length(temps)) {D=list(y = turtles$y,x = turtles$x,N
=244,t=temps[i])
fit=stan(model_code=model,data=D,iter=1250,warmup=250,chains=1,refre
sh=−1,seed=100)
U[i]= summary(fit, pars = c("U"))$summary[1]}
sink()
# marginal likelihood estimate
mean(U)

where the likelihood at temperature ti is U[i]. In practice, this approach is compu-


tationally intensive, requiring high T, and producing different results each time. The
above call led to an estimated marginal likelihood of −157.15.

3.3 Effective Model Dimension and Penalised Fit Measures


Classical model choice is frequently based on penalised likelihood criteria, such as
the Akaike Information Criterion or AIC (Akaike, 1973), and the Bayesian Information
Criterion or BIC (Schwarz, 1978). Such criteria are applicable in comparing fixed effects
models with known dimension d, and with models assumed nested within one another.
With L denoting a log likelihood, and D = −2L denoting the deviance, log likelihood ratio
tests comparing maximised log likelihoods of models 1 and 2 are obtained with

C = -2(log L1 - log L2 ) = D1 - D2

where C is approximately chi-square, with degrees of freedom d2 - d1 equal to the


number of additional parameters in the more complex model 2. The AIC is defined as
2d - 2L = D + 2d and the difference in AICs between models 1 and 2 is DAIC = C + 2(d1 - d2 ).
However, classical likelihood ratio testing is not possible in random effects models or
models with parameter constraints (e.g. order or size constraints) that make the effective
number of estimated parameters itself a random variable so that the asymptotic distribu-
tion of the log likelihood ratio is unknown.
72 Bayesian Hierarchical Models

3.3.1 Deviance Information Criterion (DIC)


Spiegelhalter et al. (2002) provide a penalised fit criterion analogous to the AIC and BIC,
called the deviance information criterion or DIC. This is applicable to comparing non-
nested models and also to models including random effects where the true model dimen-
sion is another unknown. The DIC is based on the posterior distribution of the deviance
statistic

D(q |y ) = -2 log[ p( y|q )] + 2 log[h( y )]

where p(y|θ) is the likelihood of the data y given parameters θ, and h(y) is a standardising
function of the data only (and so does not affect model choice).
Suppose the deviance is monitored during an MCMC run, providing samples
{D(1) , … , D( R) } . The overall fit of a model is measured by the posterior expected deviance
obtained by averaging over the posterior density of the parameters,

D = Eq|y [D],

while the effective model dimension, de, is estimated as

de = Eq|y [D] - D(Eq|y [q ]) = D - D(q |y ), (3.3)

namely the expected deviance minus the deviance at the posterior means of the param-
eters; the latter is also known as the plug-in deviance (Plummer, 2008). In hierarchical
random effects models, the effective number of parameters in total is typically lower than
the nominal number of parameters, due to borrowing of strength under the hyperdensity
(e.g. Zhu et al., 2006; Buenconsejo et al., 2008).
The DIC is then obtainable as the expected deviance plus the effective model dimension,

DIC = D + de = D(q ) + 2de . (3.4)

So the DIC will prefer models with lower values of D , combined with smaller values of de
(which indicate a relatively parsimonious model). A possible disadvantage with the DIC
is that it can be affected by reparameterisation of θ or by the form of link in general linear
models, with this applying in particular to the “plug-in” deviance D(q | y ) ; hence the value
of de may be sensitive to parameterisation.
The deviance D(q | y ) at the posterior mean q of the parameters may also be estimated
by using posterior means of quantities involved in defining the deviance, such as case
means (Poisson likelihood), means and overdispersion parameter (negative binomial
likelihood), means and variance (normal likelihood), and so on (Spiegelhalter et al.,
2002, p.596). Thus, let μi denote case specific means and ξ denote any other parameters
needed to derive the deviance. Then an estimate D( m , x | y ) may be more easily obtain-
able than D(q | y ) in complex (e.g. discrete mixture) models, or in models with many
random effects, where the number of nominal parameters may considerably exceed the
number of cases. This type of procedure is also mentioned by Spiegelhalter (2006) in
terms of monitoring the “direct parameters” that appear in the distributional syntax
and plugging these into the deviance; it was adopted in the paper by Ohlssen et al.
(2006, section 2).
Model Fit, Comparison, and Checking 73

The DIC and de can be disaggregated to individual observations, and provide a measure
of local complexity, namely of observations that are more problematic under the model rel-
ative to others. Spiegelhalter et al. (2002, p.602) mention that the local complexity measures

dei = Di - Di (q )

measure the leverage of observation i, defined as the relative influence that each obser-
vation has on its own fitted value. Unusually large observation specific DIC measures,
namely

DICi = Di + dei

are used by Spiegelhalter et al. (2002) as indicators of outlier status – observations incon-
sistent with the model. The DIC can be seen as a Bayesian version of AIC and may under-
penalise model complexity, as pointed out by discussants to Spiegelhalter et al. (2002). By
contrast, it is well established (Burnham and Anderson, 2002) that the BIC tends to select
overly parsimonious models. A fit criterion analogous to the BIC may be defined as

DIC* = D(q ) + de log(n)

and was used by Pourahmadi and Daniels (2002, p.228) for panel data with repeated obser-
vations over n subjects.
Note that the model with the lowest DIC or DIC* will not necessarily be a suitable model
if it does not reproduce the data adequately. Hence, model checks are required to assess
consistency of predictions from the model with the actual observations.
Just as there are alternative approaches to marginal likelihood derivation in hierarchical
models, Spiegelhalter et al. (2002) point that for such models, one cannot uniquely define
the likelihood or model complexity without specifying the level of the hierarchy that is the
model focus. Thus one might analyse count data using a complete data likelihood (with
unknown latent data b as well as hyperparameters θ) using a Poisson-gamma or Poisson-
lognormal model, or alternatively apply a negative binomial likelihood with the random
effects integrated out (Fahrmeir and Osuna, 2003), and the complexity measures will obvi-
ously differ.
Model choice may be affected by the focus, as shown by Plummer (2008, p.530) in an
analysis of a discrete mixture model, with one approach considering a complete data like-
lihood pC ( y |b , q ) (with the parameters including missing component indicators), and the
other considering the integrated likelihood pI(y|θ). Ando (2007) considered DICs based on
both conditional and integrated likelihoods, namely DICC and DICI, and showed that both
tend to select overfitted (i.e. non-parsimonious) models.

3.3.2 Alternative Complexity Measures


Plummer (2008) confirms that the DIC tends to under-penalise complex models, particu-
larly when the ratio of the sample size to the effective number of parameters is relatively
low. Plummer (2008) proposes an alternative effective dimension penalty based on cross-
validation considerations. Thus, suppose Z constitutes training data and Y are test data,
and consider a loss function L(Y,Z). Assume that Y and Z are conditionally independent
given θ. Then p(Y|q , Z) = p(Y|q ), and the log-scoring rule (Gneiting and Raftery, 2007) for
74 Bayesian Hierarchical Models

Y is then the log-likelihood log{P(Y|q )} of θ. The corresponding loss function is the devi-
ance D(q ) = -2 log{ p(Y|q )} .
As estimates of the loss function, one may consider either the plug-in deviance

Lp (Y |Z) = -2 log[{ p(Y |q (Z)}]

where q(Z) = E(q |Z), or the expected deviance


ò
Le (Y |Z) = -2 log{P(Y |q )P(q |Z)dq ,

with the test data considered fixed. Whereas the plug-in deviance is sensitive to reparam-
eterisation and does not take account of the precision of θ(Z), the expected deviance is
coordinate-free and takes account of precision.
When there are no training data, Y must be used to estimate θ and assess model fit.
However, L(Y,Y) is optimistic as a measure of model adequacy, as it uses the data twice.
Consider the corresponding function for observation i, namely L(Yi,Y). This can be com-
pared with the cross-validation loss L(Yi,Y[i]), where Y[i] is Y with observation i excluded.
The excess of L(Yi,Y) over L(Yi,Y[i]) provides a measure of optimism from using the data
twice. The expected decrease in loss due to using L(Yi,Y) instead of L(Yi,Y[i]) is obtained as

dopt,i = E{L(Yi , Y[i] ) −L(Yi , Y )|Y[i]}. (3.5)

Summing over observations provides a complexity measure dopt, with a corresponding


penalised fit measure

DICopt = D + dopt . (3.6)

Issues of focus, as well as the derivation of the complexity measure de, are also consid-
ered by Celeux et al. (2006). In general terms, a complexity measure or effective parameter
count is obtained by comparing the mean deviance with the deviance at the pseudo-true
parameter values θt (Spiegelhalter et al., 2002, section 3.2). There are various estimators
q of the pseudo-true parameter values θt, apart from the element wise posterior means.
Another possibility is to consider the posterior mode posterior value, qˆ, that generates
the maximum posterior density p(q |y ) µ p( y|q )p(q ) (Celeux et al., 2006, p.654), namely
qˆ = argmax p(q |y ). In applications (e.g. discrete mixture models and random effect mod-
q
els) with missing data b, this extends to considering the pair (qˆ , bˆ ) that generates the
maximum posterior density (Celeux et al., 2006, p.656). Celeux et al. (2006) mention other
possibilities for q, such as the EM maximum likelihood estimate.
They state different DIC definitions under three alternative foci (observed data likeli-
hood, complete data likelihood, and conditional likelihood) and under different options
for q. For the observed data focus with likelihood p( y|q ) , obtained possibly after integrat-
ing out random effects, one has

DIC = D + de = D(q ) + 2de = 2D - D(q ) = -4Eq [log{ p( y|q )}|y] + 2 log[ p( y|q )].

It can be seen that taking q as the posterior mean amounts to assuming

DIC = -4Eq éëlog {p( y|q )}|y ùû + 2 log éë p {y|Eq (q |y )}ùû ,


Model Fit, Comparison, and Checking 75

whereas taking q as the posterior mode qˆ amounts to an alternative DIC definition, denoted
DIC2 by Celeux et al., namely

DIC = -4Eq éëlog {p(y|q )}|y ùû + 2log é p(y|qˆ )ù .


ë û
For a complete data focus, with likelihood p( y , b|q ) = p( y|b ,q )p(b|q ) including the sec-
ond stage likelihood model p(b|θ) for the missing data (e.g. Kuhn and Lavielle, 2005), one
obtains

D = -2Eq ,b éëlog {p( y , b|q )}|y ùû .

Taking b as additional parameters, one may define q on the basis of joint modal or maxi-
mum a posteriori parameters, (qˆ , bˆ ), with de obtained by comparing the average deviance
D with

D(q ) = -2log é p(y , bˆ|qˆ )ù .


ë û

The joint mode (qˆ , bˆ ) may be estimated by monitoring the posterior density over an MCMC
sequence, and finding that set of values {q ( r ) , b( r ) } associated with the maximum value,
pmax (q | y ), of the posterior density. The DIC may then be defined as

DIC = -4Eq ,b éëlog {p(y , b|q )}|y ùû + 2log é p(y , bˆ|qˆ )ù ,


ë û
with complexity estimated as

de = -2Eq ,b éëlog {p(y , b|q )}|y ùû + 2log é p(y , bˆ|qˆ )ù .


ë û

3.3.3 WAIC and LOO-IC


The WAIC (widely applicable information criterion) and the LOO-IC (leave one out infor-
mation criterion) are more recently developed measures of complexity penalised fit, and
are based on averaging over the posterior distribution, rather than using posterior means
q or other point estimates of θ (Watanabe, 2010; Vehtari et al., 2017). The WAIC is obtained
as

WAIC = -2(LPPD( y|q ) - de )

where

n
LPPD( y|q ) = å log ò p(y|q )p(q |y)dq
i =1

is the log posterior predictive density (LPPD) for y (Gelman et al., 2014), and de is the esti-
mated effective model dimension (complexity). The LLPD is an estimate, albeit a biased
overestimate, of the expected log posterior predictive density (ELPD) for (unobserved)
76 Bayesian Hierarchical Models

new data y generated from the same density as the observed data y, and the complexity
measure is a measure of the bias.
To estimate the LPPD for a particular observation, one obtains the likelihood for that
observation at each MCMC iteration (i.e. conditioning on θ(r) at iteration r). The resulting vec-
tor of likelihoods, for observation i and samples r = 1, … , R, can be denoted Li = (Li1 , … , LiR ).
The log of the mean of Lir over iterations r provides the LPPD for observation i, namely
LPPD( yi |q ) = log(Li ). The total of these over observations is the estimate of the LPPD.
The estimated complexity for the WAIC is obtained by monitoring log-likelihoods dur-
ing MCMC sampling, namely LLir = log(Lir ). Then the variance of LLi = (LLi1 , … , LLiR ) pro-
vides an estimate of complexity dei for that observation, dei = var(LLi ). The total of the dei is
the total complexity de. The estimated piecewise WAIC can be obtained as -2(log(Li ) - dei ),
and the total WAIC as the sum of the piecewise WAIC.
If the R package loo is used to obtain the LOO-IC, then it is more convenient to moni-
tor log-likelihoods (which are the input to loo), and then obtain sampled likelihoods by
exponentiation. For example, using rjags or jagsUI (for example), and with R an object
containing model results (including sampled log-likelihoods, LL), WAIC calculations are
as follows:

LL = as.matrix(R$sims.list$LL)
L=exp(LL)
waic1=log(apply(L,2,mean))
waic2=apply(LL,2,sd)
# casewise waic
waic.pw=-2*(waic1-waic2)
elpd_waic=sum(waic1)-sum(waic2)
# total waic
waic=-2*elpd_waic.

The LOO-IC uses an estimate of the leave-one-out predictive fit (or ELPD)
n

∑ log[p(y |y
i=1
i [i] )],

where y[i] is the set of observations omitting yi, and

p( yi | y[i] ) =
ò p(y |q)p(q|y
i [i] )dq.

The latter may be estimated using samples θr from the full data posterior p(θ|y) using
importance ratios

1 p(q |y[i] )
IRir = ∝ ,
p( yi |q )
r
p(q |y )

with the estimator for p( yi | y[i] ) then being

R R R

∑ ∑ ∑ p(y |q ) .
1 1
IRir p( y i |q r )/ IRir ≈ 1/[ r
r =1 r =1
R r =1
i

This estimator may be unstable due to high variances of the importance ratios for certain
observations.
Model Fit, Comparison, and Checking 77

Vehtari et al. (2017) use a smoothed version of the importance ratios based on fit-
ting a generalised Pareto density to the upper tail of the importance ratios, leading to
Pareto smoothed importance sampling (PSIS) estimates of the LOO-IC. Let wir denote the
smoothed importance weights. Then the estimate of the ELPD is


∑ 
R
n wir p( yi |q r )
ELPD PSIS-LOO = ∑ log  r =1 ,

∑ 
R
i=1  wir 
r =1

with the LOO-IC estimated as -2 ´ ELPD PSIS-LOO . The estimate of the effective parameter
total is then

dPSIS-LOO = LPPD - ELPD PSIS-LOO .

The LOO-IC may be obtained directly as follows:

LL = as.matrix(R$sims.list$LL)
L=exp(LL)
library(resample)
S = nrow(LL)
n = ncol(LL)
lpd_pw = log(colMeans(L))
w = 1/exp(LL-max(LL))
w_n = w/matrix(colMeans(w),S,n,byrow=TRUE)
w_r = pmin (w_n, sqrt(S))
elpd_loo_pw = log(colMeans(L*w_r)/colMeans(w_r))
p_loo_pw = lpd_pw − elpd_loo_pw
# Complexity
sum(p_loo_pw)
# LOO-IC
−2*sum(elpd_loo_pw)

Though the WAIC and LOO-IC provide an estimate of predictive ability, both are sub-
ject to stochastic variability which can be considerable for smaller datasets (Piironen and
Vehtari, 2017). There may also be cautions regarding the estimates of WAIC and LOO-IC,
provided in the loo package and discussed by Vehtari et al. (2017, p.1416). For the LOO-IC,
these are based on the estimated shape parameter of the generalized Pareto, values of
which indicate whether the variance of the importance ratios is effectively infinite.

3.3.4 The WBIC
The BIC is a penalised fit measure, and the widely applicable Bayesian information crite-
rion or WBIC (Watanabe, 2013) is therefore included here, though it is essentially based on
an estimator of the marginal likelihood. Thus following Friel et al. (2017), and referring to
path sampling ideas, there exists a unique temperature t* such that

p( y ) = Eq|y ,t* p( y|q ).

Watanabe (2013) shows that asymptotically, as the sample size n tends to ∞, t* » 1/log(n).
Friel et al. (2017) show for a number of worked examples that the optimal t* is smaller than
78 Bayesian Hierarchical Models

1/log(n), but that the latter approximation may be a useful practical option, except when
weakly informative priors are used.

Example 3.2 Turtle Survival and Penalised Fit Measures


This example compares penalised fit measures for the data of Example 3.1. As before,
a fixed effects model is compared with random clutch effects alternatives. Models are
compared first using rjags and jagsUI. Normal N(0,10) priors are adopted on the fixed
effects, and in the random effects models, a shrinkage prior is implemented by tak-
ing a uniform prior on U = 1/(1 + sb2 )2 , so that sb2 = (1 - U 0.5 )/U 0.5 . The WBIC is for these
models (Watanabe, 2013) is also estimated using rstan, and using the prior on sb2 as in
Sinharay and Stern (2005).
The DICs of models 1 and 2 (fixed effects and random clutch intercepts) are estimated
with JAGS using different penalties on the mean deviance, as proposed by Plummer
(2008). The usual DIC estimates, following Spiegelhalter et al. (2002), as in (3.4), can then
be compared with optimism adjusted DIC estimates, as in (3.6). For model 1, these are
denoted DIC1.pD and DIC1.popt in the code.
On this basis, the effective parameter total de2 for model 2 is 12.3 compared to 2.2
under model 1, a difference of approximately 10, whereas there are 32 extra nominal
parameters (the random effects variance and the 31 cluster effects). The DIC is then 3.4
lower under model 2, as the mean deviance is lessened from 299.8 to 286.3.
Comparing the DICs, as defined by (3.4), suggests a small advantage for the random
effects model, though the small DIC difference is unlikely to be significant according to
the rule of thumb in Spiegelhalter et al. (2002, section 9.2.4). The WAIC and LOO-IC also
both show a slight advantage for model 2 as against model 1, with the respective WAIC
being 299.4 and 301.7, and the respective LOO-IC being 299.6 and 301.7.
The optimism adjusted DICs (DIC.popt), by contrast, show a considerably better fit for
model 1. There are now more effective parameters under model 2, namely dopt,2 = 25.4 ,
as compared to dopt,1 = 4.0 , and this offsets the reduction in mean deviance. The WBIC
also prefers model 1, with respective values for models 1 and 2 of 308.4 and 312.9.
Other criteria may be considered. For example, the 5% worst fitting cases under
model 1 (without clutch effects) account for 14% of the WAIC, suggesting some issues
in fit for individual cases, especially in clutches 9 and 10. Additionally, estimates from
model 2 shows the density of σb to be relatively symmetric and to have its mass away
from zero (Figure 3.1), supporting the presence of at least some random variability. In
this connection, MacNab et al. (2004) illustrate how – when a form of random variation
is not supported by the data – the density of the random effects standard deviation can
be heavily skewed to the left or “spiked,” with the posterior mass piled up against zero.
In this regard, it is relevant to consider the significance of individual clutch effects
under model 2. In fact, there are three clutch effect (clutches 9, 10, and 26) with a 0.85
chance of exceeding zero, while the clutch 15 effect has posterior probability of 0.07 of
exceeding 0.
It is of interest also to compare the fixed effects model to a random slopes model (now
denoted model 3), as the Bayes factor showed strong positive support for the simpler
model. In this regard, the optimism adjusted DIC has the same preference for the sim-
pler model, with DICopt,3 = 314.3 compared to DICopt,1 = 303.3. Similarly, the WBIC for
model 3 is estimated as 314.7, higher than 308.4 for the simplest model 1. By contrast,
the DIC of (3.4), the WAIC and the LOO-IC all show a slight preference for the random
slopes model over model 1. Such findings suggest that in certain applications, penalised
fit measures and formal model comparisons may prefer different models. Figure 3.2
plots out the varying slopes under model 3.
Model Fit, Comparison, and Checking 79

3
Density

0.0 0.2 0.4 0.6 0.8 1.0


Variance

FIGURE 3.1
Density of clutch variance.

6
Frequency

0.30 0.35 0.40 0.45


Slope

FIGURE 3.2
Random slopes on birth weight.
80 Bayesian Hierarchical Models

3.4 Variance Component Choice and Model Averaging


A considerable amount of research has been devoted to MCMC selection of significant
predictors in regression – sometimes called variable selection or predictor selection. Such
techniques are consistent with a formal Bayes approach, but less constrained by the com-
plex integration issues that may be involved in obtaining marginal likelihoods. Different
possible approaches to predictor selection are considered by Rockova et al. (2012), Sala-i-
Martin et al. (2004), and Fernandez et al. (2001). With Jj as a binary indicator for retaining
or excluding the jth regression coefficient βj, George and McCullough (1993, 1997) develop
stochastic search variable selection (SSVS) using a mixture prior

p( b j | J j ) = J j p( b j | J j = 1) + (1 - J j )p( b j | J j = 0)

in which the “inclusion prior” p( b j | J j = 1) is a diffuse or possibly informative prior, but


one that allows realistic search for the parameter value. By contrast, the “exclusion prior”
p( b j | J j = 0) is centred at zero with high precision. For example, one might have

p( b j | J j = 1) ~ N (0, Vj )

with Vj large, but

p( b j | J j = 0) ~ N (0, Vj / K j ) K j >> 1

with Kj chosen so that the sampling from the prior is constrained to values around zero,
that is, to substantively insignificant values. If all p predictors apart from the intercept are
open to inclusion or exclusion, then MCMC sampling over parameters βj and indicators Jj
is averaging over 2p possible models (Fernandez et al., 2001).
By contrast, Kuo and Mallick (1998) and Smith and Kohn (1996) take the selection indica-
tors Jj and coefficients βj to be independent rather than being governed by mixture priors.
Assuming normal priors, one has βj = 0 if Jj = 0, but p( b j ) ~ N (0, Vj ) if Jj = 1. Following Zellner
(1986), the prior on ( b0 , b1 ,… bp) may be specified as a g-prior, namely

( b0 , b1 , … bp |s 2 ) ~ N p + 1(B, gs 2 (XX )-1 )

where g is a known constant, and B is typically a vector of zeroes (Vannucci, 2000).

3.4.1 Random Effects Selection


Model indicator selection ideas have also been applied to the parameters governing ran-
dom effects, so that only genuine sources of heterogeneity are retained (Müller et al., 2013).
Such covariance selection helps ensure sparse structure in the covariance matrix of the
selected (retained) random effects (Frühwirth-Schnatter and Tüchler, 2008). Selection may
relate to the retention or otherwise of univariate random effects – for example, a multilevel
model with a random intercept (as in the turtle survival analysis) or the convolution model
of Besag et al. (1991) for area count data. For multivariate random effects, such as random
cluster intercepts b0j and slopes {b1 j , … , bpj } in a multilevel analysis

yij = b0 j + b1 j x1ij + … + bpj x pij + uij ,


Model Fit, Comparison, and Checking 81

one can consider retaining covariances Σbgh subject to variances in both effects bgj and bhj
being retained. Thus, Smith and Kohn (2002) identify zero off-diagonal elements in the
inverse Π b = Σ b−1 of the variance-covariance matrix. Alternatively, one may also allow the
exclusion of variance components (diagonal terms in Σb), which necessarily leads to exclu-
sion of associated covariances.
Selection schemes applicable to both diagonal and off-diagonal elements in covariance
matrices for random effects have been developed by Fruhwirth-Schnatter and Tuchler
(2008), Chen and Dunson (2003), Kinney and Dunson (2008), and Cai and Dunson (2006);
for applications, see Yang (2012), Saville et al. (2011), and Harun and Cai (2014). Note that
these methods may be relatively difficult to implement, with Saville and Herring (2009)
finding “these methods are generally time consuming to implement, require special soft-
ware, and rely on subjective choice of hyperparameters.”
Consider a general linear mixed model for nested responses yij (as in longitudinal data
with repetitions i over subjects j) with means μij. These means are linked to a P × 1 vector
of regressors Xij and Q × 1 vector of regressors Zij via the model

g( mij ) = Xij¢ b + Zij¢ b j ,

where g is an appropriate link, b = ( b1 , ¼ bQ )¢ denotes the central fixed effects, and


b j = (b j1 , … , b jQ )¢ are zero mean random effects with covariance S b = {sbkl } . For continuous
data – and discrete outcomes subject to overdispersion – an observation level residual is
also present, so that

g( mij ) = Xij¢ b + Zij¢ b j + uij

with uij usually taken as iid.


Following Cai and Dunson (2006), one possible Cholesky decomposition of the covari-
ance matrix for b j = (b j1 , ¼ , b jQ )¢ has the form

S b = LGG ¢L ,

where L = diag(l1 , … lQ ) and Γ is a lower triangular matrix

 1 0 … 0
 g21 1 … 0
Γ= ,
 … …  0
 g gQ 2 … 1
Q1

implying

 r1 −1

sbkl = lk ll  gr2r1 +

∑ g g  ,
s=1
ks ls

where r2 = max(k , l), r1 = min(k , l). Then one has

g( mij ) = Xij¢ b + Zij¢ LGc j + uij ,

where {c jq ~ N (0, 1), q = 1, … , Q} are uncorrelated standard normal variables.


82 Bayesian Hierarchical Models

The selection indicators for retaining variances and covariances are J q ~ Bern(pL ) , gov-
erning the diagonal terms in Λ, and H kl ~ Bern(pG ) governing the terms in Γ. Note that
retaining γkl requires not only Hkl = 1, but J k = J l = 1. If either Jk or Jl is zero, then γkl is nec-
essarily excluded. Cai and Dunson (2006) suggest positive truncated normal priors with
variance 10 for the diagonal terms λq, namely

lq ~ N (0, 10) I (0, ) if J q = 1

lq = 0 if J q = 0

Diffuse priors are not recommended (Cai and Dunson, 2008, p.72), as they may favour the
null model. There may also be a case for interlinked priors for λq and the variances of the
uij effects (if present).
Fruhwirth-Schnatter and Tuchler (2008) consider the covariance matrix decomposition

S b = CC ¢,

with C a lower triangular matrix of dimension Q including unknown diagonal terms Cqq.
To illustrate the covariance selection procedure, a hierarchical linear normal model with
varying cluster regression effects b j = b + b j of dimension Q would be reframed as

yij = Xij ( b + b j ) + uij



= Xij b + XijCz j + uij ,

where uij ~ N (0, 1/tu ), and z j = ( z j1 , z j 2 … , z jQ )¢ is a Q × 1 vector distributed as NQ (0, I ).


Consider binary indicators Jkl for retention or otherwise of each of the Q(Q + 1)/2 elements
of C. Then

Ckl ¹ 0 if J kl = 1 (for k ³ l),

Ckl = 0 if J kl = 0,

and bjk is 0 at a particular iteration if all Ckl in the kth row of C are zero. A possible prior for
the Jkl indicators is Bernoulli with probability πJ, where πJ follows a beta density,

pJ ~ Be(TJ + 1, Q(Q + 1)/2 - TJ + 1),

based on the total free covariance parameters, and the number TJ of Jkl taking the value 1
(i.e. the number of non-zero elements in C). For Q = 1 in a model where a cluster level ran-
dom intercept is to be tested for inclusion, one would have

mij = b0 + Xij b + b j + uij



= b0 + Xij b + cz j + uij ,

where z j ~ N (0, 1), and c ≠ 0 if J = 1 and c = 0 if J = 0. The (model averaged) estimate of the
covariance matrix Σb of the bj over r = 1, … , R iterations of a chain is obtained as

åC
1 (r )
Ŝ b = (C ¢ )(r ) .
R r =1
Model Fit, Comparison, and Checking 83

Methods for selecting the entire random effect term extend to selection of individual ran-
dom effects. For selecting the entire term, consider a spike and slab prior with the spike
component having considerably lower variance:

bi ~ (1 - d )N (0, rsb2 ) + d N (0, sb2 ),

where r << 1. This extends to selection of individual random effects, for example using
Lasso random effect models (Fruhwirth-Schnatter and Wagner, 2010) involving compo-
nent-specific indicators δi and a hierarchical prior on the variances. For example, a mixture
of Laplace densities is obtained under

bi ~ (1 - di )N (0, z1i ) + di N (0, z2i ),

z1i ~ E(1/(2rQ)),

z2i ~ E(1/(2Q)),

with r set small, so that z1i  0 . The δi are binary indicators with unknown probability πδ,
the prior proportion of subjects with non-zero random effects. If Q is also unknown, there
may be identification issues under independent priors, as different combinations of πδ and
Q can give similar bi.

Example 3.3 Seeds Data


The widely analysed seeds data from Crowder (1978) may be considered an example
where not all random effects may be necessary. The binomial data { y i , ni } over N = 21
plates refer to germinations yi among ni seeds, and are subject to a binomial logit analy-
sis with random normal plate effects to account for binomial overdispersion. Predictors
are x1i and x2i (respectively seed type and root extract), and their interaction x1ix2i. Thus

y i ~ N(ni , pi ),

logit(pi ) = b1 + bi + b2 x1i + b3 x2i + b4 x1i x2i ,

bi ~ N(0, sb2 ).

Fitting this baseline model, without any random effects selection, suggests not all the
plate effects are needed. Posterior mean probabilities for Pr(bi > 0|y ) are inconclusive,
ranging from 0.34 to 0.63.
As one approach to selection, the method of Fruhwirth-Schnatter and Wagner (2010)
seeks to classify units as either close to average (with di ≈ 0, with bi close to zero, and
effectively unnecessary), above average with δi ≈ 1, and high Pr(bi > 0|y ), or below aver-
age, also with δi ≈ 1 but high Pr(bi < 0|y ) = 1 - Pr(bi > 0, y ) . A Laplace mixture density for
the plate effects is used, namely

bi ~ (1 - di )N(0, z1i ) + di N(0, z2i ),

z1i ~ E(1 /(2rQ)),

z2i ~ E(1 /(2Q)),


84 Bayesian Hierarchical Models

di ~ Bern(w),

w ~ Beta(1, 1),

with r = 0.00001 and 1/Q ~ Ga(0.5, 0.2275), the latter as suggested by Fruhwirth-Schnatter
and Wagner (2010).
Estimated retention probabilities Pr( di = 1|y ) range from 0.48 to 0.70, while the
probabilities of high effects Pr(bi > 0|y ) range from 0.18 to 0.82. The most distinctive
Pr(bi > 0|y ) are for plates 10 and 17, with probabilities Pr(bi < 0|y ) around 0.80, and
plates 4 and 15 with probabilities Pr(bi > 0|y ) exceeding 0.80 (cf. Fruhwirth-Schnatter
and Wagner, 2010, Table 7). Figure 3.3 plots out the probabilities Pr(bi > 0|y ). The prob-
abilities of high effects Pr(bi > 0|y ) are relatively stable as less informative Ga(1, 0.05)
and Ga(1, 0.01) priors are assumed for 1/Q.
We also consider a horseshoe prior for the plate effects, namely

bi ~ N(0, li2sb2 ),

with half Cauchy C(0, 1)+ priors on both the λi and sb2 . As mentioned by Carvalho et al.
(2009), ji = 1/(1 + li2 ) is interpretable as the amount of weight that the posterior mean
for bi places on zero. We consider instead ki = li2 /(1 + li2 ) as an indicator for non-zero
posterior mean bi , analogous to a probability that bi ≠ 0. The estimated κi range from
0.35 to 0.61, with κi greater than 0.5 for plates 4,10, 15, 16 and 17 (see Figure 3.4). Despite
the extra parameters in this extended model as compared to the baseline model, a
formal comparison shows similar marginal likelihoods for the extended and baseline
models.

3.0

2.5

2.0
Frequency

1.5

1.0

0.5

0.0

0.2 0.3 0.4 0.5 0.6 0.7 0.8


Probability

FIGURE 3.3
Probabilities of high random effects.
Model Fit, Comparison, and Checking 85

4
Frequency

0.35 0.40 0.45 0.50 0.55 0.60


Weight

FIGURE 3.4
Histogram of weights for non-zero effects, horseshoe prior.

Example 3.4 Hypertension Trial


To illustrate covariance selection for potentially correlated multiple random effects,
this example considers clinical trial data from Brown and Prescott (1999). In this trial,
288 patients are randomly assigned to one of three drug treatments for hypertension,
1 = Carvedilol, 2 = Nifedipine, and 3 = Atenolol. The data consist of a baseline reading BPi
of diastolic blood pressure (these are centred), and four post-treatment blood pressure
readings yit at two weekly intervals (weeks 3,5,7 and 9 after treatment). Some patients
are lost to follow up (there are 1092 observations rather than 4 × 288 = 1152), but for sim-
plicity, their means are modelled for all T = 4 periods.
A baseline analysis includes random patient intercepts, and random slopes on the
blood pressure readings. Additionally, the new treatment Carvedilol is the reference in
the fixed effects comparison vector h = (h1 , h2 , h3 ), leading to the corner constraint η1 = 0.
Then for patients i = 1,… , 288 , with treatments Tri and waves t = 1,… 4 ,

y it = b1 + b1i + ( b2 + b2i )BPi + hTri + uit ,

uit ~ N(0, 1/tu ),

with errors taken to be uncorrelated through time. In line with a commonly adopted
methodology, the bqi are taken to be bivariate normal with mean zero and covariance
Σb. The precision matrix S b-1 is assumed to be Wishart with 2 degrees of freedom and
identity scale matrix, S. The observation level precision is taken to have a gamma prior,
tu ~ Ga(1, 0.001).
A two-chain run of 5000 iterations in jagsUI give posterior means (sd) for b = ( b1 , b2 ) of
92.6 (0.7) and 0.41 (0.10). Posterior means (sd) for the random effect standard deviations
86 Bayesian Hierarchical Models

sb j = S b jj of {b1i , b2i } are 5.55 (0.42), and 0.64 (0.13). The ratios b ji /sd(b ji ) of posterior
means to standard deviations of the varying intercepts and slopes both show variation,
though less so for the slopes. While 42 of 288 ratios b1i /sd(b1i ) exceed 2, only 2 of the
corresponding ratios for slopes do. Correlation between the effects does not seem to be
apparent, with sb12 having a 95% interval straddling zero.
In a second analysis, covariance selection is considered via the approach of
Fruhwirth-Schnatter and Tuchler (2008). Context-based informative priors for the
diagonal elements of C are assumed. Initially C11 ~ Ga(1, 0.2) and C22 ~ Ga(1, 1.5), based
on the posterior means 5.55 and 0.64 for the random effects standard deviations from
the preceding analysis. For the lower diagonal term, a normal prior C21 ~ N(0, 1) is
assumed. These options are preferred to, say, adopting diffuse priors on the Cjk terms,
in order to stabilise the covariance selection analysis. Note that the covariance term
Σ21 is non-zero only when both C11 and C21 are retained. This option gives a posterior
probability of 1 for retaining slope variation, while the posterior probability for inter-
cept variation is 0.98.
However, priors on Cjk that downweight the baseline analysis more lead to lower
retention probabilities. Taking C11 ~ Ga(0.5, 0.1), C22 ~ Ga(0.5, 0.75) and C21 ~ N(0, 2)
gives retention probabilities for varying intercepts and slopes of 1 and 0.93 respectively.
Similarly, taking C11 ~ Ga(0.1, 0.0.02), C22 ~ Ga(0.1, 0.15) and C21 ~ N(0, 10) gives retention
probabilities for varying intercepts and slopes of 0.65 and 0.98 respectively. This is in
line with a general principle that model selection tends to choose the null model if
diffuse priors are taken on the parameter(s) subject to inclusion or rejection (Cai and
Dunson, 2008).
We also consider an adaptation of the method of Saville and Herring (2009) for con-
tinuous nested outcomes, which involves scaling factors exp(fj ) premultiplying random
effects (e.g. cluster intercepts and slopes) taken to have the same variance as the main
residual term. This allows Bayes factor calculation using Laplace methods. In the cur-
rent application, and allowing for correlated slopes and intercepts, one has

y it = b1 + b2BPi + exp(f1 )b1i + exp(f2 )(g12b1i + b2i )BSi + hTri + uit ,

uit ~ N(0, 1/tu ),

b1i ~ N(0, 1 / tu ),

b2i ~ N(0, 1/tu ),

g12 ~ N(0, 1).

For the ϕj discrete mixture, priors are adopted, with one option corresponding to
lj = exp(fj ) being close to zero, while in the other, the prior on ϕj allows unrestricted
sampling. Here

fj ~ (1 - J j )N( -5, 0.1) + J j N(0, 10),

J j ~ Bern(0.5).

This provides posterior probabilities Pr( J j = 1) of 1 and 0.41 respectively for random
intercepts and slopes. Posterior means (sd) for the random effect standard deviations
sbj = S bjj of {b1i , b2i } are 6.18 (0.39), and 0.16 (0.17).
Model Fit, Comparison, and Checking 87

3.5 Predictive Methods for Model Choice and Checking


A number of studies have pointed to drawbacks in focusing solely on the marginal likeli-
hood or Bayes factor, as a single global assessment measure of the performance of com-
plex models, and point out computational and inferential difficulties with the Bayes factor
when priors are diffuse, as well as the need to examine fit for individual observations to
make sense of global criteria (e.g. Gelfand, 1996; Johnson, 2004). While formal Bayes meth-
ods can be extended to assessing the fit of single observations (Pettit and Young, 1990),
it may be argued that predictive likelihood methods offer a more flexible approach to
assessing the role of individual observations. In fact, predictive methods have a role both
in model choice and model checking.

3.5.1 Predictive Model Checking and Choice


The formal predictive likelihood approach assumes only part of the observations are used
in estimating a model. On this basis, one may obtain cross-validation predictive densities
(Vehtari and Lampinen, 2002), p( y s | y[ s] ) , where ys denotes a subset of y (the “validation
data”), and y[s] is the complementary “test data” formed by excluding ys from y. If [i] is
defined to contain all the data { y1 , … yi -1 , yi + 1 , … , y n } except for a single observation i, then
the densities

p( yi | y[i] ) =
ò p(y |q , y
i [i] )p(q | y[i] )dq ,

are called conditional predictive ordinates or CPOs (e.g. Chaloner and Brant, 1988; Geisser
and Eddy, 1979), and sampling from them shows what values of yi are likely when a model
is applied to all the data points except the ith, namely to the data y[i]. The predictive dis-
tribution p( yi | y[i] ) can be compared to the actual observation in various ways (Gelfand et
al., 1992).
For example, to assess whether the observation is extreme (not well fitted) in terms of
the model being applied, replicate data yi,rep may be sampled from p( yi | y[i] ) and their con-
cordance with the data may be represented by probabilities (Marshall and Spiegelhalter,
2003),

Pr( yi ,rep ≤ yi |y[i] ).

These are estimated in practice by counting iterations r where the constraint yi(,rrep
)
≤ yi
holds. For discrete data, this assessment is based on the probability

Pr( yi , rep < yi | y[i] ) + 0.5Pr( yi , rep = yi | y[i] ).

Gelfand (1996) recommends assessing concordance between predictions and actual data
by a tally of how many actual observations yi are located within the 95% interval of the
corresponding model prediction yi,rep. For example, if 95% or more of all the observations
are within 95% posterior intervals of the predictions yi,rep, then the model is judged to be
reproducing the observations satisfactorily.
The collection of predictive ordinates { p( yi | y[i] ), i = 1, n} is equivalent to the marginal
likelihood p(y) when p(y) is proper, in that each uniquely determines the other. A pseudo
88 Bayesian Hierarchical Models

Bayes factor is obtained as a ratio of products of leave one out cross-validation predictive
densities (Vehtari and Lampinen, 2002) under models M1 and M2, namely

PsBF( M1 , M2 ) = ∏ {p(y |y
i =1
i [i] , M1 )/ p( yi |y[i] , M2 )} .

In practical data analysis, one typically uses logs of CPO estimates, and totals the log(CPO) to
derive log pseudo marginal likelihoods and log pseudo Bayes factors (Sinha et al., 1999, p.588).
Monte Carlo estimates of conditional predictive ordinates p( yi | y[i] ) may be obtained
without actually omitting cases, so formal cross-validation based on n separate estima-
tions (the 1st omitting case 1, the 2nd omitting case 2, etc) may be approximated by using
a single estimation run. For parameter samples {q (1) ,… ,q ( R) } from an MCMC chain, an
estimator for the CPO, p( yi | y[i] ), is

å p(y |q
1 1 1
= (r )
,
p( yi |y[i] ) R r =1 i )

namely the harmonic mean of the likelihoods for each observation (Aslanidou et al.,
1998; Silva et al., 2006; Sinha, 1993) In computing terms, an inverse likelihood needs to
be calculated for each case at each iteration, the posterior means of these inverse likeli-
hoods obtained, and the CPOs are the inverse of those posterior mean inverse likelihoods.
Denoting the inverse likelihoods as H i( r ) = 1/p( yi |q ( r ) ) , one would in practice take minus
the logarithms of the posterior means of Hi as an estimate of log(CPO)i. The sum over all
cases of these estimates provides a simple estimate of the log pseudo marginal likelihood.
In the turtle data example (Example 3.1), the fixed effects only model 1 has a PsBF of −151.8,
while the random intercepts model 2 has a PsBF of −149.6 under a Ga(0.1,0.1) prior for τb. So
the pseudo Bayes factors tends to weakly support the random effects option.
Model fit (and hence choice) may also be assessed by comparing samples yrep from the
posterior predictive density based on all observations, though such procedures may be con-
servative since the presence of yi influences the sampled yi,rep (Marshall and Spiegelhalter,
2003). Laud and Ibrahim (1995) and Meyer and Laud (2002) propose model choice based on
minimisation of the criterion

C = E éëc( yrep , y )| y ùû = å {var(y


i =1
i , rep }
) + [ yi - E( yi , rep )]2 ,

where for y continuous, c( yrep , y ) is the predictive error sum of squares

c( yrep , y ) = ( yrep - y )¢ ( yrep - y ).

The C measure can be obtained from the posterior means and variances of sampled yi(,rrep
)
or

å
n
from the posterior average of ( yi(,rrep
)
- yi )2 . Carlin and Louis (2000) and Buck and Sahu
i =1
(2000) propose related model fit criteria appropriate to both metric and discrete outcomes.
Posterior predictive loss (PPL) model choice criteria allow varying trade-offs in the bal-
ance between bias in predictions and their precision (Gelfand and Ghosh, 1998; Ibrahim et
al., 2001). Thus for k positive and y continuous, one possible criterion has the form
Model Fit, Comparison, and Checking 89

n
ì 2ü
PPL(k ) = å íîvar(y
i =1
i , rep
æ k ö
)+ ç é y - E( yi , rep )ùû ý .
è k + 1 ÷ø ë i þ

This criterion would be compared between models at selected values of k, typical values
being k = 0, k = 1, and k = 10,000, where higher k values put greater stress on accuracy in
predictions, and less on precision. One may consider calibration of such measures, namely
expressing the uncertainty of C or PPL in a variance measure (Laud and Ibrahim, 1995;
Ibrahim et al., 2001). De la Horra and Rodríguez-Bernal (2005) suggest predictive model
choice based on measures of distance between the two densities that can potentially be
used for predicting future observations, namely sampling densities and posterior predic-
tive densities.
To assess poorly fitted cases, the CPO values may be scaled (dividing by their maximum)
and low values for particular observations (e.g. under 0.001) will then show observations
which the model does not reproduce effectively (Weiss, 1994). If there are no very small
scaled CPOs, then a relatively good fit of the model to all data points is suggested, and is
likely to be confirmed by other forms of predictive check. The ratio of extreme percentiles
of the CPOs is useful as an indicator of a good fitting model e.g. the ratio of the 99th to the
1st percentile.
An improved estimate of the CPO may be obtained by weighted resampling from p(θ|y)
(Smith and Gelfand, 1992; Marshall and Spiegelhalter, 2003). Samples θ(r) from p(θ|y) can
be converted (approximately) to samples from p(q | y[i] ) by resampling the θ(r) with weights

wi( r ) = G( yi |q ( r ) )/ åG(y |q
r =1
i
(r )
),

where

G( yi |q ( r ) ) = 1/p( yi |q ( r ) ),

is the inverse likelihood of case i at iteration r. Using the resulting re-sampled values q ( r ) ,
corresponding predictions y rep can be obtained which are a sample from p( yi | y[i] ).

3.5.2 Posterior Predictive Model Checks


A range of model checks can also be applied using samples from the posterior predictive
density without actual case omission. To assess predictive performance, samples of repli-
cate data yrep from



p( yrep |y ) = p( yrep |q )p(q|y )dq ,

may be taken, and checks made against the data, for example, whether the actual obser-
vations y are within 95% credible intervals of yrep. Formally, such samples are obtained
by the method of composition (Chib, 2008), whereby if θ(r) is a draw from p(θ|y), then yrep (r )

drawn from p( yrep |q ) is a draw from p( yrep | y ) . In a satisfactory model, namely one that
(r )

adequately reproduces the data being modelled, predictive concordance (accurate repro-
duction of the actual data by replicate data) is at least 95% (Gelfand, 1996, p.158).
90 Bayesian Hierarchical Models

Other comparisons of actual and predicted data can be made, for example by a chi-square
comparison (Gosoniu et al., 2006). Johnson (2004) proposes a Bayesian chi-square approach
based on partitioning the cumulative distribution into K bins, usually of equal probability.
Thus, one chooses quantiles

0 º a0 < a1 < … < aK -1 < aK º 1,

with corresponding bin probabilities

pk = ak - ak -1 , k = 1, … , K .

Then using model means μi for subject i Î(1, … , n) one obtains the implied cumulative
density qi, say ak* -1 < qi < ak* , and allocates the fitted point to a bin randomly chosen from
bins 1, … , k * .
For example, suppose there are K = 5 equally probable intervals, with pk = 0.2. If mi = 1.4 ,
the probability assigned to an observation yi = 1 by the cumulative density function falls
in the interval (0.247,0.592), which straddles bins 2 and 3. To allocate a bin, a U(0.247,0.592)
variable is sampled, and the predicted bin is 2 or 3, according to whether the sampled
uniform variable falls within (0.247,0.4), or (0.4, 0.592). The totals so obtained accumulating
over all subjects define predicted counts mk (q ) which are compared (at each MCMC itera-
tion) to actual counts npk, as in formula (3) in Johnson (2004). This provides the Bayesian
chi-square criterion

[mk (q ) - npk ]2


K

RB (q ) = å
k =1
npk
,

where R (q ) is asymptotically c , regardless of the parameter dimension of the model


B 2
K -1

being fitted [2]. One can assess the posterior probability that RB (q ) exceeds the 95th per-
centile of the cK2 -1 density. Poor fit will show in probabilities considerably exceeding 0.05.
Analogues of classical significance tests are obtained using the posterior predictive
p-value (Kato and Hoijtink, 2004). This was originally defined (Rubin, 1984; Meng, 1994) as
the probability that a test statistic T ( yrep ) of future observations yrep is larger than or equal
to the observed value of T(y), given the adopted model M, the response data y, and any
ancillary data x,

ppost = Pr[T ( yrep , x) ³ T ( y , x)| y , x , M], (3.7)

where x would typically be predictors measured without error. The probability is calcu-
lated over the posterior predictive distribution of yrep conditional on M and x. By contrast,
the classical p-test integrates over y, as in

pc = Pr[T ( yrep , x) ³ T ( y , x)|x , M].

The formulation of Meng (1994) is extended by Gelman et al. (1996) to apply to discrepancy
criteria D(y,θ) based on data and parameters, as well as to observation-based functions
T(y). So the posterior predictive check is

ppost = Pr[D( yrep , x ,q ) ³ D( y , x ,q )|y , x , M],


Model Fit, Comparison, and Checking 91

where the probability is taken over the joint posterior distribution of yrep and θ given M
and x. In estimating the corresponding ppost, the discrepancy is calculated at each MCMC
iteration. This is done both for the observations, giving a value D( y , x ,q ( r ) ) , and for the
(r )
replicate data yrep (r )
, sampled from p( yrep |q ( r ) , x) resulting in a value D( yrep (r )
, x ,q ( r ) ) for each
sampled parameter q ( r ). The proportion of samples where D( yrep (r )
, x ,q ( r ) ) exceeds D( y , x ,q ( r ) )
is then the Monte Carlo estimate of ppost For example, Kato and Hoijtink (2004) show good
performance of ppost using both statistics T and discrepancies D in a normal multilevel
model context with subjects i = 1,… m j in clusters j = 1, … , J

yij = b1 j + b2 j xij + uij ,

where uij ~ N (0, s j2 ). The hypotheses considered (i.e. in the form of reduced models) are
bkj = bk and s j2 = s 2 .
Posterior predictive checks may be used to assess model assumptions. For instance, in
multilevel and general linear mixed models, assumptions of normality regarding ran-
dom effects are often made by default, and a posterior check against such assumptions
is sensible. A number of classical tests have been proposed such as the Shapiro–Wilk
W statistic (Royston, 1993) and the Jarque–Bera test (Bera and Jarque, 1980). These sta-
tistics can be derived at each iteration for actual and replicate data, and the comparison
D( yrep , x ,q ) ³ D( y , x ,q ) applied over MCMC iterations to provide a posterior predictive
p-value.

3.5.3 Mixed Predictive Checks


The posterior predictive check makes double use of the data and so may be conservative
as a test (Bayarri and Berger, 1999; Bayarri and Berger, 2000), since the observation yi has
a strong influence on the replicate yi,rep. For example, Sinharay and Stern (2003) show pos-
terior predictive checks may fail to detect departures from normality in random effects
models. However, also in the context of random effect and hierarchical models, Marshall
and Spiegelhalter (2003) and Marshall and Spiegelhalter (2007) mention a mixed predic-
tive scheme, which uses a predictive prior distribution p( yi ,rep |bi ,rep ,q ) for a new set of
random effects. The associated model check is called a mixed predictive p-test, whereas
the (conservative) option of sampling from p( yi ,rep |bi ,q ) results in what Marshall and
Spiegelhalter (2007, p.424) term full-data posterior predictive p-values. Mixed predictive
replicates for each case seek to reduce dependence on the observation for that case, as the
replicate data is sampled conditional only on global hyperparameters. Therefore, mixed
predictive p-values are expected to be less conservative than posterior predictive p-values.
Let b denote random effects for cases i = 1, … , n, or for clusters j in which individual cases
are nested. To generate a replicate yi,rep for the ith case under the mixed scheme involves
sampling (θ,b) from the usual posterior p(q , b| y ) conditional on all observations, but the
sampled b are ignored and instead replicate brep values taken. A fully cross-validatory
method would require that bi,rep be obtained by sampling from p(bi , rep | y[i] ), or bj,rep sam-
pled from p(b j , rep | y[ j] ) . In fact, Green et al. (2009) compare mixed predictive assessment
schemes with full cross-validation based on omitting single observations.
As full cross-validation is computationally demanding when using MCMC methods,
approximate cross-validatory procedures are proposed by Marshall and Spiegelhalter
(2007), in which the replicate random effect is sampled from p(bi ,rep | y ) , followed by sam-
pling yi,rep from p( yi ,rep |bi ,rep ,q ). This is called “ghosting” by Li et al. (2017). A discrepancy
measure Tobs based on the observed data is then compared to its reference distribution
92 Bayesian Hierarchical Models


ò
p(T | y ) = p(T |b)p M (b| y )db ,

ò
where p M (b|y ) = p(b|q )p(q |y )dq may be termed the ‘predictive prior’ for b (Marshall
and Spiegelhalter, 2007, p.413). This contrasts with more conservative posterior predictive
checks based on replicate sampling from p( yi ,rep |bi , q ) , under which Tobs is compared to the
reference distribution


ò
p(T | y ) = p(T |b)p(b| y )dq.

Marshall and Spiegelhalter (2003) confirm that a mixed predictive procedure reduces
the conservatism of posterior predictive checks in relatively simple random effects mod-
els, and is more effective in reproducing p( yi | y[i] ) than weighted importance sampling.
However, this procedure may be influenced by the informativeness of the priors on the
hyperparameters θ, and also by the presence of multiple random effects.
Marshall and Spiegelhalter (2007) also consider full cross-validatory mixed predictive
checks to assess conflict in evidence regarding random effects b between the likelihood
and the second stage prior; see also Bayarri and Castellanos (2007). Consider nested data
{ yij , i = 1, … , n j ; j = 1, … , J } with likelihood

yij ~ N (b j , s 2 ),

and second stage prior on random cluster effects

b j ~ N ( m, t 2 ).

Under a cross-validatory approach, the discrepancy measure Tjobs for cluster j would be
based on the remaining data y[j] with cluster j excluded, and its reference distribution is
then



p(Tjrep |y[ j] ) = p(Tjrep |b j , s 2 )p(b j | m, t 2 )p(s 2 , t 2 , m|y[ j] )db j ds 2dt 2d m.

Marshall and Spiegelhalter (2007) also propose a conflict p-test based on comparing a pre-
dictive prior replicate b j , rep | y[ j] with a fixed effect estimate or “likelihood replicate” bj,fix for
bj based only on the data. The latter is obtained using a highly diffuse fixed effects prior
on the bj, rather than a borrowing strength hierarchical prior, for example, b j ~ Be(1, 1) or
b j ~ Be(0.5, 0.5). Defining

b j ,diff = b j , rep - b j , fix ,

the conflict p-value for cluster j is obtained as

p j ,conf = Pr(b j ,diff ≤ 0|y ).

This can be compared to a mixed predictive p-value, based on sampling yj,rep from a
cross-validatory model using only the remaining cases y[j] to estimate parameters, and
then comparing yj,rep, or some function Tjrep = T ( y j , rep ) , with yj,obs or with Tjobs = T ( y j , obs ).
Model Fit, Comparison, and Checking 93

Thus, depending on the substantive application, one may define lower or upper tail mixed
p-values

p j , mix = Pr(Tjrep £ Tjobs | y ),

or

p j , mix = Pr(Tjrep ³ Tjobs | y ),

with the latter being relevant in (say) assessing outliers in hospital mortality comparisons.
If T(y) = y, and y is a count, then a mid p-value is relevant instead, with the upper tail test
being

p j , mix = Pr( y j , rep > y j , obs | y[ j] ) + 0.5Pr( y j , rep = y j , obs | y[ j] ). (3.8)

Li et al. (2016, 2017) combine the principle of mixed predictive tests with that of impor-
tance sampling, with the intention of further correcting for optimism present in standard
posterior predictive tests. Consider a particular MCMC iteration t. Sub-samples of random
effects are obtained conditional on hyperparameters θ(t) and the random effects b(t). One set
of sub-samples b jA, s ,rep (for observations j and sub-samples s = 1, ¼ , S) are obtained, along
with the corresponding y j , s ,rep conditional on b j , s ,rep . One then obtains the correspond-
ing p j , s ,mix as per (3.8), assuming the data are binomial or Poisson. Integrated importance
weights for the pj,mix are based on an independent set of replicate random effects, say b Bj , s ,rep .
Equations (38) to (40) in Li et al. (2017) set out the procedure more completely. Integrated
importance weights are obtained as

æ ö
Wi(t ) = 1/ ç
ç
è
å p(y |q
s
i
(t )
, b Bj ,s ,rep )/S ÷ ,
÷
ø
and can be used to provide WAIC estimates (denoted iWAIC); see equations (26)–(27) in Li
et al. (2016).

Example 3.5 Seeds Data, Predictive Assessment of Logit-Binomial Model


This example considers case specific predictive assessment of the standard bino-
mial logit model for the seeds data (the baseline model in Example 3.3). A first
analysis obtains estimated log(CPO) values for each plate, and mixed predictive
checks, as in Marshall and Spiegelhalter (2007). Mixed predictions are assessed by
sampling replicate random effects {bi,rep}, the corresponding { y i , rep |bi , rep } , and then
deriving p-values

pi , mix = Pr[ y i , rep > y i |y ) + 0.5Pr[ y i , rep = y i |y ).

Note that replicating this calculation in JagsUI or R2OpenBUGS needs to account for the
fact that the step function is a greater than or equals calculation.
We also include posterior predictive checks (Section 3.5.2) based on comparing devi-
ances for actual data and replicate data. Replicate data can be drawn from the model in
an unmodified form (which may provide conservative posterior checks), or with repli-
cates obtained using the mixed sampling approach.
94 Bayesian Hierarchical Models

With estimation using jagsUI, the mixed p-tests pi,mix and log(CPO) statistics are found
to imply similar inferences regarding less well-fitted cases. The lowest pi,mix is for slide
4, which also has the second lowest log(CPO), while the second highest pi,mix is for slide
10, which has the lowest log(CPO). Regarding the posterior predictive checks, as in (3.7),
taking replicates from the original model leads to a relatively low probability of 0.10,
while using mixed replicates provides a probability of 0.08. Both these indicate possible
model failure.
For this small sample, it is relatively straightforward to carry out a full (leave one out)
cross-validation based on omitting each observation in turn. This shows slides 4, 15,
and 20 as underpredicted (low probabilities that replicates exceed actual), and slides 10
and 17 as overpredicted.
Integrated importance cross-validation probabilities based on S = 10 subsamples are
also obtained; see the code used in [3]. These are very similar to the full cross-validation
probabilities (see Table 3.1, which highlights slides with full cross-validation probabili-
ties over 0.95 or under 0.05). We also use subsampling to obtain estimated iWAIC, fol-
lowing the notation of Li et al. (2016). Thus, the total iWAIC is 121.9, as compared to a
LOO-IC of 121.6 and a WAIC of 119.8. Casewise iWAIC confirm the poor fit to slides
4 and 10. For this example, log(CPO) and casewise iWAIC statistics correlate closely,
namely 0.9976.
In an attempt to improve fit, we replace the single intercept by a three-group discrete
mixture intercept. Thus

y i ~ N(ni , pi ),

logit(pi ) = b1,Gi + bi + b2 x1i + b3 x2i + b4 x1i x2i ,

TABLE 3.1
Seeds Data. Comparing Cross-Validation Probabilities, log(CPO), and Casewise iWAIC
Mixed IIS Full Casewise
Plate Cross-Validation Cross-Validation Cross-Validation log(CPO) iWAIC
1 0.883 0.925 0.922 −3.15 −3.19
2 0.473 0.460 0.454 −2.58 −2.51
3 0.894 0.949 0.946 −3.97 −3.98
4 0.050 0.015 0.013 −4.86 −4.88
5 0.226 0.178 0.176 −2.66 −2.65
6 0.250 0.240 0.242 −1.28 −1.29
7 0.312 0.266 0.258 −2.70 −2.76
8 0.117 0.058 0.053 −3.62 −3.75
9 0.735 0.808 0.811 −2.72 −2.75
10 0.925 0.980 0.978 −5.03 −4.79
11 0.271 0.267 0.267 −1.64 −1.66
12 0.263 0.188 0.194 −2.11 −2.15
13 0.726 0.769 0.772 −2.32 −2.37
14 0.856 0.899 0.902 −2.87 −2.84
15 0.128 0.028 0.034 −4.08 −4.12
16 0.934 0.936 0.937 −2.08 −2.06
17 0.959 0.973 0.974 −3.52 −3.48
18 0.442 0.451 0.456 −2.28 −2.35
19 0.606 0.625 0.630 −2.13 −2.18
20 0.130 0.047 0.049 −3.70 −3.79
21 0.681 0.692 0.692 −1.42 −1.41
Model Fit, Comparison, and Checking 95

Gi ∼ Categorical(f[1 : 3]),

f ~ Dirichlet( 5, 5, 5),

bi ~ N(0, sb2 ).

The posterior predictive checks, whether or not based on mixed replicates, are now
satisfactory, both around 0.48. There are now no casewise predictive exceedance prob-
abilities exceeding 0.95 or under 0.05. The LOO-IC and WAIC now stand at 116.6 and
110.4 respectively.

3.6 Computational Notes

[1] The code for the bridgesampling vignette is as follows:

   require(jagsUI)
  # generate data
   set.seed(12345)
   mu = 0
   tau2 = 0.5
   sigma2 = 1
  # number of observations
   n = 20
   theta = rnorm(n, mu, sqrt(tau2))
   y = rnorm(n, theta, sqrt(sigma2))
   # define w according to length, T=30, of bridge-sampling schedule
   T=30
   T1=T+1
   D= list(T=T,T1=T1, w=matrix(1,n,T1),n=n,y=y, path.pow=4)
   # Model 1, mu=0
   cat("model {for (h in 1:n) {for (s in 1:T1) {
   L.tem[h,s] <- pow(L[h,s],q[s])
   w[h,s] ~dunif(a1[h,s],b1[h,s])
   a1[h,s] <- -1/L.tem[h,s]
   b1[h,s] <- 1/L.tem[h,s]
   LL[h,s] <- log(L[h,s])
   # log-likelihood
   log(L[h,s]) <- 0.5*log(phi[s]/(1+phi[s]))-0.919-0.5*phi[s]/
(1+phi[s])*y[h]*y[h]}}
   # precision parameters
   for (s in 1:T1) {phi[s] ~dgamma(1,1)}
   phi.est <- phi[T1]
   # path sampling calculations
   for (s in 1:T1) {q[s] <- pow(a[s],path.pow)
   expLL[s] <- sum(LL[1:n,s])}
   a[1] <- 0.00001
   for (s in 1:T) {a[s+1] <- s/T
   mc[s] <- (q[s+1]-q[s])*(expLL[s+1]+expLL[s])*0.5}
   logML <- sum(mc[])}
96 Bayesian Hierarchical Models

   ", file="model1.jag")
   inits1 = list(phi=rep(1,T1))
   inits2 = list(phi=rep(2,T1))
   inits=list(inits1,inits2)
   pars = c("logML","phi.est")
   R1 = autojags(D, inits, pars,model.file="model1.jag",2,iter.
increment=1000,
   n.burnin=100,Rhat.limit=1.025, max.iter=5000, seed=1234)
   R1$summary
   # Model 2, mu unknown
   cat("model {for (h in 1:n) {for (s in 1:T1) {
   L.tem[h,s] <- pow(L[h,s],q[s])
   w[h,s] ~dunif(a1[h,s],b1[h,s])
   a1[h,s] <- -1/L.tem[h,s]
   b1[h,s] <- 1/L.tem[h,s]
   LL[h,s] <- log(L[h,s])
   # log-likelihood
   log(L[h,s])<-0.5*log(phi[s]/(1+phi[s]))-0.919-0.5*phi[s]/
(1+phi[s])*(y[h]-mu[s])*(y[h]-mu[s])}}
   # mean and precision parameters
   for (s in 1:T1) {phi[s] ~dgamma(1,1)
   mu[s] ~dnorm(0,1)}
   phi.est <- phi[T1]
   mu.est <- mu[T1]
   # path sampling calculations
   for (s in 1:T1) {q[s] <- pow(a[s],path.pow)
   expLL[s] <- sum(LL[1:n,s])}
   a[1] <- 0.00001
   for (s in 1:T) {a[s+1] <- s/T
   mc[s] <- (q[s+1]-q[s])*(expLL[s+1]+expLL[s])*0.5}
   logML <- sum(mc[])}
   ", file="model2.jag")
   inits1 = list(phi=rep(1,T1),mu=rep(0,T1))
   inits2 = list(phi=rep(2,T1),mu=rep(0,T1))
   inits=list(inits1,inits2)
   pars = c("logML","phi.est","mu.est")
   R2= autojags(D, inits, pars,model.file="model2.jag",2,iter.
increment=1000,
   n.burnin=100,Rhat.limit=1.025, max.iter=5000, seed=1234)
   R2$summary
   # Marginal Likelihoods and Bayes Factor
   ML=c()
   ML[1]=R1$summary[1]
   ML[2]=R2$summary[1]
   BF12=exp(ML[1]-ML[2])

[2] The Bayesian chi-square method is illustrated using model 5 for the Scottish lip
cancer incidence, as considered in Johnson (2004, pp.2374–2376). Thus, with Ei
denoting expected incidence counts,

( )
yi ~ Po Eiexp ( ri ) ,

where the ρi are modelled as diffuse fixed effects. The BUGS code is as follows:
Model Fit, Comparison, and Checking 97

   model {for (i in 1:n) {y[i] ~dpois(mu[i])


   mu[i] <- E[i]*exp(rho[i])
   rho[i] ~dnorm(0,0.001)
   # Poisson probs (up to maximum count 50), ym[i]=y[i]-1 unless y=0
for (j in 1:51) {cdf[i,j] <- exp(-mu[i])*pow(mu[i],j-1)/
   
exp(logfact(j-1))*step(y[i]-j+1)}
for (j in 1:51) {cdfm[i,j] <- exp(-mu[i])*pow(mu[i],j-1)/
   
exp(logfact(j-1))*step(ym[i]-j+1)}
   # cdf probs for y[i] and (y[i]-1)
   t[i] <- sum(cdf[i,1:51])
   tm[i] <- sum(cdfm[i,1:51])
   # lower limit of interval from which bin randomly chosen
   s[i] <- (1-equals(y[i],0))*tm[i]
   u[i] ~dunif(0,1)
   a[i] <- s[i]+u[i]*(t[i]-s[i])
   ybin[i,1] <- step(0.2-a[i])
   ybin[i,5] <- step(a[i]-0.8)
   ybin[i,2] <- step(a[i]-0.2)*step(0.4-a[i])
   ybin[i,3] <- step(a[i]-0.4)*step(0.6-a[i])
   ybin[i,4] <- step(a[i]-0.6)*step(0.8-a[i])}
   for (k in 1:K) {mhat[k] <- sum(ybin[,k])
   m[k] <- n*p[k]
   r.B[k] <- pow(mhat[k]-m[k],2)/m[k]}
   # compare R.B with 95th quantile of the chi2 distribution for K-1 df
   R.B <- sum(r.B[])
   P <- step(R.B-9.49)}

From iterations 5–100 thousand of a single chain run, the probability that RB exceeds the
95% point of a c 42 density is 0.157, and the posterior means of the number (mhat[] in the
code) of the n = 56 counts assigned to the five bins are (8.6,9.9,10.9,12.1,14.5).

[3] The code used for the IIS cross-validation probability estimates (seeds data) is

model3 <- function() {for(i in 1: N) {y[i] ~dbin(p[i],n[i])


   b[i] ~dnorm(0,tau)
   logit(p[i]) <- beta[1]+beta[2]*x1[i]+beta[3]*x2[i]+beta[4]*x1[i]*x2[i]
+b[i]
   # IPD and log(IPD) based on averages over sub-samples
   IPD[i] <- mean(L.new[i,])
   # IIS-CV predictive estimates are posterior mean of A div’d by
posterior mean of Awt
   A[i] <- mean(a[i,])*Awt[i]
   Awt[i] <- 1/IPD[i]
   # subsamples
   for (s in 1:S) {b.new.1[i,s]~dnorm(0,tau)
   logit(p.new.1[i,s])<- beta[1]+beta[2]*x1[i]+beta[3]*x2[i]+beta[4]*x1[i
]*x2[i]+b.new.1[i,s]
   y.new[i,s]~dbin(p.new.1[i,s],n[i])
   a[i,s]<- step(y.new[i,s]-y[i])-0.5*equals(y.new[i,s],y[i])
   b.new.2[i,s]~dnorm(0,tau)
   logit(p.new.2[i,s])<- beta[1]+beta[2]*x1[i]+beta[3]*x2[i]+beta[4]*x1[i
]*x2[i]+b.new.2[i,s]
98 Bayesian Hierarchical Models

   log(L.new[i,s])<- logfact(n[i])-logfact(y[i])-logfact(n[i]-y[i])
   +y[i]*log(p.new.2[i,s])+(n[i]-y[i])*log(1-p.new.2[i,s])}}
   # priors
   for (j in 1:P) {beta[j] ~dnorm(0.0,1.0E-6)}
   tau ~dgamma(1,0.001)}

References
Albert J (1999) Criticism of a hierarchical model using Bayes factors. Statistics in Medicine, 18, 287–305.
Alqallaf F, Gustafson P (2001) On cross-validation of Bayesian models. Canadian Journal of Statistics,
29, 333–340.
Akaike H (1973) Information theory and an extension of the maximum likelihood principle, in The
Second International Symposium on Information Theory, eds B Petrov, F Csaki. Akademiai Kiado,
Budapest.
Ando T (2007) Bayesian predictive information criterion for the evaluation of hierarchical Bayesian
and empirical Bayes models. Biometrika, 94, 443–458.
Aslanidou H, Dey D, Sinha D (1998) Bayesian analysis of multivariate survival data using Monte
Carlo methods. Canadian Journal of Statistics, 26, 33–48.
Barry R (2006) An alternative to the ‘ones’ trick? BUGS Archive, 09/11/2006. https://www.jiscmail.
ac.uk/cgi-bin/webadmin?A1=ind06&L=BUGS#13
Bartlett M (1957) A comment on D.V. Lindley’s statistical paradox. Biometrika, 44, 533–534.
Bayarri M, Berger J (1999) Quantifying surprise in the data and model verification, pp 53–82, in
Bayesian Statistics 6, eds J Bernardo, J Berger, A Dawid, A Smith. Oxford University Press,
London, UK.
Bayarri M, Berger J (2000) P-values for composite null models. Journal of the American Statistical
Association, 95, 1127–1142.
Bayarri M, Castellanos M (2007) Bayesian checking of the second levels of hierarchical models.
Statistical Science, 22, 363–367.
Bera A, Jarque C (1980) Efficient tests for normality, homoscedasticity and serial independence of
regression residuals. Economics Letters, 6, 255–259.
Berkhof J, van Mechelen I, Hoijtink H (2000) Posterior predictive checks: Principles and discussion.
Computational Statistics, 3, 337–354.
Bernardo J, Smith A (1994) Bayesian Theory. Wiley.
Besag J, York J, Mollié A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43(1), 1–20.
Bos C (2002) A comparison of marginal likelihood computation methods, pp 111–117, in COMPSTAT
2002: Proceedings in Computational Statistics, eds W Härdle, B Ronz. Springer, Berlin.
Brown H, Prescott R (1999) Applied Mixed Models in Medicine. John Wiley & Sons.
Buck C, Sahu S (2000) Bayesian models for relative archaeological chronology building. Applied
Statistics, 49, 423–444.
Buenconsejo J, Fish D, Childs J, Holford T (2008) A Bayesian hierarchical model for the estimation of
two incomplete surveillance data sets. Statistics in Medicine, 27, 3269–3285.
Burnham K, Anderson D (2002) Model Selection and Multimodel Inference: A Practical Information-
Theoretic Approach, 2nd Edition. Springer-Verlag, New York.
Cai B, Dunson D (2006) Bayesian covariance selection in generalized linear mixed models. Biometrics,
62, 446–457.
Cai B, Dunson D (2008) Bayesian variable selection in generalized linear mixed models, in Random
Effect and Latent Variable Model Selection, ed D Dunson. Springer.
Carlin B, Louis T (2000) Bayes and Empirical Bayes Methods for Data Analysis, 2nd Edition. Chapman
and Hall, London, UK.
Model Fit, Comparison, and Checking 99

Carvalho C, Polson N, Scott J (2009) Handling sparsity via the horseshoe. Proceedings of Machine
Learning Research, 5, 73–80.
Celeux G, Forbes F, Robert C, Titterington M (2006) Deviance information criteria for missing data
models. Bayesian Analysis, 1, 651–674.
Chaloner K, Brant R (1988) A Bayesian approach to outlier detection and residual analysis.
Biometrika,75, 651–660.
Chen M-H (2005) Computing marginal likelihoods from a single MCMC output. Statistica Neerlandica,
59, 16–29.
Chen Z, Dunson D (2003) Random effects selection in linear mixed models. Biometrics, 59, 762–769.
Chib S (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association,
90(432), 1313–1321.
Chib S (2008) Panel data modeling and inference: A Bayesian primer, pp 479–515, in The Econometrics
of Panel Data, 3rd Edition, eds L Matyas, P Sevestre. Springer-Verlag, Berlin, Germany.
Chib S, Jeliazkov I (2001) Marginal likelihood from the Metropolis–Hastings output. Journal of the
American Statistical Association, 96(453), 270–281.
Clayton D (1996) Generalized linear mixed models, in Markov Chain Monte Carlo in Practice, eds W
Gilks, S Richardson, D Spiegelhalter. Chapman & Hall, London, UK.
Conn, P, Johnson D, Williams P, Melin S, Hooten M (2018) A guide to Bayesian model checking for
ecologists. Ecological Monographs, 88(4), 526–542.
Crowder MJ (1978) Beta-binomial ANOVA for proportions. Applied Statistics, 27, 34–37.
de la Horra, J, Rodrguez-Bernal M (2005) Bayesian model selection: A predictive approach with
losses based on distances. Statistics & Probability Letters, 71, 257–265.
Drton M, Plummer M (2017) A Bayesian information criterion for singular models. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 79(2), 323–380.
Drummond A, Bouckaert R (2015) Bayesian Evolutionary Analysis with BEAST. Cambridge University
Press.
Fahrmeir L, Osuna L (2003) Structured count data regression. Sonderforschungsbereich, 386, Discussion
Paper 334, University of Munich.
Fernandez C, Ley E, Steel M (2001) Benchmark priors for Bayesian model averaging. Journal of
Econometrics, 100, 381–427.
Friel N, McKeone J, Oates C, Pettitt A (2017) Investigation of the widely applicable Bayesian informa-
tion criterion. Statistics and Computing, 27(3), 833–844.
Friel N, Pettitt A (2008) Marginal likelihood estimation via power posteriors. Journal of the Royal
Statistical Society: Series B, 70, 589–607.
Fruhwirth-Schnatter S (1999) Bayes Factors and Model Selection for Random Effect Models. Working
Paper, Department of Statistics, University of Business Administration and Economics, Vienna.
Fruhwirth-Schnatter S (2004) Estimating marginal likelihoods for mixture and Markov switching
models using bridge-sampling techniques. The Econometrics Journal, 7, 143–167.
Fruhwirth-Schnatter S, Tuchler R (2008) Bayesian parsimonious covariance estimation for hierarchi-
cal linear mixed models. Statistics & Computing, 18, 1–13.
Frühwirth-Schnatter S, Wagner H (2010) Stochastic model specification search for Gaussian and par-
tial non-Gaussian state space models. Journal of Econometrics, 154(1), 85–100.
Geisser S, Eddy W (1979) A predictive approach to model selection. Journal of the American Statistical
Association, 74, 153–160.
Gelfand A (1996) Model determination using sampling based methods, Chapter 9, in Markov Chain Monte
Carlo in Practice, eds W Gilks, S Richardson, D Spiegelhalter. Chapman & Hall/CRC, Boca Raton.
Gelfand A, Dey D (1994) Bayesian model choice: Asymptotics and exact calculations. Journal of the
Royal Statistical Society, Series B, 56, 501–514.
Gelfand A, Dey D, Chang H (1992) Model determination using predictive distributions with imple-
mentations via sampling-based methods, pp 147–168, in Bayesian Statistics 4, eds J Bernardo
et al. Oxford University Press.
Gelfand A, Ghosh S (1998) Model choice: A minimum posterior predictive loss approach. Biometrika,
85, 1–11.
100 Bayesian Hierarchical Models

Gelfand A, Vlachos P (2003) On the calibration of Bayesian model choice criteria. Journal of Statistical
Planning and Inference, 111, 223–234.
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis. CRC, Boca
Raton, FL.
Gelman A, Meng XL (1998) Simulating normalizing constants: From importance sampling to bridge
sampling to path sampling. Statistical Science, 13(2), 163–185.
Gelman A, Meng XL, Stern H (1996) Posterior predictive assessment of model fitness via realized
discrepancies. Statistica Sinica, 6, 733–807.
George E, McCulloch R (1993) Variable selection via Gibbs sampling. Journal of the American Statistical
Association, 88(423), 881–889.
George E, McCulloch R (1997) Approaches for Bayesian variable selection. Statistica Sinica, 7, 339–373.
Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the
American Statistical Association, 102(477), 359–378.
Gosoniu L, Vounatsou P, Sogoba N, Smith T (2006) Bayesian modelling of geostatistical malaria risk
data. Geospatial Health, 1, 127–139.
Green M, Medley G, Browne W (2009) A comparison of methods of posterior predictive assess-
ment in multilevel logistic regression using an example from veterinary medicine. Veterinary
Research. 40(4), 1–10.
Gronau Q, Sarafoglou A, Matzke D, Ly A, Boehm U, Marsman M (2017b) A tutorial on bridge sam-
pling. Journal of Mathematical Psychology, 81, 80–89.
Gronau Q, Singmann H, Wagenmakers E (2017a). Bridgesampling: An R package for estimating nor-
malizing constants. arXiv preprint arXiv:1710.08162
Gustafson P, Hossain S, Macnab Y (2006) Conservative prior distributions for variance parameters in
hierarchical models. Canadian Journal of Statistics, 34(3), 377–390.
Han C, Carlin B (2001) Markov chain Monte Carlo methods for computing Bayes factors: A compara-
tive review. Journal of the American Statistical Association, 96, 1122–1132.
Harun N, Cai B (2014) Bayesian random effects selection in mixed accelerated failure time model for
interval-censored data. Statistics in Medicine, 33(6), 971–984.
Ibrahim J, Chen M, Sinha D (2001) Criterion-based methods for Bayesian model assessment. Statistica
Sinica, 11, 419–443.
Jeffreys H. (1961) The Theory of Probability, 3rd edn. Oxford, UK, Clarendon Press.
Johnson V (2004) A Bayesian χ2 test for goodness-of-fit. Annals of Statistics, 32, 2361–2384.
Johnson V, Rossell D (2012) Bayesian model selection in high-dimensional settings. Journal of the
American Statistical Association, 107(498), 649–660.
Kacker R, Forbes A, Kessel R, Sommer K-D (2008) Bayesian posterior predictive p-value of statistical
consistency in interlaboratory evaluations. Metrologia, 45, 512–523.
Kass R, Raftery A (1995) Bayes factors. Journal of the American Statistical Association, 90, 773–795.
Kato B, Hoijtink H (2004) Testing homogeneity in a random intercept model using asymptotic, pos-
terior predictive and plug-in p-values. Statistica Neerlandica, 58, 179–196.
Kelly D, Smith C (2011) Bayesian model checking, pp 39–50, in Bayesian Inference for Probabilistic Risk
Assessment. eds D Kelly, C Smith. Springer, London, UK.
Key J, Pericchi L, Smith A (1999) Bayesian model choice: What and why?, pp 343–370, in Bayesian
Statistics 6, eds J Bernardo, J Berger, A Dawid, A Smith. Oxford Science Publications, Oxford,
UK.
Kinney, S, Dunson D (2008) Bayesian model uncertainty in mixed effects models, in Random Effect and
Latent Variable Model Selection, ed D Dunson. Springer.
Kuhn E, Lavielle M (2005) Maximum likelihood estimation in nonlinear mixed effects models.
Computational Statistics & Data Analysis, 49, 1020–1038.
Kuo L, Mallick B (1998) Variable selection for regression models. Sankhyā: The Indian Journal of
Statistics, Series B, 60(1), 65–81.
Laud P, Ibrahim J (1995) Predictive model selection. Journal of The Royal Statistical Society: Series B, 57,
247–262.
Model Fit, Comparison, and Checking 101

Lenk P, DeSarbo W (2000) Bayesian inference for finite mixture models of generalized linear models
with random effects. Psychometrika, 65, 475–496.
Li L, Qiu S, Zhang B, Feng C (2016) Approximating cross-validatory predictive evaluation in Bayesian
latent variable models with integrated IS and WAIC. Statistics and Computing, 26(4), 881–897.
Li L, Feng C, Qiu S (2017) Estimating cross-validatory predictive p-values with integrated impor-
tance sampling for disease mapping models. Statistics in Medicine, 36(14), 2220–2236.
Lopes HF, West M (2004) Bayesian model assessment in factor analysis. Statistica Sinica, 14(1), 41–68.
Lucy L (2018) Bayesian model checking: A comparison of tests. Astronomy & Astrophysics, 614, A25.
MacNab Y, Qiu Z, Gustafson P, Dean C, Ohlsson A, Lee S (2004) Hierarchical Bayes analysis of mul-
tilevel health services data: A Canadian neonatal mortality study. Health Services and Outcomes
Research Methodology, 5, 5–26.
Marshall C, Spiegelhalter D (2003) Approximate cross-validatory predictive checks in disease map-
ping models. Statistics in Medicine, 22, 1649–1660.
Marshall C, Spiegelhalter D (2007) Identifying outliers in Bayesian hierarchical models: A simula-
tion-based approach. Bayesian Analysis, 2, 1–33.
Meng X (1994) Posterior predictive p-values. The Annals of Statistics, 22, 1142–1160.
Meng XL, Wong WH (1996) Simulating ratios of normalizing constants via a simple identity: A theo-
retical exploration. Statistica Sinica, 6(4), 831–860.
Meyer M, Laud P (2002) Predictive variable selection in generalized linear models. Journal of the
American Statistical Association, 97, 859–871.
Mitchell TJ, Beauchamp JJ (1988) Bayesian variable selection in linear regression. Journal of the
American Statistical Association, 83(404), 1023–1032.
Müller S, Scealy J, Welsh A (2013) Model selection in linear mixed models. Statistical Science, 28(2),
135–167.
Myung J, Pitt M (2004) Model comparison methods, pp 351–366, in Methods in Enzymology, Vol. 383,
eds L Brand, M Johnson. Elsevier, Amsterdam.
Nandram B, Kim H (2002) Marginal likelihoods for a class of Bayesian generalized linear models.
Journal of Statistical Computation and Simulation, 73, 319–340.
Ohlssen D, Sharples L, Spiegelhalter D (2006) Flexible random-effects models using Bayesian
semi-parametric models: Applications to institutional comparisons. Statistics in Medicine, 26,
2088–2112.
Park J Y, Johnson M, Lee Y-S (2015) Posterior predictive model checks for cognitive diagnostic mod-
els. International Journal of Quantitative Research in Education, 2(3–4), 244–264.
Pettit L,Young K (1990) Measuring the effect of observations on Bayes factors. Biometrika, 77, 455–466.
Piironen J, Vehtari A (2017) Comparison of Bayesian predictive methods for model selection. Statistics
and Computing, 27(3), 711–735.
Plummer M (2008) Penalized loss functions for Bayesian model comparison. Biostatistics, 9, 523–539.
Pourahmadi M, Daniels M (2002) Dynamic conditionally linear mixed models for longitudinal data.
Biometrics, 58, 225–231.
Rockova V, Lesaffre E, Luime J, Löwenberg B (2012). Hierarchical Bayesian formulations for selecting
variables in regression models. Statistics in Medicine, 31(11–12), 1221–1237.
Rossell P (2018) Bayesian Model Selection and Averaging with mombf. https://cran.r-project.org/
web/packages/mombf/vignettes/mombf.pdf
Royston P (1993) A toolkit for testing for non-normality in complete and censored samples. The
Statistician, 42, 37–43.
Rubin DB (1984) Bayesianly justifiable and relevant frequency calculations for the applied statisti-
cian. The Annals of Statistics, 12(4), 1151–1172.
Sala-i-Martin X, Doppelhofer G, Miller RI (2004) Determinants of long-term growth: A Bayesian
averaging of classical estimates (BACE) approach. American Economic Review, 94(4), 813–835.
Saville B, Herring A (2009) Testing random effects in the linear mixed model using approximate
Bayes factors. Biometrics, 65, 369–376.
102 Bayesian Hierarchical Models

Saville B, Herring A, Kaufman J (2011) Assessing variance components in multilevel linear models
using approximate Bayes factors: A case-study of ethnic disparities in birth weight. Journal of
the Royal Statistical Society: Series A, 174(3), 785–804.
Schwarz G (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Silva R, Lopes H, Migon H (2006) The extended generalized inverse Gaussian distribution for log-
linear and stochastic volatility models. Brazilian Journal of Probability and Statistics, 20, 67–91.
Sinha D (1993) Semiparametric Bayesian analysis of multiple event time data. Journal of the American
Statistical Association, 88(423), 979–983.
Sinha D, Chen M-H, Ghosh S (1999) Bayesian analysis and model selection for interval-censored
survival data. Biometrics, 55, 585–590.
Sinharay S, Stern H (2003) Posterior predictive model checking in hierarchical models. Journal of
Statistical Planning and Inference, 111, 209–221.
Sinharay S, Stern H (2005) An empirical comparison of methods for computing bayes factors in gen-
eralized linear mixed models. Journal of Computational and Graphical Statistics, 14, 415–435.
Smith AF, Gelfand AE (1992) Bayesian statistics without tears: A sampling–resampling perspective.
The American Statistician, 46(2), 84–88.
Smith M, Kohn R (1996) Nonparametric regression using Bayesian variable selection. Journal of
Econometrics, 75(2), 317–343.
Smith M, Kohn R (2002) Parsimonious covariance matrix estimation for longitudinal data. Journal of
the American Statistical Association, 97(460), 1141–1153.
Spiegelhalter D (2006) Two brief topics on modelling With WinBUGS. Presented at ICEBUGS
Conference, Helsinki 2006.
Spiegelhalter D, Best N, Carlin B, van der Linde A (2002) Bayesian measures of model complexity
and fit. Journal of the Royal Statistical Society, Series B, 64, 583–639.
Stern H, Sinharay S (2005) Bayesian model checking and model diagnostics, pp 171–192, in Bayesian
Thinking: Modeling and Computation, Handbook of Statistics, Vol. 25, eds D Dey, C Rao. Elsevier,
Amsterdam, Netherlands.
Tierney L, Kadane J (1986) Accurate approximations for posterior moments and marginal densities.
Journal of the American Statistical Association, 81, 82–86.
Vannucci M (2000) Matlab code for Bayesian variable selection. ISBA Bulletin, 7(3), 1–3.
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-
validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
Vehtari A, Lampinen J (2002) Expected utility estimation via cross-validation, in Bayesian Statistics 7,
eds J Bernardo, M Bayarri, J Berger, A Dawid, D Heckerman, A Smith, M West. Clarendon Press.
Watanabe S (2010) Asymptotic equivalence of Bayes cross validation and widely applicable informa-
tion criterion in singular learning theory. Journal of Machine Learning Research 11, 3571–3594.
Watanabe S (2013) A widely applicable Bayesian information criterion. Journal of Machine Learning
Research, 14, 867–897.
Weihs C, Plummer M (2016) Package sBIC. Computing the singular BIC for multiple models. https://
cran.r-project.org/web/packages/sBIC/sBIC.pdf
Weiss R (1994) Pediatric pain, predictive inference and sensitivity analysis. Evaluation Review, 18,
651–678.
Xie W, Lewis P, Fan Y, Kuo L, Chen M-H (2011) Improving marginal likelihood estimation for
Bayesian phylogenetic model selection. Systematic Biology, 60(2), 150–160.
Yang M (2012) Bayesian variable selection for logistic mixed model with nonparametric random
effects. Computational Statistics & Data Analysis, 56(9), 2663–2674.
Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with g-prior dis-
tributions, pp 233–243, in Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de
Finetti. North-Holland/Elsevier.
Zhu L, Gorman D, Horel S (2006) Hierarchical Bayesian spatial models for alcohol availability, drug
“hot spots” and violent crime. International Journal of Health Geographics, 5, 54.
4
Borrowing Strength via Hierarchical Estimation

4.1 Introduction
What is sometimes termed ensemble estimation, or borrowing strength, refers to infer-
ences for collections of similar (exchangeable) units i = 1, … , n (schools, health agencies,
etc.) using Bayesian hierarchical methods (Burr and Doss, 2005; Clark and Gelfand, 2006;
Rounder et al., 2013; Rhodes et al., 2016). Among possible examples are surgical outcome
rates (Kuhan et al., 2002; Bayman et al., 2013), drug development (Gupta, 2012), baseball
batting averages (Kruschke and Vanpaemel, 2015), health quality measures (Staggs and
Gajewski, 2017), or oviposition preference data (Fordyce et al., 2011). Fixed effects models for
such collections are problematic (Marshall and Spiegelhalter, 1998), whereas hierarchical
random effects approaches pool information across units to obtain more reliable estimates
for each unit, identify units with unusually high or low values, and enable comparisons
between units. Borrowing strength may need to be modified to account for, or accom-
modate, unusual observations (Baker and Jackson, 2016; Farrell et al., 2010). Rankings of
the units may often be required, or probabilities of significant difference between units or
against a threshold (Deely and Smith, 1998; Staggs and Gajewski, 2017).
Implementations for hierarchical methods in R include Bayesian applications, as in
bayesPref (Gompert and Fordyce, 2015), LearnBayes (Albert, 2015), bmeta (Ding and Baio,
2016), bamdit (Verde, 2018), meta4diag (Guo and Riebler, 2016), and frequentist applications,
such as metaplus (Beath, 2016) and metafor (Viechtbauer, 2010; Viechtbauer, 2017); see also
https://cran.r-project.org/web/views/MetaAnalysis.html. For semiparametric and dis-
crete mixture models, packages include DPpackage (Jara et al., 2011), bspmma (Burr, 2012),
bayesmix (Gruen and Plummer, 2015), and label.switching (Papastamoulis, 2016).
A prototypical Bayesian hierarchical model for interrelated units specifies an outcome
model (first stage likelihood) p( yi |bi , Φ ) , and a process model involving unobserved effects
bi, with density p(bi |Ψ) , conditional on hyperparameters Ψ. In a longitudinal linear regres-
sion, the Φ might be regression coefficients and the residual regression variance, while Ψ
could include the variance of unit random intercepts bi. Similarly, in a Poisson-gamma
mixture, the likelihood p( yi |bi ) conditions on latent gamma effects bi. At the second stage,
the gamma density p(bi |Ψ) for the bi conditions on gamma shape and scale parameters Ψ,
while prior densities for the gamma parameters form the third stage.
The procedures considered in this chapter are typically based on an exchangeability
principle: that units are similar enough to justify being modelled by a common den-
sity and that the units are not configured in ways (e.g. over time or space) that implies
higher correlations between some units than others (Spiegelhalter et al., 2004; Lindley and
Smith, 1972, p.4). Structuring of units in space, time, or other forms of non-exchangeability
does not preclude borrowing strength, but a prior reflecting that structuring is required

103
104 Bayesian Hierarchical Models

(see Chapters 5, 6). Exchangeability means that there is no prior basis for supposing some
units have higher true effects than others, or that certain subgroups of units are more
similar between themselves than other subgroups (e.g. that mortality in hospitals i and
j is more similar than between hospitals i and k). For units of the same type and obser-
vations generated under similar conditions, exchangeability means all possible permuta-
tions of the sequence of units have the same probability: random variables { y1 , … , y n } are
exchangeable if their joint distribution P( y1 , … , y n ) is invariant under permutation of its
arguments, so that

P( y1∗ , … , y n∗ ) = P( y1 , … , y n )

where { y1∗ , … , y n∗ } is any permutation of { y1 , … , y n } (Greenland and Draper, 1998). Sometimes


units are better considered exchangeable within subgroups of the data; a UK example
relates to mortality in cardiac surgery units, with exchangeability within “closed” proce-
dures involving no use of heart bypass during anaesthesia, and “open” procedures where
the heart is stopped and heart bypass needed (Spiegelhalter, 1999). Sometimes exchange-
ability can only be supported for residual effects, bi, obtained after controlling for known
differences between studies, for example, as represented by covariates in meta-regression.
Hierarchical smoothing methods result in shrinkage of estimates for each unit towards
the average outcome rate in the population within which exchangeability is assumed;
shrinkage will be greater for units with observations based on small samples (Staggs and
Gajewski, 2017). When the single population hierarchical model is appropriate, pooling
of strength results in more precise estimates, and may provide better out of sample pre-
dictive performance – see Deely and Smith (1998) for an application of such predictions
to performance indicators. However, borrowing strength may increase the risk of bias,
as compared to unadjusted fixed effect estimates. The increase in precision but possible
bias inherent in hierarchical estimation provides a dilemma known as the “bias-variance
trade-off.” In some applications, inferences are over more than one variable as well as
over a collection of similar units (Everson and Morris, 2000; van Houwelingen et al., 2002).
Inferences will typically be improved for related outcomes over similar units (e.g. surgical
and non-surgical mortality in different hospitals).
While smoothing is the leading motivation for hierarchical models, a related theme is
to achieve smoothed estimates that allow appropriately for heterogeneity between sample
units – that is, they do not oversmooth, and show some robustness or flexibility to indi-
vidual units, or to clusters of units, that are somewhat discrepant or outlying from the
rest of the population (Baker and Jackson, 2008, 2016; Zhang et al., 2015; Beath, 2014). Such
heterogeneity will typically be associated with overdispersion in Poisson or binomial data,
or with heavy tails or skew in the case of departures from normality in continuous out-
comes. One way to modify the standard densities to take account of heterogeneity greater
than postulated under that density is to allow adaptive continuous mixing at unit level.
Examples of such mixing are the scale mixture approach to the t-density discussed in
Section 4.3. Another option is discrete mixing (see Sections 4.8 et seq.), in which a sin-
gle population assumption is replaced by an assumption of two or more subpopulations.
Shrinkage will then be towards the subgroup characteristics that each unit has the highest
posterior probability of belonging to.
Undershrinkage (undersmoothing) also raises issues: this will lead to over-estimation of
random effect variability and is to be avoided when a type I error has worse consequences
than a type II error (Gustafson et al., 2006). Similarly, Spiegelhalter (2005) points out that
there is a danger in performance indicator analysis that the units (e.g. institutions) that
Borrowing Strength via Hierarchical Estimation 105

one is trying to detect could be accommodated by a random effects approach, and it is


therefore important that robust methods are used to estimate the standard deviation of
the random effects distribution.

4.2 Hierarchical Priors for Borrowing Strength


Using Continuous Mixtures
Observations for related units are typically considered in aggregate form, such as means
yi for a metric variable, or numbers of successes for a binomial variable, even though origi-
nally collected in disaggregated form for repetitions j within each unit of observation i. For
example, consider a normal first stage model yi ∼ N (bi , si2 ) , with known variances si2 . Latent
means bi vary according to a stage 2 density p(bi |Ψ) , such as bi ~ N ( m ,t 2 ). Stage 3 specifies
priors on the population parameters Y = ( m ,t 2 ).
Inferences of interest include the posterior densities, such as p(bi | y ) and p(μ|y), and
posterior probabilities that bi and μ are in specified intervals, such as the probability of a
positive effect Pr( m > 0| y ) , when the yi measure clinical treatment benefit. Interest may
also be in predictions for hypothetical future units (e.g. for a new clinical trial, or for the
next year in a performance ranking application), p( y new | y ) (Friede et al., 2017). If p(Ψ|y)
can be obtained analytically, or samples Ψ(1) , Ψ( 2) , … Ψ(t ) obtained directly, then p(bi | y ),
p(μ|y) and p( y new | y ) can be obtained by Monte Carlo simulation, as in



p(bi | y ) = p(bi | y , Ψ)p(Ψ| y )dΨ,


T
leading to the estimate p̂(bi | y ) = p(bi | y , Ψ(t ) ) .
t =1
An alternative to direct simulation is to simulate the full posterior p(Ψ, b| y ) using MCMC
methods, by obtaining samples {bi(t ) , Ψ(t ) } from the full conditional posteriors p(bi |b[i] , Ψ, y )
and p(yq |Ψ[ q] , b , y ). For example, often the first stage density p(y|b) is in the full exponen-
tial family, so that

 y b − B(bi ) 
p( yi |bi ) = exp  i i + C( yi , fi ) (4.1)
 A(fi ) 
where ϕi is a scale parameter. Assuming a conjugate second stage prior, the conditional
posterior of each bi follows the same density. For example, assume (Frees, 2004; Das and
Dey, 2006; Das and Dey, 2007; Ferreira and Gamerman, 2000) that

p(bi |Ψ) = k1 exp(bi g1(Ψ) − B(bi ) g 2 (Ψ)) (4.2)

where k1 is a normalising constant. Then the posterior density of bi and Ψ given y is of


exponential form

æé y ù é 1 ùö
p(bi , Y|y ) = k 2 exp çç ê g1(Y ) + i ú bi - B(bi ) ê g 2 (Y ) + ÷ . (4.3)
èë A(f )
i û ë A(fi ) úû ÷ø
106 Bayesian Hierarchical Models

With proper log-concave priors p(Ψ), the full conditionals p(y q |Y [ q] , b , y ) are logconcave,
and can be sampled using methods such as those of Gilks and Wild (1992). By contrast,
if improper priors are assumed on hyperparameters {y 1 ,… ,y Q }, then the full posterior
p(b , Ψ| y ) is not necessarily proper (George et al., 1993; George and Zhang, 2001; Browne
and Draper, 2006), and empirical convergence of the MCMC sequence {b(t ) , Ψ(t ) } may be
problematic even if the posterior is proper analytically. George and Zhang (2001) consider
posterior propriety results for the Poisson-gamma, the binomial-beta, and multinomial-
Dirichlet models in terms of conditions on the hyperparameter prior tail behaviour. For
the latter two hierarchical schemes, no improper prior can guarantee a proper posterior.
Similar convergence and identification issues apply to the general linear mixed model
formulation.

4.3 The Normal-Normal Hierarchical Model and Its Applications


A widely applied conjugate hierarchical scheme assumes normal sampling of obser-
vations and normally distributed latent effects. A typical borrowing strength or
meta-analysis template is for continuous observations for unit level effects yi, and intra-
unit variation a(fi ) = si2 , even though the underlying data might have involved two-way
nesting with j = 1, … , J i replications for units i = 1, … , n . This is often the case in clini-
cal meta-analysis where patient level results are summarised as treatment or risk fac-
tor effect measures (e.g. change in a clinical measure between treatment and control
groups, or the slope of a dose-response curve) along with moment estimates of sam-
pling variances. Assuming the observed summary measures are exchangeable, obtained
from similar study designs and relating to similar types of unit (Spiegelhalter et al.,
2004, p.92), they may be regarded as draws from an underlying common density for the
unknown true means bi.
Often the normal-normal model is applied to originally discrete data using normal
approximations for the effect measures (e.g. Bakbergenuly and Kulinskaya, 2017; Friede
et al., 2017). Suppose riT of NiT treated subjects in study i exhibit a particular response (e.g.
disease or death), as compared to riC of NiC control subjects. Define log odds

wiT = log(riT /( N iT − riT ),

and

wiC = log(riC /( N iC − riC ),

in the treated and control arms. Then the log odds ratios form the unit level response,

yi = wiT − wiC ,

which may be assumed approximately normal with variance

1 1 1 1
si2 = + + + ,
riT N iT − riT riC N iC − riC
Borrowing Strength via Hierarchical Estimation 107

(see Example 4.1). It is also possible to take yi as a log relative risk between treatment and
control groups, namely

 r   r 
yi = log  iT  − log  iC  ,
 N iT   N iC 
with variance

1 1 1 1
+ − − .
riT riT N iT N iC
Another option is to take the risk difference

riT r
yi = − iC ,
N iT N iC
as approximately normal with variance

riT ( N iT − riT ) riC ( N iC − riC )


+ .
N iT3 3
N iC
Unless heavy tails, skewness, or multiple modes are suspected, an appropriate hierarchi-
cal model then has a normal first stage, and a second stage normal density for the bi with
variance constant over units. So

yi ∼ N (bi , si2 ), (4.4.1)

and

bi ~ N ( m , t 2 ). (4.4.2)

Integrating out the bi, the marginal likelihood for yi (Guolo and Varin, 2017) is then

yi |m , t 2 ~ N ( m , si2 + t 2 ). (4.4.3)

Often the summary measures are unit or trial means and different observational vari-
ances are associated with differing sample sizes Ni, so that si2 = s 2 /N i, where σ2 is an addi-
tional unknown. While clinical meta-analysis applications are common, a similar scenario
occurs in small area estimation from multiple surveys where si2 are sampling variances
obtained according to the survey design.
More complex situations can be fitted into this framework. For example, Abrams et al.
(2000) consider the effect of testing positive or negative in a screening test on subsequent
levels of anxiety; see also Abrams et al. (2005). Let xik be baseline anxiety in study i, with
k = 1 (tested positive) and k = 2 (tested negative), and with Ni1 and Ni2 subjects in different
arms. Let zik be follow-up anxiety according to screening result, and let dik = zik − xik denote
change in anxiety. Then the measure of interest is the contrast between anxiety growth
according to screening result, namely yi = di1 − di 2 , with variance

( N i1 − 1)V (di1 ) + ( N i 2 − 1)V (di 2 )


si2 = ,
( N i1 + N i 2 − 2)
108 Bayesian Hierarchical Models

where

V (dik ) = V ( xik ) + V ( zik ) − 2 r V ( xik )V ( zik ),

and ρ is a within-subject correlation taken constant across studies and arms. Studies may
not report all the relevant statistics: they may report the dik and their variances, or the sepa-
rate baseline and follow-up measures in each arm {xik , zik } and their variances. In either
case, meta-analysis requires a prior on ρ.
In (4.4), assume independent priors on the hyperparameters

p(t 2 , m) = p(t 2 )p( m),

with a commonly adopted option being

 n nl 
t 2 ∼ IG  ,  ,
2 2

m ∼ N (m m , Vm ),

where n , l, m m , Vm are assumed known. The full posterior conditional for bi is then (Browne
and Draper, 2006; George et al., 1993; Silliman, 1997, p.927)

p(bi |b[i] , m, t, y ) = p(bi | m, t, y ) = N ([1 − wi ]yi + wi m, Di ),

where

-1
æ1 1 ö t 2s2
Di = ç 2 + 2 ÷ = 2 i 2 ,
è si t ø t + si

si2
wi = ,
si2 + t 2
and the first equality is by virtue of conditional independence of the bi. The full condi-
tional for τ2 is

  n

t 2 ∼ IG  0.5[n + n], 0.5 nl +
 
∑i =1
( yi − m)2  ,

while that for μ involves a precision weighted average of mμ, and the average of the bi,
namely

æ æ nVm ö æ t2 ö t 2Vm ö
m ~ Nçb ç ÷ + m m ç 2 ÷
, 2 ÷
.
ç nVm + t 2 ÷
è è ø è nVm + t ø nVm + t ø
Allowing interrelatedness between units leads to inferences about underlying unit means
that are different from those obtained under alternative scenarios sometimes used, namely,
(a) the “independent units” case, with bi taken as unknown and mutually unrelated fixed
Borrowing Strength via Hierarchical Estimation 109

effects, with t 2 → ∞, and (b) the complete pooling model of classical meta-analysis where
the studies are regarded as effectively interchangeable and t 2 = 0 .
By contrast, the intermediate “exchangeable units” Bayes model leads to a posterior
mean for bi,

E[bi | y] = wi m + [1 − wi ]yi ,

that averages over the prior mean μ and the data mean yi with weights wi = si2 /(si2 + t 2 ) and
1 − wi = t 2 /(si2 + t 2 ) respectively, as is apparent from the Gibbs sampling full conditionals.
The bi under an exchangeability scenario have narrower posterior intervals than under an
independent units assumption, with precision related to the confidence about the prior
mean and the prior assumed for τ2 (see also Section 4.4). Assume the intra-study vari-
ances can be expressed as si2 = s 2 /N i and then set t 2 = s 2 /N m , where Nμ is the sample size
assigned to the prior mean. Then the weights wi become N m /( N i + N m ) demonstrating that
shrinkage to the prior mean increases as the confidence about the prior mean increases.
The normal-normal model may be robustified against skewness, heavy tails, and outlier
studies in either the sampling density or the latent effects density. If non-normality is sus-
pected at the second stage, a heavy-tailed prior can be used to accommodate possibly out-
lying studies. A normal-t approach involves study-specific scale adjustments at the second
stage (West, 1984), downplaying the influence of atypical studies on posterior estimates of
the overall effect μ, and avoiding over-shrinkage of individual study effects bi. The scaling
factors are gamma with shape and rate ν/2:

bi ∼ N ( m, t 2 /li )

 n n
li ∼ Ga  ,  .
 2 2

Skewness in the observed data can often be reduced or eliminated by transformation.


However, continuous data (e.g. cost data or data resulting from psychometric tests) will
sometimes have more unusual departures from normality that render transformation
inapplicable, such as clumping of zero values as well as positive skewness in positive
responses (Delucchi and Bostrom, 2004). Skewness in the latent effects bi may be han-
dled by more specific parametric adaptations (e.g. Fernandez and Steele, 1998; Sahu
et al., 2003; Lee and Thompson, 2008), multivariate versions of which are considered in
Section 4.5.
Robustness may also be achieved by shrinkage priors (see Chapter 7) such as the horse-
shoe, Lasso, or scaled beta2 priors (Zollinger et al., 2015; Pérez et al., 2017), by median
regression at the second stage, or a discrete mixture over two or more normal densities
(Marshall and Spiegelhalter, 1998). For example, Beath (2014) develops a two-group mix-
ture for the second stage variance. Baker and Jackson (2016) point out possible identifi-
ability issues with extended models (e.g. discrete mixture models), and investigate instead
estimation using the marginal likelihood formulations. Moreno et al. (2018) consider prob-
lems in estimating the second stage mean μ when there is clustering in the latent means,
with bi = bj (for i ≠ j ), so that the distribution of the treatment effect is not fully heterog-
enous across units. This could potentially be approached using a discrete mixture over
cluster-specific second stage means, with the population-wide mean obtained by averag-
ing over cluster means.
110 Bayesian Hierarchical Models

Example 4.1 Local Anaesthesia


Guolo and Varin (2017) report inferential difficulties when there are a small number
of studies, comparing significance tests on μ for different classical procedures. In par-
ticular, they compare five studies of the effectiveness of local anaesthesia in controlling
pain during intrauterine pathological examinations. The procedure of DerSimonian
and Laird (1986) obtains a 95% confidence interval for μ of (−2.22, −0.35) with an associ-
ated p-value of 0.007. By contrast, the application of six alternative procedures provides
p-values between 0.056 and 0.17.
Here we compare the normal-normal (with Bayesian estimation) with three other
approaches: a normal-t hierarchy, a 2nd stage shrinkage prior, and a hierarchical
approach using (intercept only) median regression at the second stage. The latter option
is a particular case of quantile regression (see Chapter 12). A normal-t is applied with
preset degrees of freedom (df = 5), as the data would provide little information to iden-
tify an unknown parameter. As a posterior predictive check (PPC), the Q statistic for
detecting study heterogeneity is compared between the original and replicated data.
The mixed exceedance procedure of Marshall and Spiegelhalter (2007) is also used to
detect poorly fitted cases.
A normal-normal model with a Ga(1,0.001) prior on 1/τ2 provides a 95% CRI for μ of
(−2.76, 0.15), with an estimated posterior probability Pr( m < 0|y ) = 0.033 that the treat-
ment reduces pain. The PPC check is unsatisfactory, and there is a 95% exceedance
probability Pr( y new , i > y i |y ) for study 5, which has a much stronger observed treatment
effect than the other studies. A normal-t model provides a more precise estimate of μ,
with 95% CRI (−2.3, 0.2), and with the estimated scale adjustment λ5 having mean 0.43,
so downweighting the impact of the fifth study on inferences. However, the estimate
Pr( m < 0|y ) = 0.049 remains under 0.05.
A horseshoe prior on the second stage effects is more conservative, leading to an
estimate Pr( m < 0|y ) = 0.080 , with the estimated κi values (see Equation 7.2) seemingly
downweighting study 2 as well as study 5. The overall treatment effect has a much nar-
rower 95% CRI, namely (−0.86, −0.04). The scaled beta2 prior with settings p = q = 1, as in
Perez et al. (2017) provides a 95% CRI for μ of (−2.38, 0.41), with a conservative estimate
also for Pr( m < 0|y ) of 0.11. The exceedance probability for study 5 is raised to over 0.99.
Median regression using the Asymmetric Laplace distribution as in Yu and Moyeed
(2001) leads to an estimate Pr( m < 0|y ) = 0.068 . In view of the sensitivity evident in this
application, it may also be relevant to note sensitivity to the prior on 1/τ2 or τ2, even if
the same stage 2 model is used (see Section 4.4). Lambert et al. (2005) provide a sensitiv-
ity analysis of inferences under simulations with a small number of studies.

4.3.1 Meta-Regression
Sometimes it is necessary to control explicitly for trial design, study location, and
other design features in order to justify an exchangeability assumption (Marshall and
Spiegelhalter, 1998; Pauler and Wakefield, 2000; Prevost et al., 2000). Similarly, in survey-
based small area estimation, the estimate of bi may incorporate information from admin-
istrative area data Xi (Rao, 2003; Jiang and Lahiri, 2006). So, with centred predictors Xi of
dimension p (excluding a constant term), the normal-normal model becomes

yi ∼ N (bi , si2 ),

bi ∼ N ( m + Xi b , t 2 ),

with marginal likelihood then (DuMouchel, 1996)

yi | b , t 2 ∼ N ( m + Xi b , si2 + t 2 ).
Borrowing Strength via Hierarchical Estimation 111

Recent Bayesian applications of meta-regression including Markham et al. (2017) and


Druyts et al. (2017). Writing the model as

yi = m + Xi b + di + ei ,

di ∼ N (0, t 2 ),

ei ∼ N (0, si2 ),

the true effect for unit i is then m + Xi b + di .

4.4 Prior for Second Stage Variance


The prior assumed for the second stage variance τ2, or precision 1/τ2, plays an important
role in governing the degree of shrinkage or pooling strength (Lambert et al., 2005), with
diffuse priors leading to lesser shrinkage (Conlon et al., 2007), and convenient choices
prone to overfitting. As discussed in Chapter 1, improper or highly diffuse priors may also
lead to identification or propriety problems. For example, the prior

1
p(t 2 ) ∝ ,
t2
equivalent to taking t 2 ∼ IG(0, 0) and to a flat prior on log(τ) over (0,∞), can lead to improper
posteriors in random-effects models (DuMouchel and Waternaux, 1992). A just proper
risk-averse alternative, such as t 2 ∼ IG(c, c) with c small is often used (Simpson et al., 2016).
However, this prior has a spike near zero (Browne and Draper, 2006), and different values of c
can influence posterior influences despite the supposedly diffuse nature of the prior (Gelman,
2006). The prior 1/t 2 ∼ Ga(1, c) similarly may lead to overfitting (Simpson et al., 2016).
One might carry out a sensitivity analysis over a range of proper but diffuse Ga(c,d) pri-
ors for the precision 1/τ2, such as {c = 0.1, d = 0.001} or c = d = 0.0001 (Fahrmeir and Lang,
2001; van Dongen, 2006). An alternative scheme is to compare alternative values of c in
Ga(1,c) priors for 1/τ2 (Besag et al., 1995), possibly using a mixture prior over M possible
values for c = (c1 , … , cM ) in the prior 1/t 2 ∼ Ga(1, c), such as cm = 1, 0.1, 0.01 and 0.001 (Jullion
and Lambert, 2007). Then

c| p ∼ ∑ p Ga(1, c )
m =1
m m

p ∼ Dirichlet(w)

where (w1 , … , wM ) are prior weights.


Introducing some degree of prior information may be relevant, and is natural under
the inverse chi-squared density (sometimes called the scaled inverse chi-squared) with
parameters {ν,λ}. For τ2 a variance, taking

t 2 ∼ c −2 (n , l),
112 Bayesian Hierarchical Models

is equivalent to assuming t 2 ∼ IG (n/2, nl/2) , where λ is a prior guess at the mean vari-
ance, and ν is a prior sample size (or level of confidence) parameter. Conlon et al. (2007)
consider informative inverse gamma priors on τ2 for inter-study variability in logexpres-
sion ratios in a microarray data application; for example, they use relatively large prior
sample sizes ν.
Smith et al. (1995) discuss elicitation of informative inverse gamma priors for τ2 based on
anticipated variation in the underlying rates bi, and the fact that assuming normality, 95%
of the bi will lie between m − 1.96t and m + 1.96t . Assume the bi are measured on a log scale
(e.g. log relative risks or log odds ratios), and suppose the expected ratio of the 97.5th and
2.5th percentiles of risks (or odds) between centres or studies is 5, then the gap between
the 97.5th and 2.5th percentiles for bi is log(5) = 1.61. For normal bi, the prior mean for τ2 is
then (0.5 × 1.61/1.96 ) = 0.17 , and the prior mean for 1/τ2 is 5.93. If the upper limit for the
2

ratio of the 97.5th and 2.5th percentile of rates or odds is set at 10, this defines the 97.5th per-
centile of τ2 namely (0.5 × 2.3/1.96 )  = 0.34. The expectation and variability are then used to
2

define an inverse gamma prior on τ2 or a gamma prior on 1/τ2. Another procedure based
on expected contrasts in relative risk (RR) or relative odds (RO) is mentioned by Marshall
and Spiegelhalter (2007, p.422): 95% of units will have RRs or ROs in the range exp(±1.96t) ,
and an expectation of reasonable homogeneity might correspond to values of τ less than
th = 0.2 . Setting y = 0.5th = 0.1, these expectations are expressed via a half normal prior on
τ, with τ = |T| where

T ∼ N (0, y 2 ),

with prior 95% point at 1.96 × y = 0.2 .


As another way to use prior evidence on variability, Marshall and Spiegelhalter (1998)
mention a hyperprior for the scale parameter ϕ in a gamma prior for 1/τ2, namely

1/t 2 ∼ Ga(g , f),

f ∼ Ga(c, d),

where d is a small multiple of 1/R2 and R is the range of the observed centre effects, and with
γ and c constrained according to g > 1 > c. When the first stage sampling density involves
an unknown variance, Gustafson et al. (2006) suggest a conditional prior sequence adapted
to avoiding undersmoothing, namely

p(s 2 , t 2 ) = p(s 2 )p(t 2 |s 2 ),

where s 2 ∼ IG(e , e) for some small e > 0, and

a+1
 1   b 
p(t 2 |s 2 ) ∝  2 exp  − 2 .
 t + s   t + s 
2 2

This corresponds to a truncated inverse Gaussian prior on τ2, with Z ∼ IG( a, b) or


(1/Z) ∼ Ga( a, b) where Z = t 2 + s 2. The case {a = 1, b = 0} corresponds to the uniform shrink-
age prior while larger values of a (e.g. a = 5) are “conservative” in the sense of guarding
against over-estimation of τ2.
Borrowing Strength via Hierarchical Estimation 113

4.4.1 Non-Conjugate Priors
Among non-conjugate strategies (for normal-normal meta-analysis) an effective choice in
terms of being genuinely non-informative (Gelman, 2006) is a bounded uniform prior on
the random effects standard deviation t ∼ U (0, H ) with H large. However, this prior may
be biased towards relatively large variances when the number of units (trials, studies, etc.)
is small (van Dongen, 2006, p.92).
A prior selection strategy based on the principles of penalising complexity, and of pre-
ferring simpler models when more complex models are not strongly supported (Occam’s
razor), may be adopted. Thus Simpson et al. (2016) propose that the prior π(ξ) on a flexibil-
ity parameter (hyperparameter), such as the level 2 standard deviation in a normal-normal
model, be set so as to prefer the simpler base model in which ξ = 0. A penalising complexity
(PC) prior has density decreasing at high values and maximum at ξ = 0 in order to prevent
overfitting; that is, the mode of the PC prior is always at the base model. A suitable value
for the prior on ξ can be obtained via a user-defined condition Pr(Q(x) > U ) = a . This speci-
fies an upper value U for a function Q(ξ) of ξ, and the associated probability α. For the τ
standard deviation parameter in a normal-normal hierarchy, the PC prior is an exponen-
tial with rate λ, t ∼ Exp(l), and if one specifies Pr(t > tU ) = a , the resulting exponential rate
is l = − ln(a)/tU . The PC prior for the precision 1/ t is a Gumbel type 2 density.
2

Variations on the uniform shrinkage prior, suggested by Christiansen and Morris


(1997) and Daniels (1999), may also be used. One is a uniform prior on the shrinkage
weights wi = si2 /(si2 + t 2 ) , or on the shrinkage weight w = s 2 /(s 2 + t 2 ) , when σ2 is unknown.
Alternatively, one might represent different shades of opinion (sceptical, neutral, enthusi-
astic with regard to meta-analytic shrinkage) via the shrinkage weight. One might set a
prior probability of 1/3 on the value w = 0.9, or on values w > 0.9, corresponding to nearly
complete shrinkage to μ as under classical meta-analysis. A prior probability of 1/3 would
also be set on w = 0.1, or values w < 0.1, corresponding to a sceptical view on exchangeabil-
ity. Finally, a prior probability of 1/3 could be set on neutral values w ∼ U (0.1, 0.9).
Another possibility when the si2 are provided as part of study summaries, is a uniform
prior on the average shrinkage (Spiegelhalter et al., 2004, Chapter 5), namely

s02
w= ,
s + t2
2
0

where
n

∑s
1 1 1
= ,
s02 n i =1
2
i

is the harmonic mean of the study sampling variances. DuMouchel (1996) proposes a uni-
form prior on s0 /(s0 + t) which is equivalent to a Pareto prior, namely

s0
p(t) = .
(s0 + t)2

This prior is proper but with E(t) = ∞, and with (0.01,0.25,0.5,0.75,0.99) percentile points at
(s0 /99, s0 /3, s0 , 3s0 , 99s0 ) . Note that the Pareto can also be parameterised as

p(u) = bs0bu − b −1 ,
114 Bayesian Hierarchical Models

with t = u − s0 when b = 1.


Other robust options are half normal, half Student t or half Cauchy priors on the sec-
ond-stage standard deviation τ (Lambert et al., 2005; Burke et al., 2016; Williams et al.,
2018; Spittal et al., 2015). If T ∼ N (0, V ) and τ = |T|, then τ is half-normal with variance V
(Spiegelhalter et al., 2004). One then has E(t|V ) = 2V/p and var(t|V ) = V (1 − 2/p) . If τU
represents a likely upper value for τ, then one may take V = (tU /1.96)2 as in Pauler and
Wakefield (2000). A value such as τU = 1 is often suitable (Spiegelhalter et al., 2004).
Note that if T ~ N(m,V) (i.e. the normal has an unknown mean), then τ = |T| is folded-
normal with

E(t|V ) = 2V / p exp( − m2 / 2V ) − m[1 − 2Φ(m / V 0.5 )],

and variance m2 + V . Gelman (2006) and Zhao et al. (2006) adopt folded non-central t-den-
sities for τ, obtained by dividing the absolute value of a normal variable by the square root
of a gamma variable.
If the normal variable has mean zero, then the folded non-central t becomes a half t
variable. With degrees of freedom in the t density set to 1, this leads to a half-Cauchy for
τ, exemplified by

∆ ∼ N(0, s∆2 ),

s∆ ∼ U (0, K ),

l ∼ Ga(0.5, 0.5),

t =|∆|/l0.5 .

Setting s∆2 = 1 leads to a C + (0, 1) prior, as in the horseshoe prior (Chapter 7). The half Cauchy
prior on τ is included in rstan and runjags libraries in R.
Half t and half Cauchy priors for the second stage parameter τ may also be achieved
by a reparameterisation of the second-stage prior on the latent trial means which strictly
involves parameter redundancy. Such over-parameterisation may improve MCMC conver-
gence (Gelman, 2006, section 3.2). With preset parameters ν and A (degrees of freedom and
prior scale respectively) one has, for yi ∼ N (bi , si2 ) ,

bi = m + xhi ,

x ∼ N (0 , A)

hi ∼ N (0, sh2 )

1/sh2 ∼ cn2

with the standard deviation of the bi then obtained as

t =|x|sh .
Borrowing Strength via Hierarchical Estimation 115

Applications are provided by van Dongen (2006) and Chelgren et al. (2011). Setting ν = 1
leads to a half Cauchy prior

p(sb ) ∝ (t 2 + A)−1 ,

where Gelman (2006, p.524) uses a value A = 25 in a meta-analysis with small n, based on a
prior belief that τ is well below 100.

Example 4.2 Nicotine Replacement Therapies


To illustrate approximately normal responses based on discrete (binomial) data, this
example considers n = 90 studies of the benefits of nicotine replacement therapy (NRT)
(Cepeda-Benito et al., 2004). These data also raise issues of potential outliers.
The data are supplied as the numbers riT quitting smoking among those under ther-
apy NiT, and numbers of quitters riC in control or placebo groups of size NiC. Then the
empirical log odds ratios measuring treatment effects, namely

 riT   riC 
y i = log  − log  ,
 N iT − riT   N iC − riC 
are taken as approximately normal with known variances

1 1 1 1
si2 = + + + .
riT N iT - riT riC N iC - riC
A normal higher stage is assumed with yi ~ N(bi , si2 ) and bi ∼ N( m, t 2 ) . A uniform shrink-
age prior on

s02
w= ,
s +t 2
2
0

as considered above, is assumed for the second-stage variance, with the half-Cauchy
also considered. Additionally, a N(0, 100) prior on μ is adopted. Various kinds of predic-
tion may be considered. Here, the predicted treatment effect in a new trial is sampled
according to

bnew ∼ N( m, t 2 )

y new ∼ N(bnew , s02 ).

Early convergence in a two-chain run of 5000 iterations with jagsUI is obtained. τ2 is esti-
mated as 0.085 (mean) and 0.080 (median). A clear benefit of NRT is seemingly appar-
ent, with the odds ratio exp(μ) having a posterior mean (and 95% CrI) of 1.93 (1.73,2.17).
On the other hand, the predicted odds ratio for a new trial (OR.new in the rjags code)
includes null values for the benefit from NRT, having mean (95% CrI) of 2.2 (0.7,5.1).
Some deficiencies against model assumptions are evident: although the Shapiro–Wilk
normality test of the posterior mean bj is inconclusive (a p-value of 0.07), the Jarque–Bera
test (Jarque and Bera,1980) shows a significant departure from normality.
Similarly, evaluating individual components of the total WAIC (widely applicable
information criterion) shows studies 4, 36, and 59 as having distinctively high values.
Trial 4 has an exceptionally high empirical log odds ratio in support of NRT, while trial
36 shows unusually low NRT benefit. Mixed predictive exceedance checks (Marshall
and Spiegelhalter, 2007) show aberrant values (0.001 and 0.992) for these two trials, with
116 Bayesian Hierarchical Models

study 59 also having an extreme value. A reanalysis using the normal-normal scheme
uses a half-Cauchy prior, with the setting on the Cauchy scale parameter as in Gelman
et al. (2008). τ2 is now estimated as 0.081 (mean) and 0.078 (median). The LOO-IC (leave-
one-out information criterion) is reduced slightly, but model checks show similar fea-
tures to the analysis using the uniform shrinkage prior.
To allow for potential outlier trials and downweight their effect, an alternative analy-
sis adopts a second-stage Student density with

bi ∼ N( m, t 2 /li )

 n n
li ∼ G  ,  .
 2 2
Less typical trial results will have values of λi considerably under 1 and a test for the
posterior probability that λi is less than 1 can be included. The prior on ν is specified in
two steps as n ∼ E(k) and k ∼ U(0.01, 0.5) . Evidence in support of a heavy tailed second
stage is equivocal. From a two-chain run of 10000 iterations with jagsUI, ν has a poste-
rior mean of 9.3, suggesting departure from normality. The posterior mean and median
for τ2 are reduced to 0.051 and 0.045 respectively. Two trials (4 and 36) have posterior
probability that λi < 1 in excess of 0.8, namely, trials 4 and 36. These trials also have
extreme mixed predictive exceedance p-values. On the other hand, gain in goodness
of fit is not obtained: the marginal density likelihood, uncorrected for complexity, is
unchanged, and the WAIC increases.
The skew t model (Lee and Thompson, 2008; Fernandez and Steel, 1998) is also esti-
mated using rube. This involves asymmetric scaling of the second-stage variance accord-
ing to whether the residual e j = y j − b j is negative or positive. For positive residuals, τ2
is scaled by a factor g 2 > 0, while for negative residual terms, the scaling is by 1/γ2. The
value γ = 1 corresponds to a symmetric t density, while γ > 1 (γ < 1) corresponds to positive
(negative) skew. Applied to the NRT data, a two-chain run of 5,000 iterations shows no
gain in fit, or any evidence that the 95% CRI for γ excludes 1. Mixed predictive exceed-
ance checks for studies 4, 36, and 59 are still extreme, with values 0.016, 0.97 and 0.014.
Finally, a two-category discrete mixture is assumed on the second-stage variance
(Beath, 2014), with an outlier group (Gj = 2) posited to have higher variance. To improve
identifiability, prior probabilities for the outlier and main groups, Pr(Gj = 2) and Pr(Gj = 1)
are set at 0.05 and 0.95 respectively. The default (main group) variance is assigned a
uniform shrinkage prior as above, while the increment in the outlier group variance is
assigned an informative E(10) prior. The posterior probability that Pr(Gj = 2|y) is then 0.25
for trial 4 (a marginal Bayes factor of 6.3), while the corresponding marginal Bayes factor
for trial 36 is 3.6. Again, fit is not improved against the standard normal-normal model,
and mixed predictive exceedance checks for studies 4, 36, and 59 remain extreme.

4.5 Multivariate Meta-Analysis
Multivariate meta-analysis may adopt a normal-normal strategy, albeit often with orig-
inally binary, count, or time to event data (Mavridis and Salanti, 2013). A multivariate
analysis for metric outcomes may arise in different ways. These include clinical applica-
tions involving treatment and control arms; studies where multiple outcomes are reported;
in meta-analysis of diagnostic test studies, where sensitivity and specificity are reported
(Guo and Riebler, 2016; Guo et al., 2017); in multiple treatments meta-analysis; and in net-
work meta-analysis (Greco et al., 2016).
Borrowing Strength via Hierarchical Estimation 117

In the first scenario, the event rate in the control arm may be taken as indicating baseline
risk, and there is interest in whether the treatment effect is related in any way to base-
line risk (Arends, 2006). Suppose riT of NiT treated subjects in trial i exhibit a particular
response (e.g. disease or death), as compared to riC of NiC control subjects, and define log
odds yiT = log(riT /( N iT − riT )) and yiC = log(riC /( N iC − riC )) . Often the outcome may be taken
as the log of the odds ratio, yiT − yiC , assumed normal (see Example 4.2). However, to sepa-
rate out baseline risk, one may model { yiT , yiC } as (approximately) bivariate normal. If the
trial is randomised, it is legitimate to assume that { yiT , yiC } are independent at the first
stage (van Houwelingen et al., 2002). So

 yiT    biT   siT


2
0 
 yiC  ~ N     , 2 
,
  biC   0 siC 

 biT    mT  
 biC  ~ N   mC  , Σ b  ,
 
 tT2 tTC 
where Σ b =  , with tTC = rtT tC , and diagonal terms tT2 and tC2 represent variabil-
 tTC tC2 
ity in the true treatment and control event rates. Then g = mT − mC defines the underly-
ing treatment effect with variance tT2 + tC2 − 2tTC . The conditional variance of the treatment
effect, given the true control group rate, is tT2 − (tTC
2
/tC2 ). So, baseline risk explains a portion

tTC
2
tC2 − 2tTC +
tC2

tT + tC − 2tTC
2 2

of the treatment effect variance.


Most commonly, a multivariate analysis is generated when more than one outcome is
associated with a specific unit (Everson and Morris, 2000; Wei and Higgins, 2013). In this
case, suppose there are K outcomes such that

 y i1    bi1  
 yi 2   b  
  ∼ N    , Si  ,
i2

 .   .  
 y    
iK   biK  
where

 si21 si12 . si1K 


s si22 . si 2 K 
Si =   ,
i 21

 . . . . 
 2 
 siK 1 siK 2 . siK 

is the known covariance matrix between outcomes for trial i, with sijk = rijk (sij2 sik2 )0.5. A multi-
variate normal second-level prior for (bi1 , … biK ) involves means { m1 , … mK }, and K × K covari-
ance matrix
118 Bayesian Hierarchical Models

 t12 t1t2 r12 . t1tK r1K 


tt r t22 . t2tK r2 K 
Σb = 
1 2 12
.
 . . . . 
 
 t1tK r1K t2tK r2 K . tK2 

There may well be sensitivity to the priors adopted for Σb, especially when there are a small
number of trials, or some missingness in outcomes, with results from the inverse Wishart
potentially sensitive to the prior scale matrix (Wei and Higgins, 2013). Alternatives involve
decomposition approaches to the covariance matrix, so that separate priors are specified
on variances and correlations (Barnard et al., 2000; Lu and Ades, 2009; Burke et al., 2016;
Guo et al., 2017; Hurtado Rua et al., 2015). Incorporating evidence into multivariate priors
on variances and correlations leads to stabilised inferences (Burke et al., 2016; Guo et al.,
2017). Alternatives to a U(−1, 1) prior on correlations include a normal prior on the Fisher
z-transformed correlation logit(( r + 1)/2), a uniform prior U(0,1), constrained to positive
correlations (Burke et al., 2016), or penalised complexity priors, as included in the R pro-
gram meta4diag (Guo et al., 2017). Alternative to gamma priors on precisions 1/tk2 , which
may lead to relatively high estimated τk, are half-normal priors (Burke et al., 2016; Lambert
et al., 2005). Alternative methods are available if some, or all, within study correlations are
not observed (i.e. only standard errors of treatment effects are available), for example, spec-
ifying an informative prior (Mavridis and Salanti, 2013). For the bivariate case, an alterna-
tive model may be specified (Riley et al., 2008), entirely avoiding the need for observed
intra-study correlations.
Multivariate normality is often a simplification, and one may wish to allow both for
heavier tails, skewness, or multi-modality; see Genton (2004) and Lee and Thompson
(2008) regarding use of skew-elliptical densities as a route to greater robustness. These
models build on the principle (Azzalini, 1985) that if f and g are symmetric densities with
parameters μ and σ, with G the cumulative density corresponding to g, then the new den-
sity defined by

2  x − m  x − m
h( x| m, s , d ) = f  Gd 
s  s   s 
is skew for non-zero δ.
Following Sahu et al. (2003), a multivariate skew-normal model is a particular type of
skew-elliptical model (of dimension K) obtained by considering errors eK ×1 ∼ N K (0, Σ ),
positive variables ZK ×1 ∼ N K (0, I ) and taking y = DZ + e where D is a diagonal matrix,
diag( d1 , … dK ). In a regression setting with a K dimensional mean μ, one has

y |Z = z ∼ N K ( m + Dz , Σ ).

Values δk > 0 correspond to positive skew in the kth outcome while a negative δk arises from
negative skew. A multivariate skew-t model (allowing for both heavier tails than the nor-
mal, and also for skewness) is obtained by sampling ZK ×1 ∼ tK ,n (0, I ), where ν is a degrees
of freedom parameter, and then

 n + zT z 
y |Z = z ∼ tK ,n + K  m + Dz , Σ .
 n + K 
Borrowing Strength via Hierarchical Estimation 119

Example 4.3 BCG Vaccine Trials; Bivariate Normal Model


Following van Houwelingen et al. (2002), an example of a bivariate meta-analysis
involves data from 13 trials regarding the effectiveness of the BCG vaccine against
tuberculosis. We consider, in particular, sensitivity on priors for the second stage cova-
riance and relevance for assessing potential outliers.
Each trial compares vaccinated and non-vaccinated groups of size {NT , NC } , with the
outcome being counts of tuberculosis {rT , rC } , and with the infection rate in the con-
trol arm taken as indicating the baseline risk. The response variables are the log odds
in each trial arm, y iT = log ( riT /( N iT − riT )) and y iC = log ( riC /( N iC − riC )). Predictive fit is
assessed by comparing replicate data(obtained using mixed predictions) (Marshall and
Spiegelhalter, 2007) with actual observations in a sum of squares criterion.
Here an initial analysis assumes normality at both levels, with

 y iT    biT   siT 2
0 
 y iC  ~ N   biC  ,  0 2 
,
  siC 

æ biT ö æ æ mT ö ö
ç ÷ ~ N çç ç ÷ , S b ÷÷ ,
è biC ø è è mC ø ø
where

2 1 1 1 1
siT = + 2
, siC = + .
riT N iT − riT riC N iC − riC

It is also assumed that the precision matrix Σ b−1 of the latent effects is Wishart with
identity scale matrix and 2 degrees of freedom, while the { mT , mC } parameters have N(0,
1000) priors.
Using jagsUI, posterior means for ( mT , mC ) are estimated as (−4.87, −4.07), with mean
vaccination effect g = mT − mC of −0.79 (−1.27, −0.32), slightly more negative than the esti-
mate of −0.74 found by van Houwelingen et al. (2002) using classical methods (in the
SAS package). The posterior mean for the second-stage covariance matrix is

 1.83 2.21
Σb =  ,
 2.21 3.29

with correlation between treatment and control effects (where effects are log-odds),
obtained from monitoring the components of Σb, as 0.90.
Similarly, the slope of the regression to predict the vaccination group log-odds from
the control group log-odds, obtained by averaging Σ (bt12) /Σ (bt22 )
over iterations t, is 0.67
(slope.TC in the code). The variance of the true treatment effects biT − biC is obtained by
monitoring Vt = Σ b ,11 + Σ b , 22 − 2Σ b ,12 , while the conditional variance of the vaccination
log-odds effects biT given biC (and hence the variance of biT − biC given biC) is obtained
by monitoring Vc = Σ b ,11 − Σ b2,12 /Σ b , 22 . Finally, the proportion of treatment effect varia-
tion explained by baseline risk (i.e. the true log-odds in the control group), obtained by
monitoring 1 − Vc /Vt , has a posterior mean of 0.51 (r2.base in the code).
To assess possible outliers, a mixed predictive exceedance check (Marshall and
Spiegelhalter, 2007) is carried out by sampling replicate random effects (biT , new , biC , new )
and then sampling replicate data yij,new. There are four observations out of the 26 (13
pairs) with predictive exceedance checks

Pr( y ij ,new > y ij |y ) ( j = T or C )


120 Bayesian Hierarchical Models

under 0.10 or over 0.90, with the most extreme being an exceedance probability of 0.024
in the vaccination arm of trial 6 (pred.exc[1,6] in the code). As a summary fit measure, a
predictive criterion (Laud and Ibrahim, 1995) based on comparing y new and y is derived.
Accordingly, a second analysis (using WINBUGS via rube) adopts a bivariate Student
t model at stage 2. The degrees of freedom parameter is set at 4, providing a robust
analysis (Gelman et al., 2014, section 17.2). This leads to only two observations, (6,T) and
(6,C), having predictive exceedance checks under 0.10. Extension of the model to a skew
bivariate t (Sahu et al., 2003), namely model 3 in the code, provides no further gain in fit.
Model fit using the predictive criterion (PFC.mix in the code), in fact, is worse for these
two extensions, illustrating that improved model fit does not always follow measures to
counteract adverse model checks. The mean vaccination effect g = mT − mC is less precise
under these models, namely a mean (95% CRI) of −0.78 (−1.34, −0.23) under the second
model, and −0.80 (−1.54, −0.04) under the third.
A final analysis uses a Cholesky decomposition (Wei and Higgins, 2013) for the sec-
ond-stage covariance matrix in an MVN-MVN analysis,

Σ b = Vb RbVb

Rb = L′ L

where V b is a diagonal matrix of standard deviations, and Rb is a correlation matrix.


For a bivariate analysis, L11 = L22 = 1 , while a U(−1, 1) prior is assumed for L12 = rb12 , and
the standard deviations τT and τC have U(0,5) priors. This analysis produces slightly
larger variances in Σb and a slightly larger correlation, 0.9, between treatment and con-
trol effects.
A more pronounced effect on estimates is obtained if the Wishart scale matrix in the
initial MVN-MVN analysis is a diagonal with elements 0.1. This option increases the
correlation between treatment and control effects to 0.95.

Example 4.4 Hypertension Treatments


The data for this example are from ten studies into the effectiveness of hypertension
treatment (Jackson et al., 2013), included in the R library mvmeta. Data from each study
consists of two treatment effects: differences in systolic blood pressure (SBP) and dia-
stolic blood pressure (DBP) between treatment and control groups (adjusted for baseline
blood pressure). A larger reduction in blood pressure indicates greater effectiveness.
Within-study correlations are known for all studies. In the absence of covariates, one
would represent the two-stage model as

 y1i    b1i   s12i rs1i s2i  


~ N   ,  ,
 y 2i 
  b2i   rs1i s2i s22i  

 b1i    m1  
 b2i  ~ N   m2  , Σ b  ,
 
 t12 t1t2 rb 
where Σ b =  . In fact, three studies contain subjects with isolated sys-
t t r
 1 2 b t22 
tolic hypertension (namely high SBP, but normal DBP). So the treatment effect may be
smaller in these trials. To represent this effect, we introduce a second-stage regression
(i.e. multivariate meta-regression):
Borrowing Strength via Hierarchical Estimation 121

 b1i    n1i  
 b2i  ~ N   n2i  , Σ b  ,
 

n1i = m1 + b1ISHi

n2i = m2 + b2ISHi .

Following Mavridis and Salanti (2013), a Cholesky decomposition with spherical


parameterisation on the correlation is used as a prior for the second-stage covariance,
and Ga(1,0.001) priors on the between study precisions. The jags code uses conditional
expectations and variances for y2i and b2i. A two-chain run using jagsUI provides pos-
terior mean (sd) for (β1,β2) of 0.47 (1.61) and 1.43 (0.78), similar to Jackson et al. (2013,
Table 4). The second stage correlation ρ is estimated as 0.46 (0.26) (rho.tau in the code). As
to between study standard deviations, posterior mean (sd) for (τ1,τ2) are 1.85 (0.66) and
0.98 (0.32). Mixed exceedance checks throw some doubt on study 1, which has relatively
small treatment effects, especially for DBP.
A U(0,1) prior on the between study correlation may be sensible given the correlated
blood pressure outcomes (Burke et al., 2016, p.23) and this is combined with half normal
priors HN(0,2) on τ1 and τ2. This provides posterior mean (sd) for (β1,β2) of 0.63 (1.26) and
1.46 (0.76), with the estimated correlation similar to the first analysis, namely 0.45 (0.21).
Estimated study standard deviations are affected more, with posterior mean (sd) for
(τ1,τ2) of 1.51 (0.29) and 1.00 (0.25).

4.6 Heterogeneity in Count Data: Hierarchical Poisson Models


The adoption of higher stage densities for count data is often linked to apparent depar-
tures from the Poisson mean-variance assumption. The most common departure is that
count data show more variability than expected under the Poisson, so that the coefficient
of variation V ( y )/y exceeds 1. Overdispersion may reflect unobserved subject frailties,
multiple modes, non-random sampling (Efron, 1986), or widely different exposures oi (e.g.
when a count outcome yi is surgical deaths for hospitals, with means μioi where oi are
patient totals). The conjugate continuous mixture models in the presence of excess het-
erogeneity is the Poisson-gamma, though greater flexibility in more complex models (e.g.
multilevel or multivariate) is generally obtained by mixing with non-conjugate links.
The Poisson-gamma model allows for unit mean rates μi to vary according to a gamma
density mi ∼ Ga(a , b ), which is unimodal, but flexibly shaped. Thus for count data yi
assumed Poisson with means μi, set

B(bi ) = e bi

in (4.1), where mi = e bi , a(fi ) = 1 and c( yi , fi ) = log yi !. Then equation (4.2) has the form

p(bi |y) = k1 exp(bi g1(y) − e bi g 2 (y))



= k1(e bi )g1 (y ) exp( − e bi g 2 (y))
122 Bayesian Hierarchical Models

Namely, a gamma density for mi = e bi with parameters a = g1(y) and b = g 2 (y). The condi-
tional posterior is

p(bi | y , a , b ) = k 2 (e bi )a + yi exp( − e bi [ b + 1])

namely a gamma for μi with parameters α + yi and β + 1. Denoting the mean of the μi as
ξ = α/β, one obtains V ( mi ) = a/b 2 = x 2 /a . Then

V ( yi ) = E[V ( yi | mi )] + V[E( yi | mi )] = x + x 2 /a

so that overdispersion is present when ϕ > 0, where ϕ = 1/α.


Different parameterisations of the Poisson-gamma mixture can be used. For example,
one may set mi = xwi , with overall mean parameter ξ, and multiplicative random effects ωi
having mean 1 for identifiability, namely w i ∼ Ga(a , a ) with V (wi ) = 1/a . Integrating the ωi
out, as in



p( yi |x , a) = p( yi |wi , x)p(wi |a)dwi ,

leads to a marginal negative binomial density for the yi, namely

a i y
Γ (a + yi )  a   x 
p( yi |x , a ) = .
Γ (a )Γ ( yi + 1)  a + x   a + x 

If predictors Xi are present, negative binomial regression is obtained with xi = exp(Xi b ).


Note that in many applications, there may be forms of truncation, as when zero counts do
not enter the analysis (Larson and Soule, 2009). So a zero truncated negative binomial has

a yi
Γ (a + y i )  a   x 
.
Γ(a)Γ( yi + 1)  a + x   a + x 
p( yi |x , a , yi > 0) = a .
 a 
1− 
 a + x 

Alternatively, one may assume yi ∼ Po( mi ), mi ∼ Ga(a , b ) , with

E( mi ) = m m = a/b

var( mi ) = Vm = a/b 2

(e.g. Clayton and Kaldor, 1987). When this parameterisation includes offsets oi, the poste-
rior p( mi , a , b | y ) has the form

 n a −1
 
exp( − mi oi )( mi oi )yi  b a   
n n n


L(a , b , m| y )p(a , b ) = 
 yi!  Γ(a)  

∏ mi 

exp  − b

∑ mi   p(a , b ),
 
 i =1 i =1 i =1

with conditional posterior for μi now Ga(a + yi , b + oi ) . Hence the posterior mean is
Borrowing Strength via Hierarchical Estimation 123

yi + a
E( mi | yi , a , b ) = .
oi + b
One may define reliabilities (Staggs and Gajewski, 2017) using Vμ and the conditional vari-
ances var (( yi /oi )| mi , oi ) = mi /oi . Reliabilities in unit rates are estimated as:

Vm
.
mi
Vm +
oi
So higher reliabilities attach to units with larger offsets.
The conditional likelihoods (George et al., 1993, p.191) for α and β under this structure are
obtained from L(a , b , m| y ), namely
n a -1
æ ba ö æ n
ö
L(a |b , m ) = ka ç ÷
è G(a ) ø
ç
ç
è
Õ
i =1
mi ÷
÷
ø
,

and

æ n
ö
L( b |a , m ) = k b b na exp ç - b
ç
è
å m ÷÷ø ,
i =1
i

where kα and kβ are normalising constants. Hence L(β|α,μ) is gamma with param-


n
eters nα + 1 and mi . The conditional posteriors p(a| b , m) = L(a| b , m)p(a) and
i =1
p( b |a , m) = L( b |a , m)p( b ) are log-concave when the priors p(α) and p(β) are log-concave.


n
Assuming a gamma prior p(b ) = Ga(c, d) , the full conditional for β is Ga(na + 1 + c, i=1
mi + d) .
However, the full conditional for α is non-standard, whatever form for p(α) is adopted.
Another Poisson-gamma mixture formulation (e.g. Albert, 1999; Christiansen and
Morris, 1996) assumes

yi |li ∼ Po(oi li ),

 z
li ∼ Ga  z ,  ,
 mi 
where V (li ) = mi2 /z and the Poisson corresponds to z → ∞. If μi = μ and a gamma prior is
assumed for μ, then the posterior mean for λi conditional on μ and ζ is

yi + z y
E(li | y , z , m) = = Bi m + (1 − Bi ) i ,
oi + z/m oi
where

z
Bi = ,
z + oi m
measures the level of shrinkage towards the overall mean μ. Thus, shrinkage will be
greater when oi (e.g. the population at risk in a mortality application) is small, or when
ζ is large. As for the second-stage variance in the normal-normal model, the prior on ζ
124 Bayesian Hierarchical Models

influences the degree of shrinkage that is obtained. Let ri = yi /oi . Then Christiansen and
Morris (1996) suggest a uniform prior based on the average shrinkage factor

z
B0 = ∼ U (0, 1),
z + min(oi )r
with the prior value of ζ then obtained as B0 min(oi )r /(1 − B0 ).
Extended parameterisations of the negative binomial have been suggested (Liu and Dey,
2007). Winkelmann and Zimmermann (1991) suggest a variance function

V ( yi ) = E[V ( yi | mi )] + V[E( yi | mi )] = x + fx k +1

with k ≥ −1, and obtained by taking mi ~ Ga(x 1-k /f , x - k /f ). Setting k = 0 and k = 1 leads to
what are called NB1 and NB2 forms of the negative binomial, under which the variances
are linear and quadratic in ξ, namely V ( yi ) = x + fx and V ( yi ) = x + fx 2 respectively.

4.6.1 Non-Conjugate Poisson Mixing


Alternatives to the conjugate model are the Poisson lognormal model, and models such as
the generalised Poisson density, zero-inflated Poisson, and the hurdle model adapted to
different types of departure from the typical Poisson frequency pattern. The Poisson log-
normal (PLN) model has been suggested as more appropriate than the conjugate mixture
in certain applications, such as species abundance – see Bulmer (1974) and Diserud and
Engen (2000). The Poisson lognormal representation may be beneficial in terms of robust-
ness to contamination or outliers, as the tails of the lognormal are heavier than for the
gamma distribution (Connolly et al., 2009; Wang and Blei, 2017).
The PLN model is obtained for yi ∼ Po( mi ) when μi are lognormally distributed, or equiva-
lently when the logarithms wi = log( mi ) of the Poisson means are assumed normal with mean
M and variance V (Aitchison and Ho, 1989). The marginal density under lognormal mixing is
obtained by integrating the sampling density over the domain of the log mean, namely

¥
(2p V )-0.5 é -(log mi - M )2 ù
p( yi |M , V ) =
yi ! ò
0
miyi -1e - mi exp ê
ë 2V
ú dmi
û

with marginal mean and variance respectively e M +V /2 and e 2 M +V [eV − 1]. As V → 0, this
reduces to a Poisson density. An alternative parameterisation (Weems and Smith, 2004)
has yi ∼ Po( miU i ) with log( mi ) = b 0 + b1x1i +¼ b p x pi , and log(U i ) ∼ N (1, V ).
The Poisson lognormal generalises readily to multivariate count data (Chib and
Winkelmann, 2001) or to mixing with heavier tails than available under the lognormal;
for example, the log Student t with a low degrees of freedom parameter for a heavy tailed,
albeit symmetric, mixing density. Skew normal and skew Student t mixing can also be
used, since in some applications, extremes of frailty tend to be above rather than below the
centre of the density (Sahu et al., 2003).
The exchangeable Poisson lognormal model is quite widely applied to pooling infer-
ences over sets of units (e.g. hospitals) when health event totals yi such as surgical deaths
are obtained and there are oi expected events; the Poisson lognormal is also widely applied
in modelling for spatially structured disease count data (Chapter 6). The oi might be based
on multiplying the patient total for hospital i by an average event rate and are usually
Borrowing Strength via Hierarchical Estimation 125

assumed known (i.e. not to be subject to measurement error). If the average rate is based on
the total set of n hospitals then one has ∑ y = ∑ o , and with m = o r one has
i i i i i

yi ∼ Po(oi ri ),

with the ρi interpretable as relative risks averaging 1 over all units. However, this feature is
not always present, and allowing for mean risk other than 1 (e.g. if a national surgical mor-
tality rate is applied to a particular set of hospitals), the Poisson lognormal then assumes

log( ri ) = b0 + wi

where the wi ∼ N (0, Vw ) are exchangeable normal random effects, with relative risks ρi
pooled towards a global average rate exp(β0) according to the size of Vw. Equivalently
vi = exp(wi ) are lognormal with mean m = exp(0.5Vw ) and variance m2 (expVw − 1).
Generalised Poisson and Poisson process models are also often useful in particular set-
tings, including underdispersion (Consul, 1989; Scollnik, 1995; Podlich et al., 2004). The
generalised Poisson density (Consul, 1989) specifies

l(l + y r)y −1 − l − r y
p( y |l, r) = e
y!
with mean λ/(1 − ρ), variance λ/(1 − ρ)3 and hence coefficient of variation 1/(1 − r)2 ≥ 1. This
reduces to a Poisson density as ρ → 0.

Example 4.5 Hospital Mortality


To exemplify the Poisson-gamma methodology, consider counts of patient deaths fol-
lowing heart transplant surgery in 131 hospitals in the US between October 1987 and
December 1989. These were analysed by Christiansen and Morris (1996, 1997). Let oi be
expected deaths (calculated by a logit regression on patient characteristics).
The first model considered is the usual Poisson-gamma mixture,

y i ∼ Po(oi mi ),

mi ∼ Ga(a , b ),

where the μi are relative risks, since actual and expected deaths are equal. The hyper-
parameters α and β are assigned diffuse gamma priors. The DIC for this model is 475.
To assess variations in the extent of shrinkage, one may plot the lengths of 90% cred-
ible intervals for percentile ranks against posterior mean reliabilities Vm /(Vm + ( mi /oi ))
(Staggs and Gajewski, 2017). As expected, more precise estimates of percentile ranks are
associated with higher reliability. A mixed predictive exceedance check (Marshall and
Spiegelhalter, 2007) shows 13 observations with exceedance probabilities under 0.05 or
over 0.95.
A second model adopts the scheme of Christiansen and Morris (1996), which includes
data-based priors. Thus

y i | mi ∼ Po(oi mi ),

 z 
mi ∼ Ga  z ,  ,
 Mm 
with shrinkage factors
126 Bayesian Hierarchical Models

z
Bi = .
z + oi M m
The prior on ζ is indirect, via a uniform prior on B0 = z/(z + min(oi )r ). A two-chain run
of 5,000 iterations provides a DIC of 530, and high values for both B0 and ζ, namely 0.986
and 7.81. The mixed predictive exceedance check now shows seven observations with
exceedance probabilities under 0.05 or over 0.95.
Christiansen and Morris (1996) argue that exchangeability between all 131 units might
not be applicable, since hospitals with larger patient totals have lower crude death rates.
As one remedy for such a pattern, one might take

y i ∼ Po(ni ),

 z
ni ∼ Ga  z ,  ,
 ri 
where

log( ri ) = b1 + b2 log(oi ),

now includes a regression on log(oi). So expected deaths is no longer an offset with


implicit coefficient β2 = 1. Here, we instead split the hospitals into two groups with
indicator Gi, one group (with Gi = 1) containing 37 hospitals with under 10 patients, the
other (with Gi = 2) containing the remaining 94 hospitals. Different means and variance
parameters are assumed in the two groups. So

y i | mi ∼ Po(oi mi ),

 zG 
mi ∼ Ga  zGi , i 
 mGi 

with uniform priors on group-specific average shrinkage factors (k = 1, 2)

zk
B0 k = .
zk + min(oi ; Gi = k )rk

This extension to partial exchangeability produces a deviance reduction to 506. The


mean mortality relative risk (with 95% interval) is found to be 2.05 (1.33, 2.95) in the low
workload hospitals, but lower, namely 0.96 (0.83,1.10), in the higher workload hospitals
(m.mu[] in the code). The variance factor ζk is higher in the low workload hospitals, but
the average shrinkages B0k are similar, at 0.95 and 0.93 respectively. Six observations
now have exceedance probabilities under 0.05 or over 0.95.

4.7 Binomial and Multinomial Heterogeneity


Heterogeneity in binary and categoric outcomes is commonly found in consumer and
demographic data. Among possible approaches are the beta-binomial, the logistic-normal
and generalisations of the binomial (e.g. Alanko and Duffy, 1996). Analogous methods
apply for categoric data (M > 2 categories) with the conjugate model being the multinomial-
Dirichlet. Although the Poisson-gamma mixture is widely applied to health and disease
Borrowing Strength via Hierarchical Estimation 127

events, the beta-binomial may also be used if populations are relatively small, and has dif-
ferent implications for shrinkage: shrinkage is greater under the Poisson-gamma (Howley
and Gibberd, 2003). Binomial and multinomial mixture methods have recently become
popular in the analysis of ecologic problems where marginals of a contingency table are
available, often from different sources such as census and voting data, but the internal
cells are unobserved (King, 1997; King et al., 2004). They may also be applied in meta-anal-
ysis, avoiding normal approximations (Bakbergenuly and Kulinskaya, 2017; Kulinskaya
and Olkin, 2014).
For binomial data yi ∼ Bin( N i , pi ), i = 1, … , n , the exponential family parameterisation
sets

B(bi ) = N i log(1 + e bi ),

N 
in (4.1), where pi = e bi /(1 + e bi ), a(fi ) = 1, and c( yi , fi ) = log  i  . Then equation (4.2) has the
form  yi 

p(bi |y) = k1 exp(bi g1(y) − N i log(1 + e bi ) g 2 (y))


 e bi 
g1 (y )
( )
− Ni g 2 (y ) + g1 (y )
= k1  1 + e bi
 1 + e i 
b

namely a beta density for πi with parameters g1(ψ) and N i g 2 (y) − g1(y). The conditional
posterior of πi is then also a beta with parameters g1(y) + yi and N i [ g 2 (y) + 1] − g1(y) − yi .
The marginal density is the beta-binomial with

 N i  Be( g1 + yi , N i ( g 2 + 1) − ( g1 + yi ))
p( yi | g1 , g 2 ) =   .
 yi  Be( g1 , N i g 2 − g1 )

Shrinkage effects are apparent under the beta mixing parameterisation

pi ∼ Be(gr , g(1 − r)),

with mean r ∈(0, 1) , and where γ > 0, termed the spread parameter by Howley and Gibberd
(2003), is inversely related to the prior variance of the proportions r(1 − r)/(1 + g). The con-
ditional posterior for πi is

pi ∼ Be(gr + yi , g(1 − r) + N i − yi ),

and the posterior mean is

g Ni  yi 
E(pi | y , g , r) = r+ ,
g + Ni g + N i  N i 

namely a weighted average of the observed rate and the prior mean rate. Shrinkage to the
prior mean is greater when γ is large and for small populations Ni. The marginal density is
128 Bayesian Hierarchical Models

 N i  Be(gr + yi , g(1 − r) + N i − yi )
p( yi |g , r) =  
 yi  Be(gr , g(1 − r))

 N i  Γ(gr + yi )Γ(g(1 − r) + N i − yi )Γ(g)
=  ,
 yi  Γ(gr)Γ(g(1 − r))Γ(g + N ) i

with expectation E( yi ) = E[E( yi |pi )] = E( N ipi ) = N i r , and variance

 g + Ni 
V ( yi ) = V[E( yi |pi )] + E[V ( yi |pi )] = r(1 − r)  .
 g + 1 

so that γ → ∞ corresponds to the binomial density.


Quintana and Tam (1996) consider both marginal and conditional likelihood MCMC
estimation approaches to the beta-binomial. With beta mixing according to pi ∼ Be( a, b),
and prior p(a,b), they apply Hastings sampling to the joint marginal likelihood (with πi
integrated out)

 Γ( a + b) 
n
 Γ( a + yi )Γ(b + N i − yi ) 
L( a, b , y ) ∝  
 Γ( a)Γ(b) 
∏
 i
Γ( a + b + N i )
 p( a, b),

and mixed Gibbs–Hastings sampling to the joint conditional likelihood

 Γ( a + b)  
n

L( a, b , p , y ) ∝   
 Γ( a)Γ(b)  
∏p i
a + yi − 1
i (1 − pi )b + Ni − yi + 1  p( a, b).

They also consider implications for posterior parameter correlation of the reparameterisa-
tion (Lee and Sabavala, 1987)

pi ∼ Be( m, h),

m = a/( a + b),

h = 1/(1 + a + b),

where η is a measure of heterogeneity.

4.7.1 Non-Conjugate Priors for Binomial Mixing


Alternatives to conjugate beta mixing are the binomial with normal errors in the link,
generalised binomial models (Makuch et al., 1989), generalised beta-binomial models
(Rodriguez-Avi et al., 2007), and models adapted to departures from the typical binomial
frequency pattern, such as zero-inflated binomial models. The logistic-normal model with
normal random effects in the logit link specifies

yi |pi ∼ Bin( N i , pi ),

logit(pi ) = bi ,
Borrowing Strength via Hierarchical Estimation 129

bi |m , t ∼ N ( m , t 2 ).

Here πi then follows a logistic-normal density,

1  1  pi  
2
1
p(pi | m, t 2 ) = exp  − 2  log − m   .
t 2p  2t  1 − pi   pi (1 − pi )
The logistic-normal prior with τ = 2.67 and μ = 0 matches a Jeffreys prior on πi in the first
two moments, and setting τ = 1.69 matches the uniform prior in the first two moments
(Agresti and Hitchcock, 2005). As for the Poisson lognormal, one may generalise to heavier
tailed or skewed mixing densities. Teather (1984) proposes a family of symmetric prior
densities for logit(πi) that includes the normal and double exponential as special cases.
Alternative links (e.g. probit) or mixing over links are possible.
In many applications (e.g. studies with patients allocated to multiple treatment), the ran-
dom effect variation is representing differential frailty in the patient population of the
study, so that for studies i = 1, … , n with k = 1, … , K treatment categories

yik ∼ Bin( N ik , pik ),

logit(pik ) = bi + bk ,

bi ∼ N (0, t 2 ),

where the βk are fixed treatment effects, while the bi can be interpreted as between study
variation in treatment effects. For example, Gao (2004) considers this structure for data
from Winship (1978) on a meta-analysis of eight randomised clinical trials comparing
healing rates in duodenal ulcer patients. For trials with treatment and control arms only,
with patient totals {N iT , N iC }, the logistic-normal model is often applied in meta-analysis
when trial totals are small, rather than adopting a normal approximation (Warn et al.,
2002; Parmigiani, 2002). In fact, other links (combined with binomial sampling) may be
more useful in clinical interpretability.
The prior structure often focuses on the control arm probabilities πiC, and on differences
between trial and control group probabilities. Thus assume

yiT ∼ Bin( N iT , piT ),

yiC ∼ Bin( N iC , piC ).

Then analysis of treatment-control differences δi on the log odds ratio scale would involve
transforms wiT = logit(piT ) , and wiC = logit(piC ), and taking

di = wiT − wiC ,

one might assume

di ∼ N ( ∆ , sd2 ).

For the πiC, random effect options might be to take wiC ∼ N ( mC , tC2 ), with { mC , tC2 } as addi-
tional unknowns, or piC ∼ Be( aC , bC ) with {aC , bC } additional unknowns.
130 Bayesian Hierarchical Models

Consider instead a log link, so that wiT = log(piT ), and wiC = log(piC ), again with
di ~ N ( ∆ , sd2 ). The δi now measure log relative risks, which are often more clinically useful
than log odds ratios, and exp(Δ) will measure the relative risk of (say) recurrence or mor-
tality under the treatment. In practice, sampling has to be constrained to ensure δi is less
than −log(πiC), so that

wiT = wiC + min( di , − log(piC )),

di ∼ N ( ∆ , sd2 ).

Similarly, for a risk difference analysis

wiC = piC ,

piT = wiT = di + wiC ,

di ∼ N ( ∆ , sd2 ),

sampling has to be constrained to ensure that piT ∈[0, 1] . This involves confining δi to the
interval [−piC , 1 − piC ] with the actually sampled model specifying

wiT = wiC + min(max( di , −piC ), 1 − piC ).

If the control group probabilities are regarded as proxies for the underlying risk of subjects
in a study, then the model involves a regression on centred control group effects, namely

wiT = wiC + di + b(wiC − wC ),

di ∼ N ( ∆ , sd2 ),

where wC is the average of the control arm effects (calculated at each iteration), and β is an
extra unknown.

4.7.2 Multinomial Mixtures
For representing overdispersion in multinomial data with M categories

( yi1 , … yiM ) ∼ Mult( N i ,[pi1 , … piM ]),

Ni = ∑y
m
im ,

the beta prior generalises to a Dirichlet prior with parameters (a 1 ,… , a M ). With


pi = [pi1 ,… piM ] and A = ∑ m
am , one has

M
Γ ( A)
p(pi|a) = ∏p am − 1
,

M im
Γ (a )
m m=1
m=1
Borrowing Strength via Hierarchical Estimation 131

so that prior means for πim are αm/A, with variances am (K − am )/A 2 ( A + 1) . The posterior
density for [pi1 , … , piM ] is Dirichlet with parameters ( yi1 + a1 , … , yiM + aM ) . Assuming equal
prior mass is assigned to all categories, namely a1 = a2 = … = aM , there is greater shrinkage
or flattening towards an equal prior cell probability across the M categories as A increases.
Greater flexibility may be provided by a multivariate generalisation of the logistic-normal
prior (Aitchison and Shen, 1980; Hoff, 2003). Thus with ( yi1 , … , yiM ) ∼ Mult( N i ,[pi1 , … , piM ]),

exp(bij )
pij = ,

M
exp(bim )
m=1

where the vector (bi1 , … bi , M −1 ) of the first M − 1 effects is multivariate normal with mean
mi = ( mi1 , … , mi , M −1 ) and covariance matrix Σ of dimension M − 1. For the reference category,
one sets biM = 0. If the categories are ordered and similarity of probabilities in adjacent
categories is expected on substantive grounds, the covariance matrix or its inverse may
be stipulated in line with a low order autoregressive form; this is known as “histogram
smoothing” (Leonard, 1973).
Another generalisation is to add a higher stage prior on the Dirichlet parameters, for
example, on the total mass A. Thus, Albert and Gupta (1982) consider a two-stage prior in
multinomial-Dirichlet analysis of contingency tables. With the reparameterisation ai = A ri
where ∑ m
rm = 1, one possible hierarchical prior generalises the binomial-beta with

pi = [pi1 , … piM ] ∼ Dir( A r1 , … , A rM ),

A ∼ Ga( aA , bA ),

( r1 , … , rM ) ∼ Dir(w1 , … , w M ),

where the wm and {aA , bA } are known.

4.7.3 Ecological Inference Using Mixture Models


Binomial-beta and multinomial-Dirichlet models (or non-conjugate alternatives) have
recently found wide application in ecological inference. Much of the impetus for this
research has come from political science, and may involve counts of a behaviour or event
for unit i (e.g. constituency) with m = 1, … , M outcomes (e.g. party voting affiliation) by
demographic attribute with c = 1, … , C levels (e.g. social class, ethnic group). The under-
lying data are the totals Nimc. What is observed in practice are the marginals Nim+ (e.g.
constituency voting data by party voted for), and information from another source (e.g.
from the census) on the relative distribution of the voting age population across levels of
the demographic attribute. This is proxy information regarding the ratios xic = N i + c /N i + +
(which might be census-based percentages of the voting population in different ethnic
groups).
Consider the simplest case, ecological inference in 2 × 2 tables. Suppose the observations
are the total electorate Ni, and the number who turn out Vi. So M = 2, for voting and not vot-
ing. Also available from census data is the proportion xi of the voting age population who
are black. Given this information, the goal of ecological inference is to estimate parameters
governing the internal table cells, namely the proportions ri1 and ri2 of black and white
voters who turned out. Since M = 2, the data are binomial, and the overall turnout rate in
132 Bayesian Hierarchical Models

area i is modelled as Vi ∼ Bin( N i , pi ). Modelling the turnout rates in terms of ethnic-specific


voting rates proceeds using the probabilistic statement

Pr(Turnout ) = Pr(Turnout|Black )Pr(Black ) + Pr(Turnout|White)Pr(White),

with the corresponding relation in area i being

pi = ri1xi + ri 2 (1 − xi ).

Among possible priors for the unknown ri1 and ri2 in a 2 × 2 ecological problem are:

a) Independent beta densities ri1 ∼ Be( a1 , b1 ), ri 2 ∼ Be( a2 , b2 );


b) A bivariate normal for wij = logit(rij ), with mean m = ( m1 , m2 ) and covariance Σ,
allowing {ri1 , ri 2 } to be correlated; and
c) A trivariate normal for wi1 = logit(ri1 ), wi 2 = logit(ri 2 ) , and wi 3 = logit( xi ) .

Imai et al. (2008) typify ecological missing data as data “coarsening,” and the first two pri-
ors above are consistent with coarsening at random. By contrast, the final option amounts
modelling the joint density p(x,r) of racial composition x and turnout behaviour r = (r1 , r2 )
via the sequence p( x|r )p(r ). This is similar to joint modelling of missingness and observed
data in non-random models for missing data (Pastor, 2003) and hence may be termed
“coarsening not at random.” If predictors of turnout rates are available, then the means μi1
and μi2 include regression terms.

Example 4.6 Breast Cancer Recurrence; Binomial Meta-Analysis


Parmigiani (2002, p.127) considers 14 trials concerning the impact of tamoxifen on
breast cancer recurrence rates. The trials are mostly large and a normal approximation
might well be applied, though one trial involved only 20 patients. A binomial analysis
is adopted with

y iT ∼ Bin( N iT , piT ),

y iC ∼ Bin( N iC , piC ).

A beta density is assumed for the control group rates, namely

piC ∼ Be(aC , bC ),

with uniform priors on the unknowns, aC ∼ U(1, 100) and bC ∼ U(1, 100) . Different com-
parison scales can be defined. For example, on the log odds ratio scale

wiT = logit(piT ),

wiC = logit(piC ),

di = wiT − wiC ,

with treatment effects assumed normal

di ∼ N( ∆ , sd2 ),
Borrowing Strength via Hierarchical Estimation 133

and diffuse normal and inverse gamma priors on Δ and sd2 respectively. Under an abso-
lute risk difference scale, one has instead

wiT = piT ,

wiC = piC ,

di = wiT − wiC ,

with appropriate constraints (Warn et al., 2002) to ensure piT ∈[0,1].


To provide a summary index of treatment benefit, the treatment gain δnew for a hypo-
thetical new trial is sampled and added to a predicted baseline recurrence rate πnew,C (trans-
formed on the appropriate scale) to give a predicted new trial treatment rate, πT,new. Then
the probability that the predictive relative risk RRnew = pnew , T /pnew , C exceeds 1 is obtained.
Treatment and placebo groups are compared on three different effect scales, namely
the log odds ratio (LOR), the log relative risk (LRR), and the absolute risk difference
(ARD). On the LRR scale, the predictive density for RR new has 95% interval (0.79, 1.01)
with a 3.5% chance that RR new exceeds 1. The ARD scale admits a larger element of
doubt, with a 95% interval (0.65,1.06) and Pr(RRnew > 1|y ) = 0.092 . By contrast, under a
LOR scale, RR new has 95% interval (0.77,0.97), with only a 1.1% chance of exceeding 1. The
lowest DIC is for the LOR scale, namely 218. Mixed predictive checks show all observa-
tions with exceedance probabilities Pr( y iT , new > y iT |y ) + 0.5Pr( y iT , new = y iT |y ) between
0.1 and 0.9 (exc.mx in the code).
A considerably different result (providing a DIC of 222) is obtained using beta-­
binomial sampling for both treatment and control arms. Consider the reparameterised
beta density Be(qS,(1 − q )S) where S is the prior sample size, and θ is the prior probabil-
ity. Then 1/(S + 1) is an estimator of the beta-binomial intra-class correlation. Assuming
a common prior sample size (Bakbergenuly and Kulinskaya, 2017), one has

y iT ∼ Bin( N iT , piT ),

y iC ∼ Bin( N iC , piC ),

piT ∼ Be(qT S,(1 − qT )S),

piC ∼ Be(qCS,(1 − qC )S),

with S assigned a gamma prior, and the θ parameters themselves assumed beta distrib-
uted. Under these assumptions, RR new has a 95% interval (0.3,2.2), with a 38% chance of
exceeding 1.

Example 4.7 Adverse Effects from Terbinafine


We consider overdispersed binomial data from Young-Xu and Chan (2008) on patients
with adverse effects in treatment with an oral anti-fungal agent (terbinafine) for onycho-
mycosis and dermatophytosis. These data raise issues regarding the accommodation
(i.e. satisfactory representation within the model) of extreme values, as against identify-
ing poorly fitted cases (Marshall and Spiegelhalter, 2007). Another distinction can be
made, namely, between extreme values and outliers, in areas such as pharmaceutical
testing. Thus Walfish (2006) suggests that

an outlier is defined as an observation that appears to be inconsistent with other


observations in the data set. An outlier has a low probability that it originates from
the same statistical distribution as the other observations in the data set. On the
other hand, an extreme value is an observation that might have a low probability of
134 Bayesian Hierarchical Models

occurrence but cannot be statistically shown to originate from a different distribu-


tion than the rest of the data.

There are 41 studies in the terbinafine analysis, and each study has Ni patients and yi
patients with adverse reactions. The binomial logit-normal (BLN) representation

y i |pi ∼ Bin( N i , pi ),

logit(pi ) = m + bi ,

bi |t ∼ N(0, t 2 ),

is compared with the beta-binomial,

y i |pi ∼ Bin( N i , pi )

pi ∼ Be(a , b ).

The latter can be also be represented directly in rstan using the beta_binomial density.
Pooling across all studies, 111 of 3002 patients have adverse effects (around 3.7%).
The mixed replicate checking scheme is used to identify poorly fitted cases. In
the BLN representation, this is implemented by sampling replicate normal ran-
dom effects brep, i , and then the corresponding predicted totals of adverse reac-
tions. Poorly fitted cases are identified by extreme exceedance probabilities, namely
p. exci = Pr( y rep , i > y i |Y ) + 0.5Pr( y rep , i = y i |Y ) under 0.05 or over 0.95. Two studies (19, 38)
are identified as poorly fitted (a potential outlier), with study 19, containing 186 patients
but 0 adverse effects, having p. exci = 0.96.
In the hierarchical beta-binomial representation, we sample replicate πrep,i and then
the corresponding predicted adverse reactions. Now three studies are identified as
problematic: study 19 with p. exci = 0.96, and two studies (33 and 38) with relatively
high adverse reaction totals. Inferences regarding the mean adverse rate are similar
between the two approaches: the beta-binomial adverse mean rate is 3.44%, compared
to 3.45% under the binomial logitnormal (based on averaging over all samples of all πi).
However, sampled πi under the binomial logit-normal show greater positive skew than
the beta-binomial (1.61 vs 1.22), reflecting accommodation for the higher rates for some
studies. There is a similar contrast in outlier accommodation between the conjugate
Poisson-gamma mixture and the Poisson lognormal, as the tails of the lognormal are
heavier than for the gamma distribution (Connolly et al., 2009; Wang and Blei, 2017).

4.8 Discrete Mixtures and Semiparametric Smoothing Methods


Hierarchical models for pooled inferences or density estimation based on a single underly-
ing population with a specific parametric form are often a simplification. Pooling strength
applications such as meta-analysis and density estimation are often seeking to identify the
main features of the data, or to predict further observations ynew via the predictive distribution
p( y new | y ), and a single population model may not be appropriate for data exhibiting asymme-
try, multiple modes, isolated outliers, or outlier clusters (Mohr, 2006). While the standard densi-
ties can be extended (e.g. to reflect asymmetry), mixtures of standard densities (normal and t
densities) can be used to represent a wide variety of density shapes (Everitt and Hand, 1981).
Use of a single population density model in such circumstances will provide improper pooling
Borrowing Strength via Hierarchical Estimation 135

and poor predictions for a new unit (Hoff, 2003). For example, a normal random-effects analy-
sis of hospital mortality rates may shrink extreme rates considerably, and this might mask
potentially unusual results for units with smaller totals of patients at risk (Ohlssen et al., 2007).
Among the principles that govern robust smoothing and regression methods for non-
standard densities are discrete mixing of densities over K > 1 subpopulations (Bohning,
1999) and various types of local regression based on kernel or smoothness priors (Muller
et al., 1996). In this chapter, the focus is on discrete mixture modelling, where the Bayesian
approach has been coupled with many recent advances. These include the Bayesian ana-
logue to non-parametric maximum likelihood estimation, with MCMC implementation
as set out by Diebolt and Robert (1994), and Richardson and Green (1997), and numer-
ous developments of the Dirichlet process methodology, as reviewed by Hanson et al.
(2005). The Bayesian approach is flexible in terms of prior structures that can be imposed
in estimation, either grounded in substantive theory, or to improve definition of the sub-
groups (e.g. Robert and Mengersen, 1999). On the other hand, repeated sampling without
appropriate parameter constraints is subject to “label switching,” since labelling of the
subgroups is arbitrary (Fruhwirth-Schattner, 2001; Chung et al., 2004).

4.8.1 Finite Mixtures of Parametric Densities


In a discrete parametric mixture model, a single parametric density is typically assumed
in each subpopulation k ∈(1, … , K ) , but a different hyperparameter ψk, so that within the
kth subpopulation y ∼ p( y |yk ) . Unobserved subgroup or allocation indicators Si ∈(1, … K )
describe how the units are distributed over subpopulations. These are also known as con-
figuration indicators (Gopalan and Berry, 1998). The joint or complete data density p(y,S)
can be written

p( yi , Si ) = p( yi |Si )p(Si ) = p( yi |ySi )pSi

where p( y |S) = p( y |yS ) is the density for yi conditional on Si, and {p1 , … , pK } are the prior


K
subgroup probabilities, with pk  = 1. The unconditional or marginal density for a sin-
k =1
gle yi is
K

p( yi |p1 , … pK , y1 ,..yK ) = ∑ p p(y |y ),


k =1
k i k

with the total marginal likelihood being


n K

p( y |p , y) = ∏ ∑ p p(y |y ).
i =1 k =1
k i k

Classical analysis via non-parametric maximum likelihood estimation involves maximi-


sation of the log of this marginal density – for example, see Rattanasiri et al. (2004) for a
disease-mapping application where yi are malaria counts and p(y|ψ) is a Poisson density.
In MCMC applications discrete mixture models can be represented hierarchically using
the latent subpopulation indicators (Marin et al., 2005, p.462). Thus, at the highest stage or
level are the parameters j = (p1 , … , pK , y1 , … , yK ), then the missing configuration data the
distribution of which depends on φ,

Si ∼ P(Si |j)
136 Bayesian Hierarchical Models

and at the lowest (first) stage the distribution of the observations p( y |j , S) depends on both
φ and S = (S1 , … , Sn ). The joint distribution is therefore

p( y , S, j) = p( y |S, j)p(S|j)p(j).

4.8.2 Finite Mixtures of Standard Densities


There is a considerable literature on univariate and multivariate normal mixtures for
continuous data, and on Poisson and binomial mixtures for discrete data, with Bayesian
references including Richardson and Green (1997), Roberts et al. (1998), Militino et al.
(2001), and Hurn et al. (2003). Overdispersed or skew alternatives to the major densities
can be used in discrete mixtures instead: for continuous data, the Student t distribu-
tion involves an additional tuning parameter useful for outlier accommodation, and
greater robustness to such points may be obtained by discrete mixtures over univari-
ate and multivariate Student t densities with varying degrees of freedom (Lin et al.,
2004). Discrete mixtures of skew normal and skew Student t densities are considered
by Lin et al. (2007a,b). Lin et al. (2007a) argue that a simple normal discrete mixture
model tends to overfit when additional components are added to capture skewness in
continuous data.
Parameter sampling via MCMC is facilitated by conjugate prior choices for the mixing
density. For example, consider a univariate normal mixture with yk = ( mk , sk2 ) , and
K

p( y |p , m, s ) = ∑ p f(y| m , s ),
k =1
k k k

where f( y | m, s ) is the normal density, N( m, s 2 ) . The conjugate prior for yk = ( mk , sk ) takes


sk2 ∼ IG(Vk /2, Vk /2), namely

p(s k2 ) µ s k-n k -1 exp(-Vk /2s k2 ),

and

 s2 
p( mk |sk2 ) = N  xk , k  .
 kk 
Also assume a Dirichlet prior for the unknown mixture probabilities

(p1 , … , pK ) ∼ Dir(a , … , a),

with α preset or possibly an extra unknown. Gibbs sampling then samples the missing
data (the allocation indicators) according to a multinomial density with probabilities at
iteration t,

p k(t )f ( yi |mk(t ) , s k(t ) )


p(Si(t ) = k|p (t ) , m (t ) , s (t ) ) = rik(t ) = .
å
K
p k(t ) f ( yi |mk(t ) , s k(t ) )
k =1

Let dik(t ) = 1 if Si(t ) = k and dik(t ) = 0 otherwise. Suppose N k(t ) = #{Si(t ) = k } is the total number of
(t )
cases with Si(t ) = k , that mk = ∑d (t )
y /N k(t ) is the average response for these cases, and that
ik i
Borrowing Strength via Hierarchical Estimation 137

Ek(t ) = ∑d
(t )
ik ( yi − mk(t ) )2 is the sum of squared errors for this subgroup. Then, with condi-
tioning on remaining parameters understood, the πk are updated according to a Dirichlet
with

(p1(t ) , p2(t ) , … , pK(t ) ) ∼ D(a + N k(t ) , a + N 2(t ) , … , a + N K(t ) ),

the subgroup variances are sampled from an updated inverse gamma,

  N k(t )kk 
sk2(t ) ∼ IG  0.5[nk + N k(t ) ], 0.5 Vk + Ek(t ) + (xk − mk(t ) )
  kk + N k (t )

and the subgroup means are updated according to

 k x + N k(t )mk(t ) sk2(t ) 


mk(t ) ∼ N  k k , .
 kk + N k (t )
kk + N k(t ) 

Diebolt and Robert (1994) suggest stabilising adjustments to these updates to improve
convergence. A refinement is to take the mixture proportions as subject specific as in
(pi1 , … , piK ) ∼ Dir(a , … , a) , and in the updates for pik(t ), the N k(t ) are replaced by binary indica-
tors according to which class subject i is allocated to at a particular iteration.

4.8.3 Inference in Mixture Models


Parametric mixture models, such as the univariate normal just considered, are subject to
identification issues due to the arbitrariness of the sub-population labels. Other forms of
identifiability relate to potential overfitting (e.g. K taken too large), so that some groups
are overlapping and difficult to distinguish (Betancourt, 2017; Frühwirth-Schnatter, 2006,
p.107). An additional issue is whether all parameters in the separate densities need to be
taken to vary between groups. For example, the mclust package (Scrucca et al., 2016) com-
pares models with different K, and with group-specific variances as against common vari-
ances, sk2 = s 2 . Identification and relative fit may also be affected by the setting for the prior
Dirichlet weights (as shown in Example 4.8, using the galaxy data).
Assessing the fit of discrete mixture models also raises distinct problems. Compared
with random effects models, there is the benefit that the number of parameters is known,
so estimating information criteria such as the BIC or the AIC (Akaike information criterion)
is straightforward (McLachlan and Rathnayake, 2014). On the other hand, the asymptotic
justification for such criteria is affected by singularities at the boundaries of the param-
eter space in moving from a K − 1 group solution to a K group solution (Biernacki et al.,
2000). Recently developed information criteria, such as the sBIC may be considered instead
(Drton and Plummer, 2017).
To illustrate label identifiability, in the absence of parameter constraints or other prior
information to distinguish the components, the likelihood is invariant under permutation
of the components, and there are K! possible labelling schemes. It is essential to produce
MCMC draws with a unique labelling, if interest lies in the estimation of group-specific
parameters or classification probabilities πk (Frühwirth-Schnatter et al., 2004). Note though
that inferences on some aspects of the model are unaffected by group labelling – for exam-
ple, the unit means that pool over the subpopulation category means. Cluster labelling
138 Bayesian Hierarchical Models

issues are also not generally considered in the Dirichlet process approach (Section 4.9),
where the emphasis is on the smoothed unit means.
Identifying (usually ordering) constraints may be imposed on parameters to avoid label-
switching (Roeder and Wasserman, 1997, Richardson and Green 1997), providing what
may be termed “non-exchangeable priors” (Betancourt, 2017). Label switching or labelling
degeneracy refers to permuting the mixture component subscripts without altering the
likelihood (Redner and Walker, 1984). However, Celeux et al. (2000), Marin et al. (2005),
and Geweke (2007) consider drawbacks to such identifiability constraints (e.g. distortions
of the posterior distribution of the parameters). For example, in a normal mixture, con-
straints may be imposed on prior masses πk (e.g. p1 > p2 > … > pK ), or on the subpopulation
parameters, μk or on the scale parameters σk. A preliminary MCMC sampling analysis
without parameter constraints may be used to assess the most suitable form of constraint
(Fruhwirth-Schattner, 2001). Another possibility is to use maximum likelihood solutions
(e.g. using the R package flexmix) to set constraints and/or relatively informative priors
that are sensible for the dataset. Re-analysis of the posterior output to impose a consistent
labelling is another possibility (Frühwirth-Schnatter, 2001), as are data-based priors, albeit
not fully Bayesian (Wasserman, 2000). For example, in a two-group model without regres-
sion on predictors, the unit with the maximum y value could be pre-labelled as belonging
to one or other subpopulation.
Particular types of parameterisation may be used to improve identification, such as
introducing dependence between the parameters ψk in different components so that they
are perturbations of one another (Robert and Mengersen, 1999). For example, a normal
mixture model with yk = ( mk , sk2 ) would be based on taking {q1 , s12 } as reference parameters
and adopting the parameterisation

s2 = s1w1 ,

s3 = s2w2 ,

s4 = s3w3 ,

sK = sK −1wK −1 = s1w1w2 …wK −1 ,

where wk ∼ U (0, 1). With q1 = m1 , the prior on the series of normal means takes a perturba-
tion form

m2 = q1 + s1q2 ,

m3 = q1 + s1q2 + s1s2q3 ,

mK = q1 + s1q2 + s1s2q3 + … + (s1s2 … sK −1 )qK .

The mixture weights have the form

p1 = p1 ,

p2 = (1 − p1 )p2 ,
Borrowing Strength via Hierarchical Estimation 139

p3 = (1 − p1 )(1 − p2 )p3 ,

pK −1 = (1 − p1 )(1 − p2 )…(1 − pK − 2 )pK −1 ,

pK = (1 − p1 )(1 − p2 )… (1 − pK −1 )

with pk ∼ U (0, 1) . This prior is still invariant under permutation of the cluster indices
and an identifying constraint is placed on the variances by taking 1 ≥ w1 ≥ … ≥ wK −1 . An
advantage of this representation is that an improper prior on { m1 , s12 } can be used (Robert
and Titterington, 1998). For the two group case, Basu (1996) presents the parameterisation
n = s12 / s22 and ∆ = ( m2 − m1 )/s1 to test for normal or Student t unimodality as against bimo-
dality; posterior probabilities of unimodality are obtained using the results of Robertson
and Fryer (1968).
Celeux et al. (2000) and others apply post-processing to the MCMC output resulting
from a discrete mixture analysis without parameter constraints; the goal is to reconfigure
the output with a consistent labelling. Suppose there are p parameters in any subpopula-
tion. If MCMC convergence is assumed, one may select a short run of iterations (say S = 100
iterations) where there is no label switching to provide a reference labelling. The initial
run of parameter samples provides a base reference label sequence 1, 2, … , K (one among
the K! possible), and K means of dimension p, qk = {q1k , q2 k , … , qpK } , that can be permuted
to include all other remaining K! − 1 possible labelling schemes. In a subsequent run of R
iterations where label switching might occur, iteration r is assigned to that scheme (among
the K!) closest to it in distance terms and a relabelling applied if there has been a switch
away from the base reference label. Additionally, the means under the schemes are recal-
culated at each iteration S + r (Celeux et al., 2000, p.965).
Schemes for gaining identifiability can be applied within the MCMC sampling, as illus-
trated in the rjags online code for the BUGS example concerning peak sensitivity wave-
lengths (the Eyes example) (https://sourceforge.net/p/mcmc-jags/examples/ci/3765ddf
d606e96c5de12818b50ef1b807f77af53/tree/classic-bugs/vol2/eyes/eyes.bug). Assume an
unconstrained analysis, with no constraints on the mixture parameters. Then, assuming
relabelling based on sampled means, processing resorts these sampled means, named say
m0[1:K] in the code, with identifiable means mu[K], mu[K−1],...,mu[1] defined according
to which of the m0[1:K] has the maximum value, the second highest, etc. Other mixture
parameters (weights and variances for each group in a normal mixture) are reassigned
using the same relabelling rule. This procedure corresponds to adopting a standard set of
labels or standard ordering to obtain an identified solution (Betancourt, 2017).
The rjags online code is for the case K = 2. For K = 3, assume a normal univariate mixture,
with reassignment based on the means, but applied also to resorting weights, from the
sampled P0[1:3] to the identified P[1:3]. Then one possible rjags code fragment is

rank <- rank(m0)


for (j in 1:K) {
J1[j] <- equals(rank[j],1)
J2[j] <- equals(rank[j],2)
J3[j] <- equals(rank[j],3)}
P[1] <- P0[1] * J3[1] + P0[2] *J3[2] + P0[3] *J3[3]
P[2] <- P0[1] * J2[1] + P0[2] *J2[2] + P0[3] *J2[3]
140 Bayesian Hierarchical Models

P[3] <- P0[1] * J1[1] + P0[2] *J1[2] + P0[3] *J1[3]


mu[1] <- m0[1] * J3[1] + m0[2] *J3[2] + m0[3] *J3[3]
mu[2] <- m0[1] * J2[1] + m0[2] *J2[2] + m0[3] *J2[3]
mu[3] <- m0[1] * J1[1] + m0[2] *J1[2] + m0[3] *J1[3]

Assume precisions tau0[1:K] are to be reassigned as well. A general code for larger K can
be written more compactly as follows:

rank <- rank(m0)


for (j in 1:K) {P[j] <- sum(P0prod[j,])
mu[j] <- sum(m0prod[j,])
tau[j] <- sum(tau0prod[j,])
for (k in 1:K) {P0prod[j,k] <- P0[k]*equals(rank[k],j)
m0prod[j,k] <- m0[k]* equals(rank[k],j)
tau0prod[j,k] <- tau0[k]* equals(rank[k],j)}}}

This procedure is illustrated in Example 4.8. Which parameter is selected as the basis for
resorting (e.g. means or weights) may partly be decided using measures of fit.
We illustrate this procedure with jagsUI applied to the randomly generated dataset used
in Betancourt (2017), consisting of a two-group Gaussian mixture with means (−2.75, 2.75),
prior weights P = (0.6,0.4), and variances 1 in both groups. Prior Dirichlet sample sizes of 2
are assumed. The code assumes a conditional likelihood (conditional on allocation indica-
tors) and is:

mu <- c(−2.75, 2.75)


sigma <- c(1, 1)
lambda <- 0.4
set.seed(689934)
N <- 1000
z <- rbinom(N, 1, lambda) + 1
y <- rnorm(N, mu[z], sigma[z])
D <- list(N= N, y = y, K = 2)
require(jagsUI)
K=2
cat("model { for (i in 1:N){ # conditional likelihood
y[i] ~dnorm(m0[S[i]], tau0[S[i]])
# latent allocation indicators (conditional likelihood)
S[i] ~dcat(P0[1:K])
ynew[i] ~dnorm(mu[S[i]],tau0[S[i]])
exc[i] <- step(ynew[i]−y[i])
LL[i] <- log(sum(L[i,]))}
P0 ~ddirch(alpha[]); # prior for weights
for (j in 1:K) { m0[j] ~dnorm(0, 0.01)
alpha[j] <- 2              # prior Dirichlet sample sizes
tau0[j] ~dgamma(1,0.001)
for (i in 1:N) { L[i,j] <- exp(log(P[j])+0.5*log(tau[j])−0.919−0.5*tau[j]
*pow(y[i]−mu[j],2)) }}
tLL <- sum(LL[])
# Processing to obtain identifiable groups
rank <- rank(m0)
for (j in 1:K) {P[j] <- sum(P0prod[j,])
mu[j] <- sum(m0prod[j,])
tau[j] <- sum(tau0prod[j,])
Borrowing Strength via Hierarchical Estimation 141

s2[j] <- 1/tau[j]


for (k in 1:K) {P0prod[j,k] <- P0[k]*equals(rank[k],j)
m0prod[j,k] <- m0[k]* equals(rank[k],j)
tau0prod[j,k] <- tau0[k]* equals(rank[k],j)}}}
", file="discmix.jag")
# initial values and estimation
inits <- function(){list(m0=rnorm(K,0,0.01),tau0=rexp(K,1))}
pars <- c("P","mu","tau","s2","tLL")
summary(autojags(D, inits, pars, model.file="discmix.jag",2, n.adapt=100,
iter.increment=1000, n.burnin=500,Rhat.limit=1.1, max.
iter=50000,seed=1234))

We obtain a solution with μ2 as the larger mean, with mean (sd) of 2.87 (0.05), and with
corresponding estimated weight p2 = 0.38. In this solution μ1 is the smaller mean, with pos-
terior mean (sd) of −2.73 (0.04), and with corresponding estimated weight p1 = 0.62. The esti-
mated weights reflect the actually sampled assignment indicator totals at line 7 of the code,
respectively sum(z==1) = 622 and sum(z==2) = 378. Convergence was attained at under 2000
iterations.
A less satisfactory result is obtained under the alternative scenario investigated by
Betancourt (2017) where the means are (−0.75,0.75), separated by less than a standard devi-
ation. As before prior weights are P = (0.6,0.4) and variances are 1 in both groups. This time
prior Dirichlet sample sizes of 5 are assumed. Convergence is obtained by under 5000
iterations with this more informative prior, but the estimated means are not fully repro-
ducing the simulation, namely −0.50 (0.21) and 0.44 (0.38) with estimated weights of p =
(0.57,0.43). This demonstrates the identifiability issues present when components are not
widely separated.

4.8.4 Particular Types of Discrete Mixture Model


Heterogeneity within classes can be accommodated using discrete mixtures for unit level
conjugate or non-conjugate random effects (Lenk and DeSarbo, 2000; Fruhwirth-Schnatter
et al., 2004). For example, the standard discrete mixture to account for heterogeneity in
count data involves K < n homogenous subpopulations with means m1 , … , mK

yi ∼ ∑ p Po( m ),
k =1
k k


K
where πk is the prior probability that a unit belongs to sub-population k, with pk = 1.
k =1
Alternatively accounting for heterogeneity within subpopulations would involve K
Poisson-gamma subgroups

yi ∼ Po( mi ),

mi ∼ ∑ p Ga(a , b ),
k =1
k k k

or K Poisson lognormal subgroups

yi ∼ Po( mi ),
142 Bayesian Hierarchical Models

mi ∼ ∑ p LN( m , s ),
k =1
k k
2
k

where LN(m,V) denotes a lognormal density with mean m and variance V.


Discrete mixtures can also be used to modify the shape of standard densities such as
the Poisson or binomial. For example, a manufacturing process may move between dif-
ferent regimes, one where faults are essentially unknown and another where they occur
according to a Poisson process. This will generate excess zeroes as compared to the stan-
dard Poisson, leading to a zero-inflated Poisson (ZIP). One may introduce a binary regime
indicator Si with marginal probability p = Pr(Si = 1) that the fault-free regime applies, and
(1 − π) that sampling is from a Poisson density with mean μ. In more generality, with p(y|ψ)
as a density for count data (e.g. Poisson, negative binomial, binomial), the corresponding
zero-inflated density is

p( y = 0|p , y) = p + (1 − p)p( y = 0|y) y = 0

p( y |p , y) = (1 − p)p( y |y) y > 0.

Conditionally Pr(Si = 1| y > 0) = 0, while

p
Pr(Si = 1| y = 0) = .
p + (1 − p)p( y = 0|y)

The process generating the Si needs only to be considered for zero observations yi = 0, and
the complete data likelihood (assuming Si to be given) is

L(p , y| y , S) = ∏(1 − p )p(y |y)∏ p


yi >0
i i
yi =0
Si
i [(1 − pi )p(0|yi )]1− Si .

For example, if p(y|ψ) is taken to be Poisson with mean ψ = μ then E( y |p , m) = (1 − p) m and

V ( y |p , m) = (1 − p) m(1 + pm) > E( y |p , m)

so that the ZIP model is necessarily overdispersed.

4.8.5 The Logistic-Normal Alternative to the Dirichlet Prior


A generalisation of the logistic-normal to multivariate contexts has been applied to non-
parametric analysis and by authors such as Aitchison and Shen (1980), Lenk (1988), and
Hoff (2003). The goal is to replace the restrictive Dirichlet prior for the unknown mixture
probabilities πk with a multinomial logistic framework. Consider the case where units are
exchangeable, and there are no covariates relevant to allocation between subpopulations.
Then for subjects or units i = 1, … , n , and assuming

yi ∼ ∑ p p (y |y ),
k =1
ik k i k
Borrowing Strength via Hierarchical Estimation 143

the mixing probabilities are obtained as

e zik
pik = , k = 1, … , K − 1

K
1+ e zik
k =1

1
piK = ,

K
1+ e zik
k =1

where the {zik , k = 1, … , K − 1} are multivariate normal with mean ν and variance Σz. For
example, Hoff (2003) argues for the use of normal mixtures in density smoothing and,
in this case, the pk ( y |yk ) would be univariate or multivariate normal themselves. This
approach generalises to multivariate skewnormal or multivariate Student t densities, and
can be adapted to allow non-exchangeable mixture priors, as in histogram smoothing
(Leonard, 1973).
Instead of subject-specific zik, one may also assume a single vector {z1 , … , zK −1 } to be
multivariate normal. For unique identification of the subgroups one may impose order
constraints on the parameters in ψk or on those underlying {z1 , … , zK −1 } . In the univariate
normal case with yk = { mk , sk2 } , one might assume an ordering either on the means μk, or on
the means νk of the zik.

Example 4.8 Galaxy Data


The number of clusters detected in the much-analysed galaxy data has varied over dif-
ferent studies, under the model
K

yi ∼ ∑ r N( m , s ),
k =1
k k
2
k

with y being measured in thousands of kilometres per second. Classical analysis using
the flexmix package (Leisch, 2004) in R shows a better AIC and BIC for 5 clusters. The
mclust program selects K = 4 as optimal, and for the K = 6 solution selects an equal vari-
ance solution. The K = 4 and K = 5 solutions have the drawback of a large variance in the
group with the largest mean.
Bayesian studies such as Ishwaran and James (2002) find at least 5–6 clusters with
a Dirichlet process approach, and under an inverse gamma prior for the sk2 . They do,
however, find only four clusters when a uniform prior U(0,20.83) is used for sk2 , with
20.83 being the observed variance, V(y). Ando (2007) reports six clusters (assuming a
monotonic constraint on the μk) via several model fit criteria, and K = 6 is also the best
fitting using the sBIC criterion of Drton and Plummer (2017, p.350).
Here we compare solutions with K = 4, K = 5 and K = 6. First of all, the rstan ordered
vector parameterisation will be used, following Betancourt (2017) and Savage (2016). A
half-Cauchy(0,2) is assumed for the group standard deviations. Prior Dirichlet sample
sizes α of 2 and 4 are also compared. For K = 5 and K = 6, estimation is with 2 chains
and 10,000 iterations. For K = 4, a higher number of iterations (50,000) is needed for
convergence.
With α  =  2, respective posterior mean total log-likelihoods, namely


K
log[ rk f( y| mk , sk )] , are −206.0, −205.8 and −206.5, with respective LOO-IC 421.7,
k =1
422.4 and 424.2. So there is little to separate these solutions in terms of fit. With α = 4, the
posterior mean log-likelihoods are −206.4, −205.7 and −206.2, with respective LOO-IC
being 421.2, 421.4 and 422.8. The rstan solutions generally show the lowest mean group
144 Bayesian Hierarchical Models

with a mean lower than the minimum, namely 9.17, of the observed data points. This
can be taken as generalising beyond the observed data.
We also implement jagsUI with the latent means constrained to lie between the mini-
mum and maximum of the observations. MCMC convergence is focused on relabelled
parameters, using the standard labelling approach set out above [1]. Convergence is
problematic with independent priors on the precisions yk = 1/sk2 when K > 4. Improved
convergence is obtained if a hierarchical prior is adopted instead, namely yk ∼ Ga( ay , by ),
where aψ and bψ are assigned E(1) priors. This is an intermediate option between inde-
pendent priors and assuming the same variance across all groups. As noted by Baudry
et al. (2010), the most appropriate number of mixture components may not guaran-
tee well-separated groups. To assess cluster overlap, we use the entropy measure
−2∑∑ di k
ik log( rik ) (Scrucca et al., 2016, p.297); another form, with effective numerical

equivalence, is −2 ∑∑ i k
rik log( rik ).
For K = 4, 5 and 6 respective posterior mean log-likelihoods are −205.6, −204.4, and
−204.4, so BIC-type penalised fit measures (with respective penalties 48.5, 61.7, and 74.9)
would favour K = 4. Respective posterior mean entropies are 59, 87, and 109, so penalisa-
tion by entropy (Biernacki et al., 2000) would also decisively favour K = 4. The LOO-IC
measures also favour K = 4, with the values for K = 4, K = 5, and K = 6 being respectively
357, 370, and 376. A solution with K = 3 was also run, which gave a mean log-likelihood
of −216.4 and an entropy of 68.3. The estimated group means under K = 4 are 9.7, 19.9,
22.4, and 28.0, with group probabilities 0.10, 0.33, 0.42, and 0.15 via jagsUI.
By comparison, mclust provides estimated means of 9.7, 19.8, 22.9, and 24.5 with
respective probabilities 0.08, 0.39, 0.37, and 0.16, and bayesmix (Gruen and Plummer,
2015) provides estimated means of 10.3, 20.4, 22.5, and 30.5 with respective probabilities
0.09, 0.45, 0.39, and 0.06. The bayesmix run used the code:

variables <- c("mu","tau","eta")


M4<-BMMmodel(k=4,priors=list(kind="independence",parameter="priorsFi
sh",hierarchical="tau"))
C <- JAGScontrol(variables = c(variables, "S"), n.iter = 5000,burn.
in = 500)
R4 <- JAGSrun(y, model = M4, initialValues = list(S0 = 2),control =
C, cleanup = T, tmp = F)
Sort4 <- Sort(R4, "mu")
Sort4

Predictive checks for K = 4 under a hierarchical prior for ψk show only one exceedance
probabilities under 0.1 or over 0.9.

4.9 Semiparametric Modelling via Dirichlet Process and Polya Tree Priors


In applications of hierarchical models, inferences may depend on the assumed forms
(e.g. normal, gamma) for higher stage priors, and will be distorted if there are unrecog-
nised features such as multiple modes in the underlying second stage effects. Instead
of assuming a known prior distribution G for second stage latent effects, such as bi
in the normal-normal model of Section 4.3, the Dirichlet process (DP) prior involves
a distribution on G itself, so acknowledging uncertainty about its form (Carvalho
and Branscum, 2017; Gill and Casella, 2009). The DP prior involves a baseline or base
Borrowing Strength via Hierarchical Estimation 145

prior G0, the expectation of G, and a precision or mass parameter α governing the
concentration of the prior for G about its mean G0. For any partition A1 , … , AM on
the support of G0, the vector {G( A1 ), … , G( AM )} of probabilities G(Am) contained in the
set {Am , m = 1, … , M } follows a Dirichlet distribution D(aG0 ( A1 ), … , aGM ( AM )). Such an
approach may be termed semiparametric as it involves a parametric model at the first
stage for the observations, but a non-parametric model at the second stage (Basu and
Chib, 2003).
Original forms of the DP prior assumed G0 to be known (fixed). One problem with a
Dirichlet process when G0 is known is that it assigns a probability of 1 to the space of dis-
crete probability measures (Hanson et al., 2005, p.249). An alternative is to take the param-
eters in G0 to be unknown, and to follow a set of parametric distributions, with possibly
unknown hyperparameters, resulting in a mixture of Dirichlet process or MDP model
(Walker et al., 1999, p.489). Computational procedures for such models are discussed by
Jara (2007), Ohlssen et al. (2007), Jara et al. (2011), Burr (2012), Karabatsos (2016), Karabatsos
(2017), with associated R packages including DPpackage (Jara et al., 2011), and bspmma
(Burr, 2012).
Following West et al. (1994), assume conventional first-stage sampling densities
yi ∼ p( yi |bi , y) , with distributions P( yi |bi , y) . The uncertainty about the appropriate
form of prior arises about the distribution G for the latent effects bi. Under a DP prior,
any set of unitspecific parameters {b1 , … , bn } generated from G lies in a set of K ≤ n dis-
tinct values {z1 , … , zK } which are sampled from G0. The concentration parameter α gov-
erning the closeness of G to G0 can be taken as an unknown, or assigned a preset value
(e.g. α = 1) (Da Silva, 2009). The number of distinct values or clusters K is stochastic,
with an implicit prior determined by α, with limiting mean a log(1 + n/a) . Note that the
posterior mean of K is not necessarily a reliable guide to the number of components in
the data or effects (e.g. components with substantive meaning), though it can be inter-
preted as an upper bound on the number of components (Ishwaran and Zarepour, 2000,
pp.381–382).
Given the realised number of clusters K (at any particular MCMC iteration), the bi are
sampled from the set {z1 , … , zK } according to a multinomial distribution. Define cluster
indicators S = {S1 , … Sn }, where Si = k if bi = ζk, and denote N k = #{Si = k } as the total number
of units with Si = k (i.e. units in the same cluster with a common value ζk for the second
stage latent effect). If α is taken as unknown, its prior is important in determining the
number of clusters. Taking a ∼ Ga(h1 , h2 ) where η1 and η2 are relatively large will tend to
discourage unduly small or large values for α. Typical values are h1 = h2 = 1 or h1 = h2 = 2,
though taking η2 > η1 as in {h1 = 2, h2 = 4} tends to encourage repetitions in the ζk, and can
be used to assess the number of components present in the data (Ishwaran and Zarepour,
2000, p.377). It is clear that the parameters used in the prior for α may affect the number of
components, but typically there is less concern with this aspect in non-parametric mixture
modelling (Leslie et al., 2007).
Consider the assignment of a latent effect bi to a particular unit, given that the remain-
ing n − 1 latent effects b[i] = {b1 , … , bi −1 , bi + 1 , … , bn } are already assigned. Also let S[i] be
a particular configuration of the remaining n − 1 effects b[i] into K[i] distinct values, with
N[i]k = #{Sj = k , j ≠ i} denoting the total of those n − 1 units having a common value z[i]k .
Then the conditional prior for bi follows a Polya urn scheme (West et al., 1994; Hanson
et al., 2005, p.252; Dunson et al., 2007, p.165)

a
∑ d(b ),
1
(bi |b[i] , S[i] , K[i] , a) ∼ G0 + k
a + n−1 a + n−1 k≠i
146 Bayesian Hierarchical Models

K[ i ]
a
∑N
1
∼ G0 + [ i ]k d(z[i]k ), (4.5)
a + n−1 a + n−1 k =1

where δ(u) denotes a degenerate distribution having a single value at u. So bi is distinct


from the remaining latent values with probability a/(a + n − 1) , in which case it is drawn
from the base prior G0. Alternatively, it is selected from the existing distinct effects ζ[i]k
according to a multinomial with probabilities proportional to N[i]k /(a + n − 1) . This selec-
tion scheme extends to the predictive scenario i.e. to the latent effect bn+1 for a hypothetical
new unit n + 1, with

K
a
∑ N d(z ).
1
(bn + 1 |b , S, K , a) ∼ G0 + k k
a+n a+n k =1

Predictions of the first stage response for unit n + 1 are obtained as

K
a
∑N P
1
( y n + 1 |b , S, K , a) ∼ Pn + 1(|zn + 1 ) + k n+1 (|zk ),
a+n a+n k =1

where ζn+1 is an extra draw from G0. Predictions beyond n + 1 may be relevant in panel or
time series applications (Hirano, 1998).
In terms of Gibbs sampling, (4.5) implies conditional posteriors (West et al., 1994, p.367;
Ishwaran and James, 2001, p.166)

K[ i ]

(bi | y , b[i] , S[i] , K[i] , a) ∼ aqi 0 g0 (bi | y )p( yi |bi ) + ∑ q d (z


k =1
ik [ i ]k ),

where g0 (bi | y ) is the density corresponding to G0 evaluated at bi, and where



qi 0 = p( yi |bi ) g0 (bi )dbi (4.6.1)

qik = N[i]k p( yi |z[i]k ) k > 0 (4.6.2)

Normalising the values αqi0 and qik to probabilities {ri 0 , ri1 , … riK[i] } summing to 1, the condi-
tional posteriors for the subgroup indicators are then

Pr(Si = k | y , b[i] , S[i] , K[i] ) = rik

where Si = 0 corresponds to drawing a new sample from G0 under the Polya urn scheme.

4.9.1 Specifying the Baseline Density


An important aspect of the MDP framework is the specification of G0. Assume there are p
parameters (y1 , … , yp ) in G0, then one has

yi |bi ∼ p( yi |bi ),

b1 , … bn |G,
Borrowing Strength via Hierarchical Estimation 147

G|a , G0 ∼ DP(aG0 ),

G0 = { p01(y1 |x1 ), … p0 p (yp |xp )},

where {y1 , … , yp } , are unknown, and also possibly some of the defining ξ parameters.
Consider a normal mixture with both means and variances possibly differing for each
unit (Cao and West, 1996; Hirano, 2002), namely

yi ∼ N ( mi , si2 ).

The appropriate prior G for bi = ( mi , si2 ) is not certain, and so

( mi , si2 ) ∼ G,

G ∼ DP(aG0 ),

where G0 involves the priors

mi ∼ p01( mi |x1 ),

si2 ∼ p02 (si2 |x2 ),

with ξ1 and ξ2 possibly including further unknowns. For example, Hirano (2002) takes

1/si2 ∼ c 2 (s)/(sQ),

and

mi ∼ N (m, csi2 ),

where s, Q, m and c are specified, but may be varied in a sensitivity analysis.


The marginal distribution of the yi (averaged over all possible G) in this case is a mixture
of normal distributions, with the number of subgroups K randomly varying between 1
and n. The n unit specific parameter pairs bi = ( mi , si2 ) are selected under G from the set of
K[i] possible values zk = ( mk , sk2 ) already drawn from G0, or by fresh sampling from G0. The
qih in (4.6) are then obtained as

1
∫s
− ( y − m )2 /2 s 2
qi 0 = e i i i g ( m , s 2 )d m ds 2 ,
0 i i i i
i 2p
1 − ( y − m )2 /2 s 2
qik = N[i]k e i k k k > 0.
sk 2p
As other examples, Chib and Hamilton (2002) consider a potential outcomes model for
panel data with DP errors, while Kleinman and Ibrahim (1998) consider Gibbs updates
in an MDP framework for parameters in general linear mixed models for nested data.
For example, let Xi and Zi be predictors of dimension q and r (possibly overlapping), and
consider repeated data yit over subjects i, with observation vectors yi = ( yi1 , … yiT ), and first
stage model
148 Bayesian Hierarchical Models

yi ∼ N (Xi b + Zibi , s 2 ),

where one may assume conventional normal and inverse gamma priors for β and σ2.
However, for bi = (bi1 , … bir ), greater flexibility is obtained by taking

bi ∼ G,

G ∼ DP(a , G0 ),

where G0 is multivariate normal of dimension r, with mean 0, but unknown covariance D.


The Wishart distribution in the Gibbs update for D−1 is modified for clustering of values
among the sampled bi (Kleinman and Ibrahim, 1998, p.94).

4.9.2 Truncated Dirichlet Processes and Stick-Breaking Priors


Implementation may be simplified if an alternative way to generate the DP prior is adopted.
The basis of this alternative scheme is to regard the density of the unit level effects bi as
an infinite mixture of point masses or continuous densities (Ohlssen et al., 2007; Hirano,
1998), with

bi ∼ ∑ p h(b |y ).
k =1
k i k

This approach is called a Dirichlet process mixture by Hanson et al. (2005, p.250), and
a dependent Dirichlet process by Dunson et al. (2007, p.164). For practical application,
Ishwaran and Zarepour (2000) and Ishwaran and James (2002) suggest the infinite repre-
sentation be approximated by one truncated at M ≤ n components with

g (b ) = ∑ p h(b|y ),
m =1
m m

where the πm are sampled by introducing M − 1 beta distributed random variables,

Vm ∼ Be(cm , dm ),

with VM = 1 to ensure the random weights πm sum to 1 (Ishwaran and James, 2001;
Sethuraman, 1994). Then

p1 = V1 ,
pm = (1 − V1 )(1 − V2 )… (1 − Vm −1 )Vm m > 1.

This method of generation is known as stick-breaking, since at each stage, the procedure
randomly breaks what is left of a stick of unit length and assigns the length of the break to
the current πm. Griffin (2016) proposes an adaptive technique for selecting the truncation
point in truncated DP priors. Recent applications include Prabhakaran et al. (2016) and Hu
et al. (2018). It may be noted that rstan can use the TDP principle to estimate mixtures, but
taking M as a known rather than maximum number of components [2].
Borrowing Strength via Hierarchical Estimation 149

Following Pitman and Yor (1997), the beta parameters {cm , dm } in the prior for Vm can be
written cm = 1 − C , dm = D + mC , where C ∈[0, 1) and D > −C . For an infinite dimensional
mixture, the Dirichlet process is obtained by taking C = 0 and D = α, so that Vm ∼ Be(1, a) .
When a finite (truncated) mixture is used, setting

a
cm = 1 + ,
M

ma  m
dm = a − = a 1 − 
M  M

is asymptotically equivalent to the DP process (Ishwaran and Zarepour, 2002; Taylor-


Rodriguez et al., 2017).
However, using an approximate DP scheme with

Vm ∼ Be(1, a)

and M large is equivalent to the infinite DP process for practical purposes (Ishwaran and
James, 2002; Ishwaran and Zarepour, 2000, p.383). If a Ga(η1,η2) prior is used for α, its full
conditional is a ∼ Ga( M + h1 − 1, h2 − log(pM )) (Ishwaran and Zarepour, 2000, p.387). The
realised number of clusters is K ≤ M as above, and (Ishwaran and James, 2002) suggest
AIC and BIC penalties based on K that can be used for model selection.
Taking Vm ∼ Be(a , 1) rather than Vm ∼ Be(1, a) in the truncated stick-breaking scheme
means that larger values of α now imply greater clustering into a few sub-populations.
This is an example of the beta process priors considered by Ishwaran and Zarepour (2000).
Other truncated mixture sampling schemes that start with a prior on α to give an implicit
prior on a stochastic K are available. For example, Ishwaran and Zarepour (2000, p.376)
consider taking α as an unknown in

 a a a
(p1 , … pM ) ∼ D  , , … ,  .
M M M
Alternatively, Green and Richardson (2001, p.357) start off with a prior on K and then select
the cluster indicators from a multinomial vector with probabilities p(Si = k ) = pi , where
(p1 , … , pK ) follow a Dirichlet density D( d , … , d ) . They refer to this as an explicit alloca-
tion prior and show how the DP prior is obtained as K → ∞ and δ → 0 in such a way that
K d → a > 0.

4.9.3 Polya Tree Priors


The Polya tree is a more general class than the Dirichlet process, and has the benefit that it
can place probability 1 on the space of continuous densities (Hanson et al., 2005; Walker et
al., 1999). In essence, if the support of a parameter ω is denoted Γ, then the Polya Tree (PT)
prior chooses the most appropriate value for ω by successive binary partitioning of Γ. The
first partition splits Γ into 2 disjoint sets {B0,B1}; the probabilities of moving into B0 and B1
are C00 and C01 = 1 − C00, with C00 set to 0.5. At the second partition B0 is split into {B00, B01}
and B1 is split into {B10, B11} so there are 22 sets. At the third partition, B00 is split into {B000,
B001}, B01 into {B010, B011}, B10 into {B100, B101}, and B11 into {B110, B111}, so there are 23 sets. The
number of sets at the mth partition is generally 2m.
150 Bayesian Hierarchical Models

The partition probabilities at second and subsequent stages are unknown. Let ε denote
a sequence of 0s and 1s. For example, suppose B1 is selected at step 1, and B11 is selected at
step 2, then ε = [1,1]. The choice at the next stage between sets Bε0 and Bε1 (i.e. between B110
and B111) is governed by probabilities (Ce 0 , Ce 1 ), with a beta prior for Cε0, and Ce 1 = 1 − Ce 0 .
The canonical form for the prior on the partition probabilities at partition m is

Ce 0 ∼ Be(cm , cm )

cm = dm2

where d may be taken as an extra unknown. The Dirichlet process occurs when cm = d/2m ,
so that cm → 0 as m → ∞, whereas cm → ∞ as m → ∞ is appropriate if the underlying distribu-
tion G is expected to be continuous.
While theoretically the completely continuous case corresponds to m → ∞, in practice
the partitioning is truncated at a finite value M. Hanson and Johnson (2002) recommend
M = log 2 (n) where n is the sample size. The partitions can be taken to coincide with per-
centiles of G0, so for example

B0 = ( −∞ , G0−1(0.5)], B1 = [G0−1(0.5), ∞);

B00 = ( −∞ , G0−1(0.25)], B01 = [G0−1(0.25), G0−1(0.5)];

B10 = [G0−1(0.5), G0−1(0.75)], B11 = [G0−1(0.75), ∞);

and so on.
Let dki at partition k, and option i, be a re-expression of the Bε (e.g. for k = 3, d31 = B000,
d32 = B001, d33 = B010, d34 = B011, d35 =B100, d36 = B101, d37 = B110, d38 = B111). Then at partition k, for
i = 1, … 2k , the interval boundaries are

  i − 1  i 
dki = G0−1  k  , G0−1  k   ,
  2   2 

with appropriate modifications for the extreme tails.


For example, consider a PT prior on unstructured errors in a Poisson lognormal mixture,
with

yi ∼ Po( mi ),

log( mi ) = b + sbi .

Then G0 for vi = sbi is a N(0, s 2 ) density, with G0 for bi being a N(0,1) density. So with M = 3
levels, the relevant ordinates from G0 for defining the 8 intervals are (−1.15,−0.67,−0.32,0,0.
32,0.67,1.15).

Example 4.9 Nicotine Replacement Therapy


We re-analyse the NRT trials data using truncated DP priors (Section 4.9.2). Thus,
second stage random trial effects are obtained as bi = zk conditional on latent group
Borrowing Strength via Hierarchical Estimation 151

indicator Si = k, and with {z1 ,… , zM } sampled from G0. The realised number of clusters is
K ≤ M , where a maximum of M = 50 possible normal clusters N( m m ,t m2 ) are assumed as
potential second stage priors. The M potential parameter pairs {m m ,t m2 } defining G0 are
respectively sampled from normal densities with means m m ~ N(mm , 1), where mμ is itself
unknown, and from exponential densities, with 1/t m2 ~ E(1).
Then Vm ∼ Be(1, a), with an exponential E(1) prior assumed on the concentration param-
eter α, and with a lower sampling limit of 0.25 for numeric stability. A mixed predictive
check is based on sampling replicate {zrep ,1 ,… , zrep , M } from G0, and taking brep , i = zrep , k .
A two-chain run of 5,000 iterations using rube shows convergence in α, K, and the
realised latent effects b. The posterior mean and median of K are respectively 3.9 and 4,
supporting a relatively small number of components in the second-stage prior of NRT
effects; α has a posterior mean of 0.85. Mixed predictive checks are satisfactory, with
none exceeding 0.9 or being under 0.1.
A plot of the posterior means of the bi does not show sharply distinct subgroups
(Figure  4.1), though outlier random effects can be seen, such as trials 4, 36, and 59.
However, the effects show more peakedness than under a normal density (superim-
posed plot).
The analysis is also run using a Pitman–Yor prior, with Vm ∼ Be(1 − C , D + mC ), where
C ∈[0, 1) and D > −C, and with a maximum of M = 20 clusters. This is implemented
using R2OpenBUGS with a two-chain run of 20,000 iterations. A uniform U(0,1) prior is
adopted on C, with D obtained as D = D1 − C, where D1 ∼ Ga(1, 0.01) is assigned a gamma
prior. This analysis provides posterior means (sd) for C and D of 0.55 (0.25) and 132 (101),
with the mean number of clusters being 4.7. Posterior means for bi are similar to those of
the first analysis, the correlation between them exceeding 0.95, while exceedance prob-
abilities again show no model failure.

FIGURE 4.1
Nicotine replacement. Estimated random effects.
152 Bayesian Hierarchical Models

Example 4.10 Digestive Tract Decontamination

Data on log odds ratios yi and their variances si2 in n = 14 trials are considered by Burr
and Doss (2005), and relate to mortality after treatment vs control comparison for decon-
tamination of the digestive tract. Assumptions under the normal-normal model (equa-
tions 4.4.1 to 4.4.3) are cast into doubt by quantile plots of the yi. We consider a truncated
DP prior (with M = n = 14), with second-stage effects bi = ζk when allocation indicators
Si = k, and

zk ∼ N( mk , t 2 ),

with { m1 ,… , mM } , themselves sampled from a normal density

mm ∼ N(m m , tm2 ).

Both mμ and tm2 are unknowns, assigned N(0, 100) and Ga(1, 0.01) priors. The second-
stage variance parameter τ2 is also assigned a Ga(1, 0.01) prior.
Analysis compares the DPMmeta option in the R library DPpackage, and a BUGS
 a  m 
code estimated using rube in R, with Vm ∼ Be  1 + , a 1 −  in the stick-break-
 M  M 
ing prior. Either computing option suggests α is not strongly identified by the data:
alternative settings for a0 and b 0 in a ∼ Ga( a0 , b0 ) tend to carry over to the estimated
α. So alternative preset values such as α = 1, α = 10, etc. may be adopted instead
(Burr, 2012).
With the setting α = 10, DPMmeta shows a mean of around 9 realised clusters, as
against 6.8 under the TDP prior. Treatment benefit can be measured by the probability
that mμ is negative, or the probability that the mean of the realised bi is negative. The
probability Pr(m m < 0|y ) = 0.92 is inconclusive, though the probability Pr(b < 0|y ) is
0.97 (their two quantities are pben[1:2] in the code).

Example 4.11 Eye Tracking Data


Escobar and West (1998) present count data on eye tracking anomalies in n = 101 schizo-
phrenic patients. The data are zero-inflated and overdispersed. A first analysis assumes
a DP Poisson-gamma mixture, with G0 being the second-stage gamma density with
unknown shape and scale parameters. So

y i ∼ Po(bi ),

bi ∼ G,

G ∼ DP(aG0 ),

G0 = Ga(c g , dg ).

Taking cg and dg to be unknowns results in an MDP prior, which is implemented using


the Polya urn prior. Exponential E(1) priors are assumed on α, and on the parameters
(cg,dg), with a minimum of 0.5 on cg for numeric stability.
Observed yi are compared with replicates sampled from the predictive distribution
p( y rep |y ) to see if the yi are at odds with the model. Discrepancies could be due to genu-
ine outlier status, or to model failures. For discrete data, the relevant p-value is
Borrowing Strength via Hierarchical Estimation 153

Pr( y rep , i < y i ) + 0.5Pr( y rep , i = y i ).

A related check is whether the 95% intervals for yrep,i include yi (Gelfand, 1996).
A two-chain run of 20,000 iterations in R2OPENBUGS provides an estimated mean
of K = 12 clusters, with posterior means (95% CRI) for α, cg, and dg of 2.83 (0.74,6.45),
0.82 (0.51,1.61), and 0.13 (0.04,0.29). Figure 4.2 shows the prediction y new for a new
case, and demonstrates that the main source of overdispersion is skewness in the
latent frailties bi rather than multiple modes. The predictive checks based on replicate
samples are satisfactory. Note that the same does not apply if the gamma mixing den-
sity parameters are set, e.g. c g = dg = 1. In this case, bimodal posteriors are obtained
on some bi (e.g. b92), and predictive checks for y101 = 34 suggest it to be an extreme
observation.
A second analysis involves a Polya tree prior, and a Poisson-lognormal model, namely

y i ∼ Po( mi ),

log( mi ) = b + sbi

where G0 for vi = sbi is a N(0,σ2) density. The number of stages is set at M = 4, and an E(1)
prior is assumed on 1/σ2. Once an interval Bεm is selected, uniform sampling to generate
bi takes place within the interval defined by G0, except in the tails where the sampling
is from a N(0,1).
As for the Polya urn model, both types of predictive check indicate no major discrep-
ancies. σ has posterior mean (and 95% interval) 2.06 (1.65, 2.51). If σ is taken to equal 1
so that G0 is assumed known, then predictive discrepancies do occur. Taking σ = 1 also
leads to bimodal posteriors for individual bi indicating a clash between prior and data,
such that the prior cannot accommodate certain values. A plot of the estimated bi shows
the distinct zero inflation combined with positive skew (Figure 4.3).

30000

25000

20000
Frequency

15000

10000

5000

0 10 20 30 40 50
New Outcome

FIGURE 4.2
Predictive samples, new outcome, eye tracking data.
154 Bayesian Hierarchical Models

30

25

20
Frequency

15

10

–1.0 –0.5 0.0 0.5 1.0 1.5


Posterior mean b

FIGURE 4.3
Estimated random effects, eye tracking data.

4.10 Computational Notes

1. The jagsUI code including relabelling to identifiability is

    cat("model { for (i in 1:N){ # conditional likelihood


    y[i] ~ dnorm(m0[S0[i]], psi0[S0[i]])
    # individual latent membership indicators (conditional likelihood)
    S0[i] ~ dcat(P0[1:K])
    ynew[i] ~ dnorm(m0[S0[i]],psi0[S0[i]])
    for (j in 1:K) {d0[i,j] <- equals(S0[i],j)}
    # exceedance check
    exc[i] <- step(ynew[i]−y[i])
    log_lik[i] <- log(sum(L[i,]))}
    P0 ~ddirch(alpha[]); # prior for mixing proportion
    # prior on unconstrained means
    for (j in 1:K) { m0[j] ~dnorm(25, 0.01) T(9.2,34.3)
    # independent or hierarchical prior on group precisions
    # psi0[j] ~ dgamma(1,0.01)
    psi0[j] ~dgamma(a.psi,b.psi)
    # prior Dirichlet weights
    alpha[j] <- 2.5
for (i in 1:N) { L[i,j] <- exp(log(P[j])+0.5*log(psi[j])−0.919−0.5*p
    
si[j]*pow(y[i]−mu[j],2))
    # conditional allocation probabilities
    rho[i,j] <- L[i,j]/sum(L[i,])
Borrowing Strength via Hierarchical Estimation 155

    # entropy
    ent1[i,j] <- equals(S[i],j)*log(rho[i,j])
    ent2[i,j] <- rho[i,j]*log(rho[i,j])}}
    Ent[1] <- −2*sum(ent1[1:N,1:K])
    Ent[2] <- −2*sum(ent2[1:N,1:K])
    tLL <- sum(log_lik[])
    # hyperparameters, hierarchical prior on precisions
    a.psi ~dexp(1)
    b.psi ~dexp(1)
    
# Processing to obtain identifiable groups, using ranks of
unconstrained means
    rank <- rank(m0)
    # relabelled weights, means, precisions, variances
    for (j in 1:K) {P[j] <- sum(P0prod[j,])
    mu[j] <- sum(m0prod[j,])
    psi[j] <- sum(psi0prod[j,])
    s2[j] <- 1/psi[j]
    for (k in 1:K) {P0prod[j,k] <- P0[k]*equals(rank[k],j)
    m0prod[j,k] <- m0[k]* equals(rank[k],j)
    psi0prod[j,k] <- psi0[k]* equals(rank[k],j)}}
    # relabelled allocation indicators
    for (i in 1:N) { S[i] <- sum(dcat[i,])
    for (j in 1:K) {d[i,j] <- sum(d0prod[i,j,])
    dcat[i,j] <- j*d[i,j]
    for (k in 1:K) {d0prod[i,j,k] <- d0[i,k]*equals(rank[k],j)}}}}
    ", file="mixnorm.jag")

2. Consider the galaxy data and suppose M = 4 is the number of mixture compo-
nents. Unknown means are centred at the observed mean of the data. Then a trun-
cated DP prior can be implemented as

    stan_model <- "


    data{
    int<lower=0> M;// number of clusters
    int<lower=0> N;// number of observations
    real y[N];
    }
    parameters {
    positive_ordered[M]mu; //cluster means
    real <lower=0,upper=1> v[M];
    real<lower=0> sigma[M]; // cluster scales
    real<lower=0> alpha; // concentration parameter
    }
    transformed parameters{
    simplex [M] pi;
    pi[1] = v[1];
    // stick-break process
    for(j in 2:(M-1)){ pi[j]= v[j]*(1−v[j−1])*pi[j−1]/v[j−1]; }
    pi[M]=1−sum(pi[1:(M−1)]);
    }
    model { real comp[M];
    sigma ~exponential(1);
    alpha ~ gamma(2,4);
    mu ~ normal(21,5);
156 Bayesian Hierarchical Models

    v ~ beta(1,alpha);
    for(i in 1:N){ for(c in 1:M){
    comp[c]=log(pi[c])+normal_lpdf(y[i]mu[c],sigma[c]); }
    target += log_sum_exp(comp); }}
    "
    D=list(y=y,N=82,M=4)
    
fit = stan(model_code = stan_model, data =D, iter = 2000, chains = 2)
summary(fit,pars = c("mu","pi","alpha"),probs=
    
c(0.025,0.975))$summary

The estimated parameters are in Table 4.1 and are similar to those estimated in Example 4.8.

TABLE 4.1
Galaxy Data Discrete Mixture, Galaxy Data, TDP Prior
Mean St devn 2.5% 97.5%
μ1 9.76 0.24 9.33 10.33
μ2 20.02 0.54 19.42 21.48
μ3 22.45 1.08 21.20 26.01
μ4 28.65 4.18 22.59 33.96
π1 0.09 0.03 0.04 0.16
π2 0.36 0.20 0.08 0.86
π3 0.44 0.23 0.02 0.79
π4 0.12 0.11 0.01 0.39
α 0.78 0.36 0.23 1.61

References
Abrams K, Gillies C, Lambert P (2005) Meta-analysis of heterogeneously reported trials assessing
change from baseline. Statistics in Medicine, 24, 3823–3844.
Abrams K, Lambert P, Sanso B, Shaw S, Marteau T (2000) Meta-analysis of heterogeneously reported
study results: A Bayesian approach, pp 29–64, in Meta-Analysis in Medicine and Health Policy, eds
D Berry, D Stangl. Marcel Dekker.
Agresti A, Hitchcock D (2005) Bayesian inference for categorical data analysis. Statistical Methods and
Applications, 14, 297–330.
Aitchison J, Ho C (1989) The multivariate Poisson-log normal distribution. Biometrika, 76, 643–653.
Aitchison J, Shen S (1980) Logistic-normal distributions: Some properties and uses. Biometrika, 67,
261–272.
Alanko T, Duffy J (1996) Compound Binomial distributions for modeling consumption data. The
Statistician, 45, 269–286.
Albert J (1999) Criticism of a hierarchical model using Bayes factors. Statistics in Medicine, 18, 287–305.
Albert J (2015) Package ‘LearnBayes’: Functions for Learning Bayesian Inference. https://cran.r-
project.org/web/packages/LearnBayes/LearnBayes.pdf
Albert JH, Gupta AK (1982) Mixtures of Dirichlet distributions and estimation in contingency tables.
The Annals of Statistics, 10(4), 1261–1268.
Ando T (2007) Bayesian predictive information criterion for the evaluation of hierarchical Bayesian
and empirical Bayes models. Biometrika, 94, 443–458.
Arends L (2006) Multivariate meta-analysis: Modelling the heterogeneity. Repub/EUR Repository.
http://repub.eur.nl/publications/med_hea
Borrowing Strength via Hierarchical Estimation 157

Azzalini A (1985) A class of distributions which includes the normal ones. Scandinavian Journal of
Statistics, 12, 171–178.
Bakbergenuly I, Kulinskaya E (2017) Beta-binomial model for meta-analysis of odds ratios. Statistics
in Medicine, 36, 1715–1734.
Baker R, Jackson D (2008) A new approach to outliers in meta-analysis. Health Care Management
Science, 11(2), 121–131.
Baker R, Jackson D (2016) New models for describing outliers in meta-analysis. Research Synthesis
Methods, 7, 314–328.
Barnard J, McCulloch R, Meng XL (2000) Modeling covariance matrices in terms of standarddevia-
tions and correlations, with applications to shrinkage. Statistica Sinica, 10, 1281–311.
Basu S (1996) Bayesian tests for unimodality, pp 77–82, in Proceedings of the Section on Bayesian
Statistical Science. American Statistical Association.
Basu S, Chib S (2003) Marginal likelihood and Bayes factors for Dirichlet process mixture models.
Journal of the American Statistical Association, 98(461), 224–235.
Baudry J, Raftery A, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for cluster-
ing. Journal of Computational and Graphical Statistics, 19(2), 332–353.
Bayman E, Chaloner K, Hindman B, Todd M (2013) Bayesian methods to determine performance dif-
ferences and to quantify variability among centers in multi-center trials: The IHAST trial. BMC
Medical Research Methodology, 13, 5.
Beath K (2014) A finite mixture method for outlier detection and robustness in meta-analysis. Research
Synthesis Methods, 5(4), 285–293.
Beath K (2016) metaplus: An R package for the analysis of robust meta-analysis and meta-regression.
The R Journal, 8(1), 5–16.
Besag J, Green P, Higdon D, Mengerson K (1995) Bayesian computation and stochastic systems.
Statistical Science, 10(1), 103–166.
Betancourt M (2017) Identifying Bayesian Mixture Models. http://mc-stan.org/users/documenta-
tion/case-studies/identifying_mixture_models.html
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated
completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719–725.
Bohning D (1999) Computer-Assisted Analysis of Mixtures and Applications: Meta-Analysis, Disease
Mapping and Others. Chapman & Hall, New York.
Browne W, Draper D (2006) A comparison of Bayesian and likelihood-based methods for fitting mul-
tilevel models. Bayesian Analysis, 1, 473–550.
Bulmer M (1974) On fitting the Poisson log-normal distribution to species abundance data. Biometrics,
30, 101–110.
Burke D, Bujkiewicz S, Riley R (2016) Bayesian bivariate meta-analysis of correlated effects: Impact
of the prior distributions on the between-study correlation, borrowing of strength, and joint
inferences. Statistical Methods in Medical Research, 27(2), 428–450.
Burr D (2012) bspmma: An R package for Bayesian semiparametric models for meta analysis. Journal
of Statistical Software, 50, 1–23.
Burr D, Doss H (2005) A Bayesian semi-parametric model for random effects meta analysis. Journal of
the American Statistical Association, 100, 242–251.
Cao G, West M (1996) Practical Bayesian inference using mixtures of mixtures. Biometrics, 52,
1334–1341.
Carvalho, V, Branscum, A (2017) Bayesian nonparametric inference for the three-class Youden index
and its associated optimal cutoff points. Statistical Methods in Medical Research, 27, 689–700.
Celeux G, Hurn M, Robert C (2000) Computational and inferential difficulties with mixture posterior
distributions. Journal of the American Statistical Association, 95, 957–970.
Cepeda-Benito A, Reynoso N, Erath S (2004) Meta-analysis of the efficacy of nicotine replacement
therapy for smoking cessation: Differences between men and women. Journal of Consulting and
Clinical Psychology, 72, 712–722.
Chelgren N, Adams M, Bailey L, Bury, B (2011) Using multilevel spatial models to understand sala-
mander site occupancy patterns after wildfire. Ecology, 92, 408–421.
158 Bayesian Hierarchical Models

Chib S, Hamilton B (2002) Semiparametric Bayes analysis of longitudinal data treatment models.
Journal of Econometrics, 110(1), 67–89.
Chib S, Winkelmann R (2001) Markov chain Monte Carlo analysis of correlated count data. Journal of
Business & Economic Statistics, 19(4), 428–435.
Christiansen C, Morris C (1996) Fitting and checking a two-level Poisson model: modeling patient
mortality rates in heart transplant patients, pp 467–501, in Bayesian Biostatistics, eds D Berry, D
Stangl. Marcel Dekker, New York.
Christiansen C, Morris C (1997) Hierarchical Poisson regression modeling. Journal of the American
Statistical Association, 92, 618–632.
Chung H, Loken E, Schafer J (2004) Difficulties in drawing inferences with finite-mixture models: A
simple example. The American Statistician, 58, 152–158.
Clark J, Gelfand A (2006) Hierarchical Modelling for the Environmental Sciences: Statistical Methods and
Applications. Oxford University Press.
Clayton D, Kaldor J (1987) Empirical Bayes estimates of age-standardized relative risks for use in
disease mapping. Biometrics, 43(3), 671–681.
Conlon E, Song J, Liu A (2007) Bayesian meta-analysis models for microarray data: A comparative
study. BMC Bioinformatics, 8, 80.
Connolly S, Dornelas M, Bellwood D, Hughes T (2009) Testing species abundance models: A new
bootstrap approach applied to Indo-Pacific coral reefs. Ecology, 90(11), 3138–3149.
Consul P (1989) Generalized Poisson Distributions. Marcel Dekker, New York.
Daniels M (1999) A prior for the variance in hierarchical models. Canadian Journal of Statistics, 27,
569–580.
Das S, Dey D (2006) On Bayesian analysis of generalized linear models using the Jacobian technique.
The American Statistician, 60, 264–268.
Das S, Dey D (2007) On Bayesian analysis of generalized linear models: A new perspective. Technical
Report 2007-8, Statistical and Applied Mathematical Sciences Institute, UNC. www.samsi.info
Da Silva, A (2009) Bayesian mixture models of variable dimension for image segmentation. Computer
Methods and Programs in Biomedicine, 94(1), 1–14.
Deely N, Smith A (1998) Quantitative refinements for comparisons of institutional performance.
Journal of the Royal Statistical Society: Series A, 161, 5–12.
Delucchi K, Bostrom A (2004) Methods for analysis of skewed data distributions in psychiatric clini-
cal studies: Working with many zero values. The American Journal of Psychiatry, 161, 1159–1168.
DerSimonian R, Laird N (1986) Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–188.
Diebolt N, Robert C (1994) Estimation of finite mixture distributions through Bayesian sampling.
Journal of the Royal Statistical Society: Series B, 56, 363–375.
Ding T, Baio G (2016) bmeta: Bayesian Meta-analysis and Metaregression. http://www.statistica.it/
gianluca/software/bmeta/
Diserud O, Engen S (2000) A general and dynamic species abundance model, embracing the lognor-
mal and the gamma models. The American Naturalist, 155, 497–511.
Drton M, Plummer M (2017) A Bayesian information criterion for singular models. Journal of the Royal
Statistical Society: Series B, 79(2), 323–380.
Druyts E, Palmer J, Balijepalli C, Chan K, Fazeli M, Herrera V (2017) Treatment modifying factors of
biologics for psoriatic arthritis: A systematic review and Bayesian meta-regression. Clinical and
Experimental Rheumatology, 35(4), 681–688.
DuMouchel W (1996) Predictive cross-validation of Bayesian meta-analyses, pp 107–127, in eds J
Bernardo, J Berger, A Dawid, A Smith, Bayesian Statistics 5. Oxford University Press.
DuMouchel W, Waternaux C (1992) Discussion of “Hierarchical models for combining information
and for meta-analysis,” by C Morris and S Normand, pp 338–341, in Bayesian Statistics, Vol. 4,
eds J Bernardo, J Berger, A Dawid, A Smith. Clarendon Press, Oxford, UK.
Dunson D, Pillai N, Park J (2007) Bayesian density regression. Journal of the Royal Statistical Society:
Series B, 69, 163–183.
Efron B (1986) Double exponential families and their use in generalized linear regression. Journal of
the American Statistical Association, 81(395), 709–721.
Borrowing Strength via Hierarchical Estimation 159

Escobar M, West, M (1998) Computing nonparametric hierarchical models, in Practical Nonparametric


and Semiparametric Bayesian Statistics, eds D Dey, P Muller, D Sinha. Springer-Verlag.
Everitt B, Hand D (1981) Finite Mixture Distributions. Chapman & Hall, London, UK.
Everson P, Morris C (2000) Inference for multivariate normal hierarchical models. Journal of the Royal
Statistical Society: Series B, 62, 399–412.
Fahrmeir L, Lang S (2001) Bayesian inference for generalized additive mixed models based on Markov
random field priors. Journal of the Royal Statistical Society: Series C (Applied Statistics), 50(2), 201–220.
Farrell PJ, Groshen S, MacGibbon B, Tomberlin T (2010) Outlier detection for a hierarchical Bayes
model in a study of hospital variation in surgical procedures. Statistical Methods in Medical
Research, 19(6), 601–619.
Fernandez C, Steele M (1998) On Bayesian modeling of fat tails and skewness. Journal of the American
Statistical Association, 93, 359–371.
Ferreira M, Gamerman D (2000) Dynamic generalized linear models, pp 57–72, in Generalized Linear
Models: A Bayesian Perspective, eds D Dey, S Ghosh, B Mallick. Marcel Dekker, New York.
Fordyce J A, Gompert Z, Forister M L, Nice C C (2011) A hierarchical Bayesian approach to ecological
count data: A flexible tool for ecologists. PLOS ONE, 6(11), e26785.
Frees E (2004) Longitudinal and Panel Data. Cambridge University Press.
Friede T, Röver C, Wandel S, Neuenschwander B (2017) Meta-analysis of few small studies in orphan
diseases. Research Synthesis Methods, 8(1), 79–91.
Fruhwirth-Schattner S (2001) Markov chain Monte Carlo estimation of classical and dynamic switch-
ing and mixture models. Journal of the American Statistical Association, 96, 194–209.
Fruhwirth-Schnatter S (2006) Finite Mixture and Markov Switching Models. Springer.
Fruhwirth-Schnatter S, Otter T, Tuchler R (2004) Bayesian analysis of the heterogeneity model. Journal
of Business & Economic Statistics, 22, 2–15.
Gao S (2004) Combining binomial data using the logistic normal. Journal of Statistical Computation and
Simulation, 74, 293–306.
Gelfand A (1996) Model determination using sampling-based methods, pp 145–161, in Markov Chain
Monte Carlo in Practice, eds W Gilks, S Richardson, D Spiegelhalter. Chapman & Hall/CRC.
Gelman A (2006) Prior distributions for variance parameters in hierarchical model. Bayesian Analysis,
1, 515–534.
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis. CRC, Boca
Raton, FL.
Gelman A, Jakulin A, Pittau M, Su Y (2008) A weakly informative default prior distribution for logis-
tic and other regression models. The Annals of Applied Statistics, 2(4), 1360–1383.
Genton M (2004) Skew-Elliptical Distributions and Their Applications: A Journey Beyond Normality,
Edited Volume. Chapman & Hall/CRC, Boca Raton, FL.
George E, Makov U, Smith A (1993) Conjugate likelihood distributions. Scandinavian Journal of
Statistics, 20, 147–156.
George E, Zhang Z (2001) Posterior propriety in some hierarchical exponential family models, in
Data Analysis from Statistical Foundations: Festschrift in Honor of Donald A.S. Fraser, ed A Saleh.
Nova Science Publishers, New York.
Geweke J (2007) Interpretation and inference in mixture models: Simple MCMC works. Computational
Statistics & Data Analysis, 51, 3529–3550.
Gilks WR, Wild P (1992) Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical
Society: Series C (Applied Statistics), 41(2), 337–348.
Gill J, Casella G (2009) Nonparametric priors for ordinal Bayesian social science models: Specification
and estimation. Journal of the American Statistical Association, 104, 453–464.
Gompert Z, Fordyce J (2015) Package ‘bayespref’: Hierarchical Bayesian Analysis of Ecological Count
Data. https://cran.r-project.org/web/packages/bayespref/bayespref.pdf
Gopalan R, Berry D (1998) Bayesian multiple comparisons using Dirichlet process priors. Journal of
the American Statistical Association, 93, 1130–1139.
Green P, Richardson S (2001) Modelling heterogeneity with and without the Dirichlet process.
Scandinavian Journal of Statistics, 28, 355–375.
160 Bayesian Hierarchical Models

Greenland S, Draper D (1998) Exchangeability, in Entry in Encyclopedia of Biostatistics, eds P Armitage,


T Colton. Wiley, London, UK.
Greco T, Landoni G, Biondi-Zoccai G, D’Ascenzo F, Zangrillo A (2016) A Bayesian network meta-anal-
ysis for binary outcome: How to do it. Statistical Methods in Medical Research, 25(5), 1757–1773.
Griffin, J (2016) An adaptive truncation method for inference in Bayesian nonparametric models.
Statistics and Computing, 26(1–2), 423–s441.
Gruen B, Plummer M (2015) BayesMix: An R Package for Bayesian Mixture Modeling. http://ifas.
jku.at/gruen/BayesMix/
Guo J, Riebler A (2016) meta4diag: Bayesian Bivariate Meta-analysis of Diagnostic Test Studies for
Routine Practice. https://arxiv.org/pdf/1512.06220.pdf
Guo J, Riebler A, Rue H (2017) Bayesian bivariate meta-analysis of diagnostic test studies with inter-
pretable priors. Statistics in Medicine, 36(19), 3039–3058.
Guolo A, Varin C (2017) Random-effects meta-analysis: The number of studies matters. Statistical
Methods in Medical Research, 26(3), 1500–1518.
Gupta S K (2012) Use of Bayesian statistics in drug development: Advantages and challenges.
International Journal of Applied and Basic Medical Research, 2(1), 3–6.
Gustafson P, Hossain S, MacNab Y (2006) Conservative priors for hierarchical models. Canadian
Journal of Statistics, 34, 377–390.
Hanson T, Branscum A, Johnson W (2005) Nonparametric Bayesian data analysis: An introduction,
in Handbook of Statistics, Vol. 25, eds C Rao, D Dey. Elsevier.
Hanson T, Johnson W (2002) Modeling regression error with a mixture of polya trees. Journal of the
American Statistical Association, 97(460), 1020–1033.
Hirano K (1998) Nonparametric Bayes models for longitudinal earnings data, in Practical Nonparametric
and Semiparametric Bayesian Statistics, eds D Dey, P Muller, D Sinha. Springer-Verlag.
Hirano K (2002) Semiparametric Bayesian inference in autoregressive panel data models. Econometrica,
70, 781–799.
Hoff P (2003) Nonparametric modelling of hierarchically exchangeable data. Technical Report 421,
Department of Statistics, University of Washington.
Howley P, Gibberd R (2003) Using hierarchical models to analyse clinical indicators: A comparison
of the gamma-Poisson and beta-binomial models. International Journal for Quality in Health Care,
15, 319–329.
Hu J, Reiter J, Wang Q (2018) Dirichlet process mixture models for modeling and generating syn-
thetic versions of nested categorical data. Bayesian Analysis, 13(1), 183–200.
Hurn M, Justel A, Robert C (2003) Estimating mixtures of regressions. Journal of Computational and
Graphical Statistics, 12, 55–79.
Hurtado Rua S, Mazumdar M, Strawderman R (2015). The choice of prior distribution for a covariance
matrix in multivariate meta-analysis: A simulation study. Statistics in Medicine, 34(30), 4083–4104.
Imai K, Ying L, Strauss A (2008) Bayesian and likelihood inference for 2×2 ecological tables: An
incomplete data approach. Political Analysis, 16, 41–69.
Ishwaran H, James L (2001) Gibbs sampling methods for stick-breaking priors. Journal of the American
Statistical Association, 96, 161–173.
Ishwaran H, James L (2002) Approximate Dirichlet process computing in finite normal mixtures:
Smoothing and prior information. Journal of Computational and Graphical Statistics, 11, 508–532.
Ishwaran H, Zarepour M (2000) Markov chain Monte Carlo in approximate Dirichlet and beta two-
parameter process hierarchical models. Biometrika, 87, 371–390.
Ishwaran H, Zarepour M (2002) Exact and approximate sum-representations for the Dirichlet pro-
cess. Canadian Journal of Statistics, 30, 269–283.
Jackson D, White I, Riley R (2013) A matrix based method of moments for fitting the multivariate
random effects model for meta-analysis and meta-regression. Biometrical Journal, 55(2), 231–245.
Jara A (2007) Applied Bayesian non- and semi-parametric Inference using Dppackage. R News, 7/3,
17–26.
Jara A, Hanson T, Quintana F, Müller P, Rosner G (2011) DPpackage: Bayesian semi-and nonparamet-
ric modeling in R. Journal of Statistical Software, 40(5), 1–30.
Borrowing Strength via Hierarchical Estimation 161

Jarque C, Bera A (1980) Efficient tests for normality, homoscedasticity and serial independence of
regression residuals. Econometric Letters, 6, 255–259.
Jiang J, Lahiri P (2006) Mixed model prediction and small area estimation. Test, 15(1), 1.
Jullion A, Lambert P (2007) Robust specification of the roughness penalty prior distribution in
spatially adaptive Bayesian P-splines models. Computational Statistics & Data Analysis, 51(5),
2542–2558.
Karabatsos G (2016) A menu-driven software package for Bayesian regression analysis. The ISBA
Bulletin, 22(4), 13–16.
Karabatsos G (2017) A menu-driven software package of Bayesian nonparametric (and parametric)
mixed models for regression analysis and density estimation. Behavior Research Methods, 49(1),
335–362.
King G (1997) A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from
Aggregate Data. Princeton University Press, Princeton, NJ.
King G, Rosen O, Tanner M (eds) (2004) Ecological Inference: New Methodological Strategies. Cambridge
University Press, New York.
Kleinman KP, Ibrahim JG (1998) A semiparametric Bayesian approach to the random effects model.
Biometrics, 54(3), 921–938.
Kruschke J, Vanpaemel W (2015) Bayesian estimation in hierarchical models, pp 279–299, in The
Oxford Handbook of Computational and Mathematical Psychology, eds J R Busemeyer, Z Wang, J T
Townsend, A Eidels. Oxford University Press, Oxford, UK.
Kuhan G, Marshall E C, Abidia A F, Chetter I C, McCollum P (2002) A Bayesian hierarchical approach
to comparative audit for carotid surgery. European Journal of Vascular and Endovascular Surgery,
24(6), 505–510.
Kulinskaya E, Olkin I (2014) An overdispersion model in meta-analysis. Statistical Modelling, 14(1),
49–76.
Lambert P, Sutton A, Burton P, Abrams K, Jones D (2005) How vague is vague? A simulation study
of the impact of the use of vague prior distributions in MCMC using WinBUGS. Statistics in
Medicine, 24(15), 2401–2428.
Larson J, Soule S (2009) Sector-level dynamics and collective action in the United States, 1965–1975.
Mobilization: An International Quarterly, 14(3), 293–314.
Laud PW, Ibrahim JG (1995) Predictive model selection. Journal of the Royal Statistical Society: Series B
(Methodological), 57(1), 247–262.
Lee J, Sabavala D (1987) Bayesian estimation and prediction for the beta binomial model. Journal of
Business and Economic Statistics, 5, 357–367.
Lee K, Thompson S (2008) Flexible parametric models for random-effects distributions Statistics in
Medicine, 27, 418–434.
Leisch F (2004) FlexMix: A general framework for finite mixture models and latent class regression in
R. Journal of Statistical Software, 11(8), 1–18.
Lenk P (1988) The logistic normal distribution for Bayesian nonparametric predictive densities.
Journal of the American Statistical Association, 83, 509–516.
Lenk P, Desarbo W (2000) Bayesian inference for finite mixtures of generalized linear models with
random effects. Psychometrika, 65, 93–119.
Leonard T (1973) A Bayesian method for histograms. Biometrika, 60, 297–308.
Leslie D, Kohn R, Nott D (2007) A general approach to heteroscedastic linear regression. Statistics and
Computing, 17, 131–146.
Lin T, Lee J, Hsieh W (2007b) Robust mixture modeling using the skew t distribution. Statistics and
Computing, 17, 81–92.
Lin T, Lee J, Ni H (2004) Bayesian analysis of mixture modelling using the multivariate t distribution.
Statistics and Computing, 14, 119–130.
Lin T, Lee J, Yen S (2007a) Finite mixture modelling using the skew normal distribution. Statistica
Sinica, 17, 909–927.
Lindley D, Smith A (1972) Bayes estimates for the linear model. Journal of the Royal Statistical Society:
Series B, 34, 1–41.
162 Bayesian Hierarchical Models

Liu J, Dey D (2007) Hierarchical overdispersed Poisson model with macrolevel autocorrelation.
Statistical Methodology, 4(3), 354–370.
Lu G, Ades AE (2009) Modeling between-trial variance structure in mixed treatment comparisons.
Biostatistics, 10(4), 792–805.
Makuch R, Stephens M, Escobar M (1989) Generalized binomial models to examine the historical
control assumption in active control equivalence studies. The Statistician, 38, 61–70.
Marin J, Mengersen K, Robert C (2005) Bayesian modelling and inference on mixtures of distribu-
tions, in Handbook of Statistics, Vol. 25, eds D Dey, C Rao. Elsevier.
Markham F, Young M, Doran B, Sugden M (2017) A meta-regression analysis of 41 Australian prob-
lem gambling prevalence estimates and their relationship to total spending on electronic gam-
ing machines. BMC Public Health, 17(1), 495.
Marshall E, Spiegelhalter D (1998) Comparing institutional performance using Markov chain Monte
Carlo methods, in Statistical Analysis of Medical Data: New Developments, eds B Everitt, G Dunn.
Arnold.
Marshall E, Spiegelhalter D (2007) Simulation-based tests for divergent behaviour in hierarchical
models. Bayesian Analysis, 2, 409–444.
Mavridis D, Salanti G (2013) A practical introduction to multivariate meta-analysis. Statistical Methods
in Medical Research, 22(2), 133–158.
McLachlan G, Rathnayake S (2014) On the number of components in a Gaussian mixture model.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 341–355.
Militino A, Ugarte M, Dean C (2001) The use of mixture models for identifying high risks in disease
mapping. Statistics in Medicine, 20, 2035–2049.
Mohr D (2006) Bayesian identification of clustered outliers in multiple regression. Computational
Statistics & Data Analysis, 51, 3955–3967.
Moreno E, Vázquez-Polo F, Negrn M (2018) Bayesian meta-analysis: The role of the between-sample
heterogeneity. Statistical Methods in Medical Research, 27(12), 3643–3657.
Muller P, Erkanli A, West M (1996) Bayesian curve fitting using multivariate normal mixtures.
Biometrika, 83, 67–79.
Ohlssen D, Sharples L, Spiegelhalter D (2007) Flexible random-effects models using Bayesian
semi-parametric models: Applications to institutional comparisons. Statistics in Medicine, 26,
2088–2112.
Papastamoulis P (2016) label.switching: An R package for dealing with the label switching problem
in MCMC outputs. Journal of Statistical Software, 69. https://www.jstatsoft.org/article/view/
v069c01.
Parmigiani G (2002) Modeling in Medical Decision Making: A Bayesian Approach. Wiley, New York.
Pastor N (2003) Methods for the analysis of explanatory linear regression models with missing data
not at random. Quality and Quantity, 37, 363–376.
Pauler D, Wakefield J (2000) Modeling and implementation issues in Bayesian meta-analysis,
pp 205–230, in Bayesian Meta-Analysis, eds Stangl D, Berry D. Marcel Dekker.
Pérez M, Pericchi L, Ramrez I (2017) The Scaled Beta2 distribution as a robust prior for scales. Bayesian
Analysis, 12(3), 615–637.
Pitman J, Yor M (1997) The two-parameter Poisson-Dirichlet distribution derived from a stable sub-
ordinator. Annals of Probability, 25, 855–900.
Podlich H, Faddy M, Smyth G (2004) Semi-parametric extended Poisson process models for count
data. Statistics and Computing, 14, 311–321.
Prabhakaran S, Azizi E, Carr A, Pe’er D (2016) Dirichlet process mixture model for correcting techni-
cal variation in single-cell gene expression data, pp 1070–1079, in Proceedings of the International
Conference on Machine Learning, New York.
Prevost T, Abrams K, Jones D (2000) Hierarchical models in generalized synthesis of evidence: An
example based on studies of breast cancer screening. Statistics in Medicine, 19, 3359–3376.
Quintana F, Tam W (1996) Bayesian estimation of beta-binomial models by simulating posterior den-
sities. Revista de la Sociedad Chilena de Estadstica, 13, 43–56.
Rao J (2003) Small Area Estimation. Wiley, New York.
Borrowing Strength via Hierarchical Estimation 163

Rattanasiri S, Bohning D, Roianavipart P, Athipanyakom S (2004) A mixture model application in


disease mapping of malaria. The Southeast Asian Journal of Tropical Medicine and Public Health,
35, 38–47.
Redner R, Walker H (1984) Mixture densities, maximum likelihood, and the EM algorithm. S1Ah-I
Review, 26, 195–239.
Rhodes K, Turner R, White I, Jackson D, Spiegelhalter D, Higgins J (2016) Implementing informative
priors for heterogeneity in meta-analysis using meta-regression and pseudo data. Statistics in
Medicine, 35(29), 5495–5511.
Richardson S, Green P (1997) On Bayesian analysis of mixtures with an unknown number of compo-
nents. Journal of the Royal Statistical Society: Series B, 59, 731–758.
Riley RD, Dodd SR, Craig JV, Thompson JR, Williamson PR (2008) Meta-analysis of diagnostic test
studies using individual patient data and aggregate data. Statistics in Medicine, 27(29), 6111–6136.
Robert C, Mengersen K (1999) Reparameterisation issues in mixture modelling and their bearing on
the Gibbs sampler. Computational Statistics and Data Analysis, 29, 325–343.
Robert C, Titterington D (1998) On perfect simulation for some mixtures of distributions. Statistics
and Computing, 8, 145–158.
Roberts S, Husmeier D, Rezek I, Penny W (1998) Bayesian approaches to Gaussian mixture model-
ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1133–1142.
Robertson C, Fryer J (1968) Some descriptive properties of normal mixtures. Scandinavian Actuarial
Journal, 52, 137–146.
Rodrguez-Avi J, Conde-Sánchez A, Sáez-Castillo A, Olmo-Jiménez M (2007) A generalization of the
beta–binomial distribution. Journal of Applied Statistics, 56, 51–61.
Roeder K, Wasserman L (1997) Practical Bayesian density estimation using mixtures of normals.
Journal of the American Statistical Association, 92, 894–902.
Rouder J N, Morey R, Pratte M (2013) Hierarchical Bayesian models, in The New Handbook of
Mathematical Psychology, Volume 1: Measurement and Methodology, eds W H Batchelder, H
Colonius, E Dzhafarov, J I Myung. Cambridge University Press, London, UK.
Sahu S, Dey D, Branco M (2003) A new class of multivariate skew distributions with applications to
Bayesian regression models. The Canadian Journal of Statistics, 31: 129–150.
Savage J (2016) Finite Mixture Models in Stan. http://modernstatisticalworkflow.blogspot.
co.uk/2016/10/finite-mixture-models-in-stan.html
Scollnik DP (1995) Bayesian analysis of two overdispersed Poisson regression models. Communications
in Statistics-Theory and Methods, 24(11), 2901–2918.
Scrucca L, Fop M, Murphy T, Raftery A (2016) mclust 5: Clustering, classification and density estima-
tion using gaussian finite mixture models. The R Journal, 8(1), 289.
Sethuraman J (1994) A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639–650.
Silliman N (1997) Hierarchical selection models with applications in meta-analysis. Journal of the
American Statistical Association, 92, 926–936.
Simpson D P, Rue H, Martins T G, Riebler A, Sørbye S H (2016) Penalising model component com-
plexity: A principled, practical approach to constructing priors. Statistical Science (Forthcoming).
arXiv preprint arXiv:1403.4630.
Smith TC, Spiegelhalter DJ, Thomas A (1995) Bayesian approaches to random-effects meta-analysis:
A comparative study. Statistics in Medicine, 14(24), 2685–2699.
Spiegelhalter D (1999) Surgical audit: Statistical lessons from Nightingale and Codman. Journal of the
Royal Statistical Society: Series A, 162, 45–58.
Spiegelhalter D (2005) Handling over-dispersion of performance indicators. Quality and Safety in
Health Care, 14, 347–351.
Spiegelhalter D, Abrams K, Myles J (2004) Bayesian Approaches to Clinical Trials and Health-Care
Evaluation. Wiley, New York.
Spittal M J, Pirkis J, Gurrin L (2015) Meta-analysis of incidence rate data in the presence of zero
events. BMC Medical Research Methodology, 15(1), 42.
Staggs V, Gajewski B (2017) Bayesian and frequentist approaches to assessing reliability and precision
of health-care provider quality measures. Statistical Methods in Medical Research, 26(3), 1341–1349.
164 Bayesian Hierarchical Models

Taylor-Rodrguez D, Kaufeld K, Schliep E, Clark J, Gelfand A (2017) Joint species distribution model-
ing: Dimension reduction using Dirichlet processes. Bayesian Analysis, 12(4), 939–967.
Teather D (1984) The estimation of exchangeable binomial parameters. Communications in Statistics,
Part A, 13, 671–680.
van Dongen S (2006) Prior specification in Bayesian statistics: Three cautionary tales. Journal of
Theoretical Biology, 242: 90–100.
van Houwelingen H, Arends L, Stiinen T (2002) Advanced methods in meta-analysis: Multivariate
approach and meta-regression. Statistics in Medicine, 21, 589–624.
Verde PE (2018) bamdit: An R package for Bayesian meta-analysis of diagnostic test data. Journal of
Statistical Software, Articles, 86, 1–32.
Viechtbauer W (2010) Conducting meta-analyses in R with the metafor package. Journal of Statistical
Software, 36(3), 1–48.
Viechtbauer W (2017) Package ‘metafor’. https://cran.r-project.org/web/packages/metafor/meta-
for.pdf
Walfish S (2006) A review of statistical outlier methods. Pharmaceutical Technology, 30(11), 82–86.
Walker S, Damien P, Laud P, Smith A (1999) Bayesian nonparametric inference for random distribu-
tions and related functions. Journal of the Royal Statistical Society: Series B, 61, 485–527.
Wang C, Blei D (2017) A general method for robust Bayesian modeling. Bayesian Analysis, 13(4),
1163–1191.
Warn D, Thompson S, Spiegelhalter D (2002) Bayesian random effects meta-analysis of trials with
binary outcomes: Methods for the absolute risk difference and relative risk scales. Statistics in
Medicine, 21, 1601–1623.
Wasserman L (2000) Asymptotic inference for mixture models using data-dependent priors. Journal
of the Royal Statistical Society: Series B, 62, 159–180.
Weems K, Smith P (2004) On robustness of maximum likelihood estimates for Poisson-lognormal
models. Statistics & Probability Letters, 66, 189–196.
Wei Y, Higgins JP (2013) Bayesian multivariate meta-analysis with multiple outcomes. Statistics in
Medicine, 32(17), 2911–2934.
West M (1984) Outlier models and prior distributions in Bayesian linear regression. Journal of the
Royal Statistical Society: Series B, 46, 431–439.
West M, Muller P, Escobar M (1994) Hierarchical priors and mixture models, with application in
regression and density estimation, pp 363–386, in Aspects of Uncertainty: A Tribute to D. V.
Lindley, eds P Freeman, A Smith. Wiley, New York.
Williams D, Rast P, Bürkner P (2018) Bayesian Meta-Analysis with Weakly Informative Prior
Distributions. PsyArXiv. https://andrewgelman.com/wp-content/uploads/2018/01/bayes_
donny.pdf
Winkelmann R, Zimmermann KF (1991) A new approach for modeling economic count data.
Economics Letters, 37(2), 139–143.
Winship DA (1978) Cimetidine in the treatment of duodenal ulcer. Gastroenterology, 74, 402–406.
Young-Xu Y, Chan K (2008) Pooling overdispersed binomial data to estimate event rate. BMC Medical
Research Methodology, 8, 58.
Yu K, Moyeed R (2001) Bayesian quantile regression. Statistics and Probability Letters, 54(4), 437–447.
Zhang J, Fu H, Carlin B (2015) Detecting outlying trials in network meta-analysis. Statistics in
Medicine, 34(19), 2695–2707.
Zhao Y, Staudenmayer J, Coull B, Wand M (2006) General design Bayesian generalized linear mixed
models. Statistical Science, 21, 35–51.
Zollinger A, Davison A, Goldstein D (2015) Meta-analysis of incomplete microarray studies.
Biostatistics, 16(4), 686–700.
5
Time Structured Priors

5.1 Introduction
A time series is a sequence of stochastic observations which are ordered in time, most
often at equally spaced discrete times t = 1, ¼ , T , though extensions to unequally spaced
intervals are relatively straightforward (Lee and Nelder, 2001). Major goals of time series
analysis include modelling the interrelationship of variables evolving jointly through time,
as in econometric growth models (Paap and van Dijk, 2003), forecasting future values of
time series variables (Beck, 2004), and identifying the structural components of a sequence
of observations (Huerta and West, 1999). In the analysis of temporal data, one generally
expects positive covariation between observations that are close to each other in time,
so that exchangeable priors are not appropriate. While time series are sometimes anal-
ysed exchangeably, at least within subgroups of the data, as in change-point models (Mira
and Petrone, 1996), in most applications, there is a gain from modelling temporal covaria-
tion. Hence, hierarchical priors for time series modelling are typically structured in the
sense of explicitly recognising adjacency in time as the basis for smoothing or prediction.
Hierarchical methods also assist in identifying underlying relatively smooth or recurring
features of the data, for example, underlying trends or seasonal effects.
Bayesian methods are widely applied to autoregressive moving average models, without
necessarily imposing the stationarity restrictions and preliminary detrending that fea-
ture in classical estimation. However, a general scheme for specifying priors for modelling
time series data is provided by the state-space approach, considered in Sections 5.3 and
5.4 (Harvey et al., 2006; West, 2013; Giordani et al., 2011; Petris et al., 2009), which includes
ARMA (autoregressive moving average) models as special cases. State-space models rec-
ognise multiple underlying components in time series, with the priors governing the
evolution of the components under an expectation of smoothness. The linear state-space
(or dynamic linear model) specification for the changing level of a univariate continuous
response yt has the form

yt = bt Xt + ut ,

bt = bt -1Gt + wt ,

where the errors

ut ~ N (0, Vt ) ,

wt ~ N (0, Wt ) ,

165
166 Bayesian Hierarchical Models

are unstructured white noise, Xt is a predictor or design matrix, and Gt is a known matrix
governing the evolution of the state vector βt (Durbin, 2000; West and Harrison, 1997).
The time structured latent effects βt may include level, trend, seasonal, or cyclical effects.
Taking ut and wt to be normal leads to the normal dynamic linear model (West, 1998),
with extension to generalised linear model forms for discrete data leading to dynamic
generalised linear models (West et al., 1985). State-space principles can also be applied to
model stochastically evolving variances, as in stochastic volatility models (Kim et al., 1998;
Jacquier et al., 2002); see Section 5.5.
While there may be benefits from borrowing strength methods that take account of cor-
relations between units, the use of multiple random effects to represent unobserved com-
ponents in time raises potential identification issues (Auger-Méthé et al., 2016; Knape, 2008).
For example, priors for correlated effects in time may specify differences in effects between
adjacent units without specifying the mean level of the effects. MCMC methods may then
require centring of the effects during sampling to ensure identification of other param-
eters. Methods for smoothing or interpolation in time may also need to retain robustness
to take account of regime shifts, or to accommodate temporal outliers. Structured priors
assume relatively smooth variation over adjacent units, and their parameters may be dis-
torted if mechanisms are not incorporated for accommodating extreme points.
There is a wide range of time series analysis options in R using frequentist estimation
packages (https://cran.r-project.org/web/views/TimeSeries.html) which may be useful
for comparative purposes. Bayesian computing options in R for time series include bsts,
particularly for state-space modelling (Scott, 2017); BMR, Bayesian Macroeconometrics in
R (https://github.com/kthohr/BMR/tree/master/man); stochvol for stochastic volatility
analysis (Kastner and Hosszejni, 2016); and tsPI (Helske, 2017). Generic packages such as
rstan and R-INLA may facilitate estimation and identification in complex random effects
time series models (Monnahan et al., 2017; Betancourt and Girolami, 2015).
The chapter below considers schemes for modelling correlated observations and latent
effects in time series. Sections 5.2 and 5.3 consider autoregressive and state-space priors
for time series analysis, while Section 5.4 considers state-space methods for discrete time
series. Section 5.5 considers Bayesian approaches to stochastic volatility and Section 5.6
considering models adaptive to temporal discontinuities.

5.2 Modelling Temporal Structure: Autoregressive Models


Many time series show evidence of serial dependence in the observations or error terms,
leading to what are sometimes denoted as observation- and parameter-driven models,
respectively (Oh and Lim, 2001). A widely used model for expressing such serial depen-
dence is the lag p autoregressive or AR(p) model. An AR(p) scheme for dependent outcomes
yt in a normal linear framework is represented by

yt = f0 + f1 yt -1 + f2 yt -2 + … + fp yt - p + ut , t = 1,… , T

where the innovation errors ut ~ N (0, s 2 ) are homoscedastic white noise, independent of
each other and lagged y values { yt -1 , … , yt - p } . So E(utut - s ) = E(ut - jut - j - s ) = 0 for all s and j.
Note that a full likelihood analysis will refer to p latent preseries values (Marriott et al.,
Time Structured Priors 167

1996), with Naylor and Marriott (1996) suggesting preseries values follow a heavy tailed
version of the density assumed for the observed series, for instance ( y0 , y -1 , … y1- p ) as
Student t with variance σ2 and low degrees of freedom v. Autoregressive dependence may
also be present in error terms, such that

yt = f0 + f1 yt -1 + f2 yt -2 +  + fp1 yt - p1 + e t

et = r1et -1 + r2et - 2 +  + rp2 et - p2 + ut ,

where the ut are IID.


Furthermore, moving average effects may occur in the white noise errors ut, and so have
an impact on yt of lagged disturbances ut. A lag q moving average effect, combined with a
lag p effect in the yt series, provides the ARMA(p, q) model

yt = f0 + f1 yt -1 +  + fp yt - p + ut + g 1ut -1 + g 2ut -2 + g qut -q ,

with assumptions as in Chib and Greenberg (1994). Assuming the y-series is centred around its
mean, and defining Byt = yt - yt -1, one has yt - f1 yt -1 -  fp yt - p = yt (1 - f1B -  fpB p ) = F(B)yt ,
and the ARMA(p, q) model can be written

F (B) yt = G (B) ut .

As for other regressions, collinearity may occur, and parameter selection for the ARMA(p,q)
may include shrinkage priors (Schmidt and Makalic, 2013) and RJMCMC (Ehlers and
Brooks, 2004).
Classical estimation methods typically require stationarity and constant variances in
estimating such models. Stationarity is equivalent to the roots of F(B) = 1 - B - B2  - B p
being outside the unit circle, and invertibility refers to the same condition on the roots of
G(B) . This typically involves preliminary data differencing or transformation to gain sta-
tionarity, or regression to remove trend (e.g. Abraham and Ledolter, 1983, p.225), with the
actual model then applied to differenced data or to regression residuals. To assess whether
stationarity has been achieved, one can consider the autocorrelation sequence of model
residuals: a stationary process should show a sequence fading to zero at high lags, whereas
significant values at high lags indicate nonstationarity. In Bayesian analyses, it is com-
mon to estimate parameters without presuming stationarity (or invertibility), but instead
obtain the posterior probabilities of stationarity via monitoring the sampled parameters
(McCulloch and Tsay, 1994; Marriott et al., 1996).

Example 5.1 Southern Oscillation Index


This example illustrates the estimation of ARMA models via the rstan package. The data is
NINO3.4 index (as in the R package tseries with T = 598), this being one of several El Niño/
Southern Oscillation (ENSO) indicators based on sea surface temperatures. The Nino 3.4
Region is bounded by 120W–170W and 5S–5N. Use of options aic.wge from the R package
tswge suggests the best fit (under classical estimation) to be for an ARMA(4,0,1) model.
Estimates from ARMA models may be affected by the specification of the initial con-
ditions, and we consider first the estimation of the ARMA(1,1) model. Thus

yt = f0 + f1 yt -1 + ut + g 1ut -1 ,
168 Bayesian Hierarchical Models

where no stationarity constraints are imposed on ϕ1 or γ1. The first observation y1 is


included in the estimation, and a composite fixed effect parameter is assumed for
ϕ1y0 + γ1u0, referring to latent preseries data. Classical estimation via the tseries ARMA
option provides estimates (mean, s.e.) of 3.67 (0.58), 0.86 (0.02), and 0.26 (0.03) for ϕ0, ϕ1,
and γ1 respectively. Estimation using rstan including the first observation provides cor-
responding posterior mean (sd) estimates of 3.76 (0.59), 0.86 (0.02), and 0.27 (0.03). The
LOO-IC (leave-one-out information criterion) is 615.
Estimates may also be obtained by conditioning on the first observation (i.e. not
including that point in the likelihood). The rstan estimates of ϕ0, ϕ1, and γ1 on this basis
are 3.67 (0.58), 0.86 (0.02) and 0.26 (0.03).
Classical estimates of the ARMA(4,0,1) model vary slightly according to the package.
Note also that parameterisations of the intercept, and the way MA terms are signed, dif-
fers between R packages. The ARMA option in tseries is unable to fit this model, while
the aic.wge option provides estimated AR lag parameters (0.21,0.99, −0.36, −0.14) and
MA parameter 0.98. The FitARMA package provides estimated AR parameters (0.31,
0.89, −0.38, −0.10), and MA parameter 0.89.
The rstan estimation, with code as in [1], of the ARMA(4,0,1) model gives estimates for
(ϕ1, ϕ2, ϕ3, ϕ4) of 0.29 (sd = 0.08), 0.91 (0.08), −0.37 (0.04), and −0.10 (0.04) respectively, and
0.92 (0.07) for γ1. The LOO-IC is reduced to 547.

5.2.1 Random Coefficient Autoregressive Models


A hierarchical generalisation of the AR(p) prior allows the lag coefficient to vary over time,
as in random coefficient AR or RCAR models – see, for example, Lee (1998), Berkes et al.
(2009), Wang and Ghosh (2002), and Araveeporn (2017). These are also called time-varying
autoregressive or TVAR models. Thus, for a centred and univariate y, an RCAR(p) model
in the observations specifies

yt = åf y
t =1
tj t - j + ut ,


å
0.5
ft = mf + et ,
f

where ut ~ N (0, s 2 ) , et ~ N p (0, I ), ft = (ft1 ,… , ftp ), S f = diag(sf21 , … , sf2p ), and mf = (f1 ,..., fp ).
Instead of a multivariate normal prior for the ϕt, sequential updating of the ϕt may be
applied, for example, via a multivariate random walk (Section 5.3),

ft = ft -1 + wt , wt ~ N p (0, Wt ).

Another possibility (Godsill et al., 2004) is to take both the AR coefficient vector and the
innovation variance σ2 to be time-varying, for example, by setting a random walk prior on
ht = log(st ) , or by a second-stage autoregression, such as

(
ht ~ N rh ht -1 , sh2 . )
As in many Bayesian applications, stationarity constraints are not necessarily placed on
the ϕtj at each t (Prado et al., 2000). However, if the AR parameters lie in the stationary
region, then the series can be considered locally stationary. For example, for an RCAR(1)
model including a latent preseries value y0, a hierarchical scheme such as
Time Structured Priors 169

( )
yt ~ N ft1 yt -1 , s 2 , t > 1

(
y0 ~ t2 m0 , s 2 , )
( )
ft1 ~ N f1 , sf2 , t > 1

may be applied. For this model, stationarity holds if f12 + sf2 < 1.

5.2.2 Low Order Autoregressive Models


Simple dependence models for observations, errors, or latent effects are obtained via first-
or second-order autoregression. In the AR(1) observation model, one has

yt - m = f ( yt -1 - m) + ut

or

yt = f yt -1 + ut

for centred data, where under stationarity -1 < f < 1, and yr and ys for 1 £ r £ s £ T are con-
ditionally independent, given { y r + 1 , ¼ , y s -1 } if r - s > 1 (Rue and Held, 2005). The AR(2)
model has

yt = f1 yt -1 + f2 yt -2 + ut

where stationarity requires f1 + f2 < 1, f2 - f1 < 1, and |f2|< 1 .


An AR(1) error sequence et = ret -1 + ut with ut ~ N (0, s 2 ) similarly requires -1 < r < 1 for
stationarity. The covariance for such a sequence has the form Cov(e ) = s 2C with (s, t)th ele-
ment in the correlation matrix corr(e s , e t ) = r|s-t| /(1 - r 2 ), so correlations decline as the gap
between observations increases.
For the stationary AR(1) observation model yt = f yt -1 + ut , the marginal density of the
first observation is y1 ~ N (0, s 2 (1 - f 2 )), and the joint density can also be obtained by den-
sity decomposition as

p ( y1 , ¼ , y T ) = p ( y1 ) p ( y 2 | y1 ) p ( y 3 | y 2 )¼ p ( y T | y T - 1 )

( )
0.5
µ 1 - f2 s - n exp éë -0.5H / s 2 ùû ,

å
T
where H = (1 - f )y1 + ( yt - f yt -1 )2 . The same sequence of marginal and conditional
2 2
t=2
densities applies for AR(1) autoregressive errors.
The precision (inverse covariance) matrix of autoregressive models has interesting theo-
retical properties demonstrating how conditional independence structures determine the
precision matrix and vice versa (Speed and Kiiveri, 1986; Rue and Held, 2005). Specifically,
zeros in the precision matrix define, and are defined by, conditional independencies in the
joint density. Thus, for an AR(1) prior on errors ε with lag coefficient ρ, the precision matrix
П is tridiagonal with (r, s)th cell equalling zero only if the complete conditional distribu-
tion of εr does not depend on εs, namely
170 Bayesian Hierarchical Models

é 1 -r 0 ù
ê-r 1+ r2 -r ú
ê ú
-2 -1 -2
ê 0 -r 1+ r2 ¼ ú
P =s C =s ê ú.
ê ¼ ¼ ú
ê -r 1+ r2 -r ú
ê ú
êë 0 -r 1 ûú

For an AR(2) error sequence with lag parameters { r1 , r 2 }, the precision matrix is

é 1 - r1 - r2 0 ù
ê-r 1 + r12 - r1(1 - r 2 ) - r2 ú
ê 1 ú
ê- r2 - r1(1 - r 2 ) 1 + r12 + r 22 - r1(11 - r 2 ) ú
-2 ê ú
P = s ê 0 - r2 - r1(1 - r 2 ) 1 + r12 + r 22 ¼ ú.
ê ¼ ¼ ú
ê ú
ê - r1(1 - r 2 ) 1 + r12 - r1 ú
ê - r2 - r1 1 úû
ë

Such simplifications in structure are useful in multidimensional applications involving


spatio-temporal or multiple-time scale errors. For example, if the covariance matrix of a
spatio-temporal error εst is represented as a Kronecker product S t Ä S s of a temporal cova-
riance S t and spatial covariance S s , then the corresponding precision matrix is P t Ä P s
(Bijma et al., 2005).
There is considerable literature around the unit root and explosive root solutions of the
AR(1) observation model yt = f yt -1 + ut . One may apply an autoregressive prior not con-
strained to stationarity, and a substantial posterior probability of nonstationarity would
support using random walk priors, as a parsimonious autoregressive prior that allows for
potential nonstationarity. For example, Lubrano (1995) considers the alternative compos-
ite hypotheses H 0 : f < 1 and H1: f ³ 1. Schotman and van Dijk (1991) consider the autore-
gression plus trend observation model yt = f0 + f1 yt -1 + dt + ut and reframe it in equivalent
AR(1) error form as

yt = d0 + d1t + et ,

et = fet -1 + ut ,

while Chatuverdi and Kumar (2005) consider the unit root hypothesis under a more gen-
eral polynomial trend yt = d0 + S j d jt j + et .

5.2.3 Antedependence Models
Structured antedependence models may offer flexibility in time series specification; they
resemble autoregressions in entailing a regression over preceding observations or latent
effects, but are specified in a way that avoids stationarity constraints (Nunez-Anton and
Zimmerman, 2000; Pourahmadi, 2002). Observations { y1 , ¼ , yT } are antedependent of
order s if yt depends only on { yt -1 , ¼ , yt - s } for all t ³ s (Gabriel, 1962). For example, Jaffrezic
et al. (2003) consider a second-order antedependence model for normal longitudinal data
Time Structured Priors 171

of the form yit = hit + git + uit , where hit models fixed effects, e.g. hit = xit b , uit are unstruc-
tured white noise errors with fixed variance, and the genetic component git follows a sec-
ond-order structured antedependence or AD(2) scheme. This scheme specifies

g1 = e1

g 2 = f12 g1 + e2

gt = f1t gt -1 + f2t gt -2 + e t , t > 2

with et ~ N (0, wt ). Because of the initial condition g1 = e1, the antedependence parameters,
such as {f1t , f2t } in an AD(2) model, are unconstrained, in contrast to the stationarity con-
straints needed for autoregressive models.
To reduce the number of parameters being estimated, changing variances ωt may be
modelled via a parametric function of time, for example

log(wt ) = a1 + a2t + a3t 2 ,

while the antedependence parameters can also be modelled using time functions. For
example, a Box–Cox power law can be used to parameterise time-varying AD coefficients
ϕkt, namely

fkt = fkrt - rt - k

lk −1 lk −1
where {rt = t /lk , rt − k = (t − k ) /lk } if lk ≠ 0, and {rt = log(t), rt - k = log(t - k )} if lk = 0
(Nunez-Anton and Zimmerman, 2000). The ϕ and ω parameters may be adjusted to account
for unevenly spaced times located at points {a1 , ¼ aT } (Jaffrezic et al., 2004).

Example 5.2 NASDAQ Daily Volume, 2017


This example considers the log transforms of the NASDAQ daily trading volume sta-
tistics during 2017 (from https://finance.yahoo.com/). There are 252 uncentred obser-
vations. The analysis compares random coefficient AR models in terms of fit, and
robustness to outliers, against fixed coefficient models and stochastic volatility GARCH
models (cf. Wang and Ghosh, 2002).
Thus, the first two analyses compare fixed-coefficient and random-coefficient AR1
models. For the fixed-coefficient AR1 model, no prior assumption of stationarity is
made, whereby

yt = f0 + f1 yt - 1 + ut ,

ut ~ N(0, s 2 ),

with priors

f1 ~ N(0, 1),

f0 ~ N(0, 20),

s ~ N + (0, 1).
172 Bayesian Hierarchical Models

The estimates (posterior mean and st devn) for ϕ0, ϕ1, and σ are respectively 12.6 (1.25),
0.41 (0.06), and 0.143 (0.006). The LOO-IC is −255, with Figure 5.1A showing the extreme
pointwise LOO-IC associated with certain observations.
The random-coefficient AR1 specifies

yt = f0 + f1t yt - 1 + ut ,

ut ~ N(0, s 2 ),

f 1t = mf 1 + s f 1e t ,

et ~ N(0, 1),

where the parameterisation of ϕt follows Gelman et al. (2014), and provides improved
MCMC sampling via rstan [2]. For extreme outliers, such as at t = 120 and t = 227, the
mean likelihoods are higher under this model. However, the overall LOO-IC rises to
−250, with the improved fit per se (lower ELPD-LOO) offset by a higher complexity
measure (113 vs 9). The WAIC (widely applicable information criterion) also favours
the simpler model (−256 as against −245). The parameter σϕ1 has a mean of 0.0034, with
posterior mean ϕ1t varying from 0.401 to 0.423.
A GARCH(1,1) model (section 5.5) specifies

yt = f0 + ut ,

(
ut ~ N 0, st2 , )
with variance model

st2 = a0 + a1ut2- 1 + b1st2- 1 .

This provides evidence of volatility, as in Figure 5.1B, but the LOO-IC deteriorates to
−205. The α1 and β1 coefficients have skew posterior densities with respective means
(medians) of 0.17 (0.07) and 0.67 (0.87).
Finally, an AR1 lag in y is added to the GARCH(1,1), namely

yt = f0 + f1 yt -1 + ut ,

ut ~ N(0, st2 ),

st2 = a0 + a1ut2- 1 + b1st2- 1 .

This provides a LOO-IC of −260, a slight improvement on the basic AR(1) model. The
lagged effect of ut2-1 is now virtually eliminated, with posterior means (medians) for β1
and ϕ of 0.58 (0.65) and 0.38 (0.39).

5.3 State-Space Priors for Metric Data


Nonstationary models based on state-space priors are widely used in applications where
time series parameters are evolving through time, especially in analysing separate unob-
served components representing trend, cyclical, or seasonal effects (West, 2013). The
Time Structured Priors 173

25

20
Pointwise LOO-IC

15

10

0 50 100 150 200 250


(a) Day

0.30

0.25
Mean Sigma

0.20

0.15

0.10

0 50 100 150 200 250


(b) Day

FIGURE 5.1
(A) Pointwise LOO-IC. Fixed Coefficient AR1 Model. (B) Posterior Mean σ. GARCH(1,1) Model.

idea that a time series is composed of several unobserved components contrasts with
Box–Jenkins or ARMA methods that require differencing to eliminate trend or periodic
effects and achieve stationary means and variances (Durbin, 2000, p.2). ARMA models are
selected using autocorrelation and partial and autocorrelation functions that are subject
to sampling variability, and quite different models can provide similar fits for the same
series. In fact, ARMA sequences can be represented as particular instances of state-space
models with implicit components. Among informative discussions on state-space vs Box–
Jenkins methods, see Durbin and Koopman (2001, p.51) and Harvey and Todd (1983).
174 Bayesian Hierarchical Models

The normal linear state-space specification, or dynamic linear model, has the form

yt = bt Xt + ut ,

where evolution of the p dimensional signal βt is defined by a state equation

bt = bt -1Gt + wt ,

with Xt being a p × 1 design matrix (typically including an intercept), and Gt defining a p × p
state evolution matrix. The normal errors ut and wt are independent of each other, with
mean zero and variances Vt and Wt (or covariances for multivariate y). The initial state vec-
tor or initial condition has a separate (e.g. normal) prior such as b1 ~ N (m1 , W1 ) (Strickland
et al., 2008), where m1 and W1 are typically present (e.g. W1 is set large, in line with diffuse
expectations). Often Gt has a simple form, such as an identity matrix. For the case Gt = G,
Gamerman (1998) mentions an inverse parameterisation consequent on taking

d1 = b1 , dt = bt - G bt -1 ,

so that

bt = åG
j =1
t- j
d j .

Algorithms using normal distribution properties can be applied to sequential updating


(filtering), forward prediction and retrospective smoothing of the state vector in the nor-
mal dynamic linear model. Letting Dt = ( yt , yt -1 ,..y1 ) , the prior, predictive, and posterior
distributions of βt are (Reis et al., 2006)

p ( bt |Dt -1 ) =
ò p ( b |b
t t -1 ) p ( bt -1 |Dt -1 ) d bt -1 ,

p ( yt |Dt -1 ) =
ò p ( y | b ) p ( b |D
t t t t -1 ) d bt ,

p ( bt |Dt ) µ p ( bt |Dt -1 ) p ( yt |Dt -1 ) .

For the linear normal model with Vt = V , Wt = W , sequential updating provides posteriors

bt |Dt ~ N ( mt , Ct ) ,

where

at = Gt mt -1 ,

et = yt - Xt¢at ,

mt = at + At et ,
Time Structured Priors 175

Ct = Rt - At At¢qt ,

Rt = GtCt -1Gt¢ + W ,

qt = Xt¢Rt Xt + V ,

The one step ahead state and observation predictive densities are normal densities, namely,

( bt |Dt -1 ) ~ N ( at , Rt )
( yt |Dt -1 ) ~ N (Xt¢at , qt ) .
5.3.1 Simple Signal Models
As an illustration of a normal state-space or dynamic linear model, assume that observa-
tions yt are obtained with measurement error and in fact generated by a relatively smooth
underlying signal βt. This is a hierarchical model – analogous to the normal-normal model
of Chapter 4 – with the first level being the observation equation, the second level being
the state equation, and the priors on the variances and initial conditions defining hyper-
parameters at the third stage (Berliner, 1996). Assuming iid measurement errors ut, one has
an observation or measurement equation

yt = bt + ut , (5.1.1)

for t = 1, ¼ , T and a state equation defining the evolution of the signal

bt = bt -1 + wt , (5.1.2)

for t = 2, ¼ , T . This is also known as a local level model (Durbin and Koopman, 2001), or
random walk plus noise model (Durbin, 2000), and the second stage is a nonstationary first
order random walk or RW(1) prior, corresponding to the unit root case of an AR(1) prior.
As for the AR(1) prior, future values of the signal depend on (βt, βt−1,…, β1) only through
the current value βt. Denoting β[t] = (β1,…, βt−1), the conditional form of the RW(1) prior is

ì p ( b2 | b1 ) p ( b1 ) p ( y1 | b1 ) t=1 ü
ï ï
p( bt | b[t ] , y ) µ í p ( bt + 1 | bt ) p ( bt | bt -1 ) p ( yt | bt ) t = 2, ¼T - 1ý
ï ï
î p ( bT | bT -1 ) p ( yT | bT ) t=T þ
so that for times t = 2, ¼ , T - 1 there is averaging over preceding and following states. The
first period signal (initial condition) β1 is typically taken as an unknown fixed effect with
large variance, while the observation error ut, and state error wt are taken as respectively
N(0, V) and N(0, W), and assumed uncorrelated in time, independent of one other, and also
independent of the signal βt.
Assume b1 ~ N (b1 , S1 ), 1/ V ~ Ga( au , bu ), 1/ W ~ Ga( aw , bw ) then the full conditionals are

æéb b y ùé 1
-1
1 1ù é 1 1 1ù ö
-1

b1 ~ N ç ê 2 + 1 + 1 ú ê + + ú , ê + + ú ÷
ç ë W S1 V û ë W S1 V û ë W S1 V û ÷
è ø
176 Bayesian Hierarchical Models

 (b + b y  2 1   2 1  
−1 −1
bt ∼ N   t +1 t −1 + t   +  ,  +   t = 2, … , T − 1
 W V  W V  W V  

 b y  1 1   1 1  
−1 −1
bT ∼ N   T −1 + T   +  ,  +  
 W V  W V  W V  

 T
2

T
1 / V ∼ Ga  au + , bu + 0.5
2
( yt − bt ) 
 t =1 

 (T − 1) , b + 0.5 
T

1/ W ∼ Ga  aw +
 2
w ∑(b − b
t=2
t t −1 )2  .

Higher order random walks in the signal are another possibility, with a kth order random
walk having prior

D k bt ~ N (0, W )

(Berliner, 1996; Kitagawa and Gersch, 1996; Fahrmeir and Lang, 2001). For example, a
second difference random walk or RW(2) prior specifies yt = bt + ut and state equation
D 2 bt = wt . Hence

D ( Dbt ) = D ( bt - bt -1 ) = Dbt - Dbt -1 = ( bt - bt -1 ) - ( bt -1 - bt - 2 ) = wt

and the RW(2) prior can be stated as

bt ~ N ( 2 bt -1 - bt - 2 , W ) .

Whereas first order random walks penalise abrupt jumps between successive values, the
RW(2) prior penalises deviations from a linear trend. The RW(2) and higher order RW
priors therefore lead to a smoother evolution of βt through time. This is relevant not just
to time series, but to processes operating on other time scales (e.g. age, cohort), for exam-
ple, in survival analysis or in graduating (smoothing) demographic schedules (Carlin and
Klugman, 1993).

5.3.2 Sampling Schemes
Different MCMC sampling schemes have been proposed for state-space models according
to the form of outcome (e.g. metric or discrete) and the form of the observation-state equa-
tions (e.g. linear or nonlinear). Multi-state or joint sampling of the state vectors βt is gener-
ally more efficient than single-state sampling that updates one state parameter vector at a
time (Knorr-Held, 1999). Joint sampling for β when y is metric is discussed by Carter and
Kohn (1994) and Fruhwirth-Schnatter (1994), while de Jong and Shephard (1995) focus on
sampling the ut and wt error series, as opposed to the state effects βt; recent overviews are
provided by Reis et al. (2006) and Simpson et al. (2017). Gamerman (1998) proposes updat-
ing via the δt rather than the usually highly correlated βt using the re-parameterisation
mentioned above.
Time Structured Priors 177

Knorr-Held (1999) uses properties of the penalty (inverse covariance) matrix of the joint
density for the state vectors as a basis for sampling sub-blocks of the elements (β1,…, βT).
Thus Gaussian state-space priors can be written in joint form as

æ b ¢K b ö
p ( b1 ,..., bT |W ) µ exp ç - ÷ ,
è 2W ø

where the penalty matrix K is determined by the form of autoregressive prior. For a first
order random walk with bt ~ N ( bt -1 , W ), the penalty matrix is

æ1 -1 ö
ç ÷
ç -1 2 -1 ÷
ç -1 2 -1 ÷
ç ÷
K =ç … … … ÷,
ç -1 2 -1 ÷
ç ÷
ç -1 2 -1 ÷
ç -1 1 ÷ø
è

while for a second order random walk with bt ~ N (2 bt -1 - bt - 2 , W ), one has

æ 1 -2 1 ö
ç ÷
ç -2 5 -4 ÷
ç 1 -4 6 -4 1 ÷
ç ÷
ç 1 -4 6 -4 1 ÷
K =ç … … … … ÷.
ç ÷
ç 1 -4 6 -4 1 ÷
ç 1 -4 6 -4 1÷
ç ÷
ç 1 -4 5 -2 ÷
ç ÷
è 1 -2 1ø

For an RW(p) prior at equally spaced time points, the elements of the matrix K (apart from
edge effects) are expressible as

i- j æ 2p ö
kij = (-1) ç ÷ if|i - j|£ p,
è p-|i - j|ø
and kij = 0 otherwise.
Let βab denote the subvector (βa, βa+1,…, βb) of state effects, and Kab denote the correspond-
ing submatrix of K. Let K1,a−1 and Kb+1,T denote the submatrices to the left and right of Kab,
namely

æ K1¢ , a-1 ö
ç ÷
K = ç K1, a-1 K ab K b +1,T ÷ .
ç K b¢ +1,T ÷
è ø
178 Bayesian Hierarchical Models

-1
Then the conditional density for βab given β1,a−1, βb+1,T and W, is normal bab ~ N (nab , WK ab ),
where

-1
- K ab K b + 1,T bb + 1,T a=1
nab = - K ab [ K1, a -1 b1, a -1 + K b + 1,T bb + 1,T ]
-1
a > 1, b < T .
-1
- K ab K1, a -1 b1, a -1 b=T
Using this density, a Metropolis-Hastings block sample may be used to update the full
conditional

p ( bab |) µ Õ p (y |b ) p ( b
t= a
t t ab | bb + 1,T , b1, a -1 , W ).

-1
This involves drawing a proposal βab from N (nab , WK ab ) with {nab , K ab } evaluated at the cur-
rent sampled values β and W in a chain, with the proposal accepted or rejected according
to a probability

æ b b
ö
min ç 1,
ç
è
Õ( t=a
p yt |bt* ) Õ p ( y |b ) ÷÷ø ,
t=a
t t

that may be calculated by comparing likelihoods only (Knorr-Held, 1999, p.134).

5.3.3 Basic Structural Model


To allow for a trend in the mean level or signal, one may extend the state equation in (5.1)
to include a stochastic increment, so that

yt = bt + ut

bt = bt -1 + D t + w1t

D t = D t - 1 + w2t

where Δt represent the changing slope of the trend. This provides the local linear trend
model or dynamic trend model (Fruhwirth-Schnatter, 1994).
A constant parameter Δ provides a linear trend, as in the Carter–Lee mortality forecast-
ing model considered by Pedroza (2006); this is sometimes known as a random walk with
drift. Other variations on the local linear model in (5.1) include autoregressive rather than
random walk state equations, such as

bt = fbt -1 + wt

as in Carlin et al. (1992, p.496). An autoregression or random walk in y itself might be


added, as in Ghosh and Tiwari (2007), who assume a local linear model for common cancer
deaths of the form yt + 1 ~ N ( yt + bt , V ) .
The basic structural model (BSM) or unobserved components model (Koopman, 1993;
Koopman et al., 1999) adds seasonal effects st to the above local linear trend model, so that
with µ t representing the level of the series, one has
Time Structured Priors 179

yt = mt + st + ut ,

mt = mt -1 + D t + w1t ,

D t = D t -1 + w2t ,

st + st -1 +  st - S + 1 = w3t ,

where S is the number of seasons, and w jt ~ N (0, Wj ). Relevant R packages for estimat-
ing the BSM include stsm (via maximum likelihood), and bsts and dlm (via Bayesian
estimation).
Fruhwirth-Schnatter (1994) sets out the full conditionals for this model under gamma
priors for the precisions 1/Wj. The last equation provides the time domain prior for sea-
sonal effects, whereas a frequency domain prior specifies

[ S/ 2 ]

st = å s ,
j =1
jt

( ) ( )
s jt = s j ,t -1 cos lj + v j ,t -1 sin lj + w3t ,

( ) ( )
v jt = - s j ,t -1 sin lj + v j ,t -1 cos lj + w4t ,

where lj = 2p j / S and [S/2] denotes the integer part of S/2.


Certain series (e.g. natural phenomena) may show unknown periodicities, so cycli-
cal components are added as well as, or instead of, seasonal components. For example,
Piegorsch and Bailer (2005, p.229) consider unknown frequencies in carbon dioxide con-
centrations from Mauna Loa volcano in Hawaii. So for a local linear trend model with a
single unknown cycle

yt = mt + ct + ut ,

mt = mt -1 + D t + w1t ,

D t = D t -1 + w2t ,

ct + 1 = ct cos( l) + dt sin(l) + w3t ,

dt + 1 = -ct sin( l) + dt cos(l) + w4t ,

where λ is an unknown frequency.

5.3.4 Identification Questions
Identification issues in state-space random effect models occur for two main reasons. One
is that the mean or level of the state effects is not specified (rather the mean of pairwise
or higher order differences is specified). The other is the presence of multiple confounded
sources of random variation, as in the basic structural model with level and seasonal
effects, whereas the data can only identify the sum of the random effects ut + st. These
180 Bayesian Hierarchical Models

questions raise issues in MCMC sampling, for example, whether effects need to be centred
at each iteration, because an intercept (if included) will otherwise be confounded with the
means of the random effects.
To exemplify issues occurring due to the mean of the latent series consider the measure-
ment error with RW(1) signal model in 5.3.1. The state equation can be stated as

Dbt = bt - bt -1 ~ N (0, W ),

so the prior only defines a level for differences in βt, but the level of the (undifferenced) βt
is not defined by the prior. If the model for yt does not have a separate intercept parameter,
the level of the βt will be identified by the level of the yt. Suppose though that the observa-
tion equation includes a separate constant γ0 with

yt = g0 + bt + ut .

Then γ0 and the mean of the βt are confounded and for identification one may apply a
centring or corner constraint to the βt. An identifying corner constraint involves setting a
single βt to a known value; taking the initial condition β1 to have a known value, e.g. β1 = 0,
is one option (Clayton, 1996). By contrast, if the initial conditions (β1 in an RW(1) prior, β1
and β2 in an RW(2) prior, etc) are taken as unknowns, then a centring constraint may be

å
T
applied at each MCMC iteration, so that the centred βt satisfy bt = 0 .
t =1
As in other models with multiple sources of random variation, priors on the variance
components in state-space models may affect inferences. This is not simply a matter of
selecting prior densities for scale parameters, but of also a question of how such priors
influence the partitioning of total random variation. One may recognise the interdepen-
dence between variance components using devices such as uniform priors on shrinkage
ratios B = V/V + W combined with a prior on V or V + W (Daniels, 1999). Alternatively (V, W)
may be reparameterised as (V, qV), where q is a signal to noise ratio. So the prior on q might
be centred on 1 in line with a prior belief that signal and observation variances are equal.
These approaches extend to models with competing sources of variation in the state
equation. Consider the three errors wjt (for levels, slopes, and seasonals) in the basic struc-
tural model. Denoting Wj = Var( w jt ) and V = Var(ut ), one may set Wj = q jV where qj are
signal to noise ratios (Koopman, 1993; Harvey, 1989, p.33). One may then set priors on the
qj separately (e.g. separate gammas), or jointly; for example, via a multivariate normal on
the log(qj). Another option is a prior on V and uniform priors on the ratios V/(V + Wj). Such
devices amount to assuming prior correlation between the respective variances.
An alternative approach to ensure stable identification is to set informative priors on the
variance of each random walk, possibly based on expected stochastic variation around
a deterministic trend. For example, following Berzuini and Clayton (1994), for counts
yt ~ Po(λt), consider a second order random walk for bt = log(lt )

bt = 2 bt -1 - bt - 2 + wt

then the value W = 0 for Var(wt) corresponds to a log-linear deterministic relationship
between the λt and time. To allow for stochastic variation, one may assume nW */W ~ cn2 ,
or equivalently

æ n W *n ö
1 / W ~ Ga ç , ÷,
è2 2 ø
Time Structured Priors 181

where W* is a prior setting for W, and higher values of ν represent stronger degrees of
belief in that setting. For example, taking W* = 0.01 corresponds to assuming a 95% prob-
ability that λt will be within −18 and +22% of a log-linear extrapolation from βt−1.
The single source of error approach (Ord et al., 2005) may also assist in achieving parsi-
mony, and in resolving the partitioning of variance between multiple sources of variation
in unobserved component models. Thus, the local linear trend model in multiple source of
error (MSOE) form is

yt = mt + ut ,

mt = mt -1 + D t + w1t ,

D t = D t -1 + w2t ,

but in single source of error (SSOE) form is

yt = mt + ut ,

mt = mt -1 + D t + l1ut ,

D t = D t -1 + l2ut ,

where λ1 and λ2 are loadings. By contrast to the MSOE scheme, the state and observation
errors are now correlated.

Example 5.3 Air Passenger Data


As an application of the basic structural model, consider monthly air passenger totals
using London airports (Heathrow, Gatwick, etc.) from January 1999 through to March
2014, so T = 183). Totals are in millions. A monthly seasonal effect is assumed so there
are S-1 = 11 initial conditions for the st sequence. Then

yt = mt + st + ut ,

mt = mt -1 + D t + w1t ,

D t = D t -1 + w2t ,

st + st - 1 + … st - S + 1 = w3t ,

( )
with w jt ~ N(0, s j2+ 1 ), and normal observation errors ut ~ N 0, s12 . Half t(0,1) priors with
4 degrees of freedom are assumed on the σj. For t = 1, the μt and Δt series refer to pre-
series values which are assigned N(0,10) priors.
Convergence is obtained readily in a two-chain run of 5000 iterations using rstan,
with a LOO-IC of 44.1. The posterior means (medians) of the σj are 0.202 (0.202), 0.174
(0.173), 0.0036 (0.0025) and 0.0136 (0.0118).
Figure 5.2A–C show respectively the clear seasonal variations, the generally
upward trend in the slope parameters Δt (though most evident in the early part of
the period), and the combined level and trend. These series all include forecasts for
nine extra months through to the end of 2014. A similar slope trajectory is estimated
182 Bayesian Hierarchical Models

Passenger Numbers (Seasonal Effects)


Passenger Numbers (Level Effects)
2 11

1
Numbers (mill)

10

Numbers (mill)
–0

–1 9

–2
Jan 2000 Jan 2005 Jan 2010 Jan 2015 Jan 2000 Jan 2005 Jan 2010 Jan 2015
(c)
(a) Month Month

Passenger Numbers (Trend Effects) Pointwise LOO-IC

40

0.015

30
Numbers (mill)

20
0.010

10

0.005
0

Jan 2000 Jan 2005 Jan 2010 Jan 2015 Jan 2000 Jan 2005 Jan 2010
(b) (d)
Month Month

FIGURE 5.2
(A) Passenger numbers, seasonal effects. (B) Passenger numbers, trend effects. (C) Passenger numbers, level
effects. (D) Passenger numbers model, pointwise LOO-IC.

using the R program rucm. Reversals to the broad upward trend in modelled pas-
senger numbers, as in Figure 5.2C, reflect especially the recession of 2008–09, as
well as more distinct outliers for individual months. An examination of the point-
wise LOO-IC, as in Figure  5.2D, shows the most discrepant month (t = 136) to be
April 2010, reflecting the impact on flights of the Eyjafjallajökull volcanic eruption
in Iceland.
To alleviate the impact of outlier values, a student t observation model is also esti-
mated, namely ut ~ t(0, n , s12 ) . The unknown degrees of freedom ν is assigned an E(0.1)
prior. This provides an improved LOO-IC of 16 with a posterior mean (median) of ν of
2.48 (2.33), with the posterior mean (median) for σ1 reduced to 0.093 (0.092).
Estimation of the basic structural model is also straightforward with R-INLA,
with the simplest code involving a random effect that combines level and trend. The
pointwise WAIC from a normal errors-based model reproduces the extreme outlier
at t = 136.
Time Structured Priors 183

Example 5.4 Global Sea Level Change


This example compares in-sample predictions and out of sample forecasts from a local
linear model with linear trend, and a simple hierarchical model involving unit level
linear trends. Identifiability issues and their resolution are discussed.
A number of studies have analysed local relative sea-level records from tide gauge
observations, and considered broader inferences regarding global mean sea level
change. Local confounding factors may hinder quantifying a “global” signal from such
data. However, Patwardhan and Small (1992) consider a set of stations from around the
world with relatively long continuous records that were representative of other stations
in the same region, and seemed relatively free of local confounding factors.
The analysis here with jagsUI follows them in using records for 1900–1980 from five sta-
tions, namely San Francisco, Tonoura (Japan), Sydney, Bombay, and Cascais (Portugal),
with out-of-sample forecasts to 2000. Therefore, the data file has T = 101 points, with the
last 20 being recorded as NA. The data are in mm from the Permanent Service for Mean
Sea Level website (www.psmsl.org/).
For station j at time t, the model used by Patwardhan and Small (1992) involves a 1st
order random walk in the mean global sea level Mt plus a homogenous linear trend
(common coefficient b across sites j). So model 1 has

y jt = Mt + bt + u jt

u jt ~ N(0, su2 ),

Mt ~ N ( Mt - 1 , s M
2
) (t > 1),

with the initial condition M1 assigned a diffuse N(6900,10000) prior. A gamma prior is
assumed for x = tu + tM = 1/su2 + 1/sM2
, so with k = tu /x ~ U(0, 1) , one obtains τu = κξ and
tM = (1 - k)x . Patwardhan and Small mention that compilations of trends in relative sea
level data suggest an upward trend of 0.5–3.0 mm/year, so a N(0,1) prior on b seems
reasonable.
For improved identification and convergence, the Mt series are differenced with
respect to M1, namely D t = Mt - M1 , and a level parameter β0 is introduced. So the Mt
are effectively represented as Δt + β0, and the observation model is y jt = b0 + D t + bt + u jt .
Convergence is much delayed without using this re-expression. An alternative device is
centring, whereby D t = Mt - M .
An alternative model (model 2) allowing site-specific linear trends is considered,
namely

y jt = Mt + b jt + u jt ,

M t ~ N ( Mt - 1 , s M
2
),

u jt ~ N(0, su2 ),

b j ~ N( mb , sb2 ),

mb ~ N(0, 1).

A relatively informative exponential E(1) prior for 1/sb2 is adopted, as diffuse options
lead to delayed convergence. The same identification strategy as under model 1 is
adopted for the Mt series.
184 Bayesian Hierarchical Models

7120
7100 Mean Sea Level
7080 Mean
7060
2.5%
7040
97.5%
7020
7000
6980
6960
6940
6920
6900
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

FIGURE 5.3
Modelled global sea level.

For model 1, a two-chain run using jagsUI converges after 20,000 iterations. There is a
mean (95% CRI) linear growth rate b of 0.85 (−0.08, 1.54). Variation around the random
level component Mt is comparatively small, with the posterior median of sM 2
standing
at 7.6, compared to a median 2005 for su . A posterior predictive p-test based on squared
2

deviations is satisfactory. As to fit, let yrep,jt be replicate data from the model. Then a
posterior predictive loss criterion is calculated (within the observed data period to 1980,
and with k = 1000) as

80 5 R 80 5

ååå åå V(y
1 k (r ) 2
PPL = ( y rep , jt - y jt ) + rep, jt ).
R k +1 t =1 j =1 r =1 t =1 j =1

The respective components are obtained as 1629 and 1054 (in units of 1000), with the
second a measure of complexity.
For model 2, a two-chain analysis with jagsUI converges after 10,000 iterations. This
analysis gives a mean (95% interval) for μb of 0.78 (−0.15,1.68), while site level mean
growth rates range from 0.30 to 1.89. Variation in the random components (Mt and bj) is
such as to reduce the median σ2 to 1378. The respective PPL components are also much
reduced, namely to 1116 and 731. Figure 5.3 shows the evolution of modelled global sea
level to 1980 and forecasts thereafter.

5.3.5 Nonlinear State-Space Models for Continuous Data


Assumptions such as linearity of state transitions and additive normal errors are often not
realistic, for example, in population dynamics (Maunder et al., 2015). In fisheries popu-
lation dynamics, a state-space approach recognises both biological or process variation
(randomness in population dynamics), and measurement error in the observations (Auger-
Méthé et al., 2016). The process model may focus on the total biomass of a particular fish
species (in a particular fishing region) above the minimum legal catch size, or the biomass
of fish for particular age or length groups. In the former aggregate case, the state equation
typically specifies the current biomass as a function of the biomass in the previous period,
additions due to production (natural increase or other forms of recruitment), and removals
due to fishing or natural mortality. The observations for this case are Ct, observed catch
Time Structured Priors 185

totals, and an index of abundance, At, such as the catch per unit effort, though such indices
are often imperfect measures (Maunder et al., 2006).
A widely applied population dynamics model is the logistic function of Schaefer (1954)
whereby biomass at period t + 1, Bt+1, is represented as

Bt + 1 = [Bt + g(Bt ) - Ct ]etp ,

where etp are multiplicative errors, and g(Bt) represents “surplus production” as a function
of biomass. Thus, one of the observation series, namely Ct, appears in the process model.
The Schaefer model involves three parameters, r, K, and q, which can be interpreted respec-
tively as the maximum intrinsic growth rate, the arithmetic mean biomass at unexploited
equilibrium (or carrying capacity), and a catchability parameter (or proportionality con-
stant). These parameters define the surplus production function, namely

g(Bt ) = rBt (1 - Bt /K ),

and the abundance observation model

At = qBt eto .

Additional parameters are the variances σ2o = var(eto ) and σ2p = var(etp ) of the observation
and process models. Identification may be improved by using several abundance indices,
so that

A jt = q jBt e jto .

Typically, lognormal likelihoods are adopted for both the process and observation models.
Derived parameters of interest include the maximum sustainable yield, MSY = rK/4.
Estimation of this model may well require informative priors for at least some of the
parameters. The data may contain relatively little information on the parameters, so the
prior may considerably influence the posterior. The literature has discussion about appro-
priate forms of prior, such as a uniform or lognormal prior for K, or uniform on log(K) (Punt
and Hilborn, 1997; McAllister, 2014). For q, McAllister (2014) suggests a uniform density for
log(q) over (−20,2), namely a diffuse prior concentrated on values under one, but including
values above one. By contrast, Parent and Rivot (2013) suggest a U(−20,20) prior on log(q),
while Rankin and Lemos (2015) assume log(q) ~ U ( -20, -3). For the rate of natural increase,
r, there may be more substantive prior evidence, though Parent and Rivot (2013) adopt a
U(0.01,3) prior. Regarding the process and observation series variances, Parent and Rivot
(2013) propose a parameter λ governing the ratio σp/σo, but in an application, assume λ = 1
and adopt a diffuse U(−20,20) prior on the log of the common variance σ2. McAllister (2014)
discusses the basis for more informative priors on sp2 ; for example, a value σp = 0.05 results
in an interannual change in total recruited stock biomass of about 5%. Rankin and Lemos
(2015) follow Parent and Rivot (2013) in adopting relatively diffuse priors except on K.

Example 5.5 North Pacific Blue Shark


This example illustrates fish population dynamics, focusing on the Pacific blue shark
population in the North Pacific, as considered by the ISC Shark Working Group (2013).
The observed data on catch (in units of 1000 metric tons), and an availability index, run
186 Bayesian Hierarchical Models

from 1976 to 2010. Relatively diffuse priors are adopted, following recommendations in
the literature (Parent and Rivot, 2013; Rankin and Lemos, 2015), except for K, where a
uniform prior, log(K ) ~ U(3.92, 7.6) follows the ISC Shark Working Group (2013). Thus
for r, q, and σ2, the priors are r ~ U(0.01, 3), log(q) ~ U( -20, 2) , and log(s 2 ) ~ U( -10, 10) .
The biomass series is expressed as

Pt = Bt /K

as in Meyer and Millar (1999). The lognormal prior on the initial condition P1 is as in
ISSWG (2013).
In the first model, the process and observation priors are related according to sp = lso ,
with the prior for λ exponential, λ ~ E(1), centred at 1. Convergence using rstan is rapid,
with the LOO-IC pooling over process and observation likelihoods, as the process
likelihood involves observing the catch data. The LOO-IC is −99, with posterior mean
(median) estimates of the maximum sustainable yield (MSY) of 77.6 (68.1), higher than
the estimate of 52 in ISCSWG (2013). Estimates may be affected by the use in ISCSWG
(2013) of the extended Fletcher–Schaefer model, and by the inclusion in ISCSWG (2013)
of five earlier years when the abundance index was missing. The carrying capacity K
has posterior mean (median) of 1130 (1105). The mean for the At series can be modified
numerically by changing K or q, and samples of these parameters are negatively corre-
lated, a feature that might be used in setting a prior.
In a second model, it is assumed that σp = σo, and this model has a higher LOO-IC,
namely −89. Under this model, the posterior mean (median) estimates of the MSY are
73.5 (66.7). Both analyses show biomass at low levels in the late 1980s, as in Figure 8 of
ISCSWG (2013). Under the first model, the posterior median biomass for 1989 is 588,
compared to 1213 in 1976, and 994 in 2010.

5.4 Time Series for Discrete Responses; State-


Space Priors and Alternatives
Dynamic generalised linear models extend the Gaussian state-space representation to out-
comes with density p(yt|ζt) belonging to the exponential family of distributions, where ζt
is the natural parameter (Helske, 2017; Soyer et al., 2015; Davis et al., 2016). One may also
condition on the history of previous observations plus previous and current predictors,
Dt -1 = ( yt -1 , yt - 2 , … , y1 , Xt , … , X1 ) to allow for observation driven components in the model
(Fahrmeir and Tutz, 2001, p.242). So

p( yt |zt , Dt -1 ) = exp éëft { ytzt - b(zt )}ùû c( yt , ft )

with mt = E( yt |zt ) = b¢(zt ) and μt linked to a linear predictor ηt via a link function g, g( mt ) = ht .
Also a known scale parameter ϕt defines the conditional variance Var( yt |zt ) = b ²(zt )/ft .
Then an observation equation for design matrix Xt of dimension p would typically be of
the form

g( mt ) = ht = bt Xt + ut ,

with state or system equation


Time Structured Priors 187

bt = bt -1Gt + wt ,

where wt ~ N p (0, W ). The error ut ~ N (0, V ) is not necessarily included for discrete
responses, but may be necessary to represent unstructured extra-variation.
An alternative state-space approach, sometimes termed a linear Bayes approach, involves
conjugate priors for the natural parameters and a guide relationship

h(zt ) = bt Xt ,

linking the natural parameters to the state vector (West et al., 1985, p.74; Ferreira and
Gamerman, 2000, p.60). So with time specific parameters (gt,ht), the prior for the natural
parameter at time t is

p(zt |Dt -1 , gt , ht ) = k( gt , ht ) exp[ gtzt - htb(zt )],

while the updated natural parameters have density

p(zt |Dt , gt , ht ) = k( gt , ht )exp éë( gt + ft yt )zt - ( ht + ft )b(zt )ùû .

As for normal linear state-space models, the state vector may include level, trend and sea-
sonal effects. For an underlying signal model (with Xt containing only an intercept), the
regression and state equations become (Kitagawa and Gersch, 1996, Ch 13),

g( mt ) = bt + ut ,

D k bt = wt ,

with ut ~ N (0, V ), wt ~ N (0, W ) . Thus Kashiwagi and Yanagimoto (1992) consider Poisson
data on disease counts yt ~ Po( mt ) , and take k = 1 in the signal equation.
For binary data with pt = Pr( yt = 1) , a signal may be combined with randomly time-vary-
ing dependence on lagged responses (Cox, 1970), providing a parameter driven representa-
tion, whereas an observation-driven model would only involve fixed effect coefficients on
lagged observed yt values (Wu and Cui, 2014). For example, a time-varying level and lag 1
effect could specify

g(pt ) = ht = b1t + b2t yt -1 ,

( b1t , b2t ) ~ N 2 ([ b1,t -1 , b2 ,t -1 ], W ).

Time series of categorical data vectors, namely yt = ( yt1 , yt 2 , ¼ ytJ ) with only a single ytj = 1
if (say) diagnosis j applies, or mutually exclusive choices j made at time t, are multinomial
according to

yt = ( yt1 , yt 2 , ¼ ytJ ) ~ Mult(1,[ pt1 , pt 2 , ¼ ptJ ]).

Typically, a multiple logit link is assumed for the unknown probabilities ptj (Fahrmeir and
Tutz, 2001; Cargnoni et al., 1996). A signal model would then involve a (J − 1) dimensional
state vector, though by analogy to binary Markov dependence, the regression term ηtj for
188 Bayesian Hierarchical Models

the jth choice may also involve lags on both the same response yt−k,j, and lagged cross-
responses yt−k,m (m ¹ j). For a general predictor, possibly varying by category, Xtj, one has

ptj = exp( btj Xtj ) å exp( b X ) ,


j =1
tj tj

where βtJ = 0 for identifiability. Cross-series borrowing of strength via random walk priors
may be applied for the J − 1 category specific state vectors βtj. Thus, for the coefficient on
predictor k, Xtjk, one might have

btk ~ N J -1( bt -1, k , S k ),

where btk = ( bt1k , … , bt , J -1, k ) , and Σk is of dimension J − 1.


An alternative for binary and multinomial responses is to introduce the augmented met-
ric data yt* that underlie the observed discrete responses. Thus, for binary data, consider
the scheme

yt* = bt Xt + ut ,

where yt* is positive or negative according as yt = 1 or yt = 0, and the variance of ut is assumed
known for identifiability, usually with var(ut = 1). A simple signal model with Xt = 1 may
then be expressed as

yt* |W , yt , bt µ N ( bt , 1) I (0, ¥) if yt = 1

yt* |W , yt , bt µ N ( bt , 1) I ( -¥ , 0) if yt = 0

bt ~ N ( bt -1 , W ).

5.4.1 Other Approaches
Other general schemes for modelling time series of exponential family data include the
generalised autoregressive moving average (GARMA) representation (Benjamin et al.,
2003; Li, 1994; Silveira de Andrade et al., 2015). The GARMA representation involves con-
ditional means μt, link function g(μt) = ηt, and regression term in the form

p q

ht = g t X t + åf éë g(y
j =1
j t- j ) - g t - j Xt - j ùû + åg
k =1
k éë g( yt -k ) - ht -k ûù .

For example, for Poisson data yt ~ Po( mt ) and yt* = max( yt , m) for a small positive constant
m, one has

p q

log( mt ) = gt Xt + å
j =1
fj éëlog( yt*- j ) - gt - j Xt - j ùû + å g éëlog(y
k =1
k
*
t-k / mt - k )ùû .
Time Structured Priors 189

More general autoregression in the state vector (not limited to random walks) may be
adopted. Thus Oh and Lim (2001) and Chan and Ledolter (1995) adopt an autocorrelated
error θt for count data with

yt ~ Po(e ht ),

ht = qt + Xt b ,

qt = rqt -1 + wt ,

where ρ is constrained to stationarity, and the wt are normally distributed. Utazi (2017)
considers a variant of this model allowing a changepoint in the autoregressive param-
eter ρ.
Dependence on lagged counts can also be achieved by binomial thinning (Silva et al.,
2009), whereby

ht = bt Xt + r  yt -1 ,

is equivalent to ht = bt Xt + ht , ht ~ Bin( yt -1 , r), and by thinning schemes applicable to both


count and categorical data (Angers et al., 2017; Khoo and Ong, 2014). Conjugate mixture
schemes for time series counts are exemplified by Jowaheer and Sutradhar (2002), and
Bockenholt (1999), with, for instance,

yt ~ Po(e ht kt ),

æ 1 1ö
kt ~ Ga ç , ÷ ,
è c cø

where marginally var( yt ) = exp(ht ) + c exp(2ht ) .


Autoregression on both past observations and means for count data is included in autore-
gressive conditional Poisson (ACP) models (Heinen, 2003; Fokianos et al., 2009). Classical
estimation is implemented in the R package acp (https://cran.r-project.org/web/packages/
acp/acp.pdf). The ACP model is a particular case of the ARMA-GLM model for counts set
out by Liboschik et al. (2017). Poisson means in the ACP(p,q) model are specified as

p q

mt = w + åf y + åg m
j =1
j t- j
k =1
j t-k

with all parameters positive. Under an ACP(1,1) model, one therefore has

mt = w + f yt -1 + gmt -1.

In this model, stationarity is obtained subject to a constraint f + g < 1. Defining D = 1 - (f + g)2 ,


the unconditional variance is given (Heinen, 2003, p.5) by

Var( yt ) = mt ( D + f 2 )/D  mt ,
190 Bayesian Hierarchical Models

so that unconditionally the ACP is overdispersed, even though the conditional


­d istribution is equidispersed. Covariates may be introduced by defining a conditional
mean  mt* = exp(Xt b ) mt (Jung et al., 2006), and further overdispersion can be achieved
by setting mt* = exp(Xt b )mtlt , where λt is lognormal.

Example 5.6 Ontario Car Fatalities


To illustrate different models for count time series and intervention analysis (Santos et
al., 2010) we consider data on road fatalities in Ontario between 1931 and 2001, and the
impact of a seat belt law introduced on January 1, 1976. Expected accidents, Et, obtained
as average accident rate times the number of registered drivers in a year, are used as an
offset in the regression term.
The first analysis uses the ACP(1,1) model with a multiplicative lognormal error

mt* = Et exp( b.St ) mt lt ,

mt = w + fyt - 1 + gmt - 1 .

The seatbelt intervention (St) is represented by a binary variable with values 1 from 1976
onwards, 0 before. The estimates from a two-chain run of 25,000 iterations using jagsUI
show most of the lag effect on μt operating through the conditional means, with γ hav-
ing posterior mean (sd) of 0.98 (0.002). Predictive checks (comparing replicates from the
posterior predictive distribution with actual observations), are satisfactory. The seat-
belt effect β is estimated as −0.62 (0.08), but the pointwise LOO values still show 1976
(and surrounding years) as poorly fit. In that year, the accident rate per million fell to
350 compared to 433 in the previous year. It may be noted that estimates of the LOO-IC
and WAIC (respectively 797 and 744) are unstable.
A second analysis adopts an antedependence approach, whereby yt ~ Po( mt ),
log( mt ) = log(Et ) + b.St + gt , with gt following a first-order antedependence scheme,
whereby

g1 = e1

gt = ft gt - 1 + et t > 2

with et ~ N(0, wt ) . The variances ωt are modelled as a quadratic function of time,


log(wt ) = a1 + a2t + a3t 2 . This model provides rapid convergence using jagsUI. Estimates
of the LOO-IC and WAIC are lowered (to 781 and 737 respectively). The seatbelt parame-
ter has a mean (95% CRI) of −0.25 (−0.48, −0.08), while the estimated variances ωt decline
over time.
The pointwise LOO-IC values are no longer clustered in the 1970s under the antede-
pendence model, but in an attempt to better represent the discontinuity at 1976, a decay
effect in the impact of the seat belt law is implemented. The effect is constrained as nega-
tive, most pronounced in 1976, with subsequently monotonically fewer negative values,
and the effect set to zero, unless the effect is negative. On this basis, we find that the year
1985 is the last year with a probability exceeding 0.5 that the seatbelt effect is negative.
The LOO-IC and WAIC are now respectively 778 and 736, with the most extreme point-
wise LOO-IC being for 1931 and 1982, followed by 1976. A plot of the relative risks (after
controlling for the intervention) shows a continuing downward trend (Figure 5.4).
We also consider the CLAR(1) model (Grunwald et al., 2000), with form

mt = r1 yt - 1 + exp(log(Et ) + b.St + ht ),
Time Structured Priors 191

3.0

2.5

2.0
Relative Risks

1.5

1.0

0.5

1930 1940 1950 1960 1970 1980 1990 2000


Year

FIGURE 5.4
Annual relative risks, Ontario accidents, posterior means.

ht ~ N( r2 ht - 1 , sh2 ),

with r1 ~ U(0, 1), and r2 ~ U( -1, 1). This model has satisfactory predictive checks, and
there is no significant correlation between successive errors ( yt - mt )/mt0.5 . However, the
LOO-IC and WAIC, at 811 and 754 are higher than for other models, with performance
being vitiated by discontinuities in the series, such as in 1937. The β coefficient has a
mean (95%CRI) of −0.33 (−1.23, −0.01).
Finally, we use R-INLA to estimate a model with random walk level, with yt ~ Po( mt ),

log( mt ) = log(Et ) + b.St + wt ,

wt ~ N(wt - 1 , sw2 ).

This model provides a posterior mean and CRI for β of −0.22 (−0.41, −0.04), and a
WAIC of 751. A R-INLA model code including trend as well as level can be achieved
using an augmented data representation (Ruiz-Cárdenas et al., 2012).

Example 5.7 Old Faithful Data


This example uses binary data generated from the Old Faithful geyser data in R, setting
y = 1 if the eruption time exceeds 3 minutes, and y = 0 otherwise. There are T = 272 data
points. These data are characterised by several long runs of ones, but at most two zeroes
in a run. We compare a parameter driven model, including random AR1 dependence,
with two observation driven binary autoregressive moving average (BARMA) models
(Startz, 2008).
192 Bayesian Hierarchical Models

In the first model, we have

logit(pt ) = b1 + b2t yt - 1 ,

with b2t = b2 + gt , where gt follow an RW1 prior, implemented using the carnormal func-
tion in R2OpenBUGS. Under this function, the gt are centred at each iteration, leading to
improved identifiability. This provides a LOO-IC of 293. Figure 5.5 plots out the varying
AR coefficients β2t.
BARMA models may also be implemented via R2OpenBUGS, but rstan provides
considerably faster computation and convergence. We compare an autoregressive lag
5 BARMA(5,0) model, with AR coefficients following a horseshoe prior for parsimony,
with a BARMA(5,1) model including a moving average term. Generically

p q

logit(pt ) = b1 + åj =1
r j yt - j + å q (y
k =1
k t-k - pt - k ),

where p = 5 and q = 1.


Under a BARMA(5,0), the horseshoe prior for ρj specifies (for p = 5),

k j ~ Beta(0.5, 0.5), j = 1,¼, p

tl2 ~ IG(1, 0.001)

lj = (1/k j - 1)

r j ~ N(0, lj tl ),

where the posterior mean j j = 1 - k j are effectively analogous to posterior selection


rates. The model includes initial condition parameters so that all data points can be

–2.7

–2.8
Posterior mean b2

–2.9

–3.0

–3.1

–3.2

0 50 100 150 200 250


Observation

FIGURE 5.5
Old Faithful data. Varying AR1 regression coefficient
Time Structured Priors 193

included in the likelihood. For example, the model at t = 1 refers to unobserved preseries

å
p
data points represented in the parameter e 1 = r j y1- j . The alternative is to condition
j=1

on the first five observations. In fact, only the AR1 coefficient plays a significant role,
with ρ1 having posterior mean (sd) of −2.56 (0.44), and with j1 = 0.98 . The LOO-IC for
this model deteriorates to 332.
A BARMA(5,1) model finds ρ2 and θ1 (beta[3] and beta[7] in the code) to be significant
with respective posterior means (sd) 1.39 (0.37) and −2.3 (0.67). The LOO-IC for this
model is 331.

5.5 Stochastic Variances
Many state-space applications assume constant variances in the observation and state
equations, but there is often nonstationarity in such variances (Omori and Watanabe, 2015;
Broto and Ruiz, 2004). Certain types of data such as exchange rate and share price series
rt are particularly likely to demonstrate volatility clustering (Granger and Machina, 2006),
with fluctuating variances Vt = var(rt). Typically, there are periods where volatility is rel-
atively high and periods where volatility is relatively low, often with relatively smooth
transition between high and low volatility regimes. In many applications, the series is
transformed to have an effectively zero mean (Meyer and Yu, 2000, p.200). For example,
the ratio of successive exchange rates rt /rt -1 has approximate average 1, so that a response
obtained as yt = log(rt /rt -1 ) can be taken to average zero. Hence, one may write a model
without intercept (or predictor effects) as

yt = Vt0.5ut ,

where ut ~ N (0, 1) , but the variances Vt are unknowns.


Stochastic volatility models apply state-space techniques to model changing variances.
A widely used scheme involves a state-space or autoregressive model in log scale param-
eters (Meyer and Yu, 2000; Jacquier et al., 2004; Kim et al., 1998; Harvey et al., 1994). With
ht = log(Vt ), and stationary AR1 model for ht, one has

æh ö
yt = Vt ut = exp ç t ÷ ut , (5.2)
è 2ø
ht = m + f( ht -1 - m) + sw wt , t > 1

æ s2 ö
h1 ~ N ç m , w 2 ÷ ,
è 1-f ø

æ ut ö ææ0ö æ 1 0öö
ç ÷ ~ N çç ç ÷ , ç ÷÷
è wt ø èè0ø è0 1 ø ÷ø
where |ϕ| < 1 measures persistence in the volatility, but the ut and wt series are uncorre-
lated. This scheme can be generalised to multivariate responses subject to volatility, such
as a set of exchange rates – see Chapter 7, and Yu and Meyer (2006).
194 Bayesian Hierarchical Models

As a heavy tailed alternative, one may consider a Student t likelihood for the log scale
series, implemented as a scale mixture of normals (Jacquier et al., 2004). With ν degrees of
freedom, one has

yt = lt Vt ut ,

æh ö
= lt exp ç t ÷ ut ,
è 2ø
æ n nö
1/lt ~ Ga ç , ÷ ,
è 2 2ø
and other aspects as above. A diffuse prior on ν is not suitable, and one option is an expo-
nential prior with prior mean 10 or 20 (Fernández and Steel, 1998). For a recent alterna-
tive prior (applicable to other types of Student t regression), see Fonseca et al. (2008). This
model deals with isolated y-outliers by introducing a large λt, and it requires a sequence of
large |yt| before Vt is increased (Jacquier et al., 2004, p.190).
By contrast, generalised autoregressive conditional heteroscedastic (GARCH) models
involve autoregression in yt2 and/or Vt. A GARCH(p,q) model specifies

p q

Vt = g + å j =1
a j yt2- j + åbV
j =1
j t- j

where coefficients {g , a j , b j } are constrained to be positive, and setting q = 0 leads to the
ARCH(p) model (Engle, 1982). Stationarity requires

p q

å a + å b < 1
j =1
j
j =1
j

though is not necessarily imposed a priori. Whichever approach is used, departures from
normality are frequently relevant, such that yt / Vt is non-Gaussian. Among heavy tailed
alternatives, one may consider a Student t, either ut ~ t(0, 1, n) , or a scale mixture of normals
(Bauwens and Lubrano, 1998; Chib et al., 2002).
In case y has a non-zero mean, or there are predictors, one may widen the model for y.
For example, a model with a zero mean y and lag 1 effect in y would be

yt = r yt -1 + Vt ut .

One variant is the doubly autoregressive model (Ling, 2004)

yt = r yt -1 + ut g + a yt2-1 .

For ut normal, this can be shown equivalent to the random coefficient AR model

yt = ( r + at )yt -1 + ct ,

where (at,ct) are bivariate normal with mean 0 and covariance matrix Diag(α,γ).
A generalisation of the state-space approach is to introduce correlation between the
ut and wt terms, and so reflect leverage effects. Positive and negative shocks then have
Time Structured Priors 195

different impacts on future volatility (Wang et al., 2011; Asai et al., 2006; Jacquier et al.,
2004; Meyer and Yu, 2000; Chen and So, 2006). So one possible scheme has

æh ö
yt = Vt ut = exp ç t ÷ ut ,
è 2ø
ht = m + f( ht -1 - m) + sw wt ,

 ut    0  1 j
 wt  ∼ N   0 ,  j 1 
,

where φ is a correlation. A heavy tailed version of the leverage model (Jacquier et al., 2004;
Omori et al., 2007) may be obtained with

æh ö
yt = lt exp ç t ÷ ut ,
è 2ø

æ n nö
1/lt ~ Ga ç , ÷ .
è 2 2ø

A GARCH model including leverage is obtained by setting zt = Vt ut in


yt = my + r( yt -1 - my ) + Vt ut . Leverage is then obtained under the following asymmetric
model (Glosten et al., 1994; Fonseca et al., 2016)

Vt = g + a 1zt2-1 + a 2 zt2-1I ( zt -1 > 0) + b Vt -1.

Under the model (5.2), assume priors m ~ N(0, s m2 ) , (f + 1)/2 ~ Be(rf , sf ) , and sw2 ~ IG(kw , lw ),
where {s m2 , rf , sf , kw , lw } are known. Then with y = ( m, f, sw2 ) , the posterior is

ì - y 2 ü ù éæ 1 - f 2 ö üù
0.5
é T
ì1 -f 2
p(y |y ) µ ê
êë
Õ
t =1
exp{- ht /2} exp í htt ýú êç 2 ÷ exp í
î 2e þúû êè s w ø
ë î 2s w
2
( h1 - m )2 ýú
þúû

é T
æ 1 ö
0.5
ù
êÕ
êë t=2 è wø
ì 1
î w

ç s 2 ÷ exp í- 2s 2 ( ht - m - f ( ht -1 - m )) ýú p( m )p(f )p(s w )
þúû
2

and Gibbs sampling from full conditionals is obtained (Kim et al., 1998). The Griddy–Gibbs
technique may also be used to enable Gibbs sampling of all parameters in a GARCH(1,1)
model, with normal or Student distributed ut (Bauwens and Lubrano, 1998). Chib et al.
(2002) consider more general Metropolis–Hastings techniques including particle filtering,
to sample from models with discontinuities in the observations.

Example 5.8 Bitcoin Price


Consider 365 observations rt of the Bitcoin price (in $000s) during 2017. The data are
obtained as daily returns, namely yt = (rt - rt - 1 )/rt – see Figure 5.6 for a plot of yt which
shows several spells of high volatility. As one of several ways to represent the data, the
double autoregressive model (Ling, 2004), namely
196 Bayesian Hierarchical Models

0.2

0.1
Bitcoin Return

0.0

–0.1

0 100 200 300


Index

FIGURE 5.6
Fluctuations in returns yt = (rt – rt – 1)/rt (rt is Bitcoin price).

yt = r yt - 1 + ut g + a yt2- 1 ,

is applied, with the constraint r 2 + a < 1 sufficient to ensure E( yt2 < ¥) . The analysis
conditions on the first observation. A two-chain run of 5,000 iterations using jagsUI
gives posterior means (sd) for ρ and α of 0.14 (0.05) and 0.16 (0.06), with γ estimated as
0.0021(0.0002). The LOO-IC is obtained as −1155.
A second approach is based on a stationary autoregressive stochastic volatility model
(in stan), as in (5.2), with

æh ö
yt = Vt ut = exp ç t ÷ ut ,
è 2ø
ut ~ N(0, 1),

ht = m + f( ht - 1 - m) + sw wt , t > 1

with a uniform U(−1,1) prior on ϕ, and a half Cauchy prior on σw. With a two-chain run
of 2,000 iterations, we obtain posterior estimates (mean, sd) for μ and ϕ of −6.34 (0.35)
and 0.91 (0.05), with the LOO-IC estimated as −1249. Figure 5.7 plots the evolving vari-
ance Vt = exp( ht ) under this model.
To better represent the extreme volatility in the series, a Student t (scale mixture) ver-
sion of the preceding stochastic volatility model is applied. Thus

æh ö
yt = lt Vt ut = lt exp ç t ÷ ut ,
è 2ø

æ n nö
1/lt ~ Ga ç , ÷ ,
è 2 2ø
Time Structured Priors 197

0.010

0.008
Variance

0.006

0.004

0.002

0.000
0 100 200 300
Index

FIGURE 5.7
Stochastic volatility. Bitcoin data.

with a prior n ~ E(0.1) . This provides an improved LOO-IC of −1255, with a posterior
mean (sd) for ν of 15.2 (9.6). Low values of the precision scaling parameters zt = 1/lt are
indicators of outlier status, and we find that cases 200, 284,10, 216, and 340 have the low-
est posterior mean ζt. Two of these cases have return values exceeding 20%.
Outliers can also be represented by a binary shift mechanism (Wang, 2011). Thus

æh ö
yt = J t Nt + exp ç t ÷ ut ,
è 2ø
ht = m + f( ht - 1 - m) + sw wt , t > 1

where

J t ~ Bern(pJ ),

Nt ~ N(0, sN2 ),

represent the shift mechanism and its potential size respectively. The probability πJ can
be preset, or assigned a prior favouring a low outlier rate. Taking pJ ~ Beta(2, 48) , this
model (fitted using jagsUI) provides a LOO-IC of −1256. The highest posterior probabili-
ties Pr( J t = 1|y ) are for cases 284, 39, 10, 216, and 200.

5.6 Modelling Discontinuities in Time


Aberrant observations or shifts in a series can bias parameter estimates and other
inferences in time series models (Chen and Liu, 1993; Tsay, 1986; Hamilton, 2007), and
a variety of methods exist for modelling shifts or outliers in the observation, state,
198 Bayesian Hierarchical Models

or error series. These extend to shifts in variance parameters also (as considered in
Example 5.10).
Robust versions of the priors for the component errors ut and/or wjt in dynamic models
may be applied to allow flexibility in response to disparate observations. For example, a
heavy tailed alternative to Gaussian errors (Martin and Raftery, 1987) may be invoked by
scale mixing at both levels in the local level model

yt = bt + ut ,

bt = bt -1 + wt ,

with

ut ~ N (0, V/l1t ),

wt ~ N (0, sw2 /l2t ),

æn n ö
l1t ~ Ga ç u , u ÷ ,
è 2 2 ø
æn n ö
l2t ~ Ga ç w , w ÷ .
è 2 2 ø
This generalisation is adapted to detecting or accommodating additive outliers (outliers in
the observation errors) and innovation outliers in the state equation errors. Geweke (1993)
points out problems with adopting diffuse priors for ν, and possibilities include an expo-
nential density such as ν ~ E(0.1) (Fernandez and Steel, 1998).
Many outlier mechanisms involve discrete mixing around default normal error assump-
tions, as in a contaminated normal density (Verdinelli and Wasserman, 1991). Thus, let π be a
given prior probability of an outlier (e.g π = 0.05). Then the observation error in a state-space
model can be modified to allow innovation outliers

ut ~ (1 - p)N (0, W1 ) + p N (0, W2 ),

where W2 = KW1 with K large. A comprehensive generalisation of the normal errors dynamic
linear model is provided by taking yt and βt to follow the univariate or m ­ ultivariate
­exponential power distribution (Gomez et al., 2002).
More specialised binary switching in observation error or state error processes may be
applied (Diggle and Zeger, 1989), for example, adapted to positive pulses (e.g. periods with
abnormally heavy rainfall). To illustrate switching in observation errors to accommodate
positive pulses, consider the AR(1) observation model

yt = f yt -1 + ut ,

such that usually ut = u1t , but exceptionally ut = u2t , where the latter error is necessarily
positive, namely

u1t ~ N (0, s 2 ),
u2t ~ Ga( g1 , g 2 ),
Time Structured Priors 199

where {g1,g2} are preset. Define latent allocation indicators St Î(1, 2), as in Chapter 3. Then
ut = u2t with probabilities pt = Pr(St = 2), that might be defined by a separate model, such as

logit(pt ) = h0 + h1 yt -1.

One may also distinguish innovation outliers from additive outliers corresponding to iso-
lated shifts or “gross errors” in the observation series (Tsay, 1986; Fox, 1972). This involves
separate binary indicators {SAt, SIt}, or a single multinomial indicator St. For example, let
πA and πI be prior probabilities of additive and innovative outliers, and consider an AR(1)
observation model with AR(1) errors

yt = f0 + f1 yt -1 + atSAt + e t ,

et = ret -1 + ut ,

where SAt ~ Bern(pA ), and at ~ N (0, sa2 ) represents the sizes of the additive outliers
(McCulloch and Tsay, 1994). Innovation outliers are encompassed by a variance inflation
mechanism with

ut ~ (1 - pI )N (0, V ) + pI N (0, KV ),

with K  1, as determined by latent indicators SIt ~ Bern(pI ) .


The possibility of additive and innovative outliers coinciding at a single point may be
discounted (Barnett et al., 1996; Gerlach et al., 1999). So with both additive and innovation
outliers generated by variance inflation factors (respectively K A and K I), one may have a
single trinomial indicator St governing outlier occurrence, with St = 1 if neither type of
outlier is present (K A = 0, K I = 1) , St = 2 if an additive outlier is present (K A = 10, K I = 0), and
St = 3 if an innovation outlier is present (K A = 0, K I = 10). Then

St ~ Mult(1,[p1 , p2 , p3 ]),

where π2 and π3 may be assigned preset values (e.g. p2 = p3 = 0.025), and

yt = f0 + f1 yt -1 + atSt + et

et = ret -1 + utSt

where atSt ~ N (0, K As 2 ), utSt ~ N (0, K I s 2 ) .


Enduring, rather than temporary, shifts in the mean or variance of a series require
another approach. Models with a single or small number of enduring changes in the level
of the series may be handled by extending conventional discrete mixture methods (e.g.
Leonte et al., 2003; Mira and Petrone, 1996; Albert and Chib, 1993; Perreault et al., 2000). To
illustrate binary switching in both levels and variances in an autoregressive error model
(McCulloch and Tsay, 1993), determined by binary pairs (S1t,S2t), consider

yt = mt + et ,

where a change in level, namely

mt = mt -1 + S1t D t ,
200 Bayesian Hierarchical Models

occurs when S1t = 1, with Pr(S1t = 1) = π1, and the Δt are random effects representing the
shifts. The errors are AR(p)

et = r1et -1 + r2et - 2 + … + rt - p et - p + ut

where shifts in the variance of ut ~ N (0, Vt ) occur when S2t = 1 with Pr(S2t = 1) = π2. If there
is conditioning on ( y1 , … , y p ) , then the variance sequence commences with Vp+ 1 = s 2 , and
subsequently,

Vt = Vt -1 when S2t = 0,

Vt = ktVt -1 when S2t = 1,

where the κt are positive variables (e.g. gamma distributed) that model proportional shifts
in the error variance.
Shocks in different components of the basic structural model can also be considered (De
Jong and Penzer, 1998; Penzer, 2006). For example, in a three-component local linear trend
model, binary shock indicators (S1t, S2t, S3t) are invoked, such that

yt = mt + S1t D 1t + ut ,

mt = mt -1 + S2t D 2t + D t + w1t ,

D t = D t -1 + S3t D 3t + w2t ,

where the Δ1t represent temporary additive shocks that occur when S1t = 1, the Δ2t represent
shifts in mean, and the Δ3t represent shifts in the slope.
Regime switching models (Geweke and Terui, 1993; Lubrano, 1995) typically involve
discrete switching between two or more levels, regression regimes, or variances, though
smooth transition mechanisms can also be used. The choice between regimes is gov-
erned by a binary switching function St, or a continuous transition function ϕt with values
between 0 and 1, such as the logit (Bauwens et al., 2000). A binary function St might be
defined as one if time t exceeds a threshold κ and zero otherwise, as in change-point mod-
els for the mean level of a series. In self-exciting threshold autoregressive (SETAR) models,
the mechanism involves a lag in y; for example, St = 1 if yt-1 > k . The continuous version in
these two cases would be

exp(w[t - k ])
ft = ,
1 + exp(w[t - k ])

exp(w[ yt -1 - k ])
ft = ,
1 + exp(w[ yt -1 - k ])
where ω is an extra unknown. Additionally, the lag r in the comparison yt - r > k may be
unknown (Geweke and Terui, 1993).

Example 5.9 Nile Discharges


Data on Nile discharges for 1871–1970 (T = 100) have been analysed by a variety of
ARMA and other methods and illustrate possible identification issues associated with
Time Structured Priors 201

outlier and shift points. The initial analysis compares an AR(2) model for these data to
one allowing for an intercept shift (cf Balke, 1993). Following that analysis, a Bayesian
estimation of the AR(2) is applied. To facilitate prior specification for latent preseries
values y0 and y−1, we centre the original data Yt by subtracting Y1 from all points. So
yt = Yt − Y1.
For the AR(2) model with no shift mechanism (and a heavy tailed Student t prior for
the preseries points) is applied. Thus,

yt = f0 + f1 yt -1 + f2 yt - 2 + ut t = 1,… , T

ut ~ N(0, s 2 ),

y 0 ~ t2 (f0 + e1 , s 2 ),

y -1 ~ t2 (f0 + e2 , s 2 ),

where εj are fixed effects, and N(0,1) priors are adopted for {ϕ1,ϕ2} so that nonstationar-
ity is allowed. A two-chain run using jagsUI provides a LOO-IC of 1284. The posterior
means (and 95% credible intervals) on the AR parameters {ϕ1,ϕ2} are obtained as 0.45
(0.27,0.64), and 0.25 (0.06,0.44).
Suppose, however, a shift in the series level is allowed: a series plot suggests such
a shift around 1895. One may also allow for coefficient selection via binary variables,
namely dj = 1 if ϕj (j > 0) is to be retained, with prior probabilities Pr(d j = 1) = pd , with
pd ~ Beta(1, 1) So

yt = f01 + f02 I (t > k ) + d1f1 yt -1 + d2f2 yt - 2 + ut

where κ is taken to be uniform between 3 and T − 3. Fitting this model provides an
improved LOO-IC of 1275, with posterior mean for κ of 29.8. The selection pro-
cess indicates that the lag in yt−2 is now in doubt, with Pr(d2 = 1|y ) = 0.5 , whereas
Pr(d1 = 1|y ) = 0.998 .
So an AR(1) model with shift mechanism is applied, namely

yt = f01 + f02 I (t > k ) + f1 yt -1 + ut t = 1,… , T

The LOO-IC is reduced to 1272, with κ now having mean 28.6 (i.e. the year 1899).
This is similar to the classical estimate of 28 obtained from the changepoint package
(Killick and Eckley, 2014). The lag 1 coefficient estimate is now 0.43 with 95% interval
(0.25,0.62).
Finally, we consider an AR(2) SETAR model (e.g. Korenok, 2009), which bases the shift
threshold on the discharge value. Specifically,

yt = f01 + f02 I ( yt -1 > k y ) + f1 yt -1 + f2 yt - 2 + ut t = 2,… , T

with κy assigned a uniform prior, ky ~ U( -700, 300) , based on actual (differenced) y val-
ues, which have minimum (maximum) of −664 and 250. This model provides a LOO-IC
of 1282.6, with κy estimated as −336. The latter parameter is only weakly identified, as
can be verified by a prior-posterior overlap plot using MCMCvis. This explains the
small reduction in LOO-IC as against an AR(2) model with no shift. Figure 5.8 shows
the extent of updating in κy.
202 Bayesian Hierarchical Models

0.0020 84.7% overlap

Density
0.0010

0.0000

–800 –600 –400 –200 0 200 400


Parameter estimate

FIGURE 5.8
Density of κy.

Example 5.10 Box–Jenkins Series A


This example involves the Box–Jenkins series A, and demonstrates outlier modelling
via variance inflation in the observation component of an autoregressive state-space
model (cf. Gerlach et al., 1999). The observation model is

yt ~ N( b0 + qt , VJt ),

where Jt is a trinomial indicator modelling the measurement error outlier mechanism.


The state equation is

qt = fqt -1 + wt ,

where wt ~ N(0, W ) , and ϕ is constrained to stationarity.


As discussed above, outlier probabilities are often preset. However, if variance
inflation factors are preset instead, then it is possible to take the outlier probabilities
as unknowns. Thus assume π1 = Pr(Jt = 1) is the unknown probability of a default mea-
surement error with variance V1, while π2 = π3 are unknown probabilities of moder-
ate and extreme outliers with variances 10V1 and 32V1 respectively. It is assumed that
V1 ~ Ga(1, 0.001), together with the parameterisation

p1 = 1/(1 + r ),

p2 = p3 = 0.5r/(1 + r ),

where r ~ E(9). Additionally, the variances of the observation and state equations are
linked by taking W = qV1 with an E(1) prior on q.
A two-chain run using jagsUI shows early convergence with estimated probability
π1 = 0.94 (and 95% interval from 0.82 to 0.99). The observation error variance V1 has a
posterior mean of 0.031, while the state variance W has mean 0.037.

5.7 Computational Notes
[1] The code for the ARMA(4,0,1) model in Example 5.1 is
Time Structured Priors 203

ARMA41.stan <- "


   
   
data {int<lower=1> T;//length of series
   
real y[T];
   
}
   
parameters {real phi0;//intercept
   
real phi[4];//autoregression coeffs
   
real gamma1;//moving avg coeff
   
real kappa[4];
   
real<lower=0> sigma;//residual sd
   
}
   
transformed parameters {real y_fit[T];
   
real error[T];
   
for (t in 1:T) {error[t] =y[t]−y_fit[t];}
   
//kappa[1] is composite parameter for effects of y[0],y[-1], etc., and
   
// gamma1*(y[0]−y_fit[0])
   
y_fit[1]=phi0+kappa[1];
   
y_fit[2]=phi0+phi[1]*y[1]+gamma1*(y[1]−y_fit[1])+kappa[2];
   
y_fit[3]=phi0+phi[1]*y[2]+phi[2]*y[1]+gamma1*(y[2]-y_
fit[2])+kappa[3];
   
y_fit[4]=phi0+phi[1]*y[3]+phi[2]*y[2]+phi[3]*y[1]+gamma1*(y[3
]−y_fit[3])+kappa[4];
   
for(tin5:T){y_fit[t]=phi0+phi[1]*y[t−1]+phi[2]*y[t−2]+phi[3]*y[t−3]+p
hi[4]*y[t−4]+
   
gamma1*(y[t−1]−y_fit[t−1]);}
   
}
   
model {real eps[T];
   
phi0 ~normal(0,10);
   
kappa ~normal(0,10);
   
phi ~normal(0,2);
   
gamma1 ~normal(0,2);
   
sigma ~cauchy(0,5);
   
eps[1]=y[1]−phi0−kappa[1];
   
eps[1]~normal(0,sigma);
   
eps[2]=y[2]−(phi0+phi[1]*y[1]+gamma1*(y[1]−y_fit[1])+kappa[2]);
   
eps[2]~normal(0,sigma);
   
eps[3]=y[3]−(phi0+phi[1]*y[2]+phi[2]*y[1]+gamma1*(y[2
]−y_fit[2])+kappa[3]);
   
eps[3]~normal(0,sigma);
   
eps[4]=y[4]−(phi0+phi[1]*y[3]+phi[2]*y[2]+phi[3]*y[1]+
   
gamma1*(y[3]−y_fit[3])+kappa[4]);
   
eps[4]~normal(0,sigma);
   
for(tin5:T){eps[t]=y[t]−(phi0+phi[1]*y[t−1]+phi[2]*y[t−2]+phi[3]*y[t−
3]+phi[4]*y[t−4]
   
+gamma1*eps[t−1]);
   
eps[t] ~normal(0,sigma);}}
   
generated quantities {
   
vector[T] log_lik;
   
for (t in 1:T) {log_lik[t] = normal_lpdf(y[t] y_fit[t], sigma);}}
   
"
   
# Initial Values and Estimation
sm <- stan_model(model_code=ARMA41.stan)
   
INI <- list(list(phi0=8,gamma1=0.8,ph
   
i=c(0.3,0.9,-0.4,-0.2),sigma=0.5,
204 Bayesian Hierarchical Models

   
kappa=c(−1,1,−1,1.5)),
   
list(phi0=7,gamma1=0.9,phi=c(0.4,0.8,−0.3,−0.1),sigma=0.7,kapp
a=c(−2,1.5,−1.5,2)))
   
fit4<-sampling(sm,data =D,pars
=c("phi0","phi","gamma1","y_fit","kappa","log_lik"),
   
iter = 10000,warmup=500,chains = 2,seed= 12345,init=INI)
   
print(fit4)
   
# Fit
LLsamps <- extract(fit4,"log_lik",permute=F)
   
LLsamps <- matrix(LLsamps, 2*9500, 598)
   
   
loo(LLsamps)

[2] The code for the random coefficient AR1 model in Example 5.2 is

   
RCAR.stan <- “
   
data {
int<lower=0> T;
   
   
vector[T] y;
   
}
   
parameters {
   
real mu;
   
real eta[T];
   
real y0;
   
real mu_phi;
   
real<lower=0> sigma;
   
real<lower=0> sigma_phi;
   
}
   
transformed parameters {
   
vector[T] muy;
   
vector[T] phi;
   
phi[1]=mu_phi+eta[1]*sigma_phi;
   
for(tin2:T){phi[t]=mu_phi+eta[t]*sigma_phi;}
   
muy[1]=mu+(mu_phi+eta[1]*sigma_phi)*y0;
   
for(tin2:T){muy[t]=mu+(mu_phi+eta[t]*sigma_phi)*y[t-1];}
   
}
   
model {
   
sigma ~normal(0, 1);
   
eta ~normal(0,1);
   
mu ~normal(0, 20);
   
mu_phi ~normal(0, 1);
   
y0 ~normal(0,20);
   
for (t in 1:T) {y[t] ~normal(muy[t], sigma);}
   
}
   
generated quantities {
   
vector[T] log_lik;
   
for (t in 1:T) {log_lik[t] = normal_lpdf(y[t] muy[t], sigma);}
   
}

[3] The code for the intervention decay effect antedependence model is as follows:

   
cat(“model {for (t in 1:71) {
   
y[t] ~dpois(mu[t])
   
# Scaled deviance and likelihood terms
   
yts[t] <- equals(y[t],0)+(1−equals(y[t],0))*y[t]
   
mus[t] <- equals(y[t],0)+(1−equals(y[t],0))*mu[t]
Time Structured Priors 205

   
dv[t] <- 2*(y[t]*log(yts[t]/mus[t])−(y[t]−mu[t]))
   
LL[t] <- −mu[t]+y[t]*log(mu[t])−logfact(y[t])
   
# Predictive checks
   
ynew[t] ~dpois(mu[t])
   
ch[t] <- step(ynew[t]−y[t])−0.5*equals(ynew[t],y[t])
   
# Regression
   
log(mu[t]) <- log(E[t])+beta[t]*SB[t]+g[t]
   
# Relative risk after control for intervention
   
RR[t] <- exp(g[t])}
Dv <- sum(dv[1:71])
   
   
g.m <- mean(g[])
   
# priors
   
phi ~dnorm(0,1)
   
g[1] ~dnorm(0,1/omega[1])
   
for (t in 2:71) {g[t] ~dnorm(phi*g[t-1],1/omega[t])}
   
# Variance model
   
for (t in 1:71) {log(omega[t]) <- gam[1]+gam[2]*t/100+gam[3]*
t*t/10000}
   
for (j in 1:3) {gam[j] ~dnorm(0,1)}
   
# Intervention effect
   
for(r in 1:26) {b[r] ~dnorm(0,tau.b)}
   
tau.b ~dexp(1)
   
# sort ascending order
   
bsort <- sort(b)
   
for (j in 1:45) {beta[j] <- 0}
   
# Decay in effect from year of introduction
   
for (j in 46:71) {betas[j] <- bsort[j−45]
   
# Retain negative coefficients
   
beta[j] <- betas[j]*step(−betas[j])
   
# Probability that SB effect still relevant
   
decay.prob[j−45] <- step(−betas[j])}}
   
“, file=”model3.jag”)
   
# Initial values and estimation
   
init1= list(gam=c(−3,0,0),phi=0.8)
   
init2= list(gam=c(−3,0,0),phi=0.7)
   
inits=list(init1,init2)
   
pars <- c(“beta”,”gam”,”LL”,”Dv”,”phi”,”RR”,”ch”,”decay.prob”)
   
R <- autojags(D, inits, pars,model.file=”model3.jag”,2,iter.
increment=5000, n.burnin=500,Rhat.limit=1.1, max.iter=50000, seed=1234)
   
R$summary
   
samps <- as.matrix(R$samples)
   
# Select log-likelihood samples in samps
LL <- samps[,75:145]
   
   
LOO=loo(LL,pointwise=T)
   
waic(LL)
   
# Relative risks after controlling for intervention
RR <- samps[,148:218]
   
   
plot(apply(RR,2,mean),x=year,xlab=”Year”,ylab=”Relative Risks”)
   
# plots and listing, pointwise LOO
   
loocase <- as.vector(LOO$pointwise[,3])
   
plot(loocase,x=year,xlab=”Year”,ylab=”Pointwise LOO-IC”)
   
year=seq(1931,2001,1)
   
list.loocase <- data.frame(year,loocase)
   
list.loocase=list.loocase[order(−list.loocase$loocase),]
   
head(list.loocase,10)
206 Bayesian Hierarchical Models

References
Abraham B, Ledolter J (1983) Statistical Methods for Forecasting. Wiley, New York.
Albert J, Chib S (1993) Bayes inference via Gibbs sampling of autoregressive time series subject to
Markov mean and variance shifts. Journal of Business & Economic Statistics, 11(1), 1–15.
Angers J, Biswas A, Maiti R (2017) Bayesian forecasting for time series of categorical data. Journal of
Forecasting, 36(3), 217–229.
Araveeporn A (2017) Comparing random coefficient autoregressive model with and without auto-
correlated errors by Bayesian analysis. Statistical Journal of the IAOS, 33(2), 537–545.
Asai M, McAleer M, Yu J (2006) Multivariate stochastic volatility: A review. Econometric Reviews, 25,
145–175.
Auger-Méthé M, Field C, Albertsen C M, Derocher A, Lewis M, Jonsen I, Flemming J (2016) State-
space models’ dirty little secrets: Even simple linear Gaussian models can have estimation
problems. Scientific Reports, 6, 26677.
Balke N (1993) Detecting level shifts in time series. The Journal of Business and Economic Statistics, 11,
81–92.
Barnett G, Kohn R, Sheather S (1996) Bayesian estimation of an autoregressive model using Markov
chain Monte Carlo. Journal of Econometrics, 74, 237–254.
Bauwens L, Lubrano M (1998) Bayesian inference on GARCH models using the Gibbs sampler.
Econometrics Journal, 1, C23–C46.
Bauwens L, Lubrano M, Richard J (2000) Bayesian Inference in Dynamic Econometric Models. OUP.
Beck N (2004) Time series, in Encyclopedia of Social Science Research Methods, eds M Lewis-Beck, A
Bryman, T Futing Liao. Sage.
Benjamin M, Rigby R, Stasinopoulos D (2003) Generalized autoregressive moving average models.
Journal of the American Statistical Association, 98, 214–223.
Berkes I, Horvath L, Ling S (2009) Estimation in nonstationary random coefficient autoregressive
models. Journal of Time Series Analysis, 30, 395–416.
Berliner L (1996) Hierarchical Bayesian time series models, pp 15–22, in Maximum Entropy and
Bayesian Methods, eds K Hanson, R Silver. Kluwer Academic Publishers.
Berzuini C, Clayton D (1994) Bayesian analysis of survival on multiple time scales. Statistics in
Medicine, 13, 823–838.
Betancourt M, Girolami M (2015) Hamiltonian Monte Carlo for hierarchical models, in Current Trends
in Bayesian Methodology with Applications, eds S Upadhyay, U Singh, D Dey, A Loganathan. CRC.
Bijma F, De Munck J, Huizenga H, Heethaar R, Nehorai A (2005) Simultaneous estimation and test-
ing of sources in multiple MEG data sets. IEEE Transactions on Signal Processing, 53, 3449–3460.
Bockenholt U (1999) An INAR(1) negative multinomial regression model for longitudinal count data.
Psychometrika, 64, 53–68.
Broto C, Ruiz E (2004) Estimation methods for stochastic volatility models: A survey. Journal of
Economic Surveys, 18, 613–649.
Cargnoni C, Muller P, West M (1996) Bayesian forecasting of multinomial time series through con-
ditionally Gaussian dynamic models. Journal of the American Statistical Association, 92, 587–606.
Carlin BP, Klugman SA (1993) Hierarchical Bayesian Whittaker graduation. Scandinavian Actuarial
Journal, 1993(2), 183–196.
Carlin B, Polson D, Stoffer D (1992) A Monte Carlo approach to nonnormal and nonlinear state space
modelling. Journal of the American Statistical Association, 87, 493–500.
Carter C, Kohn R (1994) On Gibbs sampling for state space models. Biometrika, 81, 541–553.
Chan K, Ledolter J (1995) Monte Carlo EM estimation for time series models involving counts. Journal
of the American Statistical Association, 90, 242–252.
Chatuverdi A, Kumar J (2005) Bayesian unit root test for model with maintained trend. Statistics &
Probability Letters, 74, 109–115.
Chen C, Liu L (1993) Joint estimation of model parameters and outlier effects in time series. Journal of
the American Statistical Association, 88, 284–297.
Time Structured Priors 207

Chen C, So M (2006) On a threshold heteroscedastic model. International Journal of Forecasting, 22,


73–89.
Chib S, Greenberg E (1994) Bayes inference in regression models with ARMA(p, q) errors. Journal of
Econometrics, 64(1–2), 183–206.
Chib S, Nardari F, Shephard N (2002) Markov Chain Monte Carlo methods for stochastic volatility
models. Journal of Econometrics, 108, 281–316.
Clayton D (1996) Generalized linear mixed models, pp 275–301, in Markov Chain Monte Carlo in
Practice, eds W Gilks, S Richardson, D Spiegelhalter. Chapman and Hall, London, UK.
Cox D (1970) The Analysis of Binary Data. Methuen, London, UK.
Daniels M (1999) A prior for the variance in hierarchical models. Canadian Journal of Statistics, 27,
569–580.
Davis R A, Holan S H, Lund R, Ravishanker N (eds) (2016) Handbook of Discrete-Valued Time Series.
CRC Press.
De Jong P, Penzer J (1998) Diagnosing shocks in time series. Journal of the American Statistical
Association, 93, 796–806.
De Jong P, Shephard N (1995) The simulation smoother for time series models. Biometrika, 82, 339–350.
Diggle P, Zeger S (1989) A non-Gaussian model for time series with pulses. Journal of the American
Statistical Association, 84, 354–359.
Durbin J (2000) The Foreman lecture: The state space approach to time series analysis and its poten-
tial for official statistics. Australian & New Zealand Journal of Statistics, 42, 1–24.
Durbin J, Koopman S (2001) Time Series Analysis by State Space Methods. Oxford University Press,
Oxford, UK.
Ehlers R, Brooks S (2004) Bayesian analysis of order uncertainty in ARIMA models. Technical Report,
Federal University of Parana.
Engle RF (1982) Autoregressive conditional heteroscedasticity with estimates of the variance of
United Kingdom inflation. Econometrica: Journal of the Econometric Society, 50(4), 987–1007.
Fahrmeir L, Lang S (2001) Bayesian inference for generalized additive mixed models based on
Markov random field priors. Applied Statistics, 50, 201–220.
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd
Edition. Springer Series in Statistics. Springer Verlag, New York/Berlin/Heidelberg.
Fernandez C, Steel M (1998) On Bayesian modeling of fat tails and skewness. Journal of the American
Statistical Association, 93, 359–371.
Ferreira M, Gamerman D (2000) Dynamic generalized linear models, pp 57–72, in Generalized Linear
Models: A Bayesian Perspective, eds D Dey, S Ghosh, B Mallick. Marcel Dekker, New York.
Fokianos K, Rahbek A, Tjøstheim D (2009) Poisson autoregression. Journal of the American Statistical
Association, 104(488), 1430–1439.
Fonseca T, Cerqueira V, Migon H, Torres C (2016) Full Bayesian inference for asymmetric Garch mod-
els with Student-T innovations. IPEA Discussion Paper.
Fonseca T, Ferreira M, Migon H (2008) Objective Bayesian analysis for the Student-t regression
model. Biometrika, 95, 325–333.
Fox A (1972) Outliers in time series. Journal of the Royal Statistical Society, Series B, 34, 350–363.
Fruhwirth-Schnatter S (1994) Data augmentation and dynamic linear models. Journal of Time Series
Analysis, 15, 183–202.
Gabriel K (1962) Ante-dependence analysis of an ordered set of variables. Annals of Mathematical
Statistics, 33, 201–212.
Gamerman D (1998) Markov chain Monte Carlo for dynamic generalized linear models. Biometrika,
85, 215–227.
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis. CRC, Boca
Raton, FL.
Gerlach R, Carter C, Kohn R (1999) Diagnostics for time series analysis. Journal of Time Series Analysis,
20, 309–330.
Geweke J (1993) Bayesian treatment of the Students-t linear model. Journal of Applied Economics, 8,
S19–S40.
208 Bayesian Hierarchical Models

Geweke J, Terui N (1993) Bayesian threshold auto-regressive models for nonlinear time series. Journal
of Time Series Analysis, 14, 441–454.
Ghosh K, Tiwari R (2007) Prediction of U.S. cancer mortality counts using semiparametric Bayesian
techniques. Journal of the American Statistical Association, 102, 7–15.
Giordani P, Pitt M, Kohn R (2011) Bayesian inference for time series state space models, in The Oxford
Handbook of Bayesian Econometrics, eds J Geweke, G Koop, H Van Dijk. OUP.
Glosten L, Jagannathan R, Runkle D (1994) On the relation between the expected value and the vari-
ance of the nominal excess return on stocks. Journal of Finance, 48(5), 1779–1801.
Godsill S, Doucet A, West M (2004) Monte Carlo smoothing for nonlinear time series. Journal of the
American Statistical Association, 99, 156–168.
Gómez E, Gómez-Villegas M, Marn J (2002) Continuous elliptical and exponential power linear
dynamic models. Journal of Multivariate Analysis, 83, 22–36.
Granger C, Machina M (2006) Structural attribution of observed volatility clustering. Journal of
Econometrics, 135, 15–29.
Grunwald G, Hyndman R, Tedesco L, Tweedie R (2000) Non-Gaussian conditional linear AR(1) mod-
els. Australian & New Zealand Journal of Statistics, 42, 479–495.
Grunwald S (2005) Environmental Soil-Landscape Modeling: Geographic Information Technologies and
Pedometrics. CRC Press.
Hamilton J (2007) Regime-switching models, in Palgrave Dictionary of Economics, 2nd Edition, eds S
Durlauf, L Blume. Palgrave MacMillan, London.
Harvey A (1989) Structural Time Series Models and the Kalman Filter. Cambridge University Press.
Harvey A, Ruiz E, Shepherd N (1994) Multivariate stochastic variance models. Review of Economic
Studies, 61, 247–264.
Harvey A, Todd P (1983) Forecasting economic time series with structural and Box-Jenkins models:
A case study. Journal of Business & Economic Statistics, 1, 299–307.
Harvey A, Trimbur T, Van Dijk H (2006) Trends and cycles in economic time series: A Bayesian
approach. Journal of Econometrics, 140(2), 618–649.
Heinen A (2003) Modelling time series count data: An autoregressive conditional Poisson model.
SSRN Electronic Journal. DOI:10.2139/ssrn.1117187
Helske J (2017) tsPI: Improved Prediction Intervals for ARIMA Processes and Structural Time Series.
https://cran.r-project.org/web/packages/tsPI/index.html
Huerta G, West M (1999) Priors and component structurres in autoregressive time series. Journal of the
Royal Statistical Society, Series B, 61, 881–899.
ISC Shark Working Group (2013) Stock assessment and future projections of blue shark in the North
Pacific ocean. WCPFC-SC9-2013/SA-WP-11. WCPFC-SC. https://www.wcpfc.int/node/19204
Jacquier E, Polson N, Rossi P (2004) Bayesian analysis of stochastic volatility models with fat-tails
and correlated errors. Journal of Econometrics, 122, 185–212.
Jacquier E, Polson NG, Rossi PE (2002) Bayesian analysis of stochastic volatility models. Journal of
Business & Economic Statistics, 20(1), 69–87.
Jaffrézic F, Thompson R, Hill G (2003) Structured antedependence models for genetic analysis of
repeated measures on multiple quantitative traits. Genetics Research, 82, 55–65.
Jaffrézic F, Venot E, Laloë D, Vinet A, Renand G (2004) Use of structured antedependence models for
the genetic analysis of growth curves. Journal of Animal Science, 82, 3465–3473.
Jowaheer V, Sutradhar B (2002) Analysing longitudinal count data with overdispersion. Biometrika,
89, 389–399.
Jung R, Kukuk M, Liesenfeld R (2006) Time series of count data: modeling, estimation and diagnos-
tics. Computational Statistics & Data Analysis, 51(4), 2350–2364.
Kashiwagi N, Yanagimoto T (1992) Smoothing serial count data through a state-space model.
Biometrics, 48, 1187–1194.
Kastner G, Hosszejni D (2016) Package ‘stochvol’. Efficient Bayesian Inference for Stochastic Volatility
(SV) Models. https://cran.r-project.org/web/packages/stochvol/stochvol.pdf
Khoo W, Ong S (2014) A new model for time series of counts. AIP Conference Proceedings, 1605(1),
938–942.
Time Structured Priors 209

Killick R, Eckley I (2014) Changepoint: An R package for changepoint analysis. Journal of Statistical
Software, 58(3), 1–19.
Kim S, Shephard N, Chib S (1998) Stochastic volatility: Likelihood inference and comparison with
ARCH models. The Review of Economic Studies, 65, 361–393.
Kitagawa G, Gersch W (1996) Smoothness Priors Analysis of Time Series. Springer, New York.
Knape J (2008) Estimability of density dependence in models of time series data. Ecology, 89,
2994–3000.
Knorr-Held L (1999) Conditional prior proposals in dynamic models. Scandinavian Journal of Statistics,
26, 129–144.
Koopman S (1993) Disturbance smoother for state space models. Biometrika, 80, 117–126.
Koopman S, Shephard N, Doornik J (1999) Statistical algorithms for models in state space form using
SsfPack 2.2. Econometrics Journal, 2, 113–166.
Korenok O (2009) Bayesian methods in non-linear time series, pp 441–455, in Encyclopedia of Complexity
and Systems Science. Springer, New York.
Lee S (1998) Coefficient constancy test in a random coefficient autoregressive model. Journal of
Statistical Planning and Inference, 74, 93–101.
Lee Y, Nelder J (2001) Modelling and analysing correlated non-normal data. Statistical Modelling, 1,
3–16.
Leonte D, Nott D, Dunsmuir W (2003) Smoothing and change point detection for gamma ray count
data. Mathematical Geology, 35, 175–194.
Li W (1994) Time series models based on generalized linear models: Some further results. Biometrics,
50, 506–511.
Liboschik T, Fokianos K, Fried R (2017) tscount: An R package for analysis of count time series fol-
lowing generalized linear models. Journal of Statistical Software, 82(5), 1–50.
Ling S (2004) Estimation and testing stationarity for double-autoregressive models. Journal of the
Royal Statistical Society: Series B, 66, 63–78.
Lubrano M (1995) Testing for unit root in a Bayesian framework. Journal of Econometrics, 69, 81–109.
Marriott J, Ravishanker N, Gelfand A, Pai J (1996) Bayesian analysis of ARMA processes: Complete
sampling based inference under full likelihoods, pp 243–256, in Bayesian Analysis in Statistics
and Econometrics, eds D Barry, K Chaloner, J Geweke. Wiley, New York.
Martin D, Raftery A (1987) Non-Gaussian state-space modeling of nonstationary time series:
Robustness, computation, and non-Euclidean models. Journal of the American Statistical
Association, 82, 1044–1050.
Maunder M, Sibert J, Fonteneau A, Hampton J, Kleiber P, Harley S (2006) Interpreting catch per unit
effort data to assess the status of individual stocks and communities. ICES Journal of Marine
Science, 63(8), 1373–1385.
Maunder MN, Deriso RB, Hanson CH (2015) Use of state-space population dynamics models in
hypothesis testing: Advantages over simple log-linear regressions for modeling survival,
illustrated with application to longfin smelt (Spirinchus thaleichthys). Fisheries Research, 164,
102–111.
McAllister M K (2014) A generalized Bayesian surplus production stock assessment software (BSP2).
Collective Volumes of Scientific Papers ICCAT, 70(4), 1725–1757.
McCulloch R, Tsay R (1993) Bayesian inference and prediction for mean and variance shifts in autore-
gressive time series. Journal of the American Statistical Association, 88, 968–978.
McCulloch R, Tsay R (1994) Bayesian analysis of autoregressive time series via the Gibbs sampler.
Journal of Time Series Analysis, 15, 235–250.
Meyer R, Millar RB (1999) BUGS in Bayesian stock assessments. Canadian Journal of Fisheries and
Aquatic Sciences, 56(6), 1078–1087.
Meyer R, Yu J (2000) BUGS for a Bayesian analysis of stochastic volatility models. Econometrics
Journal, 3, 198–215.
Mira A, Petrone S (1996) Bayesian hierarchical nonparametric inference for change point prob-
lems,  pp 693–703, in Bayesian Statistics 5, eds J Bernardo, J Berger, A Dawid, A Smith. OUP,
Oxford.
210 Bayesian Hierarchical Models

Monnahan C, Thorson J, Branch T (2017) Faster estimation of Bayesian models in ecology using
Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3), 339–348.
Naylor J, Marriott J (1996) A Bayesian analysis of non-stationary autoregressive series, pp 705–712, in
Bayesian Statistics 5, eds J Bernardo, J Berger, A Dawid, A Smith. Clarendon Press.
Nunez-Anton V, Zimmerman D (2000) Modeling non-stationary longitudinal data. Biometrics, 56,
699–705.
Oh M-S, Lim Y (2001) Bayesian analysis of time series Poisson data. Journal of Applied Statistics, 28,
259–271.
Omori Y, Chib S, Shephard N, Nakajima J (2007) Stochastic volatility with leverage: Fast and efficient
likelihood inference. Journal of Econometrics, 140(2), 425–449.
Omori Y, Watanabe T (2015) Stochastic volatility and realized stochastic volatility models, pp 435–
456, Chapter 21, in Current Trends in Bayesian Methodology with Applications, eds S Upadhyay, U
Singh, D Dey, A Loganathan. Chapman and Hall/CRC.
Ord J, Snyder R, Koehler A, Hyndman R, Leeds M (2005) Time series forecasting: The case for the
single source of error state space approach. Working Paper 7/05, Department of Econometrics
and Business Statistics, Monash University.
Paap R, van Dijk H (2003) Bayes estimation of Markov trends in possibly cointegrated series: an appli-
cation to U.S. consumption and income. Journal of Business & Economic Statistics, 21, 547–563.
Parent E, Rivot E (2013) Introduction to Hierarchical Bayesian Modeling for Ecological Data. Chapman
and Hall/CRC.
Patwardhan A, Small M (1992) Bayesian methods for model uncertainty analysis with application to
future sea level rise. Risk Analysis, 12, 513–523.
Pedroza C (2006) A Bayesian forecasting model: Predicting U.S. male mortality. Biostatistics, 7,
530–550.
Penzer J (2006) Diagnosing seasonal shifts in time series using state space models. Statistical
Methodology, 3, 193–210.
Perreault L, Berniera J, Bobéeb B, Parent E (2000) Bayesian change-point analysis in hydrometeoro-
logical time series: Comparison of change-point models and forecasting. Journal of Hydrology,
235, 242–263.
Petris G, Petrone S, Campagnoli P (2009) Dynamic Linear Models with R. Springer, New York.
Piegorsch W, Bailer J (2005) Analyzing Environmental Data. Wiley.
Pourahmadi M (2002) Graphical diagnostics for modeling unstructured covariance matrices.
International Statistical Review, 70, 395–417.
Prado R, Huerta G, West M (2000) Bayesian time-varying autoregressions: Theory, methods and
applications. Journal of the Institute of Mathematics and Statistics of the University of Sao Paolo, 4,
405–422.
Punt A, Hilborn R (1997) Fisheries stock assessment and decision analysis: The Bayesian approach.
Reviews in Fish Biology and Fisheries, 7, 35–63.
Rankin P S, Lemos R T (2015) An alternative surplus production model. Ecological Modelling, 313,
109–126.
Reis E, Salazar E, Gamerman D (2006) Comparison of sampling schemes for dynamic linear models.
International Statistical Review, 74, 203–214.
Rue H, Held L (2005) Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall/
CRC.
Ruiz-Cárdenas R, Krainski E T, Rue H (2012) Direct fitting of dynamic models using integrated nested
laplace approximations—INLA. Computational Statistics & Data Analysis, 56(6), 1808–1828.
Santos T, Franco G, Gamerman D (2010) Comparison of classical and Bayesian approaches for inter-
vention analysis. International Statistical Review, 78(2), 218–239.
Schaefer MB (1954) Some aspects of the dynamics of populations important to the management of
the commercial marine fisheries. Inter-American Tropical Tuna Commission Bulletin, 1(2), 23–56.
Schmidt D, Makalic E (2013) Estimation of stationary autoregressive models with the Bayesian
LASSO. Journal of Time Series Analysis, 34(5), 517–531.
Time Structured Priors 211

Schotman P, Van Dijk H (1991) On Bayesian routes to unit roots. Journal of Applied Econometrics, 6,
387–401.
Scott S (2017) Package ‘bsts’. Bayesian Structural Time Series. https://cran.r-project.org/web/pack-
ages/bsts/bsts.pdf
Silva N, Pereira I, Silva M E (2009) Forecasting in INAR (1) model. REVSTAT, 7(1), 119–134.
Silveira de Andrade B, Andrade M, Ehlers R (2015) Bayesian GARMA models for count data.
Communications in Statistics: Case Studies, Data Analysis and Applications, 1(4), 192–205.
Simpson M, Niemi J, Roy V (2017) Interweaving Markov chain Monte Carlo strategies for efficient
estimation of dynamic linear models. Journal of Computational and Graphical Statistics, 26(1),
152–159.
Soyer R, Aktekin T, Kim B (2015) Bayesian modeling of time series of counts with business applica-
tions, in Handbook of Discrete-Valued Time Series, eds R Davis, S Holan, R Lund, N Ravishanker.
CRC.
Speed T, Kiiveri H (1986) Gaussian distributions over finite graphs. Annals of Statistics, 14, 138–150.
Startz R (2008) Binomial autoregressive moving average models with an application to US reces-
sions. Journal of Business & Economic Statistics, 26(1), 1–8.
Strickland C, Turner I, Denham R, Mengersen K (2008) Efficient Bayesian Estimation of Multivariate
State Space Models. http://eprints.qut.edu.au
Tsay R (1986) Time series model specification in the presence of outliers. Journal of the American
Statistical Association, 81, 132–141.
Utazi C (2017) Bayesian single changepoint estimation in a parameter-driven model. Scandinavian
Journal of Statistics, 44(3), 765–779.
Verdinelli I, Wasserman L (1991) Bayesian analysis of outlier problems using the Gibbs sampler.
Statistics and Computing, 1, 105–117.
Wang D, Ghosh S (2002) Bayesian analysis of random coefficient autoregressive models. Model
Assisted Statistics and Applications, 3(2), 281–295.
Wang J, Chan J, Choy S (2011) Stochastic volatility models with leverage and heavy-tailed distribu-
tions: A Bayesian approach using scale mixtures. Computational Statistics & Data Analysis, 55(1),
852–862.
Wang P (2011) Pricing currency options with support vector regression and stochastic volatility
model with jumps. Expert Systems with Applications, 38(1), 1–7.
West M (1998) Bayesian forecasting, in Encyclopedia of Statistical Sciences, eds S Kotz, C Read, D Banks.
Wiley.
West M (2013) Bayesian dynamic modelling, pp 145–166, in Bayesian Inference and Markov Chain
Monte Carlo: In Honour of Adrian FM Smith, eds M West, P Damien, P Dellaportas, N Polson, D
Stephens. Oxford University Press.
West M, Harrison P (1997) Bayesian Forecasting and Dynamic Models, 2nd Edition. Springer-Verlag,
New York.
West M, Harrison P, Migon H (1985) Dynamic generalised linear models and Bayesian forecasting.
Journal of the American Statistical Association, 80, 73–97.
Wu R, Cui Y (2014) A parameter-driven logit regression model for binary time series. Journal of Time
Series Analysis, 35(5), 462–477.
Yu J, Meyer R (2006) Multivariate stochastic volatility models: Bayesian estimation and model com-
parison. Econometric Reviews, 25, 361–384.
6
Representing Spatial Dependence

6.1 Introduction
In the analysis of spatially configured data, positive covariation is typically expected
between observations (areas, points) that are close to each other, so that residual spatial
dependence may remain under a simple iid residual assumption (Anselin and Bera, 1998).
Spatial heterogeneity in regression relationships is also common (Anselin, 2010). Spatial
regression aims to represent the residual structure appropriately, or represent hetero-
geneity, and may also be used to obtain improved estimates, especially when applying
Bayesian spatial smoothing. Consider disease counts for areas, when small event totals or
small populations lead to unstable point estimates of rates or relative risks. One is then led
to hierarchical regression models for borrowing strength to achieve more stable estimates
(Riggan et al., 1991; Waller, 2002). If there is spatial covariation (e.g. when contiguous areas
have similar disease levels), an appropriate borrowing strength mechanism would incor-
porate local smoothing towards the mean of adjacent areas (Clayton and Kaldor, 1987). By
contrast, assuming exchangeable random effects implies global smoothing, with rates or
risks smoothed towards the overall mean, and does not account for spatial dependence.
Priors for spatial covariance modelling are therefore structured in the sense of explic-
itly recognising the role of adjacency or proximity, and use this structure as the basis for
smoothing or prediction. Often smoothing of rates is an end in itself; for example, spatial
smoothing of area health data to reflect similarity of disease risks in nearby areas is a
more reliable guide for health interventions (e.g. Zhu et al., 2006). However, structured
priors may also be suitable when the goals of analysis include out-of-sample prediction. In
geostatistical applications, a frequent goal is interpolation of a modelled surface to unsam-
pled locations based on proximity to observed locations (Gotway and Wolfinger, 2003;
Webster et al., 1994; Jiruse et al., 2004).
The R environment now offers considerable potential for analysing spatial data, as dis-
cussed, for example, in Bivand et al. (2013), Allard et al. (2017), and Brunsdon and Comber
(2015). On-line R-based resources for spatial data analysis include www.rspatial.org/
spatial/ and https://data.cdrc.ac.uk/tutorial/an-introduction-to-spatial-data-analysis-
and-visualisation-in-r. Bayesian spatial estimation in R is facilitated by packages such as
CARBayes (Lee, 2013), R-INLA (Blangiardo and Cameletti, 2015; Schrödle and Held, 2011),
INLABMA (Goméz-Rubio and Bivand, 2018), geostatsp (Brown, 2015), spBayes (Finley
et al., 2015), geoR (Ribeiro and Diggle, 2018), and spNNGP.
While there may be benefits from borrowing strength methods based on spatial proxim-
ity, using random effects to represent unobserved components may raise potential iden-
tification issues. For example, priors for random effects may specify differences between
adjacent observations without specifying their mean, so that MCMC methods then require

213
214 Bayesian Hierarchical Models

centring of the effects to ensure identification of other parameters. Furthermore, methods


for smoothing or interpolation assuming relatively smooth variation over adjacent units
may need to be adaptive to spatial discontinuities (Knorr-Held and Rasser, 2000).
As an example of spatially structured prior and its Bayesian implementation, the pair-
wise difference or Markov random field (MRF) prior may be specified via conditional den-
sities, which are naturally suited for Gibbs sampling (Finley et al., 2015). For univariate
effects q = (q1 ,… ,q n ), the conditional MRF prior takes the form (Besag et al., 1995, p.11; Rue
and Tjelmeland, 2002; Furrer and Sain, 2010)

é ù
p(q i |q[i] ) µ t exp ê -
ê å w F(t [q - q ])úú ,
ij i j

ë j¹i û
where θ[i] denotes values for cases other than i, wij are weights specifying spatial dependence
between observations i and j, Φ(u) is an increasing function in u, subject to Φ(u) = Φ( −u),
and τ a precision parameter. Under a neighbourhood prior, where wij = 1 when observations
(usually areas) i and j are neighbours and wij = 0 otherwise, an equivalent representation is

é ù
p(q i |q[i] ) µ t exp ê -
ê jζ åF(t [q i - q j ])ú ,
ú
ë i û
where ∂i is the set of areas adjacent to area i . The case wij = 1 if |i − j| = 1 and wij = 0 otherwise
leads to first order random walk priors relevant to modelling time-ordered data. The MRF
prior generalises to variables θij in two-dimensional lattices (e.g. areas i and times j), and a
neighbourhood might then be defined as ∂ ij = [(i + 1, j),(i − 1, j),(i , j + 1),(i , j − 1)] (Lavine,
1999). Taking Φ(u) = u2 /2 leads to a Gaussian or L2 norm conditional prior for qi (Waller, 2002)

æ wijq j 1 ö
q i |q[i] ~ N ç
ç å ,
wi + t wi +
÷ , (6.1)
÷
è j¹i ø
whereas if f (u) =|u| then

æ ö
p(q i |q[i] ) µ t exp ç -t
ç åw ij qi - q j ÷ ,
÷
è j#i ø
known as the L1 norm prior (Richardson et al., 2004). To achieve robust smoothing, the
latter form may be better suited to spatial discontinuities, since its mode is at the median
rather than the mean.

6.2 Spatial Smoothing and Prediction for Area Data


Whereas exchangeable hierarchical analysis is appropriate for independently generated area
or point data, such data often cannot be regarded as independent because of the presence of
similarities between neighbouring areas or points (Anselin and Bera, 1998). Modelling area
differences or point patterns with spatially structured effects reflects the empirical regularity
Representing Spatial Dependence 215

that neighbouring areas or points tend to be similar, and that similarity typically diminishes
as distance increases. Even if known predictors are available, it is likely that other relevant
influences on the underlying process cannot be identified or measured, and this residual
heterogeneity is likely (at least in part) to be spatially structured (Lawson, 2008, p.94). For
example, Gelfand et al. (2005a) consider spatial modelling of residuals in the analysis of spe-
cies distributions, both for areas and points as the units, where unobserved influences might
include habitat and inter-species competition. Bayesian techniques have played a central role
in recent developments for analysing spatial data, whether space is viewed from a discrete or
continuous perspective, e.g. Banerjee et al. (2014) and Waller and Carlin (2010).
In studies with a discrete framework, the data are typically aggregated, with observa-
tions consisting of counts (e.g. of diseased subjects in spatial epidemiology) or of regional
indicators (e.g. average income per head or house prices in spatial econometrics). By con-
trast, in geostatistical models for geochemical readings, species distribution, or disease
events in relation to a pollution source, a continuous spatial framework is more relevant
(Section 6.5), allowing interpolation between observed point readings.
Consider metric responses yi for areas i, or at sites specified by grid references gi = ( g1i , g 2i ).
To allow greater flexibility, one may assume a “convolution” prior that compromises
between structured and unstructured variation; so the model includes both a spatially
structured random effect si and a fully exchangeable effect ui, with

yi = a + ui + si ,

where ui ∼ N (0, su2 ), but the si are spatially correlated. Alternatively, suppose yi are counts,
and that Pi are populations at risk with yi ~ Bin(Pi , p i ). Then one may specify

logit(pi ) = a + ui + si ,

where πi are latent probabilities of the event. Alternatively, for rare events in relation to the
risk population, a Poisson assumption is relevant with yi ~ Po(Pi li ) , and

log(li ) = a + ui + si ,

where λi are latent event rates per unit of Pi. If the offsets to the Poisson mean are expected
health events Ei, such that Si yi = S iEi with yi ~ Po(Ei li ), then the λi are interpretable as
latent relative risks (Wakefield, 2007, p.160).
One way to model the correlation in the elements of the vector s = (s1 , … , sn ) is to directly
specify a joint multivariate prior with covariance matrix that expresses spatial correlation
between areas i and j or sites gi and gj (Richardson et al., 1992, p.541; Wakefield, 2007).
Typical assumptions in such models (also considered in Section 6.5) are of stationarity and
isotropy, with the latter meaning the correlation is the same in all directions. For example,
a multivariate normal prior would take

(s1 , … sn ) ∼ N n (0, Σ s ),

 1 w12 . w1n 
w 1 . w2 n 
Σ s = ss2W = ss2 
21
,
 . . . . 
 
 w n1 wn 2 . 1 
216 Bayesian Hierarchical Models

where wij = f(dij) are correlation functions that decline as the spatial separation dij between
areas i and j (or sites gi and gj) increases, and defined to ensure that W is always non-nega-
tive definite (Mardia and Watkins, 1989).
For example, one may specify exponential spatial decay,

wij = exp( − d dij ),

where d > 0, or for area units, allow for both inter-area distance dij and the length bij of the
common border between area i and j, namely

wij = dijg 1 [bij + c]g 2 ,

where γ1 is negative, and γ2 is positive. Another choice is the disc model with

 dij2   
0.5
2  −1  dij   dij 
wij = cos   −   1 − 2   dij ≤ k ,
p  k   k  k   
 
with wij = 0 for dij > κ, so that κ controls the decline in correlation with distance. Such choices
are to some degree arbitrary, and inferences may be sensitive to the choice of spatial
weights (e.g. Bhattacharjee and Jensen-Butler, 2006).

6.2.1 SAR Schemes
A widely used scheme, especially in spatial econometrics, specifies the joint density via
simultaneous autoregressive or SAR effects (Richardson et al., 1992). By analogy with
ARMA time series models, the autoregression may operate both for (metric) responses
y = ( y1 , … y n )′ , and for the error vector e = (e1 , … en )′ . Let W = [wij ] be a spatial dependence
matrix as above, but with wii = 0 rather than wii = 1. One possible SAR scheme has the form

yi = r1 ∑w y
h≠i
ih h + Xi b + ei ,

ei = r2 ∑ w e + u ,
h≠i
ih h i

where ρ1 and ρ2 are measures of spatial dependence, and the u = (u1 , … , un )′ are inde-
pendently distributed, with diagonal covariance matrix Σu. The covariance matrix for
e = (e1 , … en )′ is (I − r2W )−1 Σ u (I − r2W ′ )−1 . In matrix form

y = r1Wy + X b + e ,

e = r2W e + u.

The ρ coefficients are constrained to lie between 1/ηmin and 1/ηmax, where {h1 , … , hn } are
the eigenvalues of W, in order to ensure that (I − rW ) is invertible. If the weights matrix is
standardised to have row sums of unity, so that wij* = wij /S h wih , then the maximum eigen-
value of W* is 1 and since negative spatial correlation is unlikely, one may specify uniform
or beta priors on ρ coefficients in the interval [0,1]. Wall (2004) points out that SAR priors
Representing Spatial Dependence 217

(and also CAR priors, as considered below) may generate implausible covariance patterns
when considered in terms of the joint priors.
Variants of the above scheme include the spatial errors model (SEM), with ρ1 = 0 (Cressie
and Wikle, 2011),

y = X b + e , (6.2)

e = rW e + u,

and the spatial lag model (SLM) with ρ2 = 0, namely

y = rWy + X b + u, (6.3)

where in both models u ∼ N (0, s 2 ) are iid. The spatial errors model may be expressed as

y = X b + (I − rW )−1 u,

or, equivalently,

y ∼ MVN (X b , s 2 [(I − rW )′ (I − rW )]−1 ).

The SEM model may also be considered as a prior for spatially correlated effects. For exam-
ple, in (6.2) one may assume spatially varying βi over units i, with

b = bm + eb ,

e b = r bW e b + u b ,

where βμ is the average coefficient. Another option is a spatial moving average errors rep-
resentation (Hepple, 2003) with

y = X b + e,

e = rWu + u,

u ∼ N (0, s 2 ).

Equivalently, y ∼ MVN (X b , s 2 (I + rW )′ (I + rW )).


Regarding the spatial errors model expressed as

y = X b + (I − rW )−1 u,

an alternative to assuming uncorrelated u and X, and allowing greater generality, specifies

u = Xg + v,

where the v are iid. Then

y = X b + (I − rW )−1 Xg + (I − rW )−1 v.
218 Bayesian Hierarchical Models

Expressed with iid errors, this leads to the spatial Durbin model or SDM (Seya et al., 2012,
Lacombe and LeSage, 2015).

y = rWy + X( b + g) − rWX b + v,

which may be reparameterised as

y = rWy + Xq1 + WXq2 + v. (6.4)

Example 6.1. SAR Models for Long-Term Limiting Illness Data


This example considers area data on limiting long-term illness (LLTI) for 133 electoral
wards (small areas) among people aged 50–59 in East London from Congdon (2008), and
also considered in Example 6.3. Here we express the illness totals Ti and the population
denominators Pi as long-term illness rates per 1,000, namely y i = 1000 × Ti /Pi . Since area
deprivation is typically a strong influence on area morbidity, an index of multiple depri-
vation (IMD) is used as a single predictor.
As discussed by Bivand et al. (2014, 2015) one may use the Integrated Nested Laplace
Approximation to estimate spatial autoregressive models conditioning on particular
values of the spatial autocorrelation parameter. Conditioned models can be estimated
over a suitable grid, and subsequently combined using Bayesian model averaging to
provide posterior marginals of parameters (Goméz-Rubio et al., 2018). In particular, the
spatial Durbin representations (SDM) of the LLTI rates (and the impact on them of area
deprivation) is estimated using the INLABMA package.
The spatial interaction matrix W is obtained using the spdep and maptools packages,
applied to a relevant shape file (note that dbf and shx files are needed in the same direc-
tory but not explicitly referenced). Thus the R sequence:

    library(easypackages)
    libraries("INLA","spdep","INLABMA","maptools")
    setwd("C:/R Files BHMRA")
   # shapefile East London electoral wards
    ELmap <- readShapePoly("Example_6_1")
    ELnb <- poly2nb(ELmap, queen=F)
    lw=nb2listw(ELnb, glist=NULL,, zero.policy=NULL)
   # Sparse Adjacency matrix
    W = as(as_dgRMatrix_listw(nb2listw(ELnb)), "CsparseMatrix")

A grid for the spatial autocorrelation parameter ρ in the SDM model (6.4) is specified
with limits 0.2 and 0.9, namely grid.rho = seq(0.2, 0.9, length.out=20). This is based on an
estimate of 0.48 from maximum likelihood estimation. The estimates from INLABMA
are shown in Table 6.1, with mean (sd) for ρ of 0.53 (0.09). The DIC is estimated as 1242.
With rstan, we may estimate the spatial errors errors (SEM) model, using the multi_
normal_prec option (Brunsdon, 2018). Thus with

y ∼ MVN(X b , s 2 [(I − rW )′ (I − rW )]−1 ),

the precision is [(I − rW )′ (I − rW )]/s 2 . The full code, with flat priors on hyperparam-
eters, is
Representing Spatial Dependence 219

    model="data {
    int N;
    vector[N] x;
    vector[N] y;
    matrix<lower=0>[N,N] W;
    matrix<lower=0,upper=>[N,N] I;
    }
    parameters {
    real beta;
    real alpha;
    real<lower = 0> sigma;
    real<lower=−1,upper=1> rho;
    }
    model {
    
y ~multi_normal_prec(alpha + x * beta, crossprod(I − rho * W)/
(sigma*sigma));
    }
    generated quantities
    {
    real LL;
    
LL= multi_normal_prec_lpdf(y alpha + x * beta, crossprod(I − rho *
W)/(sigma*sigma));
    }"

This leads to very similar estimates to those obtained using maximum likelihood,
with posterior mean (sd) for ρ of 0.50 (0.10). The log-likelihood is estimated at −627, and
the DIC (estimated as the mean deviance plus the number of parameters) is obtained
as 1258.

TABLE 6.1
Spatial Autoregressive Models Compared
Mean St devn 2.5% Median 97.5%
Spatial Error Model Intercept 109.8 10.4 90.1 109.9 129.9
IMD 6.2 0.3 5.7 6.2 6.8
ρ 0.50 0.10 0.30 0.51 0.69
DIC 1258.0
Spatial Lag Model Intercept 84.6 12.2 60.8 84.1 108.2
IMD 5.42 0.39 4.66 5.43 6.18
ρ 0.16 0.07 0.03 0.17 0.30
DIC 1271.4
Spatial Moving Average Errors Model Intercept 109.4 8.8 92.4 109.3 126.8
IMD 6.3 0.3 5.7 6.3 6.7
ρ 0.47 0.12 0.23 0.46 0.71
DIC 1262.3
Spatial Durbin Model Intercept 48.0 6.7 34.7 48.0 61.2
IMD 6.2 0.4 5.4 6.2 7.0
IMD-spatial lag −3.2 0.5 −4.1 −3.2 −2.3
ρ 0.53 0.09 0.36 0.53 0.71
DIC 1241.6
220 Bayesian Hierarchical Models

A similar approach may be applied to estimate the spatial moving average errors
model, except that the likelihood is now

    y ~multi_normal(alpha + x * beta, crossprod(I + rho * W)*sigma^2).

The DIC for this model is slightly higher than for the spatial autocorrelated errors
model, with posterior mean (sd) for ρ of 0.47 (0.12).
The spatial lag model may be estimated using the target + representation to accom-
modate the likelihood. The log-likelihood is

log(L) = −0.5N log(2p) − 0.5N log(s 2 ) + log I − rW



− ( y − rWy − X b )′ ( y − rWy − X b )/(2s 2 ),
where
n

I − rW = ∏(1 − rl )
i =1
i

with l = (l1 ,…, ln ) being the eigenvalues of W. So the log determinant term may be
written
n

log I − rW = ∑ log(1 − rl ).
i =1
i

For simplicity, the target + calculations include the squared regression error terms and
the log determinant contributions in the same summand terms, albeit with the total of
these summands still being the overall log-likelihood. Discrepancies at case level might
be assessed by standardised residuals. Estimates for the four hyperparameters are very
similar to those from maximum likelihood, with posterior mean (sd) for ρ of 0.16 (0.07),
and 5.42 (0.39) for the regression coefficient on IMD.
To illustrate the SEM as a prior for random spatial effects, we extend the above rstan
code to allow the random coefficients scheme

b = b m + e b ,

e b = r bW e b + u b ,

where βμ is the average coefficient. This involves an extra input vector, e = rep(1,N), in
the data block:

    
vector<upper=>[N] e;

and beta as a vector

    
vector[N] beta;

There are extra parameters beta_mu, sigma_b, rho_b, and a model block as follows:

    model {
    
beta ~ multi_normal_prec(e * beta_mu, tcrossprod(I − rho_b * W)/
(sigma_b*sigma_b));
    
y ~ multi_normal_prec(alpha + x .* beta, tcrossprod(I − rho * W)/
(sigma*sigma));
    }
Representing Spatial Dependence 221

20

15
Frequency

10

6.0 6.5 7.0


Posterior mean b

FIGURE 6.1
Histogram of spatially varying predictor effect.

This option shows an increase in the log-likelihood from −627.0 to −589.0, with
Figure 6.1 showing the variation in the impacts of deprivation, and slopes varying from
5.9 to 7.3.

6.3 Conditional Autoregressive Priors


In contrast to simultaneous autoregressive spatial priors, conditional autoregressive priors
for spatial errors s = (s1 , … , sn ) have the advantage of facilitating random effects analysis
under an MCMC sampling approach, especially for large numbers of areas. Such priors
are often applied to discrete outcomes, such as disease counts yi, taken to be Poisson or
binomial in relation to populations Pi or expected events Ei (e.g. Besag et al., 1991; Norton
and Niu, 2009; De Oliveira, 2012). If the event is relatively infrequent (or populations at risk
are small), one often seeks to estimate an underlying smooth pattern of disease risk by
borrowing strength over areas taking account of spatial dependence (MacNab et al., 2006).
Disease counts typically display extra-variation which can be modelled by including
random effects in the regression for disease rates. In particular, spatially correlated ran-
dom effects may account for much extra-variation, serve to borrow strength in estima-
tion, and proxy unobserved risk factors that are also spatially correlated (Richardson and
Monfort, 2000). These might be shared environmental or social capital factors in neigh-
bouring areas. While spatially correlated random effects acting alone may be assumed,
this may constitute an informative prior, since there may be areas (e.g. areas of deprived
social renting surrounded by affluent areas) discrepant from surrounding areas in terms of
risk factors. A more general and less informative approach allows adaptive downweight-
ing of a spatial prior.
222 Bayesian Hierarchical Models

6.3.1 Linking Conditional and Joint Specifications


As discussed by Besag and Kooperberg (1995), one may use properties of the multivariate
normal to obtain the conditional autoregressive prior from a joint spatial prior and vice
versa. Consider the joint multivariate normal density for the effects s′ = (s1 , … , sn ), with
mean zero and covariance Σs,

1 −0.5
p(s) = Σs exp( −0.5s′Σ s−1s).
(2p)n/2

Denote Q = [qij ] = Σ s−1 as the precision matrix, and s[i] = (s1 , … , si −1 , si + 1 , … , sn ) . Then the con-
ditional distributions for each si take a univariate normal form, corresponding to the pair-
wise interaction function Φ(u) = u2 /2 (Rue and Held, 2005, p.22), namely

  qij  1
si |s[i] ∼ N 
 ∑j≠i
 −  sj , 
 qii  qii 

with corr(si , s j |s[i , j] ) = − qij / qii q jj . Following Besag and Kooperberg (1995, p.734) define
hii = 0, and set

hij = − qij /qii (i ≠ j).

Also set qii = ai/δ with variance parameter δ, so that

hij = − qij d/ai . (6.5)

The above conditional density is then in the conditional autoregressive form specified by
Besag (1974),

 
si |s[i] ∼ N 
 ∑ h s , d/a  . (6.6)
j≠i
ij j i

To obtain the joint density from the conditional one, symmetry of Q means −Qij = −Q ji , so
that from (6.5), the constraint

hij ai = h ji a j

applies. Note that expressing d/ai = ti2 or ai = d/ti2 , this constraint can also be stated
(Cressie and Kapat, 2008) as

hijt j2 = h jiti2 .

Letting R = A(I − H ) , where A = diag( a1 , … , an ) , one has that R is symmetric with diagonal
elements ai and off-diagonal elements −aihij. So the joint density (Besag and Green, 1993;
Banerjee et al., 2014) implied by the conditional priors is

(s1 , … sn ) ∼ N n (0, d R −1 )
Representing Spatial Dependence 223

where Q = d −1R . If R is positive definite as well as symmetric, the joint density of the spa-
tial effects is proper. Positive definiteness of R holds under diagonal dominance (Rue and
Held, 2005, p.20; Besag and Kooperberg, 1995, p.734), namely, that in at least one row (or
column) of R, the diagonal element rii exceeds the absolute sum of the off-diagonal ele-
ments |S j¹i rij |.

6.3.2 Alternative Conditional Priors


Different schemes for defining the hij and ai in (6.5) are possible, including options where
R is not positive definite. Setting

wij
hij = r (6.7)
∑ k≠i
wik

ai = ∑ w ,
k≠i
ik

where 0 ≤ r ≤ 1, and taking wij = w ji , with wii = 0, ensures the symmetry constraint is met,
with hij ai = rwij = h ji a j . This is sometimes called the proper CAR, as the covariance matrix
of the corresponding multivariate density is invertible. The most commonly applied
approach is to set wij = 1 for adjacent areas and wij = 0 otherwise, and let ai = di = S k ¹i wik ,
where di is then the number of areas adjacent to area i. For example, when a region is par-
titioned into grid cells, then each grid cell has eight (first order) neighbours (Gelfand et al.,
2005a). However, distance or common boundary length based forms for wij can be used.
In this case, R = A(I − H) has diagonal elements di and off-diagonal elements −ρwij. This
provides the intrinsic conditional autoregression or ICAR(ρ) prior, with

 d
si|s[i] ∼ N  r Ai , 
 di 
where Ai is the average of the sj in locality Li of area i, i.e.

Ai =
∑ j ∈Li
sj
.
di
Note that R = A(I − H ) = D − rW is positive definite, and the joint prior on (s1 , … sn ) is
proper, only when |ρ| < 1. Lower values of ρ imply lesser degrees of spatial dependence
between the si, though the limiting case when ρ = 0 has the disadvantage that the variance
is not constant, but depends on the number of neighbours di.
Alternatively, in a CAR(ρ) spatial prior, as distinct from the ICAR(ρ) prior, one may set

hij = rwij , ai = 1,

so that

 
si |s[i] ~ N  r
 ∑ j≠i
wij s j , d  ,

224 Bayesian Hierarchical Models

with a homogenous conditional variance (Cressie and Kapat, 2008, p.729). In this case,
R = I − ρW is positive definite, and so invertible (and the joint density is proper), when the
correlation parameter is between 1/ηmin and 1/ηmax where h1 , … , hn are the eigenvalues of
W (Bell and Broemeling, 2000).
A compromise scheme for the variance deflators ai – see MacNab et al. (2006) and Leroux
et al. (1999) – sets

ai = (1 − l) + l ∑ w ,
j≠i
ij

with 0 ≤ l ≤ 1 subject to a prior such as l ∼ U(0, 1) . This representation has identifiability


advantages in involving a single set of random effects (Lee, 2011) and can be estimated
in R using CARBayes or the INLABMA packages, as well as R2OpenBUGS. It can also be
estimated in rstan using multi_normal_prec applied to the joint distribution [1].
The symmetry condition hij ai = h ji a j is ensured by setting

lwij
hij = ,
1− l + l ∑j≠i
wij

since hij ai = lwij = lw ji = h ji a j . So the joint density for (s1 , … sn ) has covariance δR−1 where

R = lF + (1 − l)I ,

f ii = ∑ w ,
j≠i
ij

f ij = − wij i ≠ j.

The case λ = 0 corresponds to a lack of spatial interdependence, with R then reducing to an
identity matrix, and borrowing strength confined to “global smoothing.” By contrast, λ = 1
leads to the ICAR(1) model (see 6.3.3). So

 
l d
si |s[i] ∼ N 
 1− l + l
 ∑ wij
∑ w s , 1 − l + l∑
j≠i
ij j
 ,
wij 
j≠i j≠i 

and when the wij are defined by contiguity, one obtains

 l d 
si |s[i] ∼ N 
 1 − l + ldi ∑
j ∈Li
sj ,  .
1 − l + ldi 

The scheme of Leroux et al. (1999) can be generalised to allow greater spatial adaptivity
with varying λ (Congdon, 2008). The symmetry condition hij ai = h ji a j is maintained by set-
ting ai = (1 − li ) + li ∑ j≠i
wij , and taking
Representing Spatial Dependence 225

li lj wij
hij = ,
1 − li + li ∑ j≠i
wij

since this ensures the constraint

hij ai = h ji a j = li lj wij .

A possible borrowing strength prior for these parameters is

logit(li ) ∼ N (lm , 1/tl ),

where the average λμ and precision τλ are extra unknowns. Setting Λ = diag(l1 , … , ln ), the
covariance in the joint prior is then

d [LF * + (I - L)]-1 ,

where

f ii∗ = ∑ w ,
j≠i
ij

f ij∗ = − wij lj i ≠ j.

Pettitt et al. (2002) propose a scheme with

fwij
hij = ,
1+ |f| ∑ j≠i
wij

and

ai = 1+ |f| ∑ w ,
j≠i
ij

where ϕ measures the strength of spatial dependency, and the case ϕ = 0 corresponds to
an absence of spatial interdependence, such that R = I (see also Gschlößl and Czado, 2006).
Gibbs updating for ϕ can be applied. So

 
 f d 
si |s[i] ∼ N 
 1+ |f|∑ wij
∑ w s , 1+|f|∑
j≠i
ij j  .
wij 
 j≠i
j≠i


Under both the MacNab et al. (2006) and Pettitt et al. (2002) schemes, the joint distribution
of s is proper, ensuring a proper posterior when either is taken as the prior distribution.
Retaining hij = f wij /(1+|f |S j¹i wij ), but setting ai = (1+|f |S j¹i wij )/(1+|f |) , means that ϕ → ∞
corresponds to the ICAR(1) prior, with the conditional variance (1+|f |d )/(1+|f |S j¹i wij )
tending to d /S j¹i wij .
226 Bayesian Hierarchical Models

6.3.3 ICAR(1) and Convolution Priors


The ICAR(ρ) prior when ρ = 1 is sometimes known as the ICAR(1) model, when one has

∑ w ,
wij
hij = , ai =

ij
wij j≠i
j≠i

and for counts yi ∼ Po(li Pi ) , if one assumes

log(li ) = a + si ,

then borrowing of strength is purely spatial, with

 
d
si |s[i] ∼ N  Ai ,  ,

 ∑ j≠i
wij 

where Ai = S j¹i wij s j /S j¹i wij. The precision matrix of the joint prior is δ−1R, where

rii = ∑ w ,
j≠i
ij

rij = − wij i ≠ j.

When the wij are binary indicators of adjacency (wij = 1 for areas i and j contiguous, wij = 0
otherwise), then rii = di and the off-diagonal elements rij are −1 if i and j are neighbours,
but zero otherwise. This case demonstrates most directly that conditional independence
properties relating to spatial effects are stipulated by the matrix R and vice versa (Rue and
Held, 2005, p.4). Despite the relative simplicity of this form and the wide use of the ICAR(1)
conditional prior, R is not invertible under this model, and the joint prior is improper
(Haran et al., 2003).
To see this in another way, for the case where the wij are binary, the joint prior can be
specified in terms of pairwise comparisons between the si (Knorr-Held and Becker, 2000).
Let i ~ j denote that areas i and j are neighbours, then for a normal ICAR(1) model, the joint
prior in terms of differences si − sj is (Hodges et al., 2003)

æ 1 ö
p(s1 ,¼sn ) µ d -0.5( n-1) exp ç -
ç 2d å (si - s j )2 ÷ .
÷
è i~ j ø

Thus the prior only specifies differences between spatial effects and not their overall
level. However, all linear contrasts c′s with c′1 = 0 have proper distributions (Besag and
Kooperberg, 1995, p.740).
To tie down the effects and remove their locational invariance, one method involves
centring the sampled values at every iteration to have mean zero. This is one form of lin-
ear constraint, and so the joint distribution becomes integrable and propriety is obtained
(Rodrigues and Assuncao, 2008). Another possibility is a corner constraint, i.e. setting a
particular effect to a known value, such as s1 = 0 (Besag et al., 1995). Finally, one may omit
Representing Spatial Dependence 227

the intercept so that the si model the level of the data. In this case, yi ∼ Po(Pi exp(si )) with
the si not constrained, rather than yi ∼ Po(Pi exp(a + si )) .
As mentioned above a spatial effects-only assumption is relatively informative, and the
ICAR(1) spatial prior is often combined with an exchangeable prior to form a convolution
prior (Richardson et al., 2004). It may be argued that an exchangeable iid effect should
only be introduced in combination with an ICAR(1) spatial prior, since conditional priors
including a correlation parameter, such as the ICAR(ρ) can adjust to varying mixtures of
spatial and unstructured variation by varying the ρ parameter (Wakefield, 2007). Thus, for
a Poisson response, yi ∼ Po(li Pi ) , the convolution prior of Besag et al. (1991), also called the
Besag-York-Mollie (BYM) prior, specifies

log(li ) = a + si + ui

with si |s[i] ∼ N ( Ai , ds /di ), and ui ∼ N (0, du ) usually homoscedastic. Note that heteroscedas-
ticity or heavier tails than under the normal might be represented by taking ui ∼ N (0, yi )
where

yi = du /ki

where the κi are positive variables with mean 1 (LeSage, 1999). While only the sum zi = si + ui
is identifiable in this model, Norton and Niu (2009) show that the precisions δs and δu are
identifiable from the distribution of zi.

6.4 Priors on Variances in Conditional Spatial Models


As in the exchangeable hierarchical models considered in Chapter 4, the prior on the con-
ditional spatial variance parameter δs, or on the pair {ds , du } in a convolution model, is
important in governing the degree of smoothing towards the neighbourhood or global
mean. Prior specification is important as an aspect in the general identifiability of complex
random effects models for spatial variation, with potential weak identifiability of hyper-
parameters and sensitivity of posterior estimates to the form of prior; see Example 6.4. The
same applies to the spatial smoothing parameters in the proper CAR prior and the Leroux
et al. CAR (MacNab, 2014), and on hyperparameters for spatial priors based on group allo-
cation, such as the Potts prior (Moores et al., 2015).
Regarding variance priors, some applications of conditional autoregressive priors use
vague priors for δs, such as p( ds ) ∝ 1/ds or just proper priors, with 1/ds ∼ Ga(e , e ) with ε
small. However, these may lead to effective impropriety in the posterior such that MCMC
convergence is impeded (Besag and Kooperberg, 1995, p.741). The prior 1/ds ∼ Ga(e , e )
with ε small may also put undue weight on low variances. Suppose the prior relates to a
variance for unstructured random effects in a log-linear model for relative risks λi, with
yi ∼ Po(Ei li ) . Wakefield (2007) mentions that a Ga(0.001,0.001) prior on 1/δu in the model
log(li ) = a + ui is equivalent to assuming relative risks e li − a follow a log-t distribution with
0.002 degrees of freedom.
Prior specification is most problematic for the convolution model, since the data identify
the total variation in log relative risks (under a Poisson model), but not the pair of vari-
ances {ds , du } (MacNab, 2014). Following Bernardinelli et al. (1995), the marginal standard
228 Bayesian Hierarchical Models

deviation sd(si) of the spatial effects is approximately equal to a multiple 1.43 (=1/0.7) of
the conditional scale term, ( ds /d)0.5 , where d is the average number of neighbours. Hence a
“fair” prior on sd(ui ) = d u0.5 (Banerjee et al., 2014, section 6.4.3.3) is one that ensures

sd(ui ) ≈ sd(si ) ≈ 1.43 × ( ds /d)0.5 .

Riebler et al. (2016) propose a modified BYM scheme retaining the two random effects, but
with a single scale parameter δ for the composite effects

ti = ui + si = d [ 1 − rqi + rfi∗ ].

Here θi ~ N(0,1) are iid effects, the fi∗ are scaled versions of spatial effects ϕi following
an ICAR(1) prior, and r ∈[0, 1] governs the proportion of residual variance due to spa-
tial dependence. To ensure d is legitimate as the standard deviation of the composite
effect, one requires var(fi ) ≈ var(qi ) ≈ 1. To achieve this, Riebler et al. (2016) propose a scal-
ing whereby the geometric mean of variances of ϕi is 1. To obtain a scaling factor F, with
fi∗ = fi /F , one may apply the R-INLA function inla.scale.model to the adjacency matrix.

Example 6.2 Blood Lead in Children, Virginia Counties


The data here, considered by Schabenberger and Gotway (2004), relate to elevated blood
level readings yi among ni children (under 72 months) tested in the N = 133 counties of
Virginia (including Independent Cities) in 2000. Numbers sampled ni vary considerably
(from 1 to 3808). Spatial proximity is binary, with wij = 1 for intercounty distances under
50 km and wij = 0 otherwise.
Assuming binomial sampling with y i ∼ Bin( ni , pi ), one option considered is the convo-
lution, or BYM, model of Besag et al. (1991), namely

logit(pi ) = a + si + ui ,

with conditional variance δs for ICAR(1) spatial effect si, and variance δu for the unstruc-
tured effects. Using rstan for estimation, positive N + (0, 25) priors are assumed on the
standard deviations ss = ds0.5 and su = du0.5 . In rstan, the ICAR(1) spatial prior is imple-
mented using the pairwise difference form of the joint multivariate density (e.g. Gerber
and Furrer, 2015; Morris, 2018), and in particular the target + formulation,

    
target += 0.5*(N-1)*log(tau_s) −0.5*tau_s*dot_self(s[node1]
- s[node2]);

where tau_s is the precision of the spatial effects.


Convergence is non-problematic, despite the default strategy for the priors on the
standard deviations. Posterior means (medians) for σs and σu are obtained as 1.57 (1.62)
and 0.37 (0.38) respectively. Twenty (from 133) of the si parameters are judged signifi-
cant in terms of posterior probabilities over 0.95, or under 0.05, that si > 0. By contrast,
41 composite terms ti = ui + si are significant. The LOO-IC (leave-one-out information
criterion) is 608, with the highest individual LOO-IC being for Winchester (county 3),
which has an unusually high proportion of elevated readings, but a small population.
More relevant in establishing significantly elevated readings may be the second highest
LOO-IC value, namely for county 63, with a much larger population than county 3. The
proportion of variance due to spatial effects can be obtained as var(s)/(var(s)+var(u)) and
is estimated at 0.77.
Representing Spatial Dependence 229

A second analysis applies the Riebler et al. (2016), or BYM2, prior, with a single set of
effects, and with the proportion of spatial variance now a parameter. This model pro-
vides an unchanged LOO-IC of 608. The proportion of spatial variance ρ is estimated at
0.82, though with a wide 95% interval from 0.32 to 1. Forty-two of the composite effects
ti are now significant.
An area spatial model may also be assessed by whether residual spatial dependence is
removed, and this can be established using the moran.mc function in R. The moran.mc
function uses a Monte Carlo permutation test for Moran’s I statistic. Significant residual
correlation shows in extreme tail p-values, either values close to zero (positive residual
correlation), or p-values near 1 (negative residual correlation).
Here 100,000 permutations are taken, with the calculations using a binary adjacency
spatial interaction matrix for the 133 areas, converted to listw format. We find a non-
significant p-value of around 0.25 for the first model, and 0.27 for the second.
These models are also estimated by R-INLA, with the default log-gamma priors on
random effect precisions. The total random effects ti = ui + si under the BYM model are
very similar to those from the rstan application, with a correlation of 0.99 between the
two sets of posterior means. However, possibly reflecting sensitivity to priors on scale
parameters, spatial effects are smaller under R-INLA, and unstructured effects larger.
The BYM2 model estimated using R-INLA produces a lower DIC than the BYM model.
The proportion ρ of total residual variation due to spatial effects is estimated with mean
(95% CRI) of 0.69 (0.30,0.95), as against 0.82 (0.32,1.00) under rstan. The spatial effects
under the two estimations are highly correlated.

6.5 Spatial Discontinuity and Robust Smoothing


Spatial pooling assuming a smoothly varying outcome over contiguous areas may not be
appropriate when there are clear discontinuities in the spatial pattern of events (Adin et al.,
2018). For instance, a low mortality area surrounded by high mortality areas will have a
distorted smoothed rate when heterogeneity is assumed to be entirely spatially structured.
More generally one may seek robustness against mis-specification of the distribution of latent
event rates or risks; for example, virtually all applications of spatial conditional autoregres-
sion models assume normality by default. Finally, one may seek some degree of spatial adap-
tiveness. For example, under conditional autoregressive models, the conditional variance δ
is constant across the region, whereas one might expect spatial correlation to be stronger in
some sub-regions. In the convolution model, the variances δs and δu are global parameters, so
that the relative amount of spatially structured and unstructured heterogeneity is constant
across the study region (Knorr-Held and Becker, 2000; Congdon, 2007).
Robustness against spatial discrepancies or non-normality may be important when
event totals are small, since then the prior structure of the latent risks has a greater effect;
this is the case with the much analysed Scottish lip cancer data, where certain areas have
elevated SMRs, but small counts y and expected cases E . A high relative risk apparent
from a crude or moment estimate not based on a large y or E may be shrunk considerably
under a spatial random effects approach, particularly if surrounded by lower morbidity
areas, so that important excess risks may not be flagged up (Conlon and Louis, 1999).
One strategy is to adopt heavier tailed alternatives to the CAR normal, such as the dou-
ble exponential (Laplace) or L1-norm version of the ICAR(1) prior, which Besag (1989, p.399)
mentions as preferable when the si have discontinuities. For a connected graph (i.e. with no
isolated areas in the region) this prior is
230 Bayesian Hierarchical Models

 

1 1
p(s1 , … sn ) ∝ n −1
exp  −0.5 |si − s j| ,
d  d j≠i


and has its posterior mode at the median rather than mean of the neighbouring sj. One
might also apply Student t versions of the ICAR(ρ) which, if applied using scale mixtures,
give a natural measure of outlier status. Thus, for a Student t with ν degrees of freedom,

 d 
si |s[i] ∼ N  r Ai ,
 gi di 

where gi ∼ Ga(n/2, n/2) , and low values of γi correspond to spatial outliers.


Forms of discrete mixture have been proposed. Green and Richardson (2002) distin-
guish between clustering models and allocation models, while Knorr-Held and Rasser
(2000) propose a scheme whereby at each MCMC iteration, areas are allocated to clusters
of mutually contiguous areas, with identical risks within each cluster. Lawson and Clark
(2002) propose a mixture of the ICAR(1) and Laplace priors for the case yi ~ Po(Eiλi), with
continuous (beta) weights ri rather than binary mixture weights, namely

log(li ) = a + ri s1i + (1 − ri )s2i ,

where ri ∼ Be(c, c), with c known, s1i is an ICAR error, but s2i follows a spatial Laplace prior.
Following Congdon (2007), analogous mixture forms can be applied to the errors in the
convolution model itself, giving more emphasis to the unstructured term ui in outlier areas:

log(li ) = a + ri si + (1 − ri )ui .

This type of representation may be useful for modelling edge effects, with the u effects
taking a greater role on the peripheral areas where neighbours are fewer. Another pos-
sibility is a discrete mixture in a “spatial switching” model (Congdon, 2007), allowing an
unstructured term only for areas where the pure spatial effects model is inappropriate.
Thus, for a count response,

yi ∼ Po(Ei lJi , i )

J i ∼ Categoric(p1 , p2 )

(p1 , p2 ) ~ Dirichlet(x1 , x2 )

log(l1i ) = a + si

log(l2i ) = a + si + ui

where the ξj are extra unknowns, and the si ~ ICAR(1). The posterior estimates for the ξj
provide overall weights of evidence in favour of a pure spatial model as compared to a con-
volution model, while high posterior probabilities Pr(Si = 2|y) for particular areas indicate
that pure spatial smoothing is inappropriate for them.
Fernandez and Green (2002) use a discrete mixture model generated via mixing over
several spatial priors. Thus, for count data, assume K possible components with area-spe-
cific probabilities πik on each component
Representing Spatial Dependence 231

yi ∼ ∑ p Po(E l )
k =1
ik i ik

where log(lik ) = ak for a model without predictors. Then K sets of underlying spatial effects
{sik} are generated from separate conditional spatial priors, and used to estimate area-spe-
cific mixture weights

pik = exp( csik ) ∑ exp( cs )


k =1
ik

where χ > 0. As χ tends to 0, the πik tend to 1/K without spatial patterning, whereas large χ
reduce over-shrinkage.
Another discrete mixture model for robust spatial dependence modelling uses the Potts
prior (Green and Richardson, 2002). Thus let J i ∈1, … , K be unknown allocation indicators
with yi ∼ Po(Ei mSi ) where { m1 , … , mK } are distinct cluster means. Also let dik = 1 if Ji = k. Then
the joint prior for the allocation indicators incorporates spatial dependence with

  K  
Pr( J i = k ) = exp w

 j∼i
I (dik = d jk ) 

exp w∑
 j∼i ∑
I (dih = d jh )

  h=1  
where ω > 0 multiplies the number of same label neighbour pairs, so that lower values of
ω indicating lesser spatial dependence. So pooling towards the local neighbourhood aver-
age will tend not to occur if an area’s latent risk is discrepant with those of its neighbours.
Richardson et al. (2004) compare this model with the convolution model under various
simulated scenarios for differentiated spatial risks. Additional effects can be included by
multiplying the mJi . For example, a spatially unstructured multiplicative effect could be
modelled as ni ∼ Ga(bn , bn ), or a log-normal prior assumed with ni = exp(ui ), and ui ∼ N (0, du ).
Then yi ∼ Po(Ei mSi ni ) .
Assumptions such as normality in the spatial effects can be avoided by adapting
the Dirichlet process stick-breaking prior of Sethuraman (1994) to spatial settings. The
stick-breaking prior specifies an unknown distribution G by a mixture

G= ∑ p d( r )
m =1
m m

where M may in principle be infinite, but in practical computing is taken as finite, the mix-
ing probabilities satisfy S mM=1 pm = 1, and δ(ρm) has a point mass at ρm which may be scalar or
vector values for areas (e.g. relative risks) or at grid locations. For example, the ρm may be
drawn from a baseline borrowing-strength prior G0 such as a stationary Gaussian process
in the case of continuous point-referenced spatial data y(gi) at sites gi. One may incorpo-
rate spatial information into either the ρm, as in Gelfand et al. (2005b), or into the mixture
probabilities pm, as in Griffin and Steel (2006). Such formulations are typically for point-
referenced data, and allow for nonstationarity and non-Gaussian features in the response
when the stationary Gaussian process is not appropriate (Duan et al., 2007).
232 Bayesian Hierarchical Models

Example 6.3 Long-Term Illness, NE London


This example compares the original Leroux et al. (1999) model with the adaptive Leroux
scheme of Congdon (2008). The application is to 133 small areas in NE London, electoral
wards, defined for political and administrative purposes. As well as census counts of
limiting long-term illness (LLTI) among people aged 50–59, and corresponding bino-
mial denominators, a deprivation index is used, not to model varying LLTI propensities,
but to measure discrepancies between areas and their surrounding localities on this
potential risk factor.
The spatially adaptive approach retains the principle of spatial borrowing of strength,
but modifies it to better represent discontinuities in the outcome and/or observed risk
factors. The Leroux global index of spatial dependence λ is allowed to vary between areas,
with one possible prior for λi linking varying spatial dependence to spatial dissimilarity
(or similarity) in risk factors. For example, illness is commonly linked to socio-economic
deprivation, and spatial correlation in illness may be weaker when socio-economically
distinct areas are adjacent, with localised dissimilarity in risk factors.
Possible priors for the λi include beta priors, or probit-normal or logit-normal pri-
ors, such as logit(li ) ∼ N( ml , 1/tl ) , where the average and precision { ml , tl } are extra
unknowns. However, if predictors Di measuring dissimilarity in observed risk factors
are available, and so relevant to whether there should be some attenuation of the prin-
ciple of local borrowing of strength, one can use the scheme

logit(li ) ~ N(g 1 + g 2Di , 1/t l ),

where g = (g 1 , g 2 ) are regression parameters. One would expect lower λi for areas dis-
similar from their neighbours on the risk factor; that is, γ2 is anticipated to be negative.
Here the discrepancy measure is based on the index zi of socioeconomic deprivation,
whereby dissimilarity may be represented as

Di = zi − Zi

with Zi being the average deprivation level in the locality Li around area i, namely
Zi = S jÎLi z j /di .
Estimation of the original Leroux et al. (1999) model using R2OpenBUGS provides a
posterior mean for the global λ of 0.86, with a LOO-IC of 1278 and WAIC (widely appli-
cable information criterion) of 1185. Estimation using CARBayes provides a slightly
higher estimate of λ, namely 0.93, but a higher WAIC of 1190.
Improved fit is provided by the adaptive Leroux model, with the LOO-IC and
WAIC respectively at 1267 and 1178. The coefficient γ2 has mean (95% CRI) of −0.54
(−0.83,−0.30). In contrast to the estimated global λ of 0.88, there are eight local λi under
0.5, with the minimum being for area 133 (the City of London) with posterior mean
λ133 = 0.002. This area has an illness rate (illness total divided by population, as percent-
age) of 14.5%, as compared to the rate in its locality (surrounding adjacent wards) of
38.4%. Its deprivation index is 16.4, compared to the locality average of 43.9. Figure 6.2
maps out the local λi.

Example 6.4 Robust Priors for London Suicides


This analysis compares the Potts prior, the convolution prior and spatial median regres-
sion for modelling the distribution of suicides in 983 middle level super output areas
(MSOAs) in London over 2011–15. Expected suicides Ei are based on England wide rates,
with a subsequent scaling to ensure S i y i = S i Ei. As for Examples 6.1 and 6.3, the adja-
cency matrix is obtained by inputting a shapefile.
Representing Spatial Dependence 233

FIGURE 6.2
Local Leroux dependence parameters.

The first analysis uses the nimble package in R, and estimates the BYM model. This
provides a LOO-IC of 3905, with maximum and minimum posterior mean relative risks
of 1.40 and 0.82 . The maximum casewise LOO-IC are for areas (such as 263, 573, and 512)
which have high yi counts in relation to expected suicides. 10% of the total LOO-IC is
due to the 5% worst fitting cases. Incidentally, the estimated proportion of variation due
to spatial dependence is relatively low, namely 0.23 (95% CRI from 0.07 to 0.44).
This feature is reproduced in an estimation of the model incorporating a proper CAR
spatial effect. This is implemented via a sparse precision matrix method in rstan, and
draws on Joseph (2016). The resulting estimate for ρ in (6.7) is 0.28 (with 95% CRI from
0.19 to 0.69). The LOO-IC is 3902, with maximum and minimum posterior mean relative
risks of 1.41 and 0.76. Mixed predictive exceedance checks are included, based on repli-
cate samples of the random spatial effects, and obtained as

pi , mix = Pr( y i , rep > y i ) + 0.5Pr( y i , rep = y i ).

These show over-prediction (high pi,mix) in a relatively high proportion of cases, with
high predicted yi deaths in relation to actual deaths.
An alternative to the BYM and proper CAR priors is the Potts prior. This is applied
with an exponential E(1) prior on ω, and with an ordering constraint on the latent cluster
means, so m1 ≤ m2 ≤ … ≤ mK , where K is set at 10. Since there is evidence of unstructured
heterogeneity, the scheme is modified to include unstructured area effects, namely

y i ∼ Po(Ei mSi ni ),

ni = exp(ui ),

ui ∼ N(0, du ).
234 Bayesian Hierarchical Models

For the ordered μk, relatively informative gamma Ga(ak,5) priors are assumed, with
a = (1, 2, 3,…, 10) , so reflecting the typical range of area relative risks for such health
outcomes. A two-chain run of 10,000 iterations provides a mean scaled deviance
2S i {y i log( y i /(Ei li )) - ( y i - Ei li )} of 1034, close to the number of observed areas. The pos-
terior mean (95% CI) of ω is 0.30 (0.01,0.85), with the K = 10 latent cluster means ranging
from μ1 = 0.43 to μK = 1.41. Maximum and minimum relative risks are estimated as 1.32
and 0.65 respectively. The LOO-IC is 3907, with the maximum casewise LOO-IC again
being for areas with high yi counts in relation to expected events.
Finally, the spatial median model is an adaptation of the approach in Congdon (2017),
implementing the asymmetric Laplace prior version of quantile regression at the second
stage of a hierarchical Poisson log-normal representation. Thus for quantiles a = 1,…, A ,
define xa = (1 − 2a)/a(1 − a) , and define scale factors Wai ∼ Exp( da ) which inflate the vari-
ances of discrepant observations, and downweight their influence on the likelihood. In
the absence of predictors, one has

Yi ∼ Poi( mai ),

mai = Ei exp(nai ),

 2Wai da 
nai ∼ N  b0 a + sai + xaWai , ,
 a(1 − a) 

Wai ∼ Exp( da ).

Here median regression (α = 0.5) only is considered, with a gamma Ga(1,0.001) prior on
δ0.5. This model has a LOO-IC of 3895, improving on the Potts, BYM, and proper CAR
priors. Poorly fitted areas cases are similar, whether identified by casewise LOO-IC, or
by the residual type measure (n i - b 0 - si )/(8Wid )0.5.
Compared to the Potts prior, extreme elevated relative risks are identified under the
spatial median model, the highest posterior mean relative risk ri = exp( b0 + si ) being
1.50 (though the second and third ranking posterior mean ρi are 1.41 and 1.31). The Potts
prior is distinctive in its broader spread of estimated relative risk, including a longer tail
of low estimated relative risk, with 211 of the 983 areas having posterior mean ρi under
0.9. Figure 6.3 compares posterior mean relative risks under the Potts and BYM priors,
and Figure 6.4 compares the Potts and spatial median relative risks.

6.6 Models for Point Processes


A continuous spatial framework is appropriate when point observations are made.
Nevertheless, a continuous framework is often applied to discrete area or lattice data
(Berke, 2004; Kelsall and Wakefield, 2002; Yanli and Wall, 2004). Consider metric obser-
vations y( g ) = ( y1( g1 ), … y n ( g n )) at points g = { g1 , g 2 , … g n } in two-dimensional space G2.
To represent the spatially driven component in the variation of y, define a Gaussian spa-
tial process, or Gaussian process prior, for (s1 , … sn ) = s( g ) = (s( g1 ), … , s( g n )) with cova-
riance matrix Σ(dij ) = ss2C(dij ) , where the off-diagonal correlations depend on distances
dij = gi − g j between points gi and gj, and C(0) = 1.
Such a process is ergodic if the off-diagonal elements in Σ(d) tend to zero as d → ∞ (so
that covariance between values at two points vanishes for large enough distances), and
Representing Spatial Dependence 235

500 Potts
BYM

400

300
Frequency

200

100

0.6 0.8 1.0 1.2 1.4 1.6


Relative Risk

FIGURE 6.3
Posterior mean relative risks, Potts vs BYM.

400
Potts
Spatial Median

300
Frequency

200

100

0.6 0.8 1.0 1.2 1.4 1.6


Relative Risk

FIGURE 6.4
Posterior mean relative risks, Potts vs spatial median.
236 Bayesian Hierarchical Models

isotropic if Σ(d) depends only on the distance between gi and gj, and not on other features
such as the direction from gi to gj or the coordinates of the gi. The process is intrinsi-
cally stationary if E[ y( g + d) − y( g )] = 0, namely has a constant mean, and if the variance
depends only on the lag, not on the point locations, namely

E[ y( g + d) − y( g )]2 = V[ y( g + d) − y( g )] = 2g(d),

where γ(d) is the semiovariogram (Waller and Gotway, 2004, p.274). The covariance Σ(d)
and the semiovariogram are related via g (d) = S(0) - S(d) since

2g (d) = V[ y( g + d) - y( g )]

= V[ y( g + d)] + V[ y( g )] - 2Cov[ y( g + d), y( g )]

= S((0) + S(0) - 2S(d) = 2[S(0) - S(d)]


so that γ(0) = 0.
A Gaussian process, possibly together with an unstructured random effect u( g ) ∼ N (0, su2 ),
and regressor effects may be used to define means mi = E( yi ) in a normal linear model.
However, this scheme generalises to discrete responses using an appropriate link function
(Eidsvik et al., 2012). The nugget variance su2 defines measurement error or micro scale spa-
tial effects (spatial variation at lower scales than the smallest observed distance between
sampled points). In Bayesian modelling, it is possible to take account of interplay between
the nugget and the parameters of the spatial correlation function (Gramacy and Lee, 2008).
Regressor effects might include a trend surface T(g) defined by the coordinates of gi (Diggle
and Ribeiro, 2002, p.133), such as a quadratic polynomial with terms ( g1i , g 2i , g12i , g 22i , g1i g 2i ).
So, for y continuous, one might have

y( g ) = b0 + T ( g ) b + s( g ) + u( g ), (6.8)

s( g ) ~ N n (0, s s2C( g ,q )),

u( g ) ∼ N n (0, su2 ),

where θ are parameters defining the spatial correlation function C( g ,q ) = [cij ( gi , g j ;q )], such
as spatial decay and smoothness parameters.
Splines can also be used to model point pattern data, typically with geographic coor-
dinates as predictors. The trend-surface is represented as a two-dimensional spline in
the geographic coordinates. Trend-surface models do not explicitly represent local spatial
dependence, but rather account for trends in the data across longer geographical distances
(Dormann et al., 2007). However, smooth spatial variation does not characterise all appli-
cations, requiring specialised techniques (Sangalli et al., 2013; Wood et al., 2008). Widely
applied spline trend regression options include cubic splines and thin plate splines (Mitas
and Mitasova, 1999; Bowman and Woods, 2016; Yang et al., 2016). Lang and Brezler (2004)
propose tensor products of equally spaced B-spline basis functions combined with sym-
metric priors on the B-spline coefficients, while Wood (2006) develops low rank smooths
from tensor products of any set of bases with quadratic penalties. Such smooths are invari-
ant to rescaling of the predictors. In the R mgcv package, the jagam function develops JAGS
code with multivariate normal priors on the smooth coefficients (Wood, 2016). The prior
precision matrix incorporates the smoothing parameters and smoothing penalty matrices.
Representing Spatial Dependence 237

To avoid a smoothing penalty not corresponding to a full rank precision matrix (and hence
an improper prior), null space penalties, as in Marra and Wood (2012), are added to the
usual penalties. The smooths are centred to improve identifiability.

6.6.1 Covariance Functions
Defining dij as a distance measure between points gi and gj, there are several common
isotropic schemes with C(dij), and hence γ(dij), parameterised to reflect anticipated distance
decay in the correlation between points (e.g. Grunwald, 2005). For example, the exponen-
tial distance model has

C(dij ) = exp( −fdij ),

with range parameter ϕ > 0, and larger values of ϕ leading to more pronounced distance
decay. Note that different parameterisations of the exponential are used in different
packages (e.g. in spBayes and spNNGP as opposed to gstat). The covariance function for
(s1 , … , sn ) is then

S(dij ) = s u2I (i = j) + s s2 exp(-f dij ),

while the semivariogram is

g(dij ) = su2 + ss2 [1 − exp(fdij )].

As dij tends to infinity, the semivariogram trends to an upper limit of su2 + ss2 , known as the
sill. The powered exponential variant (Diggle and Ribeiro, 2007) has

C(dij ) = exp[−(fdij )k ],

for ϕ > 0 and 0 < k ≤ 2.


The spherical model (Zhang, 2002) has non-zero covariance only within a certain range
δ, namely

3dij dij3
C(dij ) = 1 − + ,
2d 2d 3

for d < δ, whereas C(dij) = 0 for dij ≥ d . Hence the spherical function has covariance

é 3dij dij3 ù
S(dij ) = s u2I (i = j) + s s2 ê1 - + 3 ú I (dij < d ),
ë 2d 2d û
and semivariogram

 3dij dij3 
g(dij ) = su2 + ss2  − 3  for dij < d ,
 2d 2d 

g(dij ) = su2 + ss2 for d ≥ d.


238 Bayesian Hierarchical Models

Finally, Matern covariances (Diggle et al., 2003) set

ss2
C(dij ) = (kdij )n Kn (kdij ),
Γ(n)2n −1
where Kν(u) is a modified Bessel function of order ν. The parameter ν controls the smoothness
of the process, while κ is a scaling parameter. Together they define the range r = (8n)0.5 /k
at which the covariance is diminished to low levels (close to 0.1). INLA parameterises the
Matern in terms of a parameter α = λ + 1, with α = 2 as the default setting (Lindgren and Rue,
2015). Paciorek and Schervish (2006) use kernel convolution (Section 6.6) to develop nonsta-
tionary covariance functions, including a nonstationary version of the Matérn covariance.
Implementation of this method in R is described in Risser and Calder (2017).
Prediction at new locations is a major aspect of geostatistical modelling. Suppose contin-
uous observations y = ( y1 , … , y n ) = ( y1( g1 ), … , y n ( g n )) are made at locations g = ( g1 , … , g n ) ,
and that predictions y0 = ( y01 , … , y0 k ) are required at k new locations g0 = ( g01 , … , g0 k ).
These are based on the posterior predictive density


∫ ∫
p( y0 | y ) = p( y0 , x| y )dx = p( y0 | y , x)p(x| y )dx ,

where ξ is the vector of parameters involved in the model for y, namely those defining
its mean, and the covariance parameters for spatial and unstructured errors (Banerjee
et al., 2014). For example, Diggle et al. (2003) consider a model y( g ) = m + s( g ) + u( g ) with
u( g ) ∼ N n (0, su2 ), and spatial error process

s( g ) ∼ N n (0, ss2C ),

where prediction is required at a single new location g0. With d0 denoting a n × 1 vector of
distances between g0 and g = (g1, …, gn), and with Q = su2I + ss2C , one has

p( y0 |q , y ) = N ( m + s s2d¢0Q -1( y - m 1n ), s s2 - s s2d¢0Q -1s s2d0 ).

For k > 1, univariate predictions may be obtained separately at each new site g01 , g02 , … , g0 k ,
though multivariate predictions may be more precise.

6.6.2 Sparse and Low Rank Approaches


For large numbers n of points, the computational burden involved in operations using
dense covariance matrices becomes prohibitive, and alternative strategies have been pro-
posed. Under the stochastic partial differentiation (SPDE) approach (Lindgren et al., 2011),
included in R-INLA, the continuous spatial domain y( g ) = { y1( g1 ), … y n ( g n )} is approxi-
mated by a discrete Gaussian Markov random field process. In particular, the Gaussian
Markov random field (GMRF) of the stationary Matern family for y(g) is obtained as

(k 2 − ∆ )a/2 (t y( g )) = W ( g )

where Δ is the Laplacian, W(g) is a white noise process, α and κ are as above, and τ controls
the marginal variance ss2 .
Representing Spatial Dependence 239

The GMRF approximation involves a triangulation (with m nodes) of the spatial domain,
and the density of the triangulation mesh determines how close the approximation is.
However, increasing the mesh density also increases the computations involved. A projec-
tor matrix A of dimension n × m, containing 0 or 1 entries, is used to link the original points
to the mesh (Lindgren, 2012; Bakka et al., 2018). Unlike stationary covariance models, it is
straightforward to allow nonstationarity in SPDE models.
Computational burden is also reduced by using a low-rank representation of the spa-
tial field (e.g. Finley et al., 2009; Finley et al., 2015). This involves defining a set of knots
g ∗ = { g1∗ , g 2∗ , … , g r∗ } where r  n is considerably less than the dimension of the actual data.
Then denoting s∗ = {s( g1∗ ), s( g 2∗ ), … , s( g r∗ )} and distances between the knots as d* one has

s* ( g * ) ~ N r (0, s s2C* (d* ,q )),

with predictions or interpolations s(g ) at generic locations g obtained as

s(g ) = c( g ;q )[C* (d* ,q )]-1 s* ,

where c(g;θ) is an r × 1 vector with ith element [c( g , gi* ;q )].
For a Gaussian outcome, and spatially reference predictors X(g), a predictive process
model is then defined as

y( g ) = X( g )′ b + s(g ) + u( g ).

For a non-Gaussian response, the predictive process is included in the link regression,
such as for y binary with probability π(gi),

logit[p( gi )] = X( gi )′ b + s(gi ).

Estimation of predictive process models for large n is further facilitated (Eidsvik et al.,
2012) by using the latent approximation approach of Rue et al. (2009).
Under the nearest neighbour Gaussian process (NNGP) approach (Datta et al., 2016;
Zhang et al., 2018), a sparse precision matrix of the joint density p[s(g)] of the spatial process
s(g) is achieved by using neighbour sets N(gi). Following Vecchia (1988), the sets N(gi) can
be specified as the m nearest neighbours of the point gi. These sets are used to provide an
approximate conditional specification of the joint density of the spatial process p[s(R)] for a
set of k reference locations R (that can be taken as the n observed locations). This approach
is incorporated in the R package spNNG. The approximation to the joint density is pro-
vided by the conditional density representation
k

p (s[R]) = ∏ p(s( g )|s(N( g )).


i =1
i i

Different model formulations can be specified according to whether estimated spatial ran-
dom effects are of interest, or simply regression and other hyperparameters, with the spa-
tial effects then integrated out. These are denoted as the sequential and response options
in the R package spNNGP. Thus, under the sequential model, and for hyperparameters
x = ( b , s s2 , s u2 ,q ), the posterior density is

p(x) × N (s( g )|0, C ) × N (X( g )′ b + s( g ), su2 ),


240 Bayesian Hierarchical Models

where C −1 = (I − A)T D −1(I − A) is the precision matrix for s(g), A is a sparse lower triangular
matrix with at least m non-zero elements in each row, and D is diagonal. The construction
of these matrices is set out in Finley et al. (2017).

Example 6.5 COPD Prevalence


This example considers spatial covariance modelling for binomial disease preva-
lence data y i ∼ Bin( N i , pi ) , specifically cases y of chronic obstructive pulmonary dis-
ease (COPD) in 2016–17 in outer NE London. Observed prevalence data yi is available
for 81 GP (general practitioner) practices, with predictions of prevalence required for
k = 11 GP practices, since their prevalence is not provided. GP practice locations (east-
ings, northings) are available for all 92 GP practices, based on their postcode. Locations
are randomly jiggered to avoid colocation, as some practices are close to each other.
As a predictor of prevalence, a deprivation score xi is available for all 92 GP practices.
Binomial population denominators Ni are age weighted and so adjust for higher COPD
prevalence at older ages.
The first analysis uses rstan, with an exponential decay covariance for spatial effects
ηi, namely

Cij (d) = exp[−fd],

with

logit(pi ) = b0 + b1xi + hi ,

and with the spatial effects covariance multivariate normal prior encompassing all 92
units in the analysis. A Cholesky decomposition is used to represent the multivariate
normal covariance. The prevalence predictions for the 11 practices with missing preva-
lence data are obtained as generated quantities under an inverse logit transform.
A second analysis uses R2OpenBUGS and a powered exponential distance model for
the spatial effects si, namely

Cij (d) = exp[−(fd)k ],

with ϕ > 0, k ∈(0, 2] , and with univariate predictions (s01 ,… , s0 k ) for the 11 new points.
The coding in R2OpenBUGS is hierarchically centred (Thomas et al., 2014). A Ga(1,0.001)
prior adopted on 1/ss2 .
The two models provide similar LOO-IC, respectively 691.4 and 691.7. The posterior
mean (95% CRI) for ϕ under the simple exponential decay option are obtained as 0.92
(0.06, 1.25), with the posterior 95% interval for β1 mostly positive, so that the deprivation
score improves on the prediction of missing prevalence rates. The latter range from 1.9%
to 2.3%, with a 0.95 correlation between the estimated missing prevalence rates between
the two models.

Example 6.6 Recorded Earthquakes in Europe and Asia Minor


The data here are recorded earthquake locations across Europe (including Turkey and
the mid-Atlantic ridge) as catalogued by the Seismic Hazard Harmonization in Europe
(SHARE) project (Giardini et al., 2014). There are 29,542 records for earthquakes of mag-
nitude 3.5 and higher during the period 1000–2007.
A first analysis uses the INLA spde representation of the Matern correlation function
to model spatial dependence in the patterning of earthquake magnitudes. The spde
method uses a triangulation of the spatial domain, with the mesh extended outside the
region of interest to reduce boundary effects. The density of the mesh can be varied by
Representing Spatial Dependence 241

changing max.edge and cutoff in the inla.mesh.2d command. Here we initially select a
relatively coarse grid, setting k = 0.1 and define the mesh using

    
mesh=inla.mesh.2d(coordinates,max.edge=c(1/k,2/k),cutoff=0.1/k).

There are no explanatory variables, so predictions are based only on the estimated
spatial effects at the grid nodes.
With this relatively coarse grid, a correlation of 0.46 is obtained between actual and
predicted magnitudes. Setting k = 1 as opposed to k = 0.1 produces a denser grid with
around 24 times as many nodes (20,603 as against 866), and so is more computation-
ally intensive. However, the correlation between actual and predicted magnitudes is
increased to 0.54.
A second analysis is based on the spBayes package, and uses a 10% sample of the full
data. The data involve repeated observations at the same locations which may cause
numerical problems. Therefore, the actual locations are randomly jiggered to avoid
repeat locations. A further 10% subsample of the coordinates (i.e. of 294 coordinates) is
used to provide a set of knots. As an illustration of a particular covariance option, con-
sider an exponential decay function, which using the notation in the package, assumes
a covariance model

s 2 exp( − d dij ) + t 2 ,

where σ2 is the partial sill. To provide initial values for σ2, τ2 and the decay parameter δ,
the variogram and fit.variogram options in gstat are used. This provides an estimated
range of ϕ = 3.2, and hence an initial value for the decay parameter in spBayes of 0.31 (the
spBayes parameterisation of the exponential uses a decay parameter δ = 1/ϕ). Tuning
values for the Metropolis sampler are chosen to produce an acceptance rate of around
30%. With an MCMC sample of 2,000 iterations and burn-in of 1,000, a correlation of
0.485 is obtained between actual and predicted magnitudes. Posterior means (and sd)
for σ2, τ2 and δ are 0.13 (0.02), 0.31 (0.01) and 0.38 (0.07).
The gstat commands

    v = variogram(Y~1, D)
    fit.variogram(v, vgm(c("Exp", "Mat", "Sph","Gau")))

suggest a spherical model as better fitting, but a lower correlation (of 0.474) between
actual and fitted values is obtained under this option. The GPD criterion of Gelfand and
Ghosh (1998) also prefers the exponential model.
Finally, again using the full dataset, but with jiggering to avoid repeat locations, the near-
est neighbour Gaussian Process approach is applied using spNNG. An exponential covari-
ance and m = 10 neighbours, are assumed. This provides a correlation between actual and
predicted magnitudes of 0.51. Increasing the number of neighbours m from 10 to 15 makes
no difference to the fit. Figure 6.5 shows the predicted magnitude surface. For m = 15, the
estimated posterior means (and sd) for σ2, τ2 and δ are 0.08 (0.05), 0.32 (0.03), and 0.33 (0.13).

6.7 Discrete Convolution Models


Assuming a stationary Gaussian process described through its mean and covariance
structure may result in slow estimation when there are a large number of points and is rel-
atively inflexible when stationarity and isotropy assumptions are violated. An alternative
242 Bayesian Hierarchical Models

4.8
70

4.7

60 4.6
Latitude

4.5
50

4.4

40 4.3

4.2
–30 –20 –10 0 10 20 30 40
Longitude

FIGURE 6.5
Magnitude predictions from NNGP.

representation, based on the Gaussian process, but one that adapts to spatial nonstation-
arity and anisotropy, is the process convolution approach (Higdon, 1998; Lee et al., 2005;
Higdon, 2007; Liang and Lee, 2014). This involves convolving a continuous white noise
process w(g) with a symmetric smoothing kernel K(g), with the spatial effect obtained as

s( g ) =
∫ K( g − u) w(u)du,
G

where G is the region of interest. The spatial process might be combined with fixed effect
regression impacts and with appropriate regression links for non-normal observations.
For example, if y(g) were binary, such as species presence or absence at site g (Gelfand et al.,
2005a), then y( g ) ∼ Bern(p( g )) and

logit[p( g )] = b0 + s( g ).

where β0 defines the average intensity.


In practice, the continuous underlying process can be approximated by a discretised
process (e.g. one defined on a regular lattice over G) provided the discretisation is not
too coarse relative to the smoothing kernel (Calder, 2003; Calder, 2007). So if there are
i = 1, … , n observations at points g1, …, gn and grid locations {t j , j = 1, … , m} with t j = (t1 j , t2 j ),
over the region, one may define the discretised kernel smoother as

s( gi ) = ∑ K( g − t )w ,
j =1
i j j

where for large m, the wj can be taken as a collection of random effects (Higdon, 2007,
p.245). Lee et al. (2005) consider options for representing the kernel, possibly by a form
Representing Spatial Dependence 243

with known variance (e.g. a standard normal), and consequent ways for modelling the
wj. Note that if both the K function and w series have unknown variances, then there is
potential non-identifiability. Options for the wj include exchangeable effects or low order
random walks, with unknown precision τw. Assuming K is a normal kernel, by varying τw
one can mimic the effect of the range parameter in a conventional Gaussian process model
with a Gaussian variogram.
For example, Lee et al. (2005) consider n = 12 observations yi in G1 at equally spaced loca-
tions gi between 0 and 10. These are generated according to a Gaussian process s(g) with
mean 0 and covariance matrix

C(dij ) = exp( − dij2 /25),

where dij relates to distances between points gi and gj on the line. A white noise error
ui with standard deviation 0.2 is also used to define yi, so that yi = s( gi ) + ui . They then
fit a discrete convolution model to the yi so generated, using a grid with m = 20 points tj
equally spaced between −2 and 12. They assume the wj follow a 1st order random walk,
and assume the kernel is a normal density with standard deviation 0.6.
Best et al. (2000) consider a convolution model for health counts yi ∼ Po(Pi li ) observed
for areas rather than points, where Pi are populations and λi are latent rates. In this case,
a rectangular grid is defined over m points in the region, and an additive (rather than log
link) regression is used for modelling the latent rates. So, with a single predictor xi taking
positive values only, one has
m

li = b0 + b1xi + b2 ∑ K( g − t )w ,
j =1
i j j

where the wj (and the β parameters) are gamma distributed and the kernel function K has
a known variance. One can decompose the total risk parameter into three sources: one due
to the background rate β0, one reflecting the known predictor, and one the latent spatially
configured risk over the region.
Semiparametric approaches to spatial modelling based on the stick-breaking prior can
also be related to this theme (Reich and Fuentes, 2007). Thus there are kernel functions
for each of m potential clusters, with the kernel centres t j = (t1 j , t2 j ) being unknowns, and
the cluster allocation probabilities for sites or areas i at location gi = ( g1i , g 2i ) incorporating
spatial information. While the cluster effects w j ∼ N (0, 1/tw ) are unstructured, the cluster
for area or point i is chosen using indicators

J i ∼ Categorical( pi1 ,.., pim ),

with the pij determined both via beta distributed Vj ∼ Be(c, d), and by cluster specific ker-
nels Kij constrained to lie in [0,1]. The realised spatial effect for area or point i is then w Ji .
Defining Rij = K ijVj , one has

pi1 = Ri1

pij = Rij (1 − Ri1 )… (1 − Ri , j −1 ), j = 2, … , m − 1

pim = (1 − Ri1 )… (1 − Ri , m −1 )
244 Bayesian Hierarchical Models

where (for example)

K ij = exp[−|gi − t j|/2g j ]

defines a normal kernel with bandwidth γj. Bandwidths can be taken equal across kernel
functions or vary across kernel functions according to a positive prior (e.g. inverse gamma).

Example 6.7 Earthquake Magnitudes (Continued)


This example continues the analysis of the earthquake magnitude data. We consider a
10% sample of the original dataset, with n = 2945. A two way 10 × 10 grid cell subdivision
of the region of interest is obtained using rasterisation. There are then m = 100 interior
points t j = (t j1 , t j 2 ) defining the grid.
Initially, a discrete kernel approach is applied via a single intercept linear regression

y i ∼ N( mi , s 2 ),

mi = b0 + ∑ K w ,
j
ij j

with a bivariate exponential kernel (Clark et al., 1999)

1
K ij (dij ) = exp( − dij /h),
2ph

with distances dij = [( g i1 − t j1 )2 + ( g i 2 − t j 2 )2 ]0.5 . The grid effects wj are assumed iid ran-
dom normal with zero mean, and with standard deviation σw.
Because of confounding between the grid effects and the kernel, for identifiability,
it is assumed that η = 1, but that the wj have an unknown variance, with σw assigned a
U(0,100) prior. Using jagsUI for estimation, this model provides a correlation between
actual and predicted magnitudes of 0.32. Computation is slower if η is taken as an
unknown, and σw is set to 1. Also, the fit is not improved.
However, a much-improved fit is obtained by a two-group mixture intercept, with
preset probabilities on the two groups of 0.95 and 0.05 to facilitate identifiability. Thus

mi = b0 Ji + ∑ K w ,
j
ij j

J i ∼ Categoric(0.95, 0.05).

This increases the correlation between actual and predicted magnitudes to 0.76. The
estimates (posterior means and sd) for the intercepts β01 and β02 are 4.40 (0.02) and 6.07
(0.05). Further improvements in fit might be obtained by taking additional groups in the
discrete mixture intercept.
For this model, site-specific effects are obtained by comparing the μi to their over-
all average. Then 359 of the 2945 sites have a posterior probability over 95% that the
effect is positive. Figure 6.6 maps out three significance categories, and in particular
shows spatial clustering of sites with over 0.95 probability of elevated earthquake
magnitudes.
Representing Spatial Dependence 245

Significance Group
70
0.05-0.95
Under 0.05
Over 0.95

60
Latitude

50

40

–20 0 20 40
Longitude

FIGURE 6.6
Significance of site effects.

6.8 Computational Notes
[1] With d[i] denoting the vector of neighbour numbers (the number of areas adjacent
to area i), and W the interaction matrix, the Leroux et al. (1999) prior has the form

   
D=diag(d)
   
R=D-W
I <- diag(N)
   
   
# data inputs
   
D = list(n = N, # number of observations
   
y = y, # observed number of cases
   
T=T,
   
x=x,
   
R = R,
   
I=I)
   
model="
   
data {
   
int<lower = 1> n;
   
int<lower = 0> y[n];
   
real x[n];
   
int T[n];
   
matrix[n, n] R;
   
matrix[n, n] I;
   
}
246 Bayesian Hierarchical Models

   
transformed data{
   
vector[n] zeros;
   
zeros = rep_vector(0, n);
   
}
   
parameters {
   
real beta[2];
   
vector[n] phi;
   
real<lower = 0> tau;
   
real<lower = 0, upper = 1> alpha;}
   
transformed parameters {
   
real theta[n];
   
real eta[n];
   
for (i in 1:n) {eta[i]=beta[1]+beta[2]*x[i] + phi[i];
   
theta[i]=exp(eta[i])/(1+exp(eta[i]));}
   
}
   
model {
   
phi ~multi_normal_prec(zeros, tau * ((1−alpha)*I+alpha*R));
   
beta~normal(0, 5);
   
tau ~gamma(2, 2);
   
y ~binomial(T, theta);
   
}
   
generated quantities
   
{real log_lik[n];
   
for (i in 1:n) {log_lik[i]= binomial_lpmf(y[i]T[i],theta[i]);}
   
}
   
"
   
sm = stan_model(model_code=model)
   
fit = sampling(sm,data =D,iter = 2500,warmup=250,chains = 2,seed=
12345)
   
summary(fit,pars=c("beta","alpha"), probs=c(0.025,0.975))$summary
   
# Fit
   
loo(as.matrix(fit,pars="log_lik"))

References
Adin A, Lee D, Goicoa T, Ugarte M (2018) A two-stage approach to estimate spatial and spatio-
temporal disease risks in the presence of local discontinuities and clusters. Statistical Methods in
Medical Research, In press
Allard D, Beauchamp M, Bel L, Desassis N, Gabriel É, Geniaux G, Malherbe L, Martinetti D, Opitz
T, Parent É, Romary T, Saby N (2017) Analyzing spatio-temporal data with R: Everything you
always wanted to know – but were afraid to ask. Journal de la Société Française de Statistique,
158(3), 124–158.
Anselin L (2010) Thirty years of spatial econometrics. Papers in Regional Science, 89(1), 3–25.
Anselin L, Bera A (1998) Spatial dependence in linear regression models, with an introduction to spa-
tial econometrics, pp 237–290, in Handbook of Applied Economic Statistics, eds A Ullah, D Giles.
Marcel Dekker, New York.
Bakka H, Rue H, Fuglstad G, Riebler A, Bolin D, Krainski E, Simpson D, Lindgren F (2018) Spatial
modelling with R-INLA: A review. arXiv preprint arXiv:1802.06350.
Banerjee S, Carlin B, Gelfand A (2014) Hierarchical Modeling and Analysis for Spatial Data. Chapman
and Hall/CRC.
Representing Spatial Dependence 247

Bell B, Broemeling L (2000) A Bayesian analysis for spatial processes with application to disease map-
ping. Statistics in Medicine, 19, 957–974.
Berke O (2004) Exploratory disease mapping: kriging the spatial risk function from regional count
data. International Journal of Health Geographics, 3, 18.
Bernardinelli L, Clayton D, Pascutto C, Montomoli C, Ghislandi M, Songini M (1995) Bayesian analy-
sis of space–time variation in disease risk. Statistics in Medicine, 14(21–22), 2433–2443.
Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal
Statistical Society: Series B, 36, 192–236.
Besag J (1989) Towards Bayesian image analysis. Journal of Applied Statistics, 16, 395–407.
Besag J, Green P (1993) Spatial statistics and Bayesian computation. Journal of the Royal Statistical
Society: Series B, 55, 25–37.
Besag J, Green P, Higdon D, Mengersen K (1995) Bayesian computation and stochastic systems.
Statistical Science, 10, 3–66.
Besag J, Kooperberg C (1995) On conditional and intrinsic autoregressions. Biometrika, 82, 733–746.
Besag J, York J, Mollie A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43, 1–21.
Best N, Arnold R, Thomas A, Waller L, Conlon E (2000) Bayesian models for spatially correlated dis-
ease and exposure data, p 131, in Bayesian Statistics 6: Proceedings of the Sixth Valencia International
Meeting, Vol. 6. Oxford University Press.
Bhattacharjee A, Jensen-Butler C (2006) Estimation of the spatial weights matrix under structural
constraints. Regional Science and Urban Economics, 43(4), 617–634.
Bivand R, Gomez-Rubio V, Rue H (2014) Approximate Bayesian inference for spatial econometrics
models. Spatial Statistics, 9, 146–165.
Bivand R, Gomez-Rubio V, Rue H (2015). Spatial data analysis with R-INLA with some extensions.
Journal of Statistical Software, 63(20), 1–31.
Bivand R, Pebesma E, Gómez-Rubio V, Pebesma E (2013). Applied Spatial Data Analysis with R, 2nd
Edition. Springer, New York.
Blangiardo M, Cameletti M (2015) Spatial and Spatio-temporal Bayesian models with R-INLA. John Wiley
& Sons.
Bowman V, Woods D (2016) Emulation of multivariate simulators using thin-plate splines with
application to atmospheric dispersion. SIAM/ASA Journal on Uncertainty Quantification, 4(1),
1323–1344.
Brown P (2015) Model-based geostatistics the easy way. Journal of Statistical Software, 63(12), 1–24.
Brunsdon C (2018) Using rstan and spdep for Spatial Modelling. https://rstudio-pubs-static.
s3.amazonaws.com/
Brunsdon C, Comber L (2015) An Introduction to R for Spatial Analysis and Mapping. Sage.
Calder K (2003) Exploring latent structure in spatial temporal processes using process convolutions.
PhD Thesis, Duke University, Durham, NC. https://www2.stat.duke.edu/people/theses/
CalderK.html
Calder K (2007) Dynamic factor process convolution models for multivariate space–time data with
application to air quality assessment. Environmental and Ecological Statistics, 14, 229–247.
Clark J, Silman M, Kern R, Macklin E, HilleRisLambers J (1999) Seed dispersal near and far: Patterns
across temperate and tropical forests. Ecology, 80, 1475–1494.
Clayton D, Kaldor J (1987) Empirical Bayes estimates of age-standardised relative risks for use in
disease mapping. Biometrics, 43, 671–682.
Congdon P (2007) Mixtures of spatial and unstructured effects for spatially discontinuous health
outcomes. Computational Statistics and Data Analysis, 51, 3197–3212.
Congdon P (2008) A spatially adaptive conditional autoregressive prior for area health data. Statistical
Methodology, 5, 552–563.
Congdon P (2017) Quantile regression for overdispersed count data: a hierarchical method. Journal of
Statistical Distributions and Applications, 4, 18.
248 Bayesian Hierarchical Models

Conlon E, Louis T (1999) Addressing multiple goals in evaluating region-specific risk using Bayesian
methods, pp 31–47, in Disease Mapping and Risk Assessment for Public Health, eds A Lawson, A
Biggeri, D Bohning, E Lesaffre, J Viel, R Bertollini. John Wiley, Chichester, UK.
Cressie N, Kapat P (2008) Some diagnostics for Markov random fields. Journal of Computational and
Graphical Statistics, 17, 726–749.
Cressie N, Wikle CK (2011) Statistics for Spatio-Temporal Data. John Wiley & Sons, Inc., New York.
Datta A, Banerjee S, Finley A, Gelfand A (2016) Hierarchical nearest-neighbor Gaussian process mod-
els for large geostatistical datasets. Journal of the American Statistical Association, 111, 800–812.
De Oliveira V (2012) Bayesian analysis of conditional autoregressive models. Annals of the Institute of
Statistical Mathematics, 64(1), 107–133.
Diggle P, Ribeiro P (2002) Bayesian inference in Gaussian model-based geostatistics. Geographical and
Environmental Modelling, 6, 129–146.
Diggle P, Ribeiro P, Christensen O (2003) An introduction to model based geostatistics, pp 43–86,
in Spatial Statistics and Computational Methods, ed Möller J. Lecture Notes in Statistics, Vol. 173.
Springer.
Diggle P, Ribeiro P (2007) Model-based Geostatistics. Springer-Verlag, New York
Dormann, C., McPherson J, Araújo M, Bivand R, Bolliger J (2007) Methods to account for spatial auto-
correlation in the analysis of species distributional data: A review. Ecography, 30(5), 609–628.
Duan J, Guindani M, Gelfand A (2007) Generalized spatial Dirichlet process models. Biometrika, 94,
809–825.
Eidsvik J, Finley A, Banerjee S, Håvard R (2012) Approximate Bayesian inference for large spa-
tial datasets using predictive process models. Computational Statistics & Data Analysis, 56(6),
1362–1380.
Fernandez C, Green P (2002) Modelling spatially correlated data via mixtures: A Bayesian approach.
Journal of the Royal Statistical Society: Series B, 64, 805–826.
Finley A, Sang H, Banerjee S, Gelfand A (2009) Improving the performance of predictive process
modeling for large datasets. Computational Statistics & Data Analysis, 53(8), 2873–2884.
Finley A, Banerjee S, Gelfand A (2015) spBayes for large univariate and multivariate point-referenced
spatio-temporal data models. Journal of Statistical Software, 63(13), 1–28.
Finley, A, Datta A, Cook B, Morton D, Andersen H, Banerjee S (2017) Applying Nearest Neighbor
Gaussian Processes to Massive Spatial Data Sets: Forest Canopy Height Prediction Across
Tanana Valley Alaska. https://arxiv.org/abs/1702.00434
Furrer R, Sain S (2010) spam: A sparse matrix R package with emphasis on MCMC methods for
Gaussian Markov random fields. Journal of Statistical Software, 36(10), 1–25.
Gelfand A, Kottas A, MacEachern S (2005b) Bayesian nonparametric spatial modeling with Dirichlet
process mixing. Journal of the American Statistical Association, 100(471), 1021–1035.
Gelfand A, Latimer A, Wu S, Silander J (2005a) Building statistical models to analyse species distribu-
tions, in Hierarchical Modelling for the Environmental Sciences, Statistical Methods and Applications,
eds J Clark, A Gelfand. OUP.
Gelfand AE, Ghosh SK (1998) Model choice: A minimum posterior predictive loss approach.
Biometrika, 85(1), 1–11.
Gerber F, Furrer R (2015) Pitfalls in the implementation of Bayesian hierarchical modeling of areal
count data: An illustration using BYM and Leroux models. Journal of Statistical Software, Code
Snippets, 63(1), 1–32. http://www.jstatsoft.org/v63/c01/
Giardini D, Woessner J, Danciu L (2014) Mapping Europe’s seismic hazard. EOS, 95(29): 261–262.
Goméz-Rubio V, Bivand R (2018) R Package ‘INLABMA’, Bayesian Model Averaging with INLA.
https://rdrr.io/rforge/INLABMA/
Goméz-Rubio V, Bivand R, Rue H (2018) Estimating spatial econometrics models with integrated
nested laplace approximation. arXiv preprint arXiv:1703.01273.
Gotway C, Wolfinger R (2003) Spatial prediction of counts and rates. Statistics in Medicine, 22,
1415–1432.
Gramacy R, Lee H (2008) Gaussian processes and limiting linear models. Computational Statistics &
Data Analysis, 53, 123–136.
Representing Spatial Dependence 249

Green P, Richardson S (2002) Hidden Markov models and disease mapping. Journal of the American
Statistical Association, 97, 1055–1070.
Griffin J, Steel M (2006) Order-based dependent Dirichlet processes. Journal of the American Statistical
Association, 101, 179–194.
Grunwald S (2005) Environmental Soil-Landscape Modeling: Geographic Information Technologies and
Pedometrics. CRC Press.
Gschlößl S, Czado C (2006) Modelling count data with overdispersion and spatial effects. Technische
Universität München, Statistical Papers. DOI: 10.1007/s00362-006-0031-6
Haran M, Hodges J, Carlin B (2003) Accelerating computation in Markov random field models for
spatial data via structured MCMC. Journal of Computational & Graphical Statistics, 12, 249–264.
Hepple L (2003) Bayesian and maximum likelihood estimation of the linear model with spatial mov-
ing average disturbances. Working Papers Series, School of Geographical Sciences, University
of Bristol.
Higdon D (1998) A process-convolution approach to modelling temperatures in the North Atlantic
Ocean. Environmental and Ecological Statistics, 5, 173–190.
Higdon D (2007) A primer on space-time modelling from a Bayesian perspective, Chapter 6, in
Statistical Methods for Spatio-Temporal Systems, eds B Finkelstadt, L Held, V Isham. CRC Press.
Hodges J, Carlin B, Fan Q (2003) On the precision of the conditionally autoregressive prior in spatial
models. Biometrics, 59, 317–322.
Jiruše M, Machek J, Beneš V, Zeman P (2004) A Bayesian estimate of the risk of tick-borne diseases.
Applications of Mathematics, 49, 389–404.
Joseph M (2016) Exact Sparse CAR Models in Stan. http://mc-stan.org/users/documentation/case-
studies/mbjoseph-CARStan.html
Kelsall J, Wakefield J (2002) Modelling spatial variation in disease risk: A geostatistical approach.
Journal of the American Statistical Association, 97, 692–770.
Knorr-Held L, Becker N (2000) Bayesian modelling of spatial heterogeneity in disease maps with
application to German cancer mortality data. Journal of the German Statistical Society, 84, 121–140.
Knorr-Held L, Rasser G (2000) Bayesian detection of clusters and discontinuities in disease maps.
Biometrics, 56, 13–21.
Lacombe D, LeSage J (2015) Using Bayesian posterior model probabilities to identify omitted vari-
ables in spatial regression models. Papers in Regional Science, 94(2), 365–383.
Lang S, Brezger A (2004) Bayesian P-splines. Journal of Computational and Graphical Statistics, 13(1),
183–212.
Lavine M (1999) Another look at conditionally Gaussian Markov random fields, in Bayesian Statistics
6, eds J Bernardo, J Berger, P Dawid, A Smith. Oxford University Press, Oxford, UK.
Lawson A (2008) Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology. CRC Press.
Lawson A, Clark A (2002) Spatial mixture relative risk models applied to disease mapping. Statistics
in Medicine, 21, 359–370.
Lee D (2011) A comparison of conditional autoregressive models used in Bayesian disease mapping.
Spatial and Spatio-Temporal Epidemiology, 2(2), 79–89.
Lee D (2013) CARBayes: An R package for Bayesian spatial modeling with conditional autoregres-
sive priors. Journal of Statistical Software, 55(13), 1–24.
Lee H, Higdon D, Calder C, Holloman C (2005) Efficient models for correlated data via convolutions
of intrinsic processes. Statistical Modelling, 5, 53–74.
Leroux B, Lei X, Breslow N (1999) Estimation of disease rates in small areas: a new mixed model
for spatial dependence, pp 135–178, in Statistical Models in Epidemiology, the Environment and
Clinical Trials, eds M Halloran, D Berry. Springer-Verlag, New York.
LeSage J (1999) Spatial econometrics, in The Web Book of Regional Science (www.rri.wvu.edu/regsc-
web.htm), ed R W Jackson. Regional Research Institute, West Virginia University, Morgantown,
WV.
Liang W, Lee H (2014) Sequential process convolution gaussian process models via particle learning.
Statistics and Its Interface, 7(4), 465–475.
Lindgren F (2012) Continuous domain spatial models in R-INLA. The ISBA Bulletin, 19(4), 14–20.
250 Bayesian Hierarchical Models

Lindgren F, Rue H, Lindstrom J (2011) An explicit link between Gaussian fields and Gaussian
Markov random fields: The stochastic partial differential equation approach. Journal of the Royal
Statistical Society: Series B, 73(4), 423–498.
Lindgren F, Rue H (2015) Bayesian spatial modelling with R-INLA. Journal of Statistical Software,
63(19), 1–25
MacNab Y (2014) On identification in Bayesian disease mapping and ecological–spatial regression
models. Statistical Methods in Medical Research, 23(2), 134–155.
MacNab Y, Kmetic A, Gustafson P, Shaps S (2006) An innovative application of Bayesian disease
mapping methods to patient safety research. Statistics in Medicine, 25, 3960–3980.
Mardia K, Watkins A (1989) On multimodality of the likelihood in the spatial linear model. Biometrika,
76, 289–295.
Marra G, Wood S (2012) Coverage properties of confidence intervals for generalized additive model
components. Scandinavian Journal of Statistics, 39(1), 53–74.
Mitas L, Mitasova H (1999) Spatial interpolation, pp 481–492, in Geographical Information Systems:
Principles, Techniques, Management and Applications, eds P Longley, M Goodchild, D Maguire, D
Rhind, 1st Edition. Wiley.
Moores M, Hargrave C, Deegan T, Poulsen M, Harden F, Mengersen K (2015) An external field prior
for the hidden Potts model with application to cone-beam computed tomography. Computational
Statistics & Data Analysis, 86, 27–41.
Morris M (2018) Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data. http://
mc-stan.org/users/documentation/case-studies/icar_stan.html
Norton J, Niu X (2009) Intrinsically autoregressive spatiotemporal models with application to aggre-
gated birth outcomes. Journal of the American Statistical Association, 104, 638–649.
Paciorek CJ, Schervish MJ (2006) Spatial modelling using a new class of nonstationary covariance
functions. Environmetrics, 17(5), 483–506.
Pettitt A, Weir I, Hart A (2002) A conditional autoregressive Gaussian process for irregularly
spaced multivariate data with application to modelling large sets of binary data. Statistics and
Computing, 12, 353–367.
Reich BJ, Fuentes M (2007) A multivariate semiparametric Bayesian spatial modeling framework for
hurricane surface wind fields. The Annals of Applied Statistics, 1(1), 249–264.
Ribeiro P, Diggle P (2018) Package ‘geoR’. https://cran.r-project.org/web/packages/geoR/geoR.
pdf
Richardson S, Guihenneuc C, Lasserre V (1992) Spatial linear models with autocorrelated error struc-
ture. The Statistician, 41, 539–557.
Richardson S, Monfort C (2000) Ecological correlation studies, in Spatial Epidemiology Methods and
Applications, eds P Elliott, J Wakefield, N Best, D Briggs. Oxford University Press.
Richardson S, Thomson A, Best N, Elliott P (2004) Interpreting posterior relative risk estimates in
disease-mapping studies. Environmental Health Perspectives, 112, 1016–1025.
Riebler A, Sørbye S, Simpson D, Rue H (2016) An intuitive Bayesian spatial model for disease map-
ping that accounts for scaling. Statistical Methods in Medical Research, 25, 1145–1165.
Riggan W, Manton K, Creason J, Woodbury M, Stallard E (1991) Assessment of spatial variation of
risks in small populations. Environmental Health Perspectives, 96, 223–238.
Risser M, Calder C (2017) Local likelihood estimation for covariance functions with spatially-varying
parameters: The convoSPAT package for R. Journal of Statistical Software, 81(14), 1–32.
Rodrigues A, Assuncao R (2008) Propriety of posterior in Bayesian space varying parameter models
with normal data. Statistics & Probability Letters, 78, 2408–2411.
Rue H, Held L (2005) Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall,
London, UK.
Rue H, Martino S, Chopin, N (2009) Approximate Bayesian inference for latent Gaussian models by
using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B,
71(2), 319–392.
Rue H, Tjelmeland H (2002) Fitting Gaussian Markov random fields to Gaussian fields. Scandinavian
Journal of Statistics, 29, 31–49.
Representing Spatial Dependence 251

Sangalli L, Ramsay J, Ramsay T (2013) Spatial spline regression models. Journal of the Royal Statistical
Society: Series B, 75(4), 681–703.
Schabenberger O, Gotway C (2004) Statistical Methods for Spatial Data Analysis. Chapman & Hall/
CRC.
Schrödle B, Held L (2011) A primer on disease mapping and ecological regression using INLA.
Computational Statistics, 26(2), 241–258.
Sethuraman J (1994) A constructive definition of dirichlet priors. Statistica Sinica, 4, 639–665.
Seya H, Tsutsumi M, Yamagata Y (2012) Income convergence in Japan: A Bayesian spatial Durbin
model approach. Economic Modelling, 29(1), 60–71.
Thomas A, Best N, Lunn D, Arnold R, Spiegelhalter D (2014) GeoBUGS User Manual. https://www.
mrc-bsu.cam.ac.uk
Vecchia AV (1988) Estimation and model identification for continuous spatial processes. Journal of the
Royal Statistical Society: Series B (Methodological), 50(2), 297–312.
Wakefield J (2007) Disease mapping and spatial regression with count data. Biostatistics, 8, 158–183.
Wall M (2004) A close look at the spatial structure implied by the CAR and SAR models. Journal of
Statistical Planning and Inference, 121, 311–324.
Waller L (2002) Hierarchical models for disease mapping, in Encyclopedia of Environmetrics, eds A
El-Shaarawi, W Piegorsch. Wiley, Chichester, UK.
Waller L, Carlin B (2010) Disease mapping, Chapter 14, pp 217–243, in Handbooks of Modern Statistical
Methods. ed G Fitzmaurice, Chapman & Hall/CRC.
Waller L, Gotway C (2004) Applied Spatial Statistics for Public Health Data. Wiley.
Webster R, Oliver M, Muir K, Mann J (1994) Kriging the local risk of a rare disease from a register of
diagnoses. Geographical Analysis, 26, 168–185.
Wood S (2006) Low-rank scale-invariant tensor product smooths for generalized additive mixed
models. Biometrics, 62(4), 1025–1036.
Wood S, Bravington M, Hedley S (2008) Soap film smoothing. Journal of the Royal Statistical Society:
Series B, 70(5), 931–955.
Wood S N (2016) Just another Gibbs additive modeller: Interfacing JAGS and mgcv. arXiv preprint
arXiv:1602.02539.
Yang C, Xu J, Li Y (2016) Bayesian geoadditive modelling of climate extremes with nonparametric
spatially varying temporal effects. International Journal of Climatology, 36(12), 3975–3987.
Yanli Z, Wall M (2004) Investigating the use of the variogram for lattice data. Journal of Computational
and Graphical Statistics, 13, 719–738.
Zhang H (2002) On estimation and prediction for spatial generalised linear mixed models. Biometrics,
58, 129–136.
Zhang L, Datta A, Banerjee S (2018) Practical Bayesian Modeling and Inference for Massive Spatial
Datasets On Modest Computing Environments. arXiv preprint arXiv:1802.00495.
Zhu L, Gorman D, Horel S (2006) Hierarchical Bayesian spatial models for alcohol availability, drug
hot spots and violent crime. International Journal of Health Geographics, 5, 54.
7
Regression Techniques Using Hierarchical Priors

7.1 Introduction
This chapter is concerned with the application of hierarchical prior schemes to regressions
involving univariate responses, where the observations are non-nested, but may be spa-
tially or temporally configured. Nested data applications are considered in Chapters 8 and
10. A range of Bayesian packages in R for regression, often in specialised applications, are
detailed at https://cran.r-project.org/web/views/Bayesian.html, and include BayesLogit
(Windle, 2016), BMA (Raftery et al., 2005), and BMS (Zeugner and Feldkircher, 2015). The
treatment here is intended as providing a generic overview of regression applications
involving hierarchical principles, and development of flexible data analysis through appli-
cation-specific coding.
As a first illustration of regression modelling invoking hierarchical principles, much
attention has focused on Bayesian methods for predictor selection, commonly using selec-
tion indicators or shrinkage priors, and these are discussed in Section 7.2. Many regression
selection applications involve categorical predictors, including analysis of variance, and
the particular issues raised are considered in Section 7.3. Hierarchical specfications also
apply when latent responses or random effects are used to (a) improve data representa-
tion in overdispersed general linear models (Section 7.4), or (b) generate latent continuous
responses (augmented data) underlying discrete observations (Section 7.5). Section 7.6 con-
siders heterogeneity in regression relationships or variance parameters over exchangeable
sample units. Heterogeneous regression effects and predictor selection are then consid-
ered for responses structured in time or space (Sections 7.7 and 7.8).

7.2 Predictor Selection
Regression model uncertainty most commonly focuses on which predictors to retain,
though other aspects of regression specification may be considered also. Predictor selec-
tion methods generally also aim for improved predictive performance, through develop-
ing an encompassing model, or model simplification without adversely affecting predictive
accuracy (Piironen and Vehtari, 2017a). Formal model choice is simplified for normal linear
regression, as marginal likelihoods may be obtained analytically, but for a large number of
predictors, comparison of the many possible models becomes infeasible.
One option for a more feasible analysis involves a form of discrete mixture: predictor
selection indicators, possibly combined with particular priors on regression coefficients or

253
254 Bayesian Hierarchical Models

variances, are introduced to enable additional inferences (e.g. marginal retention probabil-
ities on predictors) (Rockova et al., 2012; Malsiner-Walli and Wagner, 2019). Alternatively,
regularisation or shrinkage priors include a penalty (e.g. an L1 norm) or some other mecha-
nism to shrink unnecessary regression effects towards zero (e.g. Polson and Scott, 2010;
Carvalho et al., 2009).
A leading motivation for predictor selection or effect regularisation is to estimate coef-
ficients which alleviate for predictor collinearity. If not controlled for, collinearity may
lead to low precision for regression coefficients, and coefficients with effect sizes or signs
contrary to subject matter expectations (Winship and Western, 2016).

7.2.1 Predictor Selection
Predictor selection recognises model uncertainty, and a predictive target which acknowl-
edges such uncertainty may be of interest. An example might be a mean treatment suc-
cess rate conditional on predictors (Garcia-Donato and Martinez-Beneito, 2013). Using
predictor selection, one is implicitly averaging over a set of plausible regression models,
so providing an encompassing model with potential predictive advantages (Piironen and
Vehtari, 2017a).
The discrete mixture approach involves binary selection indicators g j for predictors
j = 1,…,p. These indicators may directly determine inclusion (e.g. Kuo and Mallick, 1998),
or define prior regression coefficient variances consistent with inclusion or effective exclu-
sion, as under stochastic search variable selection (SSVS) (George and McCullogh, 1993).
Selection usually applies to all predictors except the intercept.
With normal priors on regression coefficients under univariate linear regression, and
with response yi and predictors Xi, one has under SSVS

yi ~ N ( b0 + Xi b , s 2 ),

b j |g j = 1 ~ N (0, t j2 ),

b j |g j = 0 ~ N (0, c jt j2 ),

g j |w j ~ Bern(w j ),

w j ~ p(w j ).

The setting for t j2 (at a known value) allows unrestricted search over the potential param-
eter space, while cj is set suitably small, so that γj = 0 corresponds to effective exclusion from
the regression. The spike and slab prior (Kuo and Mallick, 1998), denoted SSP for short,
specifies the inclusion and exclusion options as

b j |g j = 1 ~ N (0, t j2 ),

b j |g j = 0 ~ d0 (),

where δB() is a discrete measure concentrated at value B.


Whether SSVS or SSP priors are adopted, the setting for t j2 affects the level of parsi-
mony  in the selection. Higher values of t j2 favour more parsimonious models, with
Regression Techniques Using Hierarchical Priors 255

Lee and Chen (2015) proposing a predictive mean square error criterion to select t j2 . Thus


for predictions (replicate observations) y i and n cases, the MSE is Si ( yi - y i )2 /n .
The regression term becomes b0 + g1 b1Xi1 + ¼+ gp bp Xip , and to assess the effect of the jth
predictor, one would monitor the product xj = g j b j . Posterior marginal retention probabili-
ties Pr(γj = 1|y) are estimated by the proportion of MCMC iterations when γj = 1, while pos-
terior model probabilities are based on the sampled frequency of different combinations
of retained predictors.
It is often useful to have a measure of significance for individual predictors (analogous to
classical t statistics for each predictor), and marginal Bayes factors for retention (Ghosh and
Ghattas, 2015) can be obtained by comparing posterior retention odds Pr(γj = 1|y)/Pr(γj = 0|y)
with prior odds ωj/(1 − ωj) = Pr(γj = 1)/Pr(γj = 0). Interest generally lies with predictors having
posterior retention probabilities exceeding their prior probability. For example, assum-
ing Pr(γj = 1) = 0.5, Barbieri and Berger (2004) define the median probability model as that
defined by predictors with posterior inclusion probabilities exceeding 0.5.
Settings for τj in the SSVS and SSP priors are facilitated by standardising predictors.
Then reasonable priors are b j |g j = 1 ~ N (0, k 2 ), with k ϵ (0.5,4) (Kuo and Mallick, 1998), or
k = 1 (McElreath, 2016). Lee and Bing-Chen (2015) select k = 10 in a sparse group selection
linear regression with standardised predictors and response. In logistic regression, very
large k values are, in fact, informative (disproportionately weighting coefficient values con-
sistent with fitted 0 or 1 probabilities), and the setting k2 = 3 is used by Nott and Leng (2010).
Settings for ωj correspond to prior beliefs about the potential importance of the predictor,
with simplifying options being ωj = ω, with ω either preset or an additional unknown, for
example, beta distributed a priori. Taking ω as an unknown measure of model complexity
is advantageous in applications with many predictors (Piironen and Vehtari, 2016a), with
ωp amounting to a prior guess at model size (Bhattacharya et al., 2015). The indifference
setting ωj = 0.5 implies 2p equally probable models, and that about half the variables are to
be retained a priori (Ishwaran and Rao, 2005; O’Hara and Sillanpaa, 2009).
In multivariate regression (e.g. for gene expression measures yik, k = 1, … , q ), the discrete
selection approach may involve indicators γjk determining retention of the jth predictor in
the regression for the kth outcome. For example, Jia and Xu (2007) propose

b jk ~ (1 - g jk )N (0, c) + g jk N (0, tk2 ),

g jk ~ Bern(w j ),

where c is a fixed small constant, and a hierarchical prior is set on tk2 . Thus, Richardson
et al. (2010) suggest a model with selection parameters ωjk determined both by the outcome
and predictor,

w jk = wk r j ,

where ρj captures the propensity for predictor j to influence several outcomes, and ωk con-
trols the complexity of the regression for outcome k.
A similar hierarchical indicators procedure is proposed by Chen et al. (2016) for sparse
group selection, where covariates can be formed into substantively defined groups. The
aim is to select the most important groups of predictors, and within those selected groups,
select the more important predictors. Thus, retention for group j is determined by binary
indicators ρj ~ Bern(ωρ), so that for predictors k within groups, the selection rule is
256 Bayesian Hierarchical Models

g jk ~ (1 - r j )d0 + r j Bern(wg ).

Hence under an SSP prior

b jk ~ (1 - g jk r j )d0 + g jk r j N (0, tk2 ).

7.2.2 Shrinkage Priors
Shrinkage priors seek a sparse representation of the regression coefficients without neces-
sarily including a mechanism to actually formally exclude unnecessary predictors, with
potential advantages in MCMC sampling (Bhattacharya et al., 2015; Makalic and Schmidt,
2016). For example, the Lasso prior specifies a heavy tailed double exponential or Laplace
prior density for regression coefficients, where this density is defined as

l
DE ( x| m, l ) = exp ( - l x - m ) .
2
The prior b j ~ DE(0, l) assigns higher weight to values near zero than the normal prior and
favours shrinkage, with the scale parameter λ controlling the amount of shrinkage. Larger
values of λ imply greater shrinkage with lower variance around the zero prior mean. This
prior can be expressed in hierarchical terms (Kotz et al., 2001) as

b j ~ N (0, hj2 ),

hj2 ~ E(l 2 /2).

In a normal linear regression with residual variance σ2, the first stage of the prior should be
expressed as b j ~ N (0, s 2hj2 ) , (Park and Casella, 2008). One may also allow the second stage
parameters lj2 to vary between coefficients e.g. following a gamma prior (Yi and Ma, 2012).
Shrinkage priors can be represented generically (Polson and Scott, 2010; Bhadra et al.,
2016) as

b j ~ N (0, t 2hj2 ), (7.1)

with different possible choices of prior density for the hj2 (local shrinkage parameters) and
τ2 (the global shrinkage parameter).
The horseshoe prior specifies a half-Cauchy prior for the ηj, allowing considerable
shrinkage for unnecessary coefficients (Carvalho et al., 2009; Polson and Scott, 2012). There
is some debate about a suitable prior for τ2 (Piironen and Vehtari, 2017b; Piironen and
Vehtari, 2017c), with Carvalho et al. (2009) recommending a half-Cauchy prior also, namely
t ~ C + (0, 1) .
This can be expressed in terms of a Beta(0.5,0.5) density for the shrinkage parameters

k j = 1/(1 + hj2 ). (7.2)

The estimated κj can be interpreted as “the amount of weight that the posterior mean for βj
places on 0” (Carvalho et al., 2009; Piironen and Vehtari, 2017c); so higher κj correspond to
irrelevant predictors. Accordingly, Piironen and Vehtari (2017c) propose an effective num-
ber of coefficients measure
Regression Techniques Using Hierarchical Priors 257

å(1 - k ).
j =1
j

The horseshoe prior is a special case of the prior

hj2 ~ tn+ (0, 1),

namely a half-Student t-prior with ν degrees of freedom (Piironen and Vehtari, 2016).
Piironen and Vehtari (2017c) mention using this prior (with small ν) to alleviate divergent
transitions produced by the No U-Turn Sampler (NUTS) algorithm, but this implies a loss
of sparsity.

Example 7.1 Diabetes Progression


This example illustrates spike-slab and shrinkage priors using the diabetes data con-
sidered by Ishwaran and Rao (2005), and included in the R spikeslab library (Ishwaran
et al., 2010), and the R lars library. These data have a continuous measure of disease
progression as a response for n = 442 patients, with p = 64 predictors. The latter consist of
ten baseline measures (age, sex, body mass index (BMI), blood pressure, and five blood
serum measurements; 45 pairwise interactions formed between baseline variables, and
quadratic terms for the nine continuous baseline measurements.
Here predictors are arranged as in the lars library, and are standardised. Classical
estimation shows four predictors with t statistics above 2, namely x2 (sex), x3 (BMI), x4
(MAP, mean arterial pressure) and x20 (sex-age interaction). In the Bayesian analysis
here, 95% credible intervals are considered on these predictors, and also on any other
predictors shown as relevant.
The first analysis uses the spike-slab scheme of Ishwaran and Rao (2005, equation 4),
whereby v0 denotes a small constant,

b j ~ N(0, g j t j2 ),

g j ~ (1 - w)dv0 () + wd1 (),

with gamma priors on 1/t j2 and a beta or uniform prior on ω. A shrinkage mecha-
nism operates whereby when γj = 0 the variance of βj is very small. The parameter ω, if
taken unknown, acts as a complexity parameter, controlling model size. Here v0 = 0.005
and ω ~ U(0,1). It is assumed that 1/t j2 ~ Exp(1), centred around 1, since predictors are
standardised.
The second half of a two-chain run of 5,000 iterations provides a posterior mean for ω
of 0.84. The control for collinearity implicit in predictor selection reveals x9 (LTG serum)
as a significant predictor (Table 7.1), but with 95% credible intervals for β2 and β20 strad-
dling zero. The predictor x7 (high-density lipoprotein or HDL cholesterol) has a 95%
interval concentrated on negative values, albeit with an inclusion probability of 0.95.
Retention probabilities are close to 1 for BMI, MAP, and LTG.
A second analysis uses a hierarchical version of the Lasso prior, namely

b j ~ N(0, s 2 h j2 ),

1/s 2 ~ Ga( a, b),

h j2 ~ E(l 2 /2).
258 Bayesian Hierarchical Models

TABLE 7.1
Predictor Selection, Diabetes Progression, Spike-Slab Prior
βj γj
Predictor Notation Mean 2.5% 97.5% Mean
Sex X2 −4.09 −12.15 0.45 0.93
BMI X3 26.64 19.55 33.74 1.00
MAP X4 11.91 3.56 19.06 1.00
HDL X7 −5.84 −15.73 0.43 0.95
LTG X9 25.34 18.01 32.55 1.00
Age-Sex X20 4.05 −0.38 10.94 0.96

TABLE 7.2
Predictor Selection, Lasso Shrinkage Prior
βj
Predictor Notation Mean 2.5% 97.5%
Sex X2 −7.88 −13.75 −2.09
BMI X3 23.14 15.86 30.22
MAP X4 13.52 7.37 19.80
LTG X9 23.39 15.74 31.18
Age-Sex X20 5.86 0.44 11.88

The residual precision 1/σ2 is assigned a Ga(1,0.001) prior, and λ is assigned a uniform
U(0.001,100) prior. Results are as in Table 7.2, with five predictors x2, x3, x4, x9, and x20
judged significant in terms of 95% credible intervals either entirely negative or posi-
tive. In the sense that regression including predictor selection is still a model for the
data, fit statistics - penalised DIC (deviance information criterion), WAIC (widely appli-
cable information criterion) and LOO-IC (leave-one-out information criterion - are very
similar between the spike-slab model and Lasso models. For example, their respective
LOO-IC are 4797 and 4794.
A third analysis uses a horseshoe prior, implemented in rstan using the scheme

b j ~ N(0, t 2 lj2 ),

1/t 2 ~ Ga(1, 0.001),

lj ~ C + (0, 1),

where C+(0,1) is a half-Cauchy density. A two-chain run of 2000 iterations provides pos-
terior mean estimates for κj below 0.05 only for x3.
One may also include a predictor selection mechanism when shrinkage priors are used
for the coefficients (Yuan and Lin, 2005). Thus in a selection version of the Lasso prior

b j |g j = 1 ~ N(0, s 2 h j2 ),

b j |g j = 0 ~ d0 (),

1/s 2 ~ Ga( a, b),


Regression Techniques Using Hierarchical Priors 259

h j2 ~ E(l 2 /2),

with ω and λ assigned Be(1,1) and U(0.001,100) priors. This provides posterior inclusion
probabilities above 0.95 for x2, x3, x4, and x9, though the median probability model also
includes x7, x20, and x37.

7.3 Categorical Predictors and the Analysis of Variance


Categorical predictors (with ordered or unordered categories) commonly occur in gen-
eral regression settings, and in the more specific form of analysis of variance. They raise
particular issues in terms of predictor selection: whether a categorical predictor should
be retained or excluded in its entirety (group level selection), whether some categories
are retained but others excluded (within group selection), and whether categories need
to be distinguished within a particular categorical variable (Xu and Ghosh, 2015; Tutz
and Gertheiss, 2016). The latter issue can be expressed alternatively as whether categories
should be fused or merged.
For ordered predictors, a selection prior should enforce sparsity but also take account of
the ordering. Thus consider an ordinal predictor xj with Qj levels. Taking the first category
as reference with βj1 = 0 for identifiability, a regularising prior can be expressed in terms of
a prior on successive differences

d jq = b j , q + 1 - b jq , q = 1, … Q j - 1.

For p ordinal predictors, the Lasso prior may be taken on p sets of differences:

d jq ~ N (0, t 2hj2 ), j = 1, … , p; q = 1, ¼ , Q j - 1

t 2 ~ IG( a, b),

hj2 ~ E(l 2 /2).

Wagner and Pauger (2016) propose Normal shrinkage with

d jq ~ N (0, t j2 ), j = 1, … , p; q = 1, … , Q j - 1

t j2 ~ IG( a, b).

These types of penalty apply shrinkage to groups of parameters representing the same
categorical predictor, and also smooth over successive ordered categories. They can be
combined with spike-slab discrete mixture priors where the spike either sets coefficients
to zero, or scales the coefficient variances to very small positive values (i.e. effective
exclusion).
For nominal predictors, the regularising prior can be applied to all possible contrasts
between categories (Wagner and Pauger, 2016). Thus, with βj1 = 0, consider contrasts
260 Bayesian Hierarchical Models

d jqr = b jq - b jr , q = 1, ¼ , Q j - 1; q > r

including the reference category. The prior is then

d jq ~ N (0, Rjt 2hj2 ),

with Rj = (Qj + 1)/2 reflecting the number of categories.


For a nominal factor with several categories, the number of contrasts increases consider-
ably, and one might instead assume a random prior over all categories (e.g. Albert, 1996;
Gelman, 2005), as in

b jq ~ N (0, tj2 ) q = 1, ¼ , Q j

with identifiability possibly enforced by centring (Tingley, 2012). Xu and Ghosh (2015) con-
sider the hierarchical group lasso representation

b jq ~ N (0, tj2 ),

æ Qj + 1 l 2 ö
t j2 ~ Ga ç , ÷ .
è 2 2 ø
This can be a combined scheme with a spike-slab discrete mixture with indicators γj, where
the spike option γj = 0 sets the entire group j coefficient set ( b j1 , b j 2 ,¼, b jQ j ) to zero. This is
consistent with a principle of group level sparsity. To allow for coefficient selection within
groups one can use products of selection indicators γjγjq, with γj and γjq following separate
Bernoulli densities (Xu and Ghosh, 2015, p.924).

7.3.1 Testing Variance Components


Bayesian methods may have benefit in analysis of variance applications beyond variable
selection considerations. Conventional analysis of variance (ANOVA) estimation does not
allow for factor combinations for which there are no observations (Tingley, 2012). Fixed
effects ANOVA estimation also do not allow predictions of a future observation (Geinitz
and Furrer, 2016). The analogue of F tests for effects of factor variables under classical esti-
mation are provided by comparisons of variance estimates.
Thus, in a balanced normal one-way analysis of variance, one has a factor (indexed by i)
with replicates (indexed by j)

yij = a + bi + eij i = 1, ¼ , nI ; j = 1, ¼ , nJ

where eij ~ N (0, se2 ) , and factor effects βi are estimated either as fixed or random effects.
Classical assessment of the hypothesis b1 = ¼ = bnI = 0 uses F tests comparing mean
squares due to the factor and the errors.
One possible perspective, broadening into multilevel applications (Geinitz et al., 2015),
considers the parameters of this model as random effects (Gelman, 2005). For example, as
a baseline representation for the one-way ANOVA,

yij ~ N ( bi , se2 ),
Regression Techniques Using Hierarchical Priors 261

bi ~ N (a , s b2 ),

with significant variation in the factor assessed by the comparison Pr(s b > se | y ), which
can be estimated from MCMC sampling. An analogous comparison may be made using
the marginal variance sβ, estimated from the sampled βi during MCMC runs. The latter
can be regarded as estimating the variance over the observed units, rather than some
broader population of units.

Example 7.2 Horseshoe Crabs


As described in Brockmann (1996), horseshoe crabs arrive in pairs at particular beach
sites for springtime spawning, but unattached (“satellite”) males compete with attached
males for fertilisations. Some couples are ignored, while some attract many satellites,
and this is related to characteristics of females. The presence of satellites is described
by a binary variable with value 1 for at least one satellite. Potential predictors are the
female crab’s weight, carapace width, colour, and spine condition. The latter two pre-
dictors are ordinal, with colour having values 1 = dark, 2 = medium dark, 3 = medium,
and 4 = medium light, while spine condition has values 1 = both good, 2 = one worn or
broken, and 3 = both worn or broken. Older crabs tend to have darker shells.
An analysis (using rjags) without predictor selection shows the medium and medium
dark colour categories as significant in terms of a 95% credible interval entirely one side
of zero (Table 7.3). A second model uses Lasso shrinkage priors, with four h j2 param-
eters: on weight and width, and on the collective differences applying to colour and
spine condition. Thus, for the two continuous (and standardised) predictors

b j ~ N(0, h j2 ), j = 1, 2,

while for the ordinal predictors

b j1 = 0,

b j , q + 1 = b jq + d jq j = 3, 4; q = 1,… , Q j - 1

d jq ~ N(0, h j2 ),

with priors on hyperparameters

h j2 ~ E(l 2 /2) j = 1,¼ 4,

l ~ U(0.001, 100).

This model produces an improved penalised deviance (Plummer, 2008), at 197 com-
pared to 202 under the first model (though not a better Brier score). There is an enhanced
role for the width predictor, with a shrinkage in the coefficient for weight (Table 7.3).
This effect may reflect better control for impact of multicollinearity (the correlation
between weight and width is 0.89). There is also an attenuation in the impacts of the
colour categories, though there is still a 95% probability that the impact of medium
colour is positive.
Posterior inference may be sensitive to the prior for λ, with large values producing
overshrinkage (Xu and Ghosh, 2015). A sensitivity analysis assumes a prior λ ~ E(1). This
262 Bayesian Hierarchical Models

TABLE 7.3
Horseshoe Crabs. Logistic Regression for Satellite Presence
Without Selection
Predictor (Category) Parameter Mean 2.5% 97.5% Pr(βj > 0)
Width β1 0.58 −0.27 1.44 0.91
Weight β2 0.51 −0.32 1.38 0.89
Spine (category 2) β32 −0.11 −1.55 1.30 0.44
Spine (category 3) β33 0.44 −0.57 1.45 0.81
Color (category 2) β42 1.27 0.06 2.51 0.98
Color (category 3) β43 1.66 0.54 2.79 1.00
Color (category 4) β44 1.86 −0.05 3.90 0.97
Lasso Prior, λ ~ U(0.001,100)
Mean 2.5% 97.5% Pr(βj > 0)
Width β1 0.53 −0.06 1.18 0.95
Weight β2 0.43 −0.12 1.10 0.92
Spine (category 2) β32 0.01 −0.73 0.73 0.52
Spine (category 3) β33 0.15 −0.48 0.93 0.66
Color (category 2) β42 0.49 −0.16 1.54 0.90
Color (category 3) β43 0.79 −0.06 1.96 0.95
Color (category 4) β44 0.82 −0.23 2.40 0.91
Laplace scale λ 3.4 0.9 8.4
Lasso Prior, λ ~ E(1)
Mean 2.5% 97.5% Pr(βj > 0)
Width β1 0.54 −0.08 1.25 0.95
Weight β2 0.47 −0.17 1.16 0.92
Spine (category 2) β32 −0.01 −0.83 0.77 0.51
Spine (category 3) β33 0.18 −0.53 1.02 0.67
Color (category 2) β42 0.58 −0.14 1.60 0.93
Color (category 3) β43 0.93 0.01 2.06 0.98
Color (category 4) β44 0.97 −0.20 2.53 0.94
Laplace scale λ 2.1 0.7 4.2
Lasso Prior and Selection
Mean 2.50% 97.50% Pr(βj > 0)
Width β1 0.59 −0.03 1.34 0.90
Weight β2 0.42 −0.11 1.26 0.82
Spine (category 2) β32 0.04 −0.56 0.82 0.54
Spine (category 3) β33 0.17 −0.44 1.01 0.65
Color (category 2) β42 0.72 −0.09 2.00 0.88
Color (category 3) β43 0.93 −0.02 2.24 0.94
Color (category 4) β44 0.98 −0.20 2.63 0.92
Laplace scale λ 1.6 0.4 3.6
Selection probability ω 0.64 0.15 0.98

produces both a lower deviance and improved Brier score as compared to a model with-
out selection. There is still considerable shrinkage in the colour coefficients.
A third model introduces selection indicators γj, such that when γj = 0 the variances h j2
are scaled by a small constant ρ. This may be preset or assigned a prior centred on an
informative value. Here ρ = 0.0001, and the prior λ ~ E(1) is retained. This model enables
Regression Techniques Using Hierarchical Priors 263

one to assess fusion probabilities, namely, that successive ordinal category coefficients
are equated. Focusing on the colour coefficients, we find a probability of 0.33 (fusecol
in the code) that b42 = b43 = b44 , amalgamating over iterations where β42 is retained or
excluded.

Example 7.3 Rails Data, Analysis of Variance


This dataset contains measurements of times taken by an ultrasonic wave to travel
the rail length; these data are included in the R package nlme (https://stat.ethz.ch/
R-manual/R-devel/library/nlme/html/Rail.html). There are three replicates for
six rails.
A first analysis assumes a normal random prior for the rails (the factor variable). This
provides posterior means (sd) for α, σβ, and sβ of 66.5 (10.3), 23.6 (7.5), and 24.7 (1.1). The
probabilities Pr(s b > se |y ) and Pr(s b > se |y ) are both 1.
Estimates of hyperparameters (variability over the rails and of the grand mean) may
be affected by departures from the assumed normal hyperprior, and, in particular, by
outliers. Thus Figure 7.1 shows the discrepant profiles of the 95% intervals for βi. As one
alternative, a student t scale mixture with known degrees of freedom (ν = 4) is consid-
ered, whereby

bi ~ N(a , s b2 /di ),

di ~ Ga(2, 2).

This analysis provides δ2 = 0.66 as the lowest estimated scale parameter. Downweighting
the influence of rail 2 leads to a higher posterior means (sd) for α of 67.3 (11.1), while σβ
is reduced to 19.7 (7.5).
A third analysis assumes a double exponential prior for βi,

bi ~ DE(a , l/s ),

combined with an exponential E(1) prior on λ. This provides a posterior mean (sd) for α
of 68.0 (13.8) and for λ of 0.22 (0.10).

120

100

80

60
Mean
40
2.5%
20 97.5%

0
1 2 3 4 5 6
Number of Rail

FIGURE 7.1
Profiles of β coefficients.
264 Bayesian Hierarchical Models

7.4 Regression for Overdispersed Data


Let { y1 , … , y n } be observations from the exponential family density

é y q - b(q i ) ù
p( yi |q i , f ) = exp ê i i + c( yi , f )ú
ë ai (f ) û

with canonical parameter θi, dispersion function ai (f) = f/wi , and wi known. Under the
generalised linear model (GLM) framework, the mean of y is mi = E( yi |q i ) = b¢(q i ), predicted
via a monotone link function g( mi ) = Xi b , and the variance is

var( yi |q i ) = ai (f )b²(q i ) = ai (f ) var( mi ).

The exponential family includes as special cases the normal, binomial, Poisson, multino-
mial, negative binomial, exponential, and gamma densities.
However, GLM regressions often show a residual variance larger than expected under
the exponential family models, due to unknown omitted covariates, clustering in the orig-
inal units, or inter-subject variations in propensity (Zhou et al., 2012). Particular types of
response pattern (e.g. an excess proportion of zero counts as compared to the expected fre-
quency) may also cause overdispersion (Garay et al., 2015; Musio et al., 2010) (Section 7.6).
Without correction for such extra-variability, regression parameter estimates may be
biased, and their credible intervals will be too narrow, so that incorrect inferences about
significance may be obtained. The solution involves regression with additional random
effects to account for excess residual variation, and the focus in Monte Carlo Markov
Chain (MCMC) is usually on the complete data likelihood, rather than the marginal model
obtained by integrating over the random effects.

7.4.1 Overdispersed Poisson Regression


Thus, the Poisson regression model for count data assumes that the mean and vari-
ance are equal, but overdispersion, as compared to the Poisson assumption, is routinely
encountered. As discussed in Chapter 4, the conjugate mixture model for count data is the
Poisson-gamma with

yi ~ Po( mi ),

mi ~ Ga(ai , hi ).

Denoting the mean of the μi as xi = ai /hi , one obtains Var( mi ) = ai /hi2 = xi2 /ai and

Var( yi ) = E[Var( yi | mi )] + Var[E( yi | mi )] = xi + xi2 /ai ,

with increased overdispersion as αi becomes smaller. The mean is modelled by regression,


typically involving fixed effects only, with xi = exp( b0 + b1x1i + … + bp x pi ) . Identification
requires constraints on the gamma mixture parameters, such as αi = α in the {ξi,αi} param-
eterisation, namely mi ~ Ga(a , a/xi ) . Then with ϕ = 1/α, one has a quadratic variance func-
tion, with
Regression Techniques Using Hierarchical Priors 265

Var( yi ) = E[Var( yi | mi )] + Var[E( yi | mi )] = xi + fxi2 .

Another possibility is to set mi = xiwi where wi ~ Ga(a , a) so that the frailties average 1, with
variance ϕ = 1/α. Integrating out the ωi leads to a marginal negative binomial (NB2) den-
sity for the yi, namely

a yi
G(a + yi ) æ a ö æ x ö
p( yi |b , a ) = ç ÷ ç ÷ .
G(a )G( yi + 1) è a + x i ø è a + x i ø

Regarding the dispersion parameter, one may adopt a Ga(a,b) prior for α, with a = 1, and
with b ~ Ga(1,0.005) as an extra unknown (Fahrmeir and Osuna, 2006).
Setting

æ a ö
pi = ç , ÷
è a + xi ø
the negative binomial (NB2) can also be denoted as NB(pi, α). For example, Zhou et al. (2012)
propose predictor impacts be represented via a logit regression for pi, with the regression
including an additional error term to partly represent heterogeneity.
More general negative binomial forms have been suggested, such as the NBk
(Winkelmann and Zimmermann, 1995), with variance function

Var( yi |Xi ) = E[Var( yi |mi )] + Var[E( yi |mi )] = x i + fx ik +1 , (k ³ -1)

obtained with the gamma prior,

æ x 1-k x - k ö
mi ~ Ga ç i , i ÷ .
è f f ø
The values k = 0 and k = 1 lead to the NB1 and NB2 variance forms. The NB-P model (Greene,
2008) replaces α in the NB2 formulation by ax i2-P, with P = 2 corresponding to the NB2 and
P = 1 to the NB1.
Nonconjugate random mixture models are often adopted for count data (Kim et al.,
2002), as in

log( mi ) = Xi b + log(ei ),

with lognormal or log-t distributed errors εi. The prior

æ -t 2 2 ö
e i ~ LN ç ,t ÷ ,
è 2 ø
ensures E(εi) = 1, while variance matching priors can be adopted for the Poisson log-normal
and Poisson-gamma models (Millar, 2009). The nonconjugate approach is convenient when
multivariate, multiple, or multilevel random effects are to be considered. An example of
multiple effects is the convolution prior (Neyens et al., 2012) for area disease events yi with
expected totals Pi. Thus yi ~ Po(Pi mi ), where

log( mi ) = Xi b + ei + si
266 Bayesian Hierarchical Models

and both random effects {εi,si} may account for overdispersion, but the εi are unstructured
(i.e. exchangeable with regard to area identifiers), while the si are spatially structured.
For count regressions only involving an unstructured error, one may specify

mi = E( yi |Xi , ei ) = exp(Xi b + sei ),

with εi ~ N(0,1). Denoting n i = exp(Xi b ), the conditional mean (Greene, 2008) is

E( yi |Xi ) = Ee [E( yi |Xi , ei )] = ni exp(s 2 /2),

and the conditional variance is

Var( yi |Xi ) = Ee [Var( yi |Xi , ei )] + Vare [E( yi |Xi , ei )]



{
= ni exp(s 2 /2) 1 + ni exp(s 2 /2)[exp(s 2 ) - 1] . }
2
Taking f = e s - 1

Var( yi |Xi ) = E( yi |Xi , ei ) éë1 + fE( yi |Xi , ei )ùû ,

showing that the variance has a quadratic form, as for the NB2 form of the negative
binomial.

Example 7.4 School Attendance Data


This example considers Poisson overdispersion in terms of different negative bino-
mial regression forms. The dataset concerns number of days absent for 314 high school
juniors at two schools, with predictors including a standardised maths test score, gen-
der (1 = Female, 0 = Male), and type of program enrolled on (general, academic, voca-
tional). Overdispersion in the data is apparent with variance of 49.5 exceeding the mean
of 6. A simple Poisson regression using glm in R has a scaled deviance of 1747.
By contrast, under a Poisson-gamma mixture (NB2) one has

y i ~ Po(xi wi ),

wi ~ Ga(a , a),

log(x i ) = b1 + b 2 Math i + b 3 Femalei + b 4 I (Academic)i + b 5 I ( Vocational))i .

This can be estimated using a Poisson-gamma hierarchical likelihood or a negative


binomial NB2 likelihood, with the former approach having the benefit of providing
observation specific frailties. These models are estimated using rstan.
With a N(0,1) prior on βj (j = 2,…5) and a ~ Ga(1, 0.01) , the posterior mean scaled devi-
ance under the Poisson-gamma model is 330.1, with posterior mean (95% CrI) for ϕ
of 1.04 (0.84, 1.28), so the model accounts for Poisson overdispersion. The model also
achieves a satisfactory representation of the data, since the mixed predictive strategy of
Marshall and Spiegelhalter (2007) shows a relatively low number (18/314) of pupils with
probabilities of overprediction over 0.95 or under 0.05. This model suggests the effect of
gender, β3, is marginally significant with 95% credible interval straddling zero.
A variant of the NB2 model (and of other overdispersion models) takes dispersion
parameters as case-specific and determined by regression (Barreto-Souza and Simas,
Regression Techniques Using Hierarchical Priors 267

2016). Thus, one has NB(pi, αi), where log(αi) = Xiδ. This model shows dispersion to be
greater among general program students, though shows no benefit in terms of fit, with
higher LOO-IC than the Poisson-gamma model.
The suitability of NB1 or NB2 forms of negative binomial regression may be assessed
using the NBP (negative binomial P) model. To assess this, an exponential prior centred
at 2 is used for P following the Greene (2008) approach. We obtain posterior mean (95%
interval) for P of 1.23 (0.73,1.65). This excludes 2, and so does not favour the NB2 param-
eterisation. The LOO-IC is improved as compared to the NB2 model (Poisson-gamma
version), namely 1540 vs 1559. Substantive inferences are affected in that the gender
effect β3 now has an entirely positive 95% credible interval, (0.02,0.45).

7.4.2 Overdispersed Binomial and Multinomial Regression


Binomial regression with excess variation may occur when responses are arranged in clus-
ters and responses from the same cluster are correlated: examples occur in teratological
studies (e.g. when the observation unit is a litter of animals, and litters differ in terms of
unknown genetic factors). A conjugate approach to such overdispersion involves a beta-
distributed success probability, leading to a beta-binomial regression model (Kahn and
Raftery, 1996).
Thus with yi ~ Bin( ni , pi ), one assumes

pi ~ Be(gpi ,(1 - pi )g)

with mean πi and variance

pi (1 - pi )/(g + 1).

where g ³ 0 . Regression involves a logit or other link,

g(pi ) = Xi b.

Setting j = (g + 1)-1, the unconditional variance of a beta-binomial response is of the form

Var( yi ) = ni pi (1 - pi )[1 + (ni - 1)j].

Nonconjugate random mixture models are often adopted for binomial data, with normal
or Student t errors in the regression link. The presence of an error term permits predic-
tor selection using a g-prior approach (Gerlach et al., 2002; Kinney and Dunson, 2007) in
mixed logistic models. For y binomial or binary with probabilities pi, and n observations,
one might specify

logit( pi ) = Xi b + ei ,

ei ~ N (0, s 2 )

with

b ~ N (B, s 2 g(X ¢X )-1 ),


268 Bayesian Hierarchical Models

and g an unknown, with prior such as g ~ IG(0.5, 0.5n) (Zellner and Siow, 1980; Perrakis
et al., 2015). This prior can be combined with spike-slab binary selection indicators. For y
binary, and data augmentation (Section 7.5), the g-prior can also be used.
For multinomial data (e.g. on voting patterns yij for parties j by constituency i), over-
dispersion may occur when choice probabilities vary between the Ni individuals in each
observation unit, but clusters of individuals within each unit have similar probabilities.
The individual-level factors associated with such clustering are not observed, so a ran-
dom effect will proxy such unobserved factors; for example, voters with different educa-
tion levels may differ in their voting preferences, but only the average education in each
constituency is observed. The raw percentages yij/Ni are also likely to show erratic fea-
tures, whereas hierarchical models for pooling strength provide frequency smoothing and
model interdependencies between categories.
This form of data may be modelled as a product multinomial likelihood conditioning on
known Ni = yi+. With probabilities πij of choices j = 1,… J , the sampling model is

yij ~ Mult( N i ,[pij , … , piJ ]) i = 1, ¼ n.

The conjugate approach for such heterogeneity is the multinomial-Dirichlet mixture,


where the Dirichlet is the multivariate generalisation of the beta density. However, the
Dirichlet has a restricted covariance structure when there are dependencies between the
response categories j within units i. For example, for n constituencies and J political par-
ties, one may expect both negative and positive correlations between πij for different par-
ties. Greater flexibility is provided by modelling heterogeneity within the regression link,
as in random effects multiple logit models (Hensher and Greene, 2003), or via multinomial
probit models.
Under the multiple logit form, define a J − 1 dimensional random effect ai = (ai1 , … ai , J -1 )
representing subject or unit level intercepts; these might be exchangeable or correlated (if,
say, the units were areas and behaviours were spatially clustered). Then with Xi excluding
an intercept,

pij = exp(aij + Xi b j ) å exp(a


k =1
ik + Xi bk )

with aiJ = bJ = 0 for identification. For example, one may assume

(ai1 , … ai , J -1 ) ~ N J -1( H i , D),

where D is an unknown covariance matrix, Hi = (Hi1,…,Hi,J−1), where Hij = A j + Xi b j , and Aj is


the intercept for choice j.
This multiple logit may also be fitted by Poisson regression using the fact that the mul-
tinomial is equivalent to a Poisson distribution conditional on a fixed total. This involves
defining n fixed effect predictors ai to ensure the unit totals Ni are maintained. Thus
yij ~ Po( mij ) , with

log( mij ) = ai + aij + Xi b j ,

for j = 1, … , J , where the ai would typically be fixed effects assigned vague priors e.g.
ai ~ N (0, 1000).
Regression Techniques Using Hierarchical Priors 269

Example 7.5 Voting in Florida; Multinomial Overdispersion


This example considers normal random effects to model multinomial overdispersion
via multiple logit links. The analysis relates to the 2000 US presidential election voting
data yij for i = 1,… , 67 Florida polling districts and with Ni denoting total votes (Mebane
and Sekhon, 2004). There are J = 5 choices of candidate (Buchanan, Nader, Gore, Bush,
other) and three predictors:

a) x1, the proportion of each county’s votes for different presidential candidates in
1996;
b) x2, changes between 1996 and 2000 in party registration;
c) x3, percentage of census population Cuban in district i.

Specification of x1 and x2 (predictors specific for area and candidate) follows Mebane
and Sekhon (2004), but x3 differs from their variable. Mebane and Sekhon (2004) find
substantial overdispersion in these data.
The sampling model for the random effects multinomial is

y ij ~ Mult( N i ,[pi1 ,… , pi 5 ]) i = 1,¼n

pij = fij åf
ij
ij

and to account for overdispersion, normal effects a i = {a i1 ,¼, a i( J -1) } are included in
multiple logit links. These have non-zero means Aj, namely the intercepts for the first
four candidate choices. In a hierarchical parameterisation

log(fij ) = aij ,

(ai1 ,… , ai , J - 1 ) ~ N J - 1 ( H i1,¼, i , J - 1 ; D),

H ij = A j + Xi b j ,

log(fiJ ) = 0,

with { b j , k ; j = 1,¼, 4, k = 1,¼, 3}, Aj assigned diffuse priors, and the precision matrix D−1
assigned a Wishart prior with identity scale matrix and J − 1 = 4 degrees of freedom2.
MCMC convergence is considerably assisted by the hierarchical parameterisation
above and by centring predictors. Inferences are based on the second half of a two-chain
run of 20,000 iterations. The β coefficients show 1996 voting to influence later voting,
except for Nader, while change in party registration is important for all candidates,
except Gore. The proportion of Cuban-Americans has a positive effect on voting for
Gore and Bush. A posterior predictive check comparing chi-square values for replicate
and observed data is satisfactory at around 0.49.
Posterior predictive checks are not satisfactory when a fixed effects only model is
applied with

log(fij ) = A j + Xi b j , j = 1,… , J - 1.

There is then zero posterior probability that c 2 ( y rep ,q ) > c 2 ( y ,q ). Standard deviations
of predictor effects are also considerably understated if allowance is not made for excess
variation, and, in fact, all coefficient effects are significant (95% CRI either entirely posi-
tive or negative) under this model.
270 Bayesian Hierarchical Models

7.5 Latent Scales for Binary and Categorical Data


Sampling and inference in Bayesian general linear models are complicated to the extent
that conjugate priors are only available for normal regression (Holmes and Held, 2006). The
auxiliary variable approach circumvents this by introducing latent continuous responses
underlying binary or categorical observations, resulting in a specification (including pri-
ors) that effectively replicate normal regression. This provides simplified MCMC sampling
and improved residual tests.
Consider first binary responses, and assume latent metric data y* such that y = 1 when
y* > 0 and y = 0 when y* ≤ 0 (Hooten and Hobbs, 2015; Albert and Chib, 1993). In economic
choice applications (e.g. regarding economic participation or not), the latent scale y* arises
by comparing utilities U1i and U0i of options 1 and 0 with

U ji = Vji + e ji = Xi b j* + e ji ,

yi* = U1i - U 0 i .

Under the above scheme, one has

Pr( yi = 1) = Pr( yi* > 0) = Pr(e0 i - e1i < V1i - V0 i ) = Pr(e0 i - e1i < Xi b )

where b = b1* - b 0*. Alternative forms for ε lead to different links: taking εji to be normal with
mean zero and variance σ2 leads to a probit link with Pr( yi = 1) = F(Xi b/s ) . It is apparent
that β and σ cannot be separately identified, and the commonest identifying device takes
σ2 = 1.
A probit regression with binary responses yi may therefore be obtained by truncated
normal sampling for yi* , with the form of constraint determined by the observed y. Thus, if
yi = 1, yi* is constrained to be positive, and sampled from a normal with mean Xiβ (including
an intercept in p-dimensional Xi) and variance 1. If yi = 0, yi* is sampled from the same den-
sity, but constrained to be negative. With a normal prior on the coefficients b ~ N p (B0 , V0 ),
the full conditional distribution of β is also normal, namely

b | y * ~ N (B, V )

B = V -1(V0-1B0 + X ¢y * )

V = (V0-1 + X ¢X )-1.

Improved MCMC mixing is obtained by updating y* and β jointly (Holmes and Held,
2006), and justified by the factorisation,

p( b , y * | y ) = p( y * | y )p( b | y * )

where updating of β is as above, but y* is updated from its marginal distribution integrated
over β.
Heavier tailed links are obtained by sampling yi* directly from a Student t with ν degrees
of freedom, or by using the scale mixture version of the Student t density (Chang et al., 2006).
Regression Techniques Using Hierarchical Priors 271

This again involves constrained normal sampling but with gamma subject-specific precisions
li ~ Ga(n/2, n/2), so that

yi* ~ N (Xi b , 1/li ) I (0, ¥), when yi = 1

yi* ~ N (Xi b , 1/li ) I ( -¥ , 0), when yi = 0.

Skew densities for ε have also been proposed, such as a skew-probit link (Bazan et al., 2010)
with augmentation scheme

yi* = Xi b + ei ,

ei = s éë -jVi - (1 - j 2 )Wi ùû ,

where Vi is half normal Vi ~ N + (0, 1) , Wi ~ N (0, 1), j ~ U( -1, 1), and σ = 1 for identifiability.
In hierarchical form, one has

yi* ~ N (Xi b - jVi , 1 - j 2 ).

Taking ε to be logistic, a logit regression is obtainable (e.g. Holmes and Held, 2006), by the
augmentation scheme

yi* ~ Logist(Xi b , 1) I (0, ¥), when yi = 1

yi* ~ Logist(Xi b , 1) I ( -¥ , 0), when yi = 0

where y ~ Logist( m, t) , namely

p( y|t , m ) = t exp(t [ y - m ])/{1 + exp(t [ y - m ])}


2

with variance κ2/τ2, where κ2 = π2/3. A logit link can be approximated by Student t sam-
pling when ν = 8, or equivalently by scale mixture normal sampling with li ~ Ga(n/2, n/2),
combined with constrained sampling according to the observed y values. Specifically, a t8
variable is approximately 0.634 times a logistic variable, so that

æ 1 ö
yi* ~ t8 ç Xi b , ÷ I (0, ¥), when yi = 1
è 0.634 2 ø

æ 1 ö
yi* ~ t8 ç Xi b , 2 ÷
I (-¥ , 0), when yi = 0.
è 0. 634 ø

Equivalently with li ~ Ga( 4, 4) ,

æ 1 ö
yi* ~ N ç Xi b , 2 ÷
I (0, ¥), when yi = 1
è li (0.634) ø

æ 1 ö
yi* ~ N ç Xi b , 2 ÷
I (-¥ , 0), when yi = 0.
è li (0.634) ø
272 Bayesian Hierarchical Models

A different approximation follows since a t7.3 variable is approximately 0.647 times a


logistic variable (Kinney and Dunson, 2007). So, with li ~ Ga(n/2, n/2), where ν = 7.3, and
s 2 = p 2 (n - 2)/3n ,

yi* ~ N (Xi b , s 2 /li ) I (0, ¥), when yi = 1

yi* ~ N (Xi b , s 2 / li ) I (-¥ , 0), when yi = 0.

Logit models relate responses yi = 0 or 1 to predictors Xi through proportional exponential


functions of regressors,

Pr(yi = k ) µ exp{Xi b }

and a latent exponential variable version (Scott, 2011) of the logit link involves sam-
pling {z0i,z1i} from exponential densities E(λji), with parameters λ0i = 1 and l1i = exp(Xi b ).
If yi = arg min( z0 i , z1i ), then Pr( yi = k |Xi ) µ lki as under a logit regression. This principle
extends to multiple logit regression by sampling {z0 i , z1i , … , z J -1, i }.
Augmented data sampling for the logit model can also be achieved using a discrete mix-
ture approximation of the type 1 extreme value error (Fruhwirth-Schnatter and Fruhwirth,
2010). With U0i and U1i as utilities of categories 0 and 1, and

U1i = Xi b + ei

the binary logit is obtained when U0i and εi follow type 1 extreme value distributions.
Using the relation between the exponential and type 1 extreme distributions, and with
ni = exp(Xi b ), one has

exp( -U 0 i ) ~ E(1), exp( -U1i ) ~ E(ni ).

with the minimum of these variables also exponential,

min[exp( -U 0 i ), exp( -U1i )] ~ E(1 + ni ).

When yi = 1, one has U1i > U 0 i , or equivalently exp( -U1i ) < -exp( -U 0 i ) , so that,

exp( -U1i ) ~ E(1 + ni ).

When yi = 0, one has U 0 i > U1i , or equivalently exp( -U 0 i ) < -exp( -U1i ) , so that

exp(-U 0 i ) ~ E(1 + n i ),

exp( -U1i ) = exp( -U 0 i ) + di ,

where di ~ E(ni ).
A useful diagnostic feature resulting from the augmented data approach is that the resid-
uals yi* - Xi b are nominally a random sample from the assumed cumulative distribution
Regression Techniques Using Hierarchical Priors 273

for ε (Johnson and Albert, 1999). So for the latent data probit, ei = yi* - Xi b is approximately
N(0,1) if the model is appropriate for case i, whereas if the posterior distribution of εi is sig-
nificantly different from N(0,1) then the model conflicts with the observed y. So one might
obtain the probability Pr(|ei|> 2| y ) and compare to its prior value of 0.045. For the latent
data logit, one may obtain Pr(|ei|/k > 2| y ) .

Example 7.6 Low Birthweight


This example illustrates data augmentation and predictor selection for binary responses.
The data concerns low birthweight from the R library glmulti. There are n = 189 observa-
tions with binary response 1 for birth weight under 2.5 kg, 0 otherwise. Potential pre-
dictors are mother’s age (standardised), lwt (mother’s weight at last menstrual period,
standardised), race (1 = white, 2 = black, 3 = other), smoke (smoking status during preg-
nancy), ptl (number of previous premature labours), ht (history of hypertension), ui
(presence of uterine irritability), and ftv (number of physician visits during the first
trimester).
Collinearity in this application has been noted, though the interactions ui*smoke
and ftv*age are mentioned as potentially significant by Calcagno and de Mazancourt
(2010), giving p = 11 predictors. Probit and logit regressions with data augmentation, and
initially without any predictor selection, are implemented using R2OpenBUGS. These
show lwt, smoke, ht, ui, and ftv*age as having 95% credible intervals not straddling
zero. However, the ethnic categories (black, other) and the interaction ui*smoke also
have probabilities Pr( b j > 0|y ) under 0.95 or over 0.05.
As one approach to predictor selection, a horseshoe prior is applied as part of a probit
regression. Table 7.4 shows the consequent shrinkage in coefficient estimates, together
with a more clarified indication of significance, with only lwt, smoke, ui, ht, and ftv*age
having probabilities Pr( b j > 0|y ) under 0.95 or over 0.05.
Using a g-prior method, combined with spike-slab selection, gives a more parsimoni-
ous result. An inverse gamma IG(0.5,0.5n) prior on the g parameter is adopted, implying
a Cauchy prior on regression coefficients. With a prior retention probability of 0.5, only
the ftv*age interaction has a posterior retention probability γj over 0.95. Adjusting the
g-prior to include a ridge parameter (set to 1/p) as in Baragatti and Pommeret (2012)
does not greatly affect the results.
A final analysis considers the skew probit model, but there is no evidence supporting
skewed errors, with the parameter φ having posterior mean close to zero. Although
there is a predominance of 1 responses in the data, using a skew regression also does
not markedly affect the regression coefficients as compared to standard probit regres-
sion (cf. Pérez-Sánchez et al., 2018).

7.5.1 Augmentation for Ordinal Responses


Suppose that a categorical response yi has J categories, with the observations measuring a
latent response y* according to the model

yi = j if a j -1 £ yi* < a j .

The αj are cutpoints dividing the values of y* according to the observed y values (Bürkner
and Vuorre, 2018). The regression in the latent data is then

yi* = Xi b j + e ji
274

TABLE 7.4
Birthweight Data. Probit Regressions
Probit Probit Horseshoe Probit g-prior
Predictor Name β, Mean β, St devn Pr(βj > 0) β, Mean β, St devn κj, Mean Pr(βj > 0) β, Mean β, St devn γ, Mean
x1 Age 0.26 0.16 0.948 0.14 0.14 0.77 0.843 0.06 0.12 0.38
x2 Lwt −0.36 0.14 0.003 −0.26 0.14 0.72 0.023 −0.22 0.16 0.78
x3 Black 0.59 0.33 0.960 0.35 0.32 0.66 0.878 0.11 0.23 0.28
x4 Other 0.51 0.28 0.972 0.32 0.27 0.68 0.889 0.11 0.21 0.33
x5 Smoke 0.81 0.27 1.000 0.51 0.27 0.60 0.974 0.31 0.28 0.67
x6 Ptl 0.29 0.20 0.924 0.27 0.20 0.70 0.920 0.23 0.24 0.59
x7 Ht 1.20 0.44 0.997 0.84 0.44 0.51 0.980 0.83 0.55 0.84
x8 Ui 1.11 0.40 0.996 0.62 0.36 0.57 0.965 0.30 0.38 0.48
x9 Ftv −0.09 0.13 0.260 −0.04 0.10 0.82 0.328 −0.01 0.05 0.20
x10 ui*smoke −1.00 0.57 0.042 −0.30 0.47 0.67 0.271 −0.03 0.31 0.26
x11 ftv*age −0.49 0.15 0.000 −0.38 0.14 0.64 0.002 −0.31 0.13 0.96
Bayesian Hierarchical Models
Regression Techniques Using Hierarchical Priors 275

where εji is usually either normally or logistically distributed (Albert and Chib, 2001). So
P(e ) = F(e ) , where Φ is the cumulative normal function, or P(e ) = 1/(1 + exp( - e )), the cumu-
lative logistic.
The corresponding model for cumulative probabilities is

Pr( yi* £ a j ) = Pr(Xi b j + e ji £ a j ),



= Pr(e ji £ a j - Xi b j ).
Thus

Pr( yi* £ a j ) = F(a j - Xi b j ),

or

Pr( yi* £ a j ) = 1/(1 + exp( -[a j - Xi b j ])),

according to the assumed form for εji. Let g ji = Pr( yi* £ a j ), then

Pr( yi = j) = Pr(a j -1 £ yi* < a j ) = g ji - g j -1, i .

The probability that yi = 1, namely

Pr( yi = 1) = Pr(a0 £ yi* < a1 ) = g1i ,

is obtained by setting α0 = −∞, while the probability that yi = J,

Pr( yi = J ) = Pr(a j -1 £ yi* < aJ ) = 1 - gJ -1, i

is obtained by setting αJ = ∞.


Assuming Xi excludes an intercept, the remaining J − 1 cut points {a1 , a2 … , aJ -1 } are
unknowns subject to an order constrained prior a1 £ a2 … £ aJ -1 . By the reparameterisation

a j = a j -1 + exp( D j ) ( J > j > 1),

a1 = D 1 ,

one may, however, specify unconstrained normal priors, such as D j ~ N (0, VD ) where VΔ is
preset or possibly itself unknown.
An equivalent specification of this model involves sets of J − 1 binary variables for each
subject, namely zji = 1 if yi ≤ j, and zji = 0 otherwise. So if J = 3, and if yi = 1, then z1i = 1, z2i = 1; if
yi = 2, then z1i = 0, z2i = 1. So, for ε normal,

Pr( yi £ j) = Pr( yi* £ a j ) = Pr( z ji = 1) = F(a j - Xi b j ).


276 Bayesian Hierarchical Models

Example 7.7 Delegation of Discretion in Trade Policy


This example involves direct and augmented data options for ordinal data, with
ordered probit model analysing changes in discretion in trade policy delegated to the
US President by Congress between 1890 and 1990 (T = 99 observations) (Epstein and
O’Halloran, 1996). The response has J = 3 categories: 3 if discretion is increased between
successive years, 2 if it stays the same, and 1 if it is reduced. Changes in discretion are
related to p = 4 predictors: changes in log GNP (x1), changes in log unemployment rate
(x2), changes in the logged producer price index (x3), and to a measure of changes in
government disunity (x4), where disunity in a particular year is measured by a trichot-
omy according as one or both Congress chambers are in the same political party as the
President. So x4 can take values {-2, -1, 0, 1, 2} .
In a first analysis using rjags, a proportional odds model over responses j is assumed,
namely b jk = bk (k = 1,… , p). Order constrained N(0,1) priors are assumed on the
unknown cut points {α1,α2} (using the rjags sort option), and an MVN prior on { b1 ,… , b4 } ,
with mean zero and diagonal precision matrix B0, with prior variances of 1000.
A two-chain run of 10,000 iterations (with the last 5,000 for inference) gives a non-
significant posterior mean (95% CrI) for the impact β1 of x1, namely of 1.02 (−5.3,7.1). The
95% interval for the impacts of x2 is inconclusive, though the posterior density is decid-
edly concentrated on negative values. The coefficients β3 and β4 have respective means
(95% CrI) of −3.1 (−6.5,0.5) and −0.42 (−0.81,−0.04). The percentage of years where the
actual response is accurately predicted by replicate responses is 62.6%.
Similar coefficients are obtained from an augmented data approach involving
binary responses ztj = 1 if yt ≤ j, and ztj = 0 otherwise. This form of the model is coded in
R2OpenBugs, initially with normality assumed. Plots of the resulting two sets of residu-
als suggests some unusual observations, and a Student t modification of the truncated
sampling is adopted, with scale factors ltj ~ Ga(2, 2) (corresponding to a fixed 4 degrees
of freedom in the Student t). This shows nine observations with posterior medians for
λtj under 0.5, and downweighting their influence improves predictions: the percentage
of years accurately predicted by replicate responses is raised to 64.5%.

7.6 Heteroscedasticity and Regression Heterogeneity


For data assumed conditionally normal, the canonical normal linear regression specifies

yi = Xi b + ei ,

where ei ~ N (0, s 2 ) . Potential limitations of this specification reflect both the assumptions
regarding errors and the same form of regression effect for all subjects. Assumptions of
discrete regression (e.g. Poisson, binomial) are also vitiated by excess observations at par-
ticular outcomes (e.g. clumping at zero).

7.6.1 Nonconstant Error Variances


Thus, assuming homoscedastic normal errors in linear regression (or in overdispersed
regressions such as the Poisson lognormal) may be restrictive due to the relatively thin
tails of the normal, particularly when unusual observations are present. By contrast, scale
mixtures of normals accommodate a wide variety of heavy-tailed distributions (Fonseca
et al., 2008; Fernandez and Steele, 2000).
Regression Techniques Using Hierarchical Priors 277

In the linear model yi = hi + s zi , with zi ~ N(0,1), a scale mixture is generated by assuming


the residuals are distributed as

ei = zi /li0.5

where the λi are independent positive random variables. The tν distribution results as a
scale mixture of normal distributions by taking λi gamma with scale and shape ν/2, with
the Cauchy when ν = 1 (Boris Choy and Chan, 2008). An alternative scale mixture of uni-
forms method can lead to both heavier and lighter tails than the normal (Qin et al., 2000).
Other approaches to heteroscedasticity include variance transformation and variance
regression modelling (Cepeda and Gamerman, 2000; Chib and Greenberg, 2013). As an
alternative to the canonical linear regression, one may consider a heteroscedastic model
(Wang and Zhou, 2007)

yi = hi + si zi ,

with hi = Xig , zi ~ N(0,1), and σi or si2 taken as a function of ηi, such as

si = exp(hi ),

l
si = s hi ,

or

si = s(1 + lhi2 )0.5 .

7.6.2 Varying Regression Effects via Discrete Mixtures


A potential limitation of the standard normal linear model, and of other generalised linear
models, is the assumption of identical regression relationships for all cases. Alternatives
include random coefficient models and discrete mixture regressions.
With regard to the former, residual heteroscedasticity may reflect varying predictor
effects. Consider a linear model specified as yi = xi b + wig + ei* when the true model is
yi = xi b + wi (g + vi ) + ei with var(vi ) = sv2 and var(ei ) = se2 . The error ei* will then have non-
constant variance wi2sv2 + se2. To address such issues, a random regression effects linear
model specifies
p

yi = a + å x ( b + v ) + e ,
j =1
ji j ji i

where {v1i , … , vpi } are zero mean random effects. Random regression effects are often
applied to structured data (e.g. time or spatially configured data).
Variation in regression effects is also approached using discrete regression mixtures,
with form
K

p( yi |Xi ) = å p f (X , b , f ),
k =1
k k i k k
278 Bayesian Hierarchical Models

where βk are component specific regression effects, and ϕk are other parameters defining
densities f k. Such mixtures are useful for detecting subpopulations with different behav-
iours, while accounting for excess heterogeneity (e.g. overdispersion) related to varying
regression relationships. Examples include normal regression mixtures

K æ p ö
p( yi |Xi ) = å p kN çak +
ç å x ji b jk , s k2 ÷ ,
÷
k =1 è j =1 ø
Poisson regression mixtures

K æ æ p öö
p( yi |Xi ) = å p k Po ç exp ç a k +
ç ç å x ji b jk ÷ ÷ ,
÷÷
k =1 è è j =1 øø

and logit regression mixtures for binary or binomial data

æ æ
å x b ö÷ø ö÷÷ .
p
K ç exp ç a k + ji jk

å è j=1
p( yi |Xi ) = p k Bern ç
æ
å x b ö÷ø ÷÷ø
p
ç
k =1
ç 1 + exp ç a k + ji jk
è è j=1

The probabilities πk for the components may be predicted for each individual via regres-
sion (e.g. multinomial logit).
In Bayesian applications, MCMC sampling is facilitated by the introduction of latent allo-
cation indicators Gi Î(1, … , K ), with full conditionals based on multinomial probabilities

pk f k (Xi , bk , fk ) å p f (X , b , f ).
k =1
k k i k k

Certain identification and estimation issues apply to discrete regression mixtures, and a
variety of sampling and post-processing methods, and priors to gain or improve identifi-
ability, have been proposed. Different component labels cannot be distinguished during
MCMC sampling unless some identifiability constraint is imposed. Another issue involves
small components (with low probabilities πk), especially when combined with small sam-
ples, since at particular MCMC iterations, no cases may be allocated to a particular group,
so that the associated parameters are not updated.
Sampling and estimation methods for discrete regression mixtures differ in whether
they impose identifying constraints or allow switching between different numbers of
components. For example, Viele and Tong (2002) apply identifying restrictions in linear
regression mixtures. Ordering of variances may work better when the variances are well
separated, whereas ordering of particular regression parameters works well when sub-
populations are distinct in substantive terms. Such features might be established by pre-
liminary classical estimation.

7.6.3 Other Applications of Discrete Mixtures


Discrete mixtures are also relevant for excess observations at particular points. For exam-
ple, Poisson overdispersion may result from an excess number of zero counts. Under
Regression Techniques Using Hierarchical Priors 279

a zero-inflated Poisson (ZIP) model, or more generally zero modified model (ZMP)
(Conceição et al., 2013), zero counts may be either true zeroes, or result from a stochastic
mechanism, when the process is “active,” but sometimes produces zero events. A distinc-
tion is similarly made between structural and random zeroes (Martin et al., 2005).
Denote the active stochastic mechanism as f(y), and let di = 1 for true zeroes as against
stochastic zeroes, obtained when di = 0. Setting Pr(di = 1) = w , one has for discrete f(y),

Pr( yi = 0) = Pr(di = 1) + f ( yi = 0|di = 0)Pr(di = 0)

Pr( yi = j) = f ( yi = j|di = 0)Pr(di = 0) j = 1, 2, ¼

Regressors Xi may be relevant both to the binary inflation mechanism, and to the param-
eters defining the density (Czado et al., 2007). A useful representation for programming
the zero-inflated Poisson involves the data augmentation scheme (Ghosh et al., 2006):

di ~ Bern(wi ),

yi ~ Po( mi (1 - di )),

with analogous representations for other zero-inflated densities.


For a zero-inflated Poisson, with ωi and μi defined by regression, one has

P( yi = 0|Xi ) = wi + (1 - wi )e - mi ,

P( yi = j|Xi ) = (1 - wi )e - mi miyi / yi !, j = 1, 2, …

with variance then

Var( yi |wi , mi ) = (1 - wi )[mi + wi mi2 ] > mi (1 - wi ) = E( y|wi , mi ).

So, the modelling of excess zeros implies overdispersion.


Discrete mixture approaches are also used for outlier accommodation and detection
(Verdinelli and Wasserman, 1991). For example, in linear regression, one may have

2 æ p ö
p( yi |Xi ) = å p kN ça +
ç å x ji b j , s k2 ÷
÷
k =1 è j =1 ø
where s  s , and π2 is taken small (e.g. π2 = 0.05). This provides variance-inflation for
2
2
2
1
outliers. Mohr (2007) advocates a two-group model allowing for both clustered outliers
(defined by similar predictor values), and for scattered outliers, generated by a variance
inflation mechanism.

Example 7.8 Radioimmunoassay and Esterase


This example compares three heteroscedastic linear models and a varying regression
model for n = 113 radioimmunoassay observations (y) in relation to a single predictor,
namely esterase (x). Fit is based on sampling replicate data, with fit and penalty criteria
derived as in Gelfand and Ghosh (1998).
280 Bayesian Hierarchical Models

So, with zi ~ N(0,1) and

y i = hi = b0 + b1xi + si zi ,

the first model is a variance regression model with

si2 = exp(g0 + g1xi ).

This yields a significant γ1, with mean (95% interval) of 0.068 (0.044,0.094), so indicating
heteroscedasticity. The penalty criterion CP (obtained by summing the posterior vari-
ances of replicates) is 1.41E+06, while the predictive fit criterion, the posterior mean of
S i ( y i - y rep ,i )2 , is CF = 2.635E+06.
A variance power model in absolute predictor values, namely

si = s(1+|hi|)l .

is then applied (Bonate, 2011). A U(−2,2) prior is taken on λ, and a U(0,250) prior on σ,
which includes the observed standard deviation of 213. The final 5000 iterations of a
two-chain run of 15,000 iterations give an estimate for λ = 0.61(0.44,0.82), and provide
improved fit criteria (CP , CF ) = (1.21E + 06, 2.42E+06).
A student t with ν degrees of freedom via normal scale mixing (centred on a single
variance parameter σ2) is then applied. A U(0.01,1) prior is applied on the inverse of the
degrees of freedom 1/ν. This shows 21 datapoints with posterior mean precision adjust-
ment factors κi below 0.5, and ν estimated at 2.74. Although such estimates clearly show
non-normality, the fit criteria deteriorate to (CP , CF ) = (1.66E+06, 2.87 E+06) .
Finally, a discrete mixture regression is applied, with group varying intercept, slope,
and scale, namely

y i = aGi + bGi xi + sGi zi ,

Gi ~ Mult(1, p)

with π of dimension K assigned a Dirichlet prior, π ~ Dir(1,1,…,1). A prior constraint


that regression slopes are increasing is applied, using the sort() option in rjags. As well
as the predictive fit criterion CF, choice between different values of K is based on the
Bayesian Information Criterion (BIC) (Lee et al., 2016; Utazi et al., 2016).
There are considerable reductions in both measures in moving from K = 4 to K = 5, but
increases in moving to K = 6. For K = 5, the posterior means for BIC and CF are 621 and
1.10E+06. Hence, heteroscedasticity seems in this dataset to be linked in part to varying
regression relationships between sample sub-groups.

Example 7.9 Predictor Selection in Discrete Mixture


Regression, Baseball Salary Data
This example considers discrete mixture linear regression, combined with predictor
selection within each component of the mixture (e.g. Khalili and Chen, 2007; Lee et al.,
2016). The response is salaries for 337 major league baseball players (in units of 100,000
dollars) in the year 1992 with performance measures from the year 1991. Of the 16 avail-
able predictors, three are focused on here, and are in standardised form: x1, batting aver-
age; x2, on-base percentage; and x3, number of runs. All these predictors are positively
correlated with the response (with significant impacts when used as single predictors),
but they are also intercorrelated. Using the flexmix package and comparing K = 2, 3, and
Regression Techniques Using Hierarchical Priors 281

4 components shows the lowest BIC for K = 3, and also shows a considerable differentia-
tion in the residual standard deviations between the three components.
For a Bayesian analysis in rjags, initially without predictor selection and with K = 3, an
identifying constraint based on ordering of the residual variances is adopted, with nor-
mal N(0,1000) priors on the regression coefficients βjk. A Dirichlet prior assigns weights
of 5 on each component πk. The final 5,000 iterations of a two-chain run of 15,000 itera-
tions leads to components distinguished firstly by salary level: the respective means
on the response within components are 0.4, 1.3, and 10.8 (this is the node avg.sal in the
code). The first component shows a significant effect of x3, with posterior mean (95% CrI)
of 0.60 (0.45, 0.74). The second component shows a relatively strong impact for x1, namely
1.9 (−0.1,3.9), while the third component shows a pronounced impact for x3, namely 9.0
(7.1, 10.9).
The second analysis uses Laplace priors on the regression coefficients, with Laplace
parameters and selection rate parameters component specific

b jk |g jk = 1 ~ N(0, sk2 h jk2 ),

b jk |g jk = 0 ~ d0 (),

h jk2 ~ E(lk2 /2),

lk ~ U(0.01, 100),

g jk ~ Bern(wk ).

The final 5,000 of a two-chain run of 30,000 iterations shows a high retention rate only
for x3 in the first and third component, with mean (95% CrI) for the realised coefficient
xjk = g jk b jk in the third component of 8.5 (6.7,10.1). The estimated λk for the second compo-
nent is relatively high, reflecting the lack of significant predictor effects.
A final analysis uses Laplace priors again, but without a binary selection mechanism.
The impact of x3 in the third component is unaffected, whereas that in the first compo-
nent is eliminated. Again, the estimated λk for the first two components are relatively
high, in line with shrinkage in predictor effects.

Example 7.10 ZIP Regression for Pursuit Behaviours


This example involves zero-inflated regression to investigate impacts of education (x1)
and anxious attachment (x2) on numbers of unwanted pursuit behaviour (UPB) inci-
dents in couple separation contexts, with n = 387 cases (Loeys et al., 2012).
The response yi is the UPB total, and the first model is a zero-inflated Poisson with a
regression for the inflation mechanism. Thus

P( y i = 0|Xi ) = wi + (1 - wi )e - mi ,

P( y i = j|Xi ) = (1 - wi )e - mi miyi / y i !, j = 1, 2,¼

logit(mi ) = b0 + Xi b.

logit(wi ) = g0 + Xig.

Anxious attachment has a significant effect in both regressions, with β2 and γ2 having
respective mean (95% CrI) coefficients of 0.13 (0.06,0.20) and −0.49 (−0.71,−0.27). The pos-
terior mean marginalised likelihood (Millar, 2009) is −805.5.
282 Bayesian Hierarchical Models

As a mixed predictive check (Marshall and Spiegelhalter, 2007), replicate zero infla-
tion indicators di* ~ Bern(wi ) are sampled, and replicate responses sampled from the
corresponding shifted mean y i* ~ Po( mi (1 - di* )) . There is found to be only 1 case with
probabilities of overprediction, Pr(y i*  > yi)+0.5Pr(y i*  = yi), exceeding 0.95, but 28 cases
with probabilities under 0.05, indicating underprediction of some larger counts. Hence,
the ZIP model may not be representing the full extent of overdispersion.
A more general representation is obtained by using a zero-inflated negative binomial.
This increases the posterior mean marginalised likelihood to −570 and the number of
underpredicted cases is reduced to 15.

7.7 Time Series Regression: Correlated Errors and


Time-Varying Regression Effects
Time series regression for generalised linear models raises distinct issues, such as serial
correlation in regression residuals, and time-varying regression coefficients or dispersion
(Jung et al., 2006; Kedem and Fokianos, 2002). There may also be dependence on earlier
responses, observed or latent, and time-varying dependence on predictors (Kitagawa and
Gersch, 1985; Nicholls and Quinn, 1982).
If autocorrelation in the regression errors is suspected or postulated, as opposed to
dependence on past responses or latent data, one option is models with autoregressive and
moving average random effects (Cox, 1981; Chiogna and Gaetan, 2002). For linear regres-
sion over time points t = 1,…,T

yt = Xt b + e t ,

an ARMA(p,q) error scheme specifies

e t - r1e t -1 - r 2e t -2 … - r pe t - p = ut - q1ut -1 - q 2ut -2 … - q qut -q ,

where ut ~ N (0, s 2 ) are white noise errors.


Widely applied options in practice for ARMA error dependence in time series models
are the simple AR(1) and MA(1) schemes. The AR(1) model with

et = ret -1 + ut ,

where ut ~ N (0, s 2 ) are iid and independent of εt, is an effective scheme for controlling for
temporal error dependence if (as often) most correlation from previous errors is transmit-
ted through the impact of εt−1. This assumption is widely used in longitudinal models (e.g.
Chi and Reinsel, 1989). With se2 = var(et ) , and assuming stationarity with |ρ| < 1, AR(1)
error dependence implies

var(et ) = r 2 var(et -1 ) + s 2 + 2 r cov( et -1 , ut ) = r 2se2 + s 2 ,

so that se2 = s 2 /(1 - r 2 ), and the initial condition for the stationary case is

æ s2 ö
e 1 ~ N ç 0, .
2 ÷
è 1- r ø
Regression Techniques Using Hierarchical Priors 283

AR(1) error dependence for non-metric responses is illustrated by the Poisson count out-
comes case yt ~ Po( mt ) (Chan and Ledolter, 1995; Nelson and Leroux, 2006), with

log( mt ) = Xt b + et ,

e t = re t -1 + ut .

Bayesian analysis of AR(1) errors for count data is exemplified by Oh and Lim (2001) and
Jung et al. (2006), who also consider augmented data sampling for count responses, while
Ibrahim and Chen (2000) set out sampling algorithms under a power prior approach (that
assumes historic data with the same form of design are available).
The Durbin–Watson statistic for AR(1) error dependence, namely

DW =
å(e - e t t -1 )2
=2-2
å(e - e t t -1 )
= 2 - 2 r ,
åe 2
t åe åe 2
t
2
t -1

is often used to test temporal autocorrelation (when predictors exclude lagged responses),
and in a Bayesian context can be applied in a posterior predictive check. For example,
Spiegelhalter (1998, p.126) considers a Poisson time series for cancer cases yijt in ages
i = 1,… I , districts j = 1, … , J , and years t = 1, … , T , with μijt being Poisson means. At each
iteration, deviance residuals dijt = -2 log{ p( yijt | mijt )} are obtained, and an average DW sta-
tistic derived for each age and district, namely

å (d - d ) .
T
2
ijt ij , t - 1
DW = t=2

å (d - d )
ij T
2
ijt ij.
t =1

A summary statistic for autocorrelation is then DW = å å DW /IJ , which can be ij


i j
obtained for both actual and replicate data.
The latent process driving autocorrelation may also be modelled using discrete mixture
formulations. For example, one may define Markov Poisson regression in which for each
observed count yt, there corresponds an unobserved categorical variable St Î(1, … , K ) , rep-
resenting the state by which yt is generated (Wang and Puterman, 1999). The latent states
are generated according to a stationary Markov chain with transition probabilities

Pr(St = k |St -1 = j) = p jk { j , k = 1, … , K }.

Conditional on St-1 = j , the tth observation yt is Poisson with mean mt = exp(Xt bk ).

7.7.1 Time-Varying Regression Effects


Autocorrelated or heteroscedastic disturbances in time series regression may be caused
by assuming predictor effects are constant when in fact they are time-varying. Consider a
dynamic normal linear model for metric response

yt = Xt bt + et ,

with R predictors. A simple way to allow coefficient variation is simply to take


284 Bayesian Hierarchical Models

bt = b m + ut

with ut taken as iid random effects. However, in time series contexts, it is likely that devia-
tions from the central coefficient effect βμ will be correlated with nearby deviations in time.
A flexible framework for time-varying parameter effects is provided by the linear
Gaussian state space model (Shumway, 2016), involving first order random walks in scalar
or vector coefficients βt

yt = Xt bt + et , et ~ NT (0, S t )

bt = Gt bt -1 + wt . wt ~ N R (0, Vt ).

Often Gt = I, Σt = Σ, and Vt = V, but if there is stochastic volatility, the variances or log vari-
ances can also be brought into a random walk scheme.
Subject matter considerations are likely to govern the anticipated level of smoothness in
the regression effects. For example, the RW(2) scheme

bt = 2 bt -1 - bt - 2 + wt wt ~ N (0, V )

provides a more plausible smoothly changing evolution for changing regression effects
(Beck, 1983). Dangl and Halling (2012) consider dynamic linear models for asset returns,
and formal model choice between constant regression effects with V = 0, and differing
levels of variation in βt, via a discrete prior over a set of covariance matrix discount factors.
Varying regression effects are important in particular applications of dynamic gener-
alised linear models for discrete responses (Gamerman, 1998; Fruhwirth-Schnatter and
Fruhwirth, 2007; Ferreira and Gamerman, 2000). Consider y from an exponential family
density

exp( ytqt + b(qt )


p( yt |qt ) µ
ft
mt = E( yt |qt ) = b¢(qt )

where the predictors Xt may include past responses { yt - k , yt*- k }, both observed and latent
(Fahrmeir and Tutz, 2001, p.345). For example, yt* , the latent response (e.g. utility in eco-
nomic applications) when yt is binary, may depend on previous values of both yt* and yt.
The link for μt involves random regression parameters

g( mt ) = Xt bt ,

where the parameter vector evolves according to a linear Gaussian transition model,

bt = Gt bt -1 + wt ,

with multivariate normal errors wt ~ N R (0, Vt ) independent of lagged responses, and of the
initial condition b0 ~ N R (B0 , V0 ).
Models for binary time series with state-space priors on the coefficients have been men-
tioned in several studies. Thus, Fahrmeir and Tutz (2001) consider a binary dynamic logit
model involving trend and varying effects of a predictor and lagged response,
Regression Techniques Using Hierarchical Priors 285

logit(pt ) = b1t + b2t xt + b3t yt -1

bt ~ N 3 ( bt -1 , V ),

while Gamerman (1998) consider nonstationary random walk priors in a marketing appli-
cation with binomial data, where logit(pt ) = b1t + b2t xt , and xt is a measure of cumulative
advertising expenditure.

Example 7.11 Epileptic Seizures


This example considers correlated error schemes for a count response, specifically data
from a clinical trial into the effect of intravenous gamma-globulin on suppression of epi-
leptic seizures (Wang et al., 1996). Daily seizure counts are recorded for a single patient
for a period of 140 days, where the first 27 days are a baseline period without treat-
ment, and the remaining 113 days are the treatment period. Predictors are x1t = treatment,
x2t = days treated, and an interaction x3t = x1t x2t between days treated and treatment.
A simple Poisson regression is applied initially, using jagsUI, and a predictive p test
based on the DW statistic applied. In fact, this does not appear to be significant, having
a value of 0.76. However, the 95% credible interval for DW is entirely below 2, indicat-
ing positive error autocorrelation. Monte Carlo estimates of CPO statistics also indicate
model failures (Figure 7.2), with the LPML standing at −591.
A stationary AR1 error model increases the LPML to −395, with ρ having a posterior
mean (95% CrI) of 0.23 (−0.04, 0.50). The dependence structure may also be modelled
using a latent Markov chain with K = 2 states (Wang and Puterman, 1999). Conditional
on state St = j, the Poisson mean for the seizure count on day t is represented as

mt = exp( b0 j + b1 j x1t + b2 j x12 + b3 j x3t ), j = 1, 2.

–2

–4

–6
CPO

–8

–10

–12

–14

0 20 40 60 80 100 120 140


Case Number

FIGURE 7.2
CPO estimates, seizures data.
286 Bayesian Hierarchical Models

TABLE 7.5
Seizures Data. Parameters of Markov Chain Model
Mean St Devn 2.5% 97.5%
β01 −6.96 0.50 −7.94 −6.10
β11 7.41 0.55 6.43 8.49
β21 −0.26 0.06 −0.36 −0.12
β31 −2.28 0.15 −2.58 −2.02
β02 −0.23 0.44 −0.99 0.64
β12 1.24 0.53 0.20 2.16
β22 −0.38 0.13 −0.61 −0.11
β32 −0.43 0.16 −0.72 −0.11
π11 0.75 0.05 0.65 0.83
π21 0.62 0.08 0.45 0.78
π12 0.25 0.05 0.17 0.35
π22 0.38 0.08 0.22 0.55

An identifiability (ordered parameter) constraint is applied to the intercepts, though


classical estimation makes clear that the two regimes have markedly different treat-
ment effects β1j, and a constraint could be applied to them instead.
This option shows a further improved LPML of −348. State 1 has a much higher posi-
tive treatment effect β11 (Table 7.5), and a more negative interaction effect. If a subject
is in that state on day t, the probability π11 of remaining there on the next day is 0.75,
with probability π12 = 0.25 of moving to state 2. If a subject currently occupies state 2, the
respective probabilities are 0.62 and 0.38.

Example 7.12 Mortality and Environment


This example illustrates time-varying regression effects following a state space prior.
It follows Smith et al. (2000) and Chiogna and Gaetan (2002) in analysing the relation-
ship between counts of deaths at ages over 65, meteorological variables, and air pollu-
tion in Birmingham, Alabama between August 3, 1985 and December 31, 1988 (T = 1247
observations).
Here a time constant regression is compared with an analysis involving independent
RW1 priors on a time-varying intercept and time-varying coefficients on three predic-
tors (x1 = minimum temperature, x2 = humidity and x3 = the first lag of PM10). Predictors
are standardised. With yt ~ Po( mt ), t = 1,…,T, one has

log( mt ) = b0t + b1t x1t + b2t x2t + b3t x3t ,

b0t ~ N(2 b0 , t - 1 - b0 , t - 2 , s b2 0 ),

b jt ~ N( b j , t - 1 , s b2 j ),

with 1/sa2 and 1/s b2 j assigned Ga(1,1) priors. As a predictive check, one step ahead
predictions yt* ~ Po( mt - 1 ) (t > 1) are used to estimate posterior exceedance probabilities
Pr(yt* > yt ) + 0.5Pr(yt* = yt ). Low or high values for Qt indicate failures of fit and/or for-
ward prediction.
For the Poisson regression with constant predictor effects (fit using jagsUI), the aver-
age (scaled) Poisson deviance is 1406, so there appears to be relatively little overdisper-
sion in relation to the 1247 observations. Of the three regression coefficients, only β1
has a 95% posterior interval that excludes zero, namely −0.12 to −0.03. One step ahead
Regression Techniques Using Hierarchical Priors 287

predictive checks show 143 of 1246 values of Qt (11.5%) exceeding 0.95 or under 0.05. The
LOO-IC and LPML are respectively 7046 and −3524.
For the second model, with convergence readily obtained using rstan, a slightly
improved fit is obtained. (Convergence is delayed if an RW2 prior is adopted in the
intercept). Note that standardisation of covariates is important in this example to avoid
numeric errors. One step ahead predictive checks now show 117 of 1246 values of Qt (9.4%)
exceeding 0.95 or under 0.05. The LOO-IC and LPML are respectively 7028 and −3514.
Figures 7.3 and 7.4 plot the time-varying coefficients β2t and β3t. Significant effects of
PM10 (Figure 7.4) are limited to a central period, similar to the findings of Chiogna and
Gaetan (2002).

++ +
+++++++
++++ +
+++ +++ ++++
+ +++
+
++
++++++
+ +++++++ + ++++
++++
+++
++ +++
0.15 +
+
+
++++++
+++ +++
+ + ++++++
+++ + ++
++
++
+++
+ +++ ++++++++++
+++
+ ++++++
++ +++++ +++++++ ++++
+++++++
++
++++
+ ++++ +++++ +
+++ +
++ +++++ + ++
+ ++ +++
Posterior Mean and 80% CRI, b2

++
+ + + +++++++++++++++++
+
++ ++
++
+
+
+
+ ooooooooooo
++++++ ++++
0.10 ++++ o oo o ++ +++ +++++++
++ ++
+ o
ooooooo oooooo ooooo +++++++++++ ++
+ +
+
+ ooo ooooo
o oooo oo oo ++++ +++++++++
++++++ +
+++++++ + + o
o oooooo ooooooooooooo + + +++
+++++ +++
++++ ++
+
+++ ooo o
ooo
oo ooooooo +++
++ +
+
+ ++
+ o oo
oo
oo
oo ooo
+ ++++ ++++ o
ooooo ooo
o o oo
ooo
ooo ooooooooooo
o o
o
+
+
++++ ++ o
oo o
ooo o oo
+++++ ++ o o
0.05 ++
++ +
+
+ ooo o
o ooooo
oooooooooo
oo
oo
oooo
+++++ ++++
++ oo
ooo
o ooooooo
o oooo
++++++++ oo *
************* oooooooooo
ooo ******* ********** ****** oooooooo
oo
o ooo **
**
** *
*** * **
**
** *****
*
*
oo
ooo
oooooooo ooo ooo **
* **************** ***** ********** ooo
oo oo
oooo ooooo
o o oo *
** ** *
**** oooooooooooo
0.00 oooo oo **
**********
**
* ***** * ***** ooo
ooo ooo
o *
* * * *
************** ***** o
oooooo oo oo **
** * oo
oo oo ** * * *** ooo
oooo o o
o * *
** ********************* oo
o ***
* * **** * * *
oooooooo **
*** *** **** ******
* ***** *
********
***** **********
–0.05 **
*** ** ***
**** ****** ******
**
********* ***** ******** ***
***
****** ** ***
****** **** *****
* ** **** ********
*** ** ***
–0.10 **** **
*******
*
**
* *
**
**
**

0 200 400 600 800 1000 1200


Day

FIGURE 7.3
Varying beta coefficients, beta2.
Posterior Mean and 80% CRI, b3

0.05

0.00

–0.05

0 200 400 600 800 1000 1200


Day

FIGURE 7.4
Varying beta coefficients, beta3.
288 Bayesian Hierarchical Models

7.8 Spatial Regression
In spatial data modelling, just as for time series regression, there may be correlated resid-
uals, spatially varying predictor effects, and/or predictor collinearity. Correlated errors
can bias regression parameter estimates and cause standard errors to be mis-stated (Boyd
et al., 2005). Nonlinear predictor effects may contribute to residual spatial correlation
(Dormann et al., 2007), as may incorrectly assuming homogenous regression coefficients
when a nonstationary process (a varying regression coefficient approach) is appropriate
(Fotheringham et al., 2002). Omitted spatially dependent predictors are another source
(LeSage and Dominguez, 2012) of residual spatial dependence. Regarding collinearity,
recent studies (e.g. Reich et al., 2010; Choi and Lawson, 2016) propose predictor selection
combined with spatially varying coefficients (Section 7.8.2).

7.8.1 Spatial Lag and Spatial Error Models


Spatial correlation in residuals can be measured in various ways, for example, via modi-
fied versions of the Moran I and Geary c statistics (Arbia, 2014), and a satisfactory model
will include the null value in the estimation intervals for these statistics. One may also
calculate these statistics for observed and replicate data and apply a posterior predictive
test of model adequacy.
For example, Moran’s I statistic for residuals e = y − Xβ from a normal linear regression
involving n areas is

e¢We/S0
I=
e¢e/n
where S0 = SiS j wij , and W = [wij] represent spatial interactions. Alternatively, Congdon et al.
(2007) apply a measure suggested by Fotheringham et al. (2002, p.106) obtained via linear
*
regression of appropriately defined residuals ei on the spatial lag ei = å w e /å w .
j
ij j
j
ij

The regression is simply

ei = r0 + r1ei* + ui ,

where the ui are taken as unstructured. This is done at each MCMC iteration to provide a
posterior mean and 95% intervals on the spatial correlation index ρ1, the spatial lag regres-
sion coefficient (SLRC). If the 95% interval excludes zero, then spatial correlation is present.
Standard ways to deal with spatially correlated errors are to include a spatially lagged
response as a predictor, or to incorporate spatial effects in the residual specification.
Correcting for spatial correlation in this way may affect the significance and direction of
predictor effects, as compared to a model with non-spatial error structure (Kuhn, 2007).

7.8.2 Simultaneous Autoregressive Models


Including a lagged response in normal linear models for spatially defined observations
provides the spatial autoregressive or spatial lag model (Darmofal, 2015) whereby:

yi = r å c y + X b + u ,
j
ij j i i
Regression Techniques Using Hierarchical Priors 289

where -1 £ r £ 1 , ui ~ N (0, s 2 ) are iid, and cij = wij / åw j


ij are row-standardised spatial
interactions, with å j
cij = 1. This model has been proposed for binary responses, pos-
sibly by introducing augmented data, with a widely applied approach being known as the
spatial probit (Holloway et al., 2002; Franzese and Hays, 2007). The spatial lag model can
be estimated using a Bayesian approach in R-INLA, and in rstan (see Chapter 6), while the
spatial probit spatial lag model can be estimated using the spatialprobit package (Wilhelm
and de Matos, 2013).
So, for yi binary, zi is a latent metric variable, positive when y = 1 and negative when y = 0.
Then the spatial lag model is

zi = r å c z + X b + u ,
j
ij j i i

ui ~ N (0, su2 ),

with su2 = 1 for identifiability. In econometric or voting applications, this might amount to
expecting individuals located at similar points in space to exhibit similar choice behaviour
(Smith and Lesage, 2004). In matrix terms

z = rCz + X b + u,

and solving for z gives

z = (I - rC )-1 X b + u* ,

where u* = (I - rC )-1 u are correlated disturbances with u* ~ N (0, W), where


Ω = (I - rC )-1[(I - rC )-1 ]¢ .
An alternative solution to spatially correlated residuals – especially if there is no strong
evidence for spatial lag effects – is to include spatial structure in the errors. A rationale is
that effects of unknown predictors spill across adjacent areas, causing spatially correlated
errors. Instead of normal linear regression with iid errors, the spatial error model specifies

yi = Xi b + ei ,

ei = r å c e + u ,
j
ij j i

with ui ~ N (0, S u ), and a maximum possible value of 1 for ρ, since the spatial weights are
standardised. A lower prior limit for ρ of 0 may be assumed, since negative values are
implausible.
Writing the equation for e = (e1 , … , en ) as e = (I - rC )-1 u , the covariance matrix for ε is

(I - rC )-1 S u (I - rC ¢ )-1 ,

and with D = I - rC , the joint prior for ε is obtained as

(e1 , … , en ) ~ N n (0, D -1S u (D¢ )-1 ).


290 Bayesian Hierarchical Models

Assuming S u = s 2I the likelihood is

1 é 1 ù
L(a , r , s 2 | y ) = |D¢ D|0.5exp ê - 2 éë( y - X b )¢ D¢ D( y - X b )ùû ú
2ps n
ë 2s û

7.8.3 Conditional Autoregression
By contrast to SAR spatial models, conditional autoregressive error schemes (Besag, 1974)
specify εi conditional on remaining effects ε[i]. One option takes unstandardised spatial
interactions with

E(ei |e[i] ) = l å w e ,
j¹i
ij j

Var( ei |e[i] ) = s 2 ,

with joint covariance s 2 (I - lW )-1 . In this case (Bell and Broemeling, 2000), λ is constrained
by the eigenvalues Ek of W, namely l Î[1/Emin , 1/Emax ]. Conditional variances may differ
between subjects with M = diag(si2 ) and the covariance is then (I - lW )-1 M (Lichstein et
al., 2002).
If predictor effects are written hi = Xi b , this formulation may be restated in terms of an
own area regression effect, and a filtered effect of neighbouring regression residuals. Thus
for y metric

æ ö
y i ~ N çh i + l
ç å w (y -h ),s
ij j j
2 ÷.
÷
è j¹i ø
In many spatial health applications yi are Poisson counts, with means ni = Ei ri where Ei are
expected events, and ρi are unknown relative risks. One may then (Bell and Broemeling,
2000; Assunção and Krainski, 2009) assume ri = log( ri ) are Normal with

æ ö
ri ~ N çhi + l
ç å wij (rj - h j ), s 2 ÷ .
÷
è j¹i ø
The other conditional autoregressive option takes standardised spatial interactions, with
conditional means and variances

E(e i |e [i] ) = k åc e ,
j¹i
ij j

Var( ei |e[i] ) = s 2 å w .
j¹i
ij

The joint covariance for the εi is then s 2 (D - kW )-1 , where D is diagonal with di = S j¹i wij
(Sun et al., 1999, p.342). Equivalently, for binary wij, the diagonal terms of the precision
matrix are τdi where t = 1/s 2 (Kruijer et al., 2007), while off-diagonal terms equal −τκ when
Regression Techniques Using Hierarchical Priors 291

i and j are neighbours and 0 otherwise. For κ = 1, one obtains the CAR(1) prior of Besag et
al. (1991) with joint covariance matrix no longer positive definite.

7.8.4 Spatially Varying Regression Effects: GWR and Bayesian SVC Models


As mentioned above, the assumption of constant parameter values over space may often
be unrealistic, and allowing spatial variation in regression parameters may both improve
fit and account for spatially correlated residuals (e.g. Leung et al., 2000; Osborne et al.,
2007). A widely applied method is geographically weighted linear regression (GWR),
which involves re-using the data n times, with the ith regression regarding area i as origin.
With R predictors, coefficients b1i , … , bRi for the ith regression are derived using spatial
interaction weights wik, which in concert with a precision parameter τi define area-specific
precision parameters τiwik. So for the ith regression (centred on area i), the response for area
k is modelled as

æ 1 ö
y k ~ N ç mik , ÷ k = 1,¼, n
è t w
i ik ø

mik = b0 + b1i x1k + … + bRi xRk .

The corresponding weighted least squares estimator for bi = ( b1i , … , bRi ) is

bi = (X ¢Wi X )-1 X ¢Wi y ,

where Wi is an n × n diagonal matrix with entries wik (Assuncao, 2003).


Lesage (2004) notes that GWR estimates may suffer from weak identification as the effec-
tive number of observations used to produce estimates for some points in space may be
small. This problem can be alleviated under a Bayesian approach by incorporating prior
information. Lesage (2004) and Lesage and Kelley Pace (2009) reframe the GWR scheme
to allow spatially nonconstant variance scaling parameters vi, subject to an exchangeable
chi-square prior density, vi ~ c 2 (r ), with r a hyperparameter. Lesage (2004) also redefines
the wik as normalised distance-based weights (with wii = 0). Then

Wi y = Wi X bi + ei ,

with smoothing of regression effects across space represented as

æ b1 ö
ç ÷
ç ¼÷
b i = (wi1 Ä I R ,¼, win Ä I R ) ç ¼÷ + ui .
ç ÷
ç ¼÷
ç bn ÷
è ø
With Vi = diag(v1 , … , vn ), the error terms have priors

ei ~ N (0, s 2Vi ),

ui ~ N (0, s 2 d 2 (X ¢Wi2X )-1 ),


292 Bayesian Hierarchical Models

with the specification on ui being a form of Zellner g-prior, in which δ2 governs adherence
to the smoothing specification.

7.8.5 Bayesian Spatially Varying Coefficients


An alternative to the GWR approach is provided by spatially varying coefficient (SVC)
models (Gelfand et al., 2003; Assuncao, 2002; Gamerman et al., 2003; Wheeler and Calder,
2006, 2007). For a continuous space perspective, let Y(s) be the n × 1 response for locations
s = (s1 , … , sn ) , and β be a nR × 1 stacked vector of spatially varying regression coefficients.
Then the normal SVC model is (Gelfand et al., 2003),

Y(s) ~ N (X(s)¢ b(s), s 2I )

where X(s) is a n × nR block diagonal matrix of predictors.


The prior for β is

b(s) ~ N (1n ´1 Ä mb , Vb )

where mb = ( mb1 , … , mbR )¢ contains mean regression effects, and Vβ is the nR × nR covariance
matrix defined as

Vb = C(h) Ä L

where Λ is a R × R matrix containing covariances between regression coefficients at any


particular location, and C(V) = [c(si - s j ; V)] is a n × n correlation matrix representing spa-
tial interaction between locations or areas, with denoting hyperparameters. For example,
under exponential spatial interaction (Wheeler and Calder, 2007)

c(si - s j ; V) = exp( - dij /V)

where ς is a positive parameter.


For discrete areas, distance-based kernel schemes for spatial interaction have a less sub-
stantive basis. For i = 1, … , n such areas, let bi = ( b1i , … , bRi ) denote spatially varying regres-
sion effects in the linear predictor

hi = å b x ,
r =1
ri ri

of a general linear model with mean mi = E( yi ), and link g( mi ) = hi . With b = ( b1 , … , bn ) , one


possible spatially structured scheme is a pairwise difference prior (Assuncao, 2003)

p( b |F ) µ|F|n/2exp ì -0.5
í åå
wij ( bi - b j )¢ F( bi - b j )ü ,
ý
î i j þ
with R × R precision matrix Φ, and spatial interactions wij usually binary (wij = 1 when areas
i and j are adjacent, zero otherwise). When yi is metric with μi = ηi, with residual precision
τ = 1/σ2, one may, following Gamerman et al. (2003), scale the covariance by τ, namely

p( b |F , t) µ t nR/2 |F|n/2exp ì -0.5 t


í åå w ( b - b )¢F( b - b )üý .
ij i j i j

î i j þ
Regression Techniques Using Hierarchical Priors 293

The covariance matrix for β in these specifications is respectively K -1 Ä F -1 and s 2K -1 Ä F -1,


where K has elements

kii = wi + = å w ,
j¹i
ij

kij = - wij i ¹ j.

Hence these priors are improper because the elements in each row of K add to zero.
Assuncao (2003, p.460) notes that propriety can be obtained by a constraint such as
å b = A , where A is any preset R-vector. This consideration leads to a practical strategy
i
i

representing βi as bi = mb + bi where the bi follow the pairwise difference prior, but are zero
centred at each MCMC iteration, and the mean regression effect is mb = ( mb1 , … , mbR ). This
can be implemented using the car.normal or mvcar options in BUGS.

7.8.6 Bayesian Spatial Predictor Selection Models


For spatially configured datasets, the relevance of particular predictors, and whether their
impact should vary spatially, may itself vary between areas or locations. Allowing for spa-
tial heterogeneity may affect inferences regarding the importance of predictors. However,
appropriate models are generally highly parameterised and careful specification of priors
may be needed to achieve effective sampling, with potential problematic identifiability,
sensitivity to priors, and mixing issues. As discussed above, either selection indicators or
shrinkage priors may be invoked. Assuming a selector indicator approach, let b ij( r ) denote
realised regression coefficients for area/location i and predictor j, as determined by priors
on both coefficients and selection indicators.
Under a scheme proposed by Reich et al. (2010), assessing predictor relevance has two
stages: (a) whether a predictor is relevant or irrelevant with a homogenous spatial effect
(constant over locations), and conditional on a homogenous effect being relevant, then (b)
assessing whether a spatially varying effect is justified. Let γj denote a binary selection
indicator for stage (a) and δj denote an indicator for stage (b). Then the three possible rel-
evant indicator pairings are {g j = 0, d j  = 0}, {γj = 1, δj = 0}, and{γj = 1, δj = 1}. So, one can define
indicators γaj = γj and gbj = g j d j as relevant to each stage, with two sets of suitably corre-
sponding priors on regression coefficients. For example, for stage (a) a spike-slab prior
could be set on coefficients mb j , and for stage (b) an SVC prior on zero centred spatial effects
bij, with realised coefficients b ij( r ) = m b j + bij when γbj = 1, and b ij( r ) = m b j when γbj = 0.
Under a scheme proposed by Choi and Lawson (2015), spatial structure is applied to the
selection indicators, which are area and predictor specific. Thus, one option is to take a
hierarchical prior on the regression coefficients, through the inclusion mechanism specify-
ing spatial dependence. Thus

b ij( r ) = b ijg ij ,

b ij ~ N ( m b j , 1/t b j )

g ij ~ Bern( rij ),

logit( rij ) = w j + rij ,


294 Bayesian Hierarchical Models

where the rij are entirely spatially structured, as under a CAR(1) prior, or admit spatial
structure, as under the Leroux et al. (1999) scheme.

Example 7.13 Very Low Birthweight, New York Counties


This example considers numbers of very low birthweight babies (under 1500 g) in 62
New York counties over 2008–12. Expected VLBW births, Ei, are obtained as total births
in a county multiplied by the region-wide VLBW rate; so å y = å E . Predictors are
i
i
i
i

x1 = median household income; and x2 = income inequality (ratio of household income at


the 80th percentile to income at the 20th percentile). These predictors are standardised.
A negative effect of x1 and positive effect of x2 would be expected.
A Poisson regression is fitted first, and shows significant impacts (negative and posi-
tive respectively) for the predictors. β1 and β2 have respective means (95% CrI) of −0.077
(−0.09,−0.065) and 0.037 (0.03,0.045). This model has a DIC of 589. The spatial lag regres-
sion coefficient (SLRC) discussed above has a 95% credible interval (−0.06,0.48) only just
overlapping zero, indicating spatially correlated residuals.
One possible remedy is to include a spatially structured residual in the regression.
Accordingly, a second model includes an additive CAR(1) spatial error si, and for iden-
tifiability omits the intercept, which is estimated as the mean of the si. The prior on the
residual is specified from first principles, rather than using the car.normal function in
BUGS. Thus with y i ~ Po(Ei ri ),

log( r i ) = b1x1i + b 2 x2i + si ,

with a Ga(0.5,0.0005) prior on the precision of si. This gives a considerably improved DIC
of 464, with significant effects remaining for both predictors, including an enhanced
effect of x2. Thus, β1 and β2 have respective means (95%CrI) of −0.08 (−0.10,−0.02) and 0.06
(0.01,0.12). Such a change in the strength of predictor effects demonstrates the impor-
tance of correct error specification for inferences regarding risk factors in spatial data.
Spatial correlation in residuals is removed as judged by an SLRC with 95% interval
(−1.03,0.74).
An alternative possible solution to spatially correlated residuals is to consider spatial
nonstationarity in predictor effects. Here

log( r i ) = b 0 + b1i x1i + b 2i x2i ,

where the βki follow independent CAR(1) priors. This model gives a DIC of 486, while
posterior means (95% intervals) for mb1 and mb2 are obtained as −0.03 (−0.09,0.04) and
0.17 (0.09,0.25).
Adding a spatial residual to this model leads to

log( r i ) = b1i x1i + b 2i x2i + si ,

where priors on spatial effects are specified from first principles. This improves the DIC
to 472. This is a slight loss of fit compared to the spatial residual model, but acknowledg-
ing regression heterogeneity in regression effects may often be important on substantive
grounds, and may impact on average regression effects over all areas. Posterior means
(95% intervals) for m b1 and mb2 are obtained as −0.05 (−0.13,0.03) and 0.13 (0.05,0.22), so
that recognising heterogeneity has much enhanced the inequality effect, as compared
to the spatial residual model.
Wheeler and Tiefelsdorf (2005, p.169) mention implausibly signed effects when using a
classical GWR approach. The Bayesian SVC approach reveals four counties with poste-
rior probabilities Pr( b1i > 0|y ) over 0.25 (the maximum being 0.27), and no county with
a posterior probability Pr( b2i < 0|y ) under 0.75 (the minimum being 0.76).
Regression Techniques Using Hierarchical Priors 295

Example 7.14 EU Referendum Voting


This example considers voting data for 380 local authorities in England, Wales, and
Scotland, namely the proportion voting for Britain’s exit (“Brexit”) from the European
Union in the 2016 Referendum. The level of Brexit voting has been linked, inter alia, to
age structure in different areas, proportions of adults with higher qualifications, pro-
portions of residents born outside the UK, and urban status. Here the proportion of over
65s is used to measure age structure, and population density (1,000 persons per hectare)
is used as a measure of urbanity.
An indication of collinearity distorting regression parameters is provided by spatial
lag probit regression (via the spatialprobit package) applied to the outcome variable,
y = 1 for Brexit votes above the median proportion of 0.543, and y = 0 otherwise. The spa-
tial interaction matrix for this analysis is based on the five nearest neighbours to each of
the 380 areas, with a sparse matrix representation adopted to assist computation.
A univariate regression on the proportion non-UK-born shows a significant negative
effect on Brexit voting, but this coefficient changes direction (albeit is again significant)
when all four predictors are included in the regression. By far the strongest predictor,
with a negative effect on Brexit voting, is the area proportion with higher education with
posterior mean (sd) of −20.8 (2.28). The population density covariate is not significant.
A spatial lag model is also fitted in R-INLA with response defined by the logit of the
proportion voting Brexit. The predictors are higher education, over 65s, non-UK-born,
and population density. This analysis provides a posterior mean (sd) on the higher edu-
cation and age variables of −4.16 (0.57) and 2.18 (1.05). The other predictors have non-
significant effects.
A conditional autoregressive approach is also applied using R2OpenBUGS, without
a spatial lag on Brexit voting. This analysis is based on spatial adjacency of the local
authorities, with binomial response based on Brexit voters yi and total voters Vi. To
ensure all areas have spatial adjacencies, links are introduced for island areas.
An initial analysis retains all predictors (which are standardised), and assumes a
Leroux et al. (1999) prior (denoted LLB prior) for random effects in the logit regression.
From a two-chain run of 10,000 iterations, the LLB spatial dependence parameter has
mean (sd) of 0.92 (0.05) showing high spatial association between regression residuals.
The regression coefficients show a negative effect of the higher education and popula-
tion density predictors on Brexit voting, and a positive effect of over 65s, but with the
impact of UK born predictor inconclusive.
Conventional predictor selection using an SSVS prior (George and McCullogh, 1993)
is then applied. The prior precision on the regression coefficients is set at 1 under reten-
tion, and 1,000 under exclusion. From a two-chain run of 100,000 iterations, this model
shows posterior retention probabilities of 0.95 for higher education, 1 for the over 65s
predictor, 0.93 for population density, but below 0.5 (namely 0.10) for the non-UK-born
predictor. Fit is essentially unchanged under the DIC criterion, namely 5,177 as com-
pared to 5,173 in the analysis without selection.
A final analysis adopts the approach of Choi and Lawson (2015) with spatial selection
indicators γij (for area i, and predictor j) following CAR(1) priors. Thus

y i ~ Bin(Vi , pi ),

logit(p i ) = b 0 + åX b
j
ij
(r )
ij + si ,

b ij( r ) = b ijg ij ,

bij ~ N( mj , 1/t j )
296 Bayesian Hierarchical Models

gij ~ Bern( rij ),

logit( rij ) = w j + rij ,

where si are Leroux et al. (1999) effects, whereas the rij are CAR(1). A two-chain run
of 100,000 iterations does not attain convergence according to Brooks-Gelman-Rubin
(BGR) statistics. Inferences at this stage show the highest retention probabilities (found
by averaging γij over areas) for the higher education and population density variables.
These predictors both have negative effects, with posterior means (sd) of −0.33 (0.02)
and −0.10 (0.04), as assessed from posterior mean b ij( r ) averaged over areas. The age 65+
and non-UK-born predictors have retention probabilities below 0.50.

7.9 Adjusting for Selection Bias and Estimating Causal Effects


The aim of much of social and health science research is to understand the effect of a treat-
ment, intervention, or exposure on an outcome when only observational non-randomised
data are available. Conventional regression may misrepresent treatment effects when
there is selection bias (e.g. treated subjects differ from untreated subjects in terms of base-
line health status or income). While many applications focus on the effects of a treatment
or instructional program, the conception of treatment extends to demographic variables
(Davis et al., 2017), and includes events such as dropping out of school (e.g. Vaughn et al.,
2011), where the outcome of interest may be adult earnings or verbal ability, and the goal is
to assess impacts of dropout after control for confounders. Since there may be intervening
variables in the causal pathway, there is also often interest in decomposing the effect of a
treatment or exposure into direct and mediated effects.

7.9.1 Propensity Score Adjustment


Propensity score (PS) methods are used in an attempt to reduce the potential bias in esti-
mated effects (e.g. of a treatment or intervention) obtained from observational studies,
when a treatment is not randomly assigned to subjects. They allow adjustment for mul-
tiple confounders without needing to specify a model for their association with the out-
come. Estimated effects may be of a treatment or focus risk factor X (often binary), while
C denotes remaining covariates, often considered as confounders. The propensity score is
the estimated probability of treatment assignment conditional on confounders. The pro-
pensity score acts as a balancing score, such that conditional on the score, the distribution
of confounders will be similar between treated and untreated subjects.
Let Y(x) denote a possibly counterfactual outcome when the treatment or exposure has
value X = x. So for a binary treatment, the outcome values are Y(1) and Y(0) of which only
one is observed. Let A ^ B|C denote that A is independent of B given C. Then a propensity
score is sufficient to adjust for confounding provided

Y(0), Y(1) ^ X |C ,

namely, treatment assignment is ignorable, given the confounders (Vansteelandt and


Daniel, 2014).
Regression Techniques Using Hierarchical Priors 297

Suppose a logit regression is used to predict Pr(Xi = 1|Ci , g) , so that the propensity score
is Si = 1/ éë1 + exp(-g Ci )ùû . It is potentially important to exclude insignificant confounder
variables (Weitzen et al., 2004), so one may include Bayesian variable selection in the esti-
mation of the propensity score, for example, using binary retention indicators δk,

logit(Si ) = g0 + å d g C . (7.3)
k
k k ik

Subsequent analysis options are then to apply the score in a subsequent regression to pre-
dict Y, stratify the sample according to propensity score (e.g. into quintiles or deciles),
match on the propensity score, or use inverse weighting by the propensity score. Suppose
the subsequent analysis involves regression. Regression to assess effects of exposure or
treatment can then be (a) on X and S; (b) on X and groupings of S (e.g. a categorical variable
based on a decile grouping of the S scores; or (c) on X, S and C.
A Bayesian approach may be specified using a joint likelihood (Zigler et al., 2013;
McCandless et al., 2009). Suppose Y is binary with pi = Pr(Yi = 1|Xi , Ci ), then the joint likeli-
hood consists of (7.3) and an outcome model such as (Zigler and Dominici, 2014, section 2.3)

logit( pi ) = b0 + b1Xi + b2Si + å d q C (7.4)


k
k k ik

where the selection indicators δk are common to both regressions. An average treatment or
exposure effect

D = E(Y = 1|X = 1, C ) - E(Y = 1|X = 0, C )

may be calculated by comparing estimated responses for each subject at X = 1 and X = 0 in
(7.4) (Davis et al., 2017).
The estimation of the propensity score S via a joint likelihood contrasts with a separate
stage perspective whereby the propensity score is intended to approximate the design stage
of a randomised study, without access to the outcome. In accord with a two-stage perspec-
tive, one may instead apply a quasi-Bayesian approach whereby feedback is cut between
the PS model and the outcome model (McCandless et al., 2010; Zigler and Dominici, 2014,
section 3.2).
For hierarchical data (e.g. subjects nested within institutions, or within areas) contri-
butions of covariates to treatment assignment may vary across institutions. In terms of
multilevel coefficients, this implies that a random slope analysis is needed to represent
institutionally varying or area-varying effects of covariates. Such a cross-level interaction
effect on the probability of receiving treatment means that each institution then has a dif-
ferent propensity equation. If Cij denotes individual confounders and Wj denotes institu-
tional confounders, then the ignorability assumption is now stated as

(Y(0), Y(1) ^ X |C , W )

The aim of the propensity score method is to ensure that within groups homogeneous on
the propensity score, the distributions of the covariates are essentially the same for treated
and untreated subjects (Austin, 2009). The achievement of covariate balance may be tested
(Baser, 2006) e.g. by testing for significant differences in covariate distributions within
propensity score strata.
298 Bayesian Hierarchical Models

Example 7.15 Patients Hospitalised for Suspected Myocardial Infarction


This example uses the data at http://web.hku.hk/~bcowling/examples/propensity.
htm. The data consist of 400 subjects in a retrospective cohort study of men aged 40–70
admitted to hospital with suspected myocardial infarction, with a binary outcome Y for
30-day mortality. To be assessed is the impact of a newer clot-busting drug (X = 1) versus
a standard therapy (X = 0) on the risk of mortality. Confounders are, respectively, age, an
admission severity score (on a scale from 0 to 10, 10 being worst), and a risk factor score
(on a scale from 0 to 5, 5 being worst).
Two models are compared using jagsUI. The first involves a logit propensity score
model predicting Si from the three confounders, namely

logit(Si ) = g0 + å g C .
k =1
k ik

and with an outcome model assuming no residual confounding, namely

logit( pi ) = b0 + b1Xi + b2Si .

This model has a satisfactory performance in reproducing the data based on posterior
predictive tests using the Brier score.
The three confounders have significant positive effects in the propensity score regres-
sion, so that patients receiving the new drug have a distinctly adverse risk profile.
Within quintiles of the propensity score, differences between confounder profiles are
not significant (these are represented by match.C1, match.C2, and match.C3 in the code).
In the outcome model, β1 and β2 have respective posterior means (sd) of −0.47 (0.28) and
2.83 (1.15). An estimate of the causal effect (of the new drug in reducing mortality) is
based on evaluating outcomes p1i = logit -1 ( b0 + b1Xi + b2Si ) and p0 i = logit -1 ( b0 + b2Si ) for
all subjects. Then the causal effect (mort.X in the code) is estimated as the average of the
differences p1i - p0 i , which has a mean (95% CRI) of −0.065 (−0.142, 0.006). LOO-IC crite-
ria are obtained separately for the propensity score and outcome models as 529 and 369.
A second model extends the propensity score model to include quadratic and interac-
tion terms (C12 , C22 , C32 , C1C2 , C1C3 , C2C3 ), and also allows for confounder selection feed-
back, with a residual confounding effect in the outcome model. Additionally, rather
than selection via SSVS or other spike-slab priors (Zigler and Dominici, 2014), horseshoe
shrinkage priors are used, with a sharing of the shrinkage parameters between propen-
sity and outcome models. Thus

logit(Si ) = g0 + åg C
k =1
k ik ,

gk ~ N(0, tg2 rk ),

kk ~ Be(0.5, 0.5),

rk = 1/kk - 1,

logit( pi ) = b0 + b1Xi + b2Si + åh C


k =1
k ik ,

hk ~ N(0, th2 rk ).
Regression Techniques Using Hierarchical Priors 299

LOO-IC criteria for the propensity score and outcome models are, in fact, now slightly
higher at 533 and 372. Values of κk are above 0.5 (indicating redundant regressors) except
for the main linear terms in C1, C2 and C3. The average causal effect p1i - p0 i is little
changed, with mean (95% CRI) of −0.063 (−0.137, 0.009).

7.9.2 Establishing Causal Effects: Mediation and Marginal Models


One may often have preliminary knowledge about different predictors from a substantive
perspective, as in epidemiology, whereby effects of a specific exposure on a disease are
confounded by other predictors (McNamee, 2005). Among questions that occur are estab-
lishing the causal effect of an exposure after controlling for confounders, and establishing
the direct and indirect effects of an exposure or treatment when there are mediators in the
relationship between the exposure and outcome.
Consider the issue of estimating the direct effect of an exposure or treatment (X)
on an outcome (Y) in the presence of mediator (M) and confounders (C). A standard
strategy would be (i) to estimate a regression for Y conditional on X and C, and then
(ii) add M into the model. The extent to which the coefficient of X changes between
these two models is then interpreted as measuring how far the effect of X on Y is
mediated by M. The coefficient on X in stage (i) is taken to represent the total effect of
X on Y, while that from (ii) is taken to represent the direct effect of X on Y not medi-
ated by M. This strategy assumes no interaction between X and M in their impact on Y.
Additionally, estimation of direct effects by such a strategy is complicated if confound-
ers are affected by the exposure, or certain confounders affect both the mediator and
outcome. However, with suitable regression modelling of both outcome and mediator,
even in the presence of the above complications, one may decompose the total causal
effect (the average effect on the outcome of the exposure or treatment) into an indi-
rect effect mediated through the mediator, and the remaining direct effect. These are
represented specifically in the causal mediation literature (Pearl, 2014; Tchetgen and
Vanderweele, 2014; Greenland, 2000) via two quantities: the natural indirect effect and
the natural direct effect.
Let E[Y {X , M(X )}] denote the expected value of Y at a stipulated value X, and at the
corresponding value of the mediator M(X). This expected value will be counterfactual
if the observed X differs from the stipulated X. Causal effects may be assessed by
considering counterfactual settings of exposures, which may correspond to a minimal
risk value, such as X* = 0 for a binary exposure. For a continuous exposure such as
body mass index (BMI), the counterfactual level might be a minimum risk level such
as X* = 21.

7.9.3 Causal Path Sequences


Assume the model for the mediator M specifies dependence on the treatment X and con-
founders C, and the model for the outcome Y allows dependence on X, M, on interactions
between X and M (denoted X.M), and on confounders. Symbolically

M ~ f1(X , C )

Y ~ f 2 (X , M , X.M , C ).
300 Bayesian Hierarchical Models

Let M(X*) denote the prediction of the mediator M under the setting X = X*. For example,
suppose M is continuous, and a normal linear regression is specified with

Mi ~ N (hi , s M
2
),

where

hi = b1 + b2Xi + b3Ci .

Then expected values of M(X) and M(X*) can be obtained as equal to the corresponding
regression terms, namely as ηi and

hi* = b1 + b2Xi* + b3Ci ,

respectively. Under a Bayesian perspective, they can also potentially be obtained as the
respective predictions

Mnew, i ~ N (hi , sM
2
),

* *
i ~ N ( hi , s M ).
2
Mnew,

The total effect is

E éëY {X , M(X )}ùû - E éëY {X * , M(X * )}ùû ,

while the natural direct effect (Lange et al., 2012) compares E[Y {X , M(X * )}] and
E[Y {X * , M(X * )}] , namely

E éëY {X , M(X * )}ùû - E éëY {X * , M(X * )}ùû .

The natural indirect, or mediated, effect is the difference between the total effect and the
natural direct effect, namely

E éëY {X , M(X )}ùû - E éëY {X , M(X * )}ùû .

An additional effect sometimes of interest (Naimi et al., 2014b; Vanderweele, 2013), namely
the controlled direct effect CDE, the effect of exposure on outcome if the mediator is con-
trolled uniformly at a particular value of M, say M.c. Then

CDE = E éëY {X , M.c}ùû - E éëY {X * , M.c}ùû .

Consider a structural model, as in Figure 7.5, with dependencies specified via normal lin-
ear regressions, with regression terms

E[ M|X , C] = b1 + b 2X + b 3C ,

E[Y|X , M , C] = q1 + q 2X + q 3 M + q 4 X.M + q 5C.


Regression Techniques Using Hierarchical Priors 301

FIGURE 7.5
Causal path example.

Then the Natural Direct Effect (NDE) and the Natural Indirect Effect (NIE) can be obtained
by effect substitution (substitution of regression means). So defining expected M(X) and
M(X*) from the corresponding regressions,

E[ M(X )] = b1 + b 2X + b 3C ,

E[ M(X * )] = b1 + b 2X * + b 3C ,

one has

E[Y {X , M(X )}] = q1 + q 2X + q 3 M(X ) + q 4 X.M(X ) + q 5C ,

E[Y {X , M(X * )}] = q1 + q 2X + q 3 M(X * ) + q 4 X.M(X * ) + q 5C ,

E[Y {X * , M(X * )}] = q1 + q 2X * + q 3 M(X * ) + q 4 X * .M(X * ) + q 5C ,

The natural direct effect is then obtained as

NDE = E[Y {X , M(X * )}] - E[Y {X * , M(X * )}]



= (X - X * )(q 2 + q 4 b1 + q 4 b 2X * + q 4 b 3C ).

Also, the mediated, or natural indirect, effect is

NIE = (X - X * )(q 3 b 2 + q 4 b 2X ).

Underlying the effect decomposition in the above model are assumptions of conditional
ignorability

Y {X , M(X )} ^ X |C

Y {X , M(X )} ^ M(X )|X , C.

These specify independence of exposure and outcome, given the confounders (Vanderweele,
2015), and of mediator and outcome, given confounders and exposure. Additional assump-
tions (VanderWeele and Vansteelandt, 2014) are M(X ) ^ X |C and Y {X , M(X )} ^ M(X * )|C .
An alternative method for estimating causal effects is set out by Imai et al. (2010, p.312)
based on assumptions of sequential ignorability. The initial assumptions relate to ignor-
ability of the treatment (or exposure) given confounders, namely Y {X , M(X )} ^ X |C and
302 Bayesian Hierarchical Models

M(X ) ^ X |C . The subsequent assumption relates to ignorability of the mediator given


confounders and exposure, namely Y {X , M(X )} ^ M(X )|C , X . The essential steps in the
method are (Imai et al., 2010, p.317 and Appendix D)

i) Estimate a mediator and outcome regression models (regressions of M on X and C,


and of Y on X, M and C);
ii) Obtain J sets of sampled parameters from each regression based on the estimated
sampling distributions;
iii) For each of j = 1, … , J samples from stage ii, carry out K further simulations, namely
(a) of mediator values M | X,C for both X = 1 and X = 0, and (b) of two potential
outcomes given the simulated mediator values, and then (c) compute the causal
mediation effect for sample j by averaging over the K simulations.
iv) Compute summary statistics by averaging over all samples j.

Assuming a binary treatment, the two causal mediation effects are defined for treatment
settings x = 0,1 as

di ( x) = Yi ( x , Mi (1)) - Yi ( x , Mi (0)).

Similarly, direct treatment effects are defined as

zi ( x) = Yi (1, Mi ( x)) - Yi (0, Mi ( x)).

Assuming that causal mediation and direct effects do not vary according to treatment
status, so that di (1) = di (0) = di and zi (1) = zi (0) = zi , one has that the total treatment effect
ti = Yi (1, Mi (1)) - Yi (0, Mi (0)) is the sum of the causal mediation and direct effects, ti = di + zi .
The Imai et al. method may be characterised as a quasi-Bayesian Monte Carlo algorithm,
and one may adapt the method using fully Bayesian principles, with substitution of appro-
priate replicates. Assuming the treatment X is binary, and based on sampled parameters
at each MCMC iteration, one samples replicate mediator values M1 = Mrep (C , X = 1) and
M0 = Mrep (C , X = 0) at different treatment levels. One then substitutes these (as media-
tor values) in the regression term for predicting Y, along with settings X = 1 and X = 0 on
the treatment values. This provides predictions Yrep (1, M1 ) , Yrep (0, M1 ) , Yrep (1, M0 ), and
Yrep (0, M0 ), with the total treatment effect Yrep (1, M1 ) - Yrep (0, M0 ) .
In real situations, the exposure X may influence one or more confounders C. Schematically,

C ~ f1(X ),

M ~ f 2 (X , C ),

Y ~ f 3 (X , M , X.M , C ).

Then more detailed calculations are obtained, since C(X) and C(X*) will differ. To illustrate
linear effect substitution, suppose C denotes a confounder influenced by X, and D denotes
confounders independent of X. Then one will have an additional linear regression with
expectation such as

E[C|X , D] = a1 + a2X + a3D,


Regression Techniques Using Hierarchical Priors 303

whereby E[C(X )] = a1 + a2X + a3D , and E[C(X * )] = a1 + a2X * + a3D. Then one has

E[ M(X )] = b1 + b 2X + b 3C(X ) + b 4D,

E[ M(X * )] = b1 + b 2X * + b 3C(X * ) + b 4D,

and

E[Y {X , M(X )}] = q1 + q 2X + q 3 M(X ) + q 4 X.M(X )



+ q 5C(X ) + q 6D,

E[Y {X , M(X * )}] = q1 + q 2X + q 3 M(X * ) + q 4 X.M(X * )



+ q 5C(X ) + q 6D,

E[Y {X * , M(X * )}] = q1 + q 2X * + q 3 M(X * ) + q 4 X * .M(X * )



+ q 5C(X * ) + q 6D.

Example 7.16 Framing and Public Opinion


This example uses data from Tingley et al. (2014) in which randomised subjects are
exposed to different media perspectives on immigration (the binary treatment, X) with
the aim of assessing how this affects a political outcome measure (Y, also binary), namely
whether or not a letter is sent regarding immigration policy to a Congress representa-
tive. Anxiety is posited as a continuous mediating variable, with the measure of anxiety
(anx) influenced by the form of media framing. By virtue of the binary response, direct
and indirect effects are obtained in terms of probability differences.
The mediator regression specifies dependence on X and four confounders C (age, gen-
der, income, and educational category: less than high school, high school, some college,
bachelor’s degree or higher, with the first level as reference). These are taken not to be
influenced by the exposure. The anxiety mediator has positive values only, and assum-
ing a truncated normal density, the regression is coded in jags as

   anx[i] ~dnorm(mu.anx[i],tau.anx) T(0,)


   mu.anx[i] <- a[1]+a[2]*treat[i]+a[3]*age.c[i]+a[4]*equals(edu[i],2)
   
+a[5]*equals(edu[i],3)+a[6]*equals(edu[i],4)+a[7]*gend[i]+a[8]
*income[i]

Predictions at values X = 1 and X = 0 are obtained as

   anx.1[i] ~dnorm(mu.anx.1[i],tau.anx) T(0,)


   anx.0[i] ~dnorm(mu.anx.0[i],tau.anx) T(0,)

with corresponding regression settings:

   mu.anx.1[i] <- a[1]+a[2]+a[3]*age.c[i]+a[4]*equals(edu[i],2)


   
+a[5]*equals(edu[i],3)+a[6]*equals(edu[i],4)+a[7]*gend[i]+a[8]
*income[i]
   mu.anx.0[i] <- a[1]+a[3]*age.c[i]+a[4]*equals(edu[i],2)
   
+a[5]*equals(edu[i],3)+a[6]*equals(edu[i],4)+a[7]*gend[i]+a[8]
*income[i]
304 Bayesian Hierarchical Models

The binary outcome has predictors X and C, but also involves the mediator. So assum-
ing a probit regression one has

   congmesg[i] ~dbern(p[i])
p[i] <- phi(b[1]+b[2]*treat[i]+b[3]*anx[i]+b[4]*age.c[i]+b[5]*equals
   
(edu[i],2)
   
+b[6]*equals(edu[i],3)+b[7]*equals(edu[i],4)+b[8]*gend[i]+b[9]*incom
e[i])

Four alternative predictions of the outcome are obtained at settings X = 1 and
X = 0, crossed with mediator values set at the predictions M1 = Mrep (C , X = 1) and
M0 = Mrep (C , X = 0) (anx.1[i] and anx.0[i] in the code). The direct causal effect is defined
as

E éëY {1, M(0)}ùû - E éëY {0, M(0)}ùû .

Two assumptions regarding the density for the positive anxiety score are made. Under a
truncated normal assumption, a two-chain run using jagsUI provides means (95% CRI)
for the average mediation, direct and total effects as 0.083 (0.011, 0.160), 0.012 (−0.113,
0.140) and 0.096 (−0.042, 0.238). These estimates are similar in location to, but less precise
than, those contained in Tingley et al. (2014). In inference terms, the direct impact of the
treatment is insignificant, and the impact of the treatment on the response is mainly due
to its effect on the anxiety mediator. The LOO-IC for the mediator and outcome models
are obtained as 2,099 and 298 respectively.
Assuming instead a lognormal density for anxiety, the respective means (95% CRI)
become 0.090 (0.011, 0.181), 0.016 (−0.106, 0.142), and 0.106 (−0.034, 0.253). The LOO-IC for
the mediator model is reduced to 2,092.

Example 7.17 Effect of Alcohol on SBP


This example uses a dataset for n = 10,000 subjects from Daniel et al. (2011), which illus-
trates a case where the exposure affects a confounder. The interest is in estimating direct
and indirect effects of alcohol consumption (ALC, in units per day), which is the expo-
sure, on the outcome, systolic blood pressure (SBP, measured in mmHg), while allowing
for the mediating effect of a liver enzyme, GGT (gamma-glutamyl transpeptidase). GGT
is the logarithm of the enzyme measured in grams per litre. The causal sequence is that
alcohol intake affects levels of the liver enzyme, which in turn affect SBP, though there
is also potentially a direct influence of alcohol on SBP.
Regarding confounders, body mass index (BMI) may affect both the mediator GGT
and the outcome SBP, while socioeconomic status (SES, a trinomial category) may affect
alcohol intake, BMI and SBP. Furthermore, alcohol intake may have a direct effect on
BMI (i.e. the exposure potentially influences one of the confounders).
Data on the four continuous variables is subject to missingness. This is handled by
assuming missingness at random, and by using a sequence of one-dimensional condi-
tional distributions (Lipsitz and Ibrahim, 1996)

ALC ~ (ALC|SES)

BMI ~ (BMI|ALC, SES)

GGT ~ (GGT|BMI , ALC)

SBP ~ (SBP|ALC, GGT , ALC.GGT , BMI , SES).


Regression Techniques Using Hierarchical Priors 305

A different imputation strategy is adopted by Daniel et al. (2011) which may affect find-
ings. Normal linear regressions are adopted for each outcome. In full, the regression
term assumed for predicting the outcome SBP is

mSBP ,i = q1 + q 2 ALC + q 3 GGT + q 4 ALC.GGT + q 5 BMI



+ q 6 I (SES = 2) + q7 I (SES = 3)).

One aim is to estimate the natural direct effect NDE, defined (in generic symbols) as
the expected value of the difference Y(X , M(X * )) - Y(X * , M(X * )). Accordingly, a counter-
factual alcohol consumption level ALC* = 0 is defined, with corresponding predictions
(obtained as Bayesian replicates)

BMI* = BMI(ALC* , SES)

and

GGT* = GGT(ALC* , BMI* ).

These are bmi.star.new[i] and ggt.star.new[i] in the code. Then

NDE = E éSBP {ALC, GGT* , BMI* , SES}ù - E éSBP {ALC* , GGT* , BMI* , SES}ù
ë û ë û

with the first and second components defining NDE at subject level denoted NDE.a[i]
and NDE.star[i] in the code.
The natural indirect effect NIE is defined generically as the expected value of the dif-
ference Y(X , M(X )) - Y(X , M(X * )). In terms of the application, we have

NIE = E éëSBP {ALC, GGT(BMI , ALC), BMI(ALC, SES), SES}ùû



- E éSBP {ALC, GGT* , BMI* }ù .
ë û

In the first component, GGT and BMI are replicates (ggt.new[i] and bmi.new[i] in the
code). Including the prediction GGT* = GGT(ALC* , BMI* ) in the second component
allows for the fact that BMI (a confounder) depends on the exposure ALC, so that
BMI* = BMI(ALC* , SES) and BMI(ALC,SES) differ.
The total causal effect TCE is the sum of NDE and NIE. Additionally, the controlled
direct effect CDE may be obtained at the setting GGT.c = 3, namely

CDE = E éëSBP {ALC, GGT.c, BMI , SES}ùû



- E éSBP {ALC* , GGT.c, BMI , SES}ù .
ë û

Table 7.6 shows the posterior summary for these quantities and the regression param-
eters θ from a two-chain run using jagsUI. The estimated total causal effect (TCE)
implies that the reduction of alcohol consumption to zero would reduce average SBP
by 8.04 units (95% CRI from 7.71 to 8.40). A relatively small part of the reduction (with
posterior mean 1.30 units) is mediated through GGT. It may be noted that the impact
of alcohol on SBP is possibly nonlinear, with evidence of a U-shaped effect, and an
extended model might allow nonlinearity (Jackson et al., 1985).
306 Bayesian Hierarchical Models

TABLE 7.6
Prediction of SBP, Posterior Parameter Summary
Parameter Predictor Mean St devn 2.5% 97.5%
TCE 8.04 0.18 7.71 8.40
NDE 6.74 0.16 6.42 7.06
NIE 1.30 0.10 1.10 1.51
CDE 6.63 0.15 6.32 6.93
θ1 Intercept 89.86 1.25 87.63 92.58
θ2 ALC 5.94 0.19 5.58 6.33
θ3 GGT 7.03 0.16 6.74 7.35
θ4 GGT.ALC −0.99 0.05 −1.10 −0.89
θ5 BMI 0.51 0.05 0.41 0.59
θ6 SES2 −5.34 0.24 −5.80 −4.87
θ7 SES3 −10.32 0.31 −10.92 −9.71

7.9.4 Marginal Structural Models


Another strategy to estimate causal effects focuses on a marginal structural model (MSM)
relating Y to exposure X, and possibly other selected risk factors V of interest, after adjust-
ing for other confounders (Snowden et al., 2011; Joffe et al., 2004). Sometimes the marginal
model will involve a regression on X and effect modifiers V (Robbins et al., 2000, p.556).
The inverse probability of treatment weighted (IPTW) approach involves a weighted
likelihood for the MSM. Consider the case when the marginal structural model involves
regression of Y on X only, adjusting for confounders C. The IPTW method involves first
deriving weights

wi = 1/Pr(Xi = x|Ci )

that X = x given confounders C, following binary regression of X on C. The weights are

wi = 1/P(Xi = 1|Ci ) = Si

for subjects with observed X = 1, and

wi = 1/P(Xi = 0|Ci ) = 1/(1 - Si )

for subjects with observed X = 0. Equivalently

wi = Xi /P(Xi = 1|Ci ) + (1 - Xi )/P(Xi = 0|Ci )



= Xi /Si + (1 - Xi )/(1 - Si ).

Estimating the marginal structural model then involves a weighted likelihood (nor-
mal, logistic, etc.) of Y on X. Applying weights in this way creates an artificial popula-
tion which tends to balance on covariates X used in deriving the weights (Naimi et al.,
2014a). Doubly robust weights may also be defined that estimate causal effects if either
the propensity score model or the outcome model is correctly specified. Davis et al. (2017)
use Bayesian methods to estimate the parameters needed to define a propensity score in
Regression Techniques Using Hierarchical Priors 307

spatial applications, and substitute relevant posterior means to estimate IPTW weights,
with the latter considered as frequentist.
Marginal structural models may also be estimated using a regression of Y on X (expo-
sure) and C (confounders) to predict counterfactual outcomes for all subjects. This is in
line with g-computation principles (Wang and Arah, 2015). Then E(Y[X , C]) denotes the
prediction, possibly counterfactual, at the value X. So for X binary, and X = 1 as exposed,
the total causal effect is estimated as

TCE = E(Y[1, C]) - E(Y[0, C]).

Snowden et al. (2011) use an additional regression step, involving 2n outcomes (half being
actual responses at observed X, half being counterfactual responses at the counterfactual
X*) and estimate the treatment effect by regression of the expanded outcome vector on cor-
responding X (or X*) values. However, the TCE may also be estimated by averaging over
predictions at appropriate settings of X and C (Example 7.18).

Example 7.18 Lung Function and Ozone Exposure


This example involves simulated data from Snowden et al. (2011), with Y being forced
expiratory volume in 1 second (FEV1) measured in litres, X being ozone exposure
(binary), and confounders C1 (male = 1, female = 0), and C2, controller medication use
(1 = effective use, 0 = ineffective). The true model underlying the simulated Y-data
involves terms in X, C1, and an interaction between X and C2.
In the absence of such knowledge about the data generation, a researcher might con-
sider a full linear regression of Y on X, C1, C2, and the three possible interactions involv-
ing X, C1, and C2. Thus the regression term considered here is

b1 + b 2 X + b 3C1 + b 4C2 + b 5 X.C1 + b 6 X.C2 + b7 C1 .C2

where X.C1 denotes an interaction between X and C1, etc. We estimate this model using
jagsUI, and find the regression coefficient β2 to have posterior mean (sd) of −0.486 (0.08).
By contrast, the marginal causal effect has posterior mean (sd) of −0.337 (0.054). This is
estimated by averaging the difference between the predictions

Y(1, C ) = b1 + b 2 + b 3C1 + b 4C2 + b 5C1 + b 6C2 + b7 C1 .C2

Y(0, C ) = b1 + b 3C1 + b 4C2 + b7 C1 .C2

So the conventional regression overstates the impact of ozone exposure in reducing


FEV1. The estimates by Snowden et al. (2011, Table 3) are similar.
To alleviate impacts of insignificant predictors, one can include spike-slab predictor
selection, whereby

bj = gj J j ,

g j ~ N(0, 10),

J j ~ Bern(0, 0.5).

This gives posterior probabilities of 1 for retaining β2, and β3, and 0.95 for retaining β6, as
expected in line with the data generation mechanism. Other coefficients have retention
308 Bayesian Hierarchical Models

probabilities below 0.05. Including predictor selection affects estimates slightly: the
regression coefficient β2 now has a posterior mean (sd) of −0.481 (0.059), while the mar-
ginal causal effect has a posterior mean (sd) of −0.340 (0.054).
We also consider the propensity score approach of Section 7.9.1, regressing the prob-
ability Si that Xi = 1 on C1i, C2i and the interaction C1iC2i. One may then either simply
regress the response Yi on Si and Xi, or also allow for residual confounding (Zigler and
Dominici, 2014), namely,

Y ~ N( b1 + b 2 Xi + b 3Si + b 4C1i + b 6C2i + b7 C1i .C2i , s 2 ).

This is carried out using a joint likelihood, though feedback between the Y-model and
the X-model can be avoided using the BUGS “cut” function. Results for the coefficient β2,
and hence the ACE, are very similar whether or not residual confounding is allowed for,
and also whether or not feedback is avoided. Allowing feedback, and without allowing
residual confounding, the mean (sd) of the ACE is estimated as −0.352 (0.053).
To illustrate the IPTW approach, we again regress the probability that X = 1 on C1,
C2, and the interaction C1.C2. This provides probabilities Si = Pr(Xi = 1|C1i , C2i ) , and the
weights

wi = Xi /Si + (1 - Xi )/(1 - Si )

are then used in a weighted linear regression of Y on X with weights σ2/wi. To avoid
feedback between the logit regression for X on {C1,C2}, and the marginal structural
regression of Y on X, the “cut” function in BUGS is applied to the predicted probabili-
ties Si before they are inserted in the weights. From the second half of a 10,000 iteration
sequence, the posterior mean (sd) for the marginal causal effect (MCE) is −0.36 (0.10). If
feedback between the two regressions is allowed, convergence in the coefficients of the
X-model is impeded, and the MCE has a value closer to null, around −0.27.

References
Albert J (1996) Bayesian selection of log-linear models. Canadian Journal of Statistics, 24, 327–347.
Albert JH, Chib S (1993) Bayesian analysis of binary and polychotomous response data. Journal of the
American statistical Association, 88(422), 669–679.
Albert J, Chib S (2001) Sequential ordinal modeling with applications to survival data. Biometrics,
57(3), 829–836.
Arbia G (2014) A Primer for Spatial Econometrics: With Applications in R. Palgrave.
Assunçao RM (2003) Space varying coefficient models for small area data. Environmetrics, 14(5),
453–473.
Assunção R, Krainski E (2009) Neighborhood dependence in Bayesian spatial models. Biometrical
Journal, 51(5), 851–869.
Austin P (2009) Balance diagnostics for comparing the distribution of baseline covariates between
treatment groups in propensity-score matched samples. Statistics in Medicine, 28, 3083–3107.
Baragatti M, Pommeret D (2012) A study of variable selection using g-prior distribution with ridge
parameter. Computational Statistics and Data Analysis, 56(6), 1920–1934.
Barbieri M, Berger J (2004) Optimal predictive model selection. Annals of Statistics, 32, 870–897.
Barreto-Souza W, Simas A (2016) General mixed Poisson regression models with varying dispersion.
Statistics and Computing, 26, 1263–1280.
Baser O (2006) Too much ado about propensity score models? Comparing methods of propensity
score matching. Value in Health, 9(6), 377–385.
Regression Techniques Using Hierarchical Priors 309

Bazán J, Bolfarine H, Branco M (2010) A framework for skew-probit links in binary regression.
Communications in Statistics, Theory and Methods, 39(4), 678–697.
Beck N (1983) Time-varying parameter regression models. American Journal of Political Science, 27,
557–600.
Bell S, Broemeling LD (2000) A Bayesian analysis for spatial processes with application to disease
mapping. Statistics in Medicine, 19(7), 957–974.
Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal
Statistical Society B, 36, 192–225.
Besag J, York J, Mollié A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43(1), 1–20.
Bhadra A, Datta J, Polson N, Willard B (2016) Default Bayesian analysis with global-local shrinkage
priors. Biometrika, 103, 955–969.
Bhattacharya A, Pati D, Pillai N, Dunson D (2015) Dirichlet–Laplace priors for optimal shrinkage.
Journal of the American Statistical Association, 110(512), 1479–1490.
Bonate P (2011) Pharmacokinetic-Pharmacodynamic Modeling and Simulation, 2nd Edition. Springer,
New York.
Boris Choy S, Chan J (2008) Scale mixtures distributions in statistical modelling. Australian & New
Zealand Journal of Statistics, 50(2), 135–146.
Boyd H, Flanders W, Addiss D, Waller L (2005) Residual spatial correlation between geographically
referenced observations: A Bayesian hierarchical modeling approach. Epidemiology, 16, 532–541.
Brockmann H (1996) Satellite male groups in horseshoe crabs, Limulus polyphemus. Ethology, 102(1),
1–21.
Bürkner P, Vuorre M (2018, February 28) Ordinal Regression Models in Psychology: A Tutorial.
https://doi.org/10.31234/osf.io/x8swp
Calcagno V, de Mazancourt C (2010) glmulti: An R package for easy automated model selection with
(generalized) linear models. Journal of Statistical Software, 34(12), 1–29.
Carvalho C, Polson N, Scott J (2009) Handling Sparsity via the Horseshoe. Proceedings of Machine
Learning Research, 5, 73–80.
Cepeda E, Gamerman D (2000) Bayesian modeling of variance heterogeneity in normal regression
models. Brazilian Journal of Probability and Statistics, 14(2), 207–221.
Chan K, Ledolter J (1995) Monte Carlo EM estimation for time series models involving counts. Journal
of the American Statistical Association, 90, 242–252.
Chang Y, Gianola D, Heringstad B, Klemetsdal G (2006) A comparison between multivariate Slash,
Student’s t and probit threshold models for analysis of clinical mastitis in first lactation cows.
Journal of Animal Breeding and Genetics, 123, 290–300.
Chen R, Chu C, Yuan S, Wu Y (2016) Bayesian sparse group selection. Journal of Computational and
Graphical Statistics, 25(3), 665–683.
Chi EM, Reinsel GC (1989) Models for longitudinal data with random effects and AR (1) errors.
Journal of the American Statistical Association, 84(406), 452–459.
Chib S, Greenberg E (2013) On conditional variance estimation in nonparametric regression. Statistics
and Computing, 23(2), 261–270.
Chiogna M, Gaetan C (2002) Dynamic generalized linear models with application to environmental
epidemiology. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51(4), 453–468.
Choi J, Lawson A (2016, June 16) Bayesian spatially dependent variable selection for small area health
modeling. Statistical Methods in Medical Research. pii: 0962280215627184.
Choi J, Lawson AB (2018) Bayesian spatially dependent variable selection for small area health mod-
eling. Statistical Methods in Medical Research, 27(1), 234–249.
Conceição K, Andrade M, Louzada F (2013) Zero-modified Poisson model: Bayesian approach, influ-
ence diagnostics, and an application to a Brazilian leptospirosis notification data. Biometrical
Journal, 55(5), 661–678.
Congdon P, Almog M, Curtis S, Ellerman R (2007) A spatial structural equation modelling frame-
work for health count responses. Statistics in Medicine, 26(29), 5267–5284.
Cox D (1981) Statistical analysis of time series: Some recent developments. Scandinavian Journal of
Statistics, 8, 93–115.
310 Bayesian Hierarchical Models

Czado C, Erhardt V, Min A, Wagner S (2007) Zero-inflated generalized Poisson models with regres-
sion effects on the mean, dispersion and zero-inflation level applied to patent outsourcing
rates. Statistical Modelling, 7(2), 125–153.
Dangl T, Halling M (2012) Predictive regressions with time-varying coefficients. Journal of Financial
Economics, 106(1), 157–181.
Daniel R, De Stavola B, Cousens S (2011) gformula: Estimating causal effects in the presence of
time-varying confounding or mediation using the g-computation formula. Stata Journal, 11(4),
479–517.
Darmofal D (2015) Spatial Analysis for the Social Sciences. Cambridge University Press.
Davis M, Neelon B, Nietert P, Hunt K, Burgette L, Lawson A, Egede L (2017) Addressing geographic
confounding through spatial propensity scores: A study of racial disparities in diabetes.
Statistical Methods in Medical Research, 28(3), 734–748.
Dormann C, McPherson N, Araújo M et al. (2007) Methods to account for spatial autocorrelation in
the analysis of species distributional data: A review. Ecography, 30(5), 609–628.
Epstein D, O’Halloran S (1996) The partisan paradox and the US tariff, 1877–1934. International
Organization, 50(2), 301–324.
Fahrmeir L, Osuna E (2006) Structured additive regression for overdispersed and zero-inflated count
data. Applied Stochastic Models in Business and Industry, 22(4), 351–369.
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, pp
69–137. Springer, New York.
Fernández C, Steel MF (1998) On Bayesian modeling of fat tails and skewness. Journal of the American
Statistical Association, 93(441), 359–371.
Ferreira M, Gamerman D (2000) Dynamic generalized linear models, pp 57–72, in Generalized Linear
Models: A Bayesian Perspective, eds D Dey, S Ghosh, B Mallick. Marcel Dekker, New York.
Fokianos K, Kedem B (2003) Regression theory for categorical time series. Statistical Science, 18(3),
357–376.
Fonseca TC, Ferreira MA, Migon HS (2008) Objective Bayesian analysis for the Student-t regression
model. Biometrika, 95(2), 325–333.
Fotheringham A, Brunsdon C, Charlton M (2002) Geographically Weighted Regression: The Analysis of
Spatially Varying Relationships. Wiley, Chichester, UK.
Franzese RJ, Hays JC (2007) Spatial econometric models of cross-sectional interdependence in politi-
cal science panel and time-series-cross-section data. Political Analysis, 15(2), 140–164.
Fruhwirth-Schnatter S, Fruhwirth R (2007) Auxiliary mixture sampling with applications to logistic
models. Computational Statistics and Data Analysis, 51, 3509–3528.
Frühwirth-SchnatterS, Frühwirth R (2010) Data augmentation and MCMC for binary and multino-
mial logit models, pp 111–132, in Statistical Modelling and Regression Structures, eds T Kneib, G
Tutz. Physica-Verlag HD.
Gamerman D (1998) Markov chain Monte Carlo for dynamic generalised linear models. Biometrika,
85(1), 215–227.
Gamerman D, Moreira A, Rue H (2003) Space-varying regression models: Specifications and simula-
tion. Computational Statistics and Data Analysis, 42, 513–533.
Garay A, Lachos V, Bolfarine H, Ortega E (2015) Bayesian estimation and case influence diagnos-
tics for the zero-inflated negative binomial regression model. Journal of Applied Statistics, 42(6),
1148–1165.
Garcia-Donato G, Martinez-Beneito M (2013) On sampling strategies in Bayesian variable selec-
tion problems with large model spaces. Journal of the American Statistical Association, 108(501),
340–352.
Geinitz S, Furrer R (2016) Conjugate distributions in hierarchical Bayesian ANOVA for computa-
tional efficiency and assessments of both practical and statistical significance. arXiv:1303.3390.
Geinitz S, Furrer R, Sain S (2015) Bayesian multilevel analysis of variance for relative comparison
across sources of global climate model variability. International Journal of Climatology, 35(3),
433–443.
Regression Techniques Using Hierarchical Priors 311

Gelfand A, Kim H, Sirmans C, Banerjee S (2003) Spatial modelling with spatially varying coefficient
models. Journal of the American Statistical Association, 98, 387–396.
Gelfand AE, Ghosh SK (1998) Model choice: A minimum posterior predictive loss approach.
Biometrika, 85(1), 1–11.
Gelman A (2005) Analysis of variance—Why it is more important than ever. The Annals of Statistics,
33(1), 1–53.
George E, McCullogh R (1993) Variable selection via Gibbs sampling. Journal of the American Statistical
Association, 85, 398–409.
Gerlach R, Bird R, Hall A (2002) Bayesian variable selection in logistic regression: Predicting com-
pany earnings direction. Australian & New Zealand Journal of Statistics, 44, 155–168.
Ghosh J, Ghattas A (2015) Bayesian variable selection under collinearity. The American Statistician,
69(3), 165–173.
Ghosh S, Mukhopadhyay P, Lu J-C (2006) Bayesian analysis of zero-inflated regression models.
Journal of Statistical Planning and Inference, 136, 1360–1375.
Greene W (2008) Functional forms for the negative binomial model for count data. Economics Letters,
99(3), 585–590.
Greenland S (2000) Causal analysis in the health sciences. Journal of the American Statistical Association,
95, 286–289.
Hensher DA, Greene WH (2003) The mixed logit model: the state of practice. Transportation, 30(2),
133–176.
Holloway G, Shankar B, Rahmanb S (2002) Bayesian spatial probit estimation: A primer and an appli-
cation to HYV rice adoption. Agricultural Economics, 27(3), 383–402.
Holmes C, Held L (2006) Bayesian auxiliary variable models for binary and multinomial regression.
Bayesian Analysis, 1, 145–168.
Hooten M, Hobbs N (2015) A guide to Bayesian model selection for ecologists. Ecological Monographs,
85(1), 3–28.
Ibrahim JG, Chen MH (2000) Power prior distributions for regression models. Statistical Science,
15(1), 46–60.
Imai K, Keele L, Tingley D (2010) A general approach to causal mediation analysis. Psychological
Methods, 15(4), 309.
Ishwaran H, Kogalur U, Rao J (2010) spikeslab: Prediction and variable selection using spike and slab
regression. R Journal, 2(2), 68–73.
Ishwaran H, Rao J (2005) Spike and slab variable selection: Frequentist and Bayesian strategies.
Annals of Statistics, 33, 730–773.
Jackson R, Stewart A, Beaglehole R, Scragg R (1985) Alcohol consumption and blood pressure.
American Journal of Epidemiology, 122(6), 1037–1044.
Jia Z, Xu S (2007) Mapping quantitative trait loci for expression abundance. Genetics, 176, 611–623.
Joffe MM, Ten Have TR, Feldman HI, Kimmel SE (2004) Model selection, confounder control, and
marginal structural models: Review and new applications. The American Statistician, 58(4),
272–279.
Johnson VE, Albert JH (1999) Ordinal Data Modeling. Springer-Verlag.
Jung R C, Kukuk M, Liesenfeld R (2006) Time series of count data: Modeling, estimation and diag-
nostics. Computational Statistics & Data Analysis, 51, 2350–2364.
Kahn M, Raftery A (1996) Discharge rates of Medicare stroke patients to skilled nursing facilities:
Bayesian logistic regression with unobserved heterogeneity. Journal of the American Statistical
Association, 91, 29–41.
Khalili A, Chen J (2007) Variables selection in finite mixture of regression models. Journal of the
American Statistical Association, 102, 1025–1038.
Kim H, Sun D,Tsutakawa R K (2002) Lognormal vs. gamma: Extra variations. Biometrical Journal,
44(3), 305–323.
Kinney S, Dunson D (2007) Fixed and random effects selection in linear and logistic models.
Biometrics, 63, 690–698.
312 Bayesian Hierarchical Models

Kitagawa G, Gersch W (1985) A smoothness priors time-varying AR coefficient modeling of nonsta-


tionary covariance time series. IEEE Transactions on Automatic Control, 30, 48–56.
Kotz S, Kozubowski T, Podgórski K (2001) The Laplace Distribution and Generalizations: A Revisit with
Applications to Communications, Economics, Engineering, and Finance. Springer.
Kruijer W, Stein A, Schaafsma W, Heijting S (2007) Analyzing spatial count data, with an application
to weed counts. Environmental and Ecological Statistics, 14, 399–410.
Kuhn I (2007) Incorporating spatial autocorrelation may invert observed patterns. Diversity and
Distributions, 13, 66–69.
Kuo L, Mallick B (1998) Variable selection for regression models. Sankhya B, 60, 65–81.
Lange T, Vansteelandt S, Bekaert M (2012) A simple unified approach for estimating natural direct
and indirect effects. American Journal of Epidemiology, 176(3), 190–195.
Lee K, Chen R, Wu Y (2016) Bayesian variable selection for finite mixture model of linear regressions.
Computational Statistics & Data Analysis, 95, 1–16.
Lee K-J, Chen R-B (2015) BSGS: Bayesian sparse group selection. The R Journal, 7(2), 122–133.
Leroux B, Lei X, Breslow N (1999) Estimation of disease rates in small areas: A new mixed model
for spatial dependence, pp 135–178, in Statistical Models in Epidemiology, the Environment and
Clinical Trials, eds M Halloran, D Berry. Springer-Verlag, New York.
LeSage J (2004) A family of geographically weighted regression models, Chapter 11, pp 241–264, in
Advances in Spatial Econometrics: Methodology, Tools and Applications, eds L Anselin, R Florax, S
Rey. Springer, New York.
LeSage J, Dominguez M (2012) The importance of modeling spatial spillovers in public choice analy-
sis. Public Choice, 150(3–4), 525–545.
LeSage J, Kelley Pace R (2009) Introduction to Spatial Econometrics. CRC Press/Taylor & Francis.
Leung Y, Mei C-L, Zhang W-X (2000) Statistical tests for spatial nonstationarity based on the geo-
graphically weighted regression model. Environment and Planning A, 32(1), 9–32.
Lichstein J, Simons T., Shriner S, Franzreb, K (2002) Spatial autocorrelation and autoregressive mod-
els in ecology. Ecological Monographs, 72, 445–463.
Lipsitz SR, Ibrahim JG (1996) A conditional model for incomplete covariates in parametric regression
models. Biometrika, 83(4), 916–922.
Loeys T, Moerkerke B, De Smet O, Buysse A (2012) The analysis of zero-inflated count data: Beyond
zero-inflated Poisson regression. British Journal of Mathematical and Statistical Psychology, 65(1),
163–180.
Makalic E, Schmidt D (2016) High-dimensional Bayesian regularised regression with the BayesReg
package. arXiv preprint arXiv:1611.06649.
Malsiner-Walli G, Wagner H (2019) Comparing spike and slab priors for Bayesian variable selection.
arXiv preprint arXiv:1812.07259.
Marshall E, Spiegelhalter D (2007) Identifying outliers in Bayesian hierarchical models: A simulation-
based approach. Bayesian Analysis, 2, 409–444.
Martin T, Wintle B, Rhodes J, Kuhnert P, Field S, Low-Choy S, Tyre A, Possingham H (2005) Zero tol-
erance ecology: improving ecological inference by modelling the source of zero observations.
Ecology Letters, 8(11), 1235–1246.
McCandless L, Douglas I, Evans S, Smeeth L (2010) Cutting feedback in Bayesian regression adjust-
ment for the propensity score. International Journal of Biostatistics, 6(2), 16.
McCandless L, Gustafson P, Austin P (2009) Bayesian propensity score analysis for observational
data. Statistics in Medicine, 28, 94–112.
McElreath R (2016) Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Vol. 122. CRC
Press.
McNamee R (2005) Regression modelling and other methods to control confounding. Occupational
and Environmental Medicine, 62(7), 500–506, 472.
Mebane W, Sekhon J (2004) Robust estimation and outlier detection for overdispersed multinomial
models of count data. American Journal of Political Science, 48(2): 391–410.
Millar R (2009) Comparison of hierarchical Bayesian models for overdispersed count data using DIC
and Bayes’ factors. Biometrics, 65, 962–969.
Regression Techniques Using Hierarchical Priors 313

Mohr D (2007) Bayesian identification of clustered outliers in multiple regression. Computational


Statistics & Data Analysis, 51, 3955–3967.
Musio M, Sauleau E, Buemi A (2010) Bayesian semi-parametric ZIP models with space-time interac-
tions: An application to cancer registry data. Mathematical Medicine and Biology, 27(2), 181–194.
Naimi A, Kaufman J, MacLehose R (2014b) Mediation misgivings: Ambiguous clinical and public
health interpretations of natural direct and indirect effects. International Journal of Epidemiology,
43(5):1656–1661.
Naimi A, Moodie E, Auger N, Kaufman J (2014a) Constructing inverse probability weights for con-
tinuous exposures: A comparison of methods. Epidemiology, 25, 292–299.
Nelson K, Leroux B (2006) Statistical models for autocorrelated count data. Statistics in Medicine, 25,
1413–1430.
Neyens T, Faes C, Molenberghs G (2012) A generalized Poisson-gamma model for spatially-overdis-
persed data. Spatial and Spatio-temporal Epidemiology, 3(3), 185–194.
Nicholls D, Quinn B (1982) Random Coefficient Autoregressive Models: An Introduction. Springer-Verlag,
New York.
Nott D, Leng C (2010) Bayesian projection approaches to variable selection in generalized linear
models. Computational Statistics & Data Analysis, 54(12), 3227–3241.
Oh MS, Lim YB (2001) Bayesian analysis of time series Poisson data. Journal of Applied Statistics, 28(2),
259–271.
O’Hara R, Sillanpaa M (2009) A review of Bayesian variable selection methods: What, how and
which. Bayesian Analysis, 4(1), 85–118.
Osborne P, Foody G, Suárez-Seoane S (2007) Non-stationarity and local approaches to modelling the
distributions of wildlife. Diversity and Distributions, 13, 313–323.
Park T, Casella G (2008) The Bayesian Lasso. Journal of the American Statistical Association, 103, 681–686.
Pearl J (2014) Interpretation and identification of causal mediation. Psychological Methods, 19(4),
459–481.
Pérez-Sánchez J, Salmerón-Gómez R, Ocaña-Peinado F (2018) A Bayesian asymmetric logistic model
of factors underlying team success in top-level basketball in Spain. Statistica Neerlandica, 73(1),
22–43.
Perrakis K, Fouskakis D, Ntzoufras I (2015) Bayesian variable selection for generalized linear mod-
els using the power-conditional-expected-posterior prior, pp 59–73, in Bayesian Statistics from
Methods to Models and Applications: Research from BAYSM 2014, eds S Fruhwirth-Schnatter, A
Bitto, G Kastner, A Posekany, Vol. 126. Springer, New York.
Piironen J, Vehtari A (2016) Projection predictive variable selection using Stan+R. Proceedings of the
2016 IEEE 26th International Workshop on Machine Learning for Signal Processing. http://
arxiv.org/abs/1508.02502
Piironen J, Vehtari A (2017a) Comparison of Bayesian predictive methods for model selection.
Statistics and Computing, 27, 711–735.
Piironen J, Vehtari A (2017b) On the hyperprior choice for the global shrinkage parameter in the
horseshoe prior, in Proceedings of the 20th International Conference on Artificial Intelligence and
Statistics (AISTATS), PMLR, 54, 905–913.
Piironen J, Vehtari A (2017c) Sparsity information and regularization in the horseshoe and other
shrinkage priors. Electronic Journal of Statistics, 11(2), 5018–5051.
Plummer M (2008) Penalized loss functions for Bayesian model comparison. Biostatistics, 9(3),
523–539.
Polson N, Scott J (2010) Shrink globally, act locally: Sparse Bayesian regularization and prediction, pp
501–553, in Bayesian Statistics 9, eds J Bernardo, M Bayarri, J Berger, A Dawid, D Heckerman,
A Smith, M West. Oxford University Press, New York.
Polson N, Scott J (2012) On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7(4),
887–902.
Qin ZS, Damien P, Walker S (2003) Scale mixture models with applications to Bayesian inference, pp
394–395, in AIP Conference Proceedings, Vol. 690, No. 1. AIP.
314 Bayesian Hierarchical Models

Raftery A, Painter I, Volinsky C (2005) BMA: An R package for Bayesian model averaging. R News,
5(2), 2–8.
Reich B, Fuentes M, Herring A, et al. (2010) Bayesian variable selection for multivariate spatially-
varying coefficient regression. Biometrics, 66, 772–782.
Richardson S, Bottolo L, Rosenthal J (2010) Bayesian models for sparse regression analysis of high
dimensional data. Bayesian Statistics, 9, 539–569.
Robins J, Hernan M, Brumback B (2000) Marginal structural models and causal inference in epidemi-
ology. Epidemiology 11(5), 550–560.
Rockova V, Lesaffre E, Luime J, Löwenberg B (2012) Hierarchical Bayesian formulations for selecting
variables in regression models. Statistics in Medicine, 31(11–12), 1221–1237.
Scott S (2011) Data augmentation, frequentist estimation, and the Bayesian analysis of multinomial
logit models. Statistical Papers, 52(1), 87–109.
Shumway R (2016) State space models, Chapter 6, in Time Series Analysis and Its Applications, eds R
Shumway, D Stoffer. Springer, New York.
Smith RL, Davis JM, Sacks J, Speckman P, Styer P (2000) Regression models for air pollution and
daily mortality: Analysis of data from Birmingham, Alabama. Environmetrics: The official journal
of the International Environmetrics Society, 11(6), 719–743.
Smith T, LeSage J (2004) A Bayesian probit model with spatial dependencies, pp 127–160, in Pace
Advances in Econometrics: Vol 18: Spatial and Spatiotemporal Econometrics, eds J LeSage, R Kelley.
Elsevier Science.
Snowden JM, Rose S, Mortimer KM (2011) Implementation of G-computation on a simulated data
set: Demonstration of a causal inference technique. American Journal of Epidemiology, 173(7),
731–738.
Spiegelhalter D (1998) Bayesian graphical modelling: A case-study in monitoring health outcomes.
Applied Statistics, 47, 115–133.
Sun D, Tsutakawa RK, Speckman PL (1999) Posterior distribution of hierarchical models using CAR
(1) distributions. Biometrika, 86(2), 341–350.
Tchetgen E, Vanderweele T (2014) Identification of natural direct effects when a confounder of the
mediator is directly affected by exposure. Epidemiology, 25(2), 282–291.
Tingley D, Yamamoto T, Hirose K, Keele L, Imai K (2014) Mediation: R package for causal mediation
analysis. Journal of Statistical Software, 59. https://www.jstatsoft.org/article/view/v059i05
Tingley M (2012) A Bayesian ANOVA scheme for calculating climate anomalies, with applications to
the instrumental temperature record. Journal of Climate, 25(2), 777–791.
Tutz G, Gertheiss J (2016) Regularized regression for categorical data. Statistical Modelling, 16(3),
161–200.
Utazi C, Sahu S, Atkinson P, Tejedorc N, Tatem A J (2016) A probabilistic predictive Bayesian approach
for determining the representativeness of health and demographic surveillance networks.
Spatial Statistics, 17, 161–178.
VanderWeele T (2013) Policy-relevant proportions for direct effects. Epidemiology, 24(1), 175–176.
VanderWeele T (2015) Explanation in Causal Inference: Methods for Mediation and Interaction. OUP.
VanderWeele T, Vansteelandt S (2014) Mediation analysis with multiple mediators. Epidemiologic
Methods, 2(1), 95–115.
Vansteelandt S, Daniel R (2014) On regression adjustment for the propensity score. Statistics in
Medicine, 33, 4053–4072.
Vaughn M, Beaver K, Wexler J, DeLisi M, Roberts G (2011) The effect of school dropout on verbal
ability in adulthood: A propensity score matching approach. Journal of Youth and Adolescence,
40(2), 197–206.
Verdinelli I, Wasserman L (1991) Bayesian analysis of outlier problems using the Gibbs sampler.
Statistics and Computing, 1(2), 105–117.
Viele K, Tong B (2002) Modeling with mixtures of linear regressions. Statistics and Computing, 12(4),
315–330.
Wagner H, Pauger D (2016) Discussion: Bayesian regularization and effect smoothing for categorical
predictors. Statistical Modelling, 16(3), 220–227.
Regression Techniques Using Hierarchical Priors 315

Wang A, Arah O (2015) G-computation demonstration in causal mediation analysis. European Journal
of Epidemiology, 30(10), 1119–1127.
Wang L, Zhou XH (2007) Assessing the adequacy of variance function in heteroscedastic regression
models. Biometrics, 63(4), 1218–1225.
Wang P, Puterman M (1999) Markov Poisson regression models for discrete time series, part 1:
Methodology. Journal of Applied Statistics, 26, 855–869.
Wang P, Puterman M, Cockburn I, Le N (1996) Mixed poisson regression models with covariate
dependent rates. Biometrics, 52, 381–400.
Weitzen S, Lapane K, Toledano A Y, Hume A L, Mor V (2004) Principles for modeling propensity
scores in medical research: A systematic literature review. Pharmacoepidemiology and Drug Safety,
13(12), 841–853.
Wheeler D, Calder C (2006) Bayesian spatially varying coefficient models in the presence of collinear-
ity. ASA Section on Bayesian Statistical Science, Proceedings of the Joint Statistical Meetings,
Seattle, WA, August 6–10, 2006.
Wheeler D, Calder C (2007) An assessment of coefficient accuracy in linear regression models with
spatially varying coefficients. Journal of Geographical Systems, 9, 145–166.
Wheeler D, Tiefelsdorf M (2005) Multicollinearity and correlation among local regression coefficients
in geographically weighted regression. Journal of Geographical Systems, 7, 161–187.
Wilhelm S, de Matos M (2013) Estimating spatial probit models in R. The R Journal, 5(1), 130–143.
Windle J (2016) BayesLogit. https://www.rdocumentation.org/packages/BayesLogit/versions/0.6
Winkelmann R, Zimmermann K F (1995) Recent developments in count data modelling: Theory and
application. Journal of Economic Surveys, 9(1), 1–24.
Winship C, Western B (2016) Multicollinearity and model misspecification. Sociological Science, 3,
627–649.
Xu X, Ghosh M (2015) Bayesian variable selection and estimation for group lasso. Bayesian Analysis,
10(4), 909–936.
Yi N, Ma S (2012) Hierarchical shrinkage priors and model fitting for high-dimensional generalized
linear models. Statistical Applications in Genetics and Molecular Biology, 11(6). DOI: https://doi.
org/10.1515/1544-6115.1803.
Yuan M, Lin Y (2005) Efficient empirical Bayes variable selection and estimation in linear models.
Journal of the American Statistical Association, 100, 1215–1224.
Zellner A, Siow A (1980) Posterior odds ratios for selected regression hypotheses, pp 585–603, in
Bayesian Statistics: Proceedings of the First International Meeting Held in Valencia, eds Bernardo J,
DeGroot M, Lindley D, Smith A. University of Valencia Press.
Zeugner S, Feldkircher M (2015) Bayesian model averaging employing fixed and flexible priors: The
BMS package for R. Journal of Statistical Software, 68(4), 1–37.
Zhou M, Li L, Dunson D, Carin L (2012) Lognormal and gamma mixed negative binomial regression.
Proceedings of the 29th International Conference on Machine Learning, 2012, 1343–1350.
Zigler C, Dominici F (2014) Uncertainty in propensity score estimation: Bayesian methods for vari-
able selection and model-averaged causal effects. Journal of the American Statistical Association,
109(505), 95–107.
Zigler C, Watts K, Yeh R, Wang Y, Coull B, Dominici F (2013) Model feedback in Bayesian propensity
score estimation. Biometrics, 69(1), 263–273.
8
Bayesian Multilevel Models

8.1 Introduction
The rationale for applying multilevel models to hierarchical data is well-established
(Snijders and Bosker, 1999; Skrondal and Rabe-Hesketh, 2004). When lower level units are
nested within one or more higher level strata, conventional single-level regression analy-
sis is not appropriate, since observations are no longer independent: pupils in the same
schools, or households in the same communities, tend to be more similar to one another
than pupils in different schools or households in different communities. Such dependency
means standard errors are downwardly biased if the nesting is ignored, and spurious
inferences regarding predictor or treatment effects may be made (Hox, 2002; Aarts et al.,
2015; Bliese and Hanges, 2004).
In multilevel analysis, predictors may be defined at any level and the interest focuses
on adjusting predictor effects for the simultaneous operation of contextual and individual
variability in the outcome. This may be important in health applications, for example, if
impacts of individual-level risk factors vary by geographic context (Congdon and Lloyd,
2010). Another major goal is variance partitioning (Goldstein et al., 2002; Gelman and
Pardoe, 2006); for example, what proportion of area variations in crime rates is due to
characteristics of those areas (what is sometimes termed “contextual variation”), and how
much is due to the characteristics of the individuals who live in these areas (termed “com-
positional variation”) (Subramanian et al., 2003).
One may also be interested in estimates for geographic areas or institutions that include
both individual and area information; for example, the multilevel model for county radon
estimates discussed by Gelman (2006). Gelman (2006) notes that compared to estimates
involving no pooling or complete pooling, inferences from multilevel models are more
reasonable. Complete pooling leads to identical estimates for all units, while a no-pooling
model (no borrowing strength) overfits the data, giving implausibly high or low estimates
for particular units and low precisions for such estimates.
As well as predictor effects at any level, a multilevel model is likely to involve ran-
dom effects defined over the clusters at higher level(s), and possibly correlation between
different cluster effects. As in Chapter 4, one seeks to pool strength in inferences
about clusters when the number of observations for each cluster might be quite small.
While exchangeable cluster effects dominate the multilevel literature, there may well
be instances where random cluster effects are better regarded as non-exchangeable, as
recognised in the general design general linear mixed model of Zhao et al. (2006). For
example, it is possible that the significance level of cluster effects is overstated in area
multilevel applications that disregard spatial dependence between clusters (Chaix et al.,
2005; Dong et al., 2016).

317
318 Bayesian Hierarchical Models

Application of multilevel models from a Bayesian perspective exemplifies many of the


issues referred to in earlier chapters; these include sensitivity to priors and setting priors
to ensure identifiability and satisfactory mixing (Draper, 2006). Devices such as hierarchi-
cal centring may reduce correlation in the joint posterior and increase MCMC effective
sample sizes (Givens and Hoeting, 2012; Browne, 2004). The Bayesian approach has ben-
efits in ensuring that uncertainty in variance components is fully reflected in posterior
inferences (e.g. regarding cluster effects), an important issue when the number of clusters
is small and the likelihood function of level 2 variance parameters may be asymmmetric
(Seltzer et al., 1996, 2002).
Improved software for Bayesian multilevel analysis is exemplified by the rstan based
brms package (Buerkner, 2017), with an overview provided by Mai and Zhang (2018). The
remaining sections of the chapter consider the normal linear multilevel model (Section
8.2), general linear and conjugate models for multilevel discrete data (Section 8.3), crossed
factor and multiple member random effect models (Section 8.4), and robust multilevel
models (Section 8.5).

8.2 The Normal Linear Mixed Model for Hierarchical Data


A multilevel model typically assumes observations to be independent conditional on fixed
regression and random effects defined at one or more levels in the data hierarchy. The pro-
totype two-level model for a continuous response yij with repetitions j = 1,… ni (e.g. pupils,
patients, households) in clusters i = 1,… m (e.g. schools, hospitals, communities) tackles a
similar scenario to that considered in Chapter 4, but assumes individual observations to
be available, rather than cluster averages. Consider observation level attribute vectors xij
of dimension p, and zij of dimension q, typically a subvector of xij with q ≤ p (Chen and
Dunson, 2003).
Then a widely used form of the normal linear mixed model for nested data (e.g. Snijders
and Berkhof, 2002) specifies

yij = xij b + zijbi + uij , (8.1)

with bi and uij denoting random cluster effects and observation level random effects
respectively. The intercept x1ij = 1 with parameter β1 is included in xij. With N = Sim=1ni total
observations, the nested form of the model is

y = X b + Zb + u,

 X1 
 
where y is N × 1, X ≡  …  is N × p, with Xi = ( xi1 , … xini )′ of dimension ni × p, and where
 X m 
the N × mq matrix Z is block diagonal with m diagonal blocks Zi = ( zi1 , … zini )′ of dimension
ni × q (Gamerman, 1997, p.61; Zhao et al., 2006, p.3). Here β is a (p × 1) vector of population
parameters and bi = (b1i , … , bqi )′ is a q × 1 vector of zero mean cluster specific deviations
around those population parameters, with bi assumed random.
Bayesian Multilevel Models 319

While random effects models offer a way to borrow strength (e.g. when level 2 clus-
ter sizes ni are relatively small), fixed effect models, especially for varying intercepts are,
however, advocated in longitudinal applications, especially in econometrics. Fixed effects
for parameter collections are sometimes used in cross-sectional multilevel applications
(Snijders and Berkhof, 2002). The choice between the two depends on the purpose of the
statistical inference and how far the level 2 units can be regarded as a sample from a
policy-relevant population (Draper, 1995). If the sampled clusters are representative of
(exchangeable with) a wider population, then a random coefficient model is, in principle,
appropriate (Hsiao, 1996). If statistical inference is confined to the particular unique set of
level 2 units included in a data set, then a fixed effects model may be more appropriate.
The conjugate linear normal model with random cluster effects assumes multivariate
normality for these effects and for the observation level errors. Assuming the zij are a sub-
vector of xij, the cluster effects have zero mean, so that

(b1i , … , bqi )′ ∼ N q (0, Σ b ).

The total impact of xrij is then obtained by cumulating over fixed and random components
as br + bri .
Assume the unstructured level 1 errors ui = (ui1 , … uini )′ have prior ui ∼ N ni (0, H i ) where
Hi represents the within-cluster dispersion matrix. The stacked form of the linear mixed
model at cluster level, namely yi = Xi b + Zibi + ui , may then be expressed in joint likelihood
form as

 yi    Xi b   Zi Σ b Zi′ + H i Zi Σ b 
 b  ∼ N ni + q   0  ,  Σ Z′ Σ b 
,
 i    b i

or in marginal form as

yi ∼ N ni (Xi b , Zi Σ b Zi′ + H i ).

The level 1 errors are typically assumed to be independent, given cluster effects and regres-
sion terms, often with Hi = σ2I for all clusters.
The conjugate model then takes inverse gamma and inverse Wishart priors for σ2 and
Σb respectively (or gamma and Wishart priors on σ−2 and Σ b−1 ), and common practice is to
adopt just proper priors e.g. s 2 ∼ IG(e , e ) where ε is small. Recent research shows that such
priors can lead to effectively improper posteriors and also that inferences are sensitive to
the choice of hyperparameters (Natarajan and McCulloch, 1998). Alternatives for the level
1 variance include uniform or half t priors on σ (Gelman, 2006b), while hierarchical models
for Σb are considered by Daniels and Kass (1999) and Daniels and Zhao (2003). A separa-
tion strategy using the LKJ (Lewandowski, Kurowicka and Joe) prior is another option
(McElreath, 2016).
Following Gamerman (1997), one may sometimes also include random predictor effects
at observation level

yij = xij b + zijbi + wijuij ,

which is one way of specifying what is known as complex level 1 variation or heterosce-
dasticity related to level 1 attributes (Browne et al., 2002). This means that variances
depend on subject level predictors (when subjects j are nested in clusters i) or in panel data
320 Bayesian Hierarchical Models

applications that variances are changing over time (when times t are nested in subjects i).
For categorical wij, one may equivalently specify complex variation in terms of category-
specific variances. Thus Goldstein (2005) considers school exam data yij (pupils j nested in
schools i), with a single predictor gender xij (=1 for boy, 0 for girl). Then level 1 heterosce-
dasticity can be represented as

yij = b1 + b 2 xij + xiju1ij + (1 - xij )u0 ij

where u0 ij ∼ N (0, s02 ) is the prior for girl observation level errors, and u1ij ∼ N (0, s12 ) is the
prior for boy observation level errors. Equivalently, setting wij = xij ,

yij ~ N ( b1 + b 2 xij , s w2 ij ).

It can be seen that random variation over clusters or at level 1 in specification (8.1) raises
questions of empirical identification (see Chapter 1), as the fixed regression effects are con-
founded with the mean of the associated cluster random effect. Suppose xij = ( x fij , x hij ) and
b = ( b f , bh ), where xfij of dimension p–q contains predictors where no variation in clusters
is posited, while xhij contains predictors (usually including the constant term) which have
a randomly varying effect over clusters.
Under hierarchical centring of the cluster effects, which has been argued to improve
MCMC convergence (Gelfand et al., 1995), varying cluster effects γri are centred on βr
so that the rth varying predictor effect is gri = bhr + bri in cluster i. The parameterisation
( b , bi ) = ([ b f , bh ], bi ) with zero mean bi, is replaced by the parameterisation ( b f , gi ) where
gi = bh + bi . Then

yij = xij b + zijgi + uij (= x fij b f + x hijgi + uij ) (8.2)

(g1i , … , gqi ) ∼ N q ( bh , Σ g ),

where the vectors zij and xij are now distinct, with xij now containing only xfij, while zij = x hij .

8.2.1 The Lindley–Smith Model Format


An alternative fully hierarchical presentation of the normal linear multilevel model (e.g.
Seltzer, 1993; Candel and Winkens, 2003) is based on the scheme of Lindley and Smith
(1972). It is assumed that all the effects of level 1 predictor (e.g. pupil characteristics in a
two-level educational attainment application) vary randomly over clusters, with their vari-
ability explained by cluster predictors Wi = (w1i , … , wri )′ (e.g. school-level attributes).
The two-level scheme is

yi = Zi bi + ui , i = 1, … , m (8.3)

bi = kWi + bi

where yi = ( yi1 , … , yini )′ is ni × 1, κ is q × r, Zi is ni × q, βi is a q × 1 vector of random cluster
regression parameters, and the errors ui = (ui1 , … , uini )′ have prior uij ∼ N (0, s 2 ). The level 2
regression for βi involves a fixed effect parameter matrix κ, and errors bi = (b1i , … , bqi )′ with
Bayesian Multilevel Models 321

mean zero and precision matrix Tb. Substituting the second equation in (8.3) into the first
yields the model

yi = Zi kWi + Zibi + ui .

To constrain the effect of one or more level 1 predictors to have an identical effect across all
clusters, the model may be reformulated as the mixed model (8.2) above.
In (8.3), one may assume flat (uniform) priors for κ, and gamma and Wishart priors for σ−2
and Tb, namely 1/s 2 ∼ Ga( au , bu ), Tb ∼ W (Se , ne ). Also define rij = yij − Zij bi , bˆi = (Zi′Zi )−1 Zi yi ,
Vi = (s −2Zi′Zi + Tb )−1 , Vi = s 2Zi′Zi , Λ i = (Vi−1 + Tb )Vi−1, U i = ( bi − kWi ) and G = [ SWi¢TbWi ] . Then
-1

the full conditionals for Gibbs sampling are

  m ni 
1/s 2 ∼ Ga  0.5( au + m), 0.5  bu +
  ∑∑ i =1 j =1
rij2  
 

(
b i ~ N q L i bˆi + (I - L i )k Wi , Vi )
 m

Tb ∼ W  Se +

∑i =1
U iU i′, m + ne 

 m 

 i =1

k ∼ N r  G WiTb bi , G .

Example 8.1 Maths Achievement


Hierarchical centring, random predictor effects, and level 1 complex variation are illus-
trated in an educational example from Kreft and de Leeuw (1998) concerning 519 pupils
in 23 schools (the “clusters”). Analysis uses the brms, rstan, and R2OpenBUGS pack-
ages. The response is for Math achievement, with predictors being homework hours
per week and gender (F = 1, M = 0). An initial model (model 1) assumes school-varying
intercepts only, while an extension (model 2) assumes varying intercepts and slopes on
homework, with the effect of gender not varying by cluster.
Model 2 is

y ij = xij b f + zijgi + uij ,

uij ∼ N(0, s 2 )
(g1i , g2i ) ∼ N 2 ( bh , Σ g ),

where xij = (gend) excludes an intercept, and zij = (1, homework ), with βh1 providing the
regression intercept.
The brms package is applied to assess gain in fit, using WAIC (widely applicable
information criterion) and LOO-IC (leave-one-out information criterion), through add-
ing the extra source of cluster variability. The command form

  
BRMS2=brm(y ~ 1+homework+gend+(1+homework|sch), data = D,
family = “gaussian”, chains = 2)
322 Bayesian Hierarchical Models

ensures that mean random effects in model 2 are zero. The default setting for the
LKJ prior for the random effects correlation matrix is adopted, with shape parameter
1 (Buerkner, 2017, p.4). Sensitivity may be assessed, for example, by specifying set_
prior(“lkj(2)”, class = “cor”) or set_prior(“lkj(0.5)”, class = “cor”).
There is a substantial gain in fit in adding homework random slopes according to both
WAIC and LOO information criteria, which in this example have very similar values.
The WAIC falls from 3712.7 to 3578, and the LOOIC from 3712.9 to 3579.6.
R2OpenBUGS codes for these models include exceedance checks Pr(yij,rep > yij|y) based
on the mixed predictive method (Marshall and Spiegelhalter, 2007; Green et al., 2009).
Exceedance checks are also included at cluster level, obtained by checking school aver-
aged replicates of yij,rep against school averages on the response. In the R2OpenBUGS
code for the second model, a Wishart prior with identity scale matrix and 2 degrees of
freedom is assumed for the cluster precision matrix Σ g−1 , and a Ga(1,0.001) prior for the
observation level precision σ−2. Predictors are centred, but not standardised (as in BRMS).
WAIC measures are very similar between the LKJ and Wishart approaches to model
2, at just under 3580. However, there is sensitivity to priors in covariance estimates:
the LKJ prior identifies a negative correlation of −0.78 between school intercepts and
slopes, whereas the Wishart prior method estimates a positive correlation of around
0.35. In fact, the 23 observed school-level averages on achievement and homework also
show a positive correlation of 0.40. Random intercepts under the Wishart model 2 have
a correlation of 0.90 with observed average school achievement levels, as against a cor-
responding correlation of 0.48 under the LKJ prior. Sensitivity may be partly related to
small cluster sizes (e.g. schools 2 and 3 have under 10 pupils).
Cross-validatory checks at school level (testmx.sch in the R2OpenBUGS code) under
model 2 show a 96% probability of overprediction for school 17, and a 6% probability of
overprediction for school 2. This may indicate the need to adjust for school-level predic-
tors, or to adopt a cluster effects scheme that is more robust to outlier schools. However,
this is an improved performance over model 1 which shows three schools with mixed
exceedance probabilities under 0.05 or over 0.95 (8, 17 and 18).
Individual pupils with extreme cross-validatory checks differ according to cluster ran-
dom effects approach. Both models have under 10% of cases with mixed cross-validatory
probabilities either exceeding 0.95 or under 0.05 (cvtail[1] and cvtail[2] in the code). For
the random intercepts model, the lowest (highest) exceedance probabilities are for sub-
jects 51 and 88 respectively, subject 51 having zero homework hours but a relatively high
achievement of 67, while subject 88 has 5 homework hours but achievement of 33.
The random intercepts and slopes model shows widely discrepant homework effects
between schools (under both LKJ and Wishart priors). Hence, outlier pupils may be
identified if they are discrepant with the cluster sub-model defined by school-specific
intercepts and slopes.
A third analysis illustrates the economy of coding possible with rstan and assumes
σ2 differing by gender (complex level 1 variation). An LKJ prior is assumed on the inter-
cepts-slopes correlation. The posterior mean residual standard deviation is found to
be slightly lower for females as compared to males (7.05 vs. 7.64), but the LOO-IC is
unchanged (in fact slightly increased) at 3581.

8.3 Discrete Responses: GLMM, Conjugate, and Augmented Data Models


While conjugate multilevel structures can be developed for discrete responses such as
counts or proportions (see Section 8.3.1), a more flexible approach is based on the general
linear mixed model (GLMM), which extends the linear normal formulation to discrete
Bayesian Multilevel Models 323

outcomes. Thus, consider univariate observations yij, with repetitions j nested in clusters i,
that, conditional on cluster effects bi, follow an exponential family density

 yijqij − d(qij ) 
f ( yij |bi ) ∝ exp  + c( yij , fij ) ,
 fij 

where θij is the canonical parameter and ϕij is usually a known scale parameter. Additionally,
E( yij |qij ) = d′(qij ) and Var( yij |qij , fij ) = d″(qij )fij . For example, under the Poisson, d(u) = exp(u),
and for binomial data, d(u) = log(1 + eu). Taking the regression terms as hij = g(qij ) where g
is a link function, the observation level model (including a level 2 regression on cluster
attributes) is

hij = xij b + zijbi ,

bi = kWi + e ,

where β and bi are of dimension p and q respectively.


It is also common to include a residual term ui = (ui1 , … , uini ) to account for overdisper-
sion, so that

hij = xij b + zijbi + uij . (8.4)

Assume priors b ∼ N p ( a, R), bi ∼ N q (0, Σ b ) and uij ∼ N r (0, Σ u ) , with inverse Wishart priors
Σ b ∼ IW (nb , Sb ) and Σ u ∼ IW (nu , Su ) . Then the full posterior conditional for each bi vector is

 ni
yijqij − d(qij ) 
p(bi |b[i] , b , u, Σ b , Σ u ) ∝ exp  −0.5bi′Σ b−1bi + ∑ fij ,
 j =1 

while the full conditional for each uij vector is

 −1
yijqij − d(qij ) 
p(uij |u[ij] , b , b , u, Σ b , Σ u ) ∝ exp  −0.5uij′

∑ u
uij +
fij
.


Additionally, the covariance matrices have inverse Wishart full conditionals, namely

 m

Σ b ∼ IW  nb + m, Sb +

∑ b b′ ,
i =1
i i

 m 
Σ u ∼ IW  nu +
 ∑i =1
ni , Su + ∑ i, j
uijuij′  .


The GLMM approach includes multilevel multinomial observations in a choice setting


(e.g. brand, political party),

(dij1 , … , dijK ) ∼ Mult(1,[ pij1 , … , pijK ]),


324 Bayesian Hierarchical Models

with probability πijk that option k is chosen by subject j in cluster i, namely that yij = k (or
dijk = 1) where options are unordered. A particular choice (k ∈1, … , K ) made by subject j in
cluster i results from comparing the latent utilities of all options (hij1 , … , hijK ), with

pijk = Pr( yij = k ) = Pr(hijk > hijm ), m ≠ k

where the ηijk include systematic effects and random errors εijk. Suppose the errors follow
a Gumbel (extreme value type I) density, namely P(e ) = exp( − e − exp( − e )), then since dif-
ferences between Gumbel errors follow a standard logistic distribution, the choice prob-
abilities reduce to the multinomial logit (Hedeker, 2003, p.1439).
Predictors in the systematic term may be defined at option-subject, or at option level, but
consider subject level predictors xij and zij (e.g. voter age) of respective dimensions p and q,
that may vary according to cluster i. Then with the final category as a reference, fixed effect
parameters and random effects are specific to choices k, with K − 1 sets of random effects
bih each of dimension q,

exp(ak + xij bk + zijbik )


Pr( yij = k ) = k = 1, … , K − 1

K −1
1+ exp(ah + xij bh + zijbih )
h=1

1
Pr( yij = K ) = .

K −1
1+ exp(ah + xij bh + zijbih )
h=1

The bi = (bi1 , … bi , K −1 ) are multivariate zero mean effects, typically assumed multivariate
normal.

8.3.1 Augmented Data Multilevel Models


Another option for multilevel binary and multinomial responses is to introduce aug-
mented metric data yij∗ with sampling constrained according to the observed yij, and apply
the linear mixed model to yij∗ . The data augmentation density depends on the assumed
link. Thus, a logit link for two-level binary data implies truncated standard logistic sam-
pling to generate the augmented data, namely

yij∗ ∼ Logistic(hij , 1) I ( Aij , Bij ),

where Aij = −∞ or 0, and Bijk = 0 or ∞, according as yij = 0 or 1. As mentioned in Chapter 7,


data augmentation leads to simpler MCMC sampling and improved residual tests. In a
multilevel setting, it may further assist in assessing variance partitioning. Consider a
regression with a random level 2 intercept

hij = xij b + bi ,

where bi ~ N (0, s b2 ). Since the variance of the standard logistic is p 2 /3, the intraclass cor-
relation at level 2 may be obtained as s b2 /(s b2 + p 2 /3), and monitored over MCMC itera-
tions. Moreover, if the composite fixed effect term xijβ is monitored and its posterior
variance s F2 obtained, one may obtain a proportion of variance explained by covariates as
s F2 /[s F2 + s b2 + p 2 /3], where s b2 is the posterior mean of s b2.
Bayesian Multilevel Models 325

For multilevel ordinal outcomes with K levels, the observations

(dij1 , … , dijK ) ∼ Mult(1,[ pij1 , … , pijK ]),

provide information about an underlying metric variable yij∗ defined by cutpoints such that

yij = k , (i.e. dijk = 1)

if

kk −1 < yij∗ ≤ kk ,

for k = 1, … , K , where k0 = −∞ and kK = ∞ . The corresponding augmented data regres-


sion is

yij∗ = xij b + zijbi + eij ,

where ε is normal or logistic. If xij excludes an intercept, there are K − 1 unknown cutpoints
(k1 , … , kK −1 ), with yij = 1 if yij∗ ≤ k1, yij = 2 if k1 < yij∗ ≤ k2 , etc., and yij = K if yij∗ > kK −1.
A standard logistic density for εij with mean 0, variance π2/3, and distribution function
F(e ) = exp(e )/(1 + exp(e )) leads to a logit link for the cumulative probabilities

gijk = ∑p
m=1
ijm = Pr( yij∗ ≤ kk ) = Pr( yij ≤ k ), k = 1, … , K − 1

with pijK = 1 - S Km-=11 pijm . Taking eij ∼ N (0, 1) corresponds to a probit link for γijk. For ε logistic,
the hierarchical regression is expressed as follows

gijk = Pr( xij b + zijbi + eij ≤ kk ),


= Pr(eij ≤ kk − xij b − zijbi ),

exp(kk − xij b − zijbi )


= ,
1 + exp(kk − xij b − zijbi )

that is,

logit(gijk ) = kk − xij b − zijbi .

8.3.2 Conjugate Cluster Effects


An alternative to generalised linear mixed models for count and binomial data is pro-
vided by conjugate random effects at different levels. Daniels and Gatsonis (1999) also
consider hierarchical conjugate priors one-parameter exponential family densities (e.g.
Poisson, binomial). For binomial data yij ∼ Bin(Tij , pij ), pij are taken to be beta distributed
with means πij, and cluster-specific scale parameters δi.
326 Bayesian Hierarchical Models

With a logit link to predictors, and level 2 regression involving cluster level predictors,
Wi one has

pij ∼ Beta(pij di ,(1 − pij )di ),


logit(pij ) = xij b + zijbi ,
bi = kWi + ei .

To provide robustness (e.g. to outlier clusters), the ei may be taken as Student t distributed
(see Section 8.5). The prior on δi has the form

di
p( di ) ∝ ,
( hi + di )2

where hi = min (Tij ) . For Poisson data, one has yij ∼ Po(oijqij ), where oij is an offset for the
j ∈1,…, ni
expected response, and qij ∼ Ga( mij di , di ). The regression model then involves a log link for
the μij,

log( mij ) = xij b + zijbi .

More specialised models apply for particular data structures. For example, Van Duijn and
Jansen (1995) suggest a model for repeated counts (e.g. tests j = 1, … , ni within students
i = 1, … , m ) with Poisson means

mij = ni dij ,

and gamma distributed student ability effects ni ∼ Ga( a1 , a2 ), where a1 and a2 are additional
parameters, and the δij represent subject specific difficulty parameters for tests j, with iden-
tifiability constraint S jd ij = 1 , and prior

( di1 , … diJ ) ∼ Dir(x1 , … , xJ ),

where the ξj are also unknowns. If the subjects fall into known (or possibly unknown)
groups k = 1,… K with allocation indicators Si ∈(1, … , K ) , then a more general model speci-
fies (ni |Si = k ) ∼ Ga( a1k , a2 k ) .
A conjugate structure for stratified area health counts is considered by Dean and
MacNab (2001). Thus for micro areas j = 1, … , ni nested within larger areas i = 1, … , m , let
μ be an average event rate across all m areas, and Tij be populations at risk. Assume first
cluster level overdispersion represented by effects ρi, so that yij ∼ Po( mTij ri ), where ρi have
mean 1, and let the mean and variance of yi+ be Ti+μ and Ti+ m(1 + s r2 ) . Under gamma mixing

 T m T m
ri ∼ Ga  i +2 , i +2  ,
 sr sr 

with variance s r2 /(Ti+ m) . The interpretation is that ρi represents the average relative risk
over the Ti+ individuals in area i.
Bayesian Multilevel Models 327

Example 8.2 Well-Being and Hours Worked


This example illustrates multilevel binary and augmented data methods applied to data
from the study of Bliese and Halverson (1996), which are included in the R package mul-
tilevel (Bliese, 2016a). There are N = 7382 subjects nested in m = 99 US Army companies,
with the study investigating group level influences on reported well-being. This is an
example of a macro-micro multilevel situation, where a response measured at a lower
level is predicted by variables measured both at that level and a higher level (Croon and
van Veldhoven, 2007).
A binary response analysis sets yij =1 for well-being score above 3.5, and yij = 0 other-
wise. Then Bliese (2016b) specifies random intercept variation in well-being at group
level, together with subject level impacts. These apply to group average hours worked,
and subject level hours worked respectively.
This model is investigated using a Bernoulli likelihood, and logit regression in brms
and rstan, and an augmented data logistic regression using R2OpenBUGS. Under the
latter approach, a logit link involves the sampling mechanism

y ij∗ ∼ Logistic(hij , 1) I ( Aij , Bij ),

where Aij = -¥ or 0 (and Bij = 0 or ∞) according as yij = 0 or 1. The equivalent code is

    for (h in 1:N) {z[h] ~ dlogis(eta[h],1) %_% I(A[h],B[h])


   A[h] <- −100*equals(wb[h],0)
   B[h] <- 100*equals(wb[h],1)}

where wb[h] denotes the observed binary outcome. The R2OpenBUGS analysis centres
the group intercepts around the impact of group average hours worked.
Both methods of estimation report a stronger impact of group average hours worked
than individual hours worked on individual well-being, with respective posterior means
(sd) of −0.27 (0.05) and −0.10 (0.02) (respectively beta[2] and beta[1] in the R2OpenBUGS
code).
Subject level mixed predictive checks (Marshall and Spiegelhalter, 2007) are based
on sampling replicate cluster intercepts, and these predictive checks are aggregated
to group level (testmx.grp in the R2OpenBUGS code). These show well-being in some
groups to be much better explained than in others, with average predictive success
varying from 0.50 (group 68) to 0.84 (group 57). Both the brms logistic regression and
the augmented data logistic regression show around 22% of subjects with predictive
concordance below 0.50 (the model does not improve on guesswork for such subjects).

Example 8.3 Ordinal Three Level Model


Vermunt (2013) and others have considered data from the Television, School and Family
Smoking Prevention and Cessation Project (Flay et al., 1987). Schools included in the
project were randomised to one of four categories, defined by the presence or absence
of a TV intervention (TV), and by the presence or absence of a social-resistance class-
room curriculum (CC). One outcome measure was a tobacco and health knowledge
(THK) score, here represented using H = 4 ordinal categories, with predictors being a
pre-intervention THK score, the binary intervention variables TV and CC, and a CC by
TV interaction.
The analysis is three level, with schools i = 1,… , m at level 3 (with m = 28), and class-
rooms j within schools at level 2. With responses y ijk ∈(1,… , H ) , where H = 4, for subjects
k = 1,… , nij in each class, consider a logit model for the cumulative probabilities

gijkh = Pr( y ijk ≤ h),


328 Bayesian Hierarchical Models

with two sets of higher level errors, namely level 3 random errors u3i with variance s 32 ,
and level 2 class errors u2ij with variance s 22 (pertaining to effects of classrooms within
schools). Then

logit(gijkh ) = kh − mijk , h = 1,… , H − 1,


mijk = xijk b + u3 i + u2ij

with N(0,1000) priors on fixed effects and U(0,1000) priors on the random effect standard
deviations. Predictors are centred.
A two-chain run of 2,500 iterations with 500 burn-in gives posterior means for the
cutpoints (k1 , k2 , k3 ) of –1.39, −0.11, and 1.10, with a significant coefficient of 0.83 on the
curriculum intervention and significant influence also of pre-intervention score, but no
significant effects for TV or the interaction term. The posterior means for σ2 and σ3 are
0.41 and 0.30 (for classrooms and schools respectively) with densities bounded away
from zero. Similar estimates are obtained using brms and rstan.
By contrast, a maximum likelihood analysis using numerical quadrature reported by
Rabe-Hesketh et al. (2004) finds an insignificant school variance, and Vermunt (2013)
also finds a model with class effects only to be the best fitting.
The analysis was also carried out using the latent data approach, which may be use-
ful for obtaining intraclass correlations or for model checking. This produces larger
estimates of σ2 and σ3, namely 0.59 and 0.36, but similar fixed predictor and cutpoint
estimates. The worst fit (using pointwise WAIC) is for subjects 952 and 190, who have
respectively high (low) THK scores, despite low (high) pre-intervention THK scores and
absence (presence) of the curriculum intervention.

8.4 Crossed and Multiple Membership Random Effects


Crossed random effects at level 2 and above occur when classifications in a model are not
completely nested. For a two-level example, let i denote the main level 2 nesting classifica-
tion, and hij denote a crossed nesting. Raudenbush (1993), Browne et al. (2001, Section 3.2)
and Snijders and Bosker (1999) mention educational examples, namely pupils classified by
primary school i, and by secondary school hij, or pupils classified by school i and neigh-
bourhood hij. In the latter situation, a school can draw pupils from multiple neighbour-
hoods, and residents in a neighbourhood can choose between multiple schools for their
children. By extension, if pupils are classified by primary school, secondary school and
neighbourhood, then two crossed nestings (denoted say as h1ij and h2ij) will be involved.
An important issue is the relationship between the crossed classifications since they may
well not be independent, and introducing extra crossed factors will typically reduce the
variance explained by the main level 2 nesting.
A straightforward extension of the normal linear mixed model to account for a single
extra crossed factor is to add varying intercepts and slopes according to that extra factor.
Assuming hij varies between 1 and H, then adapting the (8.1) format for zij of dimension q,

yij = xij b + zij (bi + chij ) + uij ,

where

(b1i , … , bqi ) ∼ N q (0, Σ b ), i = 1, … , m,


(ch1 , … chq ) ∼ N q (0, Σ c ), h = 1, … , H .
Bayesian Multilevel Models 329

Alternatively, variation over the extra crossed factor may be applied to a different predic-
tor than those subject to random variation over the main level 2 classification. Often the
additional random effects would be confined to intercept variation over the extra crossed
factor, so that with q = 1 and zi1 = 1 also, one has

yij = xij b + bi + chij + uij .

In these situations, the random effects are confounded and empirical identification may
be impeded. Selection between random effects may well be needed (Browne et al., 2001).
Another possible source of variation in crossed models is defined in cells formed by
cross-classification of two or more higher level factors. For example, N patients living in
a particular administrative health district may be classified into subpopulations s based
on intersections of their primary care general practitioner i1 = 1, … , m1 , and small area of
residence i2 = 1, … , m2 (Congdon and Best, 2000). Often there may be no subjects in certain
combinations of higher level factors. So define total non-empty cells as Sn, equal to or less
than the total S = m1m2 of all possible combinations, with different values s = 1,… Sn defined
by cross-hatched factor identifiers [i1,i2]. Let r = 1, … , N denote a single string subject level
identifier. Subjects will be classified by subpopulation sr ∈{1, … Sn } , by higher level factor 1
classification indicator h1r (general practitioner), higher level factor 2 classification indicator
h2r (small area of residence), and so on. Random intercept variation in a metric response
over the two factors and the cells then takes the form

y r = xr b + zr (a1, h1r + a2 , h1r + hsr ) + ur , (8.5)

where a1i1 , a2 ,i2 , ηs and ur are random effects.


Multiple membership schemes define a generic weighting scheme applicable to cross-
classified data (Browne et al., 2001), and may be illustrated by the case where subjects at
level 1 may belong to more than one level 2 unit. Supposing a pupil’s entire primary school
career is of interest, there may then be moves between schools. Multiple affiliations then
need to be taken account of in terms of school impacts on attainment. Another exam-
ple is analysis of neighbourhood health effects to take account of changes in residence
(Subramanian, 2004). Suppose there are m level 2 units (clusters such as schools) that are
included in the analysis, and that subjects j = 1, … , J (not taken to be nested within schools)
K
have Kj level 2 affiliations with weights {w j1 , w j 2 , … w jK j } where S k =j1w jk = 1 . The weights
would in many situations be taken as known (e.g. based on the number of terms spent by
a pupil in different schools). Then for pupil level predictors zj of dimension q not varying
over affiliations, the normal linear mixed model becomes

Kj

yij = xij b + zij ∑ w b + u ,


k =1
jk k ij

where (b1i , … , bqi ) ∼ N q (0, Σ b ), i = 1, … , m . If the pupil predictors vary over affiliations, then

Kj

yij = xij b + ∑ w z b + u .
k =1
jk jk k ij

Multiple member schemes extend to data frames which are structured spatially or tem-
porally rather than nested. A particular kind of multiple member prior can be applied
to spatially configured count responses yi subject to random intercept variation. Thus let
yi ∼ Po(oi mi ) where oi are expected events, and where the μi measure the Poisson intensity
330 Bayesian Hierarchical Models

relative to expected levels (in spatial health applications the μi are termed relative risks).
Then the impact of Ki neighbouring areas can be represented by random effects bk while
own area effects are represented by effects ui in a model

Ki

log( mi ) = xi b + ∑ w b + u ,
k =1
ik k i

where the wik are row standardised with S Kk =i 1wik = 1 , obtained from spatial interactions
C = cik. These might be based on binary spatial interactions cik (cik = 1 if areas i and k are
contiguous, cik = 0 otherwise), or based on distances dik between area centres, such as
cik = exp( − hdik ) where η is positive; then wik = cik / S Kk =i 1cik .

Example 8.4 Neighbourhood Effects on Educational Attainment


Raudenbush and Bryk (2002) consider attainment data from a Scottish education
authority with R = 2310 pupils classified by crossed factors (neighbourhood and sec-
ondary school). Some neighbourhoods send pupils to multiple schools, while schools
generally draw pupils from several neighbourhoods. They consider effects of a “place
variable,” namely, neighbourhood social deprivation, on educational attainment, after
controlling for the impacts of pupil attributes (pupil aptitude and family background).
The data involves m1 = 524 neighbourhoods, and m2 = 17 schools, with child-specific pre-
dictors being gender (1 = M, 0 = F), and a verbal reasoning quotient (VRQ) and a reading
test score (RTS), both obtained when the child was at primary school. Parent-specific
predictors are father’s status, FSTAT, and three binary indicators (whether father edu-
cated beyond age 15, whether the mother educated beyond 15, and whether father
unemployed).
Two models are considered. In the first, there are random effects for both neighbour-
hoods and schools, with varying neighbourhood effects linked to social deprivation.
Let h1r and h2r denote neighbourhood and school indices for pupils r = 1,… , R in the
model

y r = xr b + a 1, h1r + a 2 , h2 r + ur ,

a 1i1 ~ N(g 2 Depi1 , s 12 ), i1 = 1,… , m1

a 2i2 ~ N(g 1 , s 22 ), i2 = 1,… , m2

ur ~ N(0, s 32 ),

with xr excluding a constant term, and all predictors centred. Centring the random
effects around the intercept γ1, and neighbourhood deprivation effect γ2 improves
convergence.
In the R2OpenBUGS analysis, gamma priors are adopted on neighbourhood, school,
and pupil random effect precisions. The model is also estimated using the brms library
in R, but with half-t priors on standard deviation parameters.
Model 1 results from R2OpenBUGS show residual pupil standard variation (σ3 has
posterior mean 0.67) as more substantial than either school or neighbourhood variation
(σ1 and σ2 have means 0.09 and 0.08). A negative deprivation impact γ2 on attainment
(with mean −0.156 and 95% interval from −0.202 to −0.106) operates via (is mediated by)
the neighbourhood effects.
Bayesian Multilevel Models 331

A second model allows the deprivation effect to vary by school – expressing poten-
tially varying effectiveness on schools h2r in countering catchment area effects (also
known as contextual value-added effects). There are now four random variances, with

y r = xr b + a 1, h1r + a 2 , h2 r + d h2 r Deph1r + ur ,

a 1i1 ~ N(0, s 12 ),

a 2i2 ~ N(g 1 , s 22 ),

d i2 ~ N(g 2 , s 32 ),

ur ~ N(0, s 42 ).

In R2OpenBUGS, convergence of sampling the first three random effect components


was improved by partitioning, using a Dirichlet density applied to a total precision
parameter. This model is also estimated by brms.
In fact, this model has a slightly worse WAIC than the first model, increasing from
4792 to 4797; and the performance with regard to the mixed predictive check (cvtail in
the code) is also not improved. The school deprivation effects di2 on attainment from
R2OpenBUGS (delta[] in the code) have a mean γ2 of −0.17, and vary from −0.22 for
school 11 to −0.10 for school 9. Their standard deviation σ3 is 0.08. The brms estimates
show less variability between schools in the deprivation effect.

8.5 Robust Multilevel Models


Under normality assumptions regarding errors at different levels, extreme data points can
influence estimates of fixed effect and variance component parameters, and reduce the
precision of estimates (i.e. widening the width of credible intervals). Sensitivity of the level
2 fixed effect estimates κ to alternative assumptions regarding the prior on level 2 effects
bi in (8.3) is the focus of Seltzer (1993). Estimates of level 2 random cluster effects may also
be sensitive to normality assumptions (Seltzer et al., 1996, p.137). Multilevel logistic regres-
sion in particular may be sensitive to multicollinearity and small cluster sizes (Shieh and
Fouladi, 2003; Moineddin et al., 2007).
For a two-level model, outliers may occur both in level 2 cluster effects and in level 1
within-cluster errors (Pinheiro et al., 2001; Langford and Lewis, 1998), and the two sources
may be confounded. For example, a discordant school effect might be due to a system-
atic effect across all pupils, or because a few pupils in the school are responsible for the
discrepancy. Seltzer et al. (2002) investigate how level-1 outliers affect estimation of fixed
effect regression parameters and inferences regarding level 2 cluster effects (e.g. treatment
contrasts for individual clusters) in two-level models for continuous outcomes.
A robust alternative to the normal linear mixed model based on the multivariate t den-
sity is proposed by Pinheiro et al. (2001), and shown to outperform normality assumptions
when outliers are present in multilevel data. Daniels and Gatsonis (1999, p.31) assume mul-
tivariate t random effects at level 2 by default in a generalised linear mixed model, while
Seltzer et al. (1996) present Gibbs sampling steps for the linear mixed model case where a
multivariate t with a single degrees of freedom parameter is assumed for level 2 random
effects. Seltzer et al. (2002) adopt Student t priors at both levels and apply a U(0, 1) prior to
sample from a discrete grid of values on the degrees of freedom parameter. Thus for an
equally spaced grid of potential values {2.1, 2.2, 2.3, …, 49.9} with equal prior probabilities,
332 Bayesian Hierarchical Models

the cumulative probability Pr(n = 2.1) + Pr(n = 2.2) + … is calculated for each point, and the
U(0, 1) draw determines which is sampled.
Following the Pinheiro et al. (2001) scheme, assume a gamma-normal hierarchical repre-
sentation with scale mixture parameters si ∼ Ga(0.5n , 0.5n), and also that ei ∼ N q (0, I ) . Then
for continuous responses yi = ( yi1 , … , yini )′ , a level 2 assumption of t distributed random
effects bi = (b1i , … , bqi )′ with dispersion Σb leads to

yi = Xi b + Zibi + ui , i = 1, … , m

bi = kWi + Σ 0b.5ei / si

For outlier clusters with low si the overall dispersion Σ b /si2 is inflated, but the fixed effect
κ will be less distorted than under normal level 2 errors.
The degrees of freedom parameters νi of the level 2 multivariate t prior may be taken to
vary between clusters, namely

 yi   Xi b  Zi Σ b Zi′ + Λ i Zi Σ b  
 b  ∼ tni + q  0 ,  Σ Z′ , ni .
 i   b i Σ b  

or under a gamma-normal hierarchical representation,

 yi   Xi b 1  Zi Σ b Zi′ + Λ i Zi Σ b 
 b  ∼ N ni + q  0 , s  Σ Z′ Σ b 
.
 i  i  b i

where si ∼ Ga(0.5ni , 0.5ni ) . The si can then be used for identifying cluster outliers. An alter-
native to assuming cluster specific degrees of freedom is to take ni = ngi , according to a
known or possibly unknown grouping variable gi ε (1, … , G) applicable to clusters, for
example, type of school in an educational application.
Discrete mixtures of random effects are also possible for outlier accommodation, model-
ling non-normality or other asymmetry in random effects. Latent mixtures of regression
effects may also be present: Muthén and Asparouhov (2009) show how latent regression
classes may be misrepresented as random cluster variation. To detect outlier random
effects, Daniels and Gatsonis (1999, p.36) adapt the approach of Albert and Chib (1997) in
their models for hierarchical conjugate priors for discrete data.
For nested binomial data yij ∼ Bin( nij , pij ), a mechanism to detect level 1 outliers may be
specified with pij drawn for a two-group mixture of beta densities, both with means πij.
For the main group, the dispersion parameters are δi, while for the outlier group they are
deflated as δi/K where K  1. Then

 d d
pij ∼ (1 − l)Beta(pij di ,(1 − pij )di ) + lBeta  pij i ,(1 − pij ) i  .
 K K

If the outlier probability λ is preset to a low value (e.g. λ = 0.05), then K might be taken as
an extra parameter. Weiss et al. (1999) suggest a similarly motivated prior for mixtures of
normal random effects at levels 1 and 2 in (8.1) and (8.3), namely

bi ∼ (1 − lb )N q (0, Σ b ) + lb N q (0, K b Σ b ),
uij ∼ (1 − lu )N (0, su2 ) + lu N (0, K usu2 ),
Bayesian Multilevel Models 333

An alternative mixture prior to reduce the impact of parametric assumptions is the mix-
ture of Dirichlet process approach (Kleinman and Ibrahim, 1998; Guha, 2008). Thus, a con-
ventional first stage likelihood

yi ∼ N (Xi b + Zibi , s 2 ),

may be combined with a semiparametric approach for bi = (b1i , … bqi )′ , typically with a mul-
tivariate normal base G0 as in

bi ∼ G,
G ∼ DP(a , G0 ),
G0 = N q (0, D),
D −1 ∼ Wishart(d0 , R0 ),

Gibbs sampling for D−1 is modified for clustering among the sampled bi (Kleinman and
Ibrahim, 1998, p.94).

Example 8.5 Police Stops and Ethnicity


This example considers representation of cluster effects in multilevel Poisson regres-
sion analysis of 900 counts yij of “stop and frisk” over a 15-month period in 1998–99
(Gelman and Hill, 2006). For each of m = 75 New York police precincts, counts are disag-
gregated both by ethnic group (1 = black, 2 = hispanic, 3 = white), and crime type (1 = vio-
lent, 2 = weapons, 3 = property, 4 = drug), so that there are j = 1,… , ni observations, with
ni = 12, for each precinct i. These 12 categories are called classes here. An offset, oij, is
provided by arrests according to precinct, ethnicity, and type in 1997 (multiplied by
15/12); in fact oij + 1 is used instead, since some 1997 arrest counts are zero.
Here, an initial analysis (using jagsUI) has normal random errors at both precinct and
precinct-class level (model 1), the latter introduced to account for overdispersion, while
the former measure overall crime levels in a precinct. Cluster (precinct) effects are taken
as normal. So y ij ∼ Po( mij [oij + 1]) with

log( mij ) = b0 + bethij + bi + uij ,

uij ∼ N(0, su2 ), bi ∼ N(0, sb2 ).

The ethnicity fixed effect ( b1 , b2 , b3 ) has black ethnicity as reference. In practice, the uij
are centred around the regression term b0 + bethij to improve convergence. U(0,100) pri-
ors are adopted for the random standard deviations.
Using replicate random effects uij,rep and bi,rep , and the resulting replicate data y ij,rep
sampled from the model, predictive checks involve the mixed predictive exceedance
criterion Pr( y ij , rep > y ij |y ) (Green et al., 2009). The observation level log posterior predic-
tive densities (LPPDs) associated with the WAIC are also obtained. The significance of
individual precinct effects bi is assessed using the probabilities Pr(bi > 0|y ) .
The scaled deviance (DV in the code) is estimated as 925, so overdispersion is
accounted for. The Hispanic and white ethnic coefficients ( b2 and b3 ) have 95% intervals
(−0.12,0.22), and (−0.59,−0.23), so whites have lower chances of being subject to “stop and
frisk.” Specifically, they have a 33% lower relative risk, namely 100(1 − exp(−0.4)) where
–0.4 is the posterior mean of β3. Despite the presence of precinct-cell error terms (which
might reduce the need for separate precinct effects), a relatively high number (25 out of 75)
334 Bayesian Hierarchical Models

30

25

20
Frequency

15

10

–1.5 –1.0 –0.5 0.0


Posterior Mean Effect

FIGURE 8.1
Precinct random effects, truncated Dirichlet prior.

of the precinct effects bi are significant in the sense that the probabilities Pr(bi > 0|y )
exceed 0.95 or are under 0.05.
Around 8.7% of the mixed predictive exceedance checks are in the extreme tails
(under 0.05 or over 0.95), so the model is reproducing the data effectively. The lowest
LPPD values and extreme exceedance probabilities are for subjects with very high stop
counts, and for subjects with zero stop counts, despite relatively large offsets.
Of interest in terms of the robustness of the model assumptions are the character-
istics of the posterior estimates of bi and uij. The proportion of extreme values for the
precinct effects may cast doubt on a normality assumption, and as an alternative, a
truncated Dirichlet process prior is adopted for these effects (model 2). A fixed Dirichlet
concentration parameter is assumed, namely α = 1, to aid convergence. The base density
involves normal random effects over a maximum of 20 clusters.
A slight reduction in WAIC (from 6897 to 6891) is obtained. Around 9% of the mixed
predictive exceedance checks are in the extreme tails (under 0.05 or over 0.95), similar
to model 1. Similar results regarding the fixed effects and ethnic differences in risk of
stop and frisk are also estimated for this model. However, a histogram of the posterior
mean bi suggests non-normality (Figure 8.1), shown, for example, by a bimodal pattern,
and five precincts (2,26,28,51,70) with unusually low bi.

Example 8.6 Prenatal Care


This example illustrates the possible sensitivity of both fixed effects and variance com-
ponent estimates to multicollinearity. It involves a study of the adoption of modern
prenatal care (binary response) among Guatemalan women, based on 2,449 births (in a
5-year period before the survey) to 1,558 mothers living in one of m = 161 communities.
The predictor variables are at all levels, namely communities i at level 3, mothers ij at
level 2, and pregnancy episodes ijk at level 1, and denoted Xi, Xij and Xijk respectively.
The predictors include at level 3: proportion of community population indigenous and
distance to the nearest clinic; at level 2: mother’s ethnicity, mother’s education, husband’s
Bayesian Multilevel Models 335

education, husband’s occupation, and presence of a modern toilet; and at level 1: (exist-
ing) child’s age, mother’s age, and birth order.
Then with yijk ~ Bern(pijk), a binary multilevel model specifies normal random intercept
variation at levels 2 and 3, namely according to both mother and community. Using a
non-centred parameterisation and logit link, one has

logit( pijk ) = a + bi 3 + bij 2 + Xi b 3 + Xij b 2 + Xijk b1 ,

where bi3 ~ N(0,s32 ) are community effects, and bij2 ~ N(0,s22 ) are mother level effects.
Alternatively, community effects and mother effects could be centred at Xiβ3 and Xijβ2
respectively. Instability across estimation methods in this dataset is noted by Rodriguez
and Goldman (2001) and Guo and Zhao (2000).
This instability may be related partly to small cluster sizes at both levels as well as
the binary form of outcome. Here we illustrate the potential impacts of (fixed effect)
predictor collinearity, comparing a diffuse normal prior on predictors with a horseshoe
prior. Under the horseshoe prior, the student-t prior of the local shrinkage parameters

TABLE 8.1
Modern Pregnancy Advice
Diffuse Normal Prior on Fixed Horseshoe Prior on Fixed
Regression Effects Regression Effects
Fixed effects Mean 2.50% 97.50% Mean 2.50% 97.50%
Intercept 5.3 0.3 11.1 3.0 0.5 5.9
Pregnancy Level
Child aged 3–4 years −1.38 −2.16 −0.66 −0.89 −1.51 −0.29
Mother aged > 25 years 1.28 0.00 2.65 0.60 −0.16 1.57
Birth order 2–3 −1.05 −2.18 0.03 −0.35 −1.16 0.18
Birth order 4–6 −0.54 −2.06 0.98 −0.01 −0.88 0.76
Birth order > 7 −1.29 −3.45 0.73 −0.32 −1.75 0.65
Mother Level
Indigenous, no Spanish −7.85 −13.20 −3.59 −5.30 −8.95 −1.70
Indigenous Spanish −4.21 −7.68 −1.18 −2.60 −5.18 −0.10
Mother’s education primary 2.66 0.85 4.79 1.59 0.18 3.03
Mother’s education secondary 5.73 1.40 10.95 3.51 0.00 7.22
Husband’s education primary 1.14 −0.84 3.22 0.57 −0.45 2.18
Husband’s education secondary 4.89 1.04 9.21 3.35 0.19 6.64
Husband’s education missing 0.07 −3.03 3.11 −0.05 −1.38 1.25
Husband professional etc. −0.54 −5.53 4.36 0.65 −0.87 2.70
Husband agric. self-employed −2.73 −7.09 1.46 −0.57 −2.40 0.60
Husband agric. employee −3.82 −8.49 0.44 −1.33 −3.52 0.16
Husband skilled service −1.17 −5.46 3.17 0.25 −1.03 1.83
Modern toilet in households 2.80 0.08 5.87 1.82 −0.02 4.00
Television not watched daily 2.18 −1.81 6.37 0.59 −0.74 3.03
Television watched daily 2.14 −0.38 4.93 1.00 −0.32 3.08
Community Level
Proportion indigenous, 1981 −6.60 −11.84 −1.98 −4.95 −8.96 −1.08
Distance to nearest clinic −0.07 −0.14 −0.02 −0.06 −0.10 −0.02
Random effect variances
Family 10.5 7.7 14.2 7.3 5.7 9.3
Community 5.6 3.8 7.8 4.0 2.8 5.3
336 Bayesian Hierarchical Models

(see Equation 7.1) has 1 degree of freedom (Piironen and Vehtari, 2016), while the global
parameter has a Cauchy prior with scale 1. Using rstan, convergence is achieved in two
chain runs of 2000 iterations.
Table 8.1 shows that posterior mean random intercept variances at both family and
community level are reduced by about 30% under the horseshoe prior. Fixed regression
effects show considerable shrinkage, but significant predictor effects (on 8 of the 21 pre-
dictors) are maintained (as assessed by 95% credible intervals either entirely negative
or positive). The indicators κj (see Equation 7.2) show that the “indigenous, no Spanish”
(mother) and “proportion indigenous” (community) predictors have the highest rele-
vance, with posterior mean κj around 0.12 for both (kappa[6] and kappa[20] in the code),
and posterior median κj around 0.06. This strategy improves fit: the LOO-IC falls from
2653 to 1765 on adopting shrinkage priors.

References
Aarts E, Dolan C, Verhage M, van der Sluis S (2015) Multilevel analysis quantifies variation in the
experimental effect while optimizing power and preventing false positives. BMC Neuroscience,
16, 94.
Albert J, Chib S (1997) Bayesian tests and model diagnostics in conditionally independent hierarchi-
cal models. Journal of the American Statistical Association, 92(439), 916–925.
Bliese P (2016a) Package ‘Multilevel’ Manual. https://cran.r-project.org/web/packages/multilevel/
Bliese P (2016b) Multilevel Modeling in R: A Brief Introduction to R, the multilevel Package and the
nlme Package. Darla Moore School of Business, University of South Carolina.
Bliese P, Halverson R (1996) Individual and nomothetic models of job stress: An examination of work
hours, cohesion, and well-being. Journal of Applied Social Psychology, 26(13), 1171–1189.
Bliese P, Hanges P (2004) Being both too liberal and too conservative: The perils of treating grouped
data as though they were independent. Organizational Research Methods, 7(4), 400–417.
Browne W (2004) An illustration of the use of reparameterisation methods for improving MCMC
efficiency in crossed random effect models. Multilevel Modelling Newsletter, 16, 13–25.
Browne W, Draper D, Goldstein H, Rasbash J (2002) Bayesian and likelihood methods for fitting
multilevel models with complex level-1 variation. Computational Statistics and Data Analysis,
39, 203–225.
Browne W, Goldstein H, Rasbash J (2001) Multiple membership multiple classification (MMMC)
models. Statistical Modelling, 1, 103–124.
Buerkner P (2017) brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical
Software, 80(1), 1–28.
Candel J, Winkens B (2003) Performance of empirical Bayes estimators of level-2 random parameters
in multilevel analysis: A Monte Carlo study for longitudinal designs. Journal of Educational and
Behavioral Statistics, 28, 169–194.
Chaix B, Merlo J, Chauvin P (2005) Comparison of a spatial approach with the multilevel approach
for investigating place effects on health: The example of healthcare utilisation in France. Journal
of Epidemiology and Community Health, 59, 517–526.
Chen Z, Dunson D (2003) Random effects selection in linear mixed models. Biometrics, 59, 762–769.
Congdon P, Best N (2000) Small area variation in hospital admission rates: Adjusting for referral and
provider variation. Journal of the Royal Statistical Society: Series C, 49(2), 207–226.
Congdon P, Lloyd P (2010) Estimating small area diabetes prevalence in the US using the behavioral
risk factor surveillance system. Journal of Data Science, 8(2), 235–252.
Croon M, van Veldhoven M (2007) Predicting group-level outcome variables from variables mea-
sured at the individual level: A latent variable multilevel model. Psychological Methods, 12(1),
45–57.
Bayesian Multilevel Models 337

Daniels M, Gatsonis C (1999) Hierarchical generalized linear models in the in the analysis of varia-
tions in health care utilization. Journal of the American Statistical Association, 94, 29–42.
Daniels M, Kass R (1999) Nonconjugate Bayesian estimation of covariance matrices and its use in
hierarchical models. Journal of the American Statistical Association, 94, 1254–1263.
Daniels M, Zhao Y (2003) Modelling the random effects covariance matrix in longitudinal data.
Statistics in Medicine, 22, 1631–1647.
Dean C, MacNab Y (2001) Modeling of rates over a hierarchical health administrative structure.
Canadian Journal of Statistics, 29, 405–419.
Dong G, Ma J, Harris R, Pryce G (2016) Spatial random slope multilevel modeling using multivari-
ate conditional autoregressive models: A case study of subjective travel satisfaction in Beijing.
Annals of the American Association of Geographers, 106(1), 19–35.
Draper D (1995) Inference and hierarchical modeling in the social sciences. Journal of Educational and
Behavioral Statistics, 20, 115–147.
Draper D (2006) Bayesian multilevel analysis and MCMC, Chapter 2, in Handbook of Quantitative
Multilevel Analysis, eds J de Leeuw, E Meijer. Springer, New York.
Flay B, Hansen W, Johnson C, Collins L, Dent C, Dwyer K, Grossman L, Hockstein G, Rauch J,
Sobol J, Sobel D, Sussman S, Ulene A (1987) Implementation effectiveness trial of a social influ-
ences smoking prevention program using schools and television. Health Education Research, 2,
385–400.
Gamerman D (1997) Sampling from the posterior distribution in generalized linear mixed models.
Statistics and Computing, 7, 57–68.
Gelfand A, Sahu S, Carlin BP (1995) Efficient parameterisations for normal linear mixed models.
Biometrika, 82, 479–488.
Gelman A (2006) Multilevel (hierarchical) modeling: What it can and can’t do. Technometrics, 48,
432–435.
Gelman A, Hill J (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge
University Press.
Gelman A, Pardoe I (2006) Bayesian measures of explained variance and pooling in multilevel (hier-
archical) models. Technometrics, 48(2), 241–251.
Givens G, Hoeting J (2012) Computational Statistics, 2nd Edition. John Wiley.
Goldstein H (2005) Heteroscedasticity and complex variation, pp 790–795, in Encyclopedia of Statistics
in Behavioral Science, Vol. 2, eds B Everrit, D Howell. Wiley, New York.
Goldstein H, Browne W, Rasbash J (2002) Partitioning variation in multilevel models. Understanding
Statistics, 1, 223–232.
Green MJ, Medley GF, Browne WJ (2009) Use of posterior predictive assessments to evaluate model
fit in multilevel logistic regression. Veterinary Research, 40(4), 1–10.
Guha S (2008) Posterior simulation in the generalized linear mixed model with semiparametric ran-
dom effects. Journal of Computational and Graphical Statistics, 17, 410–425.
Guo G, Zhao H (2000) Multilevel modeling for binary data. Annual Review of Sociology, 26, 441–462.
Hedeker D (2003) A mixed-effects multinomial logistic regression model. Statistics in Medicine, 22,
1433–1446.
Hox J (2002) Multilevel Analysis: Techniques and Applications. Lawrence Erlbaum Associates, Mahwah, NJ.
Hsiao C (1996) Random coefficient models, pp 77–99, in The Econometrics of Panel Data, eds L Matyas,
P Sevestre. Kluwer, Dordrecht, Netherlands.
Kreft I, de Leeuw J (1998) Introducing Multilevel Modeling. Sage, Thousand Oaks, CA.
Kleinman K, Ibrahim J (1998) A semi-parametric Bayesian approach to generalized linear mixed
models. Statistics in Medicine, 17, 2579–2596.
Langford I, Lewis T (1998) Outliers in multilevel data. Journal of the Royal Statistical Society: Series A,
161, 121–160.
Lindley DV, Smith AF (1972) Bayes estimates for the linear model. Journal of the Royal Statistical
Society: Series B (Methodological), 34(1), 1–18.
Mai Y, Zhang Z (2018) Software Packages for Bayesian Multilevel Modeling. Structural Equation
Modeling, 25(4), 650–658.
338 Bayesian Hierarchical Models

Marshall E, Spiegelhalter D (2007) Identifying outliers in Bayesian hierarchical models: A simulation-


based approach. Bayesian Analysis, 2(2), 409–444.
McElreath R (2016) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman &
Hall/CRC.
Moineddin R, Matheson F, Glazier R (2007) A simulation study of sample size for multilevel logistic
regression models. BMC Medical Research Methodology, 7, 34.
Muthén B, Asparouhov T (2009) Multilevel regression mixture analysis. Journal of the Royal Statistical
Society. Series A, 172, 639–657.
Natarajan R, McCulloch C (1998) Gibbs sampling with diffuse proper priors: A valid approach to
data-driven inference? Journal of Computational and Graphical Statistics, 7, 267–277.
Piironen J, Vehtari A (2016) Projection predictive variable selection using Stan+ R. arXiv preprint
arXiv:1508.02502.
Pinheiro J, Liu C, Wu Y (2001) Efficient algorithms for robust estimation in linear mixed-effects mod-
els using the multivariate t distribution. Journal of Computational and Graphical Statistics, 10,
249–276.
Rabe-Hesketh S, Skrondal A, Pickles A (2004) GLLAMM Manual. U.C. Berkeley Division of
Biostatistics Working Paper 160.
Raudenbush S (1993) A crossed random effects model for unbalanced data with applications in cross-
sectional and longitudinal research. Journal of Educational Statistics, 18, 321–349.
Raudenbush S, Bryk A (2002) Hierarchical Linear Models: Applications and Data Analysis Methods. Sage.
Rodriguez G, Goldman N (2001) Improved estimation procedures for multilevel models with binary
response: A case study. Journal of the Royal Statistical Society. Series A, 164, 339–355.
Seltzer M (1993) Sensitivity analysis for fixed effects in the hierarchical model: A Gibbs sampling
approach. Journal of Educational Statistics, 18, 207–235.
Seltzer M, Wong W, Bryk A (1996) Bayesian inference in applications of hierarchical models: Issues
and methods. Journal of Educational and Behavioral Statistics, 21, 131–167.
Seltzer M, Novak J, Choi K, Lim N (2002) Sensitivity analysis for hierarchical models employing t
level-1 assumptions. Journal of Educational and Behavioral Statistics, 27, 181–222.
Shieh Y, Fouladi R (2003) The effect of multicollinearity on multilevel modeling parameter estimates
and standard errors. Educational and Psychological Measurement, 63(6), 951–985.
Skrondal A, Rabe-Hesketh S (2004) Generalized Latent Variable Modelling: Multilevel, Longitudinal and
Structural Equation Models. Chapman & Hall/CRC, Boca Raton, FL.
Snijders T, Berkhof J (2002) Diagnostic checks for multilevel models, in Handbook of Quantitative
Multilevel Analysis, eds J de Leeuw, I Kreft. Kluwer, Boston/Dordrecht/London.
Snijders T, Bosker R (1999) Multilevel Analysis. An Introduction to Basic and Advanced Multilevel
Modelling. Sage, London, UK.
Subramanian S (2004) The relevance of multilevel statistical methods for identifying causal neighbor-
hood effects. Social Science & Medicine, 58, 1961–1967.
Subramanian S, Jones K, Duncan C (2003) Multilevel methods for public health research, pp 65–111,
in Neighborhoods and Health, eds I Kawachi, L Berkman. Oxford University Press, New York.
van Duijn M, Jansen M (1995) Modelling repeated count data: some extensions of the Rasch Poisson
counts model. Journal of Educational and Behavioral Statistics, 20, 241–258.
Vermunt J (2013) Categorical response data, Chapter 16, pp 289–297, in The SAGE Handbook of
Multilevel Modeling, eds M Scott, J Simonoff, B Marx. Sage.
Weiss R, Cho M, Yanuzzi M (1999) On Bayesian calculations for mixture priors and likelihoods.
Statistics in Medicine, 18, 1555–1570.
Zhao Y, Staudenmayer J, Coull B, Wand M (2006) General design Bayesian generalized linear mixed
models. Statistical Science, 21, 35–51.
9
Factor Analysis, Structural Equation
Models, and Multivariate Priors

9.1 Introduction
A range of multivariate techniques are available both for modelling multivariate collec-
tions of metric, binary, or count data, and for modelling multivariate random effects or
regression residuals. These include data reduction (reduced dimension) methods such as
factor and principal component analysis (e.g. Hayashi and Arav, 2006; Lopes and West,
2004), structural equation modelling (Schumacker and Lomax, 2016), discriminant analy-
sis (e.g. Brown et  al., 1999; Rigby, 1997), and data mining, as well as direct (full dimen-
sion) modelling of the joint density of the observations or regression residuals (e.g. Chib
and Winkelmann, 2001; Martinez-Beneito, 2013). Structured multivariate effects in the
analysis of spatial or time configured data raise additional issues, such as representing
inter-variable correlation within units as well as non-exchangeability between units (Song
et  al., 2005). Bayesian applications of factor analysis and structural equation modelling
have grown considerably in recent years; for overviews, see Palomo et al. (2007), Merkle
and Wang (2016), Kaplan and Depaoli (2012), Lee (2007), Stromeyer et al. (2015), and Levy
and Mislevy (2016).
The rationale for introducing latent variables lies in parsimonious representation of the
covariance structure of multivariate data, while also revealing underlying clustering of,
or associations between, the variables, ideally with substantive interpretability. The latent
variables are typically unobservable constructs (e.g. authoritarianism, population morbid-
ity, or a common trend over time) that can only be imperfectly measured by observed indi-
cators. The latent variables may be continuous, as in factor analysis (Fokoue, 2004; Lopes
and West, 2004), or categorical, as in latent class analysis (Berkhof et al., 2003). The original
variables might themselves also be discrete or continuous. For example, item response
models typically involve multiple binary observed items and a single latent continuous
ability score (Bazan et al., 2006; Luo and Jiao, 2017; Albert and Ghosh, 2000). Bayesian latent
variable packages in R include blavaan (Merkle and Rosseel, 2017), brms (Byrnes, 2017),
BayesFM (Piatek, 2017), bfa (Murray, 2016), and BayesLCA (White, 2017). Preliminary anal-
ysis using classical estimation is often useful in problem definition, for example, using the
lavaan or openMX packages (Boker et al., 2011).
The extraction of information from multivariate observed indicators to derive a smaller
set of latent variables defines a measurement model, as in confirmatory and explanatory
factor analysis (Bartholomew, 1987; Skrondal and Rabe-Hesketh, 2007). The subsequent
use of the latent constructs in describing causal relationships or associations leads into

339
340 Bayesian Hierarchical Models

structural equation modelling (Lee, 2007). Both types of model have been developed, espe-
cially in areas such as psychology, marketing, educational testing, and sociology, where
it is not possible to measure underlying constructs directly. Newer areas of development
include environmental modelling (Malaeb et al., 2000; Nikolov et al., 2007), biomass mod-
els (Arhonditsis et  al., 2006a), and time series and spatial data analysis using common
factor approaches.
The observed variables in a measurement model are variously known as “items” (e.g. in
psychometric tests), as “indicators,” or as “manifest variables.” Canonical assumptions are
that (a) conditional on the constructs, the observed indicators are independent, in which
case the constructs explain the observed correlations between the indicators, and (b) that
the construct scores are independent over subjects. As Bollen (2002) points out, the local
independence property in (a) is not an intrinsic feature of structural equation models,
while spatial and time series factor and structural equation models (Hogan and Tchernis,
2004; Congdon et al., 2007) exemplify how construct scores may be dependent over space
or time.
This chapter presents a selective review of multivariate techniques, namely

a) factor modelling via continuous latent constructs, as applied in normal linear and
general linear model contexts (Sections 9.2, 9.3, and 9.4);
b) models for multivariate discrete area (lattice) data, including spatial factor models
(Sections 9.6 and 9.7); and
c) models for multivariate time series (Section 9.8), with a focus on dynamic linear
and general linear models.

A Bayesian approach is arguably of benefit in such multivariate applications. Many clas-


sical applications of factor and structural equation methods assume multivariate normal-
ity of the indicators, with estimation based on minimising a discrepancy between the
observed and predicted covariance matrix – under multivariate normality, the covariance
matrix is sufficient for describing the correlations between observed indicators (Sanchez
et al., 2005). Considerations of robustness to outliers and other departures from normality,
and the ease with which parameter restrictions may be imposed and predictions made for
new cases, may point to a Bayes approach which retains the full observation set as input
(Lee, 2007) (see Section 9.5). The fully Bayes method has further potential advantages in
allowing flexible prior specification (e.g. hierarchical priors on loadings as against fixed
effects priors), and for describing the densities of the parameters of structural equation
models without making asymptotic approximations (Aitkin and Aitkin, 2005). Simpler fit-
ting of models involving interactions between latent factors is also a feature (Merkle and
Wang, 2016).

9.2 Normal Linear Structural Equation and Factor Models


Following Joreskog (1973) and classical presentations (e.g. Bollen, 1989; Hoyle, 1995),
Bayesian treatments of normal linear or general linear structural equation models are now
substantially represented (e.g. Nikolov et  al., 2007; Song et  al., 2006; Palomo et  al., 2007;
Kaplan and Depaoli, 2012). Consider observed multivariate metric indicators y and x, and
Factor Analysis, Structural Equation Models, and Multivariate Priors 341

continuous endogenous and exogenous construct vectors, denoted F and H respectively.


For subjects i = 1, ¼ , n , the measurement model components of a normal linear SEM are

yi = ay + L y Fi + ui ,

xi = ax + L x H i + ei ,

where yi = ( y1i , ¼ , y Py i )¢ is a Py × 1 vector of indicators describing or measuring an endog-


enous construct vector Fi = ( F1i , ¼ , FQy i )¢ of dimension Qy less than or equal to Py; and xi
is a Px × 1 vector of indicators measuring an exogenous construct vector Hi of dimension
Qx £ Px . The individual factor variables Fqi may be independent of each other or intercor-
related, and similarly for the Hri. The matrices Λy and Λx are of dimension Py × Qy and
Px × Qx and contain loading parameters describing how observed indicators are related to
the latent constructs. The {F,H} are sometimes known as common factors while the errors
{u,e} are sometimes called unique factors (Skrondal and Rabe-Hesketh, 2007). The errors ui
and ei are assumed independent of the common factors and are typically assumed to have
diagonal covariance matrices (Merkle and Rosseel, 2017).
A structural model may describe (a) interrelations between the Fqi (namely reciprocal
flows between endogenous variables such as social authoritarianism and religiosity) and
(b) effects of exogenous constructs Hri on the endogenous ones (e.g. effects of socioeco-
nomic status on authoritarianism or religiosity). These effects are represented by the equa-
tion system

Fi = BFi + CH i + wi ,

wi ~ N (0, F ),

where an intercept is typically not identified, and B is a Qy × Qy matrix with zero diago-
nal elements and off-diagonal parameters describing relations between endogenous con-
structs. The matrix C is Qy × Qx with parameters describing the impact of exogenous on
endogenous constructs. The structural model may also contain further observed variables
as responses or predictors.
Many multivariate reduction applications involve just a measurement model (i.e. a sim-
ple factor analysis), and so distinction between different types of observed indicator and
factor is not needed. Then a normal linear factor model is

yi = a + LFi + ui , (9.1)

where yi = ( y1i , ¼ , y Pi )¢ is P × 1, Fi = ( F1i , ¼ , FQi )¢ is of dimension Q < P, and


(u1i , … , uPi )¢ ~ N P (0, S ) . Either interrelated factors may be posited with

Fi = BFi + wi ,

or independent factors

Fi ~ NQ (0, F )

assumed, with diagonal Φ (Mavridis and Ntzoufras, 2014). Identifying assump-


tions on Λ and Φ are considered below. Under a local independence assumption, the
342 Bayesian Hierarchical Models

residuals (u1i , … , uPi )¢ are typically taken to be independent over cases i and variables, so
that S = diag(s12 , s22 , … , sP2 )I . This assumption can equivalently be stated as that the out-
come variables are conditionally independent, given the latent variables (Skrondal and
Rabe-Hesketh, 2007).
It may be noted that path analysis models, a special case of SEM, may be estimated
straightforwardly using brms (Byrnes, 2017) [1]. Whereas SEM models in general may
include latent variables, path analysis models assume observed variables measured
without error. Only postulated structural relationships between observed variables are
included in the model. This approach is often used when particular variables are thought
to mediate relationships between others.

9.2.1 Forms of Model
If all loadings λpq in the P × Q matrix Λ are free parameters (apart from those subject to
identification constraints, as discussed below), this structure is known as an exploratory
factor analysis (EFA), and typically assumes independent factors, with Φ = I (Merkle and
Wang, 2016). By contrast, in a confirmatory factor analysis (CFA) or measurement model,
many of the loadings take preset values (usually zero) on the basis of substantive theory,
and correlations between factors may be assumed. A particular form of confirmatory
model is known as simple structure, such that each observed variable ypi loads on only one
of the constructs Fqi. For example, Fleishman and Lawrence (2003) apply a simple struc-
ture model to ordinal items from the SF12 questionnaire, assuming that each item reflects
either a physical or mental health construct.
A multiple indicator-multiple cause (MIMIC) model extends confirmatory models by
incorporating the effects of exogenous observed variables on latent factors (Joreskog and
Goldberger, 1975; Tekwe et  al., 2014). MIMIC models for normal outcomes consist of (a)
measurement equations

yi = a + LFi + jXi + ui ,

relating multiple indicator variables yi to latent constructs Fi, and possibly also to known
influences Xi, and (b) structural equations. In the latter, the latent variables Fi are related,
both to one another and to observed exogenous variables Zi, which are viewed as causal
influences on the factors, namely

Fi = BFi + CZi + wi ,

where Zi excludes a constant term, and the coefficient matrix B allows reciprocal effects
between latent factors. A MIMIC model with a single latent construct, as applied, for
instance, in analyses of the size of underground economies (Wang et al., 2006), would typi-
cally take the form

y pi = ap + lp Fi + jp Xi + upi ,

Fi = gZi + wi .

As noted by Breusch (2005), the correlation structure in a MIMIC model may need substan-
tive support, as it typically assumes that (i) the indicators y are conditionally independent
of the causes Z, given the latent construct(s) F, and (ii) that the indicators y1 , … , y P are
Factor Analysis, Structural Equation Models, and Multivariate Priors 343

mutually independent given F. This amounts to saying that all connections that indica-
tor variables y have with the causal variables Z, and with one another, are transmitted
through the latent variable(s).

9.2.2 Model Definition
Bayesian analysis in the normal linear factor model has recently focused on model defini-
tion questions. These include selection of important factor-indicator loadings (analogous
to predictor selection), covariance specification, and uncertainty in the number of factors.
Predictor selection methods such as SSVS (George and McCulloch, 1993) can be adapted
to selection of important loadings using binary indicators γjk for observed item j and latent
factor k. These indicators provide information about which items are associated with par-
ticular factors, and which items are relevant or irrelevant to the overall latent structure. For
a preset number of factors Q, this leads to confirmatory analysis, but subject to uncertainty
(Lu et al., 2016). Thus, analogous to SSVS, one has for γjk = 0,

ljk ~ N (0, j j2 ),

where j j2 is set very small so as to shrink λjk towards zero, whereas for γjk = 1,

ljk ~ N (0, c 2j ),

where cj is chosen large (e.g. cj = 10 or cj = 100) to enable effective search for non-zero λjk val-
ues. Alternatively a spike and slab prior may be used, with λjk = 0 when γjk = 0.
Such procedures can be extended with binary indicators δk that allow retention or exclu-
sion of factors (Mavridis and Ntzoufras, 2014). This leads to item and factor selection in an
exploratory factor analysis in which

ljk |g jk , dk ~ (1 - g jk dk )N (0, j jk2 ) + g jk dk N (0, c 2jk ).

This involves a hierarchical prior on selection indicators, whereby

dk ~ Bern(pd ),

g jk |dk ~ Bern(pg dk ).

As opposed to the selection of loadings, sparsity-inducing priors may be applied (Feng


et al., 2017; Bhattacharya and Dunson, 2011). Bhattacharya and Dunson propose shrinkage
parameters τk for the kth column of Λ, combined with a hierarchical Student tν degrees of
freedom, specified so that sparsity is encouraged for higher k. Thus

ljk ~ N (0, fjk-1tk-1 ),

æn n ö
f jk ~ Ga ç , ÷ ,
è2 2ø
k

tk = Õd ,
l=1
l
344 Bayesian Hierarchical Models

d1 ~ Ga( a1 , 1),
{d2 , … , dk } ~ Ga( a2 , 1),

where a2 > 1, so that the precisions τk are necessarily increasing. Fokoue (2004) proposes
to seek relatively simple structure (a Bayesian version of varimax rotation) by taking the
precisions for each loading as unknown gamma variables, namely

ljk ~ N (0, tjk-1 )


t jk ~ Ga( a, b).

In related work, Muthén and Asparouhov (2012) propose a modified constraint form of
confirmatory analysis, labelled as a Bayesian SEM, and included in the Mplus package.
Under this approach, the main loadings (those consistent with simple structure) have a
prior variance large enough to represent non-zero effects. However, instead of constrain-
ing other (cross) loadings to zero, they are assigned informative priors with very low vari-
ance (e.g. 0.01), so are approximate rather than exact zeros. If certain cross-loadings are
found to be significant (95% credibility interval excluding zero) despite these priors, such
that an item loads on more than one construct, then this suggests simple structure no
longer holds. The model may be re-estimated with those cross-loadings assigned a less
informative prior (Smith et al., 2017).
Default covariance specifications, such as diagonal Σ and Φ in (9.1), may be restrictive
in certain applications. The package blavaan (Merkle and Rosseel, 2017) uses a form of
parameter expansion, involving phantom latent variables, to facilitate the estimation of
non-diagonal covariance matrices.
Choice between models involving different numbers of factors may be tackled using
parameter expansion, combined with a Bayes factor approximation (Ghosh and Dunson,
2008), by RJMCMC (reversible jump Markov chain Monte Carlo) methods (Lopes and
West, 2004), or by marginal likelihood approximation using path sampling (Lee, 2007).
The latter approach may be extended to full structural equation models (SEMs) (Lee and
Song, 2008). The parameter expansion method may also improve MCMC performance
(Ghosh and Dunson, 2009; Merkle and Wang, 2016), and involves a reference model with
standardised factors, and a lower triangular structure for Λ (see Section 9.3), including the
diagonals constraint λqq > 0.
Thus, the reference model is

yi = a + LFi + ui , Fi ~ NQ (0, R),

where R allows correlations between factors, but has diagonal 1. The expanded model is

yi = a + L *Fi* + ui , Fi* ~ NQ (0, Y)

*
where Ψ is unconstrained, the loadings Λ* are not subject to the diagonals constraint lqq > 0,
but Λ* is still lower triangular. Q(Q −1) parameters in Λ* are set to zero when R is non-diagonal
(Merkle and Wang, 2016). Priors on parameters in the expanded model induce priors on
(Λ,F,R) in the reference model, via

lpq = S(lqq* )lpq


*
y q0.5 , (9.2)
Factor Analysis, Structural Equation Models, and Multivariate Priors 345

Fqi = S(lqq* )Fqi* /y q0.5 ,

Rqr = S(lqq* )S(lrr* )Y qr /(Y qq Y rr ),

where a sign function, S(x) = −1 if x < 0 and S(x) = 1 if x ³ 0, is used to ensure a positive
diagonals constraint in Λ.

9.2.3 Marginal and Complete Data Likelihoods, and MCMC Sampling


From (9.1), the conditional likelihood of the normal linear factor model is
p( yi |Fi , a , L , F , S ) = N (a + LFi , S ), with conditional covariance matrix V ( yi |Fi , S ) = S , and
hence {cov( y ji , y mi ) = 0, m ¹ j} if Σ is diagonal. The marginal likelihood obtained by inte-
grating out the factor scores in the normal linear factor model (Lee and Shi, 2000, p.724;
Fokoue, 2004) is p( yi |a , L , F , S ) = N (a , LFL ¢ + S ) . The joint likelihood of yi and Fi, obtained
by multiplying the marginal density of F, Fi ~ NQ (0, F ) , and the conditional density of yi
given Fi, is

é yi ù æ éa , LFL¢ + S LF ù ö
ê F ú ~ N P +Q çç ê 0 FL¢
÷.
F úû ÷ø
ë iû èë

When the factors are standardised (Bartholomew et al., 2002, p.150; Lopes and West, 2004,
p.44), the marginal variance of yp is accordingly lp21 + … lpQ 2
+ s p2 and the marginal covari-
ance of yp and ym is lp1lm1 + lp 2lm 2 … + lpQlmQ. The contribution lp21 +… lpQ 2
of the common
factors to explaining the marginal variability in the yp is known as the “communality,”
while that part due to the residual error sp2 is called the “unique variance” or “uniqueness.”
The marginal likelihood structure for cov(y) as LFL ¢ + S does not lead to any simple
form for the posterior distributions of the unknowns, though it can be used in RJMCMC
approaches to estimation and factor model selection (Lopes and West, 2004). In Gibbs sam-
pling estimation of linear Bayesian factor and SEM models, it is simplest to approach esti-
mation of the parameters ( F , a , L , F , S ) indirectly through the conditional likelihood or
complete data model (Aitkin and Aitkin, 2005; Fokoue, 2004), with the F scores regarded
as missing data rather than integrated out (Lee and Shi, 2000). Setting q = (a , L , F , S ) , the
posterior density is then

p(q , F|y ) µ L(q |y , F )p(q ).

While MCMC sampling is typically used with the conditional likelihood, the marginal
covariance LFL ¢ + S may be useful in posterior checking of model assumptions (e.g. condi-
tional independence between the y variables given the factor scores). For example, Lee and
Shi (2000) suggest a posterior check using D( y ,q ) = å
y¢i (LFL¢ + S)-1 yi . Following Gelman
i
et al. (1996), replicate data yrep,i are sampled from the predictive distribution p( yrep | y , q )
and D(y,θ) compared to D(yrep,θ).
From a set of MCMC samples, one seeks the marginal posterior density p(θ|y) of the
hyperparameters, and the predictive distribution p(F|y) of the factor scores. Estimation at
iteration t + 1 proceeds by switching between (a) sampling θ(t+1) from the posterior condi-
tional p(q | y , F (t ) ) for θ conditional on y and sampled F scores, and (b) updating F(t+1) from
346 Bayesian Hierarchical Models

the conditional density p( F|y ,q (t +1) ) . The latter corresponds to the imputation step in data
augmentation (Tanner, 1996).
A range of inference issues may occur, subject to identifiability being fully considered
(Section 9.3). The patterns of significant loadings and subject factor scores raise questions
of substantive theory, depending on the application area. As noted by Aitkin and Aitkin
(2005), one can assess the significance of parameter or factor score contrasts on the basis
of the MCMC sample, such as pairwise difference or ratio comparisons of scores on the
kth factor for subjects i1 and i2, Fi1k - Fi2 k and Fi1k /Fi2 k . Compared to classical analysis, the
posterior means and variances of the factor scores (and of factor contrasts) are routinely
obtained.
To illustrate MCMC complete data-sampling, assume Σ is diagonal in the conjugate nor-
-1
mal model (9.1) with priors spp ~ Ga(a0 p , b0 p ), that the precision matrix for F has a Wishart
-1
prior F ~ W (R0 , r0 ) , and that the prior for Λ follows the form proposed by Press and
Shigemasu (1989). Specifically, with Λp as the pth row of Λ,

L p ~ NQ (L 0 p , spp H 0 p ),

where the Q × Q matrix H0p is positive definite. Often, simple assumptions such as
H 0 p = IQ are made (Lee and Shi, 2000, p.729). Letting y ¢p be the pth row of y, and denoting
W p = ( H 0-p1 + F ¢F )-1, and hp = W p ( H 0-p1L 0 p + Fy p ), the posterior conditional for the unique vari-
ances is (Lee and Shi, 2000, p.725)

-1
spp ~ Ga(a0 p + n/2, b0 p + 0.5[ y ¢p y p - hp¢ W -p1hp + L ¢0 p H 0-p1L 0 p ]).

The conditional for Λp is a Q-variate normal with mean ηp and covariance σppΩp, and the
conditional for Φ−1 is Wishart with scale matrix FF ¢ + R0 and degrees of freedom n + r0 .
Finally, the conditional p( Fi | y , q ) for the factor scores for subject i is a Q-variate normal
with mean [F -1 + L ¢S -1L]-1 L ¢S -1 yi and covariance [F -1 + L ¢S -1L]-1 .

9.3 Identifiability and Priors on Loadings


Under the model (9.1), the marginal covariance of y is V = LFL ¢ + S. It can be seen that the
contribution ΛΦΛ′ of the factor scores to explaining variation in the y may be achieved
by an infinite number of pairs (Λ,Φ), and constraints must be imposed to ensure a unique
location and scale for the factor scores (Wedel et al., 2003, pp.358–359). One way of pro-
viding factor score identifiability (the scaling constraint) is to define the factors to be in
standardised form, with zero means and variances of 1 (Mezzetti and Billari, 2005). Under
the alternative anchoring constraint (Skrondal and Rabe-Hesketh, 2004), one among the
set of loadings { lpq , p = 1, … , P } on each construct is preset for identification. The factors
are still required to have zero means (providing unique location), but may have unknown
variances.
For the measurement model to be identifiable, the number of unknown parameters in
θ = (Σ,Φ,Λ) must be less than the number, P(P + 1)/2, of distinct elements in the residual
variance-covariance matrix V of y. For example, in the standardised factor case, and with
Φ = I excluding correlations, one has
Factor Analysis, Structural Equation Models, and Multivariate Priors 347

V = LL ¢ + S,

with PQ + P parameters on the right-hand side under a local independence assumption (Σ
taken as diagonal). For P(P + 1)/ 2 ³ PQ + P to apply requires that P ³ 2Q + 1 (Geweke and
Zhou, 1996).
In confirmatory models, certain elements of Λ are generally preset to zero, alleviating
requirements such that Σ be diagonal or that Φ exclude covariances/correlations. However,
in exploratory factor analysis (EFA) with multiple factors (Q > 1), additional identifying
constraints must be set to avoid rotation invariance. Otherwise, there is no unique solu-
tion because any orthogonal transformation of Λ leaves the likelihood unchanged (Everitt,
1984, p.16). Thus for F * = H ¢F and L * = LH , where HH ¢ = I ,

y = 1a + LF + u = 1a + (LH )( H ¢F ) + u = 1a + L *F * + u

where cov( F * ) = H ¢ cov( F )H = cov( F ) . The exception is the simple structure case (each
observed variable loading on only one factor) when rotational identifiability is not an issue
(Wedel et al., 2003, p.358; Liu et al., 2005, p.550).
In other cases, EFA identification may be achieved by fixing enough λpq to ensure a
unique solution; thus in the case Q = 2, setting any lp2 = 0 would be sufficient. Provided the
variables are ordered in such a way as to ensure substantive justification, a widely adopted
option is to assume Λ to be lower triangular, as in Geweke and Zhou (1996), Ghosh and
Dunson (2009), Zhou et al. (2014), and Mavridis and Ntzoufras (2014), namely

é l11 0 0 ¼ 0 0 ù
ê l l22 0 ¼ 0 0 úú
ê 21
ê l31 l32 l33 ¼ 0 0 ú
ê ú
      ú
L=ê .
êlQ-1,1 lQ-1,2 lQ-1,3 ¼ lQ-1,Q-1 0 ú
ê ú
ê lQ1 lQ 2 lQ 3 ¼ lQ ,Q-1 lQQ ú
ê       ú
ê ú
ëê lP1 lP 2 lP 3 ¼ lP ,Q-1 lPQ úû
The required structural zeros can be chosen according to prior knowledge, perhaps requir-
ing rearrangement of the indicators. A possible drawback with this constraint is order
dependence (Bhattacharya and Dunson, 2011), whereby the choice of the first Q responses
becomes an important model feature. Conti et al. (2014) avoid assuming a lower triangular
Λ by including identifying criteria into prior densities for model parameters. This leads
to an EFA in which indicators are uniquely allocated to only one factor, but where neither
the number of factors nor the structure of the loading matrix are specified a priori. This
approach is applied in the R package BayesFM (Piatek, 2017).
To avoid potential labelling issues, a lower triangular Λ can be combined with the diago-
nals constraint

lqq > 0.

If the λqq are unknowns under a standardised factor scale with Φ = I, one might take

lqq ~ N (0, d qq )I (0, ), (9.3)


348 Bayesian Hierarchical Models

or some other positive prior (e.g. lognormal). Otherwise, without such a constraint, and
since LF = ( - L )( - F ), loadings on (and hence scores for) a particular factor may flip over
during MCMC iterations (Geweke and Zhou, 1996, p.566). In fact, this may happen even
if a necessarily positive prior, as in (9.3), is adopted. The effectiveness of the qth indicator,
in acting as a “factor founder” (Aßmann et al., 2016) or “anchor item,” and hence guiding
the remaining loadings on the qth factor, may be influenced in substantive applications by
the ordering of indicators (see Example 9.2). This may be so in applications with a large
number of indicators and/or relatively modest correlations.
To completely avoid possible label-switching, a positivity constraint may be applied to
all loadings (Ghosh and Dunson, 2009; Sahu, 2002). A positivity constraint on all difficulty
loadings is in fact standard in item response theory (IRT) (Section 9.4.2) (Natesan et al.,
2016; Luo and Jiao, 2017). Setting one loading for each construct to be fixed (usually at 1.0)
under an anchoring constraint, also usually ensures remaining loadings conform to a con-
sistent interpretation and direction of the factor (Levy and Mislevy, 2016).

9.3.1 An Illustration of Identifiability Issues


To exemplify identifiability constraints, consider a spatial example involving English
local authorities, and suppose six observed indicators { y1 , … , y6 } are taken to measure
two latent area constructs F1 and F2, deprivation and fragmentation. Thus, several studies
have shown that area material deprivation (i.e. meaning economic hardship represented
by observed variables such as high unemployment, and low car and home ownership)
tends to be associated with higher psychiatric morbidity and suicide mortality (Gunnell
et  al., 1995). So also does social fragmentation, meaning relatively weak community
ties associated with observed indices such as one person households, high population
turnover and many adults outside married relationships (Evans et al., 2004). Indicators
{ y1 , y 2 , y 3 } of deprivation are provided by square roots (a normalising transform) of the
UK Census rates of renting from social (public sector) landlords, and of unemployment
among the economically active, together with the square root of the rate of households
claiming income support. Indicators { y 4 , y 5 , y6 } of social fragmentation are provided by
square roots of census rates of one person households, migration in the precensal year,
and people over 15 not married.
A confirmatory factor model (with simple structure) is assumed with { y1 ,., y 3 } loading
only on a deprivation score F1, and with { y 4 , … , y6 } loading only on a fragmentation score
F2. Let Dpi be the denominator (e.g. total population) used to define the transformed census
index ypi. Then the measurement model has the form

y1i = a 1 + l11F1i + u1i

y 2i = a 2 + l21F1i + u2i

y 3 i = a 3 + l31F1i + u3 i

y 4 i = a 4 + l42 F2i + u4 i

y 5i = a 5 + l52 F2i + u5i

y6 i = a 6 + l62 F2i + u6 i
Factor Analysis, Structural Equation Models, and Multivariate Priors 349

where the uji are mutually uncorrelated, with upi ~ N (0, tp /Dpi ) (Hogan and Tchernis, 2004,
p.316).
Since F1 and F2 have arbitrary location and scale, one way of providing identifiability
(the variance scaling or standardisation constraint) is to define them to be in standard
form with zero means and variances of 1 (while still possibly allowing a non-zero corre-
lation between the two factors, which is possible under this confirmatory model). Under
the alternative anchoring constraint (Skrondal and Rabe-Hesketh, 2007), one loading on
each construct is preset for identification, for example, λ11 = λ42 = 1. The Fqi may be assumed
independent of one another, although correlation over areas i may still be incorporated
via two separate univariate CAR (conditional autoregressive) priors (Besag et  al., 1991).
Alternatively, correlation both between factors and over areas may be assumed, so that
{F1i,F2i} follow a bivariate CAR prior (see Section 9.6). Under an anchoring constraint, the
within area factor covariance matrix would then contain three unknowns {f11 , f22 , r}

æ f11 r f11f22 ö
F=ç ÷,
ç r f11f22 f22 ÷ø
è
whereas under a standardisation constraint, the diagonal elements in Φ are set to 1, and
only ρ would be unknown.
Adopting an anchoring constraint has utility in helping to prevent “relabelling” of the
construct scores Fqi during MCMC sampling. Since the indicators { y1 , … , y 3 } in this exam-
ple are positive measures of material deprivation, setting λ11 = 1 is consistent with the con-
struct F1i being a positive deprivation measure. If, however, one adopted the standardised
factor assumption with ϕpp = 1 and all the λpq free, it would be necessary, in order to prevent
label switching, to set a prior on one or possibly more loadings constraining positivity, for
example,

lp1 ~ N (1, 1)I (0, ),

lp 2 ~ N (1, 1)I (0, ),

for one or more p.

Example 9.1 Wechsler Intelligence Scale


This example involves a dataset used by Tabachnick and Fidell (2006) to illustrate con-
firmatory factor analysis, and consisting of subtest scores on the second version of the
Wechsler Intelligence Test for Children (WISC-R). The 175 subjects are school-aged
children diagnosed as learning-disabled. The Wechsler Intelligence Scale for Children-
Revised (WISC-R) is a general test of intelligence, defined as “the global capacity of the
individual to act purposefully, to think rationally, and to deal effectively with his envi-
ronment.” Considering intelligence as an aggregate of mental aptitudes, the WISC-R
data here consists of 11 tests divided into two groups, verbal and performance. The
six verbal tests (Information, Comprehension, Arithmetic, Similarities, Vocabulary,
Digit Span) use language-based items, whereas the five performance tests (Picture
Completion, Picture Arrangement, Block Design, Object Assembly, Coding) are visual-
motor in character. In all analyses below, the observations are standardised. Item scores
range from 0 to 20 and are considered continuous.
Two models are compared using rstan, with factor score covariation modelled using
the cholesky_factor_corr function. The first is an exact confirmatory factor analysis,
350 Bayesian Hierarchical Models

with factor 1 corresponding to the verbal tests, and factor 2 to the performance items. So
exact zeros are used to define loadings (l71 , l81 , l91 , l10 ,1 , l11,1 ) of the performance items
on factor 1, and of the verbal item loadings (l12 , l22 , l32 , l42 , l52 , l62 )on factor 2. Thus, with
exact zero loadings not shown, one has

y1i = a 1 + l11F1i + u1i

y 2i = a 2 + l21F1i + u2i

y 3 i = a 3 + l31F1i + u3 i

y 4 i = a 4 + l41F1i + u4 i

y 5i = a 5 + l51F1i + u5i

y 6 i = a 6 + l61F1i + u6 i

y7 i = a 7 + l72 F2i + u7 i

y 8 i = a 8 + l82 F2i + u8 i

y 9i = a 9 + l92 F2i + u9i

y10 ,i = a 10 + l10 , 2 F2i + u10 i

y11,i = a 11 + l11, 2 F2i + u11i

with uncorrelated normally distributed upi. In the second model, exact zeros are
replaced by approximate zeros specified using informative normal priors with a small
variance of 0.01, so that 95% of the prior variation is between 0.2 and 0.2 (Muthen and
Asparouhov, 2012).
This model has a lower WAIC (widely applicable information criterion), namely 4828
compared to 4841, than the exact zero CFA. Table 9.1 compares the two sets of estimated
loadings. Estimated main loadings under the second model are similar to those under
the exact zero CFA, and both show that indicator 11 (Coding) is essentially unrelated
to the second factor (the loading λ11,2). The second model also suggests a significant
cross-loading (λ22) of indicator 2 (Comprehension) on the performance factor, with 95%
posterior interval (0.07,0.34).
A third analysis via rjags uses binary factor-indicator selection indicators (Mavridis
and Ntzoufras, 2014), while also retaining the approximate zero prior formula-
tion. Spike-slab priors are adopted on the selection indicators. Thus cross-loadings
(l71 , l81 , l91 , l10 ,1 , l11,1 ) and (l12 , l22 , l32 , l42 , l52 , l62 ) are assigned informative N(0,0.01) pri-
ors, while for identifiability the main loadings are assigned N(0,1) priors constrained to
positive values.
One objective of this analysis is to detect indicators not relevant to the postulated
confirmatory analysis scheme. The analysis with rstan indicated that indicator 11 may
not be relevant. Selection indicators γpq therefore have Bernoulli probabilities πp that are
indicator specific. For indicator p, one has

y pi = a p + g p1lp1F1i + g p 2lp 2 F2i + upi ,

g pq ~ Bernoulli(p p ),
Factor Analysis, Structural Equation Models, and Multivariate Priors 351

TABLE 9.1
Posterior Summary. Exact Zero vs Approximate Zero
Confirmatory Factor Analysis
Analysis 1 (Cross Loadings Preset to Zero)
Mean St Devn 2.5% 97.5%
λ11 0.77 0.07 0.63 0.91
λ21 0.70 0.07 0.56 0.85
λ31 0.57 0.08 0.42 0.73
λ41 0.71 0.07 0.57 0.86
λ51 0.78 0.07 0.64 0.93
λ61 0.39 0.08 0.24 0.55
λ72 0.59 0.09 0.42 0.76
λ82 0.47 0.09 0.3 0.65
λ92 0.69 0.08 0.53 0.85
λ10,2 0.57 0.08 0.41 0.74
λ11,2 0.11 0.07 0.01 0.26
Factor Correlation 0.58 0.08 0.41 0.72
Analysis 2 (Informative Prior on Cross Loadings)
Main loadings Mean St Devn 2.5% 97.5%
λ11 0.78 0.07 0.65 0.92
λ21 0.59 0.08 0.44 0.74
λ31 0.55 0.08 0.39 0.71
λ41 0.64 0.08 0.49 0.79
λ51 0.76 0.08 0.61 0.91
λ61 0.39 0.08 0.23 0.55
λ72 0.55 0.10 0.36 0.73
λ82 0.43 0.09 0.25 0.61
λ92 0.65 0.09 0.47 0.83
λ10,2 0.56 0.09 0.39 0.75
λ11,2 0.09 0.06 0 0.24
Cross Loadings Mean St Devn 2.5% 97.5%
Loadings of Performance Items on Verbal Factor
λ71 0.10 0.06 0.01 0.24
λ81 0.08 0.05 0.01 0.19
λ91 0.08 0.05 0 0.20
λ10,1 0.05 0.04 0 0.15
λ11,1 0.06 0.04 0 0.17
Loadings of Verbal Items on Performance Factor
λ12 0.04 0.03 0 0.13
λ22 0.20 0.07 0.07 0.34
λ32 0.06 0.05 0 0.17
λ42 0.14 0.06 0.02 0.27
λ52 0.06 0.05 0 0.17
λ62 0.05 0.04 0 0.14
Factor Correlation 0.35 0.11 0.12 0.57
352 Bayesian Hierarchical Models

TABLE 9.2
Posterior Summary. Predictor Selection Combined with Approximate Zero CFA
Mean Selection
Main loadings Mean St Devn 2.5% 97.5% Probability (γjk)
λ11 1.15 0.29 0.64 1.80 1
λ21 0.84 0.23 0.45 1.37 1
λ31 0.81 0.22 0.44 1.29 1
λ41 0.92 0.24 0.49 1.47 1
λ51 1.11 0.28 0.62 1.72 1
λ61 0.59 0.19 0.28 1.01 1
λ72 0.65 0.22 0.29 1.14 1
λ82 0.50 0.19 0.19 0.93 1
λ92 0.74 0.27 0.31 1.35 1
λ10,2 0.65 0.23 0.28 1.17 1
λ11,2 0.02 0.06 0.00 0.21 0.14
Cross Loadings Mean St Devn 2.5% 97.5%
Loadings of performance items on verbal factor
λ71 0.03 0.07 −0.11 0.20 0.67
λ81 0.01 0.07 −0.13 0.17 0.63
λ91 0.01 0.07 −0.14 0.16 0.64
λ10,1 −0.03 0.07 −0.19 0.11 0.65
λ11,1 0.01 0.05 −0.09 0.15 0.36
Loadings of verbal items on performance factor
λ12 −0.06 0.07 −0.22 0.06 0.71
λ22 0.15 0.08 0.00 0.30 0.93
λ32 −0.01 0.06 −0.15 0.12 0.60
λ42 0.08 0.08 −0.04 0.24 0.80
λ52 −0.01 0.06 −0.16 0.10 0.61
λ62 −0.04 0.07 −0.20 0.08 0.67
Factor Correlation 0.54 0.10 0.33 0.71

pp ~ Beta(1, 1),

with the covariance matrix of the factor scores taken as unknown.


Table 9.2 shows the posterior summaries for the realised loadings γpqλpq and means for
the selection probabilities, Pr(gpq = 1) . It can be seen that there is support for a significant
loading of indicator 2 (Comprehension) on the performance factor, with g22 = 0.93 , and
so a marginal Bayes factor of 14 for Pr(g22 = 1). Also g 11, 2 = 0.14 , suggesting the indicator
11 (Coding) is essentially unrelated to the second factor.

Example 9.2 Job Applicant Data


This example illustrates the application of the parameter expansion method, as in (9.2),
combined with hierarchical priors on loadings. The data are from Kendall (1975) and
correspond to 48 applicants for a position in firm who have been judged on 15 variables,
treated as continuous ( y1 ,… , y15 ): form of letter of application; appearance; academic
ability; likeability; self-confidence; lucidity; honesty; salesmanship; experience; drive;
ambition; grasp; potential; keenness to join; and suitability.
The third and fourth columns of Table 9.3 show the results of a maximum likelihood
factor analysis in Stata with two factors; the first two factors account for 59% of the
Factor Analysis, Structural Equation Models, and Multivariate Priors 353

TABLE 9.3
Estimated Factor Loadings, Two Factors. Applicant Data
Bayesian EFA, Posterior Summary
Maximum
Likelihood Factor 1 Factor 2
Variable Factor1 Factor2 Mean 2.5% 97.5% Mean 2.5% 97.5%
y1 Form of Letter 0.37 0.56 0.62 0.33 0.93 0.00 −0.17 0.19
y2 Appearance 0.53 0.03 0.16 −0.21 0.54 0.44 0.09 0.79
y3 Academic ability 0.12 0.19 0.21 −0.14 0.58 −0.01 −0.38 0.35
y4 Likeability 0.51 0.08 0.20 −0.17 0.60 0.38 −0.01 0.74
y5 Self confidence 0.84 −0.39 −0.22 −0.67 0.36 0.98 0.67 1.36
y6 Lucidity 0.88 −0.18 0.02 −0.40 0.57 0.88 0.52 1.23
y7 Honesty 0.36 −0.29 −0.18 −0.57 0.22 0.46 0.11 0.83
y8 Salesmanship 0.91 −0.05 0.16 −0.26 0.68 0.83 0.42 1.17
y9 Experience 0.32 0.73 0.81 0.48 1.21 −0.16 −0.68 0.33
y10 Drive 0.86 0.10 0.30 −0.09 0.77 0.68 0.25 1.03
y11 Ambition 0.89 −0.14 0.05 −0.37 0.59 0.87 0.51 1.22
y12 Grasp 0.90 0.00 0.21 −0.20 0.72 0.79 0.37 1.13
y13 Potential 0.89 0.09 0.31 −0.09 0.79 0.71 0.27 1.06
y14 Keeness to join 0.63 0.06 0.21 −0.16 0.61 0.50 0.12 0.85
y15 Suitability 0.61 0.67 0.83 0.49 1.22 0.11 −0.43 0.58

variance in the original indicators. It can be seen that the second factor emphasises form
of letter (y1), experience (y9), and suitability (y15), while the first factor is more generic,
with loadings over 0.8 on seven of the 15 indicators. Similar results are obtained using
the R package lavaan.
With intercorrelation between the two factors allowed under an EFA approach, there
are Q2 = 4 restrictions needed when loadings are treated as fixed effects parameters
(Merkle and Wang, 2016). Under fixed effects priors, and counting of parameter restric-
tions under a degrees of freedom approach, this can be achieved by (a) assuming stan-
*
dardised factors in the reference model; (b) setting l12 = l12 = 0 (the loading of indicator 1
on factor 2) for a lower triangular structure; and (c) setting an additional loading on the
first factor to zero (e.g. lp*1 = lp1 = 0 for some p). However, setting one of the loadings lp1 *

on the first factor to be zero is potentially arbitrary, and may affect substantive findings.
We avoid this, and formal parameter counting, by (a) adopting an approximate zero
* *
prior on l12 , namely l12 ~ N(0, 0.01), instead of an exact zero, and by (b) adopting hierar-
* *
chical priors, rather than fixed effects priors, on the remaining lp2 , and on the lp1 . Thus

lp*1 ~ N(0, w1 )

lp*2 ~ N(0, w2 ) ( p > 1),

where ω1 and ω2 are unknown variances. Because of the approximate zero restriction on
l12* , one might anticipate that factor 1 in the EFA would be the most relevant to explain-
ing indicator y1 (form of application letter).
A two-chain run of 20,000 iterations in jagsUI shows estimated loadings under
Bayesian and MLE estimation as in Table 9.3. The estimated loadings λp1 on factor 1 are
highest for y1, y9, and y15, with other loadings having credible intervals straddling zero.
By contrast, the second factor is a more generic factor, similar to factor 1 in the MLE
354 Bayesian Hierarchical Models

estimation, with highest loadings on y5, y6, y8, y10, y11, y12, and y13. The 90% highest pos-
terior density interval for the factor correlation (rho in the jags code) is (0.04,0.75). Both
types of analysis (Bayesian and maximum likelihood) show low loadings of academic
ability (y3) on the first two factors, and one might adopt a variable selection approach to
confirm its relevance, or add an additional factor.
A maximum likelihood analysis with three factors identifies a third factor, loading
highly on likeability (y4), honesty (y7), and keenness to join (y14). A Bayesian analysis
with the implicit constraint λqq > 0 identifies a similar factor (as factor 3) if the originally
labelled indicators y3 and y4 are reordered, so that likeability becomes y3. The Bayesian
analysis then identifies a factor with high positive loadings on likeability and honesty
(y3 and y7 in the revised sorting), with respective 90% hpd intervals (0.15,1.34) and
(−0.04,1.21). This factor is not detected if the original ordering of indicators is retained.

9.4 Multivariate Exponential Family Outcomes and


Generalised Linear Factor Models
The normal linear factor and structural equation models considered above extend
straightforwardly to generalised linear factor and SEM models for non-normal data from
the exponential family density: namely binomial, Poisson, and multinomial or ordinal
data. Consider multivariate observations yi = ( y1i , ¼ , y Pi )¢ , that conditional on factor scores
Fi = ( F1i , ¼ , FQi )¢ follow an exponential family density, namely

ì y piq pi - b(q pi ) ü
p( y pi |Fi ) µ exp í + c( y pi , fpi )ý
î f pi þ
where θpi is the canonical parameter, with the ϕpi typically taken as known scale parameters.
Denoting regression terms as hpi = g(qpi ) where g is a link function, and hi = (h1i , … , hPi )¢ ,
intercept a = (a1 , ¼ , aP )¢ and P × Q loading matrix Λ, the regression term without extra-
variation is

hi = a + LFi ,
while allowing extra-variation

hi = a + LFi + ui ,

with ui = (u1i , … , uPi )¢ , where the upi are independent of each other under conditional inde-
pendence. The errors u (if present) and factor scores F are also independent.
Normality of errors and factors is often assumed with (u1i , … , uPi )¢ ~ N P (0, S ) , where Σ is
diagonal, and Fi ~ NQ (0, F ) , where Φ may be non-diagonal according to the form of model
(e.g. exploratory or confirmatory) assumed. Compared to the normal data-normal factor
model, the marginal densities of y are no longer simply derived, but involve integration
over F, namely

p( yi |q , y) =
ò Õ p(y
p=1
pi |Fi , q )p( Fi |y)dFi ,
Factor Analysis, Structural Equation Models, and Multivariate Priors 355

where ψ are hyperparameters defining the density of Fi. The usual conditional indepen-
dence assumptions are made. For example, for a P-variate categorical response (Kp cat-
egories for the pth response), the conditional probability that subject i with factor scores
Fi = ( F1i , … , FQi )¢ exhibits a particular set of responses is the product of separate categorical
likelihoods

Pr( y1i = k1 , y 2i = k 2 , … , y Pi = k P |Fi ) = Pr( y1i = k1 |Fi )Pr( y 2i = k 2 |Fi )… Pr( y Pi = k P |Fi ).

For factor reduction of binary, multinomial, or ordinal data, there may be benefit (e.g. in
simplified MCMC sampling algorithms) in considering latent variables posited to underlie
the observed discrete responses. The missing data then consists not only of factor scores
but of the latent scale data y *pi that underlie the observed data ypi. Thus for ypi binary, and
ypi = 1 if y *pi > 0 and ypi = 0 otherwise, one might take yi* = ( y1*i , ¼ y Pi
*
) to be normal or logistic,
with the diagonal terms in the unique covariance matrix Σ set (usually to 1) for identifi-
ability. For instance, a normal model taking the underlying responses to be conditionally
independent given the factors, would be

y *pi ~ N (ap + L p Fi , 1)I ( Api , Bpi ),

where Λp is the pth row of Λ, and the truncation ranges are determined by the observed ypi.

9.4.1 Multivariate Count Data


Factor models with Q < P may be more parsimonious than full dimension error mod-
els for multivariate exponential family data. However, multivariate reduction may not
always be preferred in terms of fit, so parsimony may sometimes be at the expense of
predictions that reproduce the data satisfactorily. Chib and Winkelmann (2001) illus-
trate how multivariate count data may not always be suitable for reduction using latent
factors. In their full-dimension model y pi ~ Po( mpi ) , with outcome-specific predictors
x pi = ( x1pi , x2 pi ,… , xRpi )¢ and

m pi = exp( b p x pi + upi ),

(u1i ,¼, uPi ) ~ N P (0, D),

where D is an unrestricted covariance matrix; see also Inouye et al. (2017) on Poisson mix-
ture formulations, and Rodrigues-Motta et  al. (2013) for the case where ypi may follow
different count densities. The ypi are conditionally independent given the correlated errors
ui = (u1i , ¼ , uPi )¢ . Defining vpi = exp(upi ), one has equivalently y pi ~ Po(lpivpi ) with

lpi = exp( bp x pi )

(v1i , … , vPi ) ~ LN P ( mv , S v ).

That is, the vpi are multivariate lognormal with mean vector mv = exp(0.5diag(D)), and
covariance S v = diag( mv )[exp(D) - 11¢ ]diag( mv ) .
Other ways to generate correlated count data include the overlapping sums technique
(Madsen and Dalthorp, 2007). Thus consider independent Poisson variables Z12, Z1 and Z2
with means θ12, θ1 and θ2; then y1 = Z1 + Z12 and y 2 = Z2 + Z12 are correlated with marginal
356 Bayesian Hierarchical Models

means θ1 + θ12 and θ2 + θ12 and covariance θ12. The mean and covariance of the correspond-
ing joint Poisson density for three variables is provided by Karlis and Meligkotsidou (2005,
p.257).
Factor models for count data typically may include both normal factor scores and residu-
als upi, taken as uncorrelated if the usual conditional independence assumption is made.
Thus

y pi ~ Po( mpi ),

m pi = exp( b p x pi + L p Fi + upi ).

where Λp is 1 × Q, Fi = ( F1i , … , FQi )¢ and under a standardised factor constraint Fi ~ NQ (0, RF )
where R F is a correlation matrix, with possibly unknown off-diagonal terms subject to
identifiability. Alternatively, Wedel et al. (2003) consider gamma distributed factors in an
identity link model, as well as normal F scores combined with a log link. Gamma factors
would have mean 1 to avoid location invariance, and taking their cumulative impact to be
multiplicative, one could have

w w w
m pi = exp( b p x pi + upi )( F1i p1 F2i p 2 … FQipQ )

with comparable identification restrictions on the loadings W p = (wp1 , ¼ , wpQ ) to those in


the normal linear factor model (see also Dunson and Herring, 2005). The constraints would
differ according to whether the variance of the F scores were unknown, as in

Fqi ~ Ga(jq , jq ),

with φq to be estimated, or whether the variance of F is preset, as in Fqi ~ Ga(1, 1).


An alternative to outcome-specific residuals upi in the above models is a common resid-
ual factor, especially when the Fi are derived as part of a broader structural model involv-
ing further observed indicators. For example, consider count observations ypi ( p = 1, … , P)
on clinical outcomes for a set of hospitals, while also available are metric measures xri
(r = 1, … , R) of resource inputs, efficiency, etc. The latter variables are relevant to defin-
ing a multivariate latent “care quality” construct Fi in a MIMIC framework that assists
in explaining the clinical outcomes, but this construct may not explain all the covaria-
tion among (or overdispersion in) the y variables and correlated residuals and/or common
residual factors are needed. The structural equations for the metric data might take the
form (for standardised x)

Fqi = b q xi + wqi .

while the errors in the Poisson likelihood measurement equations for y pi ~ Po( mpi ) are cor-
related over outcomes under a common factor model

log( m pi ) = a p + L p Fi + k pui .

Assuming ui is univariate, one of the loadings κp is preset if the variance su2 of the common
residual scores ui is unknown. Spatial applications of common factors are exemplified by
Wang and Wall (2003) and Nethery et al. (2015).
Factor Analysis, Structural Equation Models, and Multivariate Priors 357

9.4.2 Multivariate Binary Data and Item Response Models


As for counts, models for P-variate binary outcomes y pi ~ Bern(ppi ) may retain the observed
binary data likelihood and represent joint or residual correlations by additive full-dimen-
sion multivariate effects upi, for example,

logit(ppi ) = bp x pi + upi ,

(u1i , ..., uPi ) ~ N P (0, D),

where D is an unrestricted covariance matrix. By contrast, multivariate probit or logit


models may also follow from an augmented data perspective in which unobserved metric
variables yi* = ( y1*i , y 2*i , … , y Pi
*
) result in the observed binary vector (Chen and Dey, 1998;
Chen and Dey, 2000).
Thus, Bayesian estimation of the multivariate probit typically involves augmenting the
data with the latent normal variables obtained by truncated multivariate normal sampling
(Chib and Greenberg, 1998; Talhouk et al., 2012). Denoting hpi = bp x pi ,

( y1*i , y 2*i , … , y Pi
*
) ~ N P (hpi , S )I ( Ai , Bi ), (9.4)

with the observations generated according to

y pi = I { y *pi > 0}.

The lower and upper sampling limits in the vectors Ai = ( A1i , … , APi ) and Bi = (B1i , … , BPi )
depend on the observations: sampling of the constituent y *pi is confined to values above
zero when ypi = 1, and to zero or negative values when ypi = 0. Scale mixtures of multivariate
normal densities for the y *pi are also possible, and equivalent to a multivariate Student t for
( y1*i , y 2*i , … , y Pi
*
) , which for particular degrees of freedom approximates a multivariate logit
link (Chen and Dey, 1998). A multivariate logit regression may be achieved directly with
suitable mixing strategies (Chen and Dey, 2000; O’Brien and Dunson, 2004).
The covariance matrix Σ in (9.4) is not identified, and when the predictor effects vary by
response, only the correlation matrix can be identified (Rossi et al., 2005). The identification
criteria for the multivariate probit differ from those of the multinomial probit where iden-
tification is obtained by setting one of the diagonal variance elements σpp (e.g. the first) to 1
(McCulloch et al., 2000). It is possible to sample the correlation matrix R directly (Barnard
et al., 2000; Chib and Greenberg, 1998). Talhouk et al. (2012) use a parameter expansion
method to sample R, and the LKJ (Lewandowski, Kurowicka and Joe) prior may be used
(see Example 9.6). One may also (Edwards and Allenby, 2003; McCulloch and Rossi, 1994)
sample the Σ matrix or its inverse from an unrestricted prior, and then scale both the fixed
effects and the covariance matrix to their identified forms, namely

b p* = b p / s pp ,

and the correlation matrix

R = DSD,

where D = diag( spp ) .


358 Bayesian Hierarchical Models

Factor models for multiple binary data most typically have the general linear mixed
form

y pi ~ Bern(ppi ),

g(ppi ) = bp x pi + L p Fi ,

where g is the link, and Fi = ( F1i , ¼ , FQi )¢ . As for the normal linear factor model, a com-
mon assumption for the density of F is normal with a known scale. If, additionally, factors
are independent, then Fqi ~ N (0, 1) , q = 1, ¼ , Q . If instead the assumption Fqi ~ Logist(0, 1) is
made, with loadings κpq in

hpi = ap + kp1F1i + … + kpQ FQi ,

then k pq » ( 3/p )lpq, since the variance of a standard logistic is π2/3 (Bartholomew, 1987).
A widely applied method in educational and psychometric evaluation (Albert, 1992;
Rupp et  al., 2004; Fox and Glas, 2005) is based on item response theory (IRT for short).
Typically, the observation vector yi = ( y1i , … , y Pi ) consists of binary items measuring abil-
ity, with 1 denoting a correct answer and 0 an incorrect answer, and a model seeks a single
latent ability factor score Fi. Factor score identifiability is generally obtained by assuming
Fi ~ N(0,1). Under conditional independence, the joint success probability given Fi is

Pr( y1i = 1, y 2i = 1, ¼ , y pi = 1|Fi ) = Pr( y1i = 1|Fi )Pr( y 2i = 1|Fi )¼ Pr( y Pi = 1|Fi ).

With a Bernoulli likelihood, y pi ~ Bern(ppi ) , and link g, one has a factor model

g(ppi ) = hpi = ap + lp Fi . (9.5)

IRT rests on relatively strong assumptions, namely that a unidimensional factor is appro-
priate, conditional independence of the items given the latent factor, and a monotonic rela-
tionship between latent ability and performance on the items (Arima, 2015).
The intercepts αp can be interpreted as measures of difficulty of item p, with more nega-
tive αp implying greater difficulty under the parameterisation in (9.5), while λp measures
an item’s power to discriminate ability between subjects. A now frequent practice assigns
positive (e.g. lognormal, gamma, truncated normal) priors to the discrimination param-
eters λp, and draws these parameters from a hierarchical density with common variance
(Curtis, 2010; Luo and Jiao, 2017). A hierarchical prior may also be assumed for the diffi-
culty parameters. Using hierarchical priors may improve convergence. An alternative is to
adopt fixed effects priors (Sahu, 2002), as illustrated in the Stan Case Studies. This model
may also be parameterised as

g(ppi ) = lp ( Fi - ap ),

so that αp increases with difficulty. These are called two-parameter logistic or probit IRT
models. The three-parameter model includes a guessing (or threshold) parameter cp,
whereby

ppi = cp + (1 - cp ) g -1[lp ( Fi - ap )].


Factor Analysis, Structural Equation Models, and Multivariate Priors 359

An IRT information function Ip(F) measures how precisely an item measures the latent
ability scale. For example, easy items may provide little information about higher abil-
ity subjects, while difficult items will provide little information regarding lower abil-
ity subjects. Assuming a two-parameter logistic, the item information function can be
obtained as

I p ( F ) = lp2pp ( F )(1 - pp ( F )),

where pp ( F ) = logit -1[lp ( F - ap )], and can be displayed graphically with F is taken over
the range of ability scores. The total information function is the sum of the item-specific
functions.
Soares et al. (2009) and Fox and Glas (2005) describe Bayesian IRT models allowing dif-
ferential item functioning (DIF), for example, when one or more items are not appropriate
for measuring ability because the knowledge needed for a correct answer is culturally
specific. Thus, let xi = 0 for a reference population and xi = 1 for a focal group (e.g. disad-
vantaged or minority group) (Magis et al., 2015). Then DIF is indicated if the extended
model

Pr( y pi = 1|Fi ) = g(hpi ),

hpi = ap + lp Fi + xi (gp + dp Fi ),

has better fit than the standard model without group differentiation (Choi et al., 2011).

9.4.3 Latent Scale IRT Models


As an alternative to binary likelihood modelling in IRT and binary SEM applications, the
latent scale method may be applied with the appropriate underlying density defined by
the link g (Albert, 2015). Thus, for a probit link with g -1 = F , the latent metric scale y* is
normal, such that ypi = 1 corresponds to the imputation scheme

y *pi ~ N (hpi , 1) I (0, )

and ypi = 0 corresponds to y *pi ~ N (hpi , 1)I (, 0) . For a logit link g -1(u) = L(u) = e u /(1 + e u ) and
sampling of y* is from a standard logistic. Sahu (2002) considers an extra data impu-
tation to provide three-parameter IRT models. The three-parameter probit IRT model
specifies

ppi = cp + (1 - cp )F(ap + lp Fi )

while the three-parameter logistic IRT is

ppi = cp + (1 - cp )L(ap + lp Fi ).

Lee and Song (2003) adopt a latent scale approach to a structural equation model for mul-
tiple binary observations. Their model specifies

yi* = a + LFi + ui ,
360 Bayesian Hierarchical Models

where the latent constructs Fi are partitioned into endogenous and exogenous vector com-
ponents Fi = ( F1i , F2i ) of dimension Q1 and Q2 respectively, with structural model

F1i = BF1i + CF2i + wi .

For identification ui ~ N P (0, I ), while F1i ~ NQ1 (0 , F1 ), F2i ~ NQ2 (0, F 2 ), wi ~ NQ1 (0, S w ) and
each row of Λ follows a separate normal prior. The observed binary data y is augmented
with latent data {y*,F} to provide complete data {y,y*,F}. Setting q = (a , L , F1 , F 2 , S d , B, G ), the
updating sequence involves sampling from conditionals p(q (t +1) |F (t ) , y *(t ) ), p( F (t +1) |q (t +1) , y *(t ) ),
and p( y *(t +1) |F (t +1) ,q (t +1) ).
Dunson and Herring (2005) consider instead the case where the underlying y *pi (e.g.
tumour counts) are Poisson, or overdispersed Poisson, and the observations ypi (e.g.
whether tumours are present) are binary. Thus

y *pi ~ Po(exp( x pi b )L px i ) I ( Api , Bpi )

where xi = (x1i , … , xQi ) are gamma distributed latent constructs, and the loadings Λp are
also gamma distributed. The sampling limits are (Api = 0, Bpi = 0 ) when ypi = 0, and (Api = 1,
Bpi = ∞) when ypi = 1.

9.4.4 Categorical Data
For unordered polytomous indicators ypi with Mp categories (p = 1, ¼ , P ), intercept and
loading parameters are typically specific to the category of each item, with one category
(e.g. the final one) as reference. Assume a multiple logit link, with multinomial parameter
ppi = (ppi1 , ¼ ppiMp ) for subject i and indicator p. Then while factors are common across cat-
egories, loadings are specific to indicator p and category h of that indicator,

y pi ~ Categoric(ppi1 , ppi 2 , ¼ ppiMp )

Mp

p pih = j pih åj
m=1
pim h = 1,… M p

log(j pih ) = a ph + lph1F1i +¼+ lphQ FQi h = 1,¼, M p-1

j piMp = 1

with the usual constraints on Λ and/or F to avoid scale and rotational invariance.
Factor models for multiple ordinal items y pi Î(1, ¼ , K p ) refer to locations on an underly-
ing continuous scales zpi. Thus ypi = j when ap , j -1 £ z pi < apj , where αpj are cutpoints on the
underlying scale. Define binary indicators dpij = 1 if ypi = j, and dpij = 0 otherwise, and denote
dpi = (dpi1 , ¼ , dpiK p ). With ppi = (ppi1 , … , ppiK p ) , ηpi denoting a regression term potentially
including latent factors, and z pi = hpi + epi , where errors εpi have cdf P(ε), one has

dpi ~ Mult(1, ppi ),


Factor Analysis, Structural Equation Models, and Multivariate Priors 361

ppij = Pr( y pi = j) = Pr(ap , j -1 £ z pi < apj ),


= Pr(ap , j -1 £ hpi + epi < apj ),

= P(apj - hpi ) - P(ap , j -1 - hpi )

= gpij - gpi , j -1 ,
where

gpij = Pr( y pi £ j) = P(apj - hpi ), j = 1, … K p - 1

are cumulative probabilities over ordered categories, gpij = ppi1 + ¼+ ppij .


If ε follows a logistic density, then

logit(gpij ) = apj - hpi ,

where taking ηpi as uniform across response categories j defines the proportional odds
assumption. For example, assuming a univariate latent factor Fi, and other predictors Xi,
one has

logit(gpij ) = apj - lp Fi - bp Xi .

One application is in the graded response model for ordinal outcome IRT (Luo and
Jiao, 2017). Assuming Xi excludes an intercept, the Kp − 1 thresholds {ap1 , ap 2 … , aK p -1 } are
unknowns subject to the order constraint ap1 £ ap 2 … £ aK p -1. An augmented data approach
may also be used for latent variable analysis with ordinal responses (Lee and Tang, 2006;
Poon and Wang, 2012).

Example 9.3 Greek Crime Totals


This example considers count data and compares a full dimension covariance struc-
ture to a common factor model. The data relate to P = 4 counts of crimes (rapes, arsons,
manslaughter, smuggling of antiquities) in n = 49 Greek prefectures, as used in Karlis
and Meligkotsidou (2005).* The counts are assumed to be Poisson with offset being pre-
fecture populations, popi (in millions). Predictors are unemployment rate (x1), a binary
indicator (x2) for whether the prefecture is on the Greek border, GDP per capita in euros
(x3), and a binary indicator (x4) for whether the prefecture has at least one large city (over
150,000 inhabitants). GDP and unemployment are centred.
Event rates are low in relation to populations at risk, so a reasonable sampling model
takes y pi ~ Po( popi rpi ) with ρpi then being crime rates per million. The full dimension
covariance model specifies

rpi = exp(ap + Xi bp + upi ), p = 1,¼, P

(u1i ,… , uPi ) ~ N P (0, D).

* Data kindly provided by Dimitris Karlis.


362 Bayesian Hierarchical Models

with a Wishart prior on the precision matrix

D -1 ~ W (PI , P).

With jagsUI for estimation, the mean scaled deviance is obtained as 209, comparing
closely to the number of observations, namely 196, so that Poisson extra-variation is
accounted for. Most predictor effects are insignificant: the only significant effects are
of unemployment on the rape crime rate (with β11 having mean 0.064), and of GDP on
manslaughter rates. Most correlations rjm = corr(u ji , umi ) in the regression residuals have
credible intervals straddling zero, though r13 has a posterior mean 0.46 with 95% CRI
(−0.03,0.81). Adequate model performance is shown by the fact that only 9 of the 196
observations have mixed predictive p values under 0.05 or over 0.95 (Marshall and
Spiegelhalter, 2007).
To illustrate a factor analytic approach to these data, the four predictors are taken to
be causes of a single underlying crime construct, Fi, in a MIMIC analysis. The Poisson
regressions form a measurement model in which crime levels are indicators of Fi. A
further common factor ui is included in the model for the crime types to account for
residual variation in the crime data. So

r pi = exp(a p + lp Fi + k p ui ),

Fi = b1x1i + b2 x2i + b3 x3 i + b4 x 4 i + wi ,

wi ~ N(0, 1/tw ),

ui ~ N(0, 1/tu ).

Anchoring constraints are used to define the scale of the factor scores Fi and ui. So λ1 = 1
and κ2 = 1, with the latter setting corresponding to a belief that arson is relatively distinct
from the other variables in its pattern.
Inferences are based on a 15,000 iterations run with two chains using jagsUI. The pos-
terior means (sd’s) of the unknown λp coefficients (p = 2,3,4) are respectively −0.94 (0.95),
0.69 (0.38), 0.82 (0.54). These loadings tend to confirm F as a positive crime construct with
positive loadings on all crime variables except for arson. The posterior mean F scores
range from −0.96 to 1.22, with high F scores in prefectures with above average violent
crime (such as prefecture 13), or where high violent crime is combined with smuggling.
By contrast, low F scores occur in prefectures with little crime (prefecture 40), or in areas
where arson is unduly elevated (e.g. prefecture 48). The xi are relatively weak predictors
of Fi though the GDP coefficient (b3) has a mainly positive 95% interval (−0.03, 0.28).
The average scaled deviance of this model (278) indicates some residual over-disper-
sion. The estimated parameter total is lower at 129 (compared to 170 for model 1), though
the DIC is higher at 756 (compared to 730 for model 1). Model checks are, however, ade-
quate: 14 of the 196 observations have mixed predictive p values under 0.05 or over 0.95.
The fact that this particular data reduction method did not yield a better fit may be
taken to illustrate caveats to discrete data factor reduction, as also illustrated by Chib
and Winkelmann (2001). They undertake a Poisson regression analysis of six health use
outcomes, and conclude that “a flexible model with a full set of correlated latent effects
is needed to adequately describe the correlation structure [in the regression residuals].”

Example 9.4 Attitudes to Science


This example considers a unifactorial model for data on four ordinal indicators of atti-
tudes to science, a subsample from the International Social Survey Program (ISSP) 1993
Factor Analysis, Structural Equation Models, and Multivariate Priors 363

(Greenacre and Blasius, 2006). The indicators are Lickert scales with five levels, with
wording as follows: y1, we believe too often in science, and not enough in feelings and
faith; y2, overall, modern science does more harm than good; y3, any change humans
cause in nature, no matter how scientific, is likely to make things worse; and y4, mod-
ern science will solve our environmental problems. Responses range from 1 = strongly
agree to 5 = strongly disagree. Except for the fourth question, agreement suggests a
negative attitude toward science, while disagreement (higher ordinal ranks) suggests
a positive attitude.
A logit regression for ordinal responses y pi Î(1,… , K ) ( p = 1,… , P ; k = 1,… K ), where
P = 4 and K = 5, assumes underlying continuous variables zpi such that ypi = k when
ap , k - 1 £ z pi < apk . Define binary indicators dpik = 1 if ypi = k, and dpik = 0 otherwise, and
denote dpi = (dpi1 ,… , dpiK p ) . So with ppi = (pp1i ,… , ppKi ) , and assuming Q latent factors, the
sampling model is dpi ~ Mult(1, ppi ) , where

ppki = Pr(ap , k - 1 £ z pi < apk ),


= P(apk - lp1F1i … - lpQ FQi ) - P(ap - 1, k - lp1F1i … - lpQ FQi )

= gpki - gp - 1, ki

where gpki = Pr( y pi £ k ), p = 1,… K - 1, are cumulative probabilities.


Here a single factor is assumed so that ppki = P(apk - lp Fi ) - P(ap - 1, k - lp Fi ) . To ensure con-
sistent labelling, we set λ1 = 1 (i.e. an anchoring constraint), so the factor will most likely
measure positive attitudes to science. We also assume a MIMIC model, whereby factor
scores are assumed to be influenced by three observed predictors: x1 = sex (M = 0,F = 1),
x2 = age and x3 = education. The latter two covariates are categorical with 6 levels, but
taken to be continuous for simplicity. The categories are for age: 16–24, 25–34, 35–44,
45–54, 55–64, 65 and older, and for education: primary incomplete, primary completed,
secondary incomplete, secondary completed, tertiary incomplete, tertiary completed. So

Fi ~ N( b1x1i + b 2 x2i + b 3 x3 i , s F2 ),

where sF2 is an unknown by virtue of the anchoring constraint.


Table 9.4 shows the estimated parameters obtained using jagsUI. It can be seen that
positive attitudes to science are more likely among males, and among younger, more
highly educated, subjects. Also apparent is the irrelevance of the fourth item to the
latent scale. Outlier diagnostics can be taken at the subject-indicator level, here using
the leave one out criterion. Thus elevated LOO-IC values occur for subjects such as
784, an older, less educated subject (65+, primary completed), with indicator profile for
( y1 , y 2 , y 3 , y 4 ) of (2,1,5,2), so that the third indicator value is unusual in terms both of
covariate profile and the other indicators. We can aggregate the LOO-IC criteria within
subjects, and this shows subject 781 with the most extreme criterion. This subject is also
older and less educated subject, with an indicator profile for ( y1 , y 2 , y 3 , y 4 ) of (4,1,5,3).
The second indicator value is unusual, and the otherwise favourable science attitudes
are unusual given the covariate profile.

Example 9.5 LSAT Data Item Response Model


This example compares item analysis (IRT) using two and three parameter logit mod-
els. The application involves Law School Admission Test (LSAT) data from the R ltm
package,with n = 1000 subjects and P = 5 items. The two-parameter model for subjects i
and item p is y pi ~ Bern(ppi ) , with πpi specified as

p pi = logit -1[lp (q i - a p )],


364 Bayesian Hierarchical Models

TABLE 9.4
Parameter Estimates. Scientific Attitudes
Parameter Mean St Devn 2.5% 97.5%
β1 −0.346 0.112 −0.566 −0.133
β2 −0.086 0.034 −0.153 −0.018
β3 0.219 0.043 0.128 0.305
α11 −2.561 0.146 −2.869 −2.295
α12 −0.183 0.108 −0.416 0.018
α13 1.157 0.119 0.924 1.387
α14 3.363 0.198 3 3.751
α21 −3.839 0.282 −4.457 −3.354
α22 −1.703 0.18 −2.087 −1.386
α23 −0.141 0.139 −0.432 0.117
α24 2.306 0.184 1.957 2.672
α31 −2.432 0.176 −2.779 −2.087
α32 −0.011 0.124 −0.259 0.223
α33 1.455 0.137 1.187 1.734
α34 3.549 0.224 3.138 4.024
α41 −2.617 0.129 −2.875 −2.375
α42 −0.689 0.072 −0.834 −0.547
α43 0.273 0.069 0.142 0.403
α44 1.568 0.09 1.406 1.757
λ2 1.45 0.199 1.141 1.882
λ3 1.24 0.141 0.984 1.536
λ4 −0.019 0.061 −0.135 0.103
τΦ = 1/σ2Φ 0.642 0.103 0.474 0.881

where q i ~ N(0, 1) are ability scores, and the items are all positive measures of ability.
The λp are assigned a hierarchical LN(0, s l2 ) prior. The model is checked by assessing
whether mixed predictive replicates y new, pi sampled from the model (Marshall and
Spiegelhalter, 2007) are concordant with actual values ypi, though one may also compare
actual and predicted totals falling into particular item response patterns (Sahu, 2002).
The three-parameter logit model includes guessing parameters g p (also called thresh-
old parameters), whereby

p pi = g p + (1 - g p )logit -1[lp (q i - a p )],

The g p = logit -1 (x p ) are obtained as inverse logits of ξp, which are assigned a hierarchical
normal prior.
As an example of IRT outputs, we obtain the item-specific information functions
and the test information function (or total information function), using the formulas
in Baker (2001). The test information function indicates where the set of items provides
most information about students with varying ability.
The analysis is implemented using rstan, with the rstan analysis having a substantial
advantage in early convergence. Parameters from the 2PL and 3PL models are shown
in Table 9.5. Measures of global fit show little gain in adopting a 3PL model instead of
the 2PL, with the latter in fact having a lower LOO-IC (4910 vs 4914), but a higher WAIC
(4903 vs 4897).
In terms of discrimination, item 3 shows maximum difficulty (highest αp) under both
models. Posterior mean rates of predictive concordance for the five items are (0.87, 0.69,
Factor Analysis, Structural Equation Models, and Multivariate Priors 365

TABLE 9.5
LSAT Data, Parameter Summary, 2PL vs 3PL IRT
Two Parameter Logistic
Mean St Devn 2.5% 50% 97.5%
λ1 0.90 0.20 0.54 0.89 1.33
λ2 0.76 0.16 0.46 0.76 1.07
λ3 0.86 0.18 0.54 0.85 1.24
λ4 0.76 0.17 0.46 0.76 1.10
λ5 0.78 0.17 0.44 0.78 1.11
α1 −3.26 0.64 −4.81 −3.14 −2.31
α2 −1.36 0.28 −2.03 −1.31 −0.95
α3 −0.30 0.11 −0.52 −0.29 −0.11
α4 −1.79 0.37 −2.71 −1.72 −1.26
α5 −2.83 0.64 −4.43 −2.69 −2.03
Three Parameter Logistic (Hierarchical Normal on Inv_Logit(gamma)
Mean St Devn 2.5% 50% 97.5%
λ1 1.15 0.56 0.65 1.02 2.33
λ2 1.15 0.99 0.67 1.01 2.15
λ3 1.09 0.37 0.68 1.01 1.93
λ4 1.06 0.34 0.65 1.00 1.76
λ5 1.09 0.48 0.65 1.01 1.93
α1 −0.68 0.84 −2.56 −0.49 0.46
α2 −0.12 0.41 −1.01 −0.10 0.69
α3 0.18 0.32 −0.28 0.10 0.94
α4 −0.26 0.47 −1.33 −0.20 0.58
α5 −0.51 0.69 −2.09 −0.34 0.54
γ1 0.74 0.15 0.29 0.80 0.88
γ2 0.37 0.12 0.08 0.39 0.56
γ3 0.16 0.10 0.01 0.14 0.38
γ4 0.45 0.13 0.11 0.48 0.63
γ5 0.64 0.15 0.21 0.69 0.8
Three Parameter Logistic (Hierarchical Beta prior on gamma)
Mean St Devn 2.5% 50% 97.5%
λ1 1.11 0.40 0.68 1.02 2.02
λ2 1.05 0.33 0.66 1.00 1.67
λ3 1.05 0.29 0.69 1.00 1.71
λ4 1.03 0.27 0.65 1.00 1.69
λ5 1.07 0.39 0.63 1.00 1.91
α1 −0.88 1.01 −3.02 −0.58 0.45
α2 −0.23 0.46 −1.19 −0.16 0.60
α3 0.11 0.29 −0.34 0.05 0.80
α4 −0.36 0.54 −1.52 −0.26 0.60
α5 −0.60 0.79 −2.39 −0.38 0.52
γ1 0.69 0.23 0.04 0.79 0.88
γ2 0.34 0.14 0.01 0.37 0.54
γ3 0.14 0.09 0.00 0.13 0.34
γ4 0.42 0.15 0.03 0.46 0.63
γ5 0.61 0.19 0.05 0.69 0.80
366 Bayesian Hierarchical Models

0.8

0.7

0.6
Test Information

0.5

0.4

0.3

0.2

–4 –3 –2 –1 0 1 2
Ability

FIGURE 9.1
Test information plot (mean and 60% CRI).

0.57, 0.74, 0.82) under the 2PL model, suggesting that the third item is the least well
explained by, and possibly less relevant to, the latent structure model. The test informa-
tion plot (Figure 9.1) for the 2PL model peaks at around −2, indicating that this set of
items best identifies learners with an ability less than average.
Estimates (and fit) for the 3PL may be sensitive to the prior adopted for γp. For example,
a hierarchical Beta(a1,b1) prior for the γp, with a1 and b1 assigned Exponential(1) priors,
provides differing estimates of the difficulty parameters, and a lower LOO-IC of 4910.

Example 9.6 Latent Regression vs Differential Item Functioning


The data in this example are from Thissen et  al. (1993) and concern student spelling
performance (correct/incorrect) using four words: infidelity, panoramic, succumb, and
girder. The sample includes 284 male and 374 female undergraduate students.
One option, considered in Zheng and Rabe-Hesketh (2007) is latent regression, with
the factor scores depending on gender. Generically

y pi ~ Bern(ppi ),

qi ~ N( d Xi , 1),

ppi = logit -1[lp (qi - ap )],

where Xi consists of centred covariates (an intercept not being identifiable). The λp and αp
parameters are assigned hierarchical normal priors, with λp constrained to positive val-
ues. Table 9.6 shows the 2PL parameter estimates for this model (obtained using jagsUI),
with a significant effect δ of male gender in improving spelling ability. One feature to
note is the more informative nature (compared to Example 9.5) of the total information
function. Figure 9.2 shows that this provides a higher information level centred at aver-
age ability. The LOO-IC is 3033.
Factor Analysis, Structural Equation Models, and Multivariate Priors 367

TABLE 9.6
Spelling Data, Parameter Estimates Compared
2PL Latent Regression
Mean Sd 2.50% 97.50%
α1 −1.66 0.26 −2.26 −1.24
α2 −0.54 0.11 −0.77 −0.35
α3 0.90 0.14 0.66 1.21
α4 −0.12 0.08 −0.28 0.03
λ1 0.97 0.18 0.64 1.36
λ2 1.26 0.24 0.84 1.78
λ3 1.22 0.23 0.83 1.73
λ4 1.47 0.32 0.98 2.26
Predictive Concordance Item 1 0.67 0.02 0.63 0.70
Predictive Concordance Item 2 0.53 0.02 0.49 0.57
Predictive Concordance Item 3 0.58 0.02 0.54 0.62
Predictive Concordance Item 4 0.51 0.02 0.47 0.54
Δ 0.23 0.12 0.01 0.46
Differential Item Functioning
Mean Sd 2.50% 97.50%
α1,1 −1.50 0.29 −2.22 −1.08
α2,1 −1.39 0.32 −2.14 −0.91
α1,2 −0.53 0.15 −0.87 −0.27
α2,2 −0.55 0.15 −0.87 −0.29
α1,3 1.15 0.26 0.73 1.75
α2,3 0.68 0.14 0.44 1.00
α1,4 0.18 0.12 −0.03 0.42
α2,4 −0.52 0.15 −0.86 −0.26
λ1,1 1.35 0.35 0.75 2.16
λ2,1 0.99 0.25 0.57 1.54
λ1,2 1.18 0.31 0.67 1.87
λ2,2 1.45 0.37 0.85 2.30
λ1,3 0.94 0.24 0.56 1.47
λ2,3 1.88 0.58 1.05 3.29
λ1,4 1.31 0.39 0.74 2.31
λ2,4 1.39 0.39 0.80 2.30
Predictive Concordance Item 1 0.67 0.02 0.63 0.70
Predictive Concordance Item 2 0.53 0.02 0.49 0.57
Predictive Concordance Item 3 0.58 0.02 0.54 0.62
Predictive Concordance Item 4 0.52 0.02 0.48 0.55
Differential Item Functioning (Reduced Model)
Mean Sd 2.50% 97.50%
α1,1 −1.49 0.15 −1.80 −1.21
α2,1 −1.15 0.15 −1.45 −0.86
α1,2 −0.47 0.11 −0.69 −0.25
α2,2 −0.55 0.13 −0.81 −0.31
α1,3 0.90 0.12 0.66 1.15
α2,3 0.79 0.14 0.53 1.07
(Continued)
368 Bayesian Hierarchical Models

TABLE 9.6 (CONTINUED)


Spelling Data, Parameter Estimates Compared
Differential Item Functioning (Reduced Model)
Mean Sd 2.50% 97.50%
α1,4 0.17 0.11 −0.04 0.38
α2,4 −0.49 0.13 −0.75 −0.24
Λ 1.27 0.10 1.08 1.45
Predictive Concordance Item 1 0.67 0.02 0.63 0.70
Predictive Concordance Item 2 0.53 0.02 0.49 0.57
Predictive Concordance Item 3 0.58 0.02 0.54 0.61
Predictive Concordance Item 4 0.52 0.02 0.48 0.55

1.5
Information

1.0

0.5

–3 –2 –1 0 1 2 3
Ability

FIGURE 9.2
Test information function, mean, and 80% CRI.

Fit is improved (with LOO-IC reduced to 3015) using a DIF model with difficulty and
discrimination parameters varying by group. Thus

y pi ~ Bern(ppi ),

qi ~ N(0, 1)

p pi = logit -1[lgip (q i - a gi , p )],

where gender gi is coded as (F = 1, M = 2), and where the λgp and αgp parameters are again
assigned hierarchical normal priors. There is some evidence favouring differential
Factor Analysis, Structural Equation Models, and Multivariate Priors 369

functioning. For example, Table 9.6 shows higher difficulty for females on items 3 and 4,
and higher discrimination for males on items 2 and 3.
A simplified DIF approach is discussed by Magis et al. (2015) involving a single dis-
crimination parameter (homogenous across groups and items), and αgp parameters sub-
ject to (classical) Lasso penalisation. One possible Bayesian option is a Lasso shrinkage
prior

agp ~ N(0, sa2 hgp


2
),

æ r2 ö
h gp
2
~ Exponential ç ÷ ,
è 2 ø
r ~ U(0.001, 1000),

1/sa2 ~ Exponential(1),

with other parameterisations possible. This option provides a further reduction in


LOO-IC to 3007, with the discrimination parameter more precisely estimated than
under the full DIF method. Another option, assuming a half-Cauchy prior with 1 d.f.
(a horseshoe prior) on the ηgp, leads to a LOO-IC of 3010, with shrinkage to zero most
marked for α14. Posterior mean percentage predictive concordances for the four items
are (66.8,53.0,57.8,51.5).
Since there appears to be an association between gender and spelling ability in these
data, this can be explored more fully using a multivariate probit, as implemented in
rstan. The code involves a Cholesky decomposition of the correlation matrix R = LL′,
combined with an LKJ(1) prior on L. Estimation shows that the effect of male gender β2p
is significantly positive for the last of the four words, and significantly negative for the
first (see Table 9.7). The intercepts show item 3 as the most difficult. The highest element
in the residual correlation matrix is 0.38, between items 2 and 4.

TABLE 9.7
Multivariate Probit, Spelling Data, Posterior Summary
Mean St devn 2.5% 5% 95% 97.5%
β11 0.89 0.07 0.75 0.77 1.01 1.03
β12 0.29 0.06 0.17 0.19 0.40 0.42
β13 −0.54 0.07 −0.67 −0.65 −0.42 −0.41
β14 −0.10 0.07 −0.24 −0.21 0.00 0.02
β21 −0.20 0.11 −0.41 −0.38 −0.02 0.01
β22 0.07 0.10 −0.13 −0.10 0.22 0.26
β23 0.06 0.10 −0.15 −0.11 0.23 0.26
β24 0.42 0.10 0.24 0.27 0.59 0.62
R12 0.30 0.06 0.17 0.19 0.40 0.42
R13 0.28 0.07 0.14 0.16 0.39 0.41
R14 0.33 0.06 0.20 0.23 0.43 0.45
R23 0.37 0.06 0.25 0.27 0.47 0.48
R24 0.38 0.06 0.26 0.28 0.47 0.49
R34 0.36 0.06 0.24 0.26 0.46 0.48
370 Bayesian Hierarchical Models

9.5 Robust Density Assumptions in Factor Models


To improve estimability of factor models with data containing unusual observation,
heavy tailed or skew densities may be considered (Yuan et al., 2004; Lai and Zhang, 2017).
Consider the normal linear factor reduction model (9.1), namely yi = a + LFi + ui . Instead
of conventional normality assumptions for residuals ui = (u1i , … , uPi )¢ or factor scores
Fi = ( F1i , F2i , ¼ , FQi )¢ , one might use options that are robust to measurement or construct
outliers. For example, a Student t model with ν1p degrees of freedom for the measurement
model regressions for ypi is obtainable via scale mixing, with

y pi ~ N (ap + L p Fi , sp2 /zpi ),

zpi ~ Ga(0.5n1p , 0.5n1p ).

To identify possible observation outliers, one may monitor the lowest weights ζpi. Assume
also standardised and uncorrelated factor scores, but following a Student t rather than
normal density. Then the corresponding heavy tailed construct score model for identify-
ing construct outliers is

Fqi ~ N (0, 1/wqi )

wqi ~ Gamma(0.5n2 q , 0.5n2 q ).

Skewness in outcomes or factor scores may also be present. Following Azzalini (1985), let
f and g be symmetric probability density functions, with G being the cumulative distribu-
tion function associated with g. Then for location parameter μ and scale parameter σ, the
density

2 æ x - mö æ x - mö
fç G k ÷
s è s ÷ø çè s ø

is a skew pdf for any κ. If f = ϕ and G = Φ (respectively the normal pdc and cdf), one obtains
the skew-normal distribution. Positive (negative) values of κ indicate positive (negative)
skewness, while κ = 0 provides the normal density. Bazan et al. (2006) consider the applica-
tion of the skew-normal density in item analysis. For binary items p = 1, … , P , and with

kp
dp = ,
(1 + kp2 )

they define a skew probit IRT model involving a common factor Fi and item-specific effects
Vpi to allow for skew errors. So

y *pi ~ N (ap + lp Fi + dpVpi , 1 - dp2 ) I ( Api , Bpi )

Fi ~ N (0, 1),

Vpi ~ HN (0, 1),


Factor Analysis, Structural Equation Models, and Multivariate Priors 371

with sampling limits { Api , Bpi } defined according to the observed binary responses. This
parameterisation necessitates priors for δp in the interval [−1,1].

Example 9.7 Greek Crimes by Prefecture: Non-Parametric Prior for Random Effects
The analysis of the Greek crime data in Example 9.3 assumed normally distributed
errors upi in the log-link model for the crime rates ρpi. As noted by Knorr-Held and
Rasser (2000) a fully parametric specification of the random effects distribution may
result in oversmoothing, and mask local discontinuities, especially when the true dis-
tribution is characterised by a finite number of locations. Here a truncated Dirichlet
process prior (DPP) is adopted to model the density of the residuals upi, with potential
values {u*pk , p = 1,… , P} from K clusters centred on the multivariate normal G0 = N P (0, D),
where P = 4. D−1 has a Wishart prior with identity scale matrix and P degrees of freedom.
Thus the infinite DPP representation is approximated by one truncated at K £ n com-
ponents, with appropriate values upi for prefecture i chosen according to an allocation
indicator Si Î(1,¼K ). The probabilities πk of allocation to clusters {1,… , K } are deter-
mined by K − 1 beta distributed random variables Vk ~ Beta(1, k) , with unknown con-
centration parameter κ, and VK = 1 to ensure the random weights πk sum to 1 (Ishwaran
and James, 2001; Sethuraman, 1994). Then π1 = V1 and

pk = (1 - V1 )(1 - V2 )…(1 - Vk - 1 )Vk k > 1.

Following Ishwaran and Zarepour (2000, p.377), the gamma prior for κ, namely
k ~ Ga(n1 , n2 ) has relatively large ν1 and ν2, with ν2 set larger than ν1. Such a setting dis-
courages small and large values for κ. Here ν2 = 4 and ν1 = 2. The maximum possible
clusters is set at K = 20.
Estimation using jagsUI shows early convergence and replicates Example 9.3 in show-
ing mostly non-significant predictor effects. There are, however, significant positive
effects of unemployment on rape, and of urban centre on manslaughter, and a signifi-
cant negative effect of GDP levels on manslaughter. The posterior mean for κ is 1.26,
with the average number of non-empty clusters K* being 8.26.
Extreme residuals, and departures from normality, are associated with poorly fit-
ted cases (with extreme response values and high pointwise LOO-IC). For example,
Figure 9.3 plots out positively skewed mean residuals for u3i (manslaughter), with the
most extreme positive residual for the elevated observation y3,14.

Example 9.8 Maths Aptitude; Skew Probit


This example uses binary item data from Tanner (1996) concerning maths aptitude; there
are n = 39 students and P = 6 items. An augmented data probit regression is applied, and
an extension to a skew probit link is adopted, following the approach of Bazan et al.
(2006).
So latent metric data y *pi underlying the observed binary response are sampled
according to

y *pi ~ N(ap + lp Fi + dpVpi , 1 - dp2 ).

with Fi ~ N(0, 1) and the λp all being unknowns. A U(−1,1) prior is adopted on the δp
parameters, with the prior on discrimination parameters

lp ~ N(1, 0.5) I (0, ),

providing an identifying constraint (Sahu, 2002).


372 Bayesian Hierarchical Models

15

10
Frequency

–0.5 0.0 0.5 1.0


Posterior Means

FIGURE 9.3
Histogram of residuals, manslaughter.

A two-chain run of 10,000 iterations (with convergence at under 1,000) shows none
of the δp (and hence κp) parameters to be significantly positive or negative. Despite the
apparent absence of skew, this model has a lower LOO-IC than the symmetric probit,
297 as against 322. Posterior mean percentages of predictive concordance for the six
items are also higher under the skew probit, namely (57.5,59.1,65.9,66.5,70.5,66.3).

Example 9.9 Student t Factor Model


This example involves simulated continuous data (n = 200 observations) derived using
a multivariate t5 density for the factor scores. The simulated data are then re-analysed
using different assumptions of multivariate normal factors and multivariate student
factors. The simulation is focused on explanatory factor analysis with P = 6 indicators
and Q = 2 factors, and with the same assumptions on priors for loadings as in Example
9.2. The code for the simulation takes account of inferences being potentially influenced
by order dependence in the indicators, and the loading of the second factor on the sec-
ond indicator is set to ensure the second indicator is an effective factor founder. The
code is as follows, assuming 5 degrees of freedom in the Student t, and requires the
library mvtnorm:

      n = 200 # number of observations


      p = 6 # number of indicators
      q = 2 # number of factors
     # Loading matrix
     Lambda = matrix(c(1.1,−0.1,
     −0.2,1.2,
     0.8,0,
     −0.9,0.6,
     0.3,0.7,
     −0.8,0.9),
Factor Analysis, Structural Equation Models, and Multivariate Priors 373

     nrow=p, ncol=q, byrow=T)


     DF=5
     S=matrix(c(1,0,0,1), nrow=q, ncol=q, byrow=T)
     mean=c(0, 0)
      F = rmvt(n, sigma=S*(DF−2)/DF, df=DF) + mean # MVT Factor Scores
     e <- rmvnorm(n, rep(0, p), diag(p)) # N(0,1) errors
     y <- F %*% t(Lambda) + e # indicator matrix
     y <- scale(y)

Table 9.8 shows the estimated loadings obtained under maximum likelihood (via
the lavaan package), using Bayesian analysis with MVN factors (via jagsUI), and using
Bayesian estimation with MVT factors, obtained using scale mixing (via the rube pack-
age). The prior for the unknown degrees of freedom ν follows Juárez and Steel (2010).
The first factor from the maximum likelihood analysis is reverse signed, but otherwise
the loadings are similar between the alternative estimation methods. The respective
LOO-IC values for the MVN and MVT factor models are 1577 and 1557. All estimated
loadings show shrinkage from the generating loadings.
The MVT analysis provides a posterior mean (95% CRI) of 12.4 (4.0,34.7) for ν, as com-
pared to the generating value ν = 5. The posterior median for ν of 10.1 is a better estima-
tor of the generating value. Around 10% of the observations have scale adjustments
under 0.8, and two observations (103, 152) have scale adjustments with 95% credible
intervals entirely below 1.

9.6 Multivariate Spatial Priors for Discrete Area Frameworks


Consider multivariate spatial responses ( y1i , … y Pi )¢ of dimension P from an exponential
family density observed over n discrete areas (e.g. administrative regions). Conditional
on random spatial effects si = (s1i , … , sPi )¢ of the same dimension, and predictors
xi = ( x1i , ¼ , xRi )¢ , one then has

ì y piqpi - b(qpi ) ü
p( y pi |spi , xi ) µ exp í + c( y pi , fpi )ý
î fpi þ
where θpi is the canonical parameter, and ϕpi a known scale. Denoting regression terms
as hpi = g(qpi ) with link g, the spi are included to measure spatially configured but unmea-
sured predictors. So, one has at a minimum the representation

hpi = ap + bp xi + spi ,

where the spatial effects for area i, si = (s1i , ¼ , sPi )¢ , follow a multivariate spatial prior. For
certain definitions of spatial effects, it may be appropriate to also include unstructured (i.e.
exchangeable over areas) multivariate effects, in line with a multivariate form of the Besag
et al. (1991) convolution prior. Thus, the full dimension analogue to the convolution prior is

hpi = ap + bp xi + spi + upi , (9.6)

where the upi also follow a multivariate prior. Other possibilities, following Chapter 6,
include regression effects βpi that vary spatially as well as over response variables.
374

TABLE 9.8
EFA of Simulated Data, Estimated Loadings
Bayesian Estimation MVN Factors (Mean, 95% CRI) Bayesian Estimation MVT Factors (Mean, 95% CRI)
Maximum
Likelihood Factor 1 Factor 2 Factor 1 Factor 2
Indictor Factor1 Factor2 Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5%
y1 −0.71 0.00 0.69 0.50 0.88 0.01 −0.19 0.25 0.61 0.44 0.79 0.00 −0.18 0.17
y2 0.30 0.78 −0.11 −0.96 0.90 0.84 0.48 1.41 −0.13 −0.79 0.66 0.71 0.43 1.23
y3 −0.75 0.00 0.75 0.46 1.06 0.08 −0.25 0.49 0.65 0.44 0.91 0.05 −0.21 0.35
y4 0.68 0.19 −0.63 −0.88 −0.28 0.14 −0.20 0.48 −0.56 −0.78 −0.28 0.14 −0.10 0.41
y5 0.45 0.00 0.15 −0.42 0.92 0.52 0.28 1.06 0.12 −0.37 0.72 0.46 0.26 0.85
y6 0.59 0.43 −0.47 −0.94 0.16 0.45 0.10 0.90 −0.44 −0.84 0.09 0.40 0.15 0.75
Bayesian Hierarchical Models
Factor Analysis, Structural Equation Models, and Multivariate Priors 375

Conditions for a valid multivariate spatial prior, specifically a multivariate Gaussian


Markov random field (MGMRF), are discussed by MacNab (2018), Rue and Held (2005,
section 2.2), and Banerjee et al. (2004, section 9.4). Thus denote the nP length vector over all
areas as s = (s1 , … , sn )¢ , and denote the mean vector, possibly including regression effects,
as mi = ( m1i , … , mPi ), with m = ( m1 , ¼ , mn )¢ . Also denote the matrix describing observed spatial
interactions in the region by W = [wij], with wij = wji, and set

D = Diag(d1 , … , dn ),

where di = å j¹i
wij . If wij = 1 when areas i and j are contiguous, and zero otherwise (binary
adjacency), then di is the number of neighbours for area i. The neighbourhood for area i is
often denoted ∂i, and if area j is a neighbour of area i, then the neighbour relation (under
binary interaction) is denoted j ~ i.
The joint density for a normal MGMRF for P spatial effects and with nP × nP precision
matrix Q may be expressed
nP/2
æ 1ö
p(s|Q) = ç ÷ Q 0.5
exp [(s - m)¢Q(s - m)]
è 2p ø
nP/2

æ 1ö
=ç ÷
è 2p ø
Q 0.5
å exp éë(s - m )¢Q (s - m )ùû .
ij
i i ij j j

Q is block diagonal with P × P sub-matrix elements Qij that are non-zero (zero) if area j is
(is not) a neighbour of area i. Retaining the possibility of a regression model in the means
μi (Rue and Held, 2005), the corresponding full conditional density is

æ ö
å
1
si |s[i] ~ N ç mi - Qii-1 Qij (s j - m j ), ÷,
ç Qii ÷
è j¹i ø
with conditional precision matrices

Prec(si |s[i] ) = Qii = D i .

Equivalently define P × P matrices Bij = -Qij /Qii , with Bii = 0, and D i = Qii . Then

E(si |s[i] ) = mi + å B (s - m ) (9.7)


j¹i
ij j j

Prec(si |s[i] ) = D i .

Most commonly the μi are set to zero.


Under the parameterisation in (9.7), the joint density has mean μ and precision matrix
Q = Δ(I − B), where Δ is block diagonal with blocks Δi, and the nP × nP matrix B is block
diagonal with (i,j)th block Bij. The requirements for a valid joint density to exist (for exam-
ple, if specification is starting from a prior involving the full conditionals) are that (Sain
and Cressie, 2007; Rue and Held, 2005, p.31)
376 Bayesian Hierarchical Models

D iBij = D jBji .

For example, setting Bij = [wij /di ]I P ´ P , and Δi = diζ (where ζ is a P × P precision matrix) will
ensure a valid joint density.
A number of multivariate priors which incorporate spatial dependence between areas
have been proposed. The generalisation of the intrinsic univariate CAR to a multivari-
ate setting is denoted as the multivariate CAR or MCAR prior (Mardia, 1988; Jin et al.,
2005, equation 6; Song et  al., 2006, p.254). This takes the vector of multivariate area
effects s = (s11 , s12 , … , s1n ; … ; sP1 , sP 2 , … , sPn ) as multivariate normal with mean consisting
of a vector of zeros of length nP, and with nP × nP precision matrix, Q = (D - aW ) Ä z ,
namely

nP/2
æ 1ö P /2 n/ 2 é 1 ù
p(s|z , a) = ç ÷ D - aW z exp ê - s¢ Qsú , (9.8)
è 2p ø ë 2 û
where a Î(0, 1) is a propriety parameter. The P × P positive definite symmetric matrix ζ−1
describes covariation between the outcomes, and D − αW is the precision matrix for the spa-
tial effects. The latter matrix can also be written as D(I − αB) where B = D−1W. Let the effects
be arranged by variable rather than subject, so that S1 = (s11 , s12 , … , s1n )¢ , S2 = (s21 , s22 , ¼ , s2 n )¢ ,
etc., then for P = 2, the joint prior is

æ S1 ö æ æ 0 ö éz 11(D - a W ) z 12 (D - a W )ù ö
-1

ç ÷ ~ N çç ÷,ê ÷,
è S2 ø ç è 0 ø ëz 12 (D - a W ) z 22 (D - a W )úû ÷
è ø

where each submatrix zpq (D - aW ) is of dimension n × n.


The conditional prior under (9.8) for si = (s1i , … , sPi ) given the remaining effects
s[i] = (s1 , … si -1 , si + 1 , … sn ) is multivariate normal with means E(si |s[i] ) = ( M1i , … , MPi ), where

E(spi |s[i] ) = M pi = a åw s åw
j¹i
ij pj
j¹i
ij

and with precisions

Prec(si |s[i] ) = diz.

If the wij are set to 1 for neighbouring areas and to 0 otherwise, then the M pi = a j ζi
spj /di å
are locality averages (times α) of the spatial effect for the pth response. Setting α = 1 pro-
vides the multivariate version of the intrinsic CAR prior of Besag et al. (1991); such intrin-
sic GMRFs (for spatial and non-spatial priors) are considered by Rue and Held (2005,
Chapter 3).
MacNab (2007) discusses a multivariate extension of the Leroux et al. (1999) prior, which
allows the data to determine the appropriate mix between spatial or exchangeable depen-
dence. This may be achieved with a single set of random effects rpi rather than the two
sets {spi,upi} present in the multivariate extension (9.6) of the convolution prior. Thus with
ri = (r1i , ¼ , rPi )¢ , parameter k Î(0, 1) , and spatial interactions W = [wij ]
Factor Analysis, Structural Equation Models, and Multivariate Priors 377

é ù
E(ri |r[i] ) = [ M1i , … MPi ] = k å wij I P rj ê1 - k + k
ê å wij ú
ú
j¹i ë j¹i û

é ù
Prec(ri |r[i] ) = ê1 - k + k
ê ú å
wij ú z
ë j¹i û
where, as above, ζ is the within area covariance of dimension P × P. Thus

é ù
ê kwij ú
Bij = ê ú IP´P
ê é1 - k + k
êë êë å j¹i
ù
wij ú ú
û úû

é ù
D i = ê1 - k + k
ê ú å
wij ú z
ë j¹i û
and D iBij = D jBji holds. When the wij are binary adjacency indicators, with di the number of
neighbours of area i, the conditional expectations become

E(rpi |r[i] ) = M pi =
k å j ζi
rpj
.
[1 - k + kdi ]
Define

H = diag(1 - k + k åw
j ¹1
1j ,¼, 1 - k + k åw
j¹ n
nj ) = (1 - k)I n + kD.

Then the joint density is multivariate normal with mean vector 0 and np × np precision
matrix ( H - kW ) Ä z .
Jin et  al. (2005) propose a generalised MCAR (GMCAR) model whereby the joint dis-
tribution for a multivariate spatial effect is obtained by specifying a sequence of condi-
tional and marginal models. Let effects be arranged by variable rather than subject. Then
for a bivariate spatial effect with P(S1 , S2 ) = P(S1 |S2 )P(S2 ), where S1 = (s11 , s12 , … , s1n )¢ and
S2 = (s21 , s22 , … , s2 n )¢, one has

æ S1 ö æ æ 0 ö é S11 S12 ù ö
ç ÷ ~ N çç ç ÷ , ê ÷,
è S2 ø è è 0 ø ëS12 S 22 úû ÷ø
-1 -1 -1
where E(S1 |S2 ) = S 12S 22 S2 , and var(S1 |S2 ) = S 11.2 = S 11 - S 12S 22 S 12
¢ . Hence with G = S 12S 22 ,
one has equivalently

æ S1 ö æ æ 0 ö éS11.2 + GS 22G¢ GS 22 ù ö
ç ÷ ~ N çç ç ÷ , ê ÷.
è S2 ø è è 0 ø ë (GS 22 )¢ S 22 úû ÷ø
378 Bayesian Hierarchical Models

To specify the joint distribution of S1 and S2, it is therefore necessary to specify the matrices
Σ11.2, Σ22, and G.
-1 -1
Taking S 11 .2 = t1[D - a1W ] , S 22 = t2 [D - a2W ] and G = g0 I + g1W , the marginal joint prior for
the second set of effects is then

S2 ~ N (0, t2-1[D - a2W ]-1 ),

and the conditional prior for the first set of effects is

S1 |S2 ~ N (GS2 , t1-1[D - a1W ]-1 ).

As above the 0 < ap < 1 are propriety parameters, and the γ0 parameter links different
variable-same area effects, namely regresses s1i on s2i, while γ1 links s1i with other variable-
other area effects {s2 j , j ¹ i} . This approach is possibly more suitable for small P, as P! con-
ditional density sequences are possible, and may give different inferences or fits – though
Jin et al. (2005, p.957) demonstrate how initial regression analysis may lead one to prefer
one sequence to another.
The linear co-regionalisation model of Jin et al. (2007) avoids dependence on any par-
ticular ordering. Assuming binary adjacency, the most general option in Jin et al. (2007),
namely Case 3 (dependent and non-identical latent processes), specifies a conditional mean

æ ö
E(spi |sp ,k ¹i , sq¹ p ,i , sq¹ p ,k ¹i ) = a pp å
k ~i
spk /di + å å ç a pq
ç
q¹ p è k ~i
sqk /di ÷ ,
÷
ø
where αpp is the spatial autocorrelation measure for the pth outcome, and αpq is a crossspa-
tial correlation between Sp and Sq. The joint distribution (Martinez-Beneito, 2013, p.4) may
be represented

ì æ (D - a 11W )z 11 ¼ (D - a 1PW )z 1P ö ü
ï ç ÷ï
s ~ N nP í0, ç ¼ ¼ ¼ ÷ý
ï ç (D - a W )z ¼ (D - a PPW )z PP ÷ø ïþ
î è 1P 1P

with ϑ = ζpq denoting the within area between disease precision matrix.
Martinez-Beneito (2013) represents the joint prior for the spatial error s of length nP in
the generic form s ~ N nP (0, S b Ä S w ) where Σb and Σw represent between and within disease
covariance matrices. Denoting S b and S w as lower triangular matrices such that S b = S b S Tb
and S w = S w S Tw , one has that s = S w eS Tb with ε of dimension n × P, consisting of indepen-
dent N(0,1) variates. Representing f = S w e as a matrix with P columns containing a set of
particular spatial distributions (e.g. P independent ICAR densities), then interdependence
is induced via the product form s = vec(fS Tb ) , which has covariance S b Ä (D - W )-1. If ϕ
consists of independent ICAR(αp) densities, then Case 2 of Jin et al. (2007) is obtained. More
flexibility is obtained by representing Σb as S b = S bCC T S Tb where C is any square orthogo-
nal matrix, which enables reproduction of Case 3 of Jin et al. (2007).
Factor Analysis, Structural Equation Models, and Multivariate Priors 379

9.7 Spatial Factor Models


When high correlations are evident in ζ−1, common spatial factor models may be more
parsimonious (Tzala and Best, 2007; Congdon et  al., 2007; Liu et  al., 2005: Gielen et  al.,
2017). These can extend to full structural equation models (e.g. Arhonditsis et al., 2006b).
Standard presentations of the normal linear and generalised linear factor models assume
factor scores are independent over subjects, though in fact they might be spatially or tem-
porally structured. So for P outcomes, the factor scores F of dimension Q < P may be cor-
related over both variables and areas. Then for Poisson or binomial responses with mean
mpi = g -1(hpi ) for the pth dependent variable and area i, one might have a regression term

hpi = ap + bp xi + L p Fi ,

where the vector Λp is of dimension Q, and the factor score variables Fi = ( F1i , … , FQi )¢ are
spatially dependent over areas i, as well as mutually intercorrelated. For example, a MCAR
prior would specify the joint pairwise difference density for the factor scores

é ù
p( F|S F ) µ|S F|- n/2 exp ê -0.5
ê å w (F - F )¢S
ij i j
-1
F ( Fi - Fj )ú .
ú
ë i, j û
As in other factor models, constraints are required to deal with label switching and loca-
tion, scale, and rotational indeterminacy. Constraining one or more loadings to be positive
is one strategy for avoiding label switching (Mar-Dell’Olmo et al., 2011). In the multivariate
CAR model for Fi = ( F1i , … , FQi )¢, the location is fixed in practice by centring each of the Q
sets of spatial factor scores at each MCMC iteration. Scale may be determined by fixing the
Q variances of the Fqi scores at 1, or by fixing one of the loadings (l1q , … , lPq ) linking the P
manifest indicators to the qth factor. Additional loadings would need to be fixed to avoid
rotational indeterminacy, typically λpq = 0 for q > p. For example, if Q = 2, and the variances
of the F scores are free parameters, then the two loadings λqq may be set to 1 to define the
scale, while rotational invariance is avoided by setting λ12 = 0.

Example 9.10 Chronic Disease Prevalence


This example contrasts covariance models for multivariate spatial outcomes with a
common spatial factor approach. It considers prevalence totals for common chronic
diseases for 56 wards (small political areas) in three London boroughs (Barking and
Dagenham, Havering, and Redbridge). The P = 3 outcomes are diagnosed counts ypi of
diabetes, hypertension, and chronic kidney disease (CKD) in 2016. Offsets are expected
prevalence totals Epi, based on region-wide age-specific rates.
The first model applied is the multivariate generalisation of the Leroux et al. (1999)
conditional autoregressive prior under a Poisson likelihood, with binary adjacencies wij,
and di the total number of areas adjacent to area i. The regression involves a constant
term and spatially configured effects (s1i ,… , sPi ) of the same dimension as the response
vector

h
y pi ~ Po(Epi e pi ),

hpi = ap + spi .
380 Bayesian Hierarchical Models

The conditional mean of spi is

E(spi |s[i] ) = Spi =


k å k ζi
spk
,
[1 - k + kdi ]
where κ is between 0 and 1. A Wishart prior for the conditional precision matrix Ψ, with
prior mean covariance I, is assumed.
This model is estimated using both CARBayes and R2OpenBUGS option. The first
option, using the MVS.Carleroux command (Kavanagh et  al., 2016), provides an esti-
mate of 0.92 for κ and DIC of 1748.
Under the second option, early convergence is attained in a two-chain run of 10,000
iterations, with a LOO-IC of 1811 and DIC of 1749. Setting Φ = Ψ−1, posterior mean cor-
relations rjk = F jk /(F jj F kk )0.5 are highest (0.335) between hypertension and CKD. The
spatial parameter κ is estimated at 0.97. Six of the 3 × 56 = 168 observations have mixed
predictive p-tests under 0.05 or over 0.95 (Marshall and Spiegelhalter, 2007), with under-
prediction most apparent for CKD in ward 16, and over-prediction most apparent for
diabetes in ward 56.
A second analysis uses the Martinez-Beneito (2013) implementation of Case 3 of Jin
et  al. (2007), which allows for distinct spatial parameters for each outcome, and also
for between disease dependence within areas. The CAR(γp) prior, implemented via the
proper.car function within BUGS, is used to model between area spatial dependence
within outcomes, where γp represents spatial dependence for outcome p. The precision
parameters (tau[j] in the proper.car function in the code) are set to 1 for identifiability.
Binary adjacency is assumed with conditional spatial variances proportional to 1/di,
where di is the number of areas adjacent to area i. The parameter γp has a value between
bounds given by the inverse of the minimum and maximum eigenvalues of the matrix
M -0.5Ws M 0.5 , where Ws is the row standardised adjacency matrix and M = diag(1/di ).
A two-chain run of 100,000 iterations gives significant correlations (a) between CKD
and hypertension, with mean (95% CRI) of 0.57 (0.30, 0.72), and (b) between hyperten-
sion and diabetes, namely 0.51 (0.03,0.71). The γ parameters are similar between the out-
comes, with respective posterior means 0.937, 0.933, and 0.943. The overall LOO-IC is
estimated at 1828.
The third model combines a common spatial factor and unstructured outcome-spe-
cific random effects upi, whereby

hpi = ap + lp Fi + upi

where Fi follows a univariate Leroux et  al. (1999) prior. Thus for k Î(0, 1) , precision
parameter τF, and with F[i] = ( F1 ,¼, Fi - 1 , Fi + 1 ,¼, Fn ), and binary spatial interactions, the
conditional mean and precision for ward i are

E( Fi |F[i] ) =
k å j ζi
Fj
,
[1 - k + kdi ]
and

Prec( Fi |F[i] ) = [1 - k + kdi ] tF .

With tF = sF-2 taken as an unknown, with prior σF ~ U(0,1000), one of the loadings λj
must be fixed for identification, and accordingly λ1 = 1. This model has a LOO-IC of 1823,
with κ estimated as 0.89.
Factor Analysis, Structural Equation Models, and Multivariate Priors 381

9.8 Multivariate Time Series


Multivariate time series can occur in several ways. One example is where the same
measurement process (e.g. repeated environmental readings) is carried out at several
locations and where high correlation between the series is expected. Another situation
occurs with financial data, such as exchange rates or stock returns, where high correla-
tions raise questions such as whether there are feedbacks between different series, or
whether common factors (e.g. market risk) affect all series. Multivariate time series for
count data are also an increasing focus (Aktekin et al., 2017; Chapados, 2014), with appli-
cations in political science (Brandt and Freeman, 2005) and ecology (Wang et al., 2012).
Overviews of Bayesian methods for multivariate time series include Koop and Korobilis
(2010) and Sims and Zha (1998).
Classical multivariate time series analysis includes extending the ARMA model to vec-
tor responses (Tiao and Tsay, 1989; Reinsel, 2003). For observation vector yt = ( y1t , ¼ y Pt )¢ ,
the vector ARMA(R,S) model has the form

yt = m + F1 yt -1 + … + F R yt - R + ut - Q1ut -1 … - QSut - S ,

where the coefficient matrices are all of order P × P, and ut denotes P-variate white noise,
with E(ut) = 0, and

E(utut¢- k ) = 0 k ¹ 0;

E(utut¢- k ) = S k = 0.

For the vector autoregressive or VAR model obtained on omitting moving average terms,
stationarity requires that the roots of the characteristic equation

det(I - F1z + … + F r z R ) = 0

lie outside the unit circle.


Bayesian analysis of VAR models are extensive, and include treatments of cointegration*
(Koop et al., 2006; Kleibergen and Paap, 2002), model selection and averaging (Andersson
and Karlsson, 2007), variable selection (Karlsson, 2015), and informative and restricted
priors (Litterman, 1986; Sims and Zha, 1998; Brandt and Freeman, 2005). Relevant pack-
ages in R include MTS (Tsay, 2014), BMR (Bayesian Macroeconometrics in R) and MSBVAR
(https://cran.revolutionanalytics.com/web/pac​kages​/MSBV​AR/in​dex.h​tml).​

9.8.1 Multivariate Dynamic Linear Models


The structural model approach is widely applied in Bayesian time series studies (e.g.
Petris et al., 2009; Commandeur and Koopman, 2007; West and Harrison, 1997; Durbin and

* Classical approaches using autoregressive moving average models may rest on assumptions of stationarity,
following transformation or differencing: a time series is integrated of order d, or I(d), when differencing to
order d is needed for stationarity. Such series are cointegrated if some linear combination of the series has a
lower order of integration than the individual series (Phillips and Durlauf, 1986), for example, when two series
yt and xt are both I(1), but there is a parameter α such that ut = yt − αxt is stationary (integrated of order zero).
382 Bayesian Hierarchical Models

Koopman, 2012) and focuses on underlying components of multiple series without requir-
ing initial differencing. The multivariate normal dynamic linear model specifies

yt = Ftqt + et , et ~ N (0, Vt ), t = 1, … , T

qt + 1 = Gtqt + Htut , ut ~ N (0, Wt ),

where yt is a P × 1 observation vector, and θt is a Q × 1 latent state vector following a Markov
process. The disturbance vectors et and ut are assumed normally distributed, and uncor-
related with each other and over time. The initialising prior for the state vector is typically
assumed to be a normal fixed effect with mean m1 and covariance matrix C1, q1 ~ N (m1 , C1 ).
The system matrices Ft , Gt , Vt , Wt and Ht may be assumed to be known, in which case sim-
ple updating, forecasting and filtering densities can be derived – see West and Harrison
(1997, p.582). In more realistic settings where the covariances Vt and Wt are unknown, time-
invariant assumptions such as Vt = Σe and Wt = Σu are one possible parameterisation. A sim-
ple case occurs (Koopman and Durbin, 2000) when Vt is diagonal, the assumption being
that the observations are independent conditional on the latent states.
Common model forms include the local level (LL) model with measurement and transi-
tion equations

yt = qt + et , et ~ N (0, S e ), t = 1, ¼ , T

qt + 1 = qt + ut , ut ~ N (0, S u ),

where yt is a P × 1 metric observation, θt also has dimension P, and Σe and Σu are of dimen-
sion P × P. A local linear trend (LLT) includes a trend in the underlying level, as in

yt = qt + et , et ~ N (0, S e ), t = 1, ¼ , T

qt + 1 = qt + dt + ut , ut ~ N (0, S u ),

dt + 1 = dt + wt , wt ~ N(0, S w ).

For example, Proietti (2007) applies a multivariate local level model to measuring core
inflation, while Moauro and Savio (2005) apply a LLT approach to temporal disaggrega-
tion of multiple economic series. Multivariate signal models may be applied to measure
latent risk, as in the accident rate and credit card use examples of Bijleveld et al. (2005).
This approach involves time series or panel data on exposure totals (xt or xit), outcomes (yt
or yit), and what may be generically termed “losses” (zt or zit). A simple bivariate case with
xt = vehicle registrations and yt = motor accidents would lead to a model

log( xt ) = qt( E) + et( x )

log( yt ) = qt( E) + qt( R) + et( y )

where the components of qt = (qt( E) ,qt( R) ) represent underlying log exposure and log risk,
which evolve according to a bivariate local linear trend
Factor Analysis, Structural Equation Models, and Multivariate Priors 383

qt + 1 = qt + dt + ut , ut ~ N (0, S u ),

dt + 1 = dt + wt , wt ~ N(0, S w ).

A simplifying “homogenous” model (Harvey, 1989, Chapter 8) for the covariance matrices
is obtained for the LL model by setting

S u = qS e

where q is an unknown signal-to-noise ratio, and for the LLT model by setting

S u = q1S e

S w = q2S e .

Generalisations to include trend, seasonal, and cyclical effects can be made in which each
sort of effect is independent of the other and each follows its own multivariate evolu-
tion prior (Durbin and Koopman, 2001, p.44). These assumptions lead to what is termed a
seemingly unrelated time series equations or SUTSE model (Harvey and Shephard, 1993;
Harvey and Koopman, 1997), since the individual series are connected only via the cor-
related disturbances in the measurement and transition equations. More complex matrix
normal priors (West and Harrison, 1997, p.597) result from assuming interdependence
between different types of parameter.
A model with level, seasonal, and cyclical effects for multivariate yt = ( y1t , … , y Pt ) would
specify

yt = qt + gt + yt + et , t = 1, … , T

et ~ N (0, S e ),

qt + 1 = qt + ut ,

ut ~ N (0, S u ),

where the seasonal components for the pth variable (with s seasons) evolve according to

gpt = gp ,t -1 + gp ,t - 2 … + gp ,t - s + 1 + wpt ,

with

(w1t , w2t , … , wPt ) ~ N (0, S w ).

Following Harvey and Koopman (1997), the cyclical effects ψt may be assumed “similar,”
namely to have the same damping factor ρ and frequency 0 £ l £ p across variables. The
period is then 2π/λ with the full prior being

yt = (y1t , y2t , … , yPt ) ~ N (my , S h ),


384 Bayesian Hierarchical Models

with additional shadow period effects

yt* = (y1*t , y2*t , ¼ , yPt


*
) ~ N (my * , S h * ),

where means myp and myp* for the pth variable are obtained according to

éy pt ù é cos(l ) sin(l )ù éy p ,t -1 ù éh pt ù
êy * ú = r ê - sin(l ) + .
ë pt û ë cos(l )úû êëy p*,t -1 úû êëh pt* úû
It may be noted that multivariate DLMs occur in the analysis of univariate data, for
example for categorical and ordinal outcomes. Thus Cargnoni et  al. (1997) propose a
model for time series of a multinomial outcome with M categories, and denominators
nt. One has

( y1t , y 2t , ¼ y Mt ) ~ Mult(nt ,[p1t , p2t , … , pMt ])


M

pmt = exp(hmt ) å exp(h


h=1
ht )

hmt = amt + bm xt , m = 1, ¼ M - 1

hMt = 0

where the time-varying category intercepts at = (a1t , ¼ , aMt ) follow a multivariate normal
random walk prior

at ~ N M -1(at -1 , S a ).

Example 9.11 Minks and Muskrats: Multivariate Dynamic Linear Model


Harvey and Koopman (1997) consider a bivariate series, namely numbers of skins of
minks and muskrats traded annually (logarithms of the annual sales) for T = 64 years
(1848–1911) by the Hudson Bay Company. There is a prey-predator relationship between
the P = 2 species (minks are the main predators of muskrats) leading to inter-linked
cycles. A model is fitted including trends and similar cycles, so that

yt = qt + yt + et , t = 1,… , T

et ~ N(0, S e ),

qt + 1 = qt + bt + ut ,

ut ~ N(0, S u ),

bt + 1 = bt + wt ,

wt ~ N(0, S w ),

yt = (y1t , y2t ) ~ N(my , S h )


Factor Analysis, Structural Equation Models, and Multivariate Priors 385

yt* = (y1*t , y2*t ) ~ N(my * , S h * )

where the cyclical effects for the two species have the same damping factor r ~ U(0, 1)
and frequency λ, and the non-diagonal covariance matrices are of order P × P.
Since the series contains 64 points, an informative assumption is made that the period
is between 4.2 and 21, namely that l ~ U(0.3, 1.5) . Taking a simple uniform prior on λ
between 0 and π is associated with implausibly low λ. Covariances are linked using the
homogeneity assumption, namely S u = qu S e , S w = qw S e , S h = qh S e , and S h * = qh * S e , with
the signal to noise ratios {qu , qw , qh , qh * } all assumed to follow Exponential(1) priors. For
S e-1 , a Wishart prior assumes 5 degrees of freedom and a prior covariance matrix based
on the observed covariance.
Inferences are from the final 75,000 of a two-chain run of 100,000 iterations, using
R2OpenBUGS. One finds the cycles to have a mean period of 9.9 years, with 95% CRI
(9.3,10.9). Figures 9.4 and 9.5 show modelled trends in the mink and muskrat series
(theta.var[1,] and theta.var[2,] in the code) together with the original data. The posterior
means for qu, qw, qη, and qh * are (0.093, 0.004,0.058, 0.16). The LOO-IC is 11.1, with point-
wise LOO-IC identifying the discordant observation in 1908, when muskrat sales were
unduly low.
The interlinking of the two series (and its predator-prey nature) also shows in a
VAR(1) model with

y1t = g 1 + a 11 y1,t -1 + a 12 y 2,t -1 + u1t , t = 2,… , T

y 2t = g 2 + a 21 y1,t -1 + a 22 y 2 ,t -1 + u2t ,

ut ~ N(0, S u ),

12.0

11.5
Annual Sales (log)

11.0

10.5

Data
Mean
10.0
2.5%
97.5%

9.5
1848
1850
1852
1854
1856
1858
1860
1862
1864
1866
1868
1870
1872
1874
1876
1878
1880
1882
1884
1886
1888
1890
1892
1894
1896
1898
1900
1902
1904
1906
1908
1910

FIGURE 9.4
Annual mink sales, 1848–1911 (logarithm).
386 Bayesian Hierarchical Models

14.5

14.0

13.5
Annual Sales (log)

13.0

12.5 Data

Mean

12.0 2.5%

97.5%

11.5
1848
1851
1854
1857
1860
1863
1866
1869
1872
1875
1878
1881
1884
1887
1890
1893
1896
1899
1902
1905
1908
1911
FIGURE 9.5
Annual muskrat sales, 1848–1911 (logarithm).

with y11 and y21 taken as known, and where Σu is non-diagonal. The estimated α coef-
ficient matrix from a two-chain run of 25,000 iterations is

æ 0.61 0.21 ö
ç ÷
è -0.49 0.91 ø
where the negative α21 coefficient, with posterior mean (95% interval) of −0.49 (−0.69,−0.12),
shows muskrat numbers are lower when mink number are higher. Maximum likeli-
hood estimates from the vars package are similar, as are estimates using rstan code,
which uses the Cholesky parameterisation of the bivariate normal covariance matrix.
Residuals between the two series (u.corr in the code) are positively correlated, with
posterior mean 0.26, after accounting for the lag 1 effect of one series on the other. The
LOO-IC for this model is 41.

9.8.2 Dynamic Factor Analysis


Time series factor models become sensible for large P, especially when there are high inter-
series correlations, as they result in less heavy parameterisation of covariance between
series, and may provide insights into latent structure, as well as more efficient inferences
and forecasts (Durbin and Koopman, 2001). However, as with all factor models, they are
subject to potential identification issues (Aßmann et al., 2016; Bai and Wang, 2015). Typically,
the covariance structure between series is attributed to the common factors only, with
observation errors assumed independent (Jungbacker et al., 2009). There are a number of
application areas, and Bayesian approaches have been important. Prado and West (1997)
consider the case a single latent series Ft underlying multiple series yt = ( y1t , … , y Pt ) of EEG
readings, and discuss TVAR autoregressive models for the latent Ft involving time-varying
Factor Analysis, Structural Equation Models, and Multivariate Priors 387

AR1 coefficients which follow random walk priors. Thus, first order random walk priors in
r = 1, ¼ , R autoregressive parameters ϕrt leads to

y pt = ap + lp Ft + e pt ,

e pt ~ N (0, sp2 ),

æ R
ö
Ft ~ N ç
ç
è
åf F
r =1
rt t - r , s F2 ÷ ,
÷
ø
frt ~ N (fr ,t -1 , sf2 ),

with a preset λp (anchoring constraint) if sF2 is an unknown. Autoregressive dependence in


the residuals ept may also be considered (Jackson et al., 2016; Kaufmann and Schumacher,
2013), or lagged effects of the latent factor(s) in the model for ypt (Aßmann et  al., 2016).
Another application occurs in econometric modelling of asset returns, where the number
of assets may exceed the length of the time series and factor models for returns are a clear
option (Zivot and Wang, 2006). Factor models may also be a component of multivariate
volatility models – see Section 9.8.3.
A relatively simple approach for reducing a P dimensional vector yt to a Q dimensional vec-
tor Ft involves a dynamic linear model for the factor score vector. Thus, a local linear factor or
factor trend model would propose a measurement model linking indicators and factors

yt = a + LFt + et , et ~ N (0, S e ), t = 1, ¼ , T

with the transition equation specifying a random walk in the factors, namely

Ft + 1 = Ft + ut , ut ~ N (0, S u ),

where Ft is a Q dimensional latent construct, with Q < P, and Λ is of dimension P × Q. If


the series et and Ft are uncorrelated, then the marginal mean and covariance of yt are α
and LS u L + S e respectively. To avoid location invariance in the F scores, devices such as
centring at each iteration or setting initial factor scores to known values can be used (see
Example 9.12).
The loadings matrix Λ and/or the factor score covariance matrix Σu are parameterised to
ensure identification and avoid various forms of invariance. If all elements in Σe are taken
as unknown (i.e off-diagonal as well as diagonal terms) and Σu is also unknown, then
Harvey and Koopman (1997) mention the loadings matrix formulation

æ IQ ö
L =ç *÷
èL ø
with Λ* of dimension (P − Q) × Q containing unknown loadings. If Σe is diagonal (only
residual variances assumed unknown), and Σu is also diagonal, but contains unknown fac-
tor variances, then one may set λpp = 1 and λpq = 0 for q > p. This is the anchoring constraint of
Skrondal and Rabe-Hesketh (2004), with the latter constraint used to avoid rotation invari-
ance (Geweke and Zhou, 1996, pp.565–566).
If Σe contains just residual variances, and Σu is diagonal with known factor variances
(typically of 1), then constraints on the λpq to ensure scale identification are not needed, but
388 Bayesian Hierarchical Models

the rotational constraint λpq = 0 for q > p still applies. However, Geweke and Zhou (1996)
suggest λpp > 0 as an identification device in this case, to ensure a unique labelling of factors.

9.8.3 Multivariate Stochastic Volatility


Many multivariate series (e.g. share prices, exchange rates, asset returns) may be subject
to volatility clustering, with the clustering often correlated over different series (Kastner
et  al., 2017; Chib et  al., 2006). For example, Yu and Meyer (2006) mention that financial
decision-making needs to take correlations into account when market volatilities move
together across multiple assets. With initial transformation or differencing of series, the
main focus of stochastic volatility modelling may be on modelling the changing covari-
ances. In other applications, it may be necessary to model time variation both in autore-
gressive (or VAR) coefficients and the covariance matrix of error terms (Primiceri, 2005;
Krueger, 2016). Alternatively, one may investigate stochastic volatility in tandem with a
multivariate dynamic factor model.
With regard to the latter option, a general model may be stated for untransformed data
yt = ( y1t , y 2t , … , y Pt )¢ with factor scores vector Ft = ( F1t , F2t , ¼ , FQt )¢ . Thus following Zhou
et al. (2014)

yt = at + L t Ft + et ,

et ~ N (0, S t ),

Ft = GFt -1 + wt ,

wt ~ N (0, F t ),

where et and wt are respectively residuals and factor innovations, αt is an intercept, Λt of


dimension P × Q denotes time-varying (autoregressive) loadings, S t = diag(s12t , … , sPt
2
) con-
tains time-varying residual variances, G = diag(g1 , … , gQ ) governs AR(1) dependence in the
factor scores Ft, and Φt is a diagonal factor volatility matrix.
To exemplify autoregressive dependence in loadings, consider a univariate factor model
(Q = 1). Then with AR(1) dependence in the loadings and Σt diagonal one has

y pt = apt + lpt Ft + e pt ,

e pt ~ N (0, sp2 ),

Ft = g Ft -1 + wt ,

wt ~ N (0, sw2 ),

lpt = mp + rp (lp ,t -1 - mp ) + hpt ,

hpt ~ N (0, xp ).

For multivariate factors, sparsity-inducing priors on the coefficients λpqt may be indicated,
with Zhou et al. (2014) proposing a threshold mechanism.
Factor Analysis, Structural Equation Models, and Multivariate Priors 389

Simplifications to such a scheme are often the focus, involving decompositions of the
residual variance. Applications typically involve metric series yt = ( y1t , y 2t , … , y Pt )¢ either
mean centred, or in transformed form (e.g. logs of share prices compared between suc-
cessive time points), with effectively zero means. A latent factor will not necessarily be
involved. Thus, for centred or appropriately transformed prices or returns ypt, one possible
model (Asai et al., 2006) for a response yt = ( y1t , y 2t , … , y Pt )¢ is

yt = Ht et ,

Ht = diag(exp( h1t /2), )exp( h2t /2), … , exp( hPt /2)),

where et is a vector of independent standard normal variates, and ht = ( h1t , ¼ , hPt )¢ is a vec-
tor of unobserved log variances (or volatilities), evolving according to stationary autore-
gressive schemes,

hpt = mp + fp ( hp ,t -1 - mp ) + upt , t > 2,

upt ~ N (0, tp ),

hp1 ~ N ( mp , tp /(1 - fp2 ))

with persistence parameters fp Î( -1, 1) (Kastner, 2016). The autoregression in ht can be


extended to VAR or VARMA form with heavier parameterisation (Asai et al., 2006).
The errors in the price series and in the volatilities may be stated in multivariate normal
or multivariate t form, with the MVN assumption expressed

æ et ö éæ 0 ö æ Re 0 öù
ç ÷ ~ N êç ÷ , ç ÷ú .
è ut ø êëè 0 ø è 0 S u ø úû

where Re is a positive definite correlation matrix with a diagonal of ones, and Σu is a P × P
covariance matrix for volatility shocks. Taking Re to be non-diagonal means shocks in
prices may be correlated, while taking Σu to be non-diagonal allows volatility shocks to be
correlated (Yu and Meyer, 2006, 365–366). Thus, in a bivariate example, taking

æ f11 f12 ö
f =ç ÷
è f21 f22 ø

to be non-diagonal in a VAR(2) regression for ( h1t , h2t ) amounts to allowing bilateral


Granger causality in volatility between the two series. Taking Re to evolve through time
according to

æ 1 rt ö
Ret = ç ÷
è rt 1ø

means that not only log volatilities ht, but also correlations between the observed series are
time-varying. Specifically, with
390 Bayesian Hierarchical Models

exp( gt ) - 1
rt =
exp( gt ) + 1

an additional autoregression can be set, with

gt + 1 = mg + fg ( gt - mg ) + vt .

Factor analytic models may also include correlated volatility (Pitt and Shephard, 1999;
Chan et al., 2006; Zhou et al., 2014). As an example, for two series { y pt , p = 1, 2} and a uni-
variate factor Ft, one might have

y1t = l1Ft + e1t

y 2t = l2 Ft + e2t

with evolving variances for Ft and the ept. The stochastic variance prior for the residuals
ept may include autoregressive dependence, since a factor structure may be sufficient to
account for the non-diagonal elements of the residual variance matrix of the outcomes, but
not sufficient to explain all the marginal persistence in volatility (Pitt and Shephard, 1999,
p.551). Thus, one might have

Ft ~ N (0, e h1t ),

e1t ~ N (0, e h2t ),

e2t ~ N (0, e h3 t ),

with first-order autoregressive dependence in the log variances hpt

hpt = fp hp ,t -1 + upt t = 2, … , T

with possibly unknown initial conditions hp1. For identification, one may set one or other
of the λp parameters to 1 (an anchoring constraint). Alternatively, a standardised factor
constraint might be implemented by setting the scale of the factors at one time point, for
instance by taking F1 ~ N (0, 1), that is h11 = 0.
Adaptivity in the modelling of stochastic variances can be combined with factor reduc-
tion. Chib et al. (2006) propose a multivariate stochastic volatility factor model that permits
both series-specific jumps at each time, and Student-t innovations with unknown degrees
of freedom. For bivariate data and a univariate factor, this model has the form

y1t = l1Ft + d1t q1t + e1t

y 2t = l2 Ft + d2t q2t + e2t

where qpt = 1 with probability πp, and the εpt follow independent Student t densities with
unknown degrees of freedom νp. In hierarchical form

e pt = e pt/g pt0.5 ,
Factor Analysis, Structural Equation Models, and Multivariate Priors 391

æn p n p ö
g pt ~ Ga ç , ÷ ,
è 2 2 ø
[e1t , e2t ] ~ N 2 (0, Vt ).

where Vt is diagonal with elements exp(hpt), with evolution scheme

hp ,t + 1 - mp = fp ( hpt - mp ) + spupt ,

upt ~ N (0, 1).

The variables zpt = log(1 + dpt ) are assumed to be N ( -0.5xp2 , xp2 ) where ξp are additional
unknowns. The more general form for yt = ( y1t , … , y Pt )¢ and Ft = ( F1t , … FQt )¢, Q £ P is

yt = BFt + D t qt + et

with identification constraints λpp = 1 and λpq = 0 for q > p. These constraints set a scale and
prevent rotation invariance. The covariance matrix for Ft is diagonal with evolution scheme
as for the log diagonal elements of Vt.

Example 9.12 Common Factor Model for Flour Prices


Tiao and Tsay (1989) analyse a trivariate series formed by the logarithms of indices of
monthly flour prices in Buffalo, Minneapolis and Kansas City between August 1972
and November 1980 (T = 100). The data are plotted in Figure 9.6 which shows that the
series are closely related, and a common factor model is indicated. The variance of the
factor scores Ft is taken as unknown, so a loading constraint is needed. The factor scores

5.6

5.4

5.2
Log Price

4.8 Buffalo

Minneapolis
4.6
Kansas City

4.4
Aug/1972

Aug/1973

Aug/1974

Aug/1975

Aug/1976

Aug/1977

Aug/1978

Aug/1979

Aug/1980
Nov/1972

Nov/1973

Nov/1974

Nov/1975

Nov/1976

Nov/1977

Nov/1978

Nov/1979

Nov/1980
Feb/1973
May/1973

Feb/1974
May/1974

Feb/1975
May/1975

Feb/1976
May/1976

Feb/1977
May/1977

Feb/1978
May/1978

Feb/1979
May/1979

Feb/1980
May/1980

FIGURE 9.6
Flour prices in three cities.
392 Bayesian Hierarchical Models

are assumed to follow a random walk with F1 = 0 to identify the level of the scores. The
observation residuals after accounting for the common factor are assumed multivariate
normal, with a Wishart prior on precision matrix with 3 degrees of freedom and a diag-
onal scale matrix. The elements of the Wishart scale matrix are based on the observed
variances Vp of the three series, leading to a data-based prior.
Thus, with P = 3, and an anchoring constraint on the loadings,

y pt = ap + lp Ft + upt

(u1t ,¼, uPt ) ~ N P (0, S u ),

S u-1 ~ W (PS, P),

S = diag(V1 , V2 ,.., VP ),

l1 = 1;

lk ~ N(1, 1), k = 2,¼, P ,

Ft ~ N( Ft - 1 , sF2 ) t = 2,… , T ,

F1 = 0,

sF ~ U(0, 10).

An alternative model for the factor scores adopts a locally adaptive prior, allowing for
changing variance through time (Lang et al., 2002). Thus

Ft ~ N( Ft - 1 , exp(ht )) t = 2,¼, T ,

ht ~ N(ht - 1 , th-1 ) t = 2,¼, T ,

th ~ Ga(1, 0.001),

h1 ~ N(0, 1),

F1 = 0.

Following Migon and Moreira (2004), fit may be assessed using the predictive approach
of Gelfand and Ghosh (1998), based on a goodness of fit term G = å å (y rep , pt - y pt )2

å å var(y
p t
and a penalty term H = rep , pt ). The LOO-IC is also used.
p t
Figure 9.7 shows the estimated factor scores through time under the constant variance
model. The posterior mean for sF2 is 0.0018, with posterior mean for G of 1.417 and with
H = 1.108. The non-constant variance model has similar fit criteria, namely a posterior
mean for G of 1.397 with H = 1.098. The respective LOO-IC values are −1027 and −1028.
There seems little to choose between the models, though the plot of the evolving log
variances (Figure 9.8) suggests a reduction in volatility in the second half of the obser-
vation period.
–12
–10
–8
-6
–4
–2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8

–0.2
–0.1
Aug/1972
A ug/197
g/1972
Aug/1972

FIGURE 9.7

FIGURE 9.8
Nov/1972
Nov/197
Nov/1972
v/1972
Feb/1973
Feb/197 Feb/1973
b/1973
May/1973
May/197 May/1973
y/1973
Aug/1973
Aug/197 Aug/1973
g/1973
Nov/1973
Nov/197 Nov/1973
v/1973
Feb/1974
Feb/197 Feb/1974
b/1974
May/1974
May/197 May/1974
y/1974
Aug/1974
Aug/197 Aug/1974
g/1974
Nov/1974
Nov/197

Factor scores, Model 1, Example 9.12.


Nov/1974
v/1974
Feb/1975
Feb/197 Feb/1975
b/1975
May/1975
May/197 May/1975
y/1975

Log variances of factor scores, Example 9.12.


Aug/1975
Aug/197 Aug/1975
g/1975
Nov/1975
Nov/197 Nov/1975
v/1975
Feb/1976
Feb/197 Feb/1976
b/1976
May/1976
May/197 May/1976
y/1976
Aug/1976
Aug/197 Aug/1976
g/1976
Nov/1976
Nov/197 Nov/1976
v/1976
Feb/1977
Feb/197 Feb/1977
b/1977
May/1977
May/197 May/1977
y/1977
Aug/1977
Aug/197 Aug/1977
g/19
977
2.5%
Mean

Nov/1977
Nov/197
97.5%

Nov/1977
v/1977
Feb/1978
Feb/197 Feb/1978
b/1978
May/1978
May/197 May/1978
y/1978

2.5%
Aug/1978
Aug/197
Factor Analysis, Structural Equation Models, and Multivariate Priors

Mean
Aug/1978
g/1978

97.5%
Nov/1978
Nov/197 Nov/1978
v/1978
Feb/1979
Feb/197 Feb/1979
b/1979
May/1979
May/197 May/1979
y/1979
Aug/1979
Aug/197 Aug/1979
g/1979
Nov/1979
Nov/197 Nov/1979
v/1979
Feb/1980
Feb/198 Feb/1980
b/1980
May/1980
May/198 May/1980
y/1980
Aug/1980
Aug/198 Aug/1980
g/1980
393
394 Bayesian Hierarchical Models

Example 9.13 Multivariate Stochastic Volatility, FTSE,


and S&P fluctuations during 2006–07
This example follows the FTSE 100 and S&P 500 stock indices {rpt , t = 0,… , 252; p = 1, 2}
over 253 trading days (October 27, 2006 to October 19, 2007, as recorded by uk.finance.
yahoo.com). This period includes two spells of market turbulence, the second being
associated with US sub-prime mortgage lending. Where a trading day is present in one
index, but not the other, the gap is filled by taking an average of the preceding and sub-
sequent days. Figure 9.9 plots the series relative to their start points, namely in the form
100rpt/rp1. The data to be analysed are obtained as y pt = 100rp , t /rp , t - 1 - 100 , t = 1, …, 252.
The model allows for correlated shocks in the stock change variables and correlated
log variances, so that

yt = Ht et ,

Ht = diag[exp( h1t /2), )exp( h2t /2)],

with unobserved log volatilities ht = ( h1t , h2t )¢ evolving via a VAR(1) model

ht +1 = m + diag(f11 , f22 )( ht - m ) + ut ,

h1 = m + u1 .

The errors in the price series and volatilities equations are multivariate normal

æ et ö éæ 0 ö æ Re 0 öù
ç ÷ ~ N êç ÷ , ç ÷ú .
u
è øt êëè 0 ø è 0 S u ø úû

115
Relative to Index Start Point (30-10-2006)

110

105

100

FTSE S&P
95

90
30/10/2006
13/11/2006
27/11/2006
11/12/2006
25/12/2006
08/01/2007
22/01/2007
05/02/2007
19/02/2007
05/03/2007
19/03/2007
02/04/2007
16/04/2007
30/04/2007
14/05/2007
28/05/2007
11/06/2007
25/06/2007
09/07/2007
23/07/2007
06/08/2007
20/08/2007
03/09/2007
17/09/2007
01/10/2007
15/10/2007

FIGURE 9.9
US and GB share indices, October 30, 2006 to October 19, 2007.
Factor Analysis, Structural Equation Models, and Multivariate Priors 395

where Re is a non-diagonal correlation matrix, and Σu is also non-diagonal, allowing


volatility shocks to be correlated (Yu and Meyer, 2006).
The jags code samples h1*t and h2*t as standard normal, and applies the standard devia-
tions (su1 , su 2 ), and correlation ρu, of the u1t and u2t series to the standard normal log
volatilities. Thus

h1t = su1h1*t ,

h2t = su 2 ru h1*t + su 2 (1 - ru2 )0.5 h2*t .

This raises an identification issue for the correlation ρu, since ru h1*t = ( - ru )( - h1*t ), which
is resolved by assuming ρu to be U(0,1) rather than U(−1,1). The correlation ρe in Re is
*
taken as U(−1,1). The stationary autocorrelation parameters are obtained as fpp = 2fpp - 1,
* -1
where fpp ~ Beta(19, 1) . The diagonal terms in S u are taken to be Exponential(1) .
Estimation using jagsUI provides early convergence, with posterior means for ρe and
ρu of 0.54 and 0.87, and with the autoregressive coefficients in the AR1 log volatility
equations having means 0.88 and 0.74. Figure 9.10 plots the resulting log volatility series
(posterior means of h1t and h2t) with the two periods of market turbulence apparent.
The pointwise LOO-ICs detect aberrant observations for t = 85 (comparing 27/02/2007
and 26/02/2007) when there was a sharp fall in the S&P index, and t = 206 (comparing
16/08/2007 and 15/08/2007) when there was a sharp fall in the FTSE100.

0
Log volatility

–1

h1
–2 h2

0 50 100 150 200 250


Days

FIGURE 9.10
Log volatility plot.
396 Bayesian Hierarchical Models

9.9 Computational Notes
1. The application of brms to path models can be illustrated using data from an analysis
for job satisfaction (Bryman and Cramer, 2005) and considered in Congdon (2001, p.98).
Thus, job survey data on age, income, job satisfaction, and job autonomy is available for 68
workers. A path model is proposed with y1 (age) influencing the three remaining variables,
namely y2 = autonomy, y3 = income, and y4 = satisfaction. Autonomy is postulated to influ-
ence both income and satisfaction, while all three variables – age, income, and autonomy
– affect satisfaction. All variables are standardised. Hence, the following regressions are
involved

y 2 = b1 y1 + e2 ,

y 3 = b2 y1 + b3 y 2 + e3 ,

y 4 = b4 y1 + b5 y 2 + b6 y 3 + e 4 .

We may be interested in calculating the total effect of age on satisfaction, which involves
the direct effect b4, a path from age to income to satisfaction, calculated as b2b6, a path
from age to autonomy to satisfaction, obtained as b1b5, and a path from age to autonomy to
income to satisfaction, obtained as b1b3b6 The following code encapsulates the anticipated
relationships and obtains the total effect:

   library(brms)
    D=​read.​table​("DS_​BRMS_​SATIS​_CH9.​txt",​heade​r=T)
    auton_mod = bf(auton ~age)
    income_mod = bf(income ~age+auton)
    satis_mod = bf(satis ~age+auton+income)
    fit= brm(auton_mod + income_mod + satis_mod+ set_rescor(FALSE), data
= D, chains = 2)
    make_stancode(auton_mod + income_mod + satis_mod, data=D)
    b1 = posterior_samples(fit, "auton_age")
    b2 = posterior_samples(fit, "income_age")
    b3 = posterior_samples(fit, "income_auton")
    b4 = posterior_samples(fit, "satis_age")
    b5 = posterior_samples(fit, "satis_auton")
    b6 = posterior_samples(fit, "satis_income")
    # total effect of age, posterior mean and sd
    to​t.age​=b4+b​2*b6+​b1*b5​+b1*b​3*b6
   mean(tot.age[1:2000,])
   sd(tot.age[1:2000,])
   # indirect effect
   ind.age=tot.age-b4
   mean(ind.age[1:2000,])
   sd(ind.age[1:2000,])

The output from brms shows a small direct effect of age on satisfaction, with posterior
mean (sd) of −0.06 (0.10). However, the total effect is obtained as 0.36 (0.12), and the indirect
effect as 0.43 (0.11).
Factor Analysis, Structural Equation Models, and Multivariate Priors 397

References
Aßmann C, Boysen-Hogrefe J, Pape M (2016) Bayesian analysis of static and dynamic factor models:
An ex-post approach towards the rotation problem. Journal of Econometrics, 192(1), 190–206.
Aitkin M, Aitkin I (2005) Bayesian inference for factor scores, in Contemporary Psychometrics, eds A
Maydeu-Olivares, J McArdle. Lawrence Erlbaum Associates.
Aktekin T, Polson N, Soyer R (2017) Sequential Bayesian analysis of multivariate count data. Bayesian
Analysis, 13(2), 385–409.
Albert J (1992) Bayesian estimation of normal ogive response curves using Gibbs sampling. Journal of
Educational Statistics, 17, 251–269.
Albert J (2015) Introduction to Bayesian item response modelling. International Journal of Quantitative
Research in Education, 2(3–4), 178–193.
Albert J, Ghosh M (2000) Item response modeling, pp 173–193, in Generalized Linear Models: A Bayesian
Perspective, eds D Dey, S Ghosh, B Mallick. Addison–Wesley, New York.
Andersson M, Karlsson S (2007) Bayesian forecast combination for VAR models. Working paper,
Öebro University.
Anselin L, Hudak S (1992) Spatial econometrics in practice: A review of software options. Regional
Science & Urban Economics, 22, 509–536.
Arhonditsis G, Paerl H, Valdes-Weaver L, Stow C, Steinberg J, Reckhow K (2006a) Application of
Bayesian structural equation modeling for examining phytoplankton dynamics in the Neuse
River Estuary. Estuarine, Coastal & Shelf Science, 72, 63–80.
Arhonditsis G, Stow C, Steinberg L, Kenney M, Lathrop R, McBride S, Reckhow K (2006b) Exploring
ecological patterns with structural equation modeling and Bayesian analysis. Ecological
Modelling, 192(3–4), 385–409.
Arima S (2015) Item selection via Bayesian IRT models. Statistics in Medicine, 34(3), 487–503.
Asai M, McAleer M, Yu J (2006) Multivariate stochastic volatility: A review. Econometric Reviews, 25,
145–175.
Azzalini A (1985) A class of distributions which includes the normal ones. Scandinavian Journal of
Statistics, 12, 171–178.
Bai J, Wang P (2015) Identification and Bayesian estimation of dynamic factor models. Journal of
Business & Economic Statistics, 33(2), 221–240.
Baker F (2001) The Basics of Item Response Theory, 2nd Edition. ERIC Clearinghouse on Assessment
and Evaluation.
Banerjee S, Carlin BP, Gelfand AE (2004) Hierarchical Modeling and Analysis for Spatial Data. Chapman
and Hall/CRC, Boca Raton, FL.
Barnard J, McCulloch R, Meng X-L (2000) Modeling covariance matrices in terms of standard devia-
tions and correlations, with applications to shrinkage. Statistica Sinica, 10, 1281–1312.
Bartholomew D (1987) Latent Variable Models and Factor Analysis. Charles Griffin, London, UK.
Bartholomew D, Steele F, Moustaki I, Galbraith J (2002) The Analysis and Interpretation of Multivariate
Data for Social Scientists. CRC Press.
Bazan J, Branco M, Bolfarine H (2006) A skew item response model. Bayesian Analysis, 1, 861–892.
Berkhof J, van Mechelen I, Gelman A (2003) A Bayesian approach to the selection and testing of mix-
ture models. Statistica Sinica, 13, 423–442.
Besag J, York J, Mollie A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43, 1–59.
Bhattacharya A, Dunson D (2011) Sparse Bayesian infinite factor models. Biometrika, 98(2), 291–306.
Bijleveld F, Commandeur J, Gould P, Koopman, S (2005) Model-based measurement of latent risk in
time series with applications. Tinbergen Institute Discussion Paper No. 05-118/4. Available at
SSRN: http://ssrn.com/abstract=873466
Boker S, Neale M, Maes H, Wilde M, Spiegel M, Brick T, Spies J, Estabrook R, Kenny S, Bates T,
Mehta P (2011) OpenMx: An open source extended structural equation modeling framework.
Psychometrika, 76(2), 306–317.
398 Bayesian Hierarchical Models

Bollen K (1989) Structural Equations with Latent Variables. Wiley, New York.
Bollen KA (2002) Latent variables in psychology and the social sciences. Annual Review of Psychology,
53(1), 605–634.
Brandt P, Freeman J (2005) Advances in Bayesian time series modeling and the study of politics:
Theory testing, forecasting, and policy analysis. Political Analysis, 14(1), 1–36.
Breusch T (2005) Estimating the Underground Economy using MIMIC Models. Working Paper,
National University of Australia, Canberra.
Brown P, Fearn T, Haque M (1999) Discrimination with many variables. Journal of the American
Statistical Association, 94, 1320–1329.
Bryman A, Cramer D (2005) Quantitative Data Analysis with SPSS 12 and 13: A Guide for Social Scientists.
Routledge.
Byrnes J (2017) Bayesian SEM with BRMS. http://rpubs.com/jebyrnes/343408
Cargnoni C, Müller P, West M (1997) Bayesian forecasting of multinomial time series through con-
ditionally Gaussian dynamic models. Journal of the American Statistical Association, 92(438),
640–647.
Chan D, Kohn R, Kirby C (2006) Multivariate stochastic volatility models with correlated errors.
Econometric Reviews, 25, 245–274.
Chapados N (2014) Effective Bayesian modeling of groups of related count time series. Proceedings
of the 31st International Conference on Machine Learning, Beijing, China. JMLR: W&CP vol-
ume 32.
Chen M-H, Dey D (1998) Bayesian modeling of correlated binary responses via scale mixture of mul-
tivariate normal link functions. Sankhya, 60A, 322–343.
Chen M-H, Dey D (2000) Bayesian analysis for correlated ordinal data models, in Generalized Linear
Models: A Bayesian Perspective, eds D Dey, S Ghosh, B Mallick. Marcel Dekker, New York.
Chib S, Greenberg E (1998) Analysis of multivariate probit models. Biometrika, 85, 347–361.
Chib S, Nardari F, Shephard N (2006) Analysis of high dimensional multivariate stochastic volatility
models. Journal of Econometrics, 134, 341–371.
Chib S, Winkelmann R (2001) Markov chain Monte Carlo analysis of correlated count data. Journal of
Business & Economic Statistics, 19, 428–435.
Choi S, Gibbons L, Crane P (2011) Lordif: An R package for detecting differential item functioning
using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simu-
lations. Journal of Statistical Software, 39(8), 1.
Commandeur J, Koopman S (2007) An Introduction to State Space Time Series Analysis. Oxford
University Press.
Congdon P (2001) Bayesian Statistical Modelling. Wiley.
Congdon P, Almog M, Curtis S, Ellerman R (2007) A spatial structural equation modelling frame-
work for health count responses. Statistics in Medicine, 26(29), 5267–5284.
Conti G, Fruhwirth-Schnatter S, Heckman J, Piatek R (2014) Bayesian exploratory factor analysis.
Journal of Econometrics, 183(1), 31–57.
Curtis S (2010) BUGS code for item response theory. Journal of Statistical Software, 36(1), 1–34.
Dunson D, Herring A (2005) Bayesian latent variable models for mixed discrete outcomes. Biostatistics,
6, 11–25.
Durbin J, Koopman S (2001) Time Series Analysis by State Space Methods, 1st Edition. OUP.
Durbin J, Koopman S (2012) Time Series Analysis by State Space Methods. Oxford University Press.
Edwards Y, Allenby G (2003) Multivariate analysis of multiple response data. Journal of Marketing
Research, 40, 321–334.
Evans J, Middleton N, Gunnell D (2004) Social fragmentation, severe mental illness and suicide.
Social Psychiatry and Psychiatric Epidemiology, 39, 165–170.
Everitt BS (1984) Introduction to Latent Variable Models. Chapman and Hall, London, UK.
Feng X, Wu H, Song X (2017) Bayesian adaptive Lasso for ordinal regression with latent variables.
Sociological Methods & Research, 46(4), 926–953.
Fleishman J, Lawrence W (2003) Demographic variation in SF−12 scores: True differences or differen-
tial item functioning? Medical Care, 41, 75–86.
Factor Analysis, Structural Equation Models, and Multivariate Priors 399

Fokoue E (2004) Stochastic determination of the intrinsic structure in Bayesian factor analysis. SAMSI
Technical Report #2004-17. http:​//www​.sams​i.inf​o/rep​orts/​index​.shtm​l
Fox J, Glas C (2005) Bayesian modification indices for IRT models. Statistica Neerlandica, 59, 95–106.
Gelfand A, Ghosh S (1998) Model choice: A minimum posterior predictive loss approach. Biometrika,
85, 1–11.
Gelman A, Meng X-L, Stern H (1996) Posterior predictive assessment of model fitness. Statistica
Sinica, 6, 733–807.
George E, McCulloch R (1993) Variable selection via Gibbs sampling. Journal of the American Statistical
Association, 88, 881–889.
Geweke J, Zhou G (1996) Measuring the pricing error of the arbitrage pricing theory. Review of
Financial Studies, 9, 557–587.
Ghosh J, Dunson D (2008) Bayesian model selection in factor analytic models, pp 151–163, in Random
Effect and Latent Variable Model Selection, ed D Dunson. Springer.
Ghosh J, Dunson D (2009) Default prior distributions and efficient posterior computation in Bayesian
factor analysis. Journal of Computational and Graphical Statistics, 18(2), 306–320.
Gielen E, Riutort-Mayol G, Palencia-Jiménez J, Cantarino I (2017) An urban sprawl index based on
multivariate and Bayesian factor analysis with application at the municipality level in Valencia.
Environment and Planning B, 45(5), 888–914.
Greenacre M, Blasius J (eds) (2006) Multiple Correspondence Analysis and Related Methods. CRC Press.
Gunnell D, Peters T, Kammerling R, Brooks J (1995) Relation between parasuicide, suicide, psychiat-
ric admissions and socio-economic deprivation. British Medical Journal, 311, 226–230.
Harvey A (1989) Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University
Press, Cambridge, UK.
Harvey A, Koopman S (1997) Multivariate structural time series models, pp 269–298, in Systematic
Dynamics in Economic and Financial Models, eds C Heij, H Schumacher, B Hanzon, C Praagman.
Wiley, Chichester, UK.
Harvey A, Shephard N (1993) Structural time series models, in Handbook of Statistics, Vol. 11, eds G S
Maddala et al. Elsevier Science Publishers, Barking, UK.
Hayashi K, Arav M (2006) Bayesian factor analysis when only a sample covariance matrix is avail-
able. Educational and Psychological Measurement, 66, 272–284.
Hogan J, Tchernis R (2004) Bayesian factor analysis for spatially correlated data, with application to
summarizing area-level material deprivation from census data. Journal of the American Statistical
Association, 99, 314–324.
Hoyle R (ed) (1995) Structural Equation Modeling: Concepts, Issues, and Applications. Sage.
Inouye D, Yang E, Allen G, Ravikumar P (2017) A review of multivariate distributions for count data
derived from the Poisson distribution. Wiley Interdisciplinary Reviews: Computational Statistics,
9(3), e1398.
Ishwaran H, James L (2001) Gibbs sampling methods for stick-breaking priors. Journal of the American
Statistical Association, 96, 161–173.
Ishwaran H, Zarepour M (2000) Markov chain Monte Carlo in approximate Dirichlet and beta two-
parameter process hierarchical models. Biometrika, 87, 371–390.
Jackson L, Kose M, Otrok C, Owyang M (2016) Specification and estimation of Bayesian dynamic
factor models: A Monte Carlo analysis with an application to global house price comovement,
in Advances in Econometrics, Vol. 35, eds S Koopman, E Hillebrand. Emerald Publishing.
Jin X, Banerjee S, Carlin BP (2007) Order-free co-regionalized areal data models with application to
multiple-disease mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
69(5), 817–838.
Jin X, Carlin B, Banerjee S (2005) Generalized hierarchical multivariate CAR models for areal data.
Biometrics, 61, 950–961.
Jöreskog KG (1973) Analysis of covariance structures, pp 263–285, in Multivariate Analysis–III. ed P
Krishnaiah, Academic Press.
Joreskog K, Goldberger A (1975) Estimation of a model with multiple indicators and multiple causes
of a single latent variable. Journal of the American Statistical Association, 70, 631–639.
400 Bayesian Hierarchical Models

Juárez M, Steel M (2010) Model-based clustering of non-Gaussian panel data based on skew-t distri-
butions. Journal of Business & Economic Statistics, 28, 52–66.
Jungbacker B, Koopman S, van der Wel M (2009) Dynamic factor models with smooth loadings
for analyzing the term structure of interest rates. Tinbergen Institute Discussion Paper, TI
2009-041/4.
Kaplan D, Depaoli S (2012) Bayesian structural equation modeling, pp 650–673, in Handbook of
Structural Equation Modeling, eds R Hoyle. Guilford Publications Inc.
Karlis D, Meligkotsidou L (2005) Multivariate Poisson regression with covariance structure. Statistics
and Computing, 15, 255–265.
Karlsson S (2015) Forecasting with Bayesian vector autoregression. Handbook of Economic Forecasting,
2B, 791–897.
Kastner G (2016) Dealing with stochastic volatility in time series using the R package stochvol. Journal
of Statistical Software, 69(5), 1–30.
Kastner G, Frühwirth-Schnatter S, Lopes H (2017) Efficient Bayesian inference for multivariate fac-
tor stochastic volatility models. Journal of Computational and Graphical Statistics, 26(4), 905–917.
Kaufmann S, Schumacher C (2013) Bayesian estimation of sparse dynamic factor models with order-
independent identification (No. 13.04). Working Paper, Study Center Gerzensee.
Kavanagh L, Lee D, Pryce G (2016) Is poverty decentralising? Quantifying uncertainty in the decen-
tralisation of urban poverty. Annals of the American Association of Geographers, 106(6), 1286–1298.
Kendall M (1975) Multivariate Analysis. Charles Griffin & Co., London, UK.
Kleibergen F, Paap R (2002) Priors, posteriors and Bayes factors for a Bayesian analysis of cointegra-
tion. Journal of Econometrics, 111(2), 223–249.
Knorr-Held L, Rasser G (2000) Bayesian detection of clusters and discontinuities in disease maps.
Biometrics, 56, 13–21.
Koop G, Korobilis D (2010) Bayesian multivariate time series methods for empirical macroeconom-
ics. Foundations and Trends in Econometrics, 3(4), 267–358.
Koop G, Strachan R, van Dijk H, Villani M (2006) Bayesian approaches to cointegration, in The
Palgrave Handbook of Theoretical Econometrics, eds K Patterson, T Mill. MacMillan.
Koopman S, Durbin J (2000) Fast filtering and smoothing for multivariate state space models. The
Journal of Time Series Analysis, 21, 281–296.
Krueger F (2016) bvarsv: Bayesian Analysis of a Vector Autoregressive Model with Stochastic
Volatility and Time-Varying Parameters. https​://si​tes.g​oogle​.com/​site/​fk83r​esear​ch/co​de
Lai M, Zhang J (2017) Evaluating fit indices for multivariate t-based structural equation modeling
with data contamination. Frontiers in Psychology, 8, 1286.
Lang S, Fronk E, Fahrmeir L (2002) Function estimation with locally adaptive dynamic models.
Computational Statistics, 17, 479–500.
Lee S-Y (2007) Structural Equation Modelling: A Bayesian Approach. Wiley.
Lee S-Y, Shi J (2000) Joint Bayesian analysis of factor score and structural parameters in the factor
analysis models. Annals of the Institute of Statistical Mathematics, 52, 722–736.
Lee S-Y, Song X-Y (2003) Bayesian model selection for mixtures of structural equation models with
an unknown number of components. British Journal of Mathematical and Statistical Psychology,
56, 145–165.
Lee S-Y, Song X-Y (2008) Bayesian model comparison of structural equation models, pp 121–149, in
Random Effect and Latent Variable Model Selection, ed D Dunson. Springer.
Lee S-Y, Tang N (2006) Bayesian analysis of structural equation models with mixed exponential fam-
ily and ordered categorical data. British Journal of Mathematical and Statistical Psychology, 59,
151–172.
Leroux B, Lei X, Breslow N (1999) Estimation of disease rates in small areas: A new mixed model
for spatial dependence, pp 135–178, in Statistical Models in Epidemiology, the Environment and
Clinical Trials, eds M Halloran, D Berry. Springer-Verlag, New York.
Levy R, Mislevy R (2016) Bayesian Psychometric Modeling. CRC Press.
Litterman R (1986) Forecasting with Bayesian vector autoregressions – Five years of experience.
Journal of Business & Economic Statistics, 4, 25–38.
Factor Analysis, Structural Equation Models, and Multivariate Priors 401

Liu X, Wall M, Hodges J (2005) Generalized spatial structural equation modeling. Biostatistics, 6,
539–557.
Lopes H, West M (2004) Bayesian model assessment in factor analysis. Statistica Sinica, 14, 41–67.
Lu Z-H, Chow S, Loken E (2016) Bayesian factor analysis as a variable-selection problem: Alternative
priors and consequences. Multivariate Behavioral Research, 51(4), 519–539.
Luo Y, Jiao H (2017) Using the Stan program for Bayesian item response theory. Educational and
Psychological Measurement, 77, 1–25.
MacNab Y (2007) Mapping disability-adjusted life years: A Bayesian hierarchical model framework
for burden of disease and injury assessment. Statistics in Medicine, 26(26), 4746–4769.
MacNab Y (2018) Some recent work on multivariate Gaussian Markov random fields. Test, 27(3),
1–45.
Madsen L, Dalthorp D (2007) Simulating correlated count data. Environmental and Ecological Statistics,
14, 129–148.
Magis D, Tuerlinckx F, De Boeck P (2015) Detection of differential item functioning using the lasso
approach. Journal of Educational and Behavioral Statistics, 40(2), 111–135.
Malaeb Z, Summers K, Pugesek B (2000) Using structural equation modeling to investigate relation-
ships among ecological variables. Environmental and Ecological Statistics, 7, 93–111.
Mardia K (1988) Multi-dimensional multivariate Gaussian Markov random fields with application to
image processing. Journal of Multivariate Analysis, 24, 265–284.
Mar-Dell’Olmo M, Martnez-Beneito M, Borrell C, Zurriaga O, Nolasco A, Domínguez-Berjón M (2011)
Bayesian factor analysis to calculate a deprivation index and its uncertainty. Epidemiology, 22(3),
356–364.
Marshall EC, Spiegelhalter DJ (2007) Identifying outliers in Bayesian hierarchical models: A simula-
tion-based approach. Bayesian Analysis, 2(2), 409–444.
Martinez-Beneito MA (2013) A general modelling framework for multivariate disease mapping.
Biometrika, 100(3), 539–553.
Mavridis D, Ntzoufras I (2014) Stochastic search item selection for factor analytic models. British
Journal of Mathematical and Statistical Psychology, 67(2), 284–303.
McCulloch R, Rossi P (1994) An exact likelihood analysis of the multinomial probit model. Journal of
Econometrics, 64, 207–240.
McCulloch R, Polson N, Rossi P (2000) A Bayesian analysis of the multinomial probit model with
fully identified parameters. Journal of Econometrics, 99, 173–193.
Merkle E, Rosseel Y (2017) blavaan: Bayesian structural equation models via parameter expansion.
Journal of Statistical Software, 85(4), 1–30.
Merkle E, Wang T (2016) Bayesian latent variable models for the analysis of experimental psychology
data. Psychonomic Bulletin & Review, 25(1), 256–270.
Mezzetti M, Billari F (2005) Bayesian correlated factor analysis of socio-demographic indicators.
Statistical Methods and Applications, 14(2), 223–241.
Migon H, Moreira A (2004) Core inflation: Robust common trend model forecasting. Brazilian Review
of Econometrics, 24, 1–19.
Moauro P, Savio G (2005) Temporal disaggregation using multivariate structural time series models.
Journal of Econometrics, 8, 214–234.
Murray J (2016) R Package ‘bfa’, Bayesian Factor Analysis. https​://cr​an.r-​proje​ct.or​g/web​/pack​
ages/​bfa/b​fa.pd​f
Muthén B, Asparouhov T (2012) Bayesian structural equation modeling: A more flexible representa-
tion of substantive theory. Psychological Methods, 17(3), 313–335.
Natesan P, Nandakumar R, Minka T, Rubright J (2016) Bayesian prior choice in IRT estimation using
MCMC and variational Bayes. Frontiers in Psychology, 7, 1422.
Nethery R, Warren J, Herring A, Moore K, Evenson K, Diez-Roux A (2015) A common spatial fac-
tor analysis model for measured neighborhood-level characteristics: The multi-ethnic study of
atherosclerosis. Health & Place, 36, 35–46.
Nikolov M, Coull B, Catalano P (2007) An informative Bayesian structural equation model to assess
source-specific health effects of air pollution. Biostatistics, 8, 609–624.
402 Bayesian Hierarchical Models

O’Brien S, Dunson D (2004) Bayesian multivariate logistic regression. Biometrics, 60, 739–746.
Palomo J, Dunson D, Bollen K (2007) Bayesian structural equation modeling, in Handbook of Latent
Variable and Related Models, ed S-Y Lee. Elsevier.
Petris G, Petrone S, Campagnoli P (2009) Dynamic Linear Models with R. Springer.
Piatek R (2017) R Package ‘BayesFM’, Bayesian Inference for Factor Modeling. https​://cr​an.r-​proje​
ct.or​g/web​/pack​ages/​Bayes​FM/Ba​yesFM​.pdf
Pitt M, Shephard N (1999) Time varying covariances: A factor stochastic volatility approach, pp 547–
570, in Bayesian Statistics 6, eds J Bernardo, J Berger, A Dawid, A Smith. Oxford University Press.
Poon W, Wang H (2012) Latent variable models with ordinal categorical covariates. Statistics and
Computing, 22(5), 1135–1154.
Prado R, West M (1997) Exploratory modelling of multiple non-stationary time series: Latent pro-
cess structure and decompositions, in Modelling Longitudinal and Spatially Correlated Data, ed T
Gregoire. Springer-Verlag.
Press S, Shigemasu K (1989) Bayesian inference in factor analysis, pp 271–287, in Contributions to
Probability and Statistics. eds L Gleser, M Perlman, S Press, A Sampson. Springer, New York.
Primiceri G E (2005) Time varying structural vector autoregressions and monetary policy. The Review
of Economic Studies, 72(3), 821–852.
Proietti T (2007) Measuring core inflation by multivariate structural time series models, in Optimisation,
Econometric and Financial Analysis, eds E J Kontoghiorghes, C Gatu. Advances in Computational
Management Science, Vol. 9. Springer, Berlin/Heidelberg, Germany.
Reinsel G (2003) Elements of Multivariate Time Series Analysis. Springer Science & Business Media.
Rigby R (1997) Bayesian discrimination between two multivariate normal populations with equal
covariance matrices. Journal of the American Statistical Association, 92, 1151–1154.
Rodrigues-Motta M, Pinheiro H, Martins E, Araujo M, dos Reis S (2013) Multivariate models for cor-
related count data. Journal of Applied Statistics, 40(7), 1586–1596.
Rossi P, Allenby G, McCulloch R (2005) Bayesian Statistics and Marketing. Wiley.
Rue H, Held L (2005) Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall/
CRC.
Rupp A, Dey D, Zumbo B (2004) To Bayes or not to Bayes, from whether to when: Applications of
Bayesian methodology to modeling. Structural Equation Modeling, 11, 424–451.
Sahu S (2002) Bayesian estimation and model choice in item response models. Journal of Statistical
Computation and Simulation, 72, 217–232.
Sain S, Cressie N (2007) A spatial model for multivariate lattice data. Journal of Econometrics, 140,
226–259.
Sanchez B, Butdz-Jorgensen E, Ryan L, Hu H (2005) Structural equation models: A review with
applications to environmental epidemiology. Journal of the American Statistical Association, 100,
1443–1455.
Schumacker R, Lomax R (2016) A Beginner’s Guide to Structural Equation Modeling, 4th Edition.
Routledge.
Sethuraman J (1994) A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639–665.
Sims C, Zha T (1998) Bayesian methods for dynamic multivariate models. International Economic
Review, 39, 949–968.
Skrondal A, Rabe-Hesketh S (2004) Generalized latent variable modeling: Multilevel, longitudinal
and structural equation models. Chapman & Hall/CRC, Boca Raton, FL.
Skrondal A, Rabe-Hesketh S (2007) Latent variable modelling: A survey. Scandinavian Journal of
Statistics, 34, 712–745.
Smith D, Harvey P, Lawn S, Harris M, Battersby M (2017) Measuring chronic condition self-man-
agement in an Australian community: Factor structure of the revised Partners in Health (PIH)
scale. Quality of Life Research, 26(1), 149–159.
Soares T, Gonçalves F, Gamerman D (2009) An integrated Bayesian model for DIF analysis. Journal of
Educational and Behavioral Statistics, 34(3), 348–377.
Song J, Ghosh M, Miaou S, Mallick B (2005) Bayesian multivariate spatial models for roadway traffic
crash mapping. Journal of Multivariate Analysis, 97, 246–273.
Factor Analysis, Structural Equation Models, and Multivariate Priors 403

Song X-Y, Lee S-Y, Ng M, So W-Y, Chan J (2006) Bayesian analysis of structural equation models with
multinomial variables and an application to type 2 diabetic nephropathy. Statistics in Medicine,
26, 2348–2369.
Stromeyer W, Miller J, Sriramachandramurthy R, DeMartino R (2015) The prowess and pitfalls of
Bayesian structural equation modeling: Important considerations for management research.
Journal of Management, 41(2), 491–520.
Tabachnick B, Fidell L (2006) Using Multivariate Statistics, 5th Edition. Allyn & Bacon, Inc., Needham
Heights, MA.
Talhouk A, Doucet A, Murphy K (2012) Efficient Bayesian inference for multivariate probit models
with sparse inverse correlation matrices. Journal of Computational and Graphical Statistics, 21(3),
739–757.
Tanner M (1996) Tools for Statistical Inference: Methods for the Exploration of Postrior Distributions and
Likelihood Functions, 3rd Edition. Springer-Verlag, New York.
Tekwe C, Carter R, Cullings H, Carroll R (2014) Multiple indicators, multiple causes measurement
error models. Statistics in Medicine, 33(25), 4469–4481.
Thissen D, Steinberg L, Wainer H (1993) Detection of differential item functioning using the param-
eters of item response models, pp 67–113, in Differential Item Functioning: Theory and Practice, eds
P W Holland, H Wainer. Lawrence Erlbaum Associates, Hillsdale, NJ.
Tiao G, Tsay R (1989) Model specification in multivariate time series. Journal of the Royal Statistical
Society: Series B, 51, 157–213.
Tsay R (2014) Multivariate Time Series Analysis with R and Financial Applications. Wiley, Hoboken, NJ
Tzala E, Best N (2007) Bayesian latent variable modelling of multivariate spatio-temporal variation
in cancer mortality. Statistical Methods in Medical Research, 2007 Sep 13 (epub).
Wang D, Lin J-Y, Yu T (2006) A MIMIC approach to modeling the underground economy in Taiwan.
Physica A, 371, 536–542.
Wang F, Wall M (2003) Generalized common spatial factor model. Biostatistics, 4(4), 569–582.
Wang Y, Neuman U, Wright S, Warton D (2012) mvabund: an R package for model-based analysis of
multivariate abundance data. Methods in Ecology and Evolution, 3, 471–473.
Wedel M, Bockenholt U, Kamakura W (2003) Factor models for multivariate count data. Journal of
Multivariate Analysis, 87, 356–369.
West M, Harrison J (1997) Bayesian Forecasting and Dynamic Models, 2nd Edition. Springer Verlag.
White A, Murphy T (2014) BayesLCA: An R Package for Bayesian Latent Class Analysis. Journal of
Statistical Software, 61(13). https​://ww​w.jst​atsof​t.org​/arti​cle/v​iew/v​061i1​3
Yu J, Meyer A (2006) Multivariate stochastic volatility models: Bayesian estimation and model com-
parison. Econometric Reviews, 25, 361–384.
Yuan K-H, Bentler P, Chan W (2004) Structural equation modeling with heavy tailed distributions.
Psychometrika, 69, 421–436.
Zheng X, Rabe-Hesketh S (2007) Estimating parameters of dichotomous and ordinal item response
models with gllamm. Stata Journal, 7(3), 313–333.
Zhou X, Nakajima J, West M (2014) Bayesian forecasting and portfolio decisions using dynamic
dependent sparse factor models. International Journal of Forecasting, 30(4), 963–980.
Zivot E, Wang J (2006) Modeling Financial Time Series with S-PLUS. Springer, Berlin, Germany.
10
Hierarchical Models for Longitudinal Data

10.1 Introduction
Longitudinal data sets occur when continuous or discrete observations yit on a set of sub-
jects, or units i = 1, … , n , are repeated over a number of measuring occasions t = 1, … , Ti
possibly differing between subjects (with N = SiTi total observations). There are many con-
texts for such data to occur, with variation in type of unit, study design, and data form. For
instance, in economic and marketing applications (Keane, 2015; Rossi et al., 2005), the unit
is typically an individual consumer, household, or firm, whereas in actuarial applications
(Antonio and Beirlant, 2007), the units may consist of groups of policyholders (risk classes)
with responses being insurance claim counts. Longitudinal studies often feature an inter-
vention or treatment comparison, with intervention studies (Thiese, 2014) including both
observational studies and controlled clinical trials with random treatment assignment.
In balanced studies, repeat measurements on all subjects are contemporaneous, whereas
measurement at different times for different units leads to unbalanced longitudinal data
(Daniels and Hogan, 2008), with unit specific times {ait , t = 1, … , Ti } at which events are
recorded. Furthermore, measuring occasions may be over more than one time scale. An
example is disease incidence by calendar time and age at onset or death, leading to a fur-
ther implicit cohort scale defined by the difference between age and time (Schmid and
Held, 2007; Lagazio et al., 2003); see Section 10.7.
There are also a variety of approaches to analysing longitudinal data, such as ran-
dom effects (conditional) models on the one hand, and marginal or population-averaged
approaches on the other (Heagerty and Zeger, 2000; Lee and Nelder, 2004), with conditional
model and marginal model parameters not necessarily having the same interpretation
(Verbeke et al., 2010). The focus of this chapter is on conditionally specified hierarchical
and random effect models, and on MCMC estimation via conditional likelihood with ran-
dom effects as part of the parameter set (e.g. Daniels and Hogan, 2008; Chib and Carlin,
1999); see Section 10.2.
Longitudinal data offer major advantages over cross-sectional designs in the analysis
of causal interrelationships between variables, including developmental and growth pro-
cesses and clinical studies, and before-after studies (Menard, 2002; Chen et al., 2016). The
accumulation of information over both times and subjects increases the power of statistical
methods to identify treatment effects or values-added (Lockwood et al., 2003), and permits
the estimation of parameters (e.g. permanent random effects or “frailties” for subjects i)
that are not identifiable from cross-sectional analysis or from repeated cross-sections on
different subjects.
On the other hand, analysis of longitudinal data may be problematic if the longitudinal
sequences are subject to missing observations; see Section 10.8. Missingness may involve

405
406 Bayesian Hierarchical Models

intermittently missing values of responses or predictors, or total loss to observation after


a certain point. The latter is variously known as attrition (Schafer and Graham, 2002) or
dropout (Hogan et al., 2004). A particular question relevant to estimation of the main struc-
tural model is whether such permanent exit is random (independent of the response that
would otherwise have been observed) or informatively related to the missing response.
Bayesian estimation via repeated sampling from posterior densities facilitates hierarchi-
cal modelling of longitudinal data, whether of permanent subject effects, correlated or iid
observation level errors, time-varying regressor effects, or common factors in multivariate
longitudinal data. As noted by Davidian and Giltinan (2003), random effects are treated
as parameters in Bayesian MCMC estimation, and not integrated out, as may be done in
frequentist approaches. The integrated or marginal likelihood estimation approach may
become infeasible in complex varying coefficient models (Tutz and Kauermann, 2003), and
different parameter estimates may be obtained according to the maximisation methods
used (Molenberghs and Verbeke, 2004). Bayesian modelling perspectives are also impor-
tant in the application of latent metric augmentation to categorical longitudinal outcomes
(binary, multinomial) (Chib and Carlin, 1999), and for dealing with missing data, espe-
cially attrition of subjects (Little and Rubin, 2002; Ibrahim and Molenberghs, 2009).

10.2 General Linear Mixed Models for Longitudinal Data


General linear mixed models (GLMMs) provide a framework for modelling longitudinal
data, and may be characterised by distributional and structural assumptions. Conditional
on predictors and random effects, it is assumed that data yit for subjects i and times t are
distributed according to the exponential family,

æ y q - a(qit ) ö
p( yit |qit , f) = exp ç it it + C( yit , f)÷
è f ø
where θit denotes the natural parameter (Tsai and Hsiao, 2008; Fong et al., 2010; Natarajan
and Kass, 2000). The structural assumption governs the forms assumed for the condi-
tional means E( yit ) = mit = a¢(qit ), with regression link g( mit ) = hit , and for the variances
Vit = fVar( mit ). This involves questions such as whether the conditional mean is linear or
nonlinear in predictors and random effects, and at what levels random effects are present.
As in Chapter 9, random effects may be used at different levels: enduring differences
between subjects may be represented by time-invariant random effects, while represent-
ing excess dispersion may require observation level random effects. Fixed effects are also
used to represent subject level heterogeneity, especially in econometrics (Frees, 2004). This
is equivalent to generating dummy variables for each subject, and works best for relatively
few subjects and more time periods, as there is no pooling strength in fixed effects models
and the parameter count increases with the number of subjects n. In this chapter, the focus
is on random subject iid effects (exchangeable over units), though if the units are spatially
configured (say), then a structured prior for the permanent effects can be used.
Conjugate priors may be suitable for handling random variation at subject or observa-
tion level for exponential family responses, especially if random variation does not involve
predictors (Lee and Nelder, 2000). However, more flexible models, possibly involving sub-
ject specific regression effects, as well as varying intercepts, involve a vector of subject-
level effects bi = (b1i , … , bQi )¢ in a general linear mixed model format. These are typically
Hierarchical Models for Longitudinal Data 407

taken to be normal with dispersion matrix D, with elements {sbij , i , j Î1, … , Q} , and with
mean B = (B1 , … , BQ ) , with elements zero or non-zero depending on how the predictors are
defined (see Section 10.2.1). With link g, the structural assumption specifies

g[E( yit |bi .uit )] = hit = Xit b + Zitbi + uit , (10.1)

where Zit = ( z1it , … , zQit ) is 1 × Q. In a typical analysis, bi and ui = (ui1 , ¼ , uiTi )¢ are assumed
independent, and both also taken to be normal, at least initially. Residual autocorrela-
tion may necessitate a correlation structure in the observation errors (e.g. Franco and Bell,
2015). In some applications (e.g. Poisson data without overdispersion), the observation level
errors uit may not be present.
A particular widely applied GLMM is the normal linear mixed model, with

yit = Xit b + Zitbi + uit ,

with ui = (ui1 , … , uiTi )¢ iid normal with mean zero, Ti × Ti covariance matrix S i = s 2I , and
conditional expectations

E( yit |bi ) = hit = Xit b + Zitbi .

The normal linear mixed model may be achieved with latent data yit* underlying observed
data, either binary or categorical (e.g. Chib, 2008; Chib and Jeliazkov, 2006). Thus for binary yit

yit* |bi ~ N (Xit b + Zitbi , 1) I (0, ) if yit = 1


yit* |bi ~ N (Xit b + Zitbi , 1) I (, 0) if yit = 0,

with iid residuals under conditional independence having known variance σ2 = 1, but with
D still unknown.
Stacked over times, the conditional mean in (10.1) is expressed as

hi = Xi b + Zibi + ui , (10.2)

where ηi is Ti × 1, Xi is Ti × P, and Zi is Ti × Q, while the normal linear mixed model is

yi = Xi b + Zibi + ui .

For the normal linear mixed model, the marginal model (with bi integrated out) is obtain-
able analytically as

yi ~ N (Xit b , ZiDZi¢ + s 2I ),

which is a feature not present for the broader class of general linear mixed models
(Molenberghs and Verbeke, 2006).
Given the fixed regression effects, and subject permanent effects b = (b1 , … , bn ) , repeated
observations on the same subject are conditionally independent (Kleinman and Ibrahim,
1998; Tutz and Kauermann, 2003), and the conditional likelihood factors as
n

p( y |b , b , s 2 ) = Õ p(y |b , b , s ),
i =1
i i
2
408 Bayesian Hierarchical Models

Õ
Ti
where p( yi |bi , b , s ) = p( yit |bi , b , s 2 ). Similarly, the joint density p( y , b|b , D, B, s 2 ) =
2
t =1
p( y|b , b , s 2 )p(b|B, D) factors into subject-specific elements

æ Ti
ö
ç
ç
è
Õ
t =1
p( yit |bi , b , s 2 ) ÷ p(bi |B, D).
÷
ø

If model checking in fact reveals the conditional independence assumption does not pro-
vide an adequate fit, then the model requires elaboration. For example, if checking shows
regression errors correlated through time, then the iid assumption for uit may have to be
reconsidered, or lagged effects in the response included in predictor sets Xit or Zit (Frees,
2004, p.279); see Sections 10.3 and 10.5.

10.2.1 Centred or Non-Centred Priors


The parameterisation adopted in the prior for bi depends on whether Xit and Zit are speci-
fied to be overlapping or distinct, and also on MCMC convergence considerations. In many
presentations of the GLMM, Zit are assumed to be a subset of Xit, in which case the bi are
typically taken to be zero mean random effects, usually following a standard density (e.g.
multivariate normal). If the Xit and Zit are non-overlapping, then the bqi may be taken to
have non-zero means equal to the central (fixed effect) regression parameters Bq for zqit
(Chib and Carlin, 1999, p.20), namely

g[E( yit |bi .uit )] = Xit b + Zitbi + uit ,


bi ~ N (B, D).

Such hierarchical centring may assist precise identification and MCMC convergence. If no
predictors have fixed effect coefficients, one has what is sometimes termed a random coef-
ficient regression, namely

g[E( yit |bi .uit )] = Zitbi + uit ,

where bqi have non-zero means Bq (e.g. Daniels and Hogan, 2008).
Papaspiliopoulos et al. (2003) compare MCMC convergence for centred, non-centred,
and partially non-centred hierarchical model parameterisations, and mention that hierar-
chical centring may be less effective when the latent effects bi are relatively weakly identi-
fied. Consider the normal linear mixed model in the form

yi = Xi b + Zibi + (s 2ITi )ei ,


bi = B + D0.5vi ,

where ei and vi, of dimension Ti and Q respectively, are standard normal variables.
Then the non-centred parameterisation (NCP) and partially non-centred parameterisa-
tions (PNCP) are respectively

b i = bi - B,
Hierarchical Models for Longitudinal Data 409

and

w
b i = bi - WiB,

where Wi = U iD -1 , U i = (1/s 2 )Zi Zi + D -1 , and

b i ~ N (0, D),
w
b i ~ N (B - WiB, D).

The proportion of B subtracted from bi under the PNCP form (that has favourable MCMC
convergence properties) is observation specific. The longitudinal model under the NCP
and PNCP parameterisations become

yi = Xi b + ZiB + Zi b i + (s 2ITi )ei ,


w
yi = Xi b + ZiWiB + Zi b i + (s 2ITi )ei .

The NCP form has potential use in random effects selection (see Section 10.2.3).

10.2.2 Priors on Unit Level Random Effects


The most commonly adopted prior for the random subject or cluster effects bi = (b1i , … , bQi )
is an iid multivariate normal

(b1i , … , bQi ) ~ NQ (B, D), (10.3)

where the means B = (B1 , … , BQ ) are either zeroes or unknown fixed effects, and
D = [drs ] = [sbrs ] represents covariation within subjects between the rth and sth random
effects bri and bsi. If the Zit are a subset of the Xit, then the means Bq will be zero. For robust-
ness against non-normality or outliers, other forms of mixture, including scale mixtures of
normals, or discrete mixtures of random effects, may be assumed for subject effects (sec-
tion 10.6). For spatially configured units, a prior for (b1i , … , bQi ) including correlation over
areas is likely to be relevant. For doubly nested data (e.g. observations yijt within subjects i
within clusters j), the second stage parameters are likely to be cluster specific and possibly
also randomly varying, as in

(b1ij , … , bQij ) ~ NQ (Bj , Dj ),


(B1 j , … , BQj ) ~ NQ ( MB , CB ).

In many applications, the Zit will be of relatively small dimension, confined to the intercept
or simple time functions. For example, if Q = 1 and Zit = 1, one has the normal linear form

yit = bi + Xit b + uit , (10.4)

where bi represent permanent subject effects, namely enduring differences between sub-
jects due to unmeasured attributes. If Xit excludes (or includes) an intercept, then the bi will
be normal with mean B (or zero) and variance D.
410 Bayesian Hierarchical Models

In growth curve applications, the Zit typically include transforms of time or age, and the
mean level for an individual changes with time or age (e.g. linearly or quadratically) with
growth rates specific to each subject. For example, under a linear growth model with Q = 2,
each subject has their own linear growth rate (Weiss, 2005)

yit = b1i + b2it + Xit b + uit ,

where D12 measures the correlation between intercepts and slopes. Assuming Xit omits an
intercept and linear time term, one may take

(b1i , b2i ) ~ N ([B1 , B2 ], D).

In particular, under the random intercept and slope (RIAS) model

yit = b1i + b2it + uit ,

so that an individual’s response will differ from his/her mean level at a particular time
or age by a random term uit. Another option is to replace known time functions by an
unknown time-varying function, δt, as in

yit = b1i + b2i dt + Xit b + uit ,

with δt subject to identifying constraints (e.g. δ1 = 0, and δT = 1), or with the variance of b2i
preset (Zhang et al., 2007; Zhang, 2016).
To illustrate MCMC sampling, consider the random coefficient normal linear model, namely

yit = Zitbi + uit ,


uit ~ N (0, s 2 ),
(b1i , b2i , … , bQi ) ~ NQ ([B1 , B2 , … , BQ ], D).

Let t = 1/s 2 and assume, following Wakefield et al. (1994), that t ~ Ga(n0 /2, t0n0 /2). Also
assume a multivariate normal prior for the second-stage population means B, and a
Wishart prior for D−1, namely

B ~ N (B0 , C ),
D ~ W ([ rR]-1 , r).
-1

Setting

N= å T ,
i =1
i

Ei-1 = tZi¢Zi + D -1 ,
V -1 = nD -1 + C -1 ,
n

b= å b /n,
i =1
i
Hierarchical Models for Longitudinal Data 411

then Gibbs sampling involves the posterior conditionals

bi ~ N (Ei [tZi¢y + D -1B], Ei ), i = 1, ¼ , n

B ~ N (V[nD -1 b + C -1B0 ], V )

æé n
ù
-1
ö
D ~Wçê
-1
çê å (bi - B)(bi - B)¢ + r R ú , n + r ÷
úû ÷
èë i =1
ø

æn + N 1 é n
ùö
t ~ Ga ç 0
ç 2
è
, ê
2 êë åi =1
( yi - Zibi )¢( yi - Zibi ) + n 0t 0 ú ÷ .
úû ÷ø

When predictors are available that might explain heterogeneity between subjects (e.g. treat-
ment allocations), regression priors may be used as means for unit random effects bqi (Chib,
2008). Thus, consider a model with varying intercepts b1i, varying linear and quadratic
growth effects {b2i, b3i}, and observations at differentially spaced time points {ai1 , ai 2 , … , aiTi }
(Muthen et al., 2002). So

yit = b1i + b2i ait + b3 i ait2 + uit ,

where random growth coefficients are related to an intervention variable Tri according to

b1i = B1 + e1i (10.5)


b2i = B2 + d2Tri + e2i
b3 i = B3 + d3Tri + e3 i

with (e1i , … , eQi )  ~ NQ (0, D) . Treatment is randomised so the baseline effects b1i are taken to
be independent of the intervention Tri.

10.2.3 Priors for Random Covariance Matrix and Random Effect Selection


Inferences may be sensitive to the form of prior adopted for the dispersion matrix D, and
the amount of information it contains. Improper or overly diffuse priors on D or other
variance hyperparameters may be associated with actual or effectively improper posterior
densities. For example, the Jeffreys rule prior, namely

p(D) µ det(D)-(Q +1)/2

may lead to an improper joint posterior for D and β under certain conditions (Natarajan
and Kass, 2000). The conjugate model for Q > 1 random effects involves a Wishart prior for
D−1, D -1 ~ Wish( A, n), or

n /2 0.5( n - Q - 1)
p(D -1|A, n) µ A D -1 exp( -0.5tr( AD -1 ),

where E(D -1 ) = n A -1 and E(D) = A/(n - Q - 1) .


412 Bayesian Hierarchical Models

Setting the elements in A may be difficult in the absence of substantive information.


Greater flexibility, including random effects selection, may be gained with matrix decom-
position alternatives to the Wishart (e.g. Alvarez et al., 2016; Kinney and Dunson, 2007;
Frühwirth-Schnatter and Tüchler, 2008; Tutz and Kauermann, 2003; Hedeker, 2003), or with
adaptations of the uniform shrinkage prior (Natarajan and Kass, 2000; Tsai and Hsiao, 2008).
One can follow a separation strategy (Barnard et al., 2000), decomposing the dispersion
matrix into a product of a correlation matrix R and diagonal matrices D = diag(sb1 , … , sbQ ),
namely

D = DRD.

Barnard et al. (2000) construct the correlation matrix R from an inverse Wishart distribu-
tion, but a more versatile approach is provided by the LKJ(ν) prior (Lewandowski et al.,
2009), available in rstan. Thus

R ~ LKJ (n),

where, as ν increases, large correlations are less plausible and the prior concentrates around
the unit correlation matrix. At ν = 1, the LKJ(ν) correlation distribution reduces to the iden-
tity distribution over correlation matrices, so that all correlations are equally plausible. A
setting such as LKJ(1.5) might be taken as applicable in many situations, where extreme
correlations of −1 or +1 are downweighted slightly, but relatively high correlations are not
to be ruled out. Any suitable prior (inverse gamma, lognormal, uniform, half Cauchy) may
be used for the standard deviations σbj.
Cholesky decomposition methods also provide flexibility. Consider the Cholesky decom-

å
Q
position D = CC′ where C is a lower triangular matrix, with Dpq = r =1
cpr cqr and variances
obtained as
Q

Dqq = åc
r =1
2
qr .

Then if Zit is a subset of Xit, (10.1) may be expressed

hit = Xit b + ZitCzi + uit

where (z1i , … , zQi ) ~ NQ(0,I). For example, with Q = 2,

æ c11 0 ö æ z 1i ö
( z1it , z2it ) ç ÷ ç ÷ = z 1i ( z1itc11 + z2itc21 ) + z 2i z2itc22 ,
è c21 c22 ø è z 2i ø
Instead of a Wishart prior on D−1, priors are then adopted for each element of C. To ensure
D is positive definite, the diagonal terms c11 and c22 need to be assigned positive priors,
while the prior c21 is unconstrained. For Q = 3, one has

æ c11 0 0 ö æ z 1i ö
ç ÷ç ÷
( z1it , z2it , z3 it ) ç c21 c22 0 ÷ ç z 2i ÷ = z 1i ( z1itc11 + z2itc21 + z3 itc31 )

ç c31
è c32 c33 ÷ø çè z 3 i ÷ø

+ z 2i ( z2itc22 + z3 itc32 ) + z 3 i z3 itc33 ,


Hierarchical Models for Longitudinal Data 413

with three positive unknowns cqq and three unconstrained lower diagonal unknowns.
An alternative Cholesky decomposition (Cai and Dunson, 2006; Chen and Dunson, 2003)
has

D = LWW¢L,

where L = diag(l1 , … lQ ) , and Ω is lower triangular,

æ 1 0 … 0ö
ç ÷
w21 1 … 0÷
W=ç .
ç… …  0÷
çç ÷
è wQ1 wQ 2 … 1 ÷ø

Hence C in D = CC′ can be written

æ l1 0 … 0 ö
ç ÷
w21l2 l2 … 0 ÷
C=ç .
ç … …  0 ÷
çç ÷÷
è wQ1lQ wQ 2lQ … lQ ø

Positive priors (e.g. lognormal, gamma) are taken for λq, while normal N(0,VΩ) priors may
be assumed for unconstrained elements of Ω, with VΩ = 0.5 providing relatively diffuse
priors on correlations between the bqi. Retention of terms in Λ is determined by binary
indicators gqq ~ Bern(pqq ), where πqq may be preset or unknown. Retention of the unknown
terms in Ω is determined both by binary indicators gqr ~ Bern(pqr ), and also by whether λq
and λr are retained; if either of {λq,λr} is omitted, then ωqr necessarily is.
If Zit is not a subset of Xit, one may consider the non-centred parameterisation (Frühwirth-
Schnatter and Tüchler, 2008)

hit = Xit b + ZitB + ZitCzi + uit

where (z1i , … , zQi ) ~ NQ(0,I). As above, diagonal terms cqq need to be assigned posi-
tive priors, while priors for cqr (q > r) are unconstrained. Selection of which cqq and cqr
terms to retain may be based on binary indicators {gqq ~ Bern(pqq ), pqq ~ Be( aqq , bqq )},
{gqr ~ Bern(pqr ), pqr ~ Be( aqr , bqr ), q > r} where aqq = bqq = aqr = bqr = 1 is a default option. In
effect, the model involves composite terms,

Gqq = cqqgqq , (10.6)

Gqr = cqrgqr , q > r

so that for Q = 2

hit = Xit b + z1i ( z1itG11 + z2itG21 ) + z2i z2itG22 + uit .

The posterior estimate for D would be based on MCMC monitoring of Dqr = SQs=1GqsGrs .
414 Bayesian Hierarchical Models

A regression approach to covariance estimation for longitudinal data is proposed


by Pourahmadi (1999, 2000) and Pourahmadi and Daniels (2002). For the essence of the
method, consider normal metric data yit for subjects i = 1, ¼ , n with individual specific
covariance matrices Σi of dimension T × T. The model yit ~ N ( mit , S i ) may be re-expressed
(for the purposes of decomposing Σi) as an antedependence model

t -1

yit - mit = å f (y
j =1
itj ij - mij ) + uit ,

where the errors uit ~ N (0, hit ) are uncorrelated. Denote

H i = diag( hi1 , … , hiT ),

together with the lower triangular matrix,

é 1 ù
ê -f 1 ú
ê i 21 ú
Fi = ê -fi 31 -fi 32 1 ú.
ê ú
ê … … … … ú
ê -fiT 1 -fiT 2 … -fiT ,T -1 1úû
ë

One then has the decomposition

Var(ui ) = H i = Fi S i Fi¢.

The parameters ϕitj and hit may be referred to respectively as the generalised autoregres-
sive parameters and the innovation variances of Σi (Pourahmadi and Daniels, 2002).
A parsimonious covariance model, especially for large T, may then be achieved by using
predictors zit and witj in the regressions

log( hit ) = zitg ,

fitj = witj l.

Often one might take Σi = Σ, in which case

yit - mit = åf (y tj ij - mij ) + uit ,

with uit ~ N (0, ht ), H = diag( h1 , … , hT ), and

é 1 ù
ê -f 1 ú
ê 21 ú
F = ê -f31 -f32 1 ú,
ê ú
ê … … … … ú
ê -fT 1 -fT 2 … -fT ,T -1 1úû
ë
Hierarchical Models for Longitudinal Data 415

with Var(ui ) = H = F SF ¢ . The covariates used for covariance model become {zt,wtj}, where
the wtj might simply be powers in (t − j) as illustrated by Cepeda and Gamerman (2004),
and the zt are simply powers of t. A possible drawback to using polynomial functions of
time is the multicollinearity that may be encountered, and Bayesian regression selection
may then be applied. One may also consider autoregressive or random walk priors in ht
and modelling ϕtj as a collection of iid random effects under a shrinkage prior strategy
(Daniels and Pourahmadi, 2002, p.558).

10.2.4 Priors for Multiple Sources of Error Variation


Estimation of variance components and convergence of MCMC samplers for longitudinal
data may also be sensitive to the assumed prior interlinkages (or not) between multiple
sources of random variation. Consider a random intercept model

yit = bi + Xit b + uit ,

with unknown variances D = var(bi ) and s 2 = var(uit ). The conjugate approach with the
advantage of simple posterior conditionals involves separate gamma priors on D−1 and
τ = σ−2. These could be informative (e.g. downweighted results from a maximum likelihood
fit), but are often taken to be diffuse with small scale and shape parameters, leading to
potentially delayed convergence of Gibbs sampling methods since sampling is from an
almost improper posterior (Natarajan and McCulloch, 1998). These problems may increase
if an autocorrelated error term is added to the white noise error as in

yit = bi + Xit b + uit + eit ,

eit = rei ,t -1 + vit ,

where vit ~ N (0, sn2 ) , and there are three variances.


An alternative is to allow for potential interdependence between variance components
via adaptations of the uniform shrinkage prior (Natarajan and Kass, 2000). The uniform
shrinkage principle extends to beta priors on the relative shares for two variances and
to Dirichlet priors on the relative shares for three or more variance components. So,
one might set a prior on one or other of D or σ2, but then specify a variance partition-
ing rule such as k = D/(D + s 2 ) ~ U (0, 1) to obtain the other. A related strategy might take
k = D/(D + s 2 ) ~ Be( ak , bk ), where aκ and bκ are preset or hyperparameters. Lee and Hwang
(2000) use uniform shrinkage priors in a multilevel longitudinal context when repetitions
t = 1, … , Tij are for subjects i nested within clusters j, and where there is an autocorrelated
error εijt as well as a white noise error uijt. An extension to multivariate bi of the uniform
shrinkage prior proposed by Natarajan and Kass (2000) takes the form

é ìï 1 n
ü ù
p(D) µ det ê IQ + í
êë ïî n
å Z¢W Z ïýïþ Dúúû
i =1
i i i

where Wi is diagonal of dimension Ti with elements 1/Vit [¶hit /¶mit ]2 ) where Vit = fVar( mit )
and g( mit ) = hit .
416 Bayesian Hierarchical Models

Example 10.1 Growth Model Simulation


This example illustrates sensitivity to prior specifications and also the application of
random effects selection. Thus observations yit (i = 1,… , 500; t = 1,… , 5) are generated
according to a prototypical growth model, namely:

y it = b1i + b2it + b3 i xit + uit ,


xit ~ U( -1, 1),
(b1i , b2i , b3 i ) ~ N 3 (B, D),
uit ~ N(0, 1/t),
t = 0.5,

æ1 0 0 ö
ç ÷
D-1 = ç 0 100 0 ÷,
ç0 0 10000 ÷ø
è

B = (5, 0.5, 0.5).

So, the known covariance structure involves uncorrelated random effects bqi with
respective standard deviations {1,0.1,0.01}.
The parameters are first re-estimated (in R2OpenBUGS) under a conjugate
Wishart prior for the precision matrix D -1 ~ W (I , 3), and with t ~ Ga(1, 0.001) and
(Bq ~ N(0, 100), q = 1,… , 3) . The last 4,000 of a 5,000 iteration two-chain run provides esti-
mated means (sd) for sbq = Dqq 0.5
of 1.13 (0.047), 0.22 (0.022), and 0.35 (0.06). The standard
deviations σb2 and σb3 are therefore overestimated as compared to the simulation param-
eters 0.1 and 0.01. The correlation r(b1,b2) is estimated as 0.23 with 95% interval (0.02,0.45).
A more diffuse prior on D−1 (e.g. a scale matrix with diagonal terms 0.1) provides lower
posterior means for (σb2,σb3), namely 0.14 and 0.16, but an increased posterior mean of
0.38 for r(b1,b2). Possible limitations of the Wishart prior are mentioned in the literature
(Alvarez et al., 2016).
The posterior means and standard deviations on σbq under the W(I,3) prior are used
to set gamma priors on the Cholesky terms cqq in a selection model (10.6) (Fruhwirth-
Schnatter and Tuchler, 2008), but with precision downweighted 100 times, namely
Ga(5.7,5.1), Ga(1.1,4.7), and Ga(0.35,0.95). Heavier downweighting (e.g. a thousandfold) is
avoided, as it may lead to over-diffuse priors cqq. Inferences from such selection may be
sensitive to priors on the Cholesky elements, and a full analysis would consider several
choices of prior. As in Section 10.2.3, with composite terms Gqr = cqr gqr and Zit = (1, t , xit ) ,
the linear predictor is

y it = Zit B + ZitGzi + uit


= B1 + B2t + B3 xit + z1i ( z1itG11 + z2itG21 + z3 itG31 )

+ z2i ( z2itG22 + z3 itG32 ) + z3 i z3 itG33 + uit ,

with selection indicators, γqr ~ Bern(πqr), and with πqr unknown and assigned uniform
priors. The off-diagonal Cholesky terms cqr (q > r) are assigned N(0,1) priors.
The last 4,000 of a 5,000 iteration two-chain run (in R2OpenBUGS) with D = CC′, pro-
vides estimated means (medians) for sbq = Dqq 0.5
of 1.12 (1.12), 0.033 (0), and 0.049 (0), with
the densities for σb2 and σb3 both having spikes at zero (consistent with zero random vari-
ation). The retention probabilities γ22 and γ33 are respectively 0.30 and 0.38. It is not pos-
sible to monitor the correlation matrix, but the covariance D12 has posterior mean 0.02,
with γ21 estimated at 0.27. The selection approach (with the priors adopted) provides
Hierarchical Models for Longitudinal Data 417

a more accurate estimate of the original σb3, and of the correlation structure, but also
essentially eliminates the random variability in the slopes on time.
Random effect selection is also undertaken with D = LWW¢L, namely C = ΛΩ. Priors
on λq are the same as for cqq under the decomposition D = CC′, while N(0,0.5) priors are
used for the elements of Ω, and Bernoulli priors with preset probability 0.5 for γqr. The
last 4,000 of a 5,000 iteration two-chain run provides posterior means (medians) for
sbq = Dqq0.5
of 1.12 (1.12), 0.040 (0), and 0.021 (0). The densities for σb2 and σb3 again both have
spikes at zero. So, a more accurate estimate of the original σb3 is obtained than under
a Wishart prior, but random variability in the slopes on time (as summarised in σb2) is
understated.
Either Cholesky decomposition approach can also be applied without selection (in
effect setting γqr = 1). For example, with Ga(1,1) priors on λq and N(0,0.5) priors on the ωqr,
the second method gives posterior means for σb2 and σb3 of 0.09 and 0.11, with σb3 less
inflated (as compared to the true value) than under a Wishart prior.
Also considered is the rstan option whereby an LKJ(ν) prior is applied to the lower
Cholesky factor of the correlation matrix between (b1i,b2i,b3i) (Vaidyanathan, 2016;
Baldwin, 2014). Half Cauchy priors with scale 5 are assumed for σbq. With a shape param-
eter ν = 1 for the LKJ(ν) prior, a two-chain run of 5,000 iterations provides posterior
means for σb2 and σb3 of 0.09 and 0.15. So again the estimated σb3 is less inflated (as com-
pared to the true value) than under a Wishart prior. However, the estimated correlation
matrix shows r(b1,b2) to be positive with mean 0.54 and 95% limits (0.04,0.91). Setting
ν = 1.5 provides posterior means for σb2 and σb3 of 0.08 and 0.10, with r(b1,b2) having mean
0.48 and 95% limits (−0.09,0.87).
The Cholesky factor correlation matrix approach is also applied using a Dirichlet parti-
tion to a total random variance parameter sT2 . This recognises the interdependence of the
sources of random variation. Thus, one has sb2 = fb sT2 , where (f1 , f2 , f3 ) ~ Dir(w1 , w2 , w3 ),
with the w vector itself Dirichlet distributed with prior weights 1. The total variance sT2
is taken as half Cauchy. This provides posterior means for σb2 and σb3 of 0.08 and 0.10,
with r(b1,b2) estimated at 0.49.
Finally, a direct separation strategy is applied (McElreath, 2015, p.393). Thus with D =
ΔRΔ, the correlation matrix R is assigned an LKJ(1.5) prior, and σbj (the diagonal ele-
ments in Δ) taken as lognormal with variance 1. This gives posterior means for σb2 and
σb3 of 0.09 and 0.13, with r(b1,b2) estimated at 0.55.
One can say in conclusion that some of the options considered provide better perfor-
mance in certain regards, but that no approach satisfactorily reproduces all aspects of
the known covariance structure. It may be that a more extended longitudinal simula-
tion (e.g. with T = 10), would be less sensitive, as there is more information on each unit.

Example 10.2 Joint Regression for Mean and Covariance


This example follows Cepeda and Gamerman (2004) in applying the method of
Pourahmadi (1999) to T = 24 monthly height readings yit for n = 6 students. The antede-
pendence re-expression of the model y it ~ N( mt , S ) is

y it - mt = åf (y
tj ij - mj ) + uit ,

with uit ~ N(0, ht ) . Taking H = diag( h1 ,… , hT ) , and

é 1 ù
ê -f ú
ê 21 1 ú
F = ê -f31 -f31 1 ú,
ê ú
ê … … … … ú
ê -fT 1 -fT 1 … -fT ,T -1 1úû
ë
418 Bayesian Hierarchical Models

provides the covariance decomposition Var(ui ) = H = FSF ¢ . The covariates {wtj,zt} used
for the covariance regression model are powers of t − j for wtj, and powers of t for zt. Then
the model takes

mt = b1 + b2t + b3t 2 + b4t 3 ,

log(ht ) = g1 + g2t + g3t 2 ,

ftj = l1 + l2 (t - j) + l3 (t - j)2 ,

with the model for ϕtj here extending only to a quadratic term in (t − j) rather than a
quartic as in Cepeda and Gamerman (2004).
The last 9,000 iterations from a two-chain run of 10,000 iterations in R2OpenBUGS
provide posterior mean (sd) estimates for the parameters as follows: β1 = 94.2(0.39),
β2 = 0.82(0.11), β3 = −0.021(0.011), β4 = 4.0E − 4(3.4E−4), γ1 = 0.36(0.37), γ2 = −0.189(0.071),
γ3 = 0.0085(0.0029), λ1 = 0.34(0.39), λ2 = −0.053(0.011), and λ3 = −0.00181(5.3E−4). Predictions
from the model reproduce the observations satisfactorily, with 11 of the 144 data points
having predictive exceedance probabilities Pr( y new , it > y it |y ) under 0.05 or over 0.95.
Despite providing an insight into the temporal aspects of covariation, this model has
worse fit measures than a standard approach, with a Wishart prior on trivariate normal
random intercepts and slopes in a quadratic growth model. The LOO-IC for the latter is
532, compared to over 2,380 for the joint regression model.
Finally, a quadratic growth curve model is combined with AR1 dependence in the
errors, using the appropriate form of error covariance matrix represented by a function
(see section 10.3.1). This is implemented in rstan. We find a significant AR1 parameter
with posterior mean (sd) 0.93 (0.03). This model can also be coded from first principles.
The LOO-IC is reduced to 359.

10.3 Temporal Correlation and Autocorrelated Residuals


Correlation between regression errors at different times is obtained as a by-product of
other random effect schemes, not only from explicit time series priors. In particular, the
random intercept model in (10.4) illustrates how subject random effects induce temporal
correlation. It is important to control for such heterogeneity to avoid spurious “state depen-
dence,” namely dependence of the current outcome or probability on past outcomes (Chib
and Jeliazkov, 2006). Thus, for metric data, suppose

yit = bi + Xit b + uit ,

where Xit includes an intercept, bi ~ N(0,D), and uit ~ N(0,σ2). Assuming uit and bi are inde-
pendent, the correlation between wit = uit + bi and wis = uis + bi at periods t and s is

k = cov(wit , wis )/Var(wit ) = D/(D + s 2 ),

sometimes called the intraclass correlation. The random intercept model leads to the “com-
pound symmetry” form for the intra-subject covariance matrix Σi (Weiss, 2005, pp.246–
250), with diagonal terms S itt = s 2 , and off-diagonal terms S ist = s 2k , s ¹ t . Equivalently

S i = s 2 [(1 - k)I + k J ],
Hierarchical Models for Longitudinal Data 419

where J is an Ti × Ti matrix of ones.


A factor analytic form of the random intercept model (Weiss, 2005, p.269) includes period
specific scale parameters Dt,

yit = Dt0.5bi + Xit b + uit ,

with bi ~ N(0,1) having a known variance to provide identification, so that

kt = Dt /(Dt + s 2 ),

and the correlation between wit = Dt0.5bi + uit at times t and s is (κtκs)0.5. The corresponding
RIAS model has loadings D10t.5 and D20t.5 and standard normal random effects b1i and b2i,
so that

yit = D10t.5b1i + D20t.5b2it + Xit b + uit .

For discrete data, the temporal correlation under random intercept and RIAS
models may be confined to positive values only. Thus, for Poisson counts yit, with
­
log[E( yit |bi )] = log( mit ) = Xit b + bi , and git = exp(Xit b ), one has under conditional indepen-
dence that

cov( yit , yis ) = E[cov( yit , yis |bi )] + cov[E( yit |bi ), E( yis |bi )]

= cov([e bi git ],[e bi gis ]) = gitgis var(e bi ),
while

var( yit ) = E[var( yit |bi )] + var[E( yit |bi )]



= E[e bi git ] + var[e bi git ] = mit + git2 var(e bi ),
with correlation then necessarily positive.

10.3.1 Explicit Temporal Schemes for Errors


When the residuals from (10.1)–(10.2) show temporal correlation, autocorrelated residu-
als may be used instead of, or in addition to, the white noise errors uit. Let these take the
generic form

(ei1 , … , eiTi ) ~ N (0, S i ),

where Σi is a unit level covariance matrix of dimension Ti × Ti. Commonly adopted schemes
for such residuals include low order random walks (e.g. first order or RW1 priors), or low
order stationary schemes (typically AR1 or MA1). For example, Xu et al. (2007), Oh and Lim
(2001), and Ibrahim et al. (2000) adopt stationary AR1 errors in models for longitudinal
count data, with yit ~ Po( mit ),

log( mit ) = Xit b + eit ,


eit = rei ,t -1 + uit ,
420 Bayesian Hierarchical Models

where uit ~ N (0, su2 ) are iid, and |ρ| < 1. For metric data, a stationary AR1 error scheme with

yit = Xit b + eit ,


eit = rei ,t -1 + uit ,

where |ρ| < 1, and uit ~ N (0, su2 ) , leads to error covariance matrix Σi with elements

s u2 |s-t|
Sist = var(e it )r|s-t| = r ,
1- r2
with

æ 1 r r Ti -1 ö
r2 …
ç ÷
2 ç
r 1 … ÷ r r2
su ç
Si = … … … … ÷ .

1- r2 ç ÷
ç … r21 r ÷r
ç Ti -1 ÷
èr … r2 r 1 ø
Assuming homogenous parameters across subjects so that Σi = Σ, and that subjects are
independent, the full population covariance matrix is

s u2
F= I n Ä S,
1- r2
where In is an identity matrix of order n. With eit = yit - Xit b , the marginal likelihood for
parameters c = ( r , s u2 , b ) is then of the form L( c| y ) = const - 0.5 log|F|+ e’F -1e .
A stationary first-order moving average or MA1 scheme, namely

yit = Xit b + uit + qui ,t -1 ,

and with |θ| < 1, leads to a particular form of a Toeplitz covariance matrix (Weiss, 2005,
p.267). Thus set

j 2 = var(uit + q ui ,t -1 ) = s 2 (1 + q 2 ),
g = q /(1 + q 2 ),
then

æ1 g 0 0 …ö
ç ÷
çg 1 g 0 …÷

Si = j … … … … …÷
ç ÷
ç… 0 g 1 g ÷
ç… … 0 g 1 ÷ø
è

æ1+q 2 q 0 0 … ö
ç ÷
ç q 1+q 2 q 0 … ÷
=s2ç … … … … … ÷.
ç ÷
ç … 0 q 1+q 2 q ÷
ç … … 0 q 1 + q 2 ÷ø
è
Hierarchical Models for Longitudinal Data 421

Stationary or random walk models for errors can be extended in various ways. Thus, for
unequally spaced data at points {ai1 , ai 2 , … , aiT } , the AR1 model becomes

|ait - ai , t - 1|
eit = r ei ,t -1 + uit ,

with covariance between errors given by

cov( eit , eis ) = r|ait - ais|su2 .

Another option when Ti is relatively large are subject varying autocorrelation parameters,
possibly independently distributed ri ~ U ( -1, 1) (Ryu et al., 2007) or hierarchically speci-
fied; see Example 10.3.
The use of autocorrelated or random walk effects raises issues about how to specify the
initial conditions (initial random effects) such as εi1 under an AR1 or RW1 prior on εit, and
{εi1,εi2} under an AR2 or RW2 prior. For stationary autoregressive errors, such as the AR1
prior

eit = rei ,t -1 + uit ,

the variances of εit and υit are analytically linked, so that the initial conditions are neces-
sarily specified as part of the prior. So, for stationary AR1 dependence in εit and equally
spaced data, one has

ei1 = ui1/(1 - r 2 )0.5 ,

and

var(e i1 ) = s 2u /(1 - r 2 ),

and the joint distribution of the εit is obtained (Xu et al., 2007) as

p(ei1 ) Õ p(e |e
t=2
it i ,t -1 )

where

1
p(eit |ei ,t -1 ) = exp( -0.5[eit - rei ,t -1 ]2 /su2 ).
su (2p)0.5
In non-stationary and random walk models with |ρ| ≥ 1, initial conditions are usually
specified by diffuse fixed effect priors, though Chib and Jeliazkov (2006) interlink the vari-
ance of the initial conditions with that of the main sequence of effects to provide a proper
joint prior on {ei1 , … eiTi }. One may also link initial conditions εi1 and subject heterogeneity,
as in

bi ~ N (yei1 , sb2 ),

where ψ can be positive or negative (Chamberlain and Hirano, 1999). This amounts to
assuming a bivariate density for bi and εi1.
422 Bayesian Hierarchical Models

Example 10.3 Capital Asset Pricing Model


This example considers residual autocorrelation and associated model checking in an
application of the capital asset pricing model, considering links between the perfor-
mance of a particular security and market performance in general (Frees, 2004). The
particular application is to n = 90 insurance firms observed over T = 60 months (January
1995 to December 1999). The response yit is the security return for firm i in excess of the
risk-free rate, and the predictor xt is the market return in excess of the risk-free rate.
To allow for varying impacts of xt on yit, a baseline model (model 1) is the RIAS
specification

y it = b1i + b2i xt + uit ,


uit ~ N(0, 1/tu ).

The coefficients b2i measure how far the return of security i is attributable to market fac-
tors. A bivariate normal prior is assumed for{b1i , b2i } , with mean (B1,B2), and covariance
D. A Wishart W(I,2) prior for D−1 is assumed, with the prior mean for the covariance
matrix D then being the identity matrix.
The last 4,000 of a 5,000 iteration two-chain run in R2OpenBUGS provide a signifi-
cant effect for xt with B2 having posterior mean (95% credible interval) of 0.72 (0.63,0.81).
The mixed predictive procedure (Marshall and Spiegelhalter) shows a satisfactory
fit, around 8% of the 5,400 observations to have predictive exceedance probabilities
Pr( y rep.mix , it > y it |y ) over 0.95 or under 0.05. However, to assess whether first-order autore-
gressive dependence might be present, define realised residuals eit = y it - b1i - b2i xt .
Then a firm-specific measure of AR1 error dependence is

T T

r1i = åe e
t=2
it i , t - 1 åe
t =1
it
2
.

Thus 58 of the 90 firms have probabilities below 0.05 that r1i > 0 , with the sample-wide
AR1 dependence parameter (the mean of the r1i ) estimated at −0.097 with 95% CRI
(−0.103,−0.091).
Another evaluation involves a posterior predictive check based on an average of
Durbin-Watson (DW) statistics taken over all 90 firms. Thus at each iteration r, a DW
statistic is derived for each firm, namely

T T

DWi( r ) = å
t =2
(eit( r ) - ei(,rt)- 1 )2 å(e
t =1
(r ) 2
it ) .

å DW
(r) (r)
A summary statistic for autocorrelation is then the average over firms DW = i /n ,
i
(r ) (r )
which is obtained for actual data DW , and for replicate data DW
obs (Gelman et al., new

1996). The resulting posterior probability Pr(DW obs ³ DW new |y ) is 1, indicating inad-
equate fit.
Accordingly, a revised model (model 2) includes a stationary AR1 error, so that

y it = b1i + b2i xt + eit ,


eit = rei , t - 1 + uit ,

and a stationary prior, r ~ U( -1, 1). A 5,000 iteration two-chain run (with the last 4,000
for inference) gives a significant ρ estimate, with posterior mean (sd) −0.088 (0.014).
Hierarchical Models for Longitudinal Data 423

The LOO-IC for this model is 39,696, compared to 39,733 for model 1. However, checking
based on firm-specific r1i shows 39 firms with probability under 0.05 that r1i > 0 , and 30
firms with probability over 0.95 that r1i > 0 .
An extension to unit-specific AR1 parameters is therefore adopted. Thus

y it = b1i + b2i xt + eit ,


eit = ri ei , t - 1 + uit ,

with the prior on the firm-specific ρi specified indirectly in a hierarchical prior:

di ~ Beta( ad , bd ),
ri = 2di - 1,
ad ~ E(1),
bd ~ E(1).

Priors are as above on (b1i,b2i), and D. Estimation of this model shows checks that
r1i > 0 are now considerably less concentrated in the tails, with only one firm now hav-
ing a probability under 0.05 that r1i > 0 , and no firms with probability over 0.95 that
r1i > 0 . However, possibly illustrating that improved model checks are not necessarily
associated with improved overall fit, the LOO-IC rises to 39737, as the complexity index
(p_loo) rises to 174.

10.4 Longitudinal Categorical Choice Data


Repeated categorical data involving ordered or unordered options or choices k = 1, … , K by
subjects i = 1, … , n for repetitions t = 1, … , Ti are often found in brand choice, labour mar-
ket, political science, or clinical applications (Rossi et al., 2005; Pettitt et al., 2006; Terzi and
Cengiz, 2013). These may be expressed via binary indicators dikt = 1 if category or choice k
applies (dikt = 0 for remaining categories), or by categorical responses yit Î(1, … , K ). Clinical
and pharmaceutical applications commonly involve ordinal rating scales (e.g. Zayeri et al.,
2005; Qiu et al. 2002; Agresti and Natarajan, 2001). Particular issues raised by such data
include the possibility that permanent subject effects vary between choices (or more gener-
ally between categories), and that predictor effects may vary over one or more of choices,
as well as over subjects or times. If lagged effects of the dependent variable are included
(Section 10.5), these may include both own category and cross-category lags, leading to
categorical transition models (Fokianos and Kedem, 2003).
Chintagunta et al. (2001) considers repeated brand choice data and allows subject het-
erogeneity in relation to attributes of the choices (e.g. variable consumer responsiveness
to brand prices), as well as randomly varying subject-choice intercepts bik. A Bayesian
perspective, including optimal MCMC sampling schemes, on consumer heterogeneity
in multinomial longitudinal data for purchase choices is provided by Rossi et al. (2005,
Chapter 5). For identifiability, choice or category specific parameters must be set to a fixed
value (usually zero) in a reference category. For example, the probability that a consumer
424 Bayesian Hierarchical Models

chooses brand k in period t might be modelled using a multinomial logit (MNL) regres-
sion, with choice K as a reference,

pikt = Pr( yit = k ) = fikt åf


k =1
ikt ,

log(fikt ) = b0 k + bik + Pkt bk + Aktg , k = 1, … , K - 1


log(fiKt ) = AKtg ,

where β0k are intercept terms, Pkt and Akt are brand-time specific characteristics (e.g. price
and advertising spend) varying in whether associated regression parameters are choice
specific, and bik are random consumer-brand taste effects. These are typically taken as
multivariate normal of dimension K − 1, with biK = 0 for identifiability (Malchow-Moller and
Svarer, 2003).
Consumer variation in response to prices or attributes would involve making the βk and γ
coefficients specific to each consumer, and defining hyperparameters for the densities of βki
and γi. For Pkt of dimension R, Rossi et al. (2005, p.136) propose a conjugate normal hierarchi-
cal prior structure for bi = ( b1i , … , bRi ), with mean ZiΔ, where Zi are consumer attributes, and
with variance Vβ of dimension dim( bi ) = (R - 1)R . Vβ is assigned an inverse Wishart prior hav-
ing with expectation I and dim(βi) + 3 degrees of freedom. They demonstrate the improved
MCMC convergence for βi obtained by using a random walk Metropolis with increments that
have covariance s2 ( H i + (Vb( r ) )-1 )-1, where Hi is the Hessian of a composite likelihood based on
multiplying the MNL subject specific likelihood by the pooled (all subject) likelihood raised
to power ri = Ti /cN , and c > 1 and s = 2.93/sqrt[dim( bi )] are tuning constants.
Consider categorical longitudinal data with subject level predictors only, namely Xit and
Zit of dimension P and Q, and category-specific fixed regression effects, namely

log(fikt ) = b0 k + Xit bk + Zitbik ,


log(fiKt ) = 0,

where Zitbik = z1itbik 1 + z2itbik 2 + … + zQitbikQ . Assuming the Xit and Zit are non-overlapping,
one may adopt Q independent sets of subject-category effects each of dimension K − 1, one
for each predictor zqit,

(bi1q , … bi , K -1, q ) ~ N K -1(Bq , Dq ).

Alternatively, the covariance matrix of the random effects may of dimension (K − 1)Q with
the bik correlated over both categories and predictors. In the case where Zit is a subset of Xit,
the bikq are zero mean random effects, and covariance matrices may be choice specific Dk of
dimension Q, so that

(bik 1 , … , bikQ ) ~ NQ (0, Dk ).

This permits a latent variable interpretation based on a Cholesky decomposition of Dk and


standardised random effects ζik, namely

log(fikt ) = b0 k + Xit bk + ZitCk zik ,

where Ck Ck¢ = Dk .
Hierarchical Models for Longitudinal Data 425

Examples of repeated ordinal observations are provided by labour market perception


data (Spiess, 2006), changing attitudes to divorce (Berrington et al., 2005), and repeated
ordinal scores in horticultural research (Parsons et al., 2006). Suppose responses yit have K
ordered categories, with corresponding latent responses y*it specified by thresholds, pos-
sibly time-varying κkt, or subject varying κik. For time-varying thresholds

yit = k if kk -1,t < yit* £ kkt ,

with predictor effects also possibly varying over (at least one of) categories, subjects or
times. For example, Spiess (2006) considers predictor effects varying over times, as in

yit* = Xit bt + eikt ,

where P(εikt) is usually a normal or logistic distribution. These distributions are very simi-
lar though the logistic places more probability in the tails (Hedeker, 2003). So

Pr( yit* £ kkt ) = Pr(Xit bt + eikt £ kkt ) = Pr(eikt £ kkt - Xit bt ).

Depending on the form for P(εikt), one has

Pr( yit* £ kkt ) = F(kkt - Xit bt ),

or

Pr(yit* £ kkt ) = 1/(1 + exp( -[kkt - Xit bt ])).

Let g ikt = Pr( yit* £ k kt ) , then

Pr( yit = k ) = Pr(kk -1,t < yit* £ kkt ) = gikt - gi , k -1,t .

An equivalent specification of this model involves sets of K − 1 binary variables for each
subject-time pairing, namely dikt = 1 if yit £ k , and dikt = 0 otherwise. Then for ε logistic,

Pr( yit £ k )) = Pr(dikt = 1) = 1/(1 + exp( - kkt + Xit bt ).

Example 10.4 Yoghurt Purchases


Data on yoghurt brand choice from Chen and Kuo (2001) exemplify random effects to
represent household heterogeneity in consumer behaviour, specifically longitudinal
analysis of unordered choices data. The yoghurt choice data relate to repeated pur-
chases by i = 1,¼, n households (n = 100) between K = 4 brands, with widely varying
numbers of repetitions Ti for each household (between 4 and 185). The total of obser-

å
n
vations is N = Ti = 2412 . Known influences on brand choice are brand and time
i =1

specific, namely features Akt (=1 if the brand k was subject to an advertising feature at
the time t of purchase, =0 otherwise), and shelf price Pkt.
426 Bayesian Hierarchical Models

A baseline fixed effects model (model 1) has the form

K
p ikt = Pr(dikt = 1) = fikt åf
k =1
ikt ,

log(fikt ) = b0 k + Aktg1 + Pktg2 , k = 1,… , K - 1


log(fiKt ) = AKtg1 + PJtg2 ,

As mentioned by Chen and Kuo (2001), observations from the same household are usu-
ally correlated in brand choice applications, and not accounting for such dependence
may produce biased estimates. A random intercepts model (model 2) accordingly allows
for heterogeneity at household-choice level, though retaining homogenous impacts for
brand attributes. This has the form

log(fikt ) = Aktg1 + bik + Pktg2 , k = 1,… , K - 1


log(fiKt ) = AKtg1 + PKtg2 ,
(bi1 ,… , bi , K - 1 ) ~ N(B, D),

where the vector B denotes the average category intercepts ( b01 ,… , b0 , K - 1 ). A Wishart
prior for the precision matrix, D -1 ~ W (I , 3), is assumed.
Inferences (from jagsUI) show significant fixed effects, (γ1,γ2), under model 1, for both
feature and price, with posterior means (sd) of 0.49 (0.12) and −36.7 (2.3). This model
has a LOO-IC of 5,324, whereas the trivariate normal random intercept model has a
LOO-IC of 2,181. Estimates of the correlation matrix under model 2 show brand 1 and
3 choices to be positively correlated, with r(bi1,bi3) = 0.44. The impacts for feature, and
to a lesser degree, price, are enhanced, though with reduced precision. Thus namely
posterior means (sd) of (γ1,γ2) are now 0.86 (0.18) and −44.8 (3.8). While model 2 yields a
pronounced gain in fit, it has not controlled for consumer variation in price or advertis-
ing responsiveness, which would involve making the γ1 and γ2 coefficients household
specific.

Example 10.5 NIMH Schizophrenic Collaborative Study: Ordinal Symptom Score


This example illustrates model checks for longitudinal ordinal outcomes, and involves a
study evaluating four drug treatments to alleviate symptoms in schizophrenia subjects:
chloropromazine, fluphenazine, thioridazine, and a placebo (Hedeker and Gibbons,
2006). Similar effects were obtained for the three anti-psychotic drugs, and so here the
treatment is reduced to a binary comparison of any drug vs the placebo. Symptom sever-
ity scores yit are observed for n = 324 subjects on three occasions after the first reading
(at week 0), which is coincident with treatment commencing, namely at weeks 1, 3, and
6. The score is ordinal with K = 7 levels, namely 1 = normal, 2 = borderline, 3 = mildly ill,
4 = moderately ill, 5 = markedly ill, 6 = severely ill, and 7 = extremely ill. Random base-
line intercepts are assumed, together with random slopes on a time variable Zt, obtained
as the square root of weeks. Fixed effect predictors Xit are baseline treatment status, a
treatment by time interaction, and the patient’s sex.
The responses are ordinal subject-time pairs yit, with corresponding binary indicators
ditj = 1 if yit = j, ditj = 0 otherwise. Then with dit = (dit1 ,… , ditK ), and P denoting the logistic
distribution function, one has

dit ~ Mult(1, pit ),


pit = ( pit1 ,… pitK ),
Hierarchical Models for Longitudinal Data 427

pitj = Pr( y it = j),

= P(k j - mit ) - P(k j - 1 - mit )

= gitj - git , j - 1 ,

where

gitj = Pr( y it £ j) = P(k j - mit ), j = 1,… K - 1

are cumulative probabilities over ranked categories, gitj = pit1 + … + pitj .


The regression term is a random intercepts and slopes specification,

mit = b1i + b2i Zit + b1Tri + b2 ZtTri + b3 Gend i ,


(b1i , b2i ) ~ N 2 (B, D),

and B = (B1,B2) contains an overall intercept and time slope. Since the overall intercept
is an unknown, identification of the K − 1 = 6 thresholds requires setting κ1 to zero.
The remaining five threshold parameters are subject to monotonicity constraints:
kk = kk -1 + dk , where dk ~ Ga(1, 1).
The model is first fitted in rjags, with inferences from a two-chain run of 10,000
iterations. The coefficient β1 (a measure of differences in symptom level between treat-
ment options at baseline) is not significant, but there is a steeper decline in ill-health
for treated subjects. Thus, the coefficient β2 has a posterior mean (95%CRI) of −0.69
(−1.05,−0.32). Posterior means for σb1 and σb2 are 1.71 and 0.86 respectively, with r(b1,b2)
estimated at −0.47, showing steeper decline effects for higher initial symptom levels. A
posterior predictive check based on comparing total likelihoods for actual and replicate
data gave probability 0.54, indicating a satisfactory model. Diagnostic tests such as Q-Q
plots and Jarque–Bera tests support normality of the permanent effects (b1i,b2i).
To assess sensitivity to alternative priors regarding random effect covariance, the
above model is also fitted in rstan using an LKJ(1.5) prior applied to the lower Cholesky
factor of the correlation matrix between b1i and b2i. The code for this model involves six
threshold parameters, with B1 set at 0. From a run of 2,000 iterations, estimates for β1 and
β2 are little changed, while posterior means for (σb1, σb2) are 1.80 and 0.92 respectively,
with r(b1,b2) estimated at −0.49.

10.5 Observation Driven Autocorrelation: Dynamic Longitudinal Models


Differences in behaviour or event proneness between individuals (e.g. in econometric or
health applications) may operate through an autoregression in the observations, latent or
observed. Longitudinal models including lagged observations are often termed “dynamic
longitudinal models,” whereas static longitudinal models do not include lagged response
values (e.g. Nerlove, 2002; Liu et al., 2017). A canonical dynamic model for metric data
involves lagged values of the dependent variable with the overall error combining a time-
invariant individual effect and observation level random noise (Bond, 2002).
Thus, with a first order lag in the response, one has

yit = f yi ,t -1 + Xit b + bi + uit , t = 2, … , T


428 Bayesian Hierarchical Models

where the uit ~ N(0,σ2) are independent of each other, and under standard assumptions are
also uncorrelated with the initial observations yi1 and with permanent subject effects bi.
If Xit contains a constant term, then the bi have mean zero, and bi ~ N(0,D). Allowing for
subject level variation in a Q length vector of predictors Zit, as well as for first-order lagged
response, leads to

yit = f yi ,t -1 + Xit b + Zitbi + uit .

Assuming a stationary process with |ϕ| < 1, one possible model for yi1 is

Zitbi Xit b
y i1 = + + ui1 ,
1- f 1- f

with ui1 ~ N (0, s 2 /(1 - f 2 )). A simplifying approach, more feasible for large T, is to condition
on the first observation in a model involving a first-order lag in y, so that y1 is non-stochas-
tic (Bauwens et al. 1999, p.135). Geweke and Keane (2000) and Lancaster (2002) consider
Bayesian approaches to the dynamic linear longitudinal model, in which the model for
period 1 is not necessarily linked to those for subsequent periods in a way consistent with
stationarity.
Maximum likelihood analysis of dynamic longitudinal models is subject to an initial
conditions problem if in fact there is correlation between the permanent subject effects bi
and the initial observations (Hsiao, 1986). In case of such correlation, possible options are a
joint random prior (e.g. bivariate normal) involving bi and ui1 (Dorsett, 1999), or a prior for
bi that is conditional on yi1, such as (Wooldridge, 2005; Hirano, 2002)

bi | yi1 ~ N (j yi1 , s12 ).

Dynamic linear models may be extended in several ways, to include ARMA(p,q) error
schemes, effects of time functions, or random variation over subjects or times in the
impacts of lagged predictors. For example, a dynamic model for earnings (e.g. Galler, 2001)
might include AR1 autocorrelated errors as in

yit = bi + f yi ,t -1 + Xit b + Wig + eit ,


eit = rei ,t -1 + uit ,

where Wi are fixed human capital attributes, or in RIAS form,

yit = b1i + b2it + f yi ,t -1 + Xit b + Wig + eit ,

where the random effects b1i and b2i allow subject specific variation in wage level and wage
growth. Taking the time function to be an unknown function of t, δt, lead to autoregressive
latent trait models (Bollen and Curran, 2004). Allowing for time-varying coefficients on
lagged responses yi ,t -1 , as well as random subject intercepts and growth rates, one might
then have

yit = b1i + b2i dt + Xit b + ft yi ,t -1 + uit ,

with δt subject to identifying constraints, such as δ1 = 1.


Hierarchical Models for Longitudinal Data 429

10.5.1 Dynamic Models for Discrete Data


For discrete data, a range of dynamic longitudinal approaches have been proposed, vary-
ing according to form of response (e.g. count or binary) and initial conditions prior. These
include using a conditional prior method relating bi and yi1 (Wooldridge, 2005), or specify-
ing an initial period model without subject effects or a lagged response effect (Pettitt et
al., 2006).
For counts yit taken as Poisson, yit ~ Po( mit ), problems with taking a linear impact of the
first lag outcome yi ,t -1 , as in

mit = exp(Xit b + f yi ,t -1 + bi ), t = 2, … , T

are mentioned by Fahrmeir and Tutz (2001). This option for modelling lag response impacts
defines the Markov property scheme studied by Fotouhi (2007), under which the initial
observation is modelled as

mi1 = exp(Xi1b + ci ),

where Xi1 includes any relevant predictors for the first period, and the subject effects bi and
ci follow a bivariate normal with correlation ρ.
Alternatively, the impact of a lagged count response may be modelled by a log or other
transform g(y), with extra preset or unknown parameters in case the lagged y is zero. Thus
if g( y ) = log( y + c), where c = 1 (say), one has

mit = exp(Xit b + f g( yi ,t -1 ) + bi ), t = 2, … , T

where one might assume

bi | yi1 ~ N (j yi1 , s12 ).

By contrast, applying the conditional linear autoregressive process to longitudinal data


(Grunwald et al., 2000) leads to means

mit = f yi ,t -1 + exp(Xit b + bi ).

while the full autoregressive conditional Poisson specification (Jung et al., 2006) specifies

mit = f yi ,t -1 + hmi ,t -1 + exp(Xit b + bi ).

In contrast to count regression, regression for binary responses yit ~ Bern(pit ) may straight-
forwardly include lags in observed outcomes yi ,t - s leading to Markov Chain models
(Kedem and Fokianos, 2005). First order Markov dependence, as in

logit(pit ) = a0 + a1 yi ,t -1 + Xit b + bi ,

may be extended to higher order Markov dependence,

logit(pit ) = a0 + åa y
s =1
k i ,t - s + Xit b + bi ,
430 Bayesian Hierarchical Models

with L preset or determined by selection (Erkanli et al., 2001). Alternatively, fixed predictor
effects β, and parameters for random effects bi, may vary according to the previous value
s of the binary response; so { bs , Ds } are specific to previous response yi ,t -1 = s (Islam and
Chowdhury, 2006).
Such alternatives extend in principle to multinomial outcomes yit Î(1, … , K ), or equiva-
lently dikt = 1 if category k applies (or is chosen), and ditk = 0 otherwise. So

(dit1 , … ditK ) ~ Mult(nit ,[pit1 , … , pitK ]),

where nit = 1. Use of lags is complicated by the possible influence of cross-category lags as
well as own-category lags. Pettitt et al. (2006) consider a Bayesian hierarchical multinomial
model for changes in employment status (a trichotomy), with one period lags in status as
predictors. Thus, with employment status 1 as the reference (and so ϕi1t = 1), one has for t > 1

pikt = Pr( yit = k ) = fikt åf


k =1
ikt ,

log(fikt ) = bik + b k Xit + g k 1I ( yi ,t -1 = 2) + g k 2I ( yi ,t -1 = 3), k = 2,… , K

where bik are category specific random effects. For the initial period, a static multinomial
logit model can be adopted, without lag effects or bik, and with distinct regression effects,
namely

log(fik 1 ) = dk Xi1 k = 2, ¼ , K .

This follows from a linear approximation to the reduced form obtained when lagged
response variables are replaced by their specifications under the dynamic model for peri-
ods preceding t = 1.
Dynamic modelling approaches may also be applied using latent metric responses, asso-
ciated with binary or ordinal observations. Suppose observations yit are binary such that
the latent continuous response yi*,t > 0 if and only if yit = 1, and yi*,t £ 0 if yit = 0. Then one
might specify

yi*,t = Xit b + f1 yi ,t -1 + f2 yi*,t -1 + uit ,

with uit ~ N (0, 1) , and lag one dependence on both previous events and latent utilities. If
there is serial correlation (e.g. AR1 dependence) in the errors, then e it = r1e i ,t -1 + uit, with
uit ~ N (0, 1). In this way, one may avoid spurious state dependence in which previous
responses proxy unobserved variation.

Example 10.6 National Longitudinal Study of Youth: Lagged Earnings Model


This example considers unbalanced data and the modelling of initial conditions and
autocorrelation in such data. It involves earnings data from the US National Longitudinal
Survey relating to young women aged between 14 and 26 in 1968, and either already in
the labour market in 1968, or entering the labour market during the period 1968–88. In
this period, there were fifteen measuring occasions, namely each year during 1968–88
except 1974, 1976, 1979, 1981, 1984, and 1986. There are 4,711 subjects varying consider-
ably in their observed histories; many subjects are subject to attrition or intermittent
Hierarchical Models for Longitudinal Data 431

observation. The analysis here is based on a 10% sample of the n = 4164 subjects who
have at least two measurements on yearly log earnings, where earnings figures for each
subject are divided by calendar year averages to correct for inflation. In this way, the
earnings profile of a subject observed over 1968–1975 (say) can be compared with that
for a subject observed over 1978–1985. An alternative might be to have fixed or random
effects for each calendar year to model population trends in average income.
Although not all years were subject to survey updates, the analysis here takes a sub-
ject’s entire observation span (obtained by comparing initial and last observation year)
to define that subject’s total times Ti. Any intervening years without observations are
treated as missing data, whether this is due to intermittent missingness or the absence
of an NLS update in particular years. Thus the first subject is observed on twelve occa-
sions (in the studies in 1970, 1971, 1972, 1973, 1975, 1977, 1978, 1980, 1983, 1985, 1987, and
1988), but that subject’s total times Ti is set at 19, with the intervening years without
observations (e.g. 1974, 1976, etc.) treated as missing data. Missingness is taken to be at
random, not depending on the possibly missing response value.
With yit denoting (inflation corrected) log earnings, the initial regression model
includes subject effects bi, and fixed binary attributes {W1i , W2i , W3 i } , with W1i for college
graduate (=1, 0 otherwise), W2i for white ethnicity (=1, 0 for other ethnicities), and an
interaction W3i = W1iW2i. So, for i = 1,… , n ,

y it = bi + Wig + uit , t = 1,… , Ti

where bi ~ N( b1 , D) , and uit ~ N(0, s 2 ) . Uniform U(0,10) priors are assumed for σ and
sb = D0.5 , and N(0,1000) priors for fixed effects {b1 , g 1 , g 2 , g 3 }.
Estimation using jagsUI give posterior means (sd) for γ1 and γ2 of 0.32 (0.05) and 0.036
(0.022), showing significantly higher earnings for college graduates, and a positive but not
significant white ethnicity effect. The effect of the interaction term is significantly nega-
tive, with mean (sd) of −0.12 (0.06), suggesting a greater positive impact of college educa-
tion on earnings for non-white subjects. The posterior mean for the standard deviation
of the bi is 0.18, so that a subject for whom bi is one standard deviation above the average
would have earnings about 20%, namely 100exp(0.18), above average, given observed
personal characteristics Wi. Taking ûit = y it - b1 - bi - Wig , there is evidence of autocorre-

å å å å
n Ti n Ti
lated errors, with the 95% interval for the statistic ru = uˆ it uˆ i , t - 1 uˆ it2
i =1 t=2 i =1 t =1
being (0.06,0.11).
To improve fit, a second dynamic model is non-stationary, in that there is a distinct
model for the first period for each subject (Geweke and Keane, 2000), and a one period
lag effect ϕ of earnings, with this effect not constrained to stationarity. Random subject
effects are also included in the model for periods t = 2,… , Ti so that

y it = bi + Wig + fy i , t - 1 + uit , t = 2,… , Ti

y i1 = Wig1 + ui1 ,

with an N(0,1) prior on ϕ, and with uit ~ N(0, s 2 ) and ui1 ~ N(0, s12 ) taken independently.
The 95% interval for ϕ is obtained as (0.57, 0.64), along with considerably reduced auto-
correlation, with 95% interval for ru now from −0.048 to −0.003. Fit is improved, with
WAIC now lower at −1530, compared to −387 under the non-dynamic model. The γ coef-
ficients are reduced in absolute size, but the college effect γ1 remains significant, with
95% interval (0.08,0.17). The posterior mean for σb is also reduced, to 0.072. There is scope
for further model development, as the probability that ru is positive is low (under 0.02),
and this might involve subject specific lag parameters, random slopes on time, or auto-
correlated errors.
432 Bayesian Hierarchical Models

Example 10.7 Epileptic Seizure Data: Lagged Count Model


This example illustrates model checks for a dynamic count regression using the epilep-
tic seizure data from Thall and Vail (1990). An anti-epileptic drug treatment (progabide)
was applied for some of the n = 59 patients, with others receiving a placebo. A pre-treat-
ment eight-week baseline seizure count was also obtained, and may be treated either
as exogenous, or as an endogenous initial condition (Fotouhi, 2007). Here the baseline
count is included in the outcome profile, so that T = 5 with the baseline seizure count
denoted yi1. The analysis here follows Lindsey (1993) in including a lagged response as
one predictor. The predictors Xit for t ≥ 2 are age, treatment, treatment by time interac-
tion, and lagged seizure count, while the predictor set Xi1 consists of age at baseline, and
treatment (to measure any differential baseline morbidity between treatment groups).
A trivariate normal model correlates the random intercept and slopes (bi1,bi2) for periods
t ≥ 2 with the random intercepts ci in the model for yi1.
So, model 1 takes

y it ~ Po( mit ), t = 1,… , 5

with the means modelled as

mit = exp(Xit b + f y i ,t -1 + bi1 + bi 2t), t = 2,… , T

mi1 = exp(Xi1b + ci ),
(bi1 , bi 2 , ci ) ~ N 3 (0, D),

where D -1 ~ W (I , 3), and fixed effects are assigned N(0,10) priors. Estimation using jagsUI
shows neither the main treatment effect or the treatment by time effect to be significant,
while the 95% interval for the coefficient ϕ on lagged seizure counts is (−0.009,−0.002).
The correlation between bi1 and ci is 0.77. This model leaves excess dispersion: the mean
scaled deviance of 440 (Fit[1] in the code) exceeds the number, 5 × 59 = 295, of observa-
tions, an issue returned to in Example 10.10. Predictive discrepancy shows in a posterior
predictive check involving the deviance, with zero probability that the deviance involv-
ing replicate data exceeds the deviance for the actual data. On the other hand, mixed
predictive checks (Marshall and Spiegelhalter, 2007), denoted exc.mx[i,t] in the code, do
not show an excess of tail value probabilities: 10 under 0.05 and 10 over 0.95.
A second analysis replicates model 1 except in taking (bi1 , bi 2 , ci ) as multivariate skew
student t, to account for possibly heavy tailed or skew random effects. Thus

log(mit ) = Xit b + f y i ,t -1 + bi1 + d 2W2i + (bi 2 + d 3W3 i )t , t = 2,… , T

log(mi1 ) = Xi1b + ci + d 1W1i ,


(bi1 , bi 2 , ci ) ~ N 3 (0, D/x i ),

æn n ö
x i ~ Ga ç , ÷ ,
è2 2ø

where the Wji are independently half normal N+(0,1), and the skew parameters have
dk ~ N(0, 10) priors. The degrees of freedom ν has a set value, ν = 4, providing a robust
setting (Gelman et al., 2014), as estimation of ν may be sensitive to priors adopted.
The skew parameters have 95% credible intervals (−0.04,0.82), (−1.01,0.28), and (−0.16,
0.13). The lowest scale factors (xi[i] in the code) are for subjects 49, 18, 15, and 8, namely
x49 = 0.28, x18 = 0.37 , x15 = 0.44 , and x8 = 0.45 (cf. Fotouhi, 2007). This extension reduces
the LOO-IC slightly, from 1,764 to 1,762.
Hierarchical Models for Longitudinal Data 433

10.6 Robust Longitudinal Models: Heteroscedasticity,


Generalised Error Densities, and Discrete Mixtures
Preceding sections consider the normal linear mixed model for continuous longitudinal
outcomes yi = ( yi1 , ¼ , yiTi )¢ assuming normal errors at both levels, and constant variances
(or dispersion matrices) across subjects and observations. Thus, assuming Zit is a subset of
Xit, one has

yit = Xit b + Zitbi + uit ,


(b1i , b2i , … , bQi ) ~ NQ (0, D),

and with residuals

(ui1 , ¼ , uiTi ) ~ N (0, s 2ITi ).

The general linear mixed model for y possibly being a discrete response may not include
observation level residuals, and for overlapping Xit and Zit, then takes the form

g[E( yit |bi )] = Xit b + Zitbi ,

with (b1i , b2i , … , bQi ) ~ NQ (0, D), where, again, normality of errors and constant dispersion
D are default assumptions.
Violation of standard assumptions regarding the forms of error density, or of homosce-
dasticity, are likely to affect inferences. Among principles that may provide a robust
approach to departures from such standard assumptions is that of embedding the model
in a more general framework (Zhang et al., 2014; Ma et al., 2004; Rice, 2005), with conven-
tional assumptions (e.g. normality and homoscedasticity of errors) as special cases of a
broader model.
Following Chapter 8, assumptions of homoscedasticity at level 1 (repeated observations
within subjects) or at level 2 (heterogeneity between subjects) may be modified to allow
more general variance functions varying over subjects, times, or both, including depen-
dence of the variance on subject or observation level attributes. For example, heterosce-
dasticity may exist in the permanent random effects component of longitudinal models,
which may be modelled by variance regression in a positive function. For varying inter-
cepts bi as in (10.4), one might relate subject specific variances Di to predictor values aver-
aged over time, Xi , as in

Di = a 2 (1 + j Xi )2 ,

where terms in the scalar or vector φ are positive. Heteroscedasticity may be considered at
observation level, so that for yit = Xit b + Zitbi + uit one might take

uit ~ N (0, sit2 ),


sit2 = a 2 (1 + jXit )2 .

Wakefield et al. (1994) in a nonlinear pharmacokinetic longitudinal analysis with posi-


tive structural effects ηit specify a Bayesian heteroscedastic model at observation level.
434 Bayesian Hierarchical Models

Thus yit ~ N (hit , hitw /t), where ω is an unknown power and τ is an overall precision param-
eter, and ω = 0 corresponds to homoscedasticity.
Similarly, more general error densities allowing for skewness, heavy tails, or other non-
normal features may be adopted, with the standard assumptions embedded within them.
Alternatives to assuming multivariate normal subject effects may include heavy tailed
Student t heterogeneity (Zhang et al., 2014; Chib, 2008; Lin and Lee, 2006), skew normal
and skew-t densities, and skew-elliptical densities (Ma et al., 2004). Thus, the normal linear
mixed model can be embedded within a wider class of scale mixture normal densities,
with the subject or observation level scale parameters measuring outlier status (Wakefield
et al., 1994; Chib, 2008). Thus, the model of (10.2), with normal cluster effects bi and normal
residuals uit, is a special case of a scale mixture model with

yit = Xit b + Zitbi + uit ,


æ 1 ö
uit ~ N ç 0, Si ÷ ,
è li ø
æ 1 ö
bi ~ N ç B, D ÷ ,
è xi ø
li ~ Gl ,
x i ~ Gx .

A widely applied option takes the densities {Gλ,Gξ} to be gamma with equal scale and
shape, νλ/2 and νξ/2 respectively, leading to multivariate t densities with {νλ,νξ} degrees of
freedom. This provides resistance to atypical data at both observation and cluster levels.
For possibly skew residual or subject effects, skew-normal or skew-t densities may be
adopted. Ghosh et al. (2007) consider bivariate skew-normal errors at both subject and
observation level in a linear longitudinal model for metric responses, while Jara et al.
(2008) allow both subject random effects and observation level errors to follow a multivari-
ate skew-t distribution. Thus, for a linear mixed model for y of dimension Ti

yi = Xi b + Zibi + ui , (10.7)

suppose yi follows the multivariate skew-t density (Sahu et al., 2003). Then

yi | b , bi , s 2 , Ri , D i ~ STn (Xi b + Zibi , s 2Ri , D i ),

where ν is the degrees of freedom, Ri is a Ti × Ti matrix, and D i = diag(d 1i ,… , d Ti ,i ) contains


skewness parameters relevant to the observation level residuals that may in principle be
specific to individuals and times. The density of the entire observation set y = ( y1 , … , y n ),
conditional on collections of bi, Ri and Δi, is (Jara et al., 2008)

p( y|b , b , s 2 , R, D) µ Õ2
i =1
Ti
tTi ,n ( yi |Xi b + Zibi , s 2Ri + D i2 )

¥
´
òt 0
Ti ,n (wi |mw , S w )dwi ,
Hierarchical Models for Longitudinal Data 435

where tm ,n ( x| mx , S x ) denotes a multivariate t density of dimension m. When Ri reduces to


an identity matrix ITi , and the subject-time skewness parameters δit to a global parameter
δ, namely δit = δ, the conditional expectation and variances for each subject are

G(n - 1)/2]
E( yi | b , bi , s 2 , d ) = Xi b + Zibi + (n/p)0.5 d1Ti ,
G(n/2)

2
n é G(n - 1)/2] ù 2
Var( yi | b , bi , s 2 , d ) = (s 2 + d 2 )ITi + (n/p) ê ú d ITi .
n-2 ë G(n/2) û

Under the reductions Ri = I Ti , dit = d , the conditional density may be described by a mixture
of normal distributions by conditioning on positive variables wi = (w1i , … , wTi i ) obtained by
truncated sampling from a multivariate normal with identity covariance matrix of dimen-
sion Ti and subject-specific scalings li ~ Ga(n/2, n/2), so that

æ s2 ö
yi |b , bi , s 2 , wi , li , d ) ~ NTi ç Xi b + Zibi + d wi , I ÷.
è li ø

æ 1 ö
wi ~ NTi ç 0, I ÷ I (0, ).
è li ø
In the (usual) case when Xi b + Zibi contains an intercept, then for identifiability, the ele-
ments in the vector wi may be centred (subsequent to truncated sampling) (Jara et al., 2008).
Thus, at each iteration, the average of the wit can be obtained, and then the centred vari-
ables Wit = wit - wi , so that

æ s2 ö
yit ~ N ç Xit b + Zitbi + d Wit , ÷ .
è li ø
Additionally, in the model (10.7), the permanent random effects bi may also be taken as skew
multivariate t. Assuming the Z predictors are a subset of the X predictors, one then has

bi |D, G i ~ STnb (0, D, G i ),

where D is Q × Q, νb is the degrees of freedom, and G i = diag(g1i , … , gQi ) contains skew-
ness parameters relevant to the permanent effects. Assuming common skew parameters
G i = G = diag(g1 , … , gQ ) , and conditional on a Q vector of positive variables, hi = ( h1i , … , hQi ),
with

æ 1 ö
hi ~ NQ ç 0, I ÷ I ( hi > 0), (10.8)
è xi ø
æn n ö
x i ~ Ga ç b , b ÷ ,
è 2 2ø
the random effects are mixtures of normals, namely

æ 1 ö
bi ~ NQ ç Ghi , D ÷ .
è xi ø
436 Bayesian Hierarchical Models

For improved identification, the hi can be centred around their means (at each MCMC itera-
tion), namely H qi = hqi - hq. , so that

æ 1 ö
bi ~ NQ ç GH i , D ÷ .
è x i ø

10.6.1 Robust Longitudinal Data Models: Discrete Mixture Models


Another way of reducing the impact of arbitrarily selecting a particular parametric form
for random variation in bi and/or uit is by using discrete mixtures of random effects pri-
ors (e.g. Weiss et al., 1999). A discrete mixture prior may be more flexible in dealing with
unusual cases, skewness, and multiple modes. The possibly conflicting criteria required in
the case of a prior on bi are considered by Muller and Rosner (1997): namely, that the prior
should be flexible to allow for heterogeneity in the population, though on the other hand,
unusual cases should not have an undue predictive influence.
An often suitable approach would involve two group normal mixture priors with the
groups typically conceived of as a main group and outlier group (Weiss et al., 1999, p.1563).
Such schemes may apply both for random intercepts bi

bi ~ pb NQ (0, D) + (1 - pb )NQ (0, jb2D),

and for iid observation level uit

uit ~ pu N (0, s 2 ) + (1 - pu )N (0, ju2s 2 ),

where the factors {jb > 1, ju > 1} are used for variance inflation for the outlier group. The
prior probabilities of being in the main population are set high (e.g. pb = pu = 0.95), and
variance inflation factors are typically large e.g. jb = ju = 5 or 10. Provided one or other of
the parameter sets {pb , pu } or {jb , ju } is assumed known (i.e. is assigned preset values), the
other set may be taken as unknowns.
Another option is “switching” or shift priors whereby one group has zero effects, but a
minority group has non-zero effects. These may be used for iid errors introduced to reflect
overdispersion in count or binomial data. For example, for yit ~ Po( mit ), one may have

log( mit ) = Xit b + Zitbi + s kituit

where σ is a scale factor, kit ~ Bern(pu ) , uit ~ N (0, 1) , such that observation level effects are
zero when kit = 0. One may preset πu low, say πu = 0.05. For a longitudinal series with level
ct subject to possible shifts, and Xit not containing an intercept, one may similarly propose
that

yit = ct + Xit b + Zitbi + suit ,


ct = ct -1 + kts wt ,

where uit ~ N (0, 1) , wt ~ N (0, 1) , and kt ~ Bern(pc ) with πc low.


A different emphasis, as in latent growth curve models (Depaoli and Boyajian, 2014;
Galatzer-Levy, 2015; Galatzer-Levy and Bonanno, 2012), is when there is a substantive
Hierarchical Models for Longitudinal Data 437

rationale for assuming subject level effects bqi follow a discrete prior at subject level. The
hyperparameters governing the subject effects {b1i , b2i , … , bQi } then become specific for the
latent category. Thus, in a growth curve model for modelling changes in aggression rat-
ings, Muthen et al. (2002) assume that a small number of latent trajectories characterise
growth in aggression. For subject i, let the latent category be denoted ki Î(1, … , K ). Then
conditional on ki = k, (10.5) would become

b1i = B1k + e1i ,


b2i = B2 k + d2 k Tri + e2i ,
b3 i = B3 k + d3 k Tri + e3 i ,

where (e1i , e2i , e3 i ) ~ N 3 (0, Dk ). Observation level dispersion parameters may also differ
according to latent group.
Flexible discrete mixture models are also obtained under Dirichlet process and related
semiparametric priors (Dunson, 2009), as considered for repeated binary data by Quintana
et al. (2008), for longitudinal count data by Kleinman and Ibrahim (1998), and for mul-
tiple membership longitudinal models by Savitsky and Paddock (2014) and Paddock and
Savitsky (2013). Averaging over different number of mixture components K is possible
under discrete parametric mixture models using the RJMCMC (reversible jump MCMC)
algorithm – see Ho and Hu (2008) for an application to the linear mixed model. In the non-
parametric mixture approach, the number of clusters is an outcome of other parameters
such as the Dirichlet process mass parameter κ. Under the truncated Dirichlet process
(Ohlssen et al., 2007), one may set a maximum Km possible clusters, with the realised num-
ber at each iteration being K £ K m . The posterior density of K will indicate whether the
assumed maximum Km is sufficient.
Hirano (2000, 2002) discusses non-parametric alternatives regarding white noise obser-
vation errors uit in longitudinal data, while Kleinman and Ibrahim (1998) and Muller and
Rosner (1997) consider mixed Dirichlet process (MDP) modelling of Q dimensional unit
level effects bi. Under the MDP option, one has bi following a density G which is itself
unknown, centred on a specified base density G0 with precision κ. For example, with a base
density G0 = NQ (B, D), one has

g[E( yit |bi )] = hit = Xit b + Zitbi + uit ,


uit ~ N (0, s 2 ),
bi ~ G,
G ~ DP(kG0 ),
G0 = N q (B, D),

where priors on b , B, D, s 2 are typically as considered above. This is the conjugate MDP
prior for the normal linear mixed model which tends to the conventional hierarchical prior
as κ → ∞.
The model considered by Hirano (2002) is also conjugate, and based on a dynamic model

yit = bi + r yi ,t -1 + uit ,
438 Bayesian Hierarchical Models

where the bi are zero mean effects that are modelled parametrically, and uit = yit - bi + r yi ,t -1
may have non-zero means. One has for qit = { mit , sit2 }

uit ~ N ( mit , sit2 ),


q it ~ G,
G ~ DP(k G0 ),

where G0 specifies

1 c 2 (s)
G0 ( m, s 2 ) : ~ ; m ~ N (m, bs 2 ).
s 2
sL
where s, L, m and b are preset. As discussed in Chapter 3, κ may be preset or taken as an
unknown. Thus, Kleinman and Ibrahim (1998, p.2592) consider defaults such as κ = 1.5 and
κ = 100, while Hirano (2002) takes k ~ Ga(2, 0.5).

Example 10.8 A Pharmacokinetic Application


To exemplify heteroscedastic longitudinal analysis, this example considers pharmacokinetic
longitudinal data. The dataset consists of plasma concentrations yit of the drug Cadralazine
in n = 10 cardiac failure patients at various times t = 1,… , Ti (in hours) following administra-
tion of a single dosage of G = 30 mg. A one-compartment nonlinear model for these data
(Bauer et al., 2007; Bonate, 2008) with mean concentration ηit at time t can be expressed as

hit = (G/ai )exp( - bit/ai ),

where αi > 0 and βi > 0 are respectively the volume of distribution and clearance param-
eters for each subject. A hierarchical model is proposed with the second stage consisting
of a multivariate normal or multivariate Student t for the transformed subject effects
(b1i , b2i ) = {log(ai ), log( bi )}.
For the first stage density, one option is a log-normal since y is positive, or a truncated
normal, with yit constrained to be positive. Under the latter, a heteroscedastic power model,
with a single precision parameter τ, leads to a variance hitw /t , and the first stage model is

y it ~ N(hit ,hitw /t ), I (0, ).

Note that zero y values are replaced by 0.001 to avoid conflict with this density assump-
tion. Another option for the first stage model involves a normal scale mixture, namely

y it ~ N(hit ,hitw /[lt


i ]) I (0 , ),

li ~ Ga(0.5n , 0.5n).

Here these options are compared under the priors t ~ Ga(1, 0.001) and n ~ Ga(2, 0.1) . A
uniform U(0,5) prior is assumed for ω, as in Wakefield et al. (1994).
At the second stage, a bivariate normal for (b1i , b2i ) is assumed with

(b1i , b2i ) ~ N 2 ([B1 , B2 ], D),


B ~ N(B0 , C),

D-1 ~ W ([ r R]-1 , r ),

with ρ, R, B, and C as in Wakefield et al. (1994).


Hierarchical Models for Longitudinal Data 439

2000

1500
Frequency

1000

500

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35


y2

FIGURE 10.1
Predictive distribution of y2.

Inferences are based on runs of 10,000 iterations using the rube package. The scale
mixture model with variances hitw /[lt i ] has a lower DIC, namely −183.5, as compared to

−180 for the model with variances hitw /t . The posterior mean ν under the scale mixture
is 16, with the lowest scale parameter for subject 2, λ2 = 0.66. The power ω is estimated
at 0.86 under the better performing model, whereas a log-normal model would imply
w  2.
Out-of-sample predictions of concentrations are made for a duration of 32 hours.
For subject 2, whose plasma concentrations remain relatively high compared to other
subjects, the mean prediction is 0.11. Figure 10.1 shows the predictive distribution.
Inferences on the population distribution of concentration parameters are important,
for example, the half-life (period of time required for a drug concentration to be reduced
by one-half), which for patient i is ai log(2)/bi . The median half-life and clearance are
obtained, and Figure 10.2 shows the corresponding bivariate posterior plot.

Example 10.9 Skewed Cholesterol Data


This example relates to longitudinal data on cholesterol levels collected during the
Framingham heart study for n = 200 randomly selected subjects, as considered by Zhang
and Davidian (2001) and Ma et al. (2004). Relevant subject attributes are sex (1 = M, 0 = F),
and age at baseline. Several studies have re-considered the linear mixed model used by
those authors, namely

y it = Xit b + Zit bi + uit ,


(10.9)
= b1 + b 2Sex i + b 3 Agei + b 4 ait + b1i + b2i ait + uit ,
where yit is cholesterol level divided by 100, and ait = (years - 5)/10, derived using years
from baseline. Total periods Ti differ between subjects, varying from 1 to 6.
Here two models are considered to reflect positive skew apparent from plots of the
outcome. One may, for example, consider multivariate skew normal or multivariate
440 Bayesian Hierarchical Models

FIGURE 10.2
Bivariate posterior.

skew in the random effects (b1i , b2i ). For skew bivariate normal random effects, one has
(as model 1)

bi |D, G ~ SN(0, D, G ),

where D is 2 × 2, and G = diag(g 1 , g 2 ). Equivalently, conditional on the positive standard


normal effects

hi ~ NQ (0, I ) I (0, ),

(with Q = 2), the random intercepts and slopes in (10.8) are obtained as

bi ~ NQ (Ghi , D).

An alternative perspective (model 2) is provided by allowing changing skew through


time. This involves a Ti vector of period-specific skewness parameters d i = (d 1 ,… , d Ti ),
that is dit = dt , in a multivariate skew normal scheme for the observation level errors.
Hence

uit |s 2 , wit , d t ~ NTi (d t wit , s 2 ),

wit ~ N(0, 1) I (0, ).

While centred positive variables hi and wit may be preferred for identification, this slows
MCMC analysis considerably and uncentred effects are used for illustration.
Estimates for model 1 using jagsUI show significant skewness in subject intercepts
b1i, but not in the time slopes, with the respective γ parameters having 95% intervals
(0.38,0.60) and (−0.26,0.23). The LOO-IC is obtained as −20.5 under both models. Under
model 2, the δt parameters all have credible intervals straddling zero, but earlier ones
are biased to positive values.
Hierarchical Models for Longitudinal Data 441

Example 10.10 Robust Random Effects for Epilepsy Data


This example considers forms of robust modelling for the seizure data discussed in
Example 10.7. For example, Yau and Kuk (2002) consider sensitivity of fixed effects
parameter estimates to specification of random effects at both subject and observation
level. Following their analysis, we condition on the initial observation, treated as a base-
line measure of severity; the five predictors (X1 to X5) are then: log of baseline seizure,
treatment, treatment interaction with baseline, log of patient age, and a binary variable
equal to 1 for the final visit. As Example 10.7 shows, including only subject level random
effects alone does not eliminate excess dispersion.
Instead consider random variation at both subject and observation levels, with hier-
archically centred random effect priors (cf. Roberts and Sahu, 2001). So y it ~ Po( mit ), with

log( mit ) = uit + Xit b ,

uit ~ N(bi , su2 ),


bi ~ N(a , D),

where α is the regression intercept, and a U(0,100) prior is assumed for D0.5. Additionally,
a uniform shrinkage prior (Natarajan and Kass, 2000) is adopted in relation to the other
variance component var(uit ) = su2 , with

D
f= ~ U(0, 1).
D + su2

Estimation using jagsUI gives posterior means for su2 and D of 0.13 and 0.27. The poste-
rior mean of the scaled deviance is now 271, and a posterior predictive check is satisfac-
tory, providing a probability of 0.26 that the deviance involving replicate data exceeds
the deviance for the actual data. The LOO-IC is 1200.
One may also model subject intercept heterogeneity using a discrete mixture of inter-
cepts, so avoiding parametric assumptions about such heterogeneity. Thus

log( mit ) = uit + xit b ,

uit ~ N(aki , su2 ),

where the latent categorical allocation ki Î(1,… , K ) is multinomial with probabilities


(p1 ,… , pK ) following a diffuse Dirichlet prior. The intercepts a1 ,… , aK are subject to an
order constraint. For illustrative purposes, this approach is applied with K = 2, provid-
ing a LOO-IC of 1224. The intercepts are estimated with posterior means (sd) 1.57 (0.15)
and 2.48 (0.19).
The third model adopts a selection mechanism for observation level effects uit, adapt-
ing to a scenario where many patients exhibit a stable differential over the visits (mod-
elled by a level 2 effect bi), with only a subset of patients exhibiting erratic trajectories
that require a random effect for each visit. Thus, binary indicators δi ~ Bern(πδ) are intro-
duced for each subject in a model where

log( mit ) = a + xit b + bi + di uit ,

with a U(0,100) prior for D0.5, and f = D/(D + su2 ) ~ U(0, 1). πδ may be an unknown or
preset. Here a value πδ = 0.10 is adopted, so that the posterior values Pr( di = 1|y ) can
provide clear contrasts to the prior values Pr( di = 1) = pd . From a two-chain run of
442 Bayesian Hierarchical Models

10,000 iterations, it emerges that seven patients have sufficiently high posterior odds
Pr( di = 1|y )/(1 - Pr( di = 1|y )) to provide marginal Bayes factors exceeding 3, namely 10,
11, 16, 25, 39, 53, and 56. This model gives a LOO-IC of 1,215.
A fourth analysis reverts to random subject effects only, but allows for possible
non-normality via a mixed Dirichlet process. The random effects are bivariate, with
non-zero means, one for the intercept B1 (α in above models) and one for a linear slope
B2 on visit. Thus

log( mit ) = Xi b + Zit bi ,

where Xi = (Base, Tr , Base * Tr , Age) , with predictor variables defined as in Kleinman


and Ibrahim (1998), and with Zit = (1, Visit), where the visit times are centred weeks/10.
The patient random effects have prior

(b1i , b2i ) ~ G,
G ~ DP(k G0 ),
G0 = NQ (B, D),
k ~ Ga(2, 4),
(B1 , B2 ) ~ N 2 (0, 1000I ),

D -1 ~ W (R, r),
R = diag(20),
r = 10,

so that E(D -1 ) = diag(0.5), as in Kleinman and Ibrahim (1998). The maximum number of
possible clusters is set at Km = 20.
Conditional on the particular choice made for the prior on κ, one obtains a mean
scaled deviance of 417, still leaving excess variability. The posterior density for the
realised number of clusters has 0.025 and 0.975 percentiles at 6 and 13, with mean 8.8,
while κ has posterior mean 1.32. Histograms of the mean {b1i,b2i} with superimposed
normal curves show excess kurtosis (i.e. peaked densities) (Figures 10.3 and 10.4).

FIGURE 10.3
Histogram with normal curve, varying intercepts.
Hierarchical Models for Longitudinal Data 443

FIGURE 10.4
Histogram with normal curve, varying slopes.

10.7 Multilevel, Multivariate, and Multiple Time Scale Longitudinal Data


Applications involving longitudinal data often involve contextual nesting of subjects
(Steele, 2008; Lockwood et al., 2003), multivariate responses (Verbeke et al, 2014; Curran
et al., 2010), or multiple time scales. Consider data yijt for repetitions t = 1, … , Tij for sub-
jects i = 1, … , n j nested within clusters j = 1, … , J . The general linear mixed model assumes
that conditional on predictors and random effects, the data are distributed independently
according to the exponential family,

æ yijtq ijt - a(q ijt ) ö


p( yijt |q ijt , f ) = exp ç + C( yijt , f ) ÷ ,
è f ø
with conditional means E( yijt ) = mijt = a¢(qijt ) , and link g( mijt ) = hijt to regression terms ηijt.
The structural model may specify permanent random effects {dj,bij} for both clusters and
subjects within clusters. Fixed effect regression parameters may now be cluster specific,
namely

g[E( yijt | b j , ci , bij )] = Xijt b j + Wijt d j + Zijtbij .

Taking βj as cluster fixed effects is appropriate when the categorisation j = 1, … , J refers to a


small number of treatment groups or demographic categories, as in the data from Oman et al.
(1999) where four groups are formed by crossing treatment by gender – see Example 10.11.
Where predictor effects vary randomly over time, one may now include time and clus-
ter-specific effects cjt, so that

g[E( yijt | b j , ci , bij )] = Xijt b j + Wijt d j + H ijtc jt + Zijtbij .

Autocorrelated errors may also be required to model temporal dependencies, so that the
unexplained variance may be due to a number of sources. For example, Lee and Hwang
444 Bayesian Hierarchical Models

(2000) consider a normal mixed effects model, applicable in growth curve applications
with multiple groups of subjects j = 1, ¼ , J , with

yijt = Xijt b j + bij + eijt + uijt ,

where

eijt = reij ,t -1 + vijt ,

and the variances of bij, uijt, and vijt are subject to uniform shrinkage priors (see Section
10.2.3).
For nested longitudinal data inferences (e.g. on growth patterns) may be improved by
borrowing strength over clusters. Similarly, with longitudinal data on multiple outcomes,
ymit for subjects i = 1, … , n , outcomes m = 1, … , M , and repetitions t = 1, … , T , inferences
on particular outcomes may be strengthened by incorporating correlations between out-
comes. An example might be for longitudinal data on correlated, but relatively rare, spa-
tially configured health events, such as cancer types. Multiple outcome longitudinal data
are common in clinical and educational applications, and the effectiveness of interventions
may be judged in terms of multiple (usually) correlated outcomes rather than by a single
criterion (Dunson, 2007). In environmental applications, multiple outcomes with related
aetiology are likely to be correlated (e.g. Liu and Hedeker, 2006; Jorgensen et al., 1999).
With metric or discrete data ymit for multiple outcomes, the general linear mixed model
with time homogenous, time-varying, and subject varying predictor effects becomes

g[E( y mit | bm , bmi , cmt )] = hmit = Xit bm + Zitbmi + H itcmt ,

where Xit, Zit, and Hit are of length P, Q, and R. For example, consider multivariate repeated
binary responses,

y mit ~ Bern(pmit ),

and a prediction model based on outcome-subject and outcome-time effects (Agresti,


1997), namely

logit(pmit ) = bm + bmi + cmt .

Given multivariate random outcome-subject effects bmi, and fixed effects cmt subject to an
identifying corner constraint such as cm1 = 0 or cmT = 0, the ymit are assumed conditionally
independent.
The corresponding normal linear mixed model for multivariate metric responses is

y mit | bm , bmi , cmt , sm2 = Xit bm + Zitbmi + H itcmt + umit ,

where the residuals umit are typically iid normal with variances sm2 specific to outcome m.
The M sets of permanent effect priors bmi, each of dimension Q, may be correlated between
predictors q within outcomes m, or between outcomes m within predictors q, or most
generally over both predictors q and outcomes m. The same applies to the outcome-time
effects cmt, which may be random, and incorporate short range temporal dependence. For
Hierarchical Models for Longitudinal Data 445

example, time-varying intercepts ct = (c1t , … , cMt ) in the case R = 1 (and Hit = 1) could follow
autoregressive or random walk priors correlated over outcomes, as in

(c1t , … cMt ) ~ N M (ct -1 , S c ).

Suppose Tmi responses are observed on outcome m for subject i, with Si = å m


Tmi . In vector
form, the multivariate normal longitudinal model is then

Yi = Xi b + Zibi + H ict + ui ,

where Yi is of length Si (Beckett et al., 2004).


Conjugate structures (e.g. Poisson-gamma, beta-binomial) may also be used instead of
the GLMM approach for discrete multivariate longitudinal outcomes. For example, over-
dispersed count data ymit may be assumed Poisson with

y mit ~ Po( mmitqimxmit ),

where mmit = exp(X mit bm ), and

qim ~ Ga( am , am ),

represent subject-outcome permanent random effects. The ξmit represent observation level
effects that are iid, or autoregressive, as in

xmit ~ Ga(bmxmi ,t -1 , bm )

with variance parameters bm (Jorgensen et al., 1999).

10.7.1 Latent Trait Longitudinal Models


As the number of responses M increases, the full dimensional approach becomes cumber-
some, and factor analytic or latent trait approaches may pool information just as effectively
and more parsimoniously (e.g. Roy and Lin, 2000; Jorgensen et al., 1999; Dunson, 2003,
2006, 2007). Longitudinal data on multiple outcomes raise the possibility of shared ran-
dom effects across outcomes, instead of outcome-specific effects. For example, in spatio-
temporal health applications, it is common to have correlated count responses ymit such as
different types of cancer or psychiatric illness for areas i. Observed risk factors for such
outcomes may be limited or incomplete. Common unobserved area-time risks may be
summarised in effects rit, with loadings λm linking the common factor scores to each out-
come. These may be taken as iid (Tzala and Best, 2008), or assumed to be spatially and/or
temporally correlated. For identifiability, one may either set var(rit) = 1, in which case the
loadings are free parameters, or set one of the loadings to a fixed value, such as λ1 = 1, in
which case var(rit) is an unknown. Time and area common effects may be combined with
common area effects bi with loadings γm, and common time effects ct with loadings κm, and
the same type of identifying rules. Then y mit ~ Po( mmit ), with fixed regression effects that
might vary over outcomes, as in

log( mmit ) = Xit bm + lm rit + gmbi + kmct .


446 Bayesian Hierarchical Models

Additionally, iid effects umit may be included to represent remaining overdispersion.


In item analysis and psychometric longitudinal applications, a measurement model might
involve both constant and time-varying common factors. Thus for N items or tests carried
out on S occasions, responses may be determined by the interaction of M < N item specific
factors bmi and by T < S time-specific factors rit (Marsh and Grayson, 1994). The impact of
these is governed by time and outcome-specific loadings γt and λm respectively, so that

y mit = am + gtbmi + lm rit + umit .

Structural equation models for longitudinal data typically involve both response indica-
tors ymit of dimension Py which measure latent outcomes ηqit of dimension Qy < Py, and
exogenous predictors xkit of dimension Px which measure latent causal influences ξqit of
dimension Qx < Px (Dunson, 2007). For example, for Qy = Qx = 1, ξit might be a time-varying
stress severity scale related to short-term stressors {xkit , k = 1, … Px } , and ηit might be a
time-varying latent depression scale related to mood scale measures { y mit , m = 1, … Py } .
Then the measurement model is

y mit = a1m + l1m hit + u1mit , m = 1, … , Py

xkit = a2 k + l2 mxit + u2 kit , k = 1, … , Px

while the structural model might include a linear effect, possibly time-varying, of ξit on ηit.
A simple common factor model may be applied when there are alternative measuring
scales, typically a gold standard measure, and one or more measures of the same quan-
tity, but less expensive to obtain. Consider a situation where bivariate data { y1ijt , y 2ijt } are
obtained for subjects i within clusters j, where y1ijt denotes repetitions on the standard mea-
sure, and y2ijt denotes repetitions on the proxy measure. The goal is to assess the reliability
of the proxy measure. One may postulate a shared permanent effect bij between the two
outcomes, as well as a unique permanent effect cij for the proxy measure. In the absence of
intercepts for the y1 model, one has

y1ijt = bij + u1ijt ,

y 2ijt = a j + ljbij + cij + u2ijt ,

where the bij have non-zero cluster means Bj, but the cij are zero mean effects, namely:

bij ~ N (Bj , D1 j ),

cij ~ N (0, D2 j ).

The residuals are distributed as umijt ~ N (0, 1/tmj ) . The hypothesis that {a j = 0, lj = 1} cor-
responds to y1 and y2 being identically calibrated in group j (Oman et al., 1999, p.43), that is,
they both measure the same quantity on the same scale (see Example 10.11).

10.7.2 Multiple Scale Longitudinal Data


Aggregate health and demographic event data are often available as totals yixt for mul-
tiple time scales, for example, by age group x = 1, … , X , as well as by period t = 1,…,T, and
Hierarchical Models for Longitudinal Data 447

possibly also by area or actuarial risk group i = 1, … , n (Chernyavskiy et al., 2019). A fur-
ther cohort dimension c = 1, … , C is implicit in biological age-time data via the relation
c = t - x + X , and there have been extensive developments in Bayesian age-period-cohort
(APC) and area APC models (AAPC) models (Lagazio et al., 2003; Schmid and Held,
2004; Bray, 2002; Baker and Bray, 2005), which draw on developments in space-time mod-
els involving spatial and temporal autocorrelation (Quick et al., 2018; Rushworth et al.,
2014; Donald et al., 2015). For rare event totals yixt in relation to large populations Nixt, and
assuming yixt ~ Po( N ixt mixt ), a baseline age-period (AP) model might assume independence
of age and period dimensions, with

mixt = exp(hix )exp(qit ),

or equivalently

log( mixt ) = k + hix + qit ,

where structured (e.g. random walk or autoregressive) priors might be adopted for age-area
effects ηix and area-time effects θit, and the intercept κ is identified according to possible
constraints on the random effects.
Thus Clayton and Schifflers (1987) consider data of the form yxt (i.e. without further strat-
ification), with means μxt where

log( mxt ) = hx + qt ,

with both sets of effects assumed to be random, though fixed effects may be used when
X or T is small. In the absence of an overall intercept in this model, one or other series
(say ηx) sets the level, and identifiability may be gained by centring the remaining series
θt at zero (possibly repeatedly at each MCMC iteration), or by setting one parameter in the
remaining series to a fixed value e.g. θ1 = 0. If the model includes an overall intercept κ, then
centring both sets of effects, namely x å
hx =
t å
qt = 0 , provides a way of ensuring identi-
fiability. An APC model including a mean and structured age, period and cohort effects is

log( mxt ) = k + hx + qt + gc ,

and identifiability requires either that the three sets of effects be centred, or that edge
constraints such as h1 = q1 = g1 = 0 are used to avoid confounding of the three series.
Additionally the relation c = X - x + t means an extra constraint is needed for full identifi-
cation, for example, by taking g1 = g2 = 0 (Clayton and Schifflers, 1987).
The convolution prior of Besag et al. (1991) may be generalised by adopting structured
and iid effects for each time scale, as well as for areas (Knorr-Held, 2000). Hence an APC
model would then become

log( mxt ) = k + hx + qt + gc + u1x + u2t + u3 c ,

where u1x, u2t and u3c are iid zero mean random effects, while {hx , qt , gc } follow structured
(i.e. random walk or other autoregressive) form. For area-age-period data, yixt ~ Po( N xit mxit ),
this approach leads to

log( mixt ) = k + hx + qt + gc + si + u1x + u2t + u3 c + u4 i ,


448 Bayesian Hierarchical Models

where si follows a structured spatial autoregressive prior, but the u4i are iid zero mean
random effects.
In the preceding models, the dimensions are independent and multiplicative in the risk
scale (additive in the log risk scale). In practice, interactions between one or more of the
different time scales, or between the time scales and the units (e.g. areas or actuarial risk
groups), are likely. Interactions ψxc between age and cohort are relevant if the age slope is
changing between cohorts (e.g. cancer deaths at younger ages are less common in recent
cohorts), while in mortality forecasting, age-time interactions ψxt are of interest, since dif-
ferent age groups may be subject to different mortality improvements (Pedroza, 2006; Lee
and Carter, 1992). In area APC models, area-cohort and area-time interactions might be
relevant (Lagazio et al., 2003), while in area life table models (Congdon, 2006), age-area
interactions may be investigated, since deprived areas may have relatively high “prema-
ture” mortality (sometimes defined by death before age 75).
In area-time (spatio-temporal) models, one may extend the RIAS principle, and assume
area-specific random variation for both the level and a time covariate. This amounts to
taking the interaction ψit as a linear trend model, with neighbouring areas having similar
trend parameters, as in Bernardinelli et al. (1995). Thus with yit ~ Po( N it mit ) ,

log( mit ) = k + w1i + w2i (t - t ),

where ω1i and ω2i are spatially correlated over areas. One may further adopt a bivariate
spatial (e.g. bivariate CAR) prior for {ω1i, ω2i}, allowing level and trend parameters to be cor-
related. Additionally, a convolution form may be adopted both for level and trend, so that

log( mit ) = k + w1i + u1i + (w2i + u2i )(t - t ),

where u1i and u2i are iid random effects. Equivalently, letting c ji = w ji + u ji one has

log( mit ) = k + c1i + c2i (t - t ).

A variation is to introduce an overall nonlinear trend via parameters δt, along with time
specific spatial and iid effects {ωit,uit}, and stationary AR1 dependence in the total lagged
spatial effect cit = wit + uit (Martinez-Beneito et al., 2008). Thus for t > 2,

log( mit ) = k + dt + cit + rci ,t -1 ,

with r Î( -1, 1) , while for t = 1,

ci1
log( mi1 ) = k + d1 + .
(1 - r 2 )0.5

å
t
This is equivalent to assuming log( mit ) = k + dt + r t -1(1 - r 2 )-0.5 ci1 + r t - k cik , where the
k=2
last term is zero when t = 1.
In area-age-time models for mortality counts yixt ~ Po( mixt ), area-age-time interactions
ψixt may be parsimoniously modelled by separate linear time trends for each age and area,
namely

yixt = (w1x + w2i )(t - t ),


Hierarchical Models for Longitudinal Data 449

as in Sun et al. (2000), where the random coefficients ω1x and ω2i may be structured over
ages and areas respectively. Sun et al. (2000) actually assume a spatial CAR(ρ) prior with
mean zero for the ω2i (section 6.3.3), but take the ω1x to be unrelated fixed effects. The full
model of Sun et al. (2000) also includes iid age-area-time effects, uixt, so that

log( mixt ) = k + si + hx + (w1x + w2i )(t - t ) + uixt .

Alternatively, the time function in ψixt may be unknown, as in

yixt = (w1x + w2i )dt ,

where positive loadings ω1x and ω2i specify which ages are most sensitive to trend effects δt.
For identification, the δt are centred at zero or have a corner constraint such as δ1 = 0, and the
loadings ω1x and ω2i may be centred at 1, constrained to sum to 1, or have a minimum of 1.
So, for declining mortality, represented by δt following (say) a 1st order random walk, larger
ω1x and ω2i indicate which age groups and areas contribute most to the mortality decline.
Lee and Carter (1992) apply the age-time product model yxt = wx dt in mortality forecasting,
with identification obtained by ensuring δt sum to zero, and that the ωx sum to 1.
Interaction priors may also be based on a Kronecker product of the structure matrices
for the relevant dimensions (Knorr-Held, 2000; Clayton, 1996), where a structure matrix is
a constituent part of the precision (inverse covariance) matrix. For example, if the structure
matrix of separate area and age effects are denoted Ks and Kx, then K sx = K s Ä K x defines
the structure matrix for the joint prior for ψix, and conditional priors on ψix can be obtained
from Ksx. Thus an RW1 prior in age has a structure matrix with off-diagonal elements
K x[ ab] = -1 if ages a and b are adjacent, and K x[ ab] = 0 otherwise. Diagonal elements are 1 if
a = b = 1 or a = b = X, and equal 2 for other diagonal terms. An RW2 prior for age has struc-
ture matrix

é1 -2 1 ù
ê -2 5 -4 ú
ê ú
ê1 -4 6 -4 1 ú
ê ú
ê 1 -4 6 -4 1 ú
Kx = ê . . . . ú .
ê ú
ê 1 -4 6 -4 1 ú
ê 1 -4 6 -4 1ú
ê ú
ê 1 -4 5 -2ú
ê ú
ë 1 -2 1û
Similarly, the CAR(1) prior for spatially structured errors s = (s1 , … , sn ) based on adjacency
of areas is multivariate normal with precision matrix τsKs, where τs is an overall precision
parameter, and off-diagonal terms K s[ij] = -1 if areas i and j are neighbours, and K s[ij] = 0
for non-adjacent areas. The diagonal terms in Ks are Li where Li is the cardinality of area i
(its total number of neighbours). Then an area-age interaction effect ψix formed by crossing
an RW1 age prior with a CAR(1) spatial effect has joint precision

1
K s Ä K x ,
sy2
450 Bayesian Hierarchical Models

and full prior conditionals with variances sy2 /Li when x = 1 or x = X, and sy2 /(2Li ) other-
wise. With ∂i denoting the neighbourhood of area i, the prior conditional means Ψix for
ψix are

Yi1 = yi 2 + å y /L - å y /L ,
j ζi
j1 i
j ζi
j2 i

Yix = 0.5(yi , x -1 + yi , x + 1 ) + å y /L - å(y


j ζi
jx i
j ζi
j , x +1 + yj , x -1 )/(2Li ), 1 < x < X

YiX = yi , X -1 + åy
j ζi
jX /Li - åy
j ζi
j , X -1 /Li .

For identification, the ψix should be doubly centred at each iteration (over areas for a given
age x, and over ages for a given area i).

Example 10.11 Alternative Measures of Creatinine Clearance


Oman et al. (1999) compare a standard measure of creatinine clearance (MCC) with a
proxy measure ECC for 113 patients with 437 clinic visits. MCC is obtained as the ratio
of the amount of creatinine (CR24) excreted in the urine over 24 hours, divided by cre-
atinine concentration (SERUMCR) and by the number of minutes in the period, namely

MCC = CR 24/(SERUMCR ´ 60 ´ 24).

ECC is obtained from patient age and weight WT as

ECC = (140 - Age) * WT/(SERUMCR ´ 60 ´ 24),

with a further scaling by 0.85 for women only. There are four patient groups formed
by crossing gender with whether third-space body fluids were present on at least one
visit. The J = 4 groups are then (1 = female, no fluids; 2 = female, fluids; 3 = male, no flu-
ids; 4 = male, fluids), with group sizes n = (51, 12, 41, 9) and total visits within groups
N = (211, 42, 148, 36) .
The repeated responses for patients i = 1,… , n j within groups j = 1,… , J are
y1ijt = log(MCCijt ) and y 2ijt = log(ECCijt ). The model involves a patient-group common
factor bij, and a unique factor cij, for each outcome-cluster pair, namely

y1ijt = bij + u1ijt ,


y 2ijt = lj bij + cij + u2ijt ,
bij ~ N(Bj , D1 j ),
cij ~ N(0, D2 j ),
umijt ~ N(0, 1/tmj ).

The intercepts for y2 are represented by the product of loadings and B-coefficients.
Identification issues are lessened by the fact that period 1 defines the direction of the bij
effects. Gamma priors with index and shape parameters of unity are assumed for the
precisions {1/Dmj , tmj } , and N(0,1000) priors for the fixed effects {λj,Bj}.
Hierarchical Models for Longitudinal Data 451

Estimation via jagsUI provides posterior means (95% CRI) for λj of 1.03 (1.01,1.06), 1.06
(0.98,1.14), 1.02 (0.99,1.04), and 1.05 (0.96,1.15). The representation adopted avoids includ-
ing weakly identified separate intercepts for y2, and the results support identical calibra-
tion except for the first cluster, where there is a high probability that the λ coefficient is
positive. This probability is obtained from monitoring the node step.lambda[1:4] in the
rjags code. These conclusions are unaffected by adopting a robust student t (with preset
d.f. = 4) option for the bij effects.

Example 10.12 Mortality Change, with Area and Age Dimensions


This example considers age-area interactions in mortality level and trend, and how
these can be modelled parsimoniously. Consider deaths and population data {dixt, Pixt} for
areas i = 1,… , n , ages x = 1,… , X , times t = 1,… , T . One may assume binomial sampling,

dixt ~ Bin( Pixt , mixt ),

with a logit link for the mortality rates μixt. The application here involves annual
deaths to male white non-Hispanics over the period 1999–2014 (T = 16 years), in n = 51
US states (including District of Columbia). Deaths dixt and SEER population data Pixt
are for X = 13 age bands (<1, 1–4, 5–9, 10–14, 15–19, 20–24, 25–34, 35–44, …, 75–84, 85+).
Recent research (Squires and Blumenthal, 2016; Case and Deaton, 2015) reports an
unexpected rise in death rates among middle-aged, white Americans between 1999
and 2014; see also www.commonwealthfund.org/publications/issue-briefs/2016/jan/
mortality-trends-among-middle-aged-whites.
Thus, one might adopt a linear trend model (model 1) with independent age and area
impacts (ηx and ri) on the mortality level, and parallel effects (ρ1x and ρ2i) on the trend
also. This leads to

logit( mixt ) = k + hx + ri + ( r1x + r2i )(t - t ) + uixt ,

where the intercept κ is assigned a normal N(0,1000) prior, the area effects ri follow the
Leroux et al. (1999) prior allowing for spatial dependence, and age effects ηx follow a nor-
mal first order random walk. The associated conditional precisions (τr,τη) are assigned
gamma Ga(1,0.01) priors. The ρ1x and ρ2i linear trend coefficients are taken to be iid nor-
mal random effects with zero means, and precisions τρ1 and τρ2 that are assigned gamma
Ga(1,0.01) priors. Model 1 allows for miscellaneous departures (e.g. age-area interactions
in level and trend) from a linear trend by adding iid Normal errors uixt ~ N(0,1/tu ) for
each observation, where tu ~ Ga(1, 0.01) .
Hierarchical centring is applied with uixt ~ N(hx + ri + ( r1x + r2i )(t - t ), 1/tu ), with the
spatial effects additionally centred around the overall intercept κ. For the continental
states (i = 1,…,49), the prior is then

æ (rj - k ) ö
å
1
ri ~ N ç k + l , ÷,
ç 1 - l + l Di t r [1 - l + l Di ] ÷
è jÎLi ø

where l Î[0, 1] measures spatial dependence, Li denotes the locality of area i (i.e. the set
of states adjacent to state i), and there are Di states in that locality. For the remaining two
states (Alaska, Hawaii) without neighbours, Di = l = 0, one specifies

ri ~ N(k , 1/tr ).
452 Bayesian Hierarchical Models

0.03
Linear slope
0.02

0.01

0
<1 1–4 5–9 10–14 15–19 20–24 25–34 35–44 45–54 55–64 65–74 75–84 85+
–0.01

–0.02

–0.03 Mean
2.5%
–0.04 97.5%

–0.05

FIGURE 10.5
Linear trend slopes for mortality by age band, white non-Hispanic males, 1999–2014.

By contrast, model 2 allows explicit age-area interactions in trend. Thus

logit( mixt ) = k + hx + ri + rix (t - t ) + uixt ,

where priors on κ, ri, ηx and (τr,τη) are as for Model 1.


Inferences for both models (using rube) are based on two-chain runs of 2,000 iterations,
with thinning to retain every other sample. The DIC in fact prefers the less heavily param-
eterised model 1, namely 90,665 as against 90,807, so supporting independent age and area
effects on the time evolution of US white non-Hispanic male mortality. Figure 10.5, based
on Model 1, shows the most significant adverse trend (in terms of positive ρ1x, namely
mortality increase) to be for age groups 7 and 9 (namely 25–34 and 45–54).

10.8 Missing Data in Longitudinal Models


Attrition and intermittently missing data are frequently found in longitudinal data, and
disregarding the process underlying such missingness may lead to biased and inefficient
estimates of parameters in the outcome model, though different mechanisms may apply
for attrition as opposed to intermittent missingness (Ma et al., 2005). In particular, miss-
ingness may be non-ignorable, meaning that the probability of a missing observation or
of permanent drop-out is associated with the value or values of the variable that would
otherwise have been observed (Troxel et al., 1998). Thus, in clinical trials, patients may
drop out because of adverse treatment effects, or because they don’t feel the treatment is
of benefit, leading to biased estimates of treatment effects unless the missing data mecha-
nism is allowed for.
Missingness generates an additional form of binary (or sometimes categorical) data R,
depending on whether responses Y and/or predictor variables X are missing. Li et al. (2007)
Hierarchical Models for Longitudinal Data 453

obtain a categorical (trinomial) missing data indicator by distinguishing between inter-


mittent and permanent missingness. Similarly, if missing data is entirely due to attrition,
it may be summarised in a single multinomial indicator Ri = j if an individual drops out
between the (j − 1)th and jth measurement (Hedeker and Gibbons, 2006, p.290; Fitzmaurice
et al., 2004, section 14.4). In pattern mixture models for attrition, the dropout pattern may
be summarised in various ways, most simply via a binary variable contrasting completers
against dropouts, regardless of when the dropout occurred (Hedeker and Gibbons, 1997).
Finally, for longitudinal datasets with continuous measurement at differing observation
times ait, one may record actual dropout times Ui (Hogan et al., 2004), and use these in the
model for the observed Y.
However, initially consider binary indicators Rit = 1 when a response variable Yit is miss-
ing, whether intermittently or permanently, and Rit = 0 for when the response is observed.
Further, let Y = (Yobs , Ymis ) denote the observed and unobserved response data. The totality
(R,Y) is sometimes known as the complete or full data (Ibrahim et al., 1999; Daniels and
Hogan, 2008, p.89).
How one deals with missing data depends on the generating mechanism assumed. Two
broad missing data schemes (the selection approach and the pattern mixture approach)
involve a different conditioning for the joint density P(Y , R|qY , qR ) of the responses and the
missingness indicators. The pattern mixture model (Little, 1993) starts with a model for the
missing data P(R|qR ) , and models Y conditional on R, namely P(Y |R, qY ) . When dropout
times are discrete, the model for P(R|qR ) is often not specified (Hogan et al., 2004), or when
missingness is expressed in various dropout patterns, P(R|qR ) may be specified simply
by the relevant multinomial probabilities of different dropout options (Curran et al., 2002,
p.13). By contrast, the selection model (Diggle and Kenward, 1994) starts with the data like-
lihood P(Y |qY ), and models missingness conditional on the responses, P(R|Y , qR ) , so that
P(Y , R|qY , qR ) = P(R|Y , qR )P(Y |qY ).
A classification of missingness mechanisms is set out by Little and Rubin (2002), and
framed in terms of the selection approach, though is applicable also to pattern mixture
analysis. They distinguish between

a) missingness completely at random (abbreviated as MCAR), when the probability


Pr(R = 1|Y ) of a missing response is independent of both observed and missing
data Y = (Yobs , Ymis ) , namely P(R|Y ) = P(R);
b) missingness at random (MAR), when missingness is independent of the unob-
served data Ymis, but may depend on observed data Yobs, such as when the chance
that Rit = 1 depends on preceding non-missing observations yi ,t - s ; in this case one
has the simplification P(R|Y ) = P(R|Yobs ) ;
c) missingness not at random (MNAR), when the probabilities of missingness
depend on unobserved missing responses, namely P(R|Y ) = P(R|Yobs , Ymis ). Since
the data are partly missing, and R now depends on the complete outcome data
(Yobs,Ymis), the selection model factors the joint distribution into a complete out-
come model, and a missing-data mechanism given the partially unobserved com-
plete outcomes (Troxel et al., 2004).

An additional distinction is made between ignorable and non-ignorable missingness.


Assume a MAR mechanism, and that the missing-data model is independent of the
response data parameters θY. Then the missing-data process is ignorable in the sense that
a model for missingness is not needed in order to make valid inferences from the main
454 Bayesian Hierarchical Models

Y-likelihood (Rubin, 1976; Fichman and Cummings, 2003). However, for non-ignorable
missingness, both the R-likelihood and Y-likelihood must be modelled.
As an illustration, drop-out at time t is classed as being at random if
Pr(Rt = 1|Y ) = Pr(Rt = 1|Y1 , ¼Yt -1 ) , namely when the missingness probability is related to
lagged observed responses. However, if the probability of missingness at time t is related
also to the current outcome Yt, possibly missing, so that Pr(Rt = 1|Y ) = Pr(Rt = 1|Y1 , ¼ , Yt ) ,
then missingness is non-random or informative (Diggle and Kenward, 1994). In practice,
informative missingness is assessed empirically, and would require a significant effect of
(possibly missing) Yit on pit = Pr(Rit = 1) in a binary regression also involving other influ-
ences on missingness, with the regression taken over subjects i and repetitions t = 1, … , Ti .
For dropouts, one takes Ti = Ti,obs + 1 where Ti,obs is the last interval where data on subject i
was obtained (Roy and Lin, 2002). Since MNAR missingness can never be excluded as a
generating mechanism, a sensitivity analysis under different mechanisms may be consid-
ered (Kenward, 1998). This means estimating the model under a “range of assumptions
about the non-ignorability parameters and assessing the impact of these parameters on
key inferences” (Ma et al., 2005).
A common set of predictors Xit may be relevant to modelling both the data Yit, and
missingness indicators Rit, or different predictors Wit may be used in the R model. King
(2001) accordingly presents a statement of the MCAR-MAR-MNAR alternatives as above,
but replacing Y by D = (Y,X), namely predictor and outcome data combined, and where
D = (Dobs,Dmis) denotes the subdivision of the data according to observation status. For
example, the MCAR assumption then requires P(R|D) = P(R), while missingness at random
requires P(R|D) = P(R|Dobs ) .
An alternative less stringent definition of MCAR missingness is used by Little (1995),
in which missingness is independent of Y, whether observed or not, but may depend on
fully observed covariates X (Curran et al., 2002, p.12; Daniels and Hogan, 2008, p.92). Such
covariates might for instance include time, as missingness rates often increase at later
stages of longitudinals (Hedeker and Gibbons, 2006, p.281). So, given Xobs, R is independent
of both Yobs and Ymis, leading to what is sometimes termed “covariate dependent MCAR
missingness.”

10.8.1 Forms of Missingness Regression (Selection Approach)


A logit or probit regression is the most common approach to predicting pit = Pr(Rit = 1),
and to assessing ignorability and MCAR assumptions. For example, πit might at a mini-
mum be a function of immediately preceding and current Y values, namely (Curran et al.,
2002; Mazumdar et al., 2007)

logit(pit ) = g1 + g2 yit + g3 yi ,t -1 ,

with a significant γ2 indicating non-ignorable missingness. Refinements, especially in


problems with intermittently missing data, include transition probability approaches (Li
et al., 2007) with the model for

pi 01t = Pr(Rit = 1|Ri ,t -1 = 0),

having distinct parameters from that for

pi11t = Pr(Rit = 1|Ri ,t -1 = 1).


Hierarchical Models for Longitudinal Data 455

If missingness is restricted to dropout only (i.e. there is no intermittent missingness), then


one may use a logit or clog-log link for the probability that Ri = j|Ri ³ j , where Ri = j if a
subject drops out between the (j − 1)th and jth measurement.
Choice of additional predictors in the missingness model is an area of potential sensitivity
in terms of whether the coefficient on the current Y value is found to be significant. Hedeker
and Gibbons (2006) use logit or clog-log link models to assess whether a c­ ovariate-dependent
MCAR assumption applies for a given data set. They relate Pr(Ri = j|Ri ³ j) to observed
covariates Xobs such as time and treatment, as well as to the history h(yit) of observed Y
values, and to interactions between Xobs and h(y). For example, h(yij) might be the aver-
age of all yit between periods 1 and j. Then to test for covariate-dependent MCAR, one
might use a logit regression for Pr(Ri = j|Ri ³ j) that includes main effects {t , Tr , h( y )},
as well as interactions between h(y) and t, between h(y) and Tr, and between h(y), Tr and t
jointly.
The missingness model may have a role not only as part of a likelihood analysis allowing
non-random missing data, or testing for different types of missingness, but as a method
for imputing missing data. Thus a “propensity score” analysis may be based on categoris-
ing the regression terms ηit in

logit(pit ) = hit ,

into quantile groups (e.g. quartiles) (Rosenbaum and Rubin, 1983). Among subjects located
within particular quantiles of ηit, some subjects will exit but some remain. Sampling of
the missing yit for exiting subjects may be based on sampling with replacement from the
known yit values of stayers in the same quantile – this is sometimes called the “approxi-
mate Bayesian bootstrap method” (Rubin and Schenker, 1986; Lavori et al., 1995). In mul-
tiple imputation, this imputation process would be repeated several times to provide
multiple filled-in datasets.

10.8.2 Common Factor Models


Latent variables may be introduced to explain both the Y and R data. Thus, a latent data
perspective on the selection model might consider bivariate data (Y,Z) where Yit is observed
if the latent data Zit is positive (Copas and Li, 1997). Furthermore, let Xit be predictor data
potentially relevant to explaining both Y and Z and define bivariate standard normal
errors (ε1i,ε2i) with correlation ρ. Assume a linear regression for Y with

yit = Xit b + s1e1i ,

and a missingness model

Zit = Xitg + e2i .

Then if r ¹ 0, the missing data are informative or non-ignorable, whereas ρ = 0 corresponds
to missingness at random.
A similar principle involves low dimension random effects F, also known as common
factors, that are shared between outcome and missingness models; similar shared frailty
models are used for models with outcome-dependent follow-up (Ryu et al., 2007). As often
in factor models, the outcome data and missingness patterns may be viewed as condi-
tionally independent, given the common factors (Song and Belin, 2004; Albert et al., 2002;
456 Bayesian Hierarchical Models

Roy and Lin, 2002; Ten Have et al., 1998). Equivalently, it is assumed that “all information
about the missing data in the observed response is accounted for through the shared ran-
dom effects” (Albert and Follmann, 2007). In fact, Li et al. (2007) and Yang and Shoptaw
(2005) distinguish such models as an alternative to selection and pattern mixture methods,
since under conditional independence one may represent the (R,Y,F) joint density as

P(Ri , Yi , Fi |qR , qY , qF ) = P(Ri , Yi |qY , qR , Fi )P( Fi |qF )



= P(Ri |qR , Fi )P(Yobs , i , Ymis , i |qY , Fi )P( Fi |qF ).

Integrating out the Fi, one has


ò
P(Ri , Yi |qY ,q R ) = P(Ri |q R , Fi )P(Yobs ,i , Ymis ,i |qY , Fi )P( Fi |q F )dFi .

Other assumptions are possible, as under the “conditional linear model” (Daniels and
Hogan, 2008, p.112), with the conditioning sequence

P(Ri , Yi , Fi |qR , qY , qF ) = P(Yi |qY , Fi , Ri )P( Fi |qF , Ri )P(Ri |qR ).

One form of common effect that may be used to model informative missingness is based
on shared heterogeneity (e.g. Li et al., 2007; Chib, 2008, p.507). An example is a general lin-
ear mixed outcome model with permanent subject random effects bi = (b1i , … , bQi )

g[E( yit |bi )] = Xit b + Zitbi ,

where the missingness model for Pr(Rit = 1) also conditions on the bi, and possibly on sepa-
rate predictors Wit, and on the history of responses H it = { yi1 , … , yit } . Consider the case
Q = 1 with zit = 1, and suppose predictors Wit are relevant to dropout (e.g. baseline health
status in a clinical trial). Then a common factor model adapted to predicting the missing-
ness probability pit = Pr(Rit = 1|Wit , H it ) might take the form

g[E( yit |bi )] = Xit b + bi (10.10)

logit(pit ) = Witg + lbi + yit d1 + yi ,t -1d2 ,

where bi are zero mean random effects, and the predictors {Xit,Wit} both include an inter-
cept. For example, Li et al. (2007) consider Poisson data with yit ~ Po(lit ),

log(lit ) = Xit b + bi ,

and with binary indicators for missingness, and a lagged outcome scheme adapted to
counts, one would obtain

logit(pit ) = Witg + lbi + log( yit + 1)d1 + log( yi ,t -1 + 1)d2 .

In fact, the model of Li et al. (2007) distinguishes between intermittently missing data and
permanent attrition via a multinomial rather than binary regression, and uses a transition
probability missingness model.
Hierarchical Models for Longitudinal Data 457

A model with shared latent effects exemplified by (10.9) imposes possibly restrictive
assumptions on the correlations among repeated responses for a given subject. Conditional
on the time-invariant shared effects bi, observations on a subject are uncorrelated (Albert
and Follmann, 2007). An alternative is a shared autoregressive process, as in

g[E( yit |Fit )] = Xit b + Fit ,


Fit = r Fi ,t -1 + uit ,
logit(pit ) = Witg + lFit + Yit d1 + Yi ,t -1d2 ,

where the uit are white noise, and r Î( -1, 1) .


For multivariate responses { y mit , m = 1, … , M} , one might propose common factors to
model both correlation between the observed responses, and the probabilities of missing
response, especially attrition affecting all outcomes (Lin et al., 2004). Thus, consider a sin-
gle time-varying factor Fit, and loadings {λm,κ} in the Y and R likelihoods, and let Hit denote
a subset of the history of the observed X and Y variables up to time t. Then for outcomes
m = 1, … , M , one might have

g[E( y mit |Xit , Fit )] = Xit b + lm Fit

while the drop out probability Rit ~ Bern(pit ) is modelled as

logit(pit ) = Witg + j H i ,t -1 + k Fit

for t = 1, … , Ti , where for dropouts Ti = Ti ,obs + 1 and Ti,obs is the last interval where data was
observed. Furthermore, the factor scores may depend on known predictors {Uit,Zit} and
zero mean random permanent effects bi, as in

Fit = U it h + Zitbi + uit ,

with uit ~ N (0, 1) if all loadings κ and λm are unknowns, and with Uit omitting an intercept
for identifiability (Roy and Lin, 2002, p.42). The missingness model is non-ignorable by
virtue of dependence of πit on Fit, which represents possibly missing ymit (Roy and Lin, 2002,
p.43).

10.8.3 Missing Predictor Data


Often longitudinal data will have missingness on covariates as well as on the response, so
that binary or categorical indicators RX are defined according as covariates have missing
values or not. With R = (RY , RX ), the joint density under a selection approach has the form

p(Y , X , RY , RX |h, b , q ) = pR (RX , RY |Y , X , h)pY (Y |X , b )pX (X |q ),

where pX now models the likelihood of the predictors. If RY is conditional on all the com-
ponents of RX, one has

p(Y , X , RY , RX |h, b , q ) = p(RY |RX , Y , X , hY )p(RX |Y , X , hX )



´ pY (Y |X , b )pX (X |q ).
458 Bayesian Hierarchical Models

Alternatively, RY may be modelled jointly with the RX, though complexity increases as the
number of predictors subject to missingness rises, giving rise to different possible condi-
tional sequences for RY and the components of RX.
Suppose a subset of q predictors have missing values, with Rji = 1 if Xji is missing, and
Rji = 0 otherwise. If Y is fully observed, a selection approach specifies

p(Y , X , RX |h , b ,q ) = p(RX |Y , X ,h )pY (Y|X , b )pX (X|q ),

where p(RX) is a multinomial with 2q cells. To define pX, one needs to specify the joint distribu-
tion of Xi , mis = {X1i , … , X qi } . Suppose the incompletely observed covariates Xmis = (X1 , ¼ , X q )
are both categorical Xmis, D = {X1 , … , X r } and continuous Xmis,C = {X r +1 , … , X q } , with fully
observed covariates denoted Xobs = {X q + 1 , … , X p }. Ibrahim et al. (1999) proposed the joint
density of Xmis be specified as a series of conditional distributions, namely

p(X1 , … , X q |q} = pq (X q |X q -1 , ¼ , X1 , qq , Xobs }¼ p2 (X 2 |X1 , q2 , Xobs )p1(X1 |q1 , Xobs )

though there may be sensitivity as to which of the q! conditioning sequences is adopted.


The completely observed predictors may be used in predicting the missing covariates. For
continuous predictors, the form of density (e.g. gamma, normal) can be adapted to whether
only positive values are observed. Ibrahim et al. (1999, p.180) suggested one-dimensional
or joint distributions for the continuous predictors in the lower stages (p1, p2 etc.), with the
higher stages being models for categorical predictors that are based on the imputed con-
tinuous covariates (e.g. logistic regression models).
Another general scheme for specifying the joint density of Xmis adopts a different strat-
egy by first representing the joint density of categorical predictors. This is the general
location model

p(Xmis |q , Xobs ) = p(Xmis ,C , Xmis , D |qC , qD , Xobs )



= p(Xmis ,C |Xmis , D , qC , Xobs )p(Xmis , D |qD , Xobs ),

typically involving a multivariate normal or multivariate Student t distribution for the


continuous predictors, conditional on a given combination of values of the categorical
covariates. For example, means and covariances for the MVN model could be specific
to each combination of the categorical predictors. The first stage of the joint density
for predicting missing categorical covariates p(Xmis , D |qD , Xobs ) would be a multinomial
distribution, or possibly log-linear regression, over discrete outcomes, missing and
observed.
Possible approaches for modelling the covariate missingness indicators p(Rji |Yi , Xi , h)
under a selection approach include a joint log-linear model with Xi = (Xi , mis , Xi , obs ) as
predictors, or equivalently a multinomial model with all possible classifications of non-
response as categories (Schafer, 1997, chapter 9). For example, if Xi,mis contains two vari-
ables subject to missingness, then there are four possible combinations of values of R1i and
R2i for each subject. The joint density of missingness indicators can be expressed (Ibrahim
et al., 1999) as a series of conditional distributions, namely

p(R1i , … , Rqi |h, Xi , Yi } = p(Rqi |Rq -1, i , ¼ R1i , hq , Xi , Yi )



¼ p(R2i |R1i , h2 , Xi , Yi )p(R1i |h1 , Xi , Yi ),
Hierarchical Models for Longitudinal Data 459

which in practice implies a series of binary regressions. For assessing non-randomness in


covariate missingness, one allows Pr(R2i = 1|R1i , h2 , Xi , Yi ) to depend on predictors Xi that
may be subject to missing values, as well as on earlier Rji in the conditional sequence.
In practice, a multivariate density for a set of continuous variables might be repre-
sented indirectly by a series of regressions, and missing values for binary or categori-
cal data items modelled or imputed via regressions on other predictors – see Austin and
Escobar (2005) for an illustration of such methods. Such procedures are related to multiple
imputation procedures for covariates, and possibly responses also (Schafer, 1997; Allison,
2000). Consider the case where Z and X are predictors, with Z subject to missingness.
Schafer (1997) proposes random regression imputation by initially regressing Z on X and
Y, but using only cases with Z observed, and from this regression forms point estimates
Ẑ for cases with missing data. Let ŝ be the square root of the mean square error from
the observed data regression, then for subjects with missing Z, one obtains imputations
Z = Zˆ + ŝU where U is a draw from a standard normal. For cases with observed Z, one
sets Z = Z . One then carries out a filled-in data regression of Y on X and Z for all subjects.
The Z-imputation and filled-in data regressions may be repeated M times. Such a proce-
dure is, however, not proper in the sense of Rubin (1987).

10.8.4 Pattern Mixture Models


Pattern mixture models may have a benefit in avoiding intricate modelling of the miss-
ingness indicators (Daniels et al., 2015). For regular longitudinal data (repeat measures at
fixed intervals for all subjects) subject to missingness only through attrition, a pattern mix-
ture analysis (Little, 1995) might simply involve differentiating regression effects in the
Y-model according to discrete drop out times U i Î(2, ¼ , T - 1), as well as completers with
Ui = T. Thus “the missing-data patterns can be used as grouping variables in the analysis”
(Hedeker and Gibbons, 1997). If there are hm subjects belonging to M different missing-
ness patterns, with associated proportions fm = hm /n , then the “marginal” or composite
parameter (e.g. the regression impact of a predictor xp) is obtained as a weighted average of
the pattern specific parameters βpm, namely b p = S mM=1fm b pm (Curran et al., 2002). A Bayesian
analysis might involve repeated multinomial sampling of the ϕm at each MCMC iteration,
and monitoring the composite parameters b p( r ) = S mM=1fm( r ) b pm
(r )
. Often the preliminary model
for missingness P(R|qR ) would be confined to such multinomial sampling.
For example, in a clinical application, separate intercepts, growth coefficients and treat-
ment effects would be estimated according to dropout category. The variance or cova-
riance parameters for random effects may also be differentiated. In an initial analysis,
dropout category might just be binary, differentiating between completers and dropouts,
regardless of the interval when the dropout occurred. Thus, set Gi = 1 for dropouts and
Gi = 2 for completers, and consider a regression model for fixed interval (balanced) longitu-
dinal data yit with intercept, time, treatment effect (Tr), and time-treatment interaction (e.g.
Hedeker and Gibbons, 1997; Mazumdar et al., 2007). Then a grouped linear regression with
varying intercepts could take the form

yit = b1,Gi + b2 ,Gi t + b3 ,Gi Tri + b4 ,Gi (t.Tri ) + b1i + b2it + eit ,
bi ~ N (0, DGi ),
eit ~ N (0, sG2 i ).
460 Bayesian Hierarchical Models

Curran et al. (2002, p.13) allow for an additional autocorrelated error εit with pattern spe-
cific covariance matrix RGi .
The conditional linear model (Paddock, 2007; Hogan et al., 2004) is a version of the pat-
tern mixture model that may be applied to continuously recorded longitudinal data (rather
than fixed interval longitudinal data). The impact of missingness on Y involves functions
βj(Ui) of possibly continuous dropout times Ui though this reduces to a grouping approach
for fixed intervals; that is, the βj(Ui) become step functions (Hogan et al., 2004, p.856). At
their most simple, such functions are linear in U, but polynomial functions or non-para-
metric models (e.g. splines) can be used. In the preceding example, one might have

yit = b1(U i ) + b2 (U i )t + b3 (U i )Tri + b4 (U i )(t.Tri ) + b1i + b2it + eit ,


bi ~ N (0, D),
eit ~ N (0, s 2 ),
b j (U i ) = a j 0 + a j1U i j = 1, … , 4,

and a test for missingness at random is whether the αj1 are zero. Paddock (2007) applies
a Bayesian regression selection approach to coefficients in models involving quadratic
effects of Ui.

Example 10.13 Cocaine Use and Desipramine


This example compares some of the missing data techniques described above, including
common factor and pattern mixture approaches, for data subject to dropout and intermit-
tent missing data. The data are from a 12-week trial of the antidepressant desipramine
in cocaine-dependent patients with depressive comorbidity, with 106 patients, 52 in the
treatment arm, and the remainder given a placebo (Ma et al., 2005). The responses yit are
average dollars per day spent on cocaine use. Only 47 patients completed the full 12 weeks
of observation. Let Ti* = 12 for completers, while for dropouts let Ti* denote the week sub-
sequent to the last week Ti* - 1 when an observation is obtained. So Ti* = 7 if a subject is
observed (with Rit = 1) for the first six weeks, but is missing (Rit = 0) for all the last six weeks.
A plot of the average responses for the two arms (including the baseline) shows that
the treatment group begins with a higher average baseline spending level, and reduces
its cocaine spending more. The y-model involves predictors X = {1, Tr , t , Tr.t , Base} where
Base = baseline cocaine spending. So

y it = b1 + b2Tri + b3 (t - t ) + b4Tri (t - t ) + b5 Basei + uit ,

uit ~ N(0, 1/tu ).

Assessment of desipramine efficacy in the outcome model focuses especially on the


coefficient for treatment-time interaction, Tr ´ (t - t ). N(0,100) priors are assumed on the
first three predictors, but for numeric stability, an informative N(0,0.1) prior is assumed
for the impact of baseline spend (as large predictor values are observed). Assessing
whether missingness is informative or not is initially based on a selection approach,
with Rit ~ Bern(pit ), t = 1,… , Ti* , and

logit(pit ) = g1 + g2 y it + g3 y i , t - 1 + g4Tri + g5Tri (t - t) + g6 Basei ,

using both yit and y i , t -1 as predictors (Mazumdar et al., 2007).


Hierarchical Models for Longitudinal Data 461

Estimates from iterations 1,001–10,000 of a two-chain run with rjags show no effect on
missingness probabilities πit of the possibly unobserved current outcome yit (gamma[2]
in the code) with 95% interval {−0.008, 0.012}. The treatment effect β2 in the outcome
model is not significant, but the treatment-time interaction parameter β4 has a predomi-
nantly negative density, albeit with an inconclusive 95% credible interval {−16.2,0.7). The
WAIC is 17048 for the y-model, and 544 for the R-model.
An alternative model involves a common factor Fi (multiple indicator, multiple cause)
that depends on Basei = baseline spend (standardised). The prior mean for Fi specifies
a regression with intercept omitted for identifiability; the prior variance of the factor
scores is set at 1. The missingness model now involves a lagged response and the com-
mon factor, while the Y likelihood no longer involves baseline spending. Thus

Fi ~ N(h ´ Basei , 1),


y it = b1 + b2Tri + b3 (t - t ) + b4Tri (t - t ) + lFi + uit ,

Rit ~ Bern(pit ), t = 1,… , Ti*


logit(pit ) = g1 + g2 y i , t - 1 + kFi ,

where a LN(0,1) prior is adopted for κ, and the prior on λ is N(1,1) and constrained to
positive values. Estimates show the common factor is a positive function of baseline
spend with η having 95% CRI (6.1,15.9). Its impact on πit is positive, with κ having 95%
CRI (0.01,0.05). So, the chance of a missing value (Rit = 0) tends to diminish with the score
on the common factor. The WAIC for the y-data under this model is broadly similar
(17055) to that for the earlier one.
Finally, a pattern mixture analysis is applied, distinguishing simply between non-
completer (Gi = 1) and completer groups (Gi = 2). The assumed model is

y it = b1,Gi + b2 ,Gi Tri + b3 ,Gi t + b4 ,Gi Tri (t - t )



+ b5 ,Gi Basei + bi + eit ,

bi ~ N(0, DGi ),

eit ~ N(0, s 2 ).

The group-specific precisions 1/Dj ( j = 1,… , 2) are assumed to follow independent


gamma priors with shape 1 and scale 1.
Posterior estimates (from iterations 1,001–14,000 of a two-chain run) show a signifi-
cantly positive baseline effect, β51, for dropouts, with 95% interval (0.17,0.76), whereas
completers do not have a significant baseline effect. Time-treatment interactions are
similar between the two groups. The pooled estimates for β2 and β4 (pooling over drop-
out patterns) show an insignificant main treatment effect, but the interaction parameter
β4 has a 95% credible interval (−8.2,−1.4).

Example 10.14 Shared Effect Missingness Model for IMPS


(Inpatient Multidimensional Psychiatric Scale) Data
Hedeker and Gibbons (2006, pp.297–302) consider data relating to psychiatric morbidity;
in particular, item 79 of the IMPS scale is a positive measure of morbidity with values
ranging from 0 (normal) to 7 (extremely ill). The analysis here follows Hedeker and
Gibbons (2006) in treating the outcomes as metric, and adopts a shared effects model.
The data involve n = 437 patients with up to Ti = 5 repeat measurements not necessar-
ily at the same times (in weeks) after the baseline at 0 weeks; most patients have four
measurements. Treatment is coded as Drug = 1 for patients receiving any of the drugs
462 Bayesian Hierarchical Models

Chlorpromazine, Fluphenazine, and Thioridazine; and Drug = 0 otherwise. Follow up is


terminated after six weeks, with most patients only measured at weeks ai1 = 0,ai2 = 1,ai3 = 3
and ai4 = 6. Completers are those terminating at six weeks, with all sequences ending in
earlier weeks considered as dropouts. So, in a similar way to that used for discrete time
hazards in Chapter 11 (Section 11.5), one may define event indicators {wij = 0, j = 1,… , Ti }
for subjects whose last week is 6, and

wij = 0, j = 1,… , Ti - 1;
wiTi = 1,

for subjects whose last observation is before six weeks.


Let Drugi = 1 for the treatment group subjects, with Drugi = 0 otherwise. Also let
Sit = ait0.5 be the square root of the number of weeks at which the tth observation of
patient i is obtained. The model for the morbidity outcome then has the form

y it = b1 + b2 Drug i + b3Sit + b4Sit Drug i + b1i + b2iSit + uit ,

with priors

(b1i , b2i ) ~ N([0, 0], D),

D -1 ~ Wish(I , 2),
uit ~ N(0, 1/tu ),
tu ~ Ga(1, 0.001).

The missing data model is a complementary log-log regression, sharing the random
intercept b1i and with an interaction between treatment and the shared effect. Thus

log(- log[1 - Pr(wij = 1|Ti ³ j)]) = g1 j + g2 Drug i + a1b1i + a2b1i Drug i .

with {g1 j ~ N(0, 1000), j = 1,… , max(Ti )} ,g2 ~ N(0, 1000), and {ak ~ N(1, 1), k = 1,… , 2} . Non-
ignorable missingness corresponds to any of the αk coefficients being distinct from zero
(Hedeker and Gibbons, 2006, p.298).
A two-chain run of 5,000 iterations using rjags shows early convergence, and pos-
terior means on β coefficients (from the last 4000 iterations) similar to those reported
by Hedeker and Gibbons (2006). In particular, β4 has posterior mean (sd) of −0.65 (0.08)
consistent with a greater reduction in morbidity for the treatment group. Dropout is
lower for treated patients, with γ2 having mean (sd) of −0.65 (0.23). Both the α coefficients
have 95% credible intervals excluding zero: α1 has mean and 95% interval 0.84 (0.17,1.57)
indicating that among untreated patients, those more ill are likely to drop out, while the
sum (a1 + a2 ) has mean (95%CrI) −0.50 (−1.08,0.03), showing that for those being treated,
the more ill are in fact less likely to drop out.

References
Agresti A (1997) A model for repeated measurements of a multivariate binary response. Journal of the
American Statistical Association, 92, 315–321.
Agresti A, Natarajan R (2001) Modeling clustered ordered categorical data: A survey. International
Statistical Review, 69, 345–371.
Hierarchical Models for Longitudinal Data 463

Albert P, Follmann D (2007) Random effects and latent processes approaches for analyzing binary
longitudinal data with missingness: A comparison of approaches using opiate clinical trial
data. Statistical Methods in Medical Research, 16, 417–439.
Albert PS, Follmann DA, Wang SA, Suh EB (2002) A latent autoregressive model for longitudinalbi-
nary data subject to informativemissingness. Biometrics, 58, 631–642.
Allison P (2000) Multiple imputation for missing data: A cautionary tale. Sociological Methods and
Research, 28, 301–309.
Alvarez I, Niemi J, Simpson M (2016) Bayesian inference for a covariance matrix. arXiv:1408.4050v2
Antonio K, Beirlant J (2007) Actuarial statistics with generalized linear mixed models. Insurance:
Mathematics and Economics, 40, 58–76.
Austin P, Escobar M (2005) Bayesian modeling of missing data in clinical research. Computational
Statistics and Data Analysis, 49, 821–836.
Baker A, Bray I (2005) Bayesian projections: What are the effects of excluding data from younger age
groups? American Journal of Epidemiology, 162, 798–805.
Baldwin S (2014) Visualizing the LKJ Correlation Distribution. https://www.psychstatistics.
com/2014/12/27/d-lkj-priors/
Barnard J, McCulloch R, Meng XL (2000) Modeling covariance matrices in terms of standarddevia-
tions and correlations, with applications to shrinkage. Statistica Sinica, 10, 1281–1311.
Bauer R, Guzy S, Ng C (2007) A survey of population analysis methods and software for complex phar-
macokinetic and pharmacodynamic models with examples. The AAPS Journal, 9(1), E60–E83.
Bauwens L, Lubrano M, Richard J-F (1999) Bayesian Inference in Dynamic Econometric Models. Oxford
University Press, Oxford, UK.
Beckett L, Tancredi D, Wilson R (2004) Multivariate longitudinal models for complex change pro-
cesses. Statistics in Medicine, 23, 231–239.
Bernardinelli L, Clayton D, Pascutto C, Montomoli C, Ghislandi M, Songini M (1995) Bayesian analy-
sis of space-time variation in disease risk. Statistics in Medicine, 14, 2433–2443.
Berrington A, Hu Y, Ramirez-Ducoing K, Smith P (2005) Multilevel modelling of repeated ordi-
nal measures: An application to attitude towards divorce. Southampton Statistical Sciences
Research Institute Applications and Policy Working Paper M05/10 and ESRC Research Method
Programme Working Paper No. 26.
Besag J, York J, Mollié A (1991) Bayesian image restoration, with two applications in spatial statistics.
Annals of the Institute of Statistical Mathematics, 43(1), 1–20.
Bollen K, Curran P (2004) Autoregressive latent trajectory (ALT) models: A synthesis of two tradi-
tions. Sociological Methods and Research, 32, 336–383.
Bonate P (2008) Pharmacokinetic-Pharmacodynamic Modeling and Simulation, 2nd Edition. Springer,
New York.
Bond S (2002) Dynamic panel data models: A guide to microdata methods and practice. Portuguese
Economic Journal, 1, 141–162.
Bray I (2002) Application of Markov chain Monte Carlo methods to projecting cancer incidence and
mortality. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51(2), 151–164.
Case A, Deaton A (2015) Rising morbidity and mortality in midlife among white non-hispanic
Americans in the 21st century. Proceedings of the National Academy of Sciences of the United States
of America, 112(49), 15078–15083.
Cepeda E, Gamerman D (2004) Bayesian modeling of joint regressions for the mean and covariance
matrix. Biometrical Journal, 46, 430–440.
Chamberlain G, Hirano K (1999) Predictive distributions based on longitudinal earnings data.
Annales d’Economie et de Statistique, 55–56, 211–242.
Chen Z, Kuo L (2001) A note on the estimation of the multinomial logit model with random effects.
The American Statistician, 55, 89–95.
Chen Z, Rus H, Sen A (2016) Border effects before and after 9/11: Longitudinal data evidence across
industries. World Economics. DOI:10.1111/twec.12413.
Chernyavskiy P, Little M, Rosenberg P (2019) A unified approach for assessing heterogeneity in age–
period–cohort model parameters using random effects. Statistical Methods in Medical Research,
28(1). https://journals.sagepub.com/doi/abs/10.1177/0962280217713033
464 Bayesian Hierarchical Models

Chib S (2008) Panel data modeling and inference: A Bayesian primer, pp 479–515, in The Econometrics
of longitudinal Data, 3rd Edition, eds L Matyas, P Sevestre. Springer-Verlag, Berlin, Germany.
Chib S, Carlin B (1999) On MCMC sampling in hierarchical longitudinal models. Statistics and
Computing, 9, 17–26.
Chib S, Jeliazkov I (2006) Inference in semiparametric dynamic models for binary longitudinal data.
Journal of the American Statistical Association, 101, 685–700.
Chintagunta P, Kyriazidou E, Perktold J (2001) Panel data analysis of household brand choices.
Journal of Economics, 103, 111–153.
Clayton D (1996) Generalized linear mixed models, pp 275–301, in Markov Chain MonteCarlo in
Practice, eds WR Gilks, S Richardson, DJ Spiegelhalter. Chapman & Hall, London, UK.
Clayton D, Schifflers E (1987) Models for temporal variation in cancer rates. II: Age-period-cohort
models. Statistics in Medicine, 6, 467–810.
Congdon P (2006) A model for geographical variation in health and total life expectancy. Demographic
Research, 14, 157–178.
Copas JB, Li HG (1997) Inference for non-random samples. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 59(1), 55–95.
Curran D, Molenberghs G, Aaronson N, Fossa S, Sylvester R (2002) Analyzing longitudinal continu-
ous quality of life data with dropout. Statistical Methods in Medical Research, 11, 5–23.
Curran P, Obeidat K, Losardo D (2010) Twelve frequently asked questions about growth curve mod-
eling. Journal of Cognition and Development, 11(2), 121–136.
Daniels M, Normand S (2006) Longitudinal profiling of health care units based on continuous and
discrete patient outcomes. Biostatistics, 7, 1–15.
Daniels M J, Jackson D, Feng W, White I (2015) Pattern mixture models for the analysis of repeated
attempt designs. Biometrics, 71(4), 1160–1167.
Davidian M, Giltinan D (2003) Nonlinear models for repeated measures data: An overview and
update. Journal of Agricultural, Biological, and Environmental Statistics, 8, 387–419.
Depaoli S, Boyajian J (2014) Linear and nonlinear growth models: Describing a Bayesian perspective.
Journal of Consulting and Clinical Psychology, 82(5), 784–802.
Diggle P, Kenward M (1994) Informative dropout in longitudinal data analysis. Journal of the Royal
Statistical Society: Series C, 43, 49–94.
Donald M, Mengersen K, Young R (2015) A four dimensional spatio-temporal analysis of an agricul-
tural dataset. PLOS ONE, 10(10), e0141120.
Dorsett R (1999) An econometric analysis of smoking prevalence among lone mothers. Journal of
Health Economics, 18, 429–441.
Dunson D (2003) Dynamic latent trait models for multidimensional longitudinal data. Journal of the
American Statistical Association, 98, 555–563.
Dunson D (2006) Bayesian dynamic modeling of latent trait distributions. Biostatistics, 7, 551–568.
Dunson D (2007) Bayesian methods for latent trait modeling of longitudinal data. Statistical Methods
in Medical Research, 16, 399–415.
Dunson D (2009) Bayesian nonparametric hierarchical modeling. Biometrical Journal, 51(2), 273–284.
Erkanli A, Soyer R, Angold A (2001) Bayesian analyses of longitudinal binary data using Markov
regression models of unknown order. Statistics in Medicine, 20, 755–770.
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, pp
69–137. Springer, New York.
Fichman M, Cummings J (2003) Multiple imputation for missing data: Making the most of what you
know. Organizational Research Methods, 6, 282–308.
Fitzmaurice G, Laird N, Ware J (2004) Applied Longitudinal Analysis. Wiley.
Fokianos K, Kedem B (2003) Regression theory for categorical time series. Statistical Science, 18,
357–376.
Fong Y, Rue H, Wakefield J (2010) Bayesian inference for generalized linear mixed models. Biostatistics,
11(3), 397–412.
Fotouhi A (2007) The initial conditions problem in longitudinal count process: A simulation study.
Simulation Modelling Practice and Theory, 15, 589–604.
Hierarchical Models for Longitudinal Data 465

Franco C, Bell W (2015) Borrowing information over time in binomial/logit normal models for small
area estimation. Statistics in Transition, 16(4), 563–584.
Frees E (2004) Longitudinal and Panel Data. Cambridge University Press, Cambridge, UK.
Frühwirth-Schnatter S, Tüchler R (2008) Bayesian parsimonious covariance estimation for hierarchi-
cal linear mixed models. Statistics and Computing, 18, 1–13.
Galatzer-Levy I (2015) Applications of Latent Growth Mixture Modeling and allied methods to post-
traumatic stress response data. European Journal of Psychotraumatology, 6, 27515.
Galatzer-Levy I, Bonanno G (2012) Beyond normality in the study of bereavement: Heterogeneity in
depression outcomes following loss in older adults. Social Science & Medicine, 74(12), 1987–1994.
Galler H (2001) On the dynamics of individual wage rates – Heterogeneity and stationarity of wage
rates of West German Men, pp 269–293, in Econometric Studies. A Festschrift in Honour of Joachim
Frohn, eds R Friedmann, L Knüppel, H Lütkepohl. LIT, Münster, Germany.
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis. CRC, Boca
Raton, FL.
Gelman A, Meng XL, Stern H (1996) Posterior predictive assessment of model fitness via realized
discrepancies. Statistica Sinica, 733–760.
Geweke J, Keane M (2000) An empirical analysis of earnings dynamics among men in the PSID:
1968–1989. Journal of Econometrics, 96, 293–356.
Ghosh P, Branco M, Chakraborty H (2007) Bivariate random effect model using skew-normal distri-
bution with application to HIV-RNA. Statistics in Medicine, 26, 1255–1267.
Grunwald GK, Hyndman RJ, Tedesco LM, Tweedie RL (2000) Non-Gaussian conditional linear AR(1)
models. Australian and New Zealand Journal of Statistics, 42, 479–495.
Heagerty P, Zeger S (2000) Marginalized multilevel models and likelihood inference. Statistical
Science, 15, 1–26.
Hedeker D (2003) A mixed-effects multinomial logistic regression model. Statistics in Medicine, 22,
1433–1446.
Hedeker D, Gibbons R (2006) Longitudinal Data Analysis. Wiley, New York.
Hedeker D, Gibbons R (1997) Application of random-effects pattern-mixture models for missing data
in longitudinal studies. Psychological Methods, 2, 64–78.
Hirano K (2000) A semiparametric model for labor earnings dynamics, in Practical Nonparametric and
Semiparametric Bayesian Statistics, eds D Dey, P Mueller, D Sinha. Springer-Verlag, New York.
Hirano K (2002) Semiparametric Bayesian inference in autoregressive panel data models. Econometrica,
70, 781–799.
Ho R, Hu I (2008) Flexible modelling of random effects in linear mixed models – A Bayesian approach.
Computational Statistics and Data Analysis, 52, 1347–1361.
Hogan J, Lin X, Herman B (2004) Mixtures of varying coefficient models for longitudinal data with
discrete or continuous non-ignorable dropout. Biometrics, 60, 854–864.
Hsiao C (2014) Analysis of Panel Data (No. 54). Cambridge University Press.
Ibrahim J, Chen M-H, Ryan L (2000) Bayesian variable selection for time series count data. Statistica
Sinica, 10, 971–987.
Ibrahim J, Lipsitz S, Chen M-H (1999) Missing covariates in generalized linear models when the
missing data mechanism is non-ignorable. Journal of the Royal Statistical Society, Series B, 61,
173–190.
Ibrahim J, Molenberghs G (2009) Missing data methods in longitudinal studies: A review. Test, 18(1),
1–43.
Islam M, Chowdhury R (2006) A higher order Markov model for analyzing covariate dependence.
Applied Mathematical Modelling, 30, 477–488.
Jara A, Quintana F, San Martin E (2008) Linear effects mixed models with skew-elliptical distribu-
tions: A Bayesian approach. Computational Statistics & Data Analysis, 52, 5033–5045.
Jorgensen B, Lundbye-Christensen S, Song P, Sun L (1999) A state space model for multivariate lon-
gitudinal count data. Biometrika, 86, 169–181.
Jung R, Kukuk M, Liesenfeld R (2006) Time series of count data: Modeling, estimation and diagnos-
tics. Computational Statistics and Data Analysis, 51, 2350–2364.
466 Bayesian Hierarchical Models

Keane M (2015) Longitudinal data discrete choice models of consumer demand, in The Oxford
Handbook of longitudinal Data, ed B Baltagi. OUP.
Kedem B, Fokianos K (2005) Regression models for binary time series, pp 185–199, in Modeling
Uncertainty, eds M Dror, P L’Ecuyer, F Szidarovszky. Springer.
Kenward M (1998) Selection models for repeated measurements with non-random dropout: An illus-
tration of sensitivity. Statistics in Medicine, 17, 2723–2732.
King G (2001) Analyzing incomplete political science data: An alternative algorithm for multiple
imputation. American Political Science Review, 95, 49–69.
Kinney S, Dunson D (2007) Fixed and random effects selection in linear and logistic models.
Biometrics, 63, 690–698.
Kleinman K, Ibrahim J (1998) A semi-parametric Bayesian approach to generalized linear mixed
models. Statistics in Medicine, 17, 2579–2596.
Knorr-Held L (2000) Bayesian modelling of inseparable space-time variation in disease risk. Statistics
in Medicine, 19, 2555–2567.
Lagazio C, Biggeri A, Dreassi E (2003) Age-period-cohort models and disease mapping. Environmetrics,
14, 475–490.
Lancaster T (2002) Orthogonal parameters and panel data. The Review of Economic Studies, 69, 647–666.
Lavori P, Dawson R, Shera D (1995) A multiple imputation strategy for clinical trials with truncation
of patient data. Statistics in Medicine, 14, 1913–1925.
Lee J, Hwang R (2000) On estimation and prediction for temporally correlated longitudinal data.
Journal of Statistical Planning and Inference, 87, 87–104.
Lee RD and Carter LR (1992) Modeling and forecasting U.S. mortality. Journal of the American Statistical
Association, 87(419), 659–671.
Lee Y, Nelder J (2000) Two ways of modelling overdispersion in non-normal data. Applied Statistics,
49, 591–598.
Lee Y, Nelder J (2004) Conditional and marginal models: another view. Statistical Science, 19, 219–238.
Leroux B, Lei X, Breslow N (1999) Estimation of disease rates in small areas: A new mixed model
for spatial dependence, pp 135–178, in Statistical Models in Epidemiology, the Environment and
Clinical Trials, eds M Halloran, D Berry. Springer-Verlag, New York.
Lewandowski D, Kurowicka D, Joe H (2009) Generating random correlation matrices based on vines
and extended onion method. Journal of Multivariate Analysis, 100, 1989–2001.
Li J, Yang X, Wu Y, Shoptaw S (2007) A random-effects Markov transition model for Poisson-
distributed repeated measures with non-ignorable missing values. Statistics in Medicine, 26,
2519–2532.
Lin H, McCulloch C, Rosenheck R (2004) Latent pattern mixture models for informative intermittent
missing data in longitudinal studies. Biometrics, 60, 295–305.
Lin T, Lee J (2006) A robust approach to t linear mixed models applied to multiple sclerosis data.
Statistics in Medicine, 25, 1397–1412.
Lindsey J (1993) Models for Repeated Measurements. Oxford University Press, New York.
Little RJA (1993) Pattern-mixture models for multi-variate incomplete data. Journal of the American
Statistical Association, 88, 125–133.
Little R (1995) Modeling the drop-out mechanism in repeated-measures studies. Journal of the
American Statistical Association, 90, 1112–1121.
Little R, Rubin D (2002) Statistical Analysis with Missing Data, 2nd Edition. Wiley-Interscience,
Hoboken, NJ.
Liu F, Zhang P, Erkan I, Small D S (2017) Bayesian inference for random coefficient dynamic panel
data models. Journal of Applied Statistics, 44(9), 1543–1559.
Liu L, Hedeker D (2006) A mixed-effects regression model for longitudinal multivariate ordinal data.
Biometrics, 62, 261–268.
Lockwood J, Doran H, McCaffrey D (December 2003) Using R for estimating longitudinal student
achievement models. R Newsletter, 3(3), 17–23.
Ma Y, Genton M, Davidian M (2004) Linear mixed effects models with semiparametric generalized
skew elliptical random effects, pp 339–358, in Skew-Elliptical Distributions and their Applications:
A Journey Beyond Normality, ed M Genton. Chapman and Hall/CRC, Boca Raton, FL.
Hierarchical Models for Longitudinal Data 467

Ma G, Troxel A, Heitjan D (2005) An index of local sensitivity to nonignorable drop-out in longitudi-


nal modelling. Statistics in Medicine, 24, 2129–2150.
Malchow-Moller N, Svarer M (2003) Estimation of the multinomial logit model with random effects.
Applied Economics Letters, 10, 389–392.
Marsh H, Grayson D (1994) Longitudinal confirmatory factor analysis: Common, time-specific, item-
specific, and residual-error components of variance. Structural Equation Modeling, 1, 116–145.
Marshall EC, Spiegelhalter DJ (2007) Identifying outliers in Bayesian hierarchical models: A simula-
tion-based approach. Bayesian Analysis, 2(2), 409–444.
Martinez-Beneito M, Lopez-Quilez A, Botella-Rocamora P (2008) An autoregressive approach to
spatio-temporal disease mapping. Statistics in Medicine, 27, 2874–2889.
Mazumdar S, Tang G, Houck P, Dew M, Begley A, Scott J, Mulsant B, Reynolds C (2007) Statistical
analysis of longitudinal psychiatric data with dropouts. Journal of Psychiatric Research, 41,
1032–1041.
McElreath R (2015) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Menard S (2002) Longitudinal Research, 2nd Edition. Sage, London, UK.
Molenberghs G, Verbeke G (2004) An introduction to (generalized) (non)linear mixed models,
pp 111–153, in Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach, ed
P de Boeck. Springer, New York.
Molenberghs G, Verbeke G (2006) Models for Discrete Longitudinal Data. Springer, New York.
Müller P, Rosner G (1997) A Bayesian population model with hierarchical mixture priors applied to
blood count data. Journal of the American Statistical Association, 92, 1279–1292.
Muthén B, Brown C, Masyn K, Jo B, Khoo S, Yang C, Wang C, Kellam S, Carlin J, Liao J (2002) General
growth mixture modeling for randomized preventive interventions. Biostatistics, 3, 459–475.
Natarajan R, Kass R (2000) Reference Bayesian methods for generalized linear mixed models. Journal
of the American Statistical Association, 95, 227–237.
Natarajan R, McCulloch C (1998) Gibbs sampling with diffuse proper priors: A valid approach to
data-driven inference? Journal of Computational and Graphical Statistics, 7, 267–277.
Nerlove M (2002) The history of panel data econometrics, 1861–1997, Chapter 1, in Essays in
Longitudinal Data Econometrics. ed M Nerlove. Cambridge University Press.
Oh MS, Lim YB (2001) Bayesian analysis of time series Poisson data. Journal of Applied Statistics, 28(2),
259–271.
Ohlssen D, Sharples L, Spiegelhalter D (2007) Flexible random-effects models using Bayesian
semi-parametric models: Applications to institutional comparisons. Statistics in Medicine, 26,
2088–2112.
Oman S, Meir N, Halm N (1999) Comparing two measures of creatinine clearance: An application of
errors-in-variables and bootstrap techniques. Applied Statistics, 48, 39–52.
Paddock S (2007) Bayesian variable selection for longitudinal substance abuse treatment data subject
to informative censoring. Journal of the Royal Statistical Society: Series C, 56, 293–311.
Paddock S, Savitsky T (2013) Bayesian hierarchical semiparametric modelling of longitudinal post-
treatment outcomes from open enrolment therapy groups. Journal of the Royal Statistical Society:
Series A, 176(3), 795–808.
Papaspiliopoulos O, Roberts G, Skold M (2003) Non-centered parameterisations for hierarchical
models and data augmentation, pp 307–326, in Bayesian Statistics 7, eds J Bernardo, M Bayarri,
J Berger, A Dawid, D Heckerman, A Smith, M West. OUP.
Parsons N, Edmondson R, Gilmour S (2006) A generalized estimating equation method for fit-
ting autocorrelated ordinal score data with an application in horticultural research. Applied
Statistics, 55, 507–524.
Pedroza C (2006) A Bayesian forecasting model: predicting U.S. male mortality. Biostatistics, 7,
530–550.
Pettitt A, Tran T, Haynes M, Hay J (2006) A Bayesian hierarchical model for categorical longitudinal
data from a social survey of immigrants. Journal of the Royal Statistical Society: Series A, 127,
97–114.
Pourahmadi M (1999) Joint mean-covariance models with applications to longitudinal data:
Unconstrained parameterisation. Biometrika, 86, 677–690.
468 Bayesian Hierarchical Models

Pourahmadi M (2000) Maximum likelihood estimation of generalized linear models for multivariate
normal covariance matrix. Biometrika, 87, 425–435.
Pourahmadi M, Daniels M (2002) Dynamic conditionally linear mixed models. Biometrics, 58, 225–231.
Qiu Z, Song P, Tan M (2002) Bayesian hierarchical models for multi-level repeated ordinal data using
WinBUGS. Journal of Biopharmaceutical Statistics, 12, 121–135.
Quick H, Waller L A, Casper M (2018) A multivariate space–time model for analysing county level
heart disease death rates by race and sex. Journal of the Royal Statistical Society: Series C (Applied
Statistics), 67(1), 291–304.
Quintana F, Müller P, Rosner G (2008) A semiparametric Bayesian model for repeated binary mea-
surements. The Journal of the Royal Statistical Society: Series C (Applied Statistics), 57(4):419–431.
Rice K (2005) Bayesian measures of goodness of fit, in Encyclopedia of Biostatistics, eds P Armitage, T
Colton. John Wiley, Chichester, UK.
Roberts GO, Sahu SK (2001) Approximate predetermined convergence properties of the Gibbs sam-
pler. Journal of Computational and Graphical Statistics, 10(2), 216–229.
Rosenbaum P, Rubin D (1983) The central role of the propensity score in observational studies for
causal effects. Biometrika, 70, 41–55.
Rossi P, Allenby G, McCulloch R (2005) Bayesian Statistics and Marketing. Wiley.
Roy J, Lin X (2000) Latent variable models for longitudinal data with multiple continuous outcomes.
Biometrics, 56, 1047–1054.
Roy J, Lin X (2002) Analysis of multivariate longitudinal outcomes with non-ignorable dropouts and
missing covariates: Changes in methadone treatment practices. Journal of the American Statistical
Association, 97, 40–52.
Rubin DB (1976) Inference and missing data. Biomelrika, 63, 581–592.
Rubin D, Schenker N (1986) Multiple imputation for interval estimation from simple random sam-
ples with ignorable nonresponse. Journal of the American Statistical Association, 81, 366–374.
Rushworth A, Lee D, Mitchell R (2014) A spatio-temporal model for estimating the long-term effects
of air pollution on respiratory hospital admissions in Greater London. Spatial and Spatio-
Temporal Epidemiology, 10, 29–38.
Ryu D, Sinha D, Mallick B, Lipsitz S, Lipshultz S (2007) Longitudinal Studies with outcome-depen-
dent follow-up: Models and Bayesian regression. Journal of the American Statistical Association,
102, 952–961.
Sahu S, Dey D, Branco M (2003) A new class of multivariateskew distributions with applications to
bayesian regression models. The Canadian Journal of Statistics, 31, 129–150.
Savitsky T, Paddock S (2014) Bayesian semi-and non-parametric models for longitudinal data with
multiple membership effects in R. Journal of Statistical Software, 57(3), 1–35.
Schafer J (1997) Imputation of missing covariates under a multivariate linear mixed model. Technical
report, Dept. of Statistics, The Pennsylvania State University.
Schafer J, Graham J (2002) Missing data: Our view of the state of the art. Psychological Methods, 7,
147–177.
Schmid V, Held L (2004) Bayesian extrapolation of space–time trends in cancer registry data.
Biometrics, 60(4), 1034–1042.
Schmid V, Held L (2007) Bayesian age-period-cohort modeling and prediction – BAMP. Journal of
Statistical Software, 21(8). http://www.jstatsoft.org/
Song J, Belin TR (2004) Imputation for incomplete high-dimensional multivariate normal data using
a common factor model. Statistics in Medicine, 23(18), 2827–2843.
Spiess M (2006) Estimation of a two-equation panel model with mixed continuous and ordered cat-
egorical outcomes and missing data. Jour Roy Stat Soc C 55: 525–538.
Squires D, Blumenthal D (2016) Mortality trends among working-age whites: The untold story. Issue
Brief (Commonwealth Fund), 3, 1–11.
Steele F (2008) Multilevel models for longitudinal data. Journal of the Royal Statistical Society: Series A,
171(1), 5–19.
Sun D, Tsutakawa R, Kim H, He Z (2000) Spatio-temporal interaction with disease mapping. Statistics
in Medicine, 19, 2015–2035.
Hierarchical Models for Longitudinal Data 469

Ten Have T, Kunselman A, Pulkstenis E, Landis R (1998) Mixed effects logistic regression models for
longitudinal binary response data with informative drop-out. Biometrics, 54, 367–383.
Terzi E, Cengiz M (2013) Bayesian hierarchical modeling for categorical longitudinal data from seda-
tion measurements. Computational and Mathematical Methods in Medicine, 2013, 579214.
Thall P, Vail S (1990) Some covariance models for longitudinal count data with overdispersion.
Biometrics, 46, 657–671.
Thiese M S (2014) Observational and interventional study design types; an overview. Biochemia
Medica, 24(2), 199–210.
Troxel A, Harrington D, Lipsitz S (1998) Analysis of longitudinal data with non-ignorable non-mono-
tone missing values. Applied Statistics, 47, 425–438.
Troxel A, Ma G, Heitjan D (2004) An index of local sensitivity to nonignorability. Statistica Sinica, 14,
1221–1237.
Tsai M-Y, Hsiao C (2008) Computation of reference Bayesian inference for variance components in
longitudinal studies. Computational Statistics, 23(4), 587–604.
Tutz G, Kauermann G (2003) Generalized linear random effects models with varying coefficients.
Computational Statistics and Data Analysis, 43, 13–28.
Tzala E, Best N (2008) Bayesian latent variable modelling of multivariate spatio-temporal variation
in cancer mortality. Statistical Methods in Medical Research, 17, 97–118.
Vaidyanathan R (2016) Using a LKJ Prior in Stan. http://stla.github.io/stlapblog/posts/
StanLKJprior.html
Verbeke G, Fieuws S, Molenberghs G, Davidian M (2014) The analysis of multivariate longitudinal
data: A review. Statistical Methods in Medical Research, 23(1), 42–59.
Verbeke G, Molenberghs G, Rizopoulos D (2010) Random effects models for longitudinal data,
Chapter 2, pp 37–96, in Longitudinal Research with Latent Variables, eds K van Montfort, J Oud,
A Satorra. Springer.
Wakefield J, Smith A, Racine-Poon A, Gelfand A (1994) Bayesian analysis of linear and non-linear
population models using the Gibbs sampler. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 43, 201–221.
Weiss R (2005) Modelling Longitudinal Data. Springer, New York.
Weiss R, Cho M, Yanuzzi M (1999) On Bayesian calculations for mixture priors and likelihoods.
Statistics in Medicine, 18, 1555–1570.
Wooldridge J (2005) Simple solutions to the initial conditions problem in dynamic, nonlinear panel
data models with unobserved heterogeneity. Journal of Applied Econometrics, 20, 39–54.
Xu S, Jones R, Grunwald G (2007) Analysis of longitudinal count data with serial correlation.
Biometrical Journal, 49, 416–428.
Yang X, Shoptaw S (2005) Assessing missing data assumptions in longitudinal studies: An example
using a smoking cessation trial. Drug and Alcohol Dependence, 77, 213–225.
Yau KK, Kuk AY (2002) Robust estimation in generalized linear mixed models. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 64(1), 101–117.
Zayeri F, Kazemnejad A, Khanafshar N, Nayeri F (2005) Modeling repeated ordinal responses using
a family of power transformations: Application to neonatal hypothermia data. BMC Medical
Research Methodology, 5, 29.
Zhang D, Davidian M (2001) Linear mixed models with flexible distributions of random effects for
longitudinal data. Biometrics, 57(3), 795–802.
Zhang Z (2016) Modeling error distributions of growth curve models through Bayesian methods.
Behavioral Research, 48, 427–444.
Zhang Z, Hamagami F, Wang L, Grimm K, Nesselroade J (2007) Bayesian analysis of longitudinal
data using growth curve models. International Journal of Behavioral Development, 31(4), 374–383.
Zhang Z, Keke L, Zhenqiu L, Xin T (2014) Bayesian inference and application of robust growth curve
models using student’s t distribution. Structural Equation Modeling, 20, 47–78.
11
Survival and Event History Models

11.1 Introduction
In many applications in the health and social sciences, the response of interest is duration
to a certain event, such as age at first maternity, survival time after diagnosis, or times
spent in different jobs or places of residence. In clinical applications, the interest is typi-
cally in representing and comparing the distribution of times to an event among different
patient groups (e.g treatment vs control groups) (Brard et al., 2017), whereas in social sci-
ence applications, the interest may focus on the impacts of demographic or socioeconomic
attributes on human behaviours.
Typically, durations or event times are not observed for all subjects, either because not
all subjects are followed up, or because for some events the event may never occur (e.g. age
at first marriage). So some times are missing or censored, and the missingness mechanism
is generally assumed to be at random. The most common form is right-censoring, when
the event has not occurred by the end of the observation period; the unknown failure time
exceeds the subject’s survival time c when observation ceased. A failure time is left cen-
sored at c if its unobserved actual value is less than c (e.g. a population census may record
limiting illness status by current age, but not the age when it commenced). A failure time
is interval censored if it is known only that it lies in the interval (c1,c2).
Distributions of durations or survival times are equivalently described by hazard rates,
also known as failure rates, exit rates, or forces of mortality according to the application. The
modelling of the hazard rate through time may be undertaken parametrically. Alternatively,
one may adopt semiparametric methods, such as assuming piecewise constancy in the
rates within sub-intervals of the observation span (Ibrahim et al., 2001). Pooling strength
through correlated priors is then relevant, as rates in successive intervals tend to be similar.
Imposing smoothness conditions on the baseline hazard also provides stable estimators
when observations are sparse at particular durations (Omori, 2003).
Variations in failure rates between subjects or other units may be explained to a large
degree by observed covariates, the impact of which may also vary over intervals or time.
Selection of covariates may be relevant in particular applications (Lee et al., 2015). However,
unobserved random variations between subjects are present in many applications and may
be modelled by introducing subject level frailty (see section 11.4). Additionally, duration
times may be hierarchically stratified (e.g. patient survival by hospital or by area of residence)
(e.g. Austin, 2017). Durations or survival times may also be differentiated by types of pos-
sible exit, as in competing risk analysis (see Section 11.7). One may also consider multivariate
survival outcomes, as in multiple component failure (Damien and Muller, 1998) or in familial
survival studies (Viswanathan and Manatunga, 2001). In such situations, shared frailty mod-
els may account for correlated unobserved variation over different strata or causes of exit.

471
472 Bayesian Hierarchical Models

Survival analysis options in R are summarised at https://cran.r-project.org/web/


views/Survival.html, with a review provided by Crippa (2018). Recently developed R
survival packages using Bayesian computing include biostan (https://github.com/jbu-
ros/biostan), BayesMixSurv (https://cran.r-project.org/web/packages/BayesMixSurv/
index.html), dynsurv (https://cran.r-project.org/web/packages/dynsurv/dynsurv.
pdf), MRH (Hagar et al., 2017), bamlss (Umlauf et al., 2017), R2BayesX, spBayesSurv
(Zhou et al., 2017, 2018) for spatially nested data, CFC (https://cran.r-project.org/web/
packages/CFC/CFC.pdf), and icensBKL (Bogaerts et al., 2017). The survivalstan package
(https://jburos.github.io/survivalstan/index.html) is for Python, but with transferable
rstan codes.

11.2 Survival Analysis in Continuous Time


Let T denote a survival time. The distribution function of T, providing the probability of
exit before time T = t, is then

F(t) = Pr(T ≤ t),

while the probability of surviving beyond t is S(t) = 1 − F(t) = Pr(T > t). Note that one has
S(∞) = 0, except for applications with a cure fraction (Lambert, 2007). So, the density of T
can be expressed as

dF(t) dS(t)
f (t) = =− .
dt dt
The chance of an event occurring in a short interval (t , t + dt), given survival to t, is

Pr(t < T ≤ t + dt) F(t + dt) − F(t)


Pr(t < T ≤ t + dt|T > t) = = .
Pr(T > t) S(t)
The hazard function h(t) is the instantaneous event rate, obtained as dt → 0 in the ratio of
the preceding probability to the length of the interval dt. That is

F(t + dt) − F(t) 1 S(t) − S(t + dt) 1 f (t)


h(t) = lim = lim = .
dt → 0 dt S(t) dt → 0 dt S(t) S(t)
Since –f(t) is the derivative of S(t), one obtains that h(t) = −S′(t)/S(t) , and so

− d log S(t)
h(t) = . (11.1)
dt
On integrating both sides in (11.1), one obtains the cumulative hazard rate

t t − log S( t )
 − d log S(u) 


H (t) = h(u)du = 
0 0
 ∫
du  du = − ∫0
d log S(u) = − log S(t),

and so
Survival and Event History Models 473

 t 


 0



S(t) = exp [ − H (t)] = exp  − h(u)du .

The hazard function is the central focus for modelling variations in survival. Assume pre-
dictors Zi are available (excluding a constant). Their impact is most simply modelled using
a proportional hazards form (e.g. Kiefer, 1988; Li, 2007)

h(t|Z) = h0 (t)exp(Zi b ),

where h0(t) is known as the baseline hazard, and the regression impact is constant across
time. Letting hi = Zi b , the associated survivor function is

 t 
 0 ∫
S(t|Zi ) = exp  − h(u|Zi )du

= exp  − H 0 (t)e hi 

= [S0 (t)]
exp( hi )

{
= exp − exp  hi + log H 0 (t) , }
where H0(t) is the integrated baseline hazard. The proportional hazard assumption is often
restrictive, though Yin and Ibrahim (2006) show the proportional hazard model (PHM) may
be nested in a broader class of transformation hazard models, with parameter 0 ≤ g ≤ 1 and

1/g
h(t|Z) =  h0 (t)g + exp(Zi b )

which reduces to the proportional model when γ = 0 and to an additive model when γ = 1.
Consider an absorbing (non-repeatable) type of exit, and let di = 1 for an observed exit
and di = 0 for a censored time. Assuming censoring is non-informative, the likelihood con-
tribution for subject i is

f (ti |Zi ) = h(ti |Zi )S(ti |Zi )

if di = 1, and S(ti |Zi ) if di = 0. The likelihood contribution may therefore be expressed in
equivalent form as

h(ti |Zi )di Si (ti |Zi ) = f (ti |Zi )di Si (ti |Zi )1− di .

For a PHM, the likelihood contribution also may be written (Aitkin and Clayton, 1980;
Orbe and Nunez-Anton, 2006) as

d 1- di
éë h0 (ti )exp {Zi b - H 0 (ti )exp(Zi b )}ùû i éëexp {- H 0 (ti )exp(Zi b )}ùû

di
 h (t ) 
( )
= midi e − mi  0 i  (11.2)
 H 0 (ti ) 
474 Bayesian Hierarchical Models

where

mi = H 0 (ti )exp(Zi b ) (11.3)

and the second bracketed term in (11.2) depends only on the baseline hazard and is inde-
pendent of β. The first term in (11.2) is the kernel of a Poisson likelihood for the event status
indicators di ∼ Po( mi ). From (11.3), the corresponding log-linear model is

log( mi |ti , Zi ) = log( H 0 (ti ) + Zi b (11.4)

where log(H0(ti)) is an offset using the observed time, whether censored or uncensored.

11.2.1 Counting Process Functions


For repeated events in continuous time, especially with successive durations not necessar-
ily independent, it may be advantageous to use additional functions. The count of failures
N(t) occurring over (0,t] for a given individual or component system defines a counting
process satisfying N (s) ≤ N (t) for s < t. For a non-repeatable event, the counting process
may still be useful (e.g. in modelling time-varying impacts of predictors), and one may
denote N (t) = I (T ≤ t), namely by an indicator of whether the event has occurred by t. The
counting process formulation extends straightforwardly to time-dependent covariates and
frailty (Chen et al., 2014).
For event types considered over sufficiently small intervals, the counting process incre-
ments dN (t) = N (t) − N (t −) are either 1 or 0, where N(t−) denotes limd ¯ 0 N (t - d ) (Manda
et al., 2005). Let A(t−) denote the antecedent history of the event sequence up to, but not
including, t. Then conditional on A(t−), the probability that dN(t) = 1 can be written in terms
of an intensity process λ(t), namely,

Pr {N (t + d ) - N (t -) = 1|A(t -)}  l (t)d .

Equivalently

Pr {dN (t) = 1|A(t -)}  dL(t),

t
where Λ(t) =
∫ l(u)du is the integrated intensity, with Λ(t) = E(N(t)).
0
The intensity is equal to the hazard while the subject or system is still under obser-
vation, that is, still at risk, but is zero when the event has happened (when the event is
non-repeatable), or when a sequence of (repeatable) events has finished. An example of
the latter might be when a repairable system subject to repeated breakdowns is finally
decommissioned – see Watson et al. (2002) for a counting process analysis of failure times
of water pipes. Let Y(t) = I (T ≥ t) denote the at-risk indicator, then

l(t) = Y(t)h(t).

This representation of the intensity function generalises to include predictors and random
effects (or frailties). So for proportional hazards, effects of predictor Zi would be included via

l (ti |Zi ) = Y(ti )h0 (ti )exp(Zi b ).


Survival and Event History Models 475

One may then compare observed and predicted counts via the Martingale residual at t,
defined as
ti


ò
Mi (t) = N (ti ) - L 0 (ti |Zi ) = N (ti ) - Yi (u)exp(Zi b )dH 0 (u).
0

The total residual Mi = Mi(∞) for a subject with observation time ti is obtainable for a
non-repeatable event, and event indicator di, as

Mi = di − Λ 0 (ti |Zi ).

Deviance residuals ri are obtained as

  N (∞ ) − Mi  
ri = sgn( Mi ) 2  Mi − N i (∞)log  i 
  N i (∞)  

11.2.2 Parametric Hazards
The hazard rate h(t) is called “duration dependent” if its value changes over t. Under
negative duration dependence (often observed in occupational or residential careers),
h(t) decreases with time. In practice, plots of survivor proportions are often jagged with
respect to time, and semiparametric or non-parametric methods for representing the haz-
ard function reflect this. However, parametric lifetime models are also often applied to
test whether certain basic features of duration dependence are supported by the data; see
http://rstudio-pubs-static.s3.amazonaws.com/5560_c24449c468224fd4af9f3e512a24e07d.
html for a discussion of exploratory graphical comparison of parametric and non-para-
metric approaches in R.
The simplest parametric model is the exponential model, under which the leaving rate is
constant, defining a stationary process with hazard

h(t) = l,

survival function S(t|l ) = exp(-lt), and density

f (t|l ) = l exp(-lt).

Contributions to the likelihood depend on event status di, namely

f (ti |l , di ) = l di exp(-lti ).

With covariates Zi (excluding an intercept), and assuming proportionality

h(t|Zi ) = l e Zi b .

Alternatively, setting l = e b0 , exponential hazard regression can be represented as

h(t|Zi ) = e b0 + Zi b .
476 Bayesian Hierarchical Models

Equivalently, under the Poisson likelihood approach of Aitkin and Clayton (1980), one has,
for event indicators di,

di ~ Po( mi ),

where, from (11.4),

log( mi ) = log(lti ) + Zi b ,

since H 0 (t) = lt . Absorbing λ into the regression term, one has

log( mi ) = log(ti ) + b0 + Zi b.

This Poisson likelihood device can be used in piecewise exponential models as considered
below.
Another commonly used parametric form is the Weibull (e.g. Thamrin et al., 2013), with
scale parameter λ and shape κ, namely

h(t|l , k ) = lk tk -1 ,

so that

S(t|l , k ) = exp[-ltk ],

f (t|l , k ) = lk tk -1 exp[-ltk ].

The Weibull hazard rate is monotonic, with positive duration dependence if κ > 1 (and if
the 95% credible interval excludes 1), and negative dependence if κ < 1. Impacts of covari-
ates, including an intercept, on the hazard, are represented via

h(ti |l , k , Zi , b ) = e b0 + Zi b k tk -1.

The generalised gamma density (Stacy, 1962; Cox and Matheson, 2014) is of interest in
including the Weibull, gamma, and lognormal as special cases. This has various param-
eterisations, with BUGS using the representation

g
f (t|a , l , g ) = l ag tag -1 exp éë -(lt)g ùû ,
G(a )

as in Morris et al. (1994) (see Example 11.1 below). Instead, one may take B = λγ, leading to

g
f (t|a , l , g ) = Ba tag -1 exp[-tg B],
G(a )

where setting γ = 1 gives the gamma, α = 1 gives the Weibull, and α → ∞ provides the log-
x
a −1 − u

normal. The survival function is 1 − I (g , tg B), where I ( a, x) = (1/Γ( a)) u e du . Cox and
0
Matheson (2014) consider a parameterisation involving δ = γ0.5, allowed to take both posi-
tive and negative values, with the sign of δ leading to different survivor functions.
Survival and Event History Models 477

Preliminary assessment of different parametric hazards can be obtained by density esti-


mation without covariates (e.g. using fitdistr in R), or by graphical means. For example, under
the Weibull one has log(S(t|l, k)) = − lt k , and hence log( − log(S(t|l, k))) = log l + k log t .
Therefore, a plot of log( - log(Ŝ(t|l , k ))) against log(t) should be approximately linear when
a Weibull is appropriate. An initial assessment can be made using a Kaplan–Meier esti-
mate of S(t) in the R package. Assume a dataset named survdat containing variables time
(corresponding to observed survival times ti, whether censored or not), and status (corre-
sponding to di). Then the procedure is

​  ​libra​ry(su​rviva​l)
  KMinputs <- Surv(survdat$time,survdat$status)
  KM <- survfit(KMinputs)
   plot(log(KM$time), log(−log(KM$surv)), type="S").

However, many processes exhibit peaks in exit rates; for example, the rate may at first
increase, but after reaching a peak, tail off again (Gore et al., 1984; Shao and Zhou, 2004).
Parametric models accommodating such a pattern include the log-logistic model and the
sickle model (Bennett, 1983; Brüderl and Diekmann, 1995; Diekmann and Mitter, 1983). The
log-logistic density has hazard

k −1 −1
h(t) = lkt 1 + lt k  ,

and survivor function

−1
S(t) = 1 + lt k  ,

where all parameters are positive, and the scale parameter λ can be adapted to model the
impact of predictors; see Li (1999) for a Bayesian application to Chapter 11 bankruptcies.
An alternative common parameterisation (Florens et al., 1995) sets l = n k , so that

n k k tk -1
h(t) = . (11.5)
[1 + (n t)k ]
The sickle model has corresponding functions

h(t) = cte -t/l

S(t) = exp é -l c {l - (t + l )e -t/l }ù


ë û
with both c and λ positive. The sickle model has a permanent survival probability or “cure
rate” (Chen et al., 1999) in that S(∞) > 0 (see Section 11.4.1). In general, one may define a cure
rate r = 1 − π as the limit as t → ∞ of the survivor function, namely (Tsodikov et al., 2003)

 ∞ 

t →∞ 
 0



r = lim S(t) = exp  − h(u)du

with π denoting the proportion of susceptibles.


478 Bayesian Hierarchical Models

11.2.3 Accelerated Hazards
In contrast to the proportional hazard model with h(ti |Zi ) = h0 (ti )exp(Zi b ), in an acceler-
ated failure time (AFT) model the explanatory variates are assumed to act multiplicatively
on time (Wei, 1992; Swindell, 2009; Rivas-López et al., 2014). AFT models focus on the effect
of the explanatory variates on the survival function, rather than the hazard function under
proportional hazards models. With Bi = exp(Zi b ), one has

h(ti |Zi ) = h0 (tiBi )Bi ,

S(ti |Zi ) = [S0 (tBi )]Bi

and the effect of the predictors Zi on survival time is more direct, acting to accelerate
or decelerate the time to failure. To illustrate this in the case of a treatment comparison,
assume Zi excludes an intercept, and that the baseline hazard includes a scale parameter
to model the mean hazard (e.g. the parameter λ in exponential and Weibull models). Also
assume a single predictor such as zi = 1 for a new treatment and zi = 0 for control. Then,
with Bi = e b zi = e b (= f) for a treated subject, one has a hazard fh0 (fti ) and a survivor func-
tion S(fti ) for a treated subject, but a hazard h0(ti) and a survivor function S(ti) for a control
subject. So, the lifetime under the new treatment is ϕ times the lifetime under the control
regime.
More inclusive schemes are possible. For example, defining Gi = exp(Zig), one has

h(ti |Zi ) = h0 (tiGi )Bi ,

which includes the AFT and PHM forms as special cases (Chen and Jewell, 2001). For
example, for the log-logistic density, this would imply

k −1 −1
h(ti |Zi ) = Bi lk(tiGi ) 1 + Bi l(tGi )k  .

Apart from avoiding the assumption of proportional hazards, the AFT approach has the
advantage of a direct regression form which may be useful in modelling nonlinear effects
of predictors (Orbe and Nunez-Anton, 2006). Let Zi be of dimension p, and Ti denote the
completed failure time which for censored subjects is unobserved. Then Ti = ti when di = 1
but Ti > ti when di = 0, so truncated sampling with the censored time as the lower limit is
necessary. The regression formulation is then

log(Ti ) = Zig + sui ,

where σ is a scale parameter, and the errors are defined by the survivor function, namely

S(ti ) = Pr(Ti > ti ) = Pr(log(Ti ) > log(ti ))


 log(ti ) − g0 − z1ig1 − … − z pigp 
= Pr  ui >  .
 s
A positive γj coefficient means that zj leads to longer survival or length of stay.
Taking u to be standard normal with variance 1 corresponds to a log-normal density for
failure times ti, under which
Survival and Event History Models 479

 log(ti ) − Zig 
S(ti ) = 1 − Φ   .
 s

Taking u to be standard logistic, with density p(u) = e u /(1 + e u )2, corresponds to a log-­
logistic failure time density with

{ }
−1
 log(ti ) − Zig 
S(ti ) = 1 + exp 
 s 

with σ corresponding to the inverse of the shape parameter κ. Finally, consider a Weibull
density for failure times with hazard h(ti |Zi ) = lk tik -1 exp( b Zi ) where Zi excludes a con-
stant term. Taking u to follow a standard extreme value density, namely p(u) = exp(u − e u ) ,
the AFT regression takes the form (Keiding et al., 1997)

log l b bp ui
log(Ti ) = − − z1i 1 …− z pi + .
k k k k
so that g j = − b j /k .

Example 11.1 Nursing Home Stays


Morris et al. (1994) analyse lengths of stay ti for n = 1601 nursing home patients, with stay
usually terminated by death, with predictors being patient age and gender, their mari-
tal status and dependency level, and the type of nursing home assignment (1: receive
treatment, 0: control) (www.stats.ox.ac.uk/pub/datasets/csb/). There are 322 censored
lengths of stay.
Predictor effects are initially assessed via proportional hazard (PH) Weibull regres-
sion. Under the Weibull, one has hazard

h(t|li , k) = li ktik −1 , (11.6)

li = exp(Zi b ),

with density

f (t|li , k) = li ktik −1 exp( − litia ).

This is equivalent to accelerated hazards regression for logged length of stay with
error ui

log(ti ) = Zig + s ui

where g = − b/k . The γ coefficients express influences on length of stay (i.e. survival)
while the β coefficients express influences on mortality.
We consider rstan estimation for the PH Weibull model, with both priors and likeli-
hoods represented using the target += option, and an implicit flat prior on the regres-
sion coefficients. For the Weibull analysis, the log likelihood for censored cases is
provided by the weibull_lccdf function in rstan, namely the log of the Weibull comple-
mentary cumulative distribution function for the response ti. It may be noted that the
rstan parameterisation of the Weibull is
480 Bayesian Hierarchical Models

k -1
k æ ti ö
h(t|li , k ) = ,
li çè li ÷ø
differing from that in BUGS and JAGS, and a re-expression of the regression term ηi = Ziβ
is needed to achieve results in line with the parameterisation (11.6) (Buros, 2016). Thus,
one obtains the code elements

   transformed parameters {real eta[n];


  real nu[n];
  for (i in 1:n) {eta[​i]=be​ta[1]​+beta​[2]*a​ge[i]​/100+​beta[​3]*tr​t[i]+​
beta[​4]*ge​nder[​i]
  +b​eta[5​]*mar​stat[​i]+be​ta[6]​*hlts​t3[i]​+beta​[7]*h​ltst4​[i]+b​eta[8​
]*hlt​st5[i​];
   nu[i] = exp(−eta[i]/kappa);}}
  model {for (i in 1:n) {if (censored[i] == 0) {target += weibull_
lpdf(time[i] kappa, nu[i]);}
  else if (censored[i] == 1) {target += weibull_lccdf(time[i]
kappa, nu[i]);}}}

The Weibull shape parameter κ has a posterior 95% credible interval entirely under
1, so mortality is associated with shorter stays (sometimes denoted negative duration
dependence). This feature, combined with a negative age effect (albeit not significant),
may reflect varying frailty (selection effects). Other regression coefficient estimates
(Table 11.1) for the treatment and attribute variables replicate those of Morris et al.
(1994). Health status is measured in terms of dependency in activities of daily living;
with health=2 if there are four or fewer activities with dependence (reference category),
health=3 for five dependencies, health=4 for six dependencies, and health=5 if there
were special medical conditions requiring extra care. It can be seen that higher ADL
(activities of daily living) dependency is associated with earlier mortality and shorter
stays. The LOO-IC for the Weibull model is 16463, with κ estimated as 0.61. Estimation
using the AFT form gives the same result.

TABLE 11.1
Nursing Home Stays. Parameter Posterior Summary
Weibull PH Model Generalised Gamma AFT Weibull AFT
Mean 2.50% 97.50% Mean 2.50% 97.50% Mean 2.50% 97.50%
Influences on Leaving NH Influences on Stay Length Influences on Stay Length
Age a −0.45 −1.18 0.30 1.27 0.06 2.57 0.71 −0.49 1.93
Treatment −0.13 −0.24 −0.02 0.11 −0.09 0.28 0.21 0.02 0.39
Male 0.35 0.22 0.48 −0.58 −0.80 −0.34 −0.57 −0.79 −0.35
Married 0.16 0.01 0.31 −0.22 −0.50 0.07 −0.26 −0.51 −0.01
ADL Status 3 −0.03 −0.19 0.13 0.07 −0.10 0.27 0.05 −0.21 0.28
ADL Status 4 0.23 0.08 0.39 −0.44 −0.61 −0.30 −0.38 −0.64 −0.14
ADL Status 5 0.53 0.33 0.74 −0.85 −1.20 −0.44 −0.88 −1.20 −0.54
κ 0.61 0.58 0.64
α 8.60 7.53 9.25
γ 0.18 0.17 0.19
σ 1.64 1.57 1.72
LOO-IC 16463 16364 16463
a Actual age divided by 100.
Survival and Event History Models 481

To assess poorly fitted observations, one may implement different forms of residual.
Here the Martingale residual and the normal deviate residual are obtained, with simu-
lation as in Nardi and Schemper (2003) to obtain estimated normal deviate residuals for
censored observations. These two residuals have a correlation of 0.94, and high negative
values on both highlight subjects (e.g. 1,589 and 1,596) with long lengths of stay despite
high ADL dependency.
There is evidence of redundancy among the predictors used above, and covariate
selection or shrinkage priors could be applied (e.g. Zhang et al., 2018). The impact of
the latter can be demonstrated simply by an application of the BayesMixSurv package,
which estimates a two-component discrete mixture of Weibull regressions, but allows
a single component option. Lasso shrinkage priors are assumed for regression coef-
ficients in this package. Thus, defining an event indicator (endstay=1-censored) as the
complement of the censoring indicator, and defining agec=age/100, one has
C1=ba​yesmi​x surv​(Surv​(time​,ends​t ay)~​t rt+a​gec+m​a rsta​t+gen​der+h​ltst3​+hlts​t4+hl​
tst5,​ D, contr​ol=ba​yesmi​xsurv​.cont​rol(i​ter=1​000,s​ingle​=T)).​
This shows an age coefficient much closer to zero (with posterior mean −0.03) than
obtained using the flat prior in the rstan code.
A Weibull accelerated failure time regression can be obtained by simply replacing
nu[i] = exp(−eta[i]/kappa) by nu[i] = exp(eta[i]) in the preceding code. As in Morris et al.
(1994), we compare Weibull and generalised gamma AFT regressions. The latter requires
specific functions in the rstan code to define the density and survivor likelihoods.
Convergence issues have been noted for the generalised gamma, even under maxi-
mum likelihood, though convergence may be improved by fixing one of the extra gen-
eralised gamma parameters (Lawless, 1980). Estimates here are based on a single chain
run of 5,000 iterations in rstan, at which point SRFs for the hyperparameters are 1.3 or
less. Table 11.1 shows a more pronounced effect of age, and a diminished treatment
effect, under the generalised gamma, which produces a lower LOO-IC (16364) than the
Weibull AFT. The estimate of α, with 95% CRI (7.5,9.3), suggests the lognormal may be
preferred to the Weibull.

11.3 Semiparametric Hazards
In the proportional hazards model

h(ti |Zi ) = h0 (ti )exp(Zi b ),

it may be difficult to choose a parametric form for the baseline hazard h0(t), and semipara-
metric or non-parametric approaches are often preferable. These have benefits in avoiding
possible mis-specification of parametric hazard forms, and in facilitating other aspects
of hazard regression, such as time-varying predictor effects (Gamerman, 1991). Such
approaches have been applied to the cumulative hazard, and implemented in counting
process models (Clayton, 1991). However, they may also be specified for the baseline haz-
ard h0 itself (e.g. Gamerman, 1991; Sinha and Dey, 1997) and typically use only information
about the time intervals in which exit occurs.
Consider a partition of the response time scale into J intervals ( a0 , a1 ], … ,( aJ −1 , aJ ] , where
aJ equals or exceeds the largest observed time, censored or uncensored (Ibrahim et al.,
2001, p.106). The partition scheme can be based on distinct values in the profile of observed
times {t1 , … , tn } , whether censored or not, or by siting knots aj at selected points in the
range (tmin,tmax). Yin and Ibrahim (2006, p.173) propose that the partitioning should ensure
482 Bayesian Hierarchical Models

an approximately equal number of failures in each of the J intervals, with each interval
containing at least one failure. Among alternatives are knots sited at (( j − 1)/J )th quantiles
of observed times (Gustafson et al., 2000), or evenly spaced along the range of the observed
t values. As the number of intervals J tends to infinity, a truly non-parametric model is
obtained, but is not likely to be empirically well identified (Lopes et al., 2007).
Different approaches may be based on the assumption that the baseline hazard is con-
stant within each interval. Thus Ibrahim et al. (1999) and Ibrahim et al. (2001, p.55) consider
discrete approximation to the gamma process of Dykstra and Laud (1981). This involves a
prior on the increments

∆ j = h0 ( a j ) − h0 ( a j −1 ), j = 1, … , J

in the baseline hazard, and use of the approximate survival function

 t
  J 

  ∫
S(t|Zi ) = exp  −Bi h0 (u)du  exp  −Bi
 ∑ ∆ (t − aj )  ,
j −1 +

 0   j=1 

where Bi = e Zi b , and (u)+ = u if u > 0 and is zero otherwise. The probability of exit in interval
j is then
q j = S( a j-1 ) - S( a j )

é ìï j -1
üïù é ìï j
üïù
 êexp í-Bi
ê
ë îï m=1
å
D m ( a j-1 - am-1 )ýú ê1 - exp í-Bi ( a j - a j-1 )
þïúû êë îï m=1
D m ýú .
þïúû
å
11.3.1 Piecewise Exponential Priors
Piecewise exponential (PE) priors (Ibrahim et al., 2001; Bender et al., 2018; Sinha et al., 1999;
Demarqui et al., 2008; Brezger et al., 2005) are one approach to estimating the hazard func-
tion without specifying the hazard parametrically, though semiparametric approaches
avoiding the simplifying PE assumptions have been proposed (Murray et al., 2016; Marano
et al., 2016). The PE prior specifies a baseline parameter λj for each interval, possibly com-
bined with interval-specific regression parameters βj, so that

h(ti ∈( a j −1 , a j ]|Zi ) = lj exp(Zi b j ),

where Zi excludes an intercept. Let Bij = exp(Zi b j ). For a subject surviving beyond the jth
interval, namely with ti > aj, the likelihood contribution during interval j is

exp( − lj ( a j − a j −1 )Bij ).

For a subject with a j −1 < ti ≤ a j , either failing (dij = 1) in interval j, or censored but neverthe-
less exiting (dij = 0) in the jth interval, the likelihood contribution is

d
 ljBij  ij exp  − lk (ti − a j −1 )Bij ) .

So, a Poisson likelihood approach may be applied as in (11.2)–(11.4), with responses yij
defined by the event type in each interval, and with offsets Δij, defined according to
whether the subject survives the interval (see Example 11.2).
Survival and Event History Models 483

The successive baseline parameters λj are likely to be correlated, but also possibly to
show erratic fluctuations or be imprecisely estimated if treated as fixed effects. Hence, a
smoothing prior is indicated. One might assume a parametric model (e.g. polynomial in
j) but allowing for additional random variation. Thus, Albert and Chib (2001) and Omori
(2003) assume a polynomial for a j = log(lj ), whereby

a j = y0 + y1( j − 1) + y2 ( j − 1)2 + u j ,

with uj normal. Pooling strength under autocorrelated priors linking successive λj or αj is


also widely applied. These are known as correlated prior processes or Martingale prior
processes for the baseline hazard. Possibilities are first or second order random walks
(RW) in the αj, possibly adjusted to reflect unequal width d j = ( a j − a j −1 ) of the intervals.
Thus, one might take a 1st order random walk,

a j ~ N (a j -1 , s a2d j ),

with α1 a separate fixed effect, and with t a = 1/s a2 following a gamma prior. Alternatively,
as in Gustafson et al. (2003), one may take

w j = 0.5( a j + a j + 1 ),

zj = w j − w j −1 ,

zj
a j ~ N (a j -1 + (a j -1 - a j - 2 ) , s a2 (z j /z )2 ).
z j -1
Since setting particular partitions of the time scale involves an element of arbitrariness,
Sahu and Dey (2004) apply RJMCMC (reversible jump MCMC) techniques in which J is an
additional unknown; they specify a sparse precision matrix formulation for the joint prior
for the (a1 , … , aJ ) under an RW1 prior for particular J values.
Because random walk priors of degree r set a mean level not on the αj themselves, but on
differences of order r (e.g. an RW1 prior specifies a zero mean for a j − a j −1), identifiability
may require that a separate regression intercept is omitted or that the αj are centred to
sum to zero at each MCMC iteration, by the operation a′j = a j − a . Alternatives are to set
any value, say the hth, to zero (by the operation a′j = a j − ah at each iteration), or set the first
effect α1 to zero (Sahu and Dey, 2004).
A gamma prior in the baseline hazard rates λj is also possible (Arjas and Gasbarra, 1994),
namely

lj ∼ Ga(b , b/lj −1 ),

where λ1 is a separate positive effect, and larger b values lead to smoother sequences of λj.
The same identifiability issues obtain as for a j = log(lj ) and devices such as normalisation
of the λj (to value 1) at each iteration may be applied.
Piecewise priors may also be used to model non-constant predictor effects, though typi-
cally values of time-varying regression coefficients βj in successive intervals are expected
to be close (Sinha et al., 1999). Sargent (1998) considers alternative gamma priors for the
484 Bayesian Hierarchical Models

precision t b = 1/s b2 of regression coefficients assumed to evolve according to a first order


random walk, adjusted for different interval widths,

b j ~ N ( b j -1 , s b2d j ).
Prior knowledge in this application (from the Veterans Administration lung cancer trial)
suggests that values of time-varying coefficients on successive days would differ by at
most 0.001. Taking this as the standard deviation of the normal distribution, the prior mean
precision for the gamma is 106. This corresponds to quartiles (0.0027, 0.0038, 0.0059) for
σβ. An alternative prior adopted by Sargent has mean precision 105. Posterior inferences
for the mean precision were different under the alternative priors, but not those for the
estimated βj. Fahrmeir and Knorr-Held (1997) suggest gamma Ga(1,b) priors on precision
parameters τα on varying log baseline rates, or precisions τβ on varying predictor effects.
Sensitivity is gauged by taking alternative values for b (e.g. b = 0.05 and b = 0.0005), since b
determines how close to zero the variances are allowed to be a priori.

11.3.2 Cumulative Hazard Specifications


Semiparametric approaches may also be applied to the cumulative baseline hazard H0
(Kalbfleisch, 1978). Consider a counting process approach with data ( N i (t), Yi (t), Zi (t)) and
independent priors on β and H0. For an individual i exiting or censored before t, so that
Yi(s) = 0 for s > t, one may apply a Poisson likelihood with binary responses dNi(t) and
means Yi (t)exp(Zi (t)b )dH 0 (t).
An independent gamma increments prior for dH0 (Phadia, 2015) may be adopted (assum-
ing a constant baseline hazard in each interval), namely

dH 0 (t) ∼ Ga(c[dH ∗ (t)], c)

where dH*(t) is a prior estimate of the hazard rate per unit time. Other possibilities include
normal priors on log(dH0).
Let J + 1 intervals (s0 , s1 ], … ,(sJ , sJ + 1 ] be defined by the J distinct failure times in a dataset,
with s1 equal to the minimum observed failure time, and sJ+1 exceeding the largest failure
time sJ (Sargent, 1997, p.16). The likelihood for individual i exiting or censored before sj, so
that Yij = 0 for t > sj, reduces to a discretised form of Poisson likelihood over all possible
intervals j with binary responses dNij and means Yij exp(Zij b j )dH 0 j . This model may be
adapted to allow for unobserved covariates or frailty, as considered in Section 11.4. It also
allows for autoregressive dependencies between intervals.

Example 11.2 Veterans Lung Cancer Trial


To illustrate the implementation of semiparametric hazards and the opportunity
they offer to model time-varying regression effects, consider data from the Veteran’s
Administration lung cancer trial (e.g. Bender et al., 2018). In this trial, n = 137 male sub-
jects with advanced inoperable lung cancer were randomised to either a standard or a
test chemotherapy, with the end point being time to death in days. Only 13 of the 137
survival times are censored. Most analyses find the treatment to be insignificant and
consider the remaining predictors, namely:

celltype (1 = squamous, 2=smallcell, 3=adeno, 4=large);


the Karnofsky score (KS), reflecting ability to perform common tasks (with scores
0–100, where 100 signifies normal physical abilities);
Survival and Event History Models 485

prior therapy or PT, (0=no, 1=yes);


an interaction between Karnofsky score and PT.

Sargent (1997) considers a counting process version of the Cox model for these data with
time-varying effects on the KS predictor.
Here we first consider the piecewise exponential model

h(ti Î ( a j -1 , a j ]|Zi ) = l j exp(Zi b ) = exp(a j + Zi b ),

and a partitioning of the time scale involving J = 20 intervals, and 21 cut-points aj at the
{0th,5th,10th,…,95th,100th} percentiles of the survival times ti. With di denoting event
indicators, the responses yij and offsets Δij are defined in R using the commands

   for (i in 1:N) {for (j in 1:J){


  y[i,j] <- d[i]*(t[i]>=a[j])*(a[j+1] >= t[i]);
   # offset terms
  del[i,j] <- (min(t[i],a[j+1])−a[j])*(t[i]>=a[j])}}

The first model assumes a constant Karnofsky score effect, but a time-varying (log)
baseline hazard, namely

a j ∼ N(a j − 1 , sa2 ),

with a gamma prior on 1/sa2 . To assist identifiability, estimation with the R package rube
uses the BUGS car.normal prior, and the re-expression

a j = b0 + a 0 j ,

where α0j are RW1 centred random effects. With a two-chain run, the posterior density
of σα is found to be bounded away from zero, with 95% interval (0.05,0.22). The posterior
mean αj show an irregular rise in mortality over the intervals (Figure 11.1). The LOO-IC
is 984. This LOO-IC is in fact lower than if more intervals are used (e.g. with J = 50,
and using either quantile cutpoints or equally spaced cutpoints). The coefficient on the
Karnofsky score has 95% interval (−0.38,−0.15), while the interaction term has coefficient
with 95% interval (−0.57,−0.11). Coefficients for celltypes 1 and 2 are also significantly
positive.
A second model additionally takes the Karnofsky score to have a time-varying coef-
ficient, using a random walk prior adjusted for intervals of unequal length. Thus

b j ~ N( b j -1 , s b2d j )

where the Ga(1,0.0001) prior for 1/s b2 has mean and variance 105, so supporting large val-
ues (Sargent, 1997). The LOO-IC is lower at 971, and there is an upward trend (towards a
null value) in the coefficient, with an insignificant effect at higher intervals (Figure 11.2).
Topics for further investigation might be the sensitivity of the form of time variation in
the KS effect to the partitioning scheme of the durations, or to unobserved heterogene-
ity between subjects.
We also demonstrate the changing effect of the KS score using a rstan coding of the
counting process model [1]. The settings for the gamma increments prior are as for the
Winbugs Volume 1 “Leuk: Cox regression” example. Predictors are the KS score, prior
486 Bayesian Hierarchical Models

–4.8 +
+ + + + +
+ + +
+ + + +
+ + + +
+ +
Posterior Mean and 80% CRI –5.0 +

o
o o o o o o o
o
–5.2
o o o o o o o
o o
o
o
–5.4
* * * * * * * *
*
*
* * * * * *
–5.6 * *
*
*
5 10 15 20
Interval

FIGURE 11.1
Trend in αj by interval.

0.0

–0.1
Posterior mean bKS

–0.2

–0.3

5 10 15 20
Interval

FIGURE 11.2
Trend in beta coefficient.

therapy, and their interaction (the latter two having time-constant effects). A random
walk prior is assumed for the varying KS score effect.
Figure 11.3 shows the diminution of the KS effect at higher intervals (based on dis-
tinct event times). Figure 11.4 shows differing survival chances according to whether
prior therapy was received (upper curve) or not (lower curve), with the KS score set at
its upper quartile.
Survival and Event History Models 487

+++++
+ +++
0.2 ++ + +
+ + +++++ +++
+ ++++ ++ +++++++++++++
+ ++++
+
Posterior Mean and 80% CRI, bKS +
+
ooooooo
0.0 + ooo oooooooooooo
+ oo o ooooooooooooooo
+ o ooooooo
+++ oo
++ ++
++ o
++++++++++++ o ******
–0.2
+ o **** **************** *
+++ ++ o ** ** *******
++++++ o * **
++ ++++
++++ ooooooo ** ***
oo *
oooooooo o * **
o o * *
–0.4 o oo *
o o
oooooooo oo o *
oo o ********
ooo
******* **
** * *
**
***** ** *
–0.6 * *
*** * *
* ***
*
0 20 40 60 80 100
Interval

FIGURE 11.3
Trend in KS effect.

FIGURE 11.4
Survival chances according to therapy.
488 Bayesian Hierarchical Models

11.4 Including Frailty
Subjects with a given profile of attributes are still likely to show variations in survival
times due to unobserved factors. Such factors mean that subjects have different frailties
(i.e. liabilities to experience the event) and the most frail will typically exit before others,
so that survivors are subject to a selection effect (Aalen, 1988; Wienke, 2010). Inferences
from survival analysis may be incorrect if unobserved heterogeneity is ignored (Lancaster,
1990), with a possibility of negative duration bias (Boring, 2009). Another consideration is
possible sensitivity of inferences to the assumed form of unobserved heterogeneity.
The canonical form for introducing unobserved differences between observations is via
a multiplicative frailty, γi, distributed independently of Zi and ti, with

h(ti |Zi , g i ) = g i h0 (ti )exp(Zi b )

leading to mixed proportional hazard or MPH models (Mosler, 2003; Van den Berg, 2001;
Abbring and van den Berg, 2007). Except for the case of positive stable frailty distribu-
tions, the MPH model is inconsistent with the usual Cox proportional hazard formulation
(Henderson and Oman, 1999).
A typical assumption for the distribution p(γi) of multiplicative frailties is that they are
gamma distributed (Perperoglou et al., 2006), typically gi ∼ Ga(k , k ) where k is unknown.
So, the frailties have mean 1, and variance 1/k, with normalisation to ensure identifica-
tion when Zi includes an intercept. Another possibility is to include the regression effect
exp(Ziβ) in the specification of the frailty density. So, for example,

h(ti |Zi , g i ) = g i h0 (ti )

g i ~ Ga(k exp[Zi b ], k ).

As one option, Sohn et al. (2007) assume Weibull distributed survival times, with density
form

a a -1
f (ti ) = ti exp(tia /g i ),
gi
and then take γi as inverse gamma. With the form x ~ IG(a,b) corresponding to
f ( x) = (b a /Γ( a))x −( a + 1) exp[−b/x], the frailty density is then

g i ~ IG(a + 1, a exp[Zi b ]).

Other positive parametric densities can be used to represent frailty, such as the log-nor-
mal (Gustafson, 1997). An advantage of gamma frailty combined with Weibull hazard is
that joint and marginal survival functions can be obtained analytically. An alternative
is to assume the γi have a positive stable distribution (Hougaard, 2000), in which case
the proportional hazards property is preserved after the γi are integrated out (Aalen and
Hjort, 2002).
Estimates resulting from the mixed proportional hazard model are often sensitive to the
functional form of the heterogeneity distribution, and may be biased if the functional form
of the distribution is mis-specified (Baker and Melino, 2000; Keiding et al., 1997). Heckman
Survival and Event History Models 489

and Singer (1984) report sensitivity of regression estimates according to different paramet-
ric distributions of frailty. They propose discrete mixture models with finite support at a
small number K of points, so that

h(ti |Zi , g i ) = g Gi h0 (ti )exp(Zi b )

where Gi is a multinomial indicator with K categories. Sahu and Dey (2004) compare
gamma, stable, and skewed log-t frailty models, and show how the gamma assumption
may attenuate covariate effects as compared to the other forms.
Despite such sensitivity, it is important to consider possible heterogeneity. One can show
(Lancaster, 1990) that a model neglecting frailty will show spurious duration dependence,
and specifically overestimate the extent of negative duration dependence in the true base-
line hazard, and underestimate the extent of positive duration dependence. This is a con-
sequence of selection, since in the presence of negative duration dependence, subjects with
high values of γi exit faster, so survivors at a given survival time are increasingly biased
towards relatively low γi values, and lower hazard rates. These features can be illustrated
with the MPH assumption and particular parametric hazards. Conditional on a particular
value of γi, the survivor function is

é ti
ù

ê
ë 0
ò
S(ti |Zi , g i ) = exp ê -g i exp(Zi b ) h0 (u)du ú ,
ú
û
or in terms of the cumulative hazard H0(ti),

S(ti |Zi , g i ) = exp[-g i exp(Zi b )H 0 (ti )].

The unconditional survival function (integrating out the frailties) is therefore


S(ti |Zi ) = S(ti |Zi , gi )p(gi )dgi
0

∫ p(g )exp −g H (t )e  dgi .


Zi b
= i i 0 i 
0

For γi following a gamma density, gi ∼ Ga( a, b), the unconditional survivor function is

−a
S(ti |Zi ) = b a b + H 0 (ti )e Zi b 

which for a = b = k (with Zi including a constant) reduces to

−k
S(ti |Zi ) = 1 + k −1H 0 (ti )e Zi b  .

Consider exponentially distributed times so that h0(ti) = 1 and H0(ti) = ti. Then

−k
S(ti |Zi ) = 1 + k −1e Zi b ti  ,
490 Bayesian Hierarchical Models

- k -1
f (ti |Zi ) = e Zi b éë1 + k -1e Zi b ti ùû ,

-1
h(ti |Zi ) = e Zi b éë1 + k -1e Zi b ti ùû .

For a frailty variance 1/k > 0, the hazard rate is a decreasing function of t, an example of
spurious duration dependence. If frailty is present, but ignored, not only will duration
effects be mis-stated, but covariate effects will be underestimated (Hougaard et al., 1994;
Pickles and Crouchley, 1995). Lancaster (1990) confirmed this analytically for uncensored
Weibull survival data.
More general forms of subject-level random variation can be achieved by a general linear
mixed model form, where the impact of selected predictors wi = (w1i , … , wri ) is assumed to
vary over subjects, or clusters of subjects. Thus

h(ti |Zi , wi ) = h0 (t)exp(Zi b + wibi ).

When r = 1, and wi = 1, the random effect bj ~ N(B,1/τb) is used to represent variability in
frailties between subjects. If Zi contains an intercept, the bi are constrained to have zero
mean, namely bj ~ N(0,1/τb).
A general linear mixed model form for frailty may take account of spatial locations of
subjects, as in geoadditive hazard regression (Kneib, 2006; Henderson et al., 2002). Suppose
subjects ij are nested within J locations, then {b j , j = 1, … , J } would be spatially correlated
with local pooling of strength (Zhou et al., 2017). For example, if individuals in neigh-
bouring locations are subject to similar (unobserved) environmental risks, this will affect
survival.
In accelerated failure time models (section 11.2.3), frailty is conveniently obtained by
discrete mixture modelling of the error term. Following Roeder and Wasserman (1997), a
mixture of normals provides a flexible model for estimation of densities. Suppose mem-
bership of latent sub-groups is denoted by a categorical variable Gi with K options, and
prior Gi ∼ Mult(1,[p1 , … pK ]). Assuming a log-normal density for exit times, one may trans-
form observed failure or censoring times as ri = log(ti), and to account for right censoring,
define lower sampling limits Li = log(ti) if di = 0, and Li = 0 if di = 1. Then the discrete mixture
adopts varying group intercepts and variances in the survivor function

K
log ti - g 0 k - g 1z1i - g 2 z2i - … ö
S(ti ) = 1 - åp F æçè
k =1
k
sk ÷.
ø

Mixed Dirichlet process and Polya Tree priors for the errors u in an AFT regression are
used by Kuo and Mallick (1997) and Walker and Mallick (1999).

11.4.1 Cure Rate Models


A particular form of heterogeneity may arise when permanent survival from an event is
possible. Demographic examples are provided by age at first marriage or age at first mater-
nity. The issue is then to identify latent subpopulations in the censored group, namely
to distinguish a permanent survival subgroup from a subgroup still liable or susceptible
to experience the event, but exhibiting extended survival. Not allowing for permanent
survival when it can occur will distort the failure time parameter estimates for the true
Survival and Event History Models 491

susceptible population. Herring and Ibrahim (2002) point out – in the context of cancer
survival – that improved treatment means that a substantial proportion of patients may
now be cured, whereas traditional survival analysis, including the Cox (1972) regression
model, assume that no patients are cured, but that all remain at risk of death or relapse.
Similarly, in the context of component reliability, Sinha et al. (2003) consider the case where
if a unit is free of manufacturing faults, it will never fail in its technological lifetime under
usual stress levels.
The most common approach to modelling events with a permanent survival fraction or
cure rate assumes the total survival rate is a binary mixture (Ibrahim et al., 2001). The non-
susceptible subpopulation has Sc(t) = 1 with probability (1 − π), and the other (the non-cured
or susceptible subpopulation) follows a conventional survival pattern in which Sn(t) → 0 as
t → ∞. So the overall survivor function is

S∗ (t) = (1 − p) + pSn (t),

and the overall distribution function (Bruderl and Diekmann, 1995) is

F ∗ (t) = p Fn (t).

Ibrahim et al. (2001, p.157) point out that if covariate effects are modelled via binary regres-
sion for πi then the proportional hazard property no longer obtains.
Let Ri be a partially unobserved binary indicator with Ri = 1 if a subject is susceptible.
Schmidt and Witte (1989) and Banerjee and Carlin (2004) follow the standard cure rate
model and take Ri to be Bernoulli with Pr(Ri = 1) = pi being a propensity to experience the
event (e.g. propensity to relapse). For simplicity, omit the subscript n in the survivor func-
tion for susceptibles. Then for subjects observed to fail, namely with di = 1, it necessarily
follows that Ri = 1, and so the likelihood contribution from such cases is

Pr(Ri = 1) f (ti ) = pi f (ti ).

Censored subjects may be either susceptibles or non-susceptibles with likelihood


contribution

Pr(Ri = 0) + Pr(Ri = 1)Pr(T > ti ) = (1 − pi ) + piS(ti ).

The total likelihood contribution is then

[pi f (ti )]di [(1 − pi ) + piS(ti )]1− di ,

which reduces to the usual form f (ti )di S(ti )1− di when Ri = 1 for all subjects, and so πi = 1
for all i (i.e. there is no permanent survivor fraction). Any form of binary regression (e.g.
logit) may be used for predicting πi (Schmidt and Witte, 1989). Banerjee and Carlin (2004)
carry out a Bayesian analysis with individual level regression in the scale parameter of the
failure distribution f(t), but without a regression for the susceptible probability. However,
their observations are hierarchical (spatially configured) response times tij (subjects i
within areas j), and they allow spatial variability in the propensities so that pij = p j ; see
also Cooner et al. (2006).
Chen et al. (1999) describe an alternative structure in which there is a latent count of risks
Ci, taken to be Poisson with mean θ (for example, tumour cells remaining after treatment
492 Bayesian Hierarchical Models

that have varying potentials to cause relapse), and unobserved times U i1 , … , U iCi associ-
ated with each of these risks. The Uic are assumed to follow the same failure distribution
F(t) = 1 − S(t). An observed failure time ti is the minimum of these times. If Ci = 0 then a
subject survives permanently from the event being modelled (e.g. a form of cancer). In this
case the composite survival function is

S* (ti ) = Pr(Ci = 0) + Pr(U i1 > ti ,…U iCi > ti |Ci ³ 1),


¥
qk
= exp(-q ) + å S(t)
k =1
k
k!
exp(-q ),

= exp(-q + q S(t)) = exp(-q F(t)).

and the composite hazard rate is

h∗ (ti ) = q f (ti ).

An alternative derivation of this model, not tied to the notion of multiple latent risks, is that
t

0 ∫
the cumulative hazard H (t) = h(u)du tends to a finite positive limit θ as t → ∞ (Tsodikov
et al., 2003). Chen et al. (1999) and Ibrahim et al. (2000, p.158) mention that the survivor
function of the non-cured subpopulation can be written

exp(-q F(ti )) - exp(-q )


Sn (ti ) = ,
1 - exp(-q )

so that the composite survival function is in fact also representable as a binary mixture,
namely

S* (ti ) = exp(-q ) + (1 - exp(-q ))Sn (ti ).

Chen et al. (1999) introduce covariates into a Poisson regression model for subject-spe-
cific θi. Consider Weibull distributed times with F(ti |Zi ) = 1 - exp[-litik ], li = exp(Zi b ), and
f (ti |k , Zi ) = lik tk -1 exp(-litik ) . The likelihood when predictors are used to explain both θi
and λi, and with di being event status indicators, is then

( )
éë h* (ti )ùû i S* (ti ) = éëq ilik tik -1 exp(-litik )ùû i exp -q i {1 - exp(-litik )} .
d d

Multiplicative frailty, as in the MPH setup above, can be introduced in cure rate models,
but identifiability may be weak because susceptibility responses are partially unobserved
themselves. Models for frailty in multivariate cure fraction models are considered by Yin
(2005). Thus, for times tij observed on subjects i and events j, Yin proposes multiplica-
tive frailty at subject level combined with Poisson regression for θij in the cure fractions
exp(−θij). One option takes

S* (tij ) = exp(-q ijg i F(tij )).

with hazard rates h* (tij ) = q ijg i f (tij ).


Survival and Event History Models 493

Example 11.3 Age at First Maternity


To illustrate frailty modelling in a cure rate model, this example follows Winkelmann
and Boes (2005) in analysing ages at first maternity for women in the German General
Social Survey for 2002. The subsample considered by Winkelmann and Boes involves
1371 women, comprised of (a) uncensored subjects (event indicator di = 1) who may have
been over 40 at the time of the survey, but whose age at first maternity (AFM) was under
40, and (b) women aged under 40 in 2002, but who had not yet had a child (di = 0). Here
we consider all 1,508 women (including childless women aged over 40).
A log-logistic model with hazard and survivor functions

lk tk -1
h(t) = ,
[1 + ltk ]

S(t) = [1 + ltk ]-1 ,

is appropriate to the non-monotonic form of hazard for first maternity, typically peak-
ing between ages 25 to 35. This model is implemented in rstan using the custom likeli-
hood approach. A standard log-logistic is compared with a log-logistic model with a
permanent survivor fraction (PSF), modelled according to the latent count approach
(Chen et al., 1999). The PSF log-logistic model is then generalised to allow for unmea-
sured heterogeneity in the age at first maternity. Permanent survivorship in this case is
equivalent to a woman never undergoing a maternity, and at population level is essen-
tially equivalent to the rate of childlessness.
Regression effects are included in the scale parameter of the log-logistic hazard via
li = exp(Zi b ), with θ assumed constant. However, a Poisson regression for θi could be
included. Predictors Zi and regression effects β under the standard log-logistic are as in
Table 11.2. Predictors are binary apart from number of siblings and education years. The
modal age c = [(k - 1)exp(-ZT b )]1/k reported in Table 11.2 is based on a predictor vector
ZT for a white subject with 13 years of education, and 3 siblings.
The standard log-logistic model gives a LOO-IC of 8,700. Significant coefficients in
Table 11.2 show that delayed AFM is associated with longer education, being white, and
TABLE 11.2
Age at First Maternity. Parameter Posterior Summaries
Log-logistic with Cure
Standard log-logistic Fraction Cure Fraction and Frailty
Predictor Mean 2.50% 97.50% Mean 2.50% 97.50% Mean 2.50% 97.50%
Years of education −0.213 −0.249 −0.184 −0.294 −0.334 −0.26 −0.504 −0.686 −0.377
Number of siblings 0.018 −0.012 0.045 0.017 −0.017 0.046 0.032 −0.03 0.086
White −0.594 −0.807 −0.407 −0.909 −1.14 −0.706 −1.649 −2.364 −1.133
Immigrant −0.312 −0.589 −0.083 −0.585 −0.912 −0.317 −0.921 −1.566 −0.45
Low income −0.009 −0.219 0.167 0.122 −0.126 0.325 0.217 −0.244 0.615
(age 16)
Living in city −0.074 −0.256 0.074 −0.007 −0.214 0.157 0.019 −0.343 0.334
(age 16)
Shape parameter 5.04 4.79 5.26 8.79 8.36 9.17 15.9 11.81 20.44
Modal age 33 32.1 33.7 31.7 31.1 32.4 31.5 22.8 38.7
(typical subject)
Proportion 0.172 0.151 0.19 0.167 0.146 0.185
childless
Frailty SD 2.45 1.44 3.47
494 Bayesian Hierarchical Models

immigrant status. Allowing for a permanently childless subpopulation, but without


allowing for frailty, reduces the LOO-IC to 7,903. Finally, adding log-normal frailty via

li = exp(Zi b + ui )

ui ~ N(0, s u2 ),

reduces the LOO-IC to 7,855. This analysis uses a corner constraint on the ui for iden-
tifiability, and this option provides better convergence than (a) excluding the intercept
from Ziβ and centring the ui at β0, namely ui ~ N( b 0 , 1/t u ), or (b) expressing the random
effect as a product of σu and N(0,1) terms. The option of centring the ui is more computa-
tionally intensive. The lowest (most negative) frailty values are for subjects with delayed
age at first maternity, combined with low education and non-white ethnicity.
Allowing for a childless subpopulation (as a cure rate) is a form of frailty in itself,
and enhances (absolutely) the coefficients on significant predictor effects [3]. Formally
including frailty in the modelling of the event density further enhances predictor
effects. The childless fraction (i.e. the permanent survival fraction), exp(−θ), is estimated
at around 0.17, regardless of the presence or not of frailty. A standard log-logistic model
leads to a significantly later modal age than the extended models. In fact, a better repre-
sentation of the age at first maternity process may be provided by the generalised log-
logistic of Brüderl and Diekmann (1995), as discussed in Congdon (2008).

11.5 Discrete Time Hazard Models


In applications with interval censored times, analysis using a discrete time scale
becomes appropriate, and in fact such analysis has certain benefits also for modelling
time-varying or lagged predictor effects. Let the time scale be grouped into J intervals
A1 = [a0 , a1 ), … AJ = [aJ , aJ + 1 ) , with interval j being [a j −1 , a j ), and a0 = 0, aJ+1 = ∞, where aJ
denotes either the end of the observation period, or the largest time (censored or failed).
The intervals may be of equal length d j = a j − a j −1, but are not necessarily so. Instead of con-
tinuous observed failure times, only the discrete times ti ∈ A j are observed. Equivalently,
let ti = j denote that a time of failure or censoring is observed within [a j −1 , a j ).
With Sj denoting the probability of surviving to the end of interval j, the unconditional
probability of failing in interval j is

f j = Pr(t Î ( a j-1 , a j )) = Sj-1 - Sj ,

and the hazard function (the conditional probability of failing in interval j given survival
till the start of the interval) is

Sj-1 - Sj
q j = Pr(t Î ( a j-1 , a j )|t ³ a j-1 ) = Pr(t = j|t ³ j) = f j /Sj-1 = .
Sj-1
Alternatively stated, qj is the proportion of subjects at risk at the beginning of interval j
who experience the event sometime during the interval. The survivor function (the prob-
ability of surviving beyond interval j) is obtained as

Sj = Pr(t > a j ) = ∏ (1 − q ) = f
k =1
k j +1 + f j + 2 + …+ f J = Sj −1(1 − q j ),
Survival and Event History Models 495

though an alternative survivor function S j = Pr(t > a j −1 ) may be defined as the probability of
surviving to the start of interval j (Fahrmeir and Tutz, 2001, p.396; Aitkin et al., 2004, p.350).
Let wij = 1 if individual i undergoes the event during interval j and wij otherwise. The
likelihood up to interval k for that individual is then (Aitkin et al., 2004, p.351),

f ikwik Sik1− wik = (qik Si , k −1 )wik [Si , k −1(1 − qik )]1− wik

= Si , k −1qik wik (1 − qik )1− wik


k −1

∏ (1 − q )
(1− wij )
= qikwik (1 − qik )1− wik ij
j =1

∏q
wij (1− wij )
= ij (1 − qij ) .
j =1

This shows that the likelihood involves binary responses wij ~ Bern(qij), where the qij may
vary between time intervals, but are assumed constant within them. So, the hazard prob-
ability can be represented as

q( j|Zij ) = Pr(t = j|t ≥ j , Zij ) = F(a j + Zij b j ),

where F is a suitable distribution function, and αj models the baseline hazard (Singer and
Willetts, 1993). If the predictors include lagged event status indicators {wi , j −1 , wi , j − 2 , etc} ,
one is led to discrete Markov event histories (e.g. Barmby, 2002). Lagged predictor effects
may also be used (Fahrmeir and Tutz, 2001, p.410).
A benefit of the discrete framework is that the baseline hazard can be modelled via poly-
nomial functions of j (Efron, 1988), for example:

a j = y0 + y1( j − 1) + y2 ( j − 1)2 + u j ,

where u j ∼ N (0, su2 ) . Parametric time models can also be modelled straightforwardly: a
Weibull model is represented in a complementary log-log link for F by taking the log of the
time interval as a covariate (Allison, 1997). Non-parametric models for time (e.g. via splines)
can also be applied, or a correlated random effect prior assumed, as in Section 11.3.1. Time-
varying predictor effects are straightforward to use (Muthen and Masyn, 2005), and non-
proportional effects are modelled by including interactions between subject attributes Zij
and j.
Commonly used links for the probabilities qij are the logit, probit, and complementary
log-log. For example, a logit link with time-varying intercepts and predictor effects (where
the vector Zij excludes a constant term) would mean

exp(a j + Zij b j )
q( j|Zij ) = .
1 + exp(a j + Zij b j )
Adopting a logit link means the log-odds of the event occurring are modelled as functions
of predictors and time (i.e. interval). The complementary log-log link model with

q( j|Zij ) = 1 − exp( − exp(a j + Zij b j )),


496 Bayesian Hierarchical Models

can be derived by assuming an underlying proportional hazard in continuous time, under


which

 ti 

 ∫ 
{
S(ti |Zi ) = exp − h(u|Zi )du = exp − exp  Zi b + log H 0 (ti ) . }
 0 

aj

Then taking a j = log ∫


a j −1
h0 (t) leads to the complementary log-log model, with the same

predictor effects as under a PH model (Kalbfleisch and Prentice, 1980; Fahrmeir and Tutz,
2001, p.401).
If correlated priors (e.g. random walks) on the αj and βj are adopted, the setting of priors on
the hyperparameters (e.g. precisions) follows the same considerations as discussed above
in connection with semiparametric models for continuous time hazards (Section  11.3.1).
Fahrmeir and Knorr-Held (1997) discuss alternative Hastings sampling schemes for collec-
tions of time-varying coefficients {a j , b j1 , … , b jp } in discrete hazard regression.
As for continuous time survival modelling, neglecting unobserved heterogeneity may
mean that the estimated baseline hazard parameters are biased downwards, the impact
of constant covariates is underestimated, or that spurious time-dependent effects for
observed predictors are obtained. For improved identification, frailties may be included
at subject level, rather than at subject-interval level, though bilinear schemes are possible.
Thus, a log-normal frailty might specify

qij = F(a j + Zij b + bi ),

where bi ~ N (0, s b2 ). Alternatively, a bilinear scheme might be used

qij = F(a j + Zij b + d jbi ),

where one of the δj is set to a fixed value for identification if the variance of bi is unknown.
Muthen and Masyn (2005) use a discrete mixture approach in which Gi ∈(1, … , K ) are
latent groups (e.g. developmental trajectories in educational applications). Then

F −1(qij ) = a j ,Gi + Zij bGi + d j ,Gi bi ,

where the probability that Gi = k is defined by predictors Ui in a separate multiple logit
regression. The factor scores bi may be defined by bi ~ N (0, s b2 ), or by a hierarchical linear
regression on the predictors Ui.

11.5.1 Life Tables
Life tables are a particular way of analysing discrete time survival data. They may be
applied to situations where permanent survival or withdrawal is possible, such as marital
status life tables (Schoen, 2016), or to population mortality. The intervals in such applica-
tions refer to age or duration bands, and discretisation may extend beyond that present in
the data, as in abridged life tables (Kostaki and Panousis, 2001). The intervals are not nec-
essarily of equal length (Wong, 1977). For example, in one common scheme for human life
Survival and Event History Models 497

tables, ages under 1 form the first interval, ages one to four comprise the second interval,
ages five to nine, the third interval, and so on for successive five year bands, with the final
interval typically open ended, such as ages over 90. Often human life tables are estimated
from population deaths data over a specified calendar period, to provide “period” life
tables, based on current mortality in individuals born in different periods, as distinct from
cohort life tables, based on follow-up studies of mortality in a group of individuals born in
the same time period (Richards and Barry, 1998).
Following life table conventions, ages are denoted x and age intervals are denoted
[x , x + n) , e.g. n = 5 if intervals are five years in length. Let T denote a random variable for
the total lifetime (age of death) of an individual. Also, in line with life table conventions,
the probability Pr(T > x) that the age of death T is x or higher (the survivor function) is
denoted l(x). The hazard rate – also called the force of mortality in life table applications
– is then

l( x) − l( x + ∆x) −l′( x)
h( x) = lim = ,
∆x →∞ l( x)∆x l( x)
with solution

 x 


 0

l( x) = l(0)exp  − h(u)du .


With l(0) = 1, the density of the age at death is f ( x) = h( x)l( x) . The probability of surviving
from age x to age x + n, given survival to x, namely Pr(t > x + n|t > x) , is denoted npx with

 x+n 
exp  −

h(u)du  n 
 

0
n px = l( x + n)/ l( x ) = = e x p  − h( x + u)du ,
 x   

exp  − h(u)du
 0 
 0 

while the probability of dying before age x + n conditional on reaching age x is

l( x) − l( x + n)
q = 1 − n px = 1 − l( x + n)/ l( x) =
n x .
l( x)
Important in linking these functions to estimable quantities is the central rate of mortal-
ity, which represents a weighted average of the force of mortality applying over the inter-
val [x , x + n) . Let P(x) denote the population of age x. Then the death rate for age interval
[x , x + n) is

x+n x+n
n Mx =
∫ x
h( a)P( a)da
∫ x
P( a)da .

Assuming linearity of l(a) in the interval from x to x + n, this can be simplified (Namboodiri
and Suchindran, 1987, p.36) to

l( x) − l( x + n)
n Mx = .
0.5n[l( x) + l( x + n)]
498 Bayesian Hierarchical Models

Hence the survivor probability can be written

l( x + n) 1 − 0.5n( n Mx )
= n px =
l( x) 1 + 0.5n( n Mx )
giving
n( n Mx )
n x q = .
1 + 0.5n( n Mx )

To clarify the operations involved, life tables involve hypothetical populations of initial size
l0 = 100,000 (the radix) with lx denoting numbers still alive at age x from the initial popula-
tion. The number dying between age x and x + n is denoted n dx = lx − lx + n , and from above

lx − lx + n n dx
q = 1 − lx + n /l x =
n x = .
lx lx
To develop the life table from observed deaths and populations requires an estimator
for the probability nqx. Let Dx denote observed deaths for age band [x , x + n) over a cer-
tain period, Px denote observed mid-period populations at risk (or person-years), and Mx
denote age-specific death rates. One estimator of probability of dying in interval [x , x + n)
conditional on being alive at the start of the interval is then (Chiang, 1984)

nn Mx
q =
n x ,
1 + n(1 − n ax )n Mx
where n ax is the fraction of the interval lived by those dying during it. For most age groups,
a is taken as a half, but for infants (ages under one), it can be taken as 0.1, and for the one
n x
to four age group as 0.4.
Under conventional life table methods that are usually applied to large populations, the
Mx are treated as unrelated fixed effects and estimated by assuming binomial sampling
Dx ∼ Bin( Px , Mx ) or Poisson sampling Dx ∼ Po( Mx Px ). In a Bayesian version of the fixed
effect approach, the Mx would be assigned diffuse beta or gamma priors with known
hyperparameters, e.g. Mx ∼ Beta(1, 1). Overdispersed versions of binomial or Poisson den-
sities may also be used, involving hierarchical schemes for “borrowing strength” over
correlated mortality rates, with a higher stage density for the Mx involving unknown
hyperparameters. An example might be when age-specific deaths Dix for a set of areas or
hospitals (i = 1, … , I ) are to be analysed, and populations at risk are relatively small. Then
the conjugate binomial-beta approach would mean taking death rates Mix to be distributed
according a hierarchical model, namely

Dix ∼ Bin( Pix , Mix ),

Mix ∼ Beta( a, b),

where {a,b} are unknown parameters. Congdon (2009) adopts a general linear mixed model
approach for data involving an additional stratifying group g in which

Dixg ∼ Bin( Pixg , Mixg ),


Survival and Event History Models 499

where i and x denote areas and ages, and a logistic regression with group-specific autore-
gressive area and age effects has the form

logit( Mixg ) = ag + sig + hxg .

Other options might be to model the impact of age by a parametric function; for example,
Neves and Migon (2007) use Makeham’s Law, by which

Dx ∼ Po( Mx Px ),

Mx = a + bd x ,

and extend this to a time series model for age-specific death rates and times t, namely

Mxt = a t + btd tx .

Example 11.4 Cancer Survival


This example illustrates discrete survival with potential unobserved frailty. It involves
survival times in months for 48 participants in a cancer drug trial. Of the 48 patients,
28 receive an experimental drug treatment (Z1 = 1) and 20 receive a control treatment
(Z1 = 0). The other predictor is patient age at the start of the trial, ranging from 47 to 67
years. The observed times provide the month of death, or the last month the patient was
known to be alive.
With a complementary log-log link, Weibull time dependence (model 1) is specified as

q( j|Zi ) = 1 − exp( − exp(Zi b + k log( j))),

with κ a positive parameter, and the regression term Zi including an intercept. This
representation is compared with a semiparametric baseline hazard modelled via a
first-order random walk (model 2), namely

q( j|Zi ) = 1 - exp(- exp(a j + Zi1b1 + Zi 2 b 2 + k log( j))).

Convergence in the latter is assisted by the parameterisation a j = b 0 + a 0 j, where β0 is


the intercept, and the α0j are normal RW1 effects with precision t a = 1/s a2. The α0j are
centred to have mean zero at each MCMC iteration. For numeric stability, age values are
divided by 100.
Using rube in R, the semiparametric model gives a WAIC (widely applicable informa-
tion criterion) of 307 with effective parameter count 33. There is only slight fluctuation
about the central value of β0 which has posterior mean −8.0. The simpler Weibull model
has a lower WAIC of 295, with the Weibull parameter having mean (95% CRI) of 0.29
(0.01, 0.76). The treatment and age effects under the Weibull model are −2.1 (−2.9,−1.3) and
9.7 (3.1,16.1). So, mortality declines with time (since κ < 1) after allowing for the impact of
age on mortality, though this decline might be attributable to unmodelled frailty.
A lognormal frailty effect at subject level is accordingly added to the Weibull model,
so that

q( j|Zi ) = 1 - exp(- exp( b 0 + Zi1b1 + Zi 2 b 2 + k log( j) + s b bi ))


500 Bayesian Hierarchical Models

12

10

8
Frequency

–0.4 –0.3 –0.2 –0.1 0.0 0.1 0.2


Frailties

FIGURE 11.5
Posterior mean frailties.

where bi ∼ N(0, 1) , and the precision 1/s b2 of bi is assigned a gamma prior. A two-chain
run of 10,000 iterations provides a mean (95% CRI) of 0.33 (0.01, 0.81) for the κ coefficient.
The treatment and age effects, namely −2.1 (−3,−1.3) and 10.3 (4.0,16.6), are changed only
slightly. The WAIC falls slightly to 293, with 31 effective parameters. Figure 11.5 plots
out the bi and shows a negative skew, with negative frailty effects for older subjects still
surviving at higher intervals (e.g. subject 34).

Example 11.5 Life Tables and Actuarial Implications


This example considers graduation of mortality data, and illustrates the potential for
identifiability issues in models involving multiple random effects. The data are from
Neves and Migon (2007) and consist of central numbers exposed to risk (ex) by age x, and
number of deaths observed (dx) during 1998–2001, collected by insurance companies in
Brazil. We consider the male data for ages 0 to 90. Assuming Poisson sampling, forces
of mortality μx can be estimated using the likelihood dx ∼ Po(e x mx ) , with probabilities of
death then estimated as qx = 1 − exp( − mx ) (Brouhns et al., 2002).
Mortality rates by age for different populations often exhibit underlying regularities
despite stochastic fluctuations, leading to parametric and non-parametric methods to
represent the smooth underlying pattern, under which it is expected that probabilities
of death for consecutive ages should be close. Parametric graduation procedures aim
to estimate the underlying smooth mortality curve, facilitating actuarial calculation of
premiums and reserves. Parametric methods include Makeham’s formula, whereby

m x = a + bd x ,

with the three parameters all positive. Allowing for variability in these parameters over
age groups, one may propose

m x = a x + b xd xx ,
Survival and Event History Models 501

log(a x ) = log(a x -1 ) + w1x ,

log(b x ) = log( b x -1 ) + w2 x ,

log(d x ) = log(d x -1 ) + w3 x ,

where the wjx are initially taken as iid normal. Initial conditions (log(α1), etc) are taken
as N(0,25).
Implementing this model in rstan shows problematic convergence, unless informative
priors are assumed for the standard deviations σj of the errors wjx. Informative priors can
be motivated by an expectation of small changes in death rates between successive ages,
and we assume s j ∼ N + (0, 0.25). This model shows impaired convergence in log(βx) and σ2.
Improved convergence is shown by a model taking w2x and w3x to be Student t with known
d.f. = 4, and also s j ∼ t4+ (0, 0.25) (a half-t with 4 df). This option has LOO-IC of 605. As one
example of the parametric outputs, Figure 11.6 plots out the posterior mean αx. Predictive
checks from this model (comparing replicates and actual dx) are satisfactory, with no exceed-
ance probability under 0.05 or over 0.95. The minimum exceedance probability is 0.056 for
age x = 84, with a relatively large observed death total as compared to the modelled total.
Neves and Migon (2007) consider the implications of the fitted curve in deriving the
monthly whole life annuity-due to a life aged x (Bowers et al., 1986). Figure 11.7 plots out
the posterior density of this quantity for a male subject aged 60, at an assumed annual
real interest of 6%. This compares closely with Figure 3 in Neves and Migon (2007).
Convergence problems are completely alleviated if a constancy assumption δx = δ is
made, with the specification now

m x = a x + b xd x ,

log(ax ) = log(ax −1 ) + w1x ,

0.04

0.03
Posterior Mean a

0.02

0.01

0.00

0 20 40 60 80
Age

FIGURE 11.6
Alpha by age.
502 Bayesian Hierarchical Models

1.5

1.0
Density

0.5

0.0

156.0 156.5 157.0 157.5 158.0


Life Annuity

FIGURE 11.7
Posterior density, monthly whole life annuity.

log(b x ) = log( b x -1 ) + w2 x ,

with w1x and w2x normal and s j ~ N + (0, 0.25). However, this raises the LOO-IC to 606.
The predictive exceedance probability for age x=84 is now under 0.05.

11.6 Dependent Survival Times: Multivariate and Nested Survival Times


Multivariate and nested survival data can occur in a number of different ways; for dis-
cussions, see Hougaard (1987) and Sinha and Ghosh (2005) for a Bayesian perspective.
Examples are when each subject may experience repetitions of the same event; when sub-
jects may experience more than one event; when times are for subjects arranged in clusters
(including spatially defined units); or in competing risks situations (considered in Section
11.7). For example, bivariate survival models can be used to analyse:

Survival data on twins or other types of matched pair (Anderson et al., 1992);
Reliability data when the lifetime of one component is related to the lifetimes of other
components;
Failure times of paired human organs (Sahu and Dey, 2000; Tosch and Holmes, 1980).
Survival and Event History Models 503

Examples of grouped or clustered data are provided by Gustafson (1997) as when several
response times are measured for a single patient in a clinical trial, or when responses are
for patients categorised according to clinic of treatment. Multivariate perspectives on more
specialised survival models are exemplified by Bayesian multivariate cure rate models
(Chen et al., 2002; Yin, 2005), and multivariate counting processes (Sinha and Ghosh, 2005).
The statistical model applied to such data needs to account for the intra-cluster or
inter-event correlation. It may be possible to model the dependence structure directly, for
example, via multivariate versions of widely adopted parametric survival models (Yashin
et al., 2001). Thus Sahu and Dey (2000) consider bivariate exponential and Weibull survival
models for data on times to visual impairment for paired eyes, while Damien and Muller
(1998) provide a Bayesian treatment of a bivariate Gumbel model. The multivariate lognor-
mal is another possibility, which adapts to the situation of conditional multivariate data,
when durations on a second event are obtained conditional on the duration in a first event
(Henderson and Prince, 2000).
Another approach is to introduce random frailty terms at the cluster level or common
frailties across events. The frailty term represents common influences across clusters or
events that are neglected or not observed. Responses on members of a cluster (or on cor-
related events) are typically assumed independent given the value of the cluster effect (or
shared frailty factor) (Castro et al., 2014). Sahu and Dey (2004, p.325) describe how different
frailty assumptions lead to different correlations between log survival times in a bivariate
situation (under the assumption a Weibull baseline hazard).
Let tij be the failure time for the jth component or outcome ( j = 1, … , mi ) of the ith subject
(i = 1, … n) . Then the hazard function assuming a common multiplicative frailty takes the
form (Sahu et al., 1997; Yin and Ibrahim, 2005)

h(tij |Zij , gi ) = gi h0 (tij )exp(Zij b j ),

with the unit frailty effect γi distributed independently of Zij and tij. If γi is high, then all
hazards are raised, and so times tij tend to be low; if γi is low then all hazards are lowered
and the tij tend to be relatively extended. In this way, the common frailty induces a positive
association between observed times.
In the case of repeated occurrences r = 1, … , Ri of the same outcome to the same subject
(e.g. multiple occupation shifts or repeat cardiac events), the hazard function conditional
on γi is independent of the number r of previous occurrences (Sinha, 1993). Unconditionally,
however, the hazard for the (r + 1)th occurrence is

hr (tir |Zi ) = h0 (tir )exp(Zi b )(1 + rVar(g i )).

The same scenario applies when subjects i are nested within clusters j, with cluster effects
γj shared between the nj individuals in the same cluster

h(tij |Zij , g j ) = g j h0 (tij )exp(Zij b j ), i = 1,..., n j ; j = 1,… , J .

If the γj are assumed gamma distributed g j ∼ Ga( h, h) with variance 1/h, then smaller
values of h signify a closer relationship between subjects in the same group and greater
heterogeneity between the groups. For models including cure rates, Yin (2005) proposes
multiplicative frailty at cluster level combined with Poisson regression for θij in the cure
fractions exp(−θij). One option takes
504 Bayesian Hierarchical Models

S∗ (tij ) = exp( −qijg j F(tij )),

with hazard rates h∗ (tij ) = qijg j f (tij ).


Survival time data are often highly skewed and this may affect the appropriate form
of frailty. Frailty models allowing for fat tails and skewness are obtained under the skew
log-normal or skew log-t common frailty approach (Sahu and Dey, 2004). Consider a para-
metric hazard (e.g. Weibull) for multiple event time data (subjects i = 1, … , n and events
j = 1,… m ) and with subject level scale parameters λij for event j, namely

k -1
h(tij |lij , k ) = lijk jtij j .

Then a skew log-normal frailty model implies

log(lij ) = Zi b j + bi + d ui ,

where bi ~ N (0, s b2 ), δ is positive, and ui ∼ N + (0, 1) with ui independent of bi. Under the skew
log-t model,

bi ~ t(0,n , s b2 ),

where ν is a degrees of freedom parameter.


In practice, this kind of model may need informative priors for stable identification, bear-
ing in mind that censoring reduces identifiability of complex random effect models, that bi
and ui are to some extent overlapping in their roles, and that s b2 and δ2 are confounded in
var(bi + d ui ) = s b2 + d 2. To illustrate relevant strategies for priors on the variance components,
uncensored bivariate times (n = 100, m = 2) are generated with Weibull hazards, and scales
lij = exp( b1 j + b 2 j xi + bi + d ui ) where the xi are standard normal, with b1 = (-5, -6), b 2 = (0.5, 1),
κ = (1.5,2), δ = 0.5, and 1/s b2 = t b = 2, so that s b  0.7 . A U(0,5) prior on δ is adopted in the analy-
sis to re-estimate the parameters (cf. Sahu and Dey, 2004). The re-estimated parameters lead
to considerable under-estimation of σb, with the second half of a single-chain run of 10,000
iterations leading to posterior mean of 0.054, whereas the posterior mean of δ is 1.6. Assuming
instead a U(0,100) prior on V = d 2 + s b2, and a U(0,1) prior on the ratio d 2 /(d 2 + s b2 ), improves
the estimation of σb with posterior mean 0.55, while the posterior mean of δ is now 1.17.
As for univariate models, flexibility is obtained by adopting a semiparametric hazard,
while allowing also for common frailty. An example involves a semiparametric counting
process including multiplicative frailty for repeated occurrences of the same event (Sinha,
1993). The semiparametric hazard is based on J − 1 intervals A j = [a j −1 , a j ) obtained by con-
sidering distinct failure times, with aJ equal to the maximum time (censored or failed).
Thus, for subject-occurrence index i, subject s, and interval j define an intensity function

l (tij |Zi ) = Y(tij )b0 (tij )exp(Zij b )g s ,

where b0(t) is the baseline intensity function, and the γs represent subject level frailty. The
t

integrated baseline intensity B0 (t) =


ments gamma process, namely
∫ b (u)du is assumed to follow an independent incre-
0
0

dB0 (t) ∼ Ga(cdB0∗ (t), c),


Survival and Event History Models 505

where B0∗ (t) is an assumed mean intensity. The likelihood kernel for each spell within
each subject is Poisson in form [3] with response variables dNij = 1 or 0, and means
dB0 (tij )exp(Zij b )g s.

Example 11.6 Clustered Trial of Infection Treatment


This example involves two forms of nesting: repetitions of events within patients, and
nesting of patients within hospitals. The data are from Fleming and Harrington (1991)
and Yau (2001), and concern a randomised trial of gamma interferon in treating infec-
tions among patients with chronic granulomatous disease. The 126 patients were nested
in 13 hospitals and patients may experience more than one infection. Of 63 patients in
the treatment group, 14 had at least one infection, and 20 infections were recorded in
all, whereas in the placebo group, 30 patients had at least one infection, and there were
56 infections in all.
The n = 201 observations are therefore at three levels: infections at level 1, patients
at level 2, and hospitals as level 3 units. Let tklm be times between recurrent infections,
with k denoting events within patients, l denoting patients (l = 1,… , L), and m denoting
hospitals (m = 1,… , M ) . The analysis seeks to assess the effect of gamma interferon in
reducing the rate of infection as well as taking account of the clustering in the data;
ignoring such clustering may affect the estimated treatment effect. A piecewise expo-
nential baseline hazard is assumed with J = 20 intervals ( a j −1 , a j ] based on 5th percentiles
of the observed times, with aJ being the maximum time of 389. Then with a single pre-
dictor (zlm = 1 for treated subjects, 0 otherwise)

h(tklm ∈( a j −1 , a j ]|zlm ) = lj exp( zlm b + elm + um ),

with a gamma process prior on the λj, and normally distributed patient and hospital
effects elm ~ N(0, s e2 ), and um ~ N(0, s u2 ). Because the two sources of variation are con-
founded, a uniform prior V ~ U(0,100) is adopted on the total variance V = s u2 + s e2, and a
U(0,1) prior on the ratio s u2 /(s u2 + s e2 ).
A two-chain run using jagsUI converges in 10,000 iterations and provides a LOO-IC
of 755 and WAIC of 754. Although Yau (2001) reported no significant hospital variation,
here posterior means (95% CRI) for σe and σu of 1.09 (0.59, 1.77) and 0.51 (0.11, 1.12) are
obtained, and within hospital correlation estimated as 0.21. The treatment effect is esti-
mated as −1.20 (−1.97,−0.49). Centred hospital effects have posterior means varying from
0.42 (hospital 2) to −0.27 (hospital 10), with the site 2 effect having an 87% probability of
being positive.
Q-Q plots of both the posterior mean elm and um suggests a departure from normality,
associated with positive skew in both sets of effects. Assuming Student t4 priors for both
sets of effects does not improve fit, and the corresponding Q-Q plots (using the R func-
tion TQQPlot) are also not satisfactory.
We then consider a Dirichlet process mixture prior to model the patient random
effects (model 3). A gamma(3.5,0.5) prior is adopted on the Dirichlet precision param-
eter (Dorazio, 2009), with the base density for the random effects taken as normal. This
produces a LOO-IC of 749. A skew-normal model also slightly improves fit in terms of
the LOO-IC, reducing it to 754. The estimated effective parameters total is unchanged
despite the addition of positive normal effects, as in the conditional representation of
the skew-normal (Huang and Dagne, 2012; Ghosh et al., 2007).
As a final option, the random patient and hospital effects are represented as multipli-
cative effects with gamma densities (Glidden and Vittinghoff, 2004). Thus

h(tklm Î ( a j -1 , a j ]|zlm ) = l ja 1lma 2 m exp( zlm b ),


506 Bayesian Hierarchical Models

a 1lm ~ Ga(d 1 , d 1 ),

a 2 m ~ Ga(d 2 , d 2 ),

d j ~ Ga( ad , bd ).

The prior mean of the gamma effects α1/m and α2m is set at 1 for identification. With the
setting ad = 1, bd = 0.1, a two-chain run using jagsUI converges in 10,000 iterations. This
provides similar fit statistics as the normal errors model, namely a LOO-IC of 755 and
WAIC of 753. The estimated gamma parameters (posterior mean and standard devia-
tion) are 1.62 (1.23) and 12.46 (10.29). A plot of actual patient effects as against the implied
gamma density, Ga(1.62,1.62), shows a better representation of the skew in the patient
effects, albeit with still some discrepancies.

Example 11.7 Bivariate Survival


The Diabetic Retinopathy Study was conducted by the US National Eye Institute to assess
the effect of laser photocoagulation in delaying onset of severe visual loss in patients
with diabetic retinopathy. One eye of each patient was randomly selected for photoco-
agulation, and the other was observed without treatment, with patients followed up over
several years for the occurrence of blindness in one or other eye. The follow-up time is in
months, with 80 patients censored on both eyes at the same time, while 36 patients have
onset in both eyes. Censoring is caused by dropout, death, or termination of the study.
Following Huster et al. (1989), a subset of the data set containing n=197 high-risk patients
is considered here, so there are i = 1,… , n patients and j = 1,…, m events with m = 2. The
correlation between pairs of uncensored times tij for patients in both treatment and con-
trol groups is 0.28, indicating possible dependence between the two times.
The Weibull is judged to be appropriate for these data, as a plot of the transformed
Kaplan-Meier survivor function namely log[− log(SKM (t))] on log(t), is approximately lin-
ear when either ti1 or ti2 are considered. Alternative analyses consider Weibull survival
with and without lognormal frailty at patient level. Under the latter

tij ∼ Wei(k j , lij ),

log(lij ) = b 0 + b1Agei /10 + b 2Trtij + b 3 Typei + bi ,

bi ~ N(0, s b2 ),

where bi are random patient effects, Age is age at diagnosis and Type relates to diabetes
type. A model (implemented using jagsUI) without frailty produces significant Weibull
shape effects: both shape parameters κj have 95% intervals below 1, suggesting a lesser
chance of impairment at longer follow-ups. The LOO-IC is 2,756, with the worst fitted
observation being eye 2 for patient 95. The treatment coefficient β2 has mean (95% CRI)
−0.79 (−1.12, −0.48), and the corresponding hazard ratio q = e - b 2 for untreated eyes aver-
ages 2.24, with a 95% credible interval (1.62, 3.06).
In the frailty model, a U(0,10) prior on the standard deviation σb of the random effects
is adopted, with a posterior mean for σb of 0.45 (0.12, 0.82). Under this model, the time
effects are attenuated, with the 95% interval for κ1 now straddling 1. Despite the signifi-
cant patient heterogeneity, the LOO-IC increases slightly to 2,762. The treatment effect
increases, with θ now averaging 2.31 with 95% interval (1.65, 3.22). A histogram and
kernel density plot of the posterior mean bi show a subgroup with high negative values
Survival and Event History Models 507

3.0

2.5

2.0
Density

1.5

1.0

0.5

0.0

–0.2 0.0 0.2 0.4


Frailty

FIGURE 11.8
Posterior mean frailties.

(see Figure 11.8) suggesting that a discrete mixture approach (e.g. a two-group discrete
mixture normal) to frailty might be appropriate.

11.7 Competing Risks
Competing risks (CR) models involve the tracking of multiple durations corresponding
to different types of exit or transition (Haller et al., 2013). A number of packages in R can
estimate competing risks survival regression (Scrucca et al., 2010; Putter, 2018; Scheike and
Zhang, 2011). With non-repeatable events, subjects are observed until the first exit and
completion of one of the multiple durations, but for repeatable events (e.g. occupational
or migration histories), event histories might include repeated transitions between differ-
ent job or residential destinations. Sometimes the cause of exit may be masked (Sen et al.,
2010): exact information on the cause of exit is missing, but information is available that
can determine a set of potential causes of exit.
Assume that there are K possible mutually exclusive causes of exit, and Ci be a subject
level categorical random variable with K possible outcomes representing observed cause of
exit. Under the latent failure time approach (Crowder, 2001; Box-Steffensmeier and Jones,
2004; Kozumi, 2004; Gelfand et al., 2000) with independent risks, there is a latent failure
time Tik corresponding to each outcome, but only the minimum time is observed when
individual i exits for cause ki, so that ti = min(Ti1 , … , TiK ) with ki = arg min(Ti1 , … , TiK ). The
remaining times are censored. All times are censored if an individual does not exit for any
of the K possible reasons.
With these assumptions, and conditioning on possibly cause-specific predictors Zk, the
total hazard rate may be expressed as a sum of cause-specific hazards,
508 Bayesian Hierarchical Models

h(t|Zk ) = ∑ h (t|Z ),
k =1
k k

where

Pr(t < T ≤ t + ∆t , C = k |T > t , Zk )


hk (t|Zk ) = lim .
∆t → 0 ∆t
The survival function may be similarly decomposed into K marginal survival functions,
with

S(t|Z) = ∏ S (t|Z ).
k =1
k k

Assuming a failure to risk ki is observed, the contribution of the ith subject to the likeli-
hood has the form

K K

f ki (ti |Ziki ) ∏l ≠ ki
Sl (ti |Zil ) = hki (t|Ziki ) ∏ S (t |Z ),
l =1
l i il

while for a subject censored on all risks, the contribution is P lK=1Sl (ti |Zil ). With event indi-
cators dik = 1 if Ci = k, and dir = 0 for r ≠ k , the likelihood contribution is equivalently

∏  f (t |Z ) [Sr (ti |Zir )]1− dir .


dir
r i ir
r =1

For continuous survival times, one may assume parametric forms for the time effect, e.g.
a Weibull hazard

hk (t) = lkk k tik k -1 ,

or model risk-specific semiparametric hazard sequences that may be correlated over


causes. Possible label switching problems under the latent failure approach may require
parameter constraints, such as ordering the shape parameters κk (Gelfand et al., 2000).
Often competing risks models are applied to repeated transitions between occupational,
residential, or marital states. The hazard rate then generalises to reflect moves between the
mth observed state and the (m + 1)th state. If Tim denotes the time spent in the mth state, and
occupancy of the mth state for subject i is denoted Cim = k, then

hkl (t|Zik ) = lim Pr(t < Tim ≤ t + ∆t , Ci , m + 1 = l|Tim > t , Cim = k , Zik )/∆
∆t,
∆t → 0

is the instantaneous risk of moving from state k to state l (with l ≠ k ), given survival in the
mth state until t. Under independent risks, the overall hazard for leaving state k is then

hk (t|Zik ) = ∑ h (t|Z ).
l≠ k
kl ik
Survival and Event History Models 509

For discrete time data, the functions described in Section 11.5 similarly generalise to the
competing risk case. For non-repeated events, intervals [a j −1 , a j ) for j = 1, … , J + 1, and
Ci ∈(1, … , K ) , event probabilities are

f jk = Pr(t ∈[a j −1 , a j ), C = k ),

with risk specific hazard functions

q jk = Pr(t ∈[a j −1 , a j ), C = k |t > a j −1 ),



= f jk /Sj −1 ,
and survivor functions obtained as

j K

Sj = ∏ ∏ (1 − q
m=1 h=1
mh ).

Define event indicators dimh = 1 when a non-repeatable event h occurs in interval m, and 0
otherwise. Then for subject i undergoing the kth event in the jth interval, the event indica-
tors are dijk = 1, {dijh = 0, h ≠ k } and di1h = di 2 h = … = di , j −1, h = 0 for all h, with likelihood

 j −1 K 
qijk  ∏∏
 m = 1 h = 1
(1 − qimh ) = qijk Si , j −1.

The response at each interval for discrete competing risks is multinomial, and to model
the impact of predictors different links may be used such as the multiple logit, or multiple
probit. Consider a multiple logit link with K + 1 categories (K alternative risks plus an extra
category for survival, denoted by Ci = 0). Let the reference category be for survival, and
define regression coefficients βk for the kth risk. Assuming the βr (r = 1, … , K ) do not con-
tain an intercept would lead to

1
q(t ∈[a j −1 , a j ), Ci = 0) = ,

K
1+ exp(a jr + Zir br )
r =1

exp(a jh + Zih bh )
q(t ∈[a j −1 , a j ), Ci = h|Zih ) = , h = 1, … K .

K
1+ exp(a jr + Zir br )
r =1

where the parameters αjh describe the baseline hazard for risk. K-dimensional versions of
the correlated prior processes discussed in Section 11.3 may be used for the αjh, for exam-
ple, multivariate normal first- or second-order random walks.

11.7.1 Modelling Frailty
Assuming independent risks, one may introduce unobserved frailties γik that impact on
each risk, but are uncorrelated across risks, such as independent gamma densities with
mean 1 for each possible cause. Under proportionality, the risk specific hazard in a con-
tinuous time CR hazard is then
510 Bayesian Hierarchical Models

hk (ti |Zik ) = gik h0 k (ti ) exp(Zik bk ).

The assumption of independent risks may not hold in practice because particular groups
of subjects may be more likely to experience subsets of the events. Just as it may be unre-
alistic in multinomial discrete choice situations to assume independence of irrelevant
alternatives (i.e. that ratios of choice probabilities of any two alternatives are unaffected
by changes in utilities of any other alternatives, or by their removal), so it may be unreal-
istic in survival analysis that the relative risks of two outcomes will be unaffected by the
removal of a third (Gordon, 2002).
To allow for dependent competing risks, especially for multiple spell data, one may assume
correlated or dependent frailties. In a generalisation of the MPH scheme, Abbring and van
den Berg (2003) mention that the joint distribution of (Ti1 , … , TiK ), given predictors Zik and
correlated frailties (gi1 , … , giK ), factorises into independent densities f (Tk |Zik , {gi1 , … , giK })
which are fully characterised by cause-specific hazard rates

h(Tk |Zk , {g1 , … , gK }) = gk lk (t)exp(Zk bk ).

Correlated frailties are also obtained by expanding the regression term to a general mixed
form, as in Section 11.4, so that in a continuous time analysis,

hk (ti |bik , Zik ) = lk (ti )exp( bk Zik + bik ),

where bik are zero mean effects that might be multivariate normal, discrete mixtures of
multivariate normal, etc.
Assuming a multivariate normal bik with covariance matrix Σb, dependent risks will be
apparent in significant off-diagonal terms. Whether there are significant correlations in
the frailty effects over different risks will depend in part on whether observed predictors
successfully explain variations in event proneness. Another possibility is a common frailty
model with risk specific loadings, so that

bik = lk bi ,

where λk > 0 and bi ∼ N (0, 1) for identification.

Example 11.8 Hospital Infection


This example demonstrates parametric competing risk analysis of data on hospital
infections from Beyersmann and Scheike (2013), also included as the dataset okiss in the
R package compeir. There are 1,000 subjects, and the data concern treatment for hema-
tologic disease by peripheral blood cell stem transplantation. Transplants are either
autologous (from the patients’ own blood), or allogeneic. After transplantation, patients
are deficient in white blood cells (neutropenic), and at risk of bloodstream infection, a
severe complication (competing risk 1). Alternatively, competing risk 2 defines survival
as until neutropenia ends, or until death without bloodstream infection. Predictors are
gender (1 = F, 0 = M), and transplant source (allo = 1 for allogeneic, 0 = autologous).
Figure 11.9 compares Kaplan–Meier, Weibull, and log-logistic survival plots for blood-
stream infection, and suggests the Weibull as a working approximation. So

hk (ti |zi ) = likk k tik k -1 ,


Survival and Event History Models 511

1.0 Kaplan-Meier
Weibull
Log-logistic
0.8

0.6

0.4

0.2

0.0

0 20 40 60 80
Days in Neutropenia

FIGURE 11.9
Survival curve without predictors.

for risks k = 1, 2, where lik = exp( bk xi ). The cause-specific cumulative hazards are

L k (ti |zi ) = lik tik k .

Estimation uses rstan [4] with early convergence non-problematic. Table 11.3 shows that
allogeneic transplants are associated with a lower risk of bloodstream infection, and
Figure 11.10 plots out the contrasting cumulative hazard curves for this cause of exit

TABLE 11.3
Posterior Summary. Transplant Data
Weibull Regression Mean St Devn 2.5% 97.5% Standardised Variability
Risk 1 Intercept −4.20 0.20 −4.58 −3.83
Allogeneic Transplant −0.54 0.14 −0.82 −0.28 0.26
Female −0.18 0.14 −0.46 0.09 0.79
Weibull Shape 1.22 0.07 1.09 1.36
Risk 2 Intercept −4.98 0.13 −5.25 −4.72
Allogeneic Transplant −1.19 0.07 −1.33 −1.04 0.06
Female 0.09 0.07 −0.06 0.23 0.84
Weibull Shape 2.04 0.04 1.96 2.13
Cox Regression Mean St Devn 2.5% 97.5% Standardised Variability
Risk 1 Allogeneic Transplant −0.25 0.15 −0.54 0.03 0.58
Female −0.17 0.14 −0.46 0.11 0.83
Risk 2 Allogeneic Transplant −1.26 0.08 −1.41 −1.10 0.06
Female 0.06 0.08 −0.08 0.21 1.17
512 Bayesian Hierarchical Models

0.8
*
*
*
*
Cumulative Cause-Specific Hazards 0.6 *
o *
allogeneic *
* autologous
*
*
* o
o
0.4 * o
* o
o
* o
* o
o
* o
* o
o
0.2 * o
o
* o
* o
o
* o o
*
* o o
*o o o
*
o o
0.0 * *
o

0 5 10 15 20 25
Days

FIGURE 11.10
Cumulative hazard by transplant source.

15
Cumulative Cause-Specific Hazards

10
* *
* **
o allogeneic * **
autologous * ***
*
*
5 **
*
**
*
* *
** o o
ooo o
* ** o o o oo o o o
* o oo o
* o oo
* * *o o o o o o o o
o
0 *oo * *o *o o
* *oo

0 10 20 30 40
Days

FIGURE 11.11
Cox regression. Cumulative hazard, end of neutropenia, by transplant source.

up to 25 days. The effect of allogeneic transplantation is also negative for the other risk
(end of neutropenia), meaning that events of either type are delayed for the allogeneic
treatment group.
We also apply Cox regression to these data, based on the distinct event times for each
risk. Figure 11.11 shows the resulting cumulative hazard plots for the end of neutropenia.
Survival and Event History Models 513

An issue in comparing Cox and parametric regression is the possibility of differing preci-
sion of estimated covariate and treatment effects (Nardi and Schemper, 2003). It can be
seen from Table 11.3 that the estimated coefficients for allo and sex from Cox regression
have higher standardised variability (i.e. are less efficient). This is especially so for the
impact of transplant type on the risk of bloodstream infection. Standardised variability
is measured as sd( b )/|b |.

Example 11.9 Follicular Cell Lymphoma, Cause-Specific and


Subdistribution Hazards
This example considers the follicular cell lymphoma data from Pintilie (2006), also
included in the R package timereg (Scheike and Zhang, 2011). The influence of
four covariates is explored on two alternative outcomes: (1) relapse or no response,
or (2) death in remission. There are 272 subjects with no response or relapse, 76
deaths without relapse, and 193 censored subjects. Predictors are stage of disease
(1=stage  II, 0=stage I), treatment (1=radiotherapy and chemotherapy combined,
0=radiotherapy only), age and haemoglobin level, with the latter two predictors
divided by 100. Event times are in years since diagnosis. Following discussions
such as Latouche et al. (2013), both cause-specific and subdistribution hazard
regressions are estimated.
Using the counting process version of the Cox model requires that at risk indica-
tors and increment indicators be set for each of the competing responses, based on
distinct event times. So, define a variable icaus in R having values 0 for censoring, 1
for relapse/no response, and 2 for death in remission. With obs_t denoting event (or
censoring) times, the command sequence in R to define the two sets of indicators for
event 1 is:

  ​d1=if​else(​cause​==1,1​,0)
  t.d1=subset(obs_t,cause==1)
   # unique event times
  t.d1.unique=unique(t.d1)
  NT1=length(t.d1.unique)
   t1​_uniq​ue=c(​sort(​t.d1.​uniqu​e),ma​x(obs​_t)+1​)
   # define at risk and counting process increments
  Y1=dN1=matrix(,N,NT1)
  for (i in 1:N) {for (j in 1:NT1) {Y1[i,j]
=ifelse(obs_t[i]>=t1_unique[j],1,0)}}
  for (i in 1:N) {for (j in 1:NT1) {dN1[i, j] =Y1[i, j] * (t1_
unique[j + 1] > obs_t[i]) * d1[i]}}

This is the usual assignment of at risk and increment indicators in cause-specific haz-
ard regression. The cause-specific hazard is the instantaneous risk of the event (i.e. a
specific cause of exit) in subjects currently event-free, namely for cause k,

 Pr(t ≤ T < t + ∆t , K = k |T > t) 


hk (t) = lim   .
∆t → 0  ∆t 

By contrast, the subdistribution hazard is the instantaneous risk of an event in subjects


yet to experience an event of that type. So

 Pr(t ≤ T < t + ∆t , K = k |(T > t) or (T ≤ t and K ≠ k )) 


h k (t) = lim   .
∆t → 0  ∆t 
514 Bayesian Hierarchical Models

TABLE 11.4
Cell Lymphoma. Alternative Hazard Regression Coefficients, Posterior Summaries
Cause-Specific Hazard Subdistribution Hazard
Competing Events Predictor Mean 2.5% 97.5% Mean 2.5% 97.5%
Relapse Stage 0.35 0.09 0.61 0.40 0.16 0.65
Chemotherapy 0.09 −0.26 0.41 −0.03 −0.39 0.30
Age/100 4.29 3.35 5.25 1.85 0.97 2.71
HGB/100 0.79 0.12 1.47 0.60 −0.07 1.26
Death in Remission Stage 0.11 −0.44 0.63 −0.09 −0.62 0.42
Chemotherapy −0.08 −0.85 0.63 −0.38 −1.09 0.23
Age/100 8.32 6.12 10.52 4.36 2.66 5.95
HGB/100 −0.01 −1.61 1.56 −0.54 −1.89 0.71

The risk set now includes subjects who have previously experienced a competing cause
of exit, as well as subjects currently event-free. For the cause-specific hazard, the risk set
reduces every time there is an exit from another cause and is viewed as censored. With
the subdistribution hazard subjects that exit for a cause, j ≠ k remain in the risk set for
cause k and are given a censoring time larger than all event times. The coefficients from
a subdistribution hazard model may be interpreted as the impacts of covariates on the
incidence of the event (Austin and Fine, 2017).
Table 11.4 summarises the posterior distributions of the covariate effects under these
alternative approaches. The effects are not that dissimilar, mainly differing in a lower
impact of age on incidence, while the impact of stage on the incidence of relapse is
enhanced. The estimated coefficients under classical methods (using coxph from timereg,
and crr from cmprsk) are similar to the Bayesian estimates.

11.8 Computational Notes
[1] The rstan code for the counting process model in Example 11.2 includes R calcula-
tions to convert time and event indicators (ti,di) into suitable form. Thus

   obs_t=t
   t.d=subset(obs_t,d==1)
   # unique event times
   t.d.unique=unique(t.d)
   NT=length(t.d.unique)
   t_​uniqu​e=c(s​ort(t​.d.un​ique)​,max(​obs_t​)+1)
   # define at risk and counting process increments
   Y=dN=matrix(,N,NT)
   
for (i in 1:N) {for (j in 1:NT) {Y[i,j]
=ifelse(obs_t[i]>=t_unique[j],1,0)}}
   
for (i in 1:N) {for (j in 1:NT) {dN[i, j] =Y[i, j] * (t_unique[j + 1]
> obs_t[i]) * d[i]}}
   # centred and scaled Karnosky score
   KS.c=(KS−mean(KS))/10
Survival and Event History Models 515

   # dataset
   Ds​tan=l​ist(N​=N,NT​=NT,t​_uniq​ue=t_​uniqu​e,Y=Y​,dN=d​N,Z=P​T,KS=​KS.c)​
   CP.stan ="
   data {
   int<lower=0> N;
   real KS[N];
   int<lower=0> NT;
   int<lower=0> Y[N,NT];
   int<lower=0> dN[N,NT];
   int<lower=0> t_unique[NT + 1];
   real PT[N];
   }
   transformed data {
   real c;
   real r;
   c = 0.001;
   r = 0.1;
   }
   parameters {
   real beta[2];
   real betaKS[NT];
   real<lower=0.001> sigmaKS;
   real<lower=0> dL0[NT];
   }
   model {
   real dt[NT];
   beta ~normal(0, 10);
   sigmaKS ~uniform(0,1);
   betaKS[1] ~normal(0,1);
   //RW prior on KS coefficients
   for (j in 2:NT){betaKS[j] ~normal(betaKS[j−1],sigmaKS);}
   //gamma increments prior
   for (j in 1:NT) {dt[j] = t_unique[j+1] − t_unique[j];
               dL0[j] ~gamma(r * dt[j] * c, c);
   for (i in 1:N) {if (Y[i, j]!= 0)
          target += poisson_lpmf(dN[i, j]
          Y[i, j]*ex​p(bet​aKS[j​]*KS[​i]+be​ta[1]​*PT[i​]+bet​a[2]*​PT[i]​
*KS[i​]) * dL0[j]);}}}
   generated quantities {
   real S_noPT[NT];
   real S_PT[NT];
   
for (j in 1:NT) {//Survivor functions by prior therapy, Karnofsky
score set at upper quartile
   real s;
   s = 0;
   for (i in 1:j)
   s = s + dL0[i];
   S_PT[j] = pow(exp(−s), exp(b​etaKS​[j]*1​.64+b​eta[1​]+bet​a[2]*​1.64)​);
   S_noPT[j] = pow(exp(−s), exp(betaKS[j]*1.64));}}
   "
   # Compilation and Estimation
   sm = stan_model(model_code=CP.stan)
   
fit = sampling(sm,data =Dstan,iter = 1500,warmup=250,chains = 2,seed=
12345)
516 Bayesian Hierarchical Models

   
print(fit)
betaKS <- extract(fit,"betaKS",permute=F)
   

[2] The code for the cure fraction age of maternity model is

   ​loglo​gistC​F.sta​n ="
    functions{
    real loglogistCF_lpdf(real t, real kappa, real lambda, real theta) {
   re​turn(​log(t​heta)​+log(​kappa​)+log​(lamb​da)+(​kappa​-1)*l​og(t)​
   −2​*log(​1+lam​bda*t​94kap​pa)−t​heta*​(1−1/​(1+la​mbda*​t94ka​ppa))​);}
    real loglogistCF_S_lpdf(real t, real kappa, real lambda, real theta) {
   re​turn(​−thet​a*(1−​1/(1+​lambd​a*t94​kappa​)));}​
    }
    data {int<lower=1> n;//number of cases
    vector[n] t;//response
    int<lower=0,upper=1> d[n];//event indicator(1=occurred, 0=censored)
    int<lower=0> p;//total regression parameters, incl. intercept
    int<lower=0> educ[n];
    int<lower=0> sibs[n];
    int<lower=0> white[n];
    int<lower=0> immig[n];
    int<lower=0> lowinc[n];
    int<lower=0> city[n];
    }
    parameters {vector[p] beta;
    real<lower=1> kappa;//shape parameter
    real<lower=0> theta;//cure fraction parameter
    }
    transformed parameters {
    real eta[n];
    real lambda[n];
    real lambdaT;
    real p_nochild;
   real modeT;//modal age first maternity (13 years education, 3
siblings, white)
    p_nochild = exp(−theta);//rate of childlessness
   lambdaT = exp(b​eta[1​]+bet​a[2]*​13+be​ta[3]​*3+be​ta[4]​);
    modeT = ((kappa−1)/lambdaT)94(1/kappa);
    for (i in 1:n) {eta[i]= beta[1]+beta[2]*educ[i]+beta[3]*sibs[i]
   +b​eta[4​]*whi​te[i]​+beta​[5]*i​mmig[​i]
   +b​eta[6​]*low​inc[i​]+bet​a[7]*​city[​i];
    lambda[i] =exp(eta[i]);}}
    model {target += gamma_lpdf(kappa 0.01, 0.01);
    target += gamma_lpdf(theta 0.01, 0.01);
    for (i in 1:n) {
   if (d[i] == 1) {target += loglogistCF_lpdf(t[i]kappa,
lambda[i],theta);}
   else if (d[i] == 0) {target += loglogistCF_S_lpdf(t[i]kappa,
lambda[i],theta);}}}
    generated quantities{real log_lik[n];
    for (i in 1:n) {
   if (d[i] == 1) {log_lik[i]= loglogistCF_lpdf(t[i] kappa,
lambda[i],theta);}
   else if (d[i] == 0) {log_lik[i]= loglogistCF_S_lpdf(t[i] kappa,
lambda[i],theta);}}}
Survival and Event History Models 517

[3] An example of the computation involves repeated times being applied to mam-
mary tumour in rats randomly assigned to treatment and control groups (Sinha,
1993). Totals of tumours diagnosed in each rat varying between 0 and 13. So, spell
totals for each rat (including possibly censored final spells) range from 1 to 14.
There are n=253 spells in all, for K=48 rats, and J=35 distinct times relevant to
defining the intervals, with aJ = tmax = 182. A BUGS/JAGS code for such an analy-
sis, including gamma frailty for each rat, a treatment covariate, and indicators d[i]
of tumour occurrence or censoring, is
   
model {for (j in 1:J) {for(i in 1:n) {# Y indicates whether case
still at risk
   Y[i,j] <- step(t[i] − a[j] + eps)
   dN[i, j] <- Y[i, j] * step(a[j + 1] − t[i] − eps) * d[i]
   dN[i, j] ~dpois(lam[i, j])
   lam[i, j] <- Y[i, j] * exp(beta * trt[i]) * dB0[j] * gam[rat[i]]}
   # independent increment gamma process
   dB0[j] ~dgamma(mu[j], c); mu[j] <- dB0.star[j] * c
   dB0.star[j] <- M * (a[j + 1] − a[j])
   # Survivorship in two groups
   S.tr[j] <- pow(exp(−sum(dB0[1: j])), exp(beta));
   S.cntr[j] <- exp(−sum(dB0[1: j]))}
   # priors on hyperparameters
   c <- 1; M ~dexp(1); beta ~dnorm(0,0.001)
   # frailty prior
   for (k in 1:K) {gam[k] ~dgamma(h,h)}
   h ~dgamma(1,0.001)
   var.gam <- 1/h}

where eps is a small positive value to ensure at risk and counting indices are cor-
rectly defined. The gamma process includes an unknown parameter M defining
the mean intensity. The first few records for the spell level data take the form
rat[] trt[] t[] d[]
1 1 182 1
2 1 182 0
3 1 63 1
3 1 68 1
3 1 182 0
4 1 152 1
4 1 182 0
while the other data inputs are list(​n=253​,J=34​,a=c(​63,66​,68,7​1,74,​77,81​,84,8​5,88,​
91,95,98,102,105,108,112,116,119,123,
126,130,134,137,140,145,150,152,157,
161,1​67,17​2,174​,179,​182),​eps=0​.001,​K=48)​.

[4] The code for this analysis is

   
weibCR.stan ="
   
data {
   
int<lower=1> N;//number of cases
   
int<lower=1> N2;//number of cases
   
int<lower=1> T;//number of time points for CH profiles
   
int<lower=1> K;//number of competing causes of exit
   
vector[N] time;//observed or censored times
518 Bayesian Hierarchical Models

   vector[T] timeprof;//time points for CH profiles


   int<lower=0,upper=1> cens1[N];//right censoring, cause 1
   int<lower=0,upper=1> cens2[N];//right censoring, cause 2
   int<lower=0> p;//total regression parameters, including intercept
   int<lower=0> allo[N];
   int<lower=0> sex[N];
   }
   parameters {vector[p] beta1;
   vector[p] beta2;
   real<lower=0> shape[K];//shape parameters
   }
   transformed parameters {
   real eta1[N];
   real nu1[N];
   real eta2[N];
   real nu2[N];
   real S1allo[T];//survival functions
   real S1auto[T];
   real S2allo[T];
   real S2auto[T];
   real CH1allo[T];//cumulative hazards
   real CH1auto[T];
   real CH2allo[T];
   real CH2auto[T];
   for (t in 1:T) {
   S1allo[t] = exp(−​exp(b​eta1[​1]+be​ta1[2​])*ti​mepro​f[t]9​4shap​e[1])​;
   S1auto[t] = exp(−​exp(b​eta1[​1])*t​imepr​of[t]​94sha​pe[1]​);
   S2allo[t] = exp(−​exp(b​eta2[​1]+be​ta2[2​])*ti​mepro​f[t]9​4shap​e[2])​;
   S2auto[t] = exp(−​exp(b​eta2[​1])*t​imepr​of[t]​94sha​pe[2]​);
   CH1allo[t] = −log(S1allo[t]);
   CH1auto[t] = −log(S1auto[t]);
   CH2allo[t] = −log(S2allo[t]);
   CH2auto[t] = −log(S2auto[t]);}
   for (i in 1:N) {eta1​[i]=b​eta1[​1]+be​ta1[2​]*all​o[i]+​beta1​[3]*s​ex[i]​;
   nu1[i] = exp(−eta1[i]/shape[1]);
   et​a2[i]​=beta​2[1]+​beta2​[2]*a​llo[i​]+bet​a2[3]​*sex[​i];
   nu2[i] = exp(−eta2[i]/shape[2]);}
   }
   model {target += gamma_lpdf(shape 0.01, 0.01);
   for (i in 1:N) {
   if (cens1[i] == 0) {target += weibull_lpdf(time[i] shape[1],
nu1[i]);}
   
else if (cens1[i] == 1) {target += weibull_lccdf(time[i] shape[1],
nu1[i]);}
   if (cens2[i] == 0) {target += weibull_lpdf(time[i] shape[2],
nu2[i]);}
   
else if (cens2[i] == 1) {target += weibull_lccdf(time[i] shape[2],
nu2[i]);}
   }}
   generated quantities{real log_lik[N2];
   //expanded log-likelihood vector over K=2 causes
   for (i in 1:N) {
   
if (cens1[i] == 0) {log_lik[i]= weibull_lpdf(time[i] shape[1],
nu1[i]);}
Survival and Event History Models 519

   
else if (cens1[i] == 1) {log_lik[i]= weibull_lccdf(time[i] shape[1],
nu1[i]);}
   
if (cens2[i] == 0) {log_lik[i+N]= weibull_lpdf(time[i] shape[2],
nu2[i]);}
   
else if (cens2[i] == 1) {log_lik[i+N]= weibull_lccdf(time[i]
shape[2], nu2[i]);}
   }}

References
Aalen O (1988) Heterogeneity in survival analysis. Statistics in Medicine, 7, 1121–1137.
Aalen O, Hjort N (2002) Frailty models that yield proportional hazards. Statistics & Probability Letters,
58, 335–342.
Abbring J, van den Berg G (2003) The identifiability of the mixed proportional hazards competing
risks model. Journal Royal Statistical Society: Series B, 65, 701–710.
Abbring J, van den Berg G (2007) The unobserved heterogeneity distribution in duration analysis.
Biometrika, 94, 87–99.
Aitkin M, Clayton D (1980) The fitting of exponential, Weibull and extreme value distributions to
complex censored survival data using GLIM. Journal of Applied Statistics, 29, 156–163.
Aitkin MA, Aitkin M, Francis B, Hinde J (2005) Statistical Modelling in GLIM 4. OUP, Oxford, UK.
Albert JH, Chib S (2001) Sequential ordinal modeling with applications to survival data. Biometrics,
57(3), 829–836.
Allison P (1997) Survival Analysis Using the SAS System: A Practical Guide. SAS Institute Inc., Cary, NC.
Anderson JE, Louis TA, Holm NV, Harvald B (1992) Time-dependent association measures for bivari-
ate survival distributions. Journal of the American Statistical Association, 87(419), 641–650.
Arjas E, Gasbarra D (1994) Nonparametric Bayesian inference from right censored survival data,
using the Gibbs sampler. Statistica Sinica, 4, 505–524.
Austin P (2017) A tutorial on multilevel survival analysis: Methods, models and applications.
International Statistical Review, 85(2), 185–203.
Austin P, Fine J (2017) Practical recommendations for reporting Fine-Gray model analyses for com-
peting risk data. Statistics in Medicine, 36(27), 4391–4400.
Baker M, Melino A (2000) Duration dependence and nonparametric heterogeneity: A Monte Carlo
study. Journal of Econometrics, 96, 357–393.
Banerjee S, Carlin BP (2004) Parametric spatial cure rate models for interval-censored time-to-relapse
data. Biometrics, 60(1), 268–275.
Barmby T (2002) Worker absenteeism: A discrete hazard model with bivariate heterogeneity. Labour
Economics, 9, 469–447.
Bender A, Groll A, Scheipl F (2018) A generalized additive model approach to time-to-event analysis.
Statistical Modelling, 18, 1–23.
Bennett S (1983) Log-logistic regression models for survival data. Applied Statistics, 32, 165–171.
Beyersmann J, Scheike T (2013) Classical regression models for competing risks, Chapter 8, pp 157–177,
in Handbook of Survival Analysis, eds J Klein, H C van Houwelingen, J G Ibrahim, T Scheike. CRC.
Bogaerts K, Komarek A, Lesaffre E (2017) Survival Analysis with Interval-Censored Data: A Practical
Approach with Examples in R, SAS, and BUGS. CRC Press.
Børing P (2009) Gamma unobserved heterogeneity and duration bias. Econometric Reviews, 29(1),
1–19.
Bowers N, Gerber H, Hickman J (1986) Actuarial Mathematics, 1st Edition. The Society of Actuaries,
Itasca, IL.
Box-Steffensmeier JM, Box-Steffensmeier JM, Jones BS (2004) Event History Modeling: A Guide for
Social Scientists. Cambridge University Press.
520 Bayesian Hierarchical Models

Brard C, Le Teuff G, Le Deley M, Hampson L (2017) Bayesian survival analysis in clinical trials: What
methods are used in practice? Clinical Trials, 14(1), 78–87.
Brezger A, Kneib T, Lang S (2005) BayesX: Analyzing Bayesian Structured Additive Regression
Models , Journal of Statistical Software September 14 (11), 1–22.
Brouhns N, Denuit M, Vermunt J (2002) A Poisson log-bilinear regression approach to the construc-
tion of projected lifetables. Insurance: Mathematics and Economics, 31(3), 373–393.
Brüderl J, Diekmann A (1995) The log-logistic rate model: two generalizations with an application to
demographic data. Sociological Methods & Research, 24, 158–186.
Buros J (2016) Model Checking with Simulated Data (Survival Model Example) https​://ww​w.bio​
condu​ctor.​org/h​elp/c​ourse​-mate​rials​/2016​/BioC​2016/​Concu​rrent​Works​hops4​/Buro​s/wei​
bull-​survi​val-m​odel.​html
Castro M, Chen M-H, Ibrahim J, Klein J (2014) Bayesian transformation models for multivariate sur-
vival data. Scandinavian Journal of Statistics, 41(1), 187–199.
Chen Q, Wu H, Ware L B, Koyama T (2014) A Bayesian approach for the cox proportional hazards
model with covariates subject to detection limit. International Journal of Statistics in Medical
Research, 3(1), 32–43.
Chen M-H, Ibrahim J, Sinha D (1999) A new Bayesian model for survival data with a surviving frac-
tion. Journal of the American Statistical Association, 94, 909–919.
Chen M-H, Ibrahim J, Sinha D (2002) Bayesian inference for multivariate survival data with a cure
fraction. Journal of Multivariate Analysis, 80, 101–126.
Chen Y, Jewell N (2001) On a general class of semiparametric hazards regression models. Biometrika,
88, 687–702.
Chiang C (1984) The Life Table and its Applications. R.E. Krieger, Malabar, FL.
Clayton D (1991) A Monte Carlo method for bayesian inference in frailty models. Biometrics, 47,
467–485.
Congdon P (2008) A bivariate frailty model for events with a permanent survivor fraction and non-
monotonic hazards; with an application to age at first maternity. Computational Statistics & Data
Analysis, 52, 4346–4356.
Congdon P (2009) Life expectancies for small areas: A Bayesian random effects methodology.
International Statistical Review, 77(2), 222–240.
Cooner F, Banerjee S, McBean A (2006) Modelling geographically referenced survival data with a
cure fraction. Statistical Methods in Medical Research, 15, 307–324.
Cox C, Matheson M (2014) A comparison of the generalized gamma and exponentiated Weibull dis-
tributions. Statistics in Medicine, 33(21), 3772–3780.
Cox D (1972) Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34,
187–220.
Crippa A (2018) A Not So Short Review on Survival Analysis in R. https://rpubs.com/alecri/258589
Crowder M (2001) Classical Competing Risks. CRC Press.
Damien P, Muller P (1998) A Bayesian bivariate failure time regression model. Computational Statistics
& Data Analysis, 28, 77–85.
Demarqui F, Loschi R, Colosimo E (2008) Estimating the grid of time-points for the piecewise expo-
nential model. Lifetime Data Analysis, 14(3), 333–356.
Diekmann A, Mitter P (1983) The “Sickle Hypothesis”: A time-dependent Poisson model with appli-
cations to deviant behavior and occupational mobility. The Journal of Mathematical Sociology, 9,
85–101.
Dorazio R (2009) On selecting a prior for the precision parameter of the Dirichlet process mixture
models. Journal of Statistical Planning and Inference, 139, 3384–3390.
Dykstra RL, Laud P (1981) A Bayesian nonparametric approach to reliability. The Annals of Statistics,
9(2), 356–367.
Efron B (1988) Logistic regression, survival analysis, and the Kaplan-Meier curve. Journal of the
American Statistical Association, 83(402), 414–425.
Fahrmeir L, Knorr Held L (1997) Dynamic discrete time duration models. Sociological Methodology,
27, 417–452.
Survival and Event History Models 521

Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd
Edition. Springer Series in Statistics. Springer Verlag, New-York, Berlin, Heidelberg.
Fleming TR, Harrington DP (1991) Counting Processes and Survival Analysis, Vol. 169. John Wiley &
Sons.
Florens J, Fougere D, Mouchart M (1995) Duration models, pp 491–534, in The Econometrics of Panel
Data, eds L Matyas, P Sevestre. Kluwer.
Gamerman D (1991) Dynamic Bayesian models for survival data. Applied Statistics, 40, 63–79.
Gelfand A, Ghosh S, Christiansen C, Soumerai S, McLaughlin T (2000) Proportional hazard models:
a latent competing risk approach. Journal of Applied Statistics, 49, 385–397.
Ghosh P, Branco MD, Chakraborty H (2007) Bivariate random effect model using skew-normal dis-
tribution with application to HIV-RNA. Statistics in Medicine, 26(6), 1255–1267.
Glidden D, Vittinghoff E (2004) Modelling clustered survival data from multicentre clinical trials.
Statistics in Medicine, 23(3), 369–388.
Gordon S (2002) Stochastic dependence in competing risks. American Journal of Political Science, 46,
200–217.
Gore S, Pocock S, Kerr G (1984) Regression models and non-proportional hazards in the analysis of
breast cancer survival. Journal of Applied Statistics, 33, 176–195.
Gustafson P (1997) Large hierarchical Bayesian analysis of multivariate survival data. Biometrics, 53,
230–242.
Gustafson P (2000) Bayesian regression modeling with interactions and smooth effects. Journal of the
American Statistical Association, 95(451), 795–806.
Gustafson P, Aeschliman D, Levy A (2003) A simple approach to fitting Bayesian survival models.
Lifetime Data Analysis, 9, 5–19.
Hagar Y, Dignam J, Dukic V (2017) Flexible modeling of the hazard rate and treatment effects in long-
term survival studies. Statistical Methods in Medical Research, 26(5), 2455–2480.
Haller B, Schmidt G, Ulm K (2013) Applying competing risks regression models: An overview.
Lifetime Data Analysis, 19(1), 33–58.
Heckman J, Singer B (1984) A method for minimizing the impact of distributional assumptions in
econometric models for duration data. Econometrica, 52, 271–320.
Henderson R, Oman P (1999) Effect of frailty on marginal regression estimates in survival analysis.
Journal of the Royal Statistical Society: Series B, 61, 367–379.
Henderson R, Shimakura S, Gorst D (2002) Modeling spatial variation in leukemia survival data.
Journal of the American Statistical Association, 97, 965–972.
Henderson R, Prince H (2000) Choice of conditional models in bivariate survival. Statistics in Medicine,
19, 563–574.
Herring A, Ibrahim J (2002) Maximum likelihood estimation in random effects cure rate models with
nonignorable missing covariates. Biostatistics, 3, 387–405.
Hougaard P (1987) Modelling multivariate survival. Scandinavian Journal of Statistics, 14(4), 291–304.
Hougaard P (2000) Analysis of Multivariate Survival Data. Springer, New York.
Hougaard P, Myglegaard P, Borch-Johnsen K (1994) Heterogeneity models of disease susceptibility,
with application to diabetic nephropathy. Biometrics, 50, 1178–1188.
Huang Y, Dagne G (2012) Bayesian semiparametric nonlinear mixed-effects joint models for data
with skewness, missing responses, and measurement errors in covariates. Biometrics, 68(3),
943–953.
Huster W, Brookmeyer R, Self S (1989) Modeling paired survival data with covariates. Biometrics, 45,
145–156.
Ibrahim J, Chen M-H, MacEachern S (1999) Bayesian variable selection for proportional hazards
models. The Canadian Journal of Statistics, 27, 701–717.
Ibrahim J, Chen M-H, Sinha D (2001) Bayesian Survival Analysis. Springer-Verlag.
Kalbfleisch JD (1978) Non-parametric Bayesian analysis of survival time data. Journal of the Royal
Statistical Society: Series B (Methodological), 40(2), 214–221.
Kalbfleisch JD, Prentice R (1980) The Statistical Analysis of Failure Time Data. Wiley, New York.
522 Bayesian Hierarchical Models

Keiding N, Andersen P, Klein J (1997) The role of frailty models and accelerated failure time models
in describing heterogeneity due to omitted covariates. Statistics in Medicine, 16, 215–224.
Kiefer N (1988) Economic duration data and hazard functions. Journal of Economic Literature, 26,
646–679.
Kneib T (2006) Mixed model-based inference in geoadditive hazard regression for interval-censored
survival times. Computational Statistics & Data Analysis, 51, 777–792.
Kostaki A, Panousis V (2001) Expanding an abridged life table. Demographic Research, 5, 1.
Kozumi H (2004) Posterior analysis of latent competing risk models by parallel tempering.
Computational Statistics & Data Analysis, 46, 441–458.
Kuo L, Mallick B (1997) Bayesian semiparametric inference for the accelerated failure-time model.
Canadian Journal of Statistics, 25, 457–472.
Lambert P (2007) Modeling of the cure fraction in survival studies. The Stata Journal, 7(3), 351–375.
Lancaster T (1990) The Econometric Analysis of Transition Data. Cambridge University Press.
Latouche A, Allignol A, Beyersmann J, Labopin M, Fine J (2013) A competing risks analysis should
report results on all cause-specific hazards and cumulative incidence functions. Journal of
Clinical Epidemiology, 66(6), 648–653.
Lawless J (1980) Inference in the generalized gamma and log gamma distributions. Technometrics,
22(3), 409–419.
Lee K, Chakraborty S, Sun J (2015) Survival prediction and variable selection with simultaneous
shrinkage and grouping priors. Statistical Analysis and Data Mining, 8(2), 114–127.
Li K (1999) Bayesian analysis of duration models: An application to Chapter 11 bankruptcy. Economics
Letters, 63(3), 305–312.
Li M (2007) Bayesian proportional hazard analysis of the timing of high school dropout decisions.
Econometric Reviews, 26, 529–556.
Lopes H, Muller P, Ravishanker N (2007) Bayesian computational methods in biomedical research, in
Computational Methods in Biomedical Research, eds R Khattree, D Naik.
Manda S, Gilthorpe M, Tu Y, Blance A, Mayhew M (2005) A Bayesian analysis of amalgam restora-
tions in the Royal Air Force using the counting process approach with nested frailty effects.
Statistical Methods in Medical Research, 14, 567–578.
Marano G, Boracchi P, Biganzoli E (2016) Estimation of the piecewise exponential model by Bayesian
P-splines via Gibbs sampling: Robustness and reliability of posterior estimates. Open Journal of
Statistics, 6, 451–468.
Morris CN, Norton EC, Zhou XH (1994) Parametric duration analysis of nursing home usage,
pp 231–248, in Case Studies in Biometry, eds N Lange, L Ryan, L Billard, D Brillinger, L Conquest,
J Greenhouse. Wiley.
Mosler K (2003) Mixture models in econometric duration analysis. Applied Stochastic Models in
Business and Industry, 19, 91–104.
Murray T, Hobbs B, Sargent D, Carlin B (2016) Flexible Bayesian survival modeling with semipara-
metric time-dependent and shape-restricted covariate effects. Bayesian Analysis, 11(2), 381–402.
Muthen B, Masyn K (2005) Discrete-time survival mixture analysis. Journal of Educational and
Behavioral Statistics, 30, 27–58.
Namboodiri K, Suchindran C (1987) Life Table Techniques and Their Applications. Academic Press, New
York.
Nardi A, Schemper M (2003) Comparing Cox and parametric models in clinical studies. Statistics in
Medicine, 22(23), 3597–3610.
Neves C, Migon H (2007) Bayesian graduation of mortality rates: An application to reserve evalua-
tion. Insurance: Mathematics and Economics, 40, 424–434.
Omori Y (2003) Discrete duration model having autoregressive random effects with application to
Japanese diffusion index. Journal of the Japan Statistical Society, 33, 1–22.
Orbe J, Núñez-Antón V (2006) Alternative approaches to study lifetime data under different sce-
narios: From the PH to the modified semiparametric AFT model. Computational Statistics & Data
Analysis, 50, 1565–1582.
Survival and Event History Models 523

Perperoglou A, van Houwelingen H, Henderson R (2006) A relaxation of the Gamma frailty (Burr)
model 2006. Statistics in Medicine, 25, 4253–4266.
Phadia E (2015) Prior Processes and Their Applications; Nonparametric Bayesian Estimation. Springer.
Pickles A, Crouchley R (1995) A comparison of frailty models for multivariate survival data. Statistics
in Medicine, 14, 1447–1461.
Pintilie M (2006) Competing Risks: A Practical Perspective. John Wiley, West Sussex, UK.
Putter H. (2018) Tutorial in biostatistics: Competing risks and multi-state models Analyses using
the mstate package. Leiden University Medical Center, Department of Medical Statistics and
Bioinformatics. https​://cr​an.r-​proje​ct.or​g/web​/pack​ages/​mstat​e/vig​nette​s/Tut​orial​.pdf
Richards H, Barry R (1998) U.S. Life tables for 1990 by sex, race, and education. Journal of Forensic
Economics, 11, 9–26.
Rivas-López M, López-Fidalgo J, Campo R (2014) Optimal experimental designs for accelerated fail-
ure time with Type I and random censoring. Biometrical Journal, 56(5), 819–837.
Roeder K, Wasserman L (1997) Practical Bayesian density estimation using mixtures of normals.
Journal of the American Statistical Association, 92(439), 894–902.
Sahu S, Dey D (2000) A comparison of frailty and other models for bivariate survival dataata. Lifetime
Data Analysis, 6, 207–228.
Sahu S, Dey D (2004) On a Bayesian multivariate survival model with skewed frailty, pp 321–338, in
Skew-Elliptical Distributions and Their Applications: A Journey Beyond Normality, eds M Genton.
CRC/Chapman & Hall, Boca Raton, FL.
Sahu S, Dey D, Aslanidou H, Sinha D (1997) A Weibull regression model with gamma frailties for
multivariate survival data. Lifetime Data Analysis, 3, 123–137.
Sargent DJ (1998) A general framework for random effects survival analysis in the Cox proportional
hazards setting. Biometrics, 54(4), 1486–1497.
Scheike T, Zhang M (2011) Analyzing competing risk data using the R timereg package. Journal of
Statistical Software, 38(2), i02.
Schmidt P, Witte A (1989) Predicting criminal recidivism using ‘split population’ survival time mod-
els. Journal of Econometrics, 40, 141–159.
Schoen R (2016) The continuing retreat of marriage: Figures from marital status life tables for United
States females, 2000–2005 and 2005–2010, pp 203–215, in Dynamic Demographic Analysis, ed R
Schoen. Springer.
Scrucca L, Santucci A, Aversa F (2010) Regression modeling of competing risk using R: An in depth
guide for clinicians. Bone Marrow Transplantation, 45(9), 1388–1395.
Sen A, Banerjee M, Li Y, Noone A (2010) A Bayesian approach to competing risks analysis with
masked cause of death. Statistics in Medicine, 29(16), 1681–1695.
Shao Q, Zhou X (2004) A new parametric model for survival data with long-term survivors. Statistics
in Medicine, 23, 3525–3543.
Singer JD, Willett JB (1993) It’s about time: Using discrete-time survival analysis to study duration
and the timing of events. Journal of Educational Statistics, 18(2), 155–195.
Sinha D (1993) Semiparametric Bayesian analysis of multiple event time data. Journal of the American
Statistical Association, 88, 979–983.
Sinha D, Chen M-H, Ghosh S (1999) Bayesian analysis and model selection for interval-censored
survival data. Biometrics, 55, 585–590.
Sinha D, Dey DK (1997) Semiparametric Bayesian analysis of survival data. Journal of the American
Statistical Association, 92(439), 1195–1212.
Sinha D, Patra K, Dey DK (2003) Modelling accelerated life test data by using a Bayesian approach.
Journal of the Royal Statistical Society: Series C (Applied Statistics), 52(2), 249–259.
Sohn Y, Chang I, Moon T (2007) Random effects Weibull regression model for occupational lifetime.
European Journal of Operational Research, 179, 124–131.
Stacy EW (1962) A generalization of the gamma distribution. The Annals of Mathematical Statistics,
33(3), 1187–1192.
Swindell W (2009) Accelerated failure time models provide a useful statistical framework for aging
research. Experimental Gerontology, 44(3), 190–200.
524 Bayesian Hierarchical Models

Thamrin S, McGree J, Mengersen K (2013) Bayesian Weibull survival model for gene expression
data, Chapter 10, pp 171–185, in Case Studies in Bayesian Statistical Modelling and Analysis, eds C
Alston, K Mengersen, A Pettitt. Wiley.
Tosch TJ, Holmes PT (1980) A bivariate failure model. Journal of the American Statistical Association,
75(370), 415–417.
Tsodikov A, Ibrahim J, Yakovlev A (2003) Estimating cure rates from survival data: An alternative
to two-component mixture models. Journal of the American Statistical Association, 98, 1063–1078.
Umlauf N, Klein N, Zeileis A (2018) BAMLSS: Bayesian additive models for location, scale, and shape
(and beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627.
Van den Berg G (2001) Duration models: Specification, identification, and multiple durations, in
Handbook of Econometrics 5, eds J Heckman, E Leamer. North Holland, Amsterdam, Netherlands.
Viswanathan B, Manatunga A (2001) Diagnostic plots for assessing the frailty distribution in multi-
variate survival data. Lifetime Data Analysis, 7, 143–155.
Walker S, Mallick B (1999) A Bayesian semiparametric accelerated failure time model. Biometrics, 55,
477–483.
Watson T, Christian C, Mason A, Smith M, Meyer R (2002) Bayesian-based decision support sys-
tem for water distribution systems. 5th International Conference on Hydroinformatics, Cardiff
University, UK.
Wei L (1992) The accelerated failure time model: A useful alternative to the Cox regression model in
survival analysis. Statistics in Medicine, 11, 1871–1879.
Wienke A (2010) Frailty Models in Survival Analysis. Chapman and Hall/CRC.
Winkelmann R, Boes S (2005) Analysis of Microdata. Springer-Verlag.
Wong O (1977) A competing-risk model based on the life table procedure in epidemiologic studiesIn-
ternational Journal of Epidemiology, 6, 153–159.
Yashin A, Iachine I, Begun A, Vaupel J (2001) Hidden frailty: Myths and reality. Research Report 34,
Department of Statistics and Demography, SDU - Odense University.
Yau KK (2001) Multilevel models for survival analysis with random effects. Biometrics, 57(1), 96–102.
Yin G (2005) Bayesian cure rate frailty models with application to a root canal therapy study.
Biometrics, 61, 552–558.
Yin G, Ibrahim J (2005) A class of Bayesian shared gamma frailty models with multivariate failure
time data. Biometrics, 61, 208–216.
Yin G, Ibrahim J (2006) Bayesian transformation hazard models, pp 170–182, in IMS Monograph Series,
Vol. 49. Institute of Mathematical Statistics.
Zhang Z, Sinha S, Maiti T, Shipp E (2018) Bayesian variable selection in the accelerated failure time
model with an application to the surveillance, epidemiology, and end results breast cancer
data. Statistical Methods in Medical Research, 27(4), 971–990.
Zhou H, Hanson T, Zhang J (2017) Generalized accelerated failure time spatial frailty model for arbi-
trarily censored data. Lifetime Data Analysis, 23(3), 495–515.
Zhou H, Hanson T, Zhang J (2018) spBayesSurv: Fitting Bayesian spatial survival models using R.
https://arxiv.org/abs/1705.04584
12
Hierarchical Methods for Nonlinear
and Quantile Regression

12.1 Introduction
Standard versions of the normal linear model and general linear models assume additive
and linear predictor effects in the regression mean, and a constant variance. While lin-
ear regression effects are often suitable, nonlinear predictor effects are common in areas
as diverse as economics, hydrology (Qian et al., 2005), and epidemiology (Natario and
Knorr-Held, 2003). In some applications, there may be a theoretical basis for a particular
form of nonlinearity, though some elements of specification will be uncertain – see Borsuk
and Stow (2000) on biochemical oxygen demand, and Meyer and Millar (1998) on mod-
els of fishery stock. In other situations, the form of nonlinearity is unknown and to be
assessed from the data – hence the term “non-parametric”, since a particular form for the
mean function is not assumed. Bayesian application of non-parametric smooth regression
is facilitated by R libraries such jagam (Wood, 2016) (www.rdocumentation.org/packages/
mgcv/versions/1.8-17/topics/jagam), bamlss (Umlauf et al., 2016; https://rdrr.io/rforge/
bamlss/), gammSlice (Pham and Wand, 2015), stan_gamm4 within rstanarm (https://cran.
rstudio.com/web/packages/rstanarm/index.html), and spikeSlabGAM (Scheipl, 2011).
In many applications, a nonlinear effect is present, or suspected, in only a subset of pre-
dictors, leading to partially linear models or semiparametric regression models. Consider
outcomes { yi , i = 1, … , n} from an exponential density

æ y q - a(q i ) ö
p( yi |q i , f ) = exp ç i i + c( yi , f ) ÷ ,
è f ø

with E( yi ) = mi = a′(qi ), and link g(μi) = ηi to a regression term ηi. Suppose it is intended that
R metric predictors Wi = (w1i , w2i , … , wRi ) be modelled non-parametrically via unknown
smooth functions S(wri), then

g( mi ) = hi = a + Xi b + S1(w1i ) + … + SR (wRi ) + ui ,

ui ~ N (0, s 2 ).

For instance, Engle et al. (1986) analyse the relationship between temperature and monthly
electricity sales (y metric and u normal, and with g an identity link) for four US cities.
The impact of electricity price, month (11 dummy variables), and income is modelled

525
526 Bayesian Hierarchical Models

parametrically, but an unknown smooth function is adopted to model the impact of


monthly temperature.
Residual errors ui will be present when yi is metric, and may also be present for over-
dispersed discrete outcomes. While an assumption of independent errors with constant
variance is standard, non-parametric regression for the regression mean may be extended
to modelling heteroscedastic errors (Yau and Kohn, 2003; Krivobokova et al., 2008), or
to other distributional features (Mayr et al., 2012). When the observations are observed
through time or over space it may also be important to control for correlations in the u.
Smith et al. (1998) and Kohn et al. (2000) consider the case when the observations yt are
arranged in time, smooth functions are used for predictor effects, and the ut are autocor-
related. The estimate of the predictor smooth functions will be adversely affected if inde-
pendent residuals are incorrectly assumed.
The two major forms of non-parametric regression involve basis functions (e.g. polyno-
mial spline methods) and general additive methods based on smoothness priors. These
are considered in Sections 12.2 and 12.5 respectively. Extending non-parametric regression
to multiple predictors raises the same issues as multiple linear regression, for example,
whether interactions are necessary and how the presence of smooths for other predictors
alters the smooth for a given predictor – see Section 12.3. Robustness in non-paramet-
ric regression (e.g. to heteroscedastic errors) may be obtained through spatially adaptive
methods which allow the level of smoothness to vary over the space of the covariates
(Wood et al., 2002; Baladandayuthapani et al., 2005) – see Section 12.4. A major application
area for non-parametric regression is in longitudinal settings, as discussed in Section 12.6.

12.2 Non-Parametric Basis Function Models for the Regression Mean


A wide range of methods for non-parametric regression in one or more predictors typically
assume linear combinations of basis functions Sr(wr) of predictors (w1 , … , wR ) . Numerous
basis functions can be used, including truncated polynomial functions, B-spline func-
tions, radial basis functions (Yau et al., 2003), logistic functions (Hooper, 2001), trigonomet-
ric basis functions, and wavelets (Dennison et al., 2002). For exponential family responses
yi with mean μi and link g, a truncated polynomial spline (or piecewise polynomial spline)
regression on a single predictor wi has the form (Dennison et al., 2002, p.52)

g( mi ) = a + S(wi ) + ui , (12.1)
K

= a+ ∑ b (w − k )
k =1
k i
q
k + + ui ,

ui ~ N (0, s 2 ),

where q is a known positive integer, and the κk are knots placed within the range [wmin,wmax]
of w. In (12.1), the piecewise polynomials are fitted in each interval [κk,κk+1) and preferably
join smoothly at each knot (e.g. this applies for a cubic spline, as it has continuous 1st and
2nd derivatives at each knot).
An alternative spline specification (e.g. Meyer, 2005; Tutz and Reithinger, 2007,
p.2877) matches the degree q of the truncated function T (wi ) = S Kk =1bk (wi - k k )+q by a
Hierarchical Methods for Nonlinear and Quantile Regression 527

standard polynomial of order q, namely Q(wi ) = b1wi + …+ bq wiq . So, the total smooth is
S(wi ) = Q(wi ) + T (wi ), and one has

K
g( mi ) = a + b1wi + … + b q w +q
i å b (w - k )
k =1
k i
q
k + + ui . (12.2)

Values q = 1, 2, or 3 are most typical, with q = 1 often being suitable for reproducing a smooth
function given a large enough set of knots (Ruppert et al., 2003, p.68), but also capable of
reproducing abrupt changes in the underlying function (Dennison et al., 2002, p.52).
The knots in (12.1) and (12.2) may be known or unknown. If known, then they are typi-
cally much less than the sample size in number. They could be sited at percentile points
(e.g. deciles) of w, or possibly placed more densely at points where the function is known
to be rapidly changing and less densely elsewhere. Choosing too few knots can result in
oversmoothing, and choosing too many in overfitting – see the LIDAR data examples dis-
cussed by Ruppert et al. (2003, p.63). Coull et al. (2001, p.540) suggest the allocation of one
knot for every four to five observations, up to a maximum of about 40 knots. Yau and Kohn
(2003) suggest fitting a model with a small number of knots first and gradually increasing
their number until estimates and fit stabilise. An alternative procedure known as smooth-
ing splines places a knot at every observed distinct predictor value (Berry et al., 2002; Dias
and Gamerman, 2002). The most general model averaging approach takes both the number
of knots and their sitings as unknowns, while both Denison et al. (1998) and Biller (2000)
assume a large number of potential, but prespecified, candidate knot locations. If knots are
taken to have unknown locations within [wmin,wmax], identification may rely on order con-
straints such as kk > kk −1, and analysis resembles time series with multiple change points.
If the bk in (12.1)–(12.3) are modelled as fixed effects, predictor coefficient selection is open
as a way of achieving model parsimony, and is especially indicated under the smoothing
spline method (Smith and Kohn, 1996). With a large number of preset potential knot sit-
ings, predictor selection involves obtaining posterior probabilities Pr( d jk = 1| y ) on binary
indicator variables δ1k (k = 1, …, q) for retaining coefficients in the Q(w) component, and
δ2k (k = 1, …, K) in the T(w) component. One then has

K
g( mi ) = a + d 11b1wi + … + d 1q b q w +q
i åd
k =1
b (wi - k k )q+ + ui , (12.3)
2k k

with coefficients estimated by means of the products δ1jβj and δ2kbk.

12.2.1 Mixed Model Splines


In contrast to the fixed effects models (12.1)–(12.3), under a mixed model spline regres-
sion, or penalised spline method, the coefficients in Q(wi) usually remain fixed effects, but
those in T(wi) follow a penalising random effects or P-spline prior (Brumback et al., 1999;
Ruppert et al., 2003; Wand, 2003; Yue et al., 2012). Under a P-spline approach the problem of
choosing the number and position of the knots is alleviated, since providing enough knots
are used, the penalty function should ensure that the resulting fits are very similar (Currie
and Durban, 2002). Under the P-spline approach, (12.1) and (12.2) become

K
g( mi ) = a + S(wi ) = a + å b (w - k )
k =1
k i
q
k + + ui , (12.4)
528 Bayesian Hierarchical Models

K
g( mi ) = a + b1wi + … + b q wiq + å b (w - k )
k =1
k i
q
k + + ui , (12.5)

where bk are a collection of random parameters from a common density with unknown
hyperparameters.
Possible priors for the random bk include an unstructured normal (Ruppert et al., 2003)

bk ∼ N (0, f), (12.6)

which, by comparison with a fixed effects prior, imposes a restriction on the bk when ϕ < ∞,
and tends to shrink the bk, leading to a smooth fit (Wand, 2003). A standard approach (e.g.
Lang and Brezger, 2004) assumes f ∼ IG( g , h) with g = 1 and h small (e.g. 0.001, 0.0001, or
0.00001), though there may be sensitivity to the value of h.
To illustrate equivalence to the broader class of mixed models, define design matrices

W = [1, wi , … , wiq ],
1≤ i ≤ n

Z = [(wi - k k )+q ],
1£ k £ K ,1£i£ n

and vectors b = (a , b1 ,… , b q )¢ and b = (b1 , … , bK )′ . Then, under normal error assumptions,


model (12.5) can be written in the mixed model form

g = X b + Zb + u,

éb ù æ 0 éf I 0 ùö
êu ú ~ N çç 0 , ê 0 ÷.
s 2I úû ÷ø
ë û è ë
Alternatives schemes for bk are a random walk penalty (Eilers and Marx, 1996), such as
∆ dbk ∼ N (0, f). For instance, taking d = 1 gives

bk ∼ N (bk −1 , f). (12.7)

Another option provides monotonic smooths – in applications where such smooths have
a substantive rationalisation (Brezger and Steiner, 2008) – and stipulates bk ∼ N (0, f) , but
subject to

bk ≥ bk −1 , k = 2, … , K ,

and an increasing function T (wi ) = S Kk =1bk (wi - k k )+q , or bk ≤ bk −1 for a decreasing function.
The function S(wi) resulting from a fixed effects prior on {bk} in (12.1)–(12.3) may be quite
rough, due to the large number of truncated polynomials being fitted, whereas the shrink-
age prior under the mixed model approach tends to penalise large coefficients and lead to
a smoother fit (Yau et al., 2003; Ngo and Wand, 2004; Meyer, 2005). Under an unstructured
prior bk ∼ N (0, f) , and smoothing or penalty parameter λ, the mode of the posterior density
of {β,b,ϕ} is the same as that obtained by maximising a penalised likelihood

PL = log[P( y | b , b , f)] − l ∑ b ,
k =1
2
k
Hierarchical Methods for Nonlinear and Quantile Regression 529

where the form of λ (in terms of variance parameters) depends on whether or not there
is an unstructured residual term ui in the regression model. For a metric outcome and
ui ∼ N (0, s 2 ), one has λ = σ2/ϕ (Fahrmeir and Knorr-Held, 2000). This penalised likelihood
is analogous to “ridge” penalties sometimes used with correlated predictors (Eilers and
Marx, 2004). For random walk priors of order d, one has (Lang and Brezger, 2004)

PL = log[P( y | b , b , f)] − l ∑ (∆ b ) .
k =d +1
d
k
2

12.2.2 Basis Functions Other Than Truncated Polynomials


The fixed or random coefficient approaches can equally be applied with other basis func-
tions to represent T(wi). Truncated polynomial basis functions span the space of degree q
polynomials with knots located at k1 ,… kK (Friedman, 1991). This property also holds for
radial basis functions (Koop et al., 2003, p.252) based on distances rik = wi − kk , such as the
polyharmonic spline T (wi ) = S Kk =1H (rik ), with

H (rik ) = rikq , q = 1, 3, 5,…

H (rik ) = rikq log(rik ), q = 2, 4, 6,

of which the thin-plate spline (Kohn et al., 2001; Koop and Tole, 2004)

H (rik ) = rik2 log(rik )

is a special case. These are examples of functions which are radially symmetric around
knots κk, such that the value of the function at wi depends only on the distance between wi
and the knot location. They have the form H (u) = H ( w − kk ), where|v|= v′v is the length
of the vector v. Other types of radial basis include Gaussian functions (Konishi et al., 2004)
with

 w − kk 
H k (wi ) = exp  −  ,
 2nhk
where ν is the same over different knots. As for truncated splines, smoothing based on
radial basis functions may include a parametric polynomial term to degree q to match the
degree of the radial function, for example, with q = 1

g( mi ) = a + b1wi + ∑b
k =1
k wi − kk + ui .

Both radial and truncated power splines may be ill-conditioned in terms of broader regres-
sion considerations (Eilers and Marx, 2004). An alternative basis less prone to ill-condition-
ing is provided by B-splines, with health mapping applications exemplified by Silva et al.
(2008) and MacNab and Gustafson (2007). B-splines are defined to be non-zero for at most
q + 2 interior knots for a qth degree B-spline (also called a B-spline of order q + 1), which
means the condition number of the design matrix product is relatively low (Eilers and
530 Bayesian Hierarchical Models

Marx, 1996, p.90; Biller, 2000; Dennison et al., 2002, p.75). A B-spline of degree q consists
of q + 1 polynomial pieces of degree q and overlaps with 2q of its neighbours. For K knots,
and so K + 1 intervals, in the domain [wmin, wmax] of a predictor, there will be K* = K + 1 + q
B-spline schedules, because extra knots are placed outside the domain of w to get q over-
lapping B-splines in each interval.
Let Bk(wi,q) be the value at wi of the kth B-spline of degree q, with k = 1, … , K ∗ . Successive
B-spline values are defined by the recursion

Bk (wi , 0) = I (kk ≤ wi < kk + 1 ),

wi − kk kk + q + 1 − wi
Bk (wi , q) = Bk (wi , q − 1) + Bk + 1(wi , q − 1).
kk + q − kk kk + q + 1 − kk + 1

The initial terms in the recursion are simply binary indicators defining a partition of the
w values. For equally spaced knots, a simplified B-spline recursion applies involving dif-
ferences in truncated power splines (Eilers and Marx, 2004). B-spline bases for T(w) can
be combined with random or fixed effects priors for the spline coefficients. For example,
random bk in an analysis with a single predictor wi leads to

K∗

g( mi ) = a + b1wi + … + bq w + q
i ∑ b B ( w , q) + u .
k =1
k k i i

In particular, Eilers and Marx (1996) combine a B-spline basis with a penalty on dth order
differences in adjacent bk coefficients. As mentioned above, difference penalties can be
achieved by random walk priors under a Bayesian approach (e.g. a second order random
walk prior if d = 2).
Relatively small numbers of knots may be needed to provide an effective smooth, as may
be illustrated by drawing on the Stan case study Kharratzadeh (2017), with B-spline sched-
ules defined either by a function, or by using the package splines. Consider the Boston data
set (in the R MASS package), and predicting median house values based on the percentage
of lower status of the population (lsat) (Figure 12.1). A B-spline of degree 3 is used, and a
random walk prior on the coefficients {bk} of the B-spline basis, with

b1 ∼ N (0, 5),
bk ∼ N (bk −1 , tb ),
tb ∼ C + (0, 5).

There are 508 observations. For K = 10 knots located at corresponding quantiles of lsat, we
find a LOO-IC (leave-one-out information criterion) of 3117 (Figure 12.2), while K = 5 knots
also gives a LOO-IC of 3117. By contrast, a larger number of knots, K = 20, shows evidence
of undersmoothing (overfitting) with a LOO-IC of 3122.
Bayesian application of spectral basis functions is discussed by Lenk (1999), Fahrmeir
and Tutz (2001, Chapter 5), and Kitagawa and Gersch (1996). Here the smooth may be rep-
resented by the series



T ( wi ) = bk H k (wi ),
k =1
Hierarchical Methods for Nonlinear and Quantile Regression 531

50

House Value 40

30

20

10

0 10 20 30
Lower Status

FIGURE 12.1
Median house values and status.

45

40

35
Median House Value

30

25

20

15

10 20 30
Lower Status

FIGURE 12.2
Smooth for K = 10 knots.
532 Bayesian Hierarchical Models

wi − wmin
with Hk including sine and/or cosine terms. Setting zi = and including only
cosine terms in Hk, as in Lenk (1999), gives w max − wmin

0.5
 2 
H k ( wi ) =  cos(pkzi ).
 wmax − wmin 

Since a smooth T will not have high frequency components, a natural prior on the bk
expresses decay as k increases (penalises terms at higher k values) as in

bk ~ N (0, f exp[-d ck ]),

where ck can be taken as a known increasing function of k, and δ determines the rate of
decay of the Fourier coefficients. Possibilities are ck = log(k) with δ > 1, and ck = k with δ > 0.
An alternative is a power function such as

bk ~ N (0, fd k ),

where d ∈(0, 1). For practical application, the Fourier Series is truncated above at K, namely

T ( wi ) = ∑ b H (w )
k =1
k k i

where K can be regarded as another parameter (cf. Ruppert et al., 2003, p.86).

12.2.3 Model Selection
Non-parametric regressions are often heavily parameterised and parameter redundancies
are likely, indicating that selection among predictor effects, including smooths, is neces-
sary (Yau et al., 2003; Belitz and Lang, 2008; Panagiotelis and Smith, 2008; Wood, 2008;
Marra and Wood, 2011; Banerjee and Ghosal, 2014; Gelman et al., 2014). Smooth selection
may be approached using binary indicators Jr (Yau et al., 2003), combined with conven-
tional selection for fixed effect predictor terms. Assume the framework in (12.1). Then for
r = 1, …, R predictor effects as smooths Sr (wri ) = Tr (wri ) , where Tr (wri ) = S Kj=1r brj Z j (wri ), and
Zj(wri) generically denotes a polynomial or B-spline. With binary selectors γj for fixed effect
predictors Xi of dimension p, the regression term would be

g( mi ) = a + g 1b1xii + … + g p b p x pi + … + J1T1(w1i )

+ J 2T2 (w2i )… + J RTR (wRi ) + ui .

Alternatively (e.g. Cottet et al., 2008), one may have both linear and smooth terms for each
wri. Numerical performance may be improved by scaling both the X and W predictors,
e.g. standardisation or transformation to the [0,1] interval (Cottet et al., 2008; Scheipl et al.,
2012).
Selection for retention (Jr = 1) is influenced by the degree of informativeness of the prior
adopted for the variances (ϕ1, …, ϕR) of (b1k … , bRk ) . Flat priors will tend to lead to low posterior
probabilities Pr( J r = 1| y ) for retaining random components. One option is to undertake ini-
tial runs with diffuse priors to develop an informative data-based prior (Shively et al., 1999).
Hierarchical Methods for Nonlinear and Quantile Regression 533

Related approaches include hierarchical priors (e.g. log-normal) on ϕr (Cottet et al., 2008;
Panagiotelis and Smith, 2008) as in

log(fr ) ∼ N ( g r , hr ),
g r ∼ N (0, 100),
hr ∼ IG(101, 10100),

independent of those for Jr. Another option involves a multiplicative reparameterisation


combined with spike-slab selection on the variances (Scheipl et al., 2012), namely,

brk = crk drk ,


crk ∼ N (0, gr fr ),
gr = hr + e(1 − hr ),
hr ∼ Bern(wr ),
drk ∼ N (m, 1),

where the γr are scaling factors, ε is a predefined small constant, and the mean m is set to −1
or 1 with equal probability. So for hr = 0 the random effects are effectively excluded, since
their variance is near zero. Assuming fr ∼ IG( af , bf ), then {af , bf } may be set by default, or
set in line with subject matter considerations. Default settings of ε = 0.00025, aϕ = 5, and
bϕ = 25 are proposed by Scheipl et al. (2012, p.1525).

Example 12.1 Beta-Carotene Plasma


This example concerns a continuous outcome, the dependence of blood plasma concen-
trations of beta-carotene on regulatory factors such as age, gender, vitamin use, dietary
intake, smoking status, alcohol intake, cholesterol intake, etc. (details at http://lib.stat.
cmu.edu/datasets/Plasma_Retinol). There are 315 subjects. Here impacts of p = 11 pre-
dictors (excluding age and cholesterol) are modelled via linear terms, and the impacts
of age and cholesterol intake as smooth terms, S(AGE) and S(CHOL) (Liu et al., 2011; cf.
Banerjee and Ghosal, 2014). The initial two analyses compare a linear truncated spline
with a cubic B-spline (q = 3 in Bk (wi , q)). These two analyses use K = 19 knots sited at the
5th, 10th, …, 95th percentiles of AGE and CHOL (which are untransformed).
Preliminary coding in R to obtain knots and B-spline values in cholesterol is as follows

D <- read.table("betacarotene.txt",header=T)
attach(D)
require(splines)
knots <- quantile(chol,probs=seq(0.05,0.95,0.05))
kap.chol <- as.vector(knots)
bs.chol <- bs(chol, df=NULL,knots, degree = 3, intercept = T,
Boundary.knots = range(chol))
bs.chol <- matrix(as.numeric(bs.chol), nr = nrow(bs.chol))

The linear spline model is


19
y i = b 0 + b1xii + … + b p x pi + åb
k =1
1k (AGEi - k 1k )+

19
+ åb
k =1
2k (CHOL i - k 2 k )+ + ui ,
534 Bayesian Hierarchical Models

where

ui ~ N(0, s 2 ),
brk ~ N(0, fr ).

The B-spline model is analogous, namely, with K* = 23,

K* K*
y i = b 0 + b1xii + … + b p x pi + å
k =1
b1k B1k + åb
k =1
2k B2 k + ui ,

and assumes a 1st order random walk for brk, penalising first differences in the brk. For
identification, a corner rather than centring constraint is used to ensure identifiability,
namely br1 = 0. For precisions 1/σ2 and qr = 1/fr , gamma Ga(1,0.001) priors are assumed.
Applying jagsUI to the linear spline, a B-spline model (models 1 and 2) shows simi-
lar smooths in AGE and CHOL. Fit values favour the linear spline: a LOO-IC of 673,
as against 695 for the B-spline, though both fit values have large SE values. However,
convergence is much earlier achieved using the B-spline method, and effective sample
sizes are larger.
A third analysis involves predictor and smooth selection in the B-spline model, as in

K* K*
y i = b 0 + g 1b1xii + … + g p b p x pi + J1 åb
k =1
B + J2
1k 1k åb
k =1
2k B2 k + ui ,

In this application, all predictors are standardised and the B-spline coefficients accord-
ingly revised. Diffuse priors for βj and brk are likely to lead to an overly parsimonious
model, with few predictors retained. Instead, for the βj, normal N(0,1) priors are adopted
(McElreath, 2016). A data-based prior, based on posterior inferences from model 2, is
adopted for the precisions qr = 1/fr in the prior on the brk terms. Specifically, gamma
priors Ga(0.265,0.0007) and Ga(0.6815,0.0014) correspond to the posterior mean and vari-
ance of θr in model 2. In model 3, a four-point discrete prior for the θr for CHOL and
AGE is accordingly adopted, using values set by the four quintiles of Ga(0.265,0.0007)
and Ga(0.6815,0.0014) densities. The 13 retention indicators, {γj, Jr}, are assigned Bernoulli
priors, with the prior probability ω being an overarching complexity hyperparameter,
with prior ω ~ Be(1,1).
From a two-chain run of 10,000 iterations, vitamin use (vituse) and BMI among the X
predictors have posterior retention probabilities of 0.88 and 1.00 respectively (the indi-
cators J[1] and J[3] in the code), but otherwise retention probabilities are below 0.7. CHOL
and AGE have respective retention probabilities of 1 and 0.99 (J[12] and J[13] in the code).
Figure 12.3 shows the corresponding smooths. Retention of both smooth terms is also
reported by Liu et al. (2011).
A fourth analysis also involves selection, but using a partial adaptation of Scheipl
et al. (2012). Thus, it is assumed that

brk ~ N(0, g rfr ),


gr = hr + e(1 − hr ),
hr ∼ Bern(wr ),
1/fr ∼ G( af , bf )
Hierarchical Methods for Nonlinear and Quantile Regression 535

Credible Interval
4
Predicted Y

Mean
CI_05
CI_95

(a)
0 2 4
CHOL

6.0

5.5

Credible Interval
5.0
Predicted Y

Mean
CI_05
CI_95
4.5

4.0

(b)
–2 –1 0 1 2
AGE

FIGURE 12.3
(a) Beta-carotine smooth in CHOL. (b) Beta-carotine smooth in AGE.
536 Bayesian Hierarchical Models

with ε = 0.00025, aϕ = 5, and bϕ = 25. Alternative settings ( af , bf ) = (5, 50) and ( af , bf ) = (10, 30)
were also investigated. All three settings gave retention probabilities of 1 for the smooths
in both AGE and CHOL (J[12] and J[13] in the code). The first setting gives a LOO-IC of
678. Vitamin use (vituse) has posterior retention probabilities between 0.90 and 0.95,
while BMI has a retention rate of 1.00 for all three settings. Retention probabilities for
other predictors are below 0.75.

12.3 Multivariate Basis Function Regression


Generalisation of Bayesian basis function methods to multiple metric predictors follows
three main methodologies. The first involves tensor product truncated polynomial bases,
including the multivariate adaptive regression spline (MARS) method of Friedman (1991);
the second is the generalisation of radial basis methods (e.g. Yau et al., 2003); and the
third is the generalisation of mixed model P splines (e.g. Durban et al., 2006). The full ten-
sor product approach is a multiplicative generalisation of (12.2) or (12.5), with particular
versions discussed by Smith and Kohn (1997, p.1524), Brezger et al. (2005), Chen (1993),
Dennison et al. (2002, p.104), and Ruppert et al. (2003, p.240). Interactions between categori-
cal and metric predictors in non-parametric regression are considered by Coull et al. (2001)
and Ruppert et al. (2003).
Assume a predictor vector wi = (w1i , … , wRi ) of dimension R, with spline degree q for
all predictors. Omitting the corresponding standard polynomial effects Qr(wr), the tensor
product generalisation of (12.1) for two or more metric predictors involves an analysis of
variance type representation with main and various order interaction effects,
R Kr R Kr Ks

g( m i ) = a + åår =1 k =1
brk (wri - k rk )+q + ååå c
r ¹ s k =1 l=1
rs , kl (wri - k rk )+q (wsi - k sl )+q

R Kr Ks Kt

+ åååå d
r ¹ s¹t k =1 l=1 m=1
rst , klm (wri - k rk )+q (wsi - k sl )+q (wti - k tm )+q + … + ui

 R
where Kr is the number of knots for predictor wr. There may be R main effects,   second-
 2
 R
order interactions,   third-order interactions and so on, with the associated parameters
 3
{b,c,d,…} having dimension determined by the number of knots in Kr, {Kr,Ks}, {Kr,Ks,Kt}, etc.
Higher order interactions may be excluded, even if definable in principle, as an accept-
able smooth may often be obtained by restricting attention to main effects and low order
 R
interactions. So, a model with main and second order effects only would have R +  
 2
parameter sets. Gustafson (2000) considers a BWISE approximation to smooth functions
involving main effects S1(w1 ), … , SR (wR ), and second-order interactions only, namely,

S12 (w1 , w2 ), … , S1R (w1 , wR ), … , S( R −1), R (wR −1 , wR ).


Main effects are assumed to be either conventional linear regression effects or cubic
splines. The form of the interaction depends on which form of main effect is selected for
predictors wr and ws.
Hierarchical Methods for Nonlinear and Quantile Regression 537

As an example, consider a tensor product of truncated polynomials with q = 1, and R = 3,
so that wi = (w1i , w2i , w3 i ) and with K1 = K2 = K3 = 5 knots. Also just consider linear step
 R
functions (w − κ)+. Then there may be R = 3 main effects,    = 3 second-order interactions,
 2
 R
and   = 1 third-order interactions. In a model confined to main effects and second-
 3
order interactions, the main effects would be terms S Kk =1 1
b1k (w1i - k 1k )+ , S Kk =1
2
b2 k (w2i - k 2 k )+ ,
and S Kk =1
3
b3 k (w3 i - k 3 k )+ , involving 15 parameters. The second-order interactions
would be terms S Kk =1
1
SlK=11c12 ,kl (w1i - k rk )+ (w2i - k 2l )+ , S Kk =11SlK=21c13,kl (w1i - k 1k )+ (w3 i - k 3 l )+ , and
S Kk =21SlK=31c23 ,kl (w2i - k 2 k )+ (w3 i - k 3 l )+ involving 75 parameters. If the coefficients {brk , crs , kl }
are assumed to be fixed effects, then predictor selection methods are relevant, as in
Smith and Kohn (1996) or the RJMCMC (reversible jump MCMC) methods discussed
by Dennison et al. (2002, p.105). If the {brk , crs , kl } are assumed to be random effects,
smoothness may be achieved by penalising large coefficients, and parsimony achieved
by selection between zero and positive variance components {fb1 , fb2 , fb3 , fc12 , fc13 , fc23 }.
The tensor product generalisation of (12.2) or (12.5) includes interactions between the
terms in T(w) and Q(w) (Smith and Kohn, 1997; Ruppert et al., 2003, p.240). Consider a
situation with R = 2, with K1 knots in w1i and K2 knots in w2i. For a linear spline (q = 1), and
random effect spline coefficients {brk , drsk , crskm } one would have

K1

g( mi ) = a + b1w1i + b 2w2i + b 3 w1i w2i + åb k =1


1k (w1i - k 1k )+

K2 K2
+ å
k =1
b2 k (w2i - k 2 k )+ + åd
k =1
12 , k w1i (w2i - k 2 k )+


K1
+ åd
k =1
21k w2i (w1i - k 1k )+

K1 K2
+ åå c
k =1 m =1
12 km (w1i - k 1k )+ (w2i - k 2 m )+ + ui

where there are six variance components (fb1 , fb2 , fd12 , fd21 , fc12 , s 2 ). In the bivariate example
of Smith and Kohn (1997, p.1530), K1 = K2 = 9 and q = 3 leading to a (fixed effects) analysis
involving 169 coefficients.
A similar scheme applies when interactions between metric and categorical predictors
are considered. Thus let Ci ∈(1, … L) be a categorical predictor, and w1i and w2i be metric
predictors. Suppose that only the smooth in w2 is postulated to vary according to the level
of C, and define

zil = 1 if Ci = l

= 0 otherwise.
538 Bayesian Hierarchical Models

Also consider a metric response yi, and assume that interactions between w1 and w2 are not
present. Then, with a qth degree truncated polynomial basis in both predictors, one pos-
sible representation is

yi = a + Q1(w1i ) + Q2 (w2i ) + T1(w1i ) + T2,Ci (w2i ) + ui ,


K1
= a + b11w1i + … + b1q w1qi + b 21w2i + … + b 2 q w2qi + åb
k =1
1k (w1i - k 1k )q+ ,

K2 L K2

+ å
k =1
b2 k (w2i - k 2 k )+q + å z {å c (w
l=2
il
k =1
kl 2i - k 2 k )q+ },

where b1k ~ N (0, fb1 ), b1k ~ N (0, fb1 ), ckl ∼ N (0, fcl ) and ui ∼ N (0, s 2 ) (Coull et al., 2001). The
amount of smoothing under S1 = Q1 + T1 and S2 ,Ci = Q2 + T2 ,Ci then depends on the ratios
s 2 /fb1 and s 2 /[fb 2 + fcl ].
In a multivariate mixed model generalisation of the radial basis, one may consider thin-plate
functions with exponents (2q − d) specified by integer combinations (q,d), where d is the
dimension of the covariate vectors in the relevant interaction (Yau et al., 2003). So

H k ( z) =|z − tk|(2 q − d ) log(|z − tk|) for (2q − d) even

H k ( z) =|z − tk|(2 q − d ) for (2q − d) odd

where z are univariate or multivariate vector predictor values, and tk are univariate or
multivariate knots. In applying such functions, heavily parameterised multivariate spline
models are often not likely to be well identified, and simpler options involving univariate
smooths in each predictor (with d = 1), and all possible bivariate interactions (with d = 2),
may be considered (Yau and Kohn, 2003). Consider the setting q = 2, with predictors, w1 and
w2, and let zi = (w1i , w2i ) denote bivariate covariate combinations, with K12 bivariate centres
tk = (t1k , t2 k ) that might be provided by an initial cluster analysis. Also denoting distances
hik = |zi − tk|, the bivariate basis for 2q − d = 2 is of the form h2log(h). With linear terms in the
parametric component Q(w), this leads to the representation

g( mi ) = a + S1(w1i ) + S2 (w2i ) + S12 (w1i , w2i ),


K1
= a + b1w1i + åb
k =1
1k w1i - k 1k + b 2w2i
3

K2 K12

+ åb
k =1
2k w2 i - k 2 k +
3
åc h
k =1
2
k ik log( hik ).

With Kr knots {kr 1 , … krKr } for predictor wri, the R main effects are

Kr

Sr (wri ) = br wri + ∑b
k =1
rk |wri − krk|3 ,
Hierarchical Methods for Nonlinear and Quantile Regression 539

where the R sets of coefficients {[br 1 , br 2 , … brKr ], r = 1, … , R} are assumed random with
variances fb1 ,… fbR. Let the Krs bivariate knots for first order (wr,ws) interaction effects be
denoted trs , k = (trk , tsk ). Then the interaction bases have the form

K rs

∑c
2
Trs (wri , wsi ) = rs , k (wri , wsi ) − (tsk , trk ) log( (wri , wsi ) − (trk , tsk ) ).
k =1

 R
The   sets of coefficients crs,k are also assumed to be random.
 2
Lo-rank thin-plate spline smooths as an approximation to the full thin-plate regres-
sion spline (TPRS) smoother are considered by Wood (2003, 2006, 2016). Thus, for an
R-dimensional predictor vector wi = (w1i , … , wRi ) , with linear model

yi = f (wi ) + ui

with ui random, the full TPS smooth of degree m involves a function g minimising
2
y − g + l J mR ( g ),

where JmR(g) is a roughness penalty and λ is a smoothing parameter. This penalty is an


 m + R − 1
integral of dimension M =  involving all possible terms
 R − 1 
2
m! æ ¶m g ö
ç n1 ÷ ,
n 1!…n R! è ¶w1 … ¶wnRR ø
where n1 + … nR = m. So, for R = 2 and m = 2,

2 2 2
 ∂2 g   ∂2 g   ∂2 g 
J mR ( g ) =
∫∫  ∂w 2 
1
+ 2  ∂w ∂w  +  ∂w 2  dw1dw2 .
1 2  2
The function g has the form
n M

g( w ) = ∑ i =1
di hmR w − wi + ∑ a f (w),
j =1
j j

where δi and αj are unknowns. To reduce the number of unknowns, especially for larger
samples, a rank k orthonormal basis for the δ parameters is used instead. This approach
avoids the knot placement problems of conventional regression spline modelling. Thin-plate
regression splines with truncated basis are implemented in the R package mgcv, with the
jagam option (Wood, 2016) producing a modifiable rjags code incorporating the TPRS com-
mands (see Example 12.3).

Example 12.2 Fertility, GDP, and Female Education


These data are from the UN Human Development Report for 2015 (http://report.hdr.
undp.org/), and relate to a measure of fertility (TFR, the total fertility rate) over 167
countries. The analysis concerns the relation of TFR to GDP per head and average years
540 Bayesian Hierarchical Models

FIGURE 12.4
(a) Smooth for TFR as function of GDP per capita. (b) Smooth for TFR as function of GDP per capita
and female education.

of education for females. TPRS models are applied and can be fitted using jagam/mgcv,
or the stan_gamm4 option in rtsanarm.
The first model involves a smooth in GDP only, and a truncated TPRS representation
with rank k = 20 and m = 2. This provides a penalised DIC of 415, with Figure 12.4a show-
ing the resulting centred smooth. Including separate univariate smooths in both GDP
and female education improves the pDIC to 380. Both analyses show rapid convergence.
The second model is illustrated both by jagam/mgcv and stan_gamm4 codes.
A third model involves a joint smooth s(gdp,fschool) in the predictors, and provides a
pDIC of 379. Figure 12.4b shows the resulting three-dimensional scatter plot. Combining
both univariate smooths and a joint smooth provides a slightly improved pDIC of 374.
A final analysis modifies the rjags code for this model to include likelihood calcu-
lations from which WAIC (widely applicable information criterion) and LOO-IC may
be derived, and also includes binary selection indicators, J k ∼ Bern(0.5), for the three
Hierarchical Methods for Nonlinear and Quantile Regression 541

smooths. A penalising complexity prior is adopted for the residual standard deviation,
based on an assumed 0.01 probability that this exceeds 2, and exponential, E(1), pri-
ors are adopted on the smoothing parameters. This analysis shows a 0.06 probability
for retaining the univariate smooth in gdp, and if that smooth is excluded (so that the
model consists only of a univariate smooth in fschool and a bivariate smooth), the pDIC
falls to 372.

12.4 Heteroscedasticity via Adaptive Non-Parametric Regression


As mentioned above, a random effects spline regression in a predictor wi typically takes
the form

K
g( mi ) = a + S(wi ) = a + b1wi + … + b q wiq + å b (w - k )
k =1
k i
q
k + + ui ,

where ui ∼ N (0, s 2 ), the κk are knots, and the spline coefficients bk may be taken as normal,
for example bk ∼ N (0, f) . This approach is spatially homogenous (in terms of the predictor
space), whereas a spatially adaptive regression may be used to represent heteroscedastic-
ity, which is also related to w values, or possibly to the values of other predictors (Currie
and Durban, 2002). Spatial adaptive regression may also be used to allow non-constant
variance in the bk, namely bk ~ N (0, fk ) , with log(ϕk) determined by a spline regression on
the knots (Yue et al., 2012).
For modelling heteroscedasticity, with ui ~ N (0, s i2 ), a subsidiary spline regression may
be applied to the variances s i2 = exp( hi ), with M knots in the same predictor

M
hi = g 0 + g 1wi … + g q wiq + … å c (w -y
m=1
m i
q
m + ) ,

with cm ∼ N (0, fc ) (e.g. (Chib and Greenberg, 2013)). Other options (Jerak and Lang, 2005)
are random walk priors in hi, such as an RW1

hi ∼ N ( hi −1 , 1/th ).

or discrete mixture over smoothing functions, with mixture probabilities based on mul-
tinomial logit regression involving additional covariates xi. For y metric and M mixture
components, one might have

M
p( yi |xi , wi ) ~ åp
m=1
m ( xi )N (Sm (wi ,q m ), Vm )

∑ p (x ) = 1
m=1
m i

where each smooth function Sm (w ,q m ) has its own parameter set θm.
542 Bayesian Hierarchical Models

Models allowing for heteroscedasticity via non-parametric regression belong to a


broader class of generalised additive models for location, scale, and shape (GAMLSSs).
These regress not only the expected mean, but other distribution parameters (e.g. location,
scale, and shape) on covariates, leading to what is termed distributional regression (Mayr
et al., 2012; Wood et al., 2016). In zero-inflated Poisson or negative binomial (NB) regres-
sions, the additive model framework is used for the Poisson or NB rate, the probability of
inflation, and for the NB scale parameter (Klein et al., 2015).

Example 12.3 Elementary School Attainment


The data for this example are a random sample of 400 elementary schools from
California Education Department’s API datafile for 2000 (www.cde.ca.gov/ta/ac/ap/​
apida​tafil​es.as​p), which reports school academic performance (yi) together with school
characteristics, such as average class size and the poverty rate among the pupil intake
(Chen et al., 2003). A linear regression analysis involves regressing 2000 performance

100 100
Residuals
Residuals

0 0

–100 –100

–200 –200
500 600 700 800 900 0 10 20 30 40 50 60
Fitted FSM

8.8

8.6

8.4
Log Variance

8.2

8.0

7.8

0 20 40 60 80 100
FSM

FIGURE 12.5
(a) Residuals against fitted, homoscedastic model. (b) Residuals against free school meals. (c) Plot of log variance
Hierarchical Methods for Nonlinear and Quantile Regression 543

on the percentage of pupils receiving free meals (FSM), percentage of English language
pupils (ELP), and percentage of teachers with emergency credentials (EMCRED).
Let s 2 = Var(ui ) , and assume 1/s 2 ∼ Ga(1, 0.001) in a homoscedastic linear regression

y i = mi + ui = b 0 + b1FSM i + b 2ELPi + b 3 EMCRED i + ui .

With computation via jagsUI, this provides a LOO-IC of 4387, with pe = 5.1. However, a
plot of the residuals shows residual variation to decrease as fitted attainment increases
(Figure 12.5a). All three predictors have significant (negative) effects on attainment, but
the highest ratio of posterior mean to standard deviation is for FSM, and a plot of the
residuals against FSM (Figure 12.5b) suggests residual variation increases with FSM.
A second model therefore specifies y i ~ N( mi , s i2 ), where ui ~ N(0, s i2 ), with log(s i2 )
modelled by a cubic spline regression

M
log(s i2 ) = g 0 + å c (FSM -y
m =1
m i ) .
3
m +

The spline coefficients are random cm ∼ N(0, fc ) with 1/fc ∼ Ga(1, 1). There are M = 9
knots, sited at the 10th, 20th, and 90th percentiles of FSM. A corner constraint c1 = 0
is used for identifiability. A two-chain run of 20,000 iterations gives an estimate for
fc0.5 of 0.52 with 95% interval (0.33,0.84), whereas homoscedasticity would imply fc0.5 = 0.
Figure 12.5c accordingly demonstrates non-constancy in log(s i2 ) as FSM varies, though
there is no consistent monotonic upward or downward trend in variability as FSM
increases. The LOO-IC under the second model falls to 4375 (pe = 11.3).
A third model employs a different identification device, namely centring (at each iter-
ation) the observation level smooth Si (FSM) = S mM=1cm ( FSMi -y m )3+ around the overall
mean of such smooths. The centred smooth is then included in the spline regression for
log(s i2 ). This produces a similar fit (LOO-IC = 4376), and a similar non-monotonic rela-
tion between log(s i2 ) and FSM. The centred cm (c.cent in the R code) for this implementa-
tion have a correlation of 0.99 with those from the corner constraint option.

12.5 General Additive Methods


Consider ranked values of a single predictor w1 , … , wn such that

w1 < w2 < … < wn ,

and let St = S(wt ) be a smooth function representing the locally changing impact of wt on
g(μt) as it varies over its range. Thus

g( mt ) = a + S(wt ) + ut ,
ut ~ N (0, s 2 ),

where depending on identification procedures used, the intercept α may not be present
(Koop and Poirier, 2004). Appropriate priors for St reflect the ordering and spacing of the w
values, and typically follow dynamic linear priors or other time series schemes. Normal or
544 Bayesian Hierarchical Models

Student t random walks in the first, second, or higher differences of St are one possibility
(Knorr-Held, 1999; Fahrmeir and Lang, 2001; Chib and Jeliazkov, 2006). For identifiability,
especially when there are smooths Srt = S(wrt) in several predictors one may adopt devices
such as centring of the Srt, or corner constraints (e.g. Sr1 = 0). Alternatively, to expedite com-
puting speed, one may monitor identified quantities such as the centred series Srt − Sr
without actually imposing centring constraints within the estimation. Because there is
only local smoothing, inferences may also be sensitive to priors assumed for evolution
variance τ2 for the St and other aspects of the model.
If the w values are equally spaced and distinct, then 1st and 2nd order random walk
priors are just

St ~ N (St -1 , t 2 ),
St ~ N (2St -1 - St - 2 , t 2 ),

where smaller values of τ2 result in a smoother curve. For metric or overdispersed discrete
responses, the parameterisation τ2λ = σ2 may be used, allowing for trade-off between the
residual variance and the variance of the smooth (Koop and Poirier, 2004).
In ordinary regression applications, values of the wt are typically unequally spaced, and
there may be tied values. To take account of unequal spacing between successive wt, the
prior is modified such that for second and higher order walks, the weighting on lagged
values is varied according to how distant they are from the current value (Fahrmeir and
Lang, 2001). In all orders of random walk, the precision of St is reduced the wider the
gap between wt, and its preceding ordered values. Let gaps between points be denoted
d 2 = w2 - w1 ,d 3 = w3 - w2 ,…,d n = wn - wn-1 (with δ1 = 0). Then a first-order Normal random
walk becomes

St ~ N (St -1 , d tt 2 ),

and a second-order one becomes

St ~ N ([1 + d t /d t -1 ]St -1 - [d t /d t -1 ]St -2 , d tt 2 ).

Separate usually fixed effect priors are assumed for the initial values (e.g. S1 in a first
order random walk). A scheme allowing choice between RW1 and RW2 dependence for
unequally spaced w is proposed by Berzuini and Larizza (1996), namely

st ~ N ( Mt , d tt 2 )

where

Mt = st -1[1 + (d t /d t -1 )exp(-hd t )) - st -2 [(d t /d t -1 )exp(-hd t )).

Larger values of η > 0, such that exp(−ηδt) tends to zero, imply an approximate RW1 prior
and less smoothness.
If there are ties in the w values, with only m < n distinct values, denoted {w∗j , j = 1, … , m},
then the above priors would be on the differences d j = w*j - w*j-1 in the ranked distinct val-
ues, and it is necessary to specify a grouping index Gt (ranging between 1 and m) for each
Hierarchical Methods for Nonlinear and Quantile Regression 545

observation t = 1, … , n to indicate which distinct value it takes. Assuming an RW1 prior in


the smooth of the predictor effects, the regression in wt can then be written

g( mt ) = a + S(Gt ) + ut , t = 1, … , n
Sj ~ N (Sj -1 , d jt 2 ) j = 1, ¼ , m

with Gt ∈(1, … , m) .
If there is more than one predictor then a semiparametric model might be adopted with
smooth functions Sr(wr) on a subset r = 1, …, q of R predictors, with the remainder modelled
by assuming global linearity. So

g( mt ) = a + S1(w1t ) + S2 (w2t ) +¼+ Sq (wqt ) + b1wq+1,t +¼ b R-q wR ,t + ut .

If non-parametric functions are estimated for several regressors w1t , w2t , … , wqt , then
a unique ordering across all predictors is usually infeasible and grouping indices
G1t , G2t , … , Gqt for each of q regressors are necessary, even if the regressors have no tied
values. In the case of tied values, the indices range between 1 and m1,1 and m2,…,1 and mq
(rather than between 1 and n).
Another approach (Wahba, 1983; Biller and Fahrmeir, 1997; Wood and Kohn, 1998) to
Bayesian general additive modelling involves the state space version of the polynomial
smoothing spline. For a spline of general order 2h − 1, St = S(wt) is generated by a differen-
tial equation

d hSt dWt
h
=t ,
dt dt

with Wt a Weiner process, and τ2 the evolution variance. The state vector

 dS d 2S d( h −1)S 
Zt =  St , t , 2t , … , ( h −1)t  ,
 dt dt dt 

is then of order h, evolving stochastically according to

Zt = Ft Zt −1 + et , (12.8)

where Ft is an h × h transition matrix and et is a multivariate error. For the cubic spline case
with h = 2, Zt = (St , dSt /dt) is bivariate and the transition matrix is

1 dt 
Ft =  ,
0 1 

where dt = wt +1 − wt . The et are also bivariate, for example, MVN with zero mean and cova-
riance τ2Et, where

æ d t3 /3 d t2 /2 ö
Et = çç 2 ÷.
è d t /2 d t ÷ø
546 Bayesian Hierarchical Models

As usual there may be ties in the w values, and the prior (12.8) would be on j = 1, … , m dis-
tinct ranked values. Each observation for t = 1, …, n would have a grouping index Gt with
values between 1 and m.

Example 12.4 Conceptions under 18, RW2 Smooths


This example considers data on conceptions to women aged under 18 (yi) in 352 English
local authorities. Explanatory factors are area deprivation, measured by an Index of
Multiple Deprivation (IMD), and the percentage of 15-year-old pupils not achieving five
or more GCSE subjects at grade C or above. The acronym GCSE refers to the General
Certificate of Secondary Education, and educational proficiency is set by the criterion of
grade C or above. The model involves additive RW2 priors in w1 = IMD and w2 = GCSE.
Let G1i and G2i indicate which of the unique IMD and GCSE values is taken by area i,
where such unique values are ranked, with m1 = 352 and m2 = 351 unique values (there is
a single tie in the GCSE values). Some of these distinct values are, however, very close to
each other (a consideration relevant in a BayesX application). With j = 1, …, mr denoting
ranked predictor values, drj = w∗rj − w∗r , j − 1 and assuming RW2 dependence

y i ∼ Bin( ni , mi ),

logit( mi ) = a + s1 (w1,G1i ) + s2 (w2 ,G2i )

srj ~ N([1 + d rj /d r , j -1 ]sr , j -1 - [d rj / d r , j -1 ]sr , j - 2 ,t r2d rj ) r = 1, 2; j = 1,… , mr ,

where tr2 is the variance for the randomly varying srj. There is excess dispersion which
may be removed by a model also including an unstructured effect

logit( pi ) = a + s1 (w1,G1i ) + s2 (w2,G2i ) + ui ,

where ui ∼ N(0, su2 ) .


A BayesX analysis is applied within R using the BayesXsrc package. Inverse gamma
priors IG(g,h) are assumed on variance parameters, with a setting of { g = 1, h = 0.0001} .
To avoid estimation of a large number of coefficients, BayesX performs internal group-
ing if a covariate has a large number of distinct values (for first- and second-order
random walks), so the actual number of distinct values used will be lower than the
observed mr. Plots of the smooths under h = 0.0001 are based on 94 distinct IMD values
and 89 distinct GCSE values.
Figures 12.6a and 12.6b (which include the intercept) show the resulting smooth func-
tions. The extent of smoothing may depend on prior settings: setting h smaller (e.g.
h = 0.001) produces more short-term variability. A stan_gamm4 code using TPRS smooth
functions shows similar results.

12.6 Non-Parametric Regression Methods for Longitudinal Analysis


Two major applications of non-parametric regression to longitudinal datasets are to
time-varying regression coefficients and subject-specific curves (James et al., 2000; Wu and
Zhang, 2006). Applications to joint models for longitudinal and time-to-event data are also
Hierarchical Methods for Nonlinear and Quantile Regression 547

GCSE
–3
20 30 40 50 60 70
–3.1

–3.2

–3.3
Smooth

–3.4

–3.5
10%
–3.6 Mean
90%
–3.7

–3.8

IMD
–2.5
5 10 15 20 25 30 35 40 45 50
–2.7

–2.9

–3.1
Smooth

–3.3

–3.5
10%
–3.7 Mean
90%
–3.9

–4.1

FIGURE 12.6
(a) Smooth in GCSE (80% CRI). (b) Smooth in IMD (80% CRI).

increasing (Kohler et al., 2016). Time-varying regression effects are a special case of the
general varying coefficient model of Hastie and Tibshirani (1993), namely

g( mi ,u ) = b 0 (u0 ) + w1i b1(u1 ) + … wRi b R (uR ),

where the effect modifiers u = (u1 ,… ,uR ) govern the effect of predictors w = (w1 , … wR ) . If
the modifiers are all the same (e.g. time) with u1 = u2 = … = uR = t then

g( mit ) = b 0 (t) + w1i b1(t) + … + wRi b R (t),

and the time-varying coefficient model, or dynamic general linear model (West and
Harrison, 1997), is obtained. This extends to time-varying predictors writ, with

g( mit ) = b 0 (t) + w1it b1(t) + … + wRit b R (t),


548 Bayesian Hierarchical Models

Tim-varying intercept or regression effects βr(t) of unknown form can be fitted by any non-
parametric method, such as regression, penalised splines, or random walks. For example,
a B-spline approach would take

K*

b r (t) = å b B (w
k =1
rk k rit , q)

where brk are modelled as fixed or random effects. The fixed effects approach would typi-
cally be combined with selection of significant coefficients.
Allowing for intercepts or regression effects to vary by subject makes random effects
a more sensible option. A comprehensive review of frequentist approaches to such non-
parametric mixed models is provided by Wu and Zhang (2006) – see also Chapter 9 in
Ruppert et al. (2003). A typical application is in growth curve analysis and involves subject
specific non-parametric growth curves in time or age. For example, a growth curve model
where observations at each wave included age could be modelled using a truncated spline

g( mit ) = a t + ci + Si (Ageit ) = a t + ci + å b (Age


k =1
ik it - k k )+q + uit ,

uit ~ N (0, s 2 ),

with σ2 representing within-subject variation, while ci ~ N (0, s c2 ), with s c2 measuring


between-subject heterogeneity. The subject-specific spline coefficients bik are subject to a
roughness penalty, such as a normal first difference penalty

bik ~ N (bi ,k -1 , 1/q i ),

with subject-specific precisions potentially modelled hierarchically. For example, one


might take the log(θi) to be normal with unknown variance. For applications with distinct
recording times ait, extended general linear mixed models can be used (Wu and Zhang,
2006), with

g( mit ) = Xit b + h ( ait ) + Zitbi + Si ( ait ) + uit ,

where η(a) is the population mean function, estimated non-parametrically, and Si(a) are
subject-specific deviation functions. Silva et al. (2008) consider cubic B-spline bases to
model region-wide and area-specific trends for health outcomes yit ∼ Bin( nit , pit ), namely

K* K*
logit(p it ) = a + h (t) + Si (t) + di = a + å b B (t, 3) + å c B (t, 3) + d ,
k =1
k k
k =1
ik k i

where di and cik are random area effects.


Another possible scheme for allowing variability across subjects is by random “slopes”
around the population smooth functions, also sometimes denoted as random scaling of
nonlinear functions (Tutz and Reithinger, 2007). For example, consider a longitudinal
Hierarchical Methods for Nonlinear and Quantile Regression 549

(e.g.  growth curve) application with a single predictor wit, the impact of which is mod-
elled at population level by a smooth function S(wit). Then one may wish to allow both for
­intercept (baseline) variation and for subject level variation around the average function
S(w). Thus

g( mit ) = a + b1i + S(wit ) + b2iS(wit ) + uit ,



= a + b1i + S(wit )(1 + b2i ) + uit ,
with

(b1i , b2i ) ∼ N (0, D),

and for identification åSit


it = 0 where Sit = S(wit ). The smooth function S(wit) represents

the mean effect of predictor wit, but this effect is stronger for subjects with b2i > 0, and
weaker for subjects with b2i < 0. So b2i acts to amplify or attenuate the non-parametric
impact of the variable wit. For some subjects, one may even obtain large negative estimates,
b2i < −1, so that the effect of wit is inverted. This model adapts to cross-sectional data where

g( mi ) = a + S(wi ) + biS(wi ) + ui ,

particularly in cases where the units are non-exchangeable, for example, if the units were
areas, and bi followed a spatial prior.
The impact of (1 + b2i) on the unknown function S(wit) is analogous to (subject specific)
factor loadings operating on factor scores, and is subject to identifiability (label switch-
ing) issues, since [−(1 + b2i ))[−S(wit )) = S( xit )(1 + b2i ) . However, labelling issues should be
avoided in practice if the impact of wit represented by S(w) is well-identified by the data.
An alternative product scheme is applied by Congdon (2006), based on the Lee and Carter
(1992) mortality forecasting model. In this scheme, subject-specific weights qi that sum to
1 over all subjects operate on S(wit), so that for Si qi = 1 the product scheme is qiS(wit). The
effect of w is stronger for subjects with higher qi, and weaker for subjects with lower qi, with
the average qi being 1/n.

Example 12.5 Progesterone Readings over Menstrual Cycle


This example uses progesterone readings yit (log progesterone) in a study of early preg-
nancy loss (Brumback and Rice, 1998; Wu and Zhang, 2006). There are n = 91 observed
cycles of length T = 24 days, so the total number of observations is n × T = 2184. The days
are coded as −8,−7,…,13,14,15, with 0 as day of ovulation. There are J = 2 groups of obser-
vations, the first 69 cycles being nonceptive, the last 22 being conceptive. The conceptive
group growth paths (model 1), or subject level growth paths (model 2), are modelled
non-parametrically. So instead of a linear or polynomial function in the days variable,
cubic B-splines are used with knots at (−5,0,5,10).
The K* = 8 basis functions are obtained from the R splines package via the commands:

require(splines)
cycval <- seq(-8,15)
bs.cy​cval <- bs(c​ycval​,df=N​ULL,k​nots=​c(-5,​0,5,1​0),de​gree=​3,int​ercep​
t=T, Boundary.knots=range(cycval))
bs.cycval <- matrix(as.numeric(bs.cycval), nr = nrow(bs.cycval)).
550 Bayesian Hierarchical Models

In a baseline group-specific model, the spline coefficients are group-specific random


coefficients {b jk , j = 1, 2, k = 1, K ∗ } , with group-specific precisions. Let Gi ∈(1, 2) denote
conceptive group, then y it ∼ N( mit , 1/t) , with

K∗

mit = aGi + ∑b
k =1
Gik Bk (t , 3),

b jk ∼ N(0, 1/fj ),
fj ∼ Ga(1, 0.001),
t ∼ Ga(1, 0.001).
*
A two-chain run of 20,000 iterations is undertaken, with centring of c jt = S Kk =1bGik Bk (t , 3)
within groups for identification. There is a similar path between the two groups, in
terms of posterior means of {a j + c jt } up to the week after ovulation, but distinct trends
thereafter (Figure 12.7a). The LOO-IC is 6252.

2.5

1.5
Progesterone

0.5
Day
0
–8 –3 2 7 12
–0.5
Nonconcepve
–1 Concepve

–1.5

2.5

1.5
Progesterone

0.5
Day
0
–8 –3 2 7 12
–0.5
Nonconcepve
–1 Concepve

–1.5

FIGURE 12.7
(a) Growth curve smooths (Model 1). (b) Growth curve smooths (Model 2).
Hierarchical Methods for Nonlinear and Quantile Regression 551

A subject-specific model adds both subject heterogeneity and subject (cycle)-specific


growth effects, so that

K*
mit = a Gi + bi 0 + å b B (t, 3),
k =1
ik k

with

bi0 ∼ N(0, 1/t0 ), i = 2,… , n,


bi1 = 0,

bik ∼ N(bi , k − 1 , 1/qGi ), k = 2, K ∗ ,


t j ∼ Ga(1, 0.001), j = 0, 1,
qj ∼ Ga(1, 0.001), j = 0, 1.

The corner constraint bi1 = 0 aids in identification. Average growth curves are shown
in Figure 12.7b. The LOO-IC for this model is 3319.

Example 12.6 Birthweight and maternal age


Neuhaus and McCulloch (2006) consider a subset of data from a more extensive longi-
tudinal study that involves the birthweights of babies born to n = 878 mothers from the
state of Georgia, USA, all of whom has at least T = 5 babies. The analysis here is focused
on the impact on birthweight yit of mother’s age at birth wit, and the extent to which there
is heterogeneity in the overall smooth S(wit), which is based on a second-order random
walk.
Thus, for each five birth history for mother i one may stipulate

y it = b 0 + b1i + S(wit ) + b2iS(wit ) + uit ,



= b 0 + b1i + (1 + b2i )S(wit ) + uit ,
where

(b1i , b2i ) ∼ N(0, D),

and D−1 follows a Wishart prior with identity scale matrix and 2 degrees of freedom.
A  second-order random walk smooth is estimated over all (i,t) pairs using a normal
prior with a single variance parameter, rather than on the basis of successive ages
within each fertility sequence, which would permit distinct variance parameters for
each subject. The smooth involves 31 random parameters, namely for maternal ages
12 to 42. Identification is achieved by centring S(w) at each iteration.
A two-chain run of 10,000 iterations using the rube library shows significant het-
erogeneity around the overall smooth in age, with a posterior mean for var(b2) of 2.0,
and 95% interval {1.2, 3.3}. Figure 12.8a shows the varying non-parametric impact of
maternal age wit on birthweight according to b2i, namely for subjects with b2i = sd(b2 ),
b2i = 0, and b2i = − sd(b2 ), where the standard deviations are those at particular MCMC
iterations. A histogram plot of the posterior mean b2i (Figure 12.8b) indicates normality,
though an extreme negative outlier of −4.9 occurs for subject 470, whose fourth and fifth
infants weighed under 1kg, whereas the first two exceeded 3kg in weight. To assess out-
lier status at observation level, one may derive WAIC component scores for individual
(mother, infant) pairs: the largest such score (44 out of a total WAIC of 5788) is for the
fifth infant to mother 838.
552 Bayesian Hierarchical Models

0.6

0.4
Density

0.2

0.0

-4 -2 0 2 4
Posterior mean b2

FIGURE 12.8
(a) Smooth impacts of maternal age on birthweight, according to variability in b2. (b) Histogram
of b2.

12.7 Quantile Regression
Normal linear regression and generalised linear models focus on estimating the condi-
tional mean of the response yi. Quantile regression (Koenker, 2005) provides a more com-
plete perspective on the conditional density of yi, and focuses on estimating conditional
quantiles (such as the conditional median) of the response. Sometimes, conditional mean
regression will show a predictor as having no impact, whereas quantile regression will
show a significant impact over at least part of the quantile range (Cade and Noon, 2003),
though collinearity between predictors (and hence, predictor selection) may still be an
issue (Xi et al., 2016; El Adlouni et al., 2018). With quantiles denoted q Î[0,1], the condi-
tional quantile density is denoted by the quantile (inverse cumulative distribution) func-
tion Q(q|Xi ), defined as Pr[ yi < Q(q|Xi ) = q] .
Hierarchical Methods for Nonlinear and Quantile Regression 553

For linear regression involving a continuous response, the frequentist quantile regres-
sion estimator at quantile q minimises the function

Q(q|Xi ) = q å
yi Xi b
yi - Xi b q + (1 - q) å y -X b
y i < Xi b
i i q

Equivalently, quantile regression involves minimising Sin=1 r q ( yi - Xi b q ) , as defined by the


loss function (Yu and Moyeed, 2001)

rq (u) = u(q − I (u < 0) = u(qI (u ≥ 0) + (1 − q)I (u < 0)).

This loss function downweights or emphasises absolute errors according to the quantile
q. For example, setting q = 0.9 results in a loss nine times larger for positive residuals with
yi  Xi b than for negative residuals with yi < Xi b . So, the upper tail of the conditional dis-
tribution is emphasised.
A special case is provided by median regression, via minimisation of the absolute
deviations:

Q(0.5|Xi ) = å y - X b .
i i

This reduces the impact of outliers (influential observations) in the response space on esti-
mation, so as to provide a better fit for the majority of observations. Credible intervals (e.g.
for observation level predictions) estimated using conditional mean regression by averag-
ing over MCMC samples may also be affected by outliers. By contrast, median regression
is more robust to skewness and other departures from normality (Geraci and Bottai, 2006).
Thus, Min and Kim (2004) consider different forms of non-Gaussian errors, with asym-
metric and long-tailed distributions, and show that median regression outperforms con-
ditional mean regression, since the median is a more suitable centrality measure for data
with a skewed response.
Methods for Bayesian quantile regression include asymmetric Laplace likelihood (Yu
and Moyeed, 2001), exponentially tilted empirical likelihood (Schennach, 2005), and
Dirichlet process mixture median regression (Kottas and Gelfand, 2001). Yu and Moyeed
(2001) demonstrate that loss function minimisation is equivalent to estimation using an
asymmetric Laplace distribution (ALD), with density function

q(1 - q) é æ y - hq ö ù
ALD( y|hq , s , q) = exp ê r q ç ÷ú .
s ë è s øû
This density can be represented as a scale mixture of normals, thus facilitating Gibbs sam-
pling (Kozumi and Kobayashi, 2011).
Thus, for y ~ ALD(hq , s , q), one has for quantiles q = 1, …, Q the quantile-specific
representation
0.5
é 2s qWiq ù
yi = hiq + x qWiq + ê ú Ziq ,
ë q(1 - q) û
where ηiq is the regression term, xq = (1 − 2q)/q(1 − q) , Wiq ∼ Exp(sq ) , and Ziq ∼ N (0, 1). The
practical role of the xqWiq terms is to maintain the model as a satisfactory representation
554 Bayesian Hierarchical Models

of y, compensating for shifts in ηiq between quantiles. The Wiq are measures of outlier sta-
tus. Observations with higher Wiq have higher variances and lessened influence on the
likelihood. R packages to implement Bayesian quantile linear regression include brq
(Alhamzawi, 2012), bayesQR (Benoit and Van den Poel, 2014), and ALDqr (Sanchez et al.,
2017).
In practice, it is not necessarily guaranteed that estimated quantile curves will be non-
crossing, especially for quantiles not widely separated (e.g. q = 0.05 compared to q = 0.10)
(Bondell et al., 2010). Methods to circumvent this, not necessarily fully Bayesian, have been
proposed (Cai and Jiang, 2015). An ad hoc approach involves simultaneous estimation of
all quantiles of interest, and omitting MCMC samples where the expected ordering of the
quantile regression terms ηiq is not satisfied.
For longitudinal data (with units i, and times t) (e.g. Geraci and Bottai, 2006; Alhamzawi
et al., 2011), the regression term might include quantile-specific unit level random effects
biq. Assuming normal subject effects, the representation would then be

0.5
 2sqWitq 
yit = Xit bq + biq + xqWitq +   Zitq ,
 q(1 − q) 
with biq ∼ N (0, sb2 ).

12.7.1 Non-Metric Responses
For binary responses, the augmented data method can be applied, combined with the scale
mixture version of the ALD (Benoit and Van den Poel, 2012; Benoit and Van den Poel, 2017).
Thus, binary responses yi can be regarded as determined by a continuous latent variable yi∗ .
To implement quantile regression for these latent variables, one specifies

yi∗ ∼ ALD(hq , sq = 1, q),

with set scale parameter for identifiability and truncated sampling according to the
observed value of yi. Thus

yi∗ ∼ ALD(hq , sq = 1, q) I (, 0), yi = 0;

yi∗ ∼ ALD(hq , sq = 1, q) I (0, ), yi = 1.

Yue and Hong (2012) apply quantile tobit regression to highly skewed medical expenditure
data, focusing on the latent outcome in combination with the scale mixture ALD, while
Rahman (2016) uses the augmented data approach for quantile regression of ordinal data.
To extend quantile regression to count data, Machado and Santos Silva (2005) propose
adding uniform noise u to count responses, giving zi = yi + ui , where ui ∼ U (0, 1) , and apply
quantile regression of the form

QZi (q|Xi ) = hqi = q + exp(Xi bq ).

With offsets Ei, the quantile regression is

QZi (q|Xi ) = hqi = q + Ei exp(Xi bq ).


Hierarchical Methods for Nonlinear and Quantile Regression 555

This can be rearranged (Fuzi et al., 2016) into a linear model

QZ∗ (q|Xi ) = hqi = Xi bq + log(Ei ),


i

for quantities

zi∗ = log( zi − q) for Zi > q;



zi∗ = log(f), for Zi ≤ q (with q > f > 0).

Another approach to quantile regression for overdispersed count data involves a scale
mixture version of the ALD (Yu and Moyeed, 2001), within a hierarchical Poisson lognor-
mal representation to account for overdispersion (e.g. Connolly and Thibaut, 2012). The
quantile regression is for latent outcomes at the second stage of the hierarchical model,
focused on estimating latent incidence rates or relative risks (Congdon, 2017). The Poisson
lognormal representation is in itself beneficial, since the tails of the lognormal are heavier
than for the gamma distribution, and for data with outliers, the Poisson lognormal model
may give a better fit than the negative-binomial model. Thus, for observed counts yi, one
specifies for quantiles q = 1, … , Q,

yi ∼ Poi( miq ),

miq = exp(niq ),

 2Wiq dq 
niq ∼ N  Xi bq + xqWiq , ,
 q(1 − q) 

Wiq ~ Exp(δ q)

This approach is less computationally intensive than the uniform noise (jittering) method.

Example 12.7 Trout Density


We consider data on a continuous variable, the density of Lahontan cutthroat trout y
(trout numbers per metre of stream) as response, and its varying relationship to stream
width-depth (w-d) ratio (x). The code for this example estimates several quantile regres-
sions simultaneously, and illustrates the plots that can be made. There are n = 71 obser-
vations from 13 streams across 7 years in Nevada (Dunham et al., 2002; Cade and Noon,
2003). Dunham et al. (2002) compare quantile linear regression, and a nonlinear quantile
regression y i = exp( b0 + b1xi + ui ), which can be obtained by taking a log transform of y.
To obtain a plot of the varying influence of the w-d ratio on y, as in Figure 4 of Dunham
et al. (2002), regressions are performed at Q = 19 quantiles, namely 0.05, 0.1, 0.15, etc.
through to 0.95. The varying influence is represented by the b2[q] parameters in the
code. Because the outcome is necessarily positive, the linear regression is constrained
to produce positive values. The b2[q] show no impact of the predictor until significant
negative impacts at q = 0.75 and above for the linear regression, and above q = 0.8 for
the exponential regression (Figure 12.9a). Figure 12.9b shows the predicted relationship
between density and w-d ratios (from 0 to 60) for selected quantiles under the exponen-
tial model, analogous to Figure 5 in Dunham et al. (2002). This plot uses the posterior
median of replicate density values.
556 Bayesian Hierarchical Models

0.01

Rate of Change in Trout Density


0.005
Quanle
Quanl
0
0 0
0.1
.1
1 0.2
0.2 0.3 0.4
0.4 0.5 0.6 0.7 0.8 0.9 1

–0.005

–0.01

Mean
–0.015 2.5%
97.5%
–0.02
Linear Regression
0.01
Rate of Change in Trout Density

0.005

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

–0.005 Quanle

–0.01

–0.015 Mean
2.5%
97.5%
–0.02
Nonlinear Regression

10th quinle
2.5
50th quinle
90th quinle
2
Density

1.5

0.5
0 10 20 30 40 50 60
Width:depth

FIGURE 12.9
(a) Trout per meter by width/depth ratio, according to regression quantiles. (b) Relationship
between density and W-D ratios.
Hierarchical Methods for Nonlinear and Quantile Regression 557

Posterior predictive p-tests at each quantile are made using total absolute deviations
between actual responses (or replicates) and model predictions. These remain between
0.1 and 0.9 under the linear model, though are under 0.1 for middle quantiles (between
0.4 and 0.7). Predictive tests are satisfactory across all quantiles for the exponential
model.
An rstan implementation of the linear option confirms the lack of impact of w-d ratio
at q = 0.5, with β1 having mean (95% interval) of −0.0029 (−0.0072, 0.0022). However, at q =
0.9, the estimate is −0.0114 (−0.0146, −0.0081) (Figures 12.10 and 12.11).

0.01 0.05

0.04
0.005 Slope
0.03 2.5%
97.5%

0 0.02
0 0.2 0.4 0.6 0.8 1
Slope Coefficient

Slope Coefficient
0.01
–0.005
0
0 0.2 0.4 0.6 0.8 1
–0.01 Slope –0.01
2.5%
97.5%
–0.02
–0.015

–0.03

–0.02 –0.04
Quantile
Quantile

FIGURE 12.10
Quantile Regression Coefficient Plots, Linear (left), Exponential (right), Slope of Density on Width‐Depth Ratio.

1.2

10th quintile
0.8 50th quintile
90th quintile
Density

0.6

0.4

0.2

0
0 10 20 30 40 50 60
Width:depth

FIGURE 12.11
Conditional predictive profile, trout density against width-depth ratio, 10th, 50th and 90th quantiles, exponen-
tial transform model.
558 Bayesian Hierarchical Models

Example 12.8 Binary Work Trip Data


This example shows how quantile regression can be applied to binary data using data
augmentation. The application is to work trip data, specifically use of car or not as
response, with predictors: DCOST (transit fare minus automobile travel cost in cents);
CAR (number of cars owned by household); DOVTT (transit out-of-vehicle travel minus
automobile outofvehicle travel time in minutes); and DIVTT (transit in-vehicle travel
time minus automobile in-vehicle travel time in minutes). For identifiability, the param-
eter σq is set to 1. This model can be fitted using R2OpenBugs, with the relevant coding
being:

model1 <- function() { xi <- (1−2*q)/(q*(1-q))


for (i in 1:n){ eta[i] <- b0 + b[1]*x1[i]+ b[2]*x2[i]+ b[3]*x3[i]+

b[4]*x4[i]
w[i] ~dexp(sigmaq)
mu[i] <- xi*w[i] + eta[i]
tau[i] <- (q*(1−q)*sigmaq)/(2*w[i])
ystar[i] ~dnorm(mu[i],tau[i]) %_% I(A[i],B[i])
A[i] <- −100*equals(y[i],0)
B[i] <- 100*equals(y[i],1)}

The estimation concerns only median regression (q = 0.5 in the above coding).
Table 12.1 shows the estimated coefficients. The WAIC, on the basis of a normal like-
lihood calculation, is 4,344, albeit with the least well-fitted cases having subject level
WAIC scores of 10 or more.

Example 12.9 Quantile Regression of Physician Visits Counts


Deb and Trivedi (1997) analyse data on patient visits to their primary care physician.
Data refer to 4,406 individuals covered by Medicare, a public insurance program.
Predictors are total hospital stays (hosp), health (binary, excellent self-perceived health
status), numchron (number of chronic conditions), gender, school (number of years of
education), and privins (binary, a private insurance indicator). As in Zeileis et al (2008),
the predictors hosp, numchron, and school are treated as continuous.
Quantile regression using the Machado and Santos Silva (2005) procedure can be
implemented using the zeroes trick available in BUGS and JAGS. Inferences are based
on a 5,000 iterations two-chain run using jagsUI. We also consider hierarchical quan-
tile regression using the Poisson lognormal (HQRPLN) method (Congdon, 2017), as in
model2.jag in the code [1]. The quantiles considered are q = 0.5, q = 0.75, and q = 0.95 with
a focus on identifying influences on higher levels of health care usage.
Tables 12.2 and 12.3 compare results from frequentist estimation (via the lqm.counts
function in R), and results from the Bayesian estimation. The median regression (q = 0.5)
can also be compared to a negative binomial (conditional mean) estimation using the
glm.nb function (Table 12.2). Median regression tends to show stronger regression effects,

TABLE 12.1
Car Work Trip, Parameter Estimates
Mean 2.5% 50% 97.5%
β0 Intercept 4.63 3.98 4.63 5.30
β1 DCOST 0.97 0.53 0.97 1.41
β2 CAR 3.20 2.64 3.20 3.79
β3 DOVTT 1.00 0.38 0.99 1.75
β4 DIVTT 0.24    −0.32 0.23 0.82
Hierarchical Methods for Nonlinear and Quantile Regression 559

TABLE 12.2
Physician Visits, Comparison of Estimates, Median Regression
Median Regression
Hierarchical
Machado- Santos Machado- Santos HQRPLN
Negative Binomial Silva (via lqm. Silva (Bayesian (Bayesian
(Conditional Mean) counts) Estimates) Estimates)
Estimate SE Estimate SE Mean Std Mean Std
Intercept  1.00 0.05  0.72 0.08 0.43 0.05  0.48 0.06
hosp  0.23 0.02  0.26 0.03 0.28 0.02  0.25 0.02
health −0.36 0.06 −0.40 0.10 −0.39 0.07 −0.37 0.06
numchron  0.19 0.01  0.22 0.01 0.24 0.01  0.23 0.01
gender −0.13 0.03 −0.20 0.05 −0.20 0.04 −0.18 0.03
school   0.023  0.004   0.017  0.006  0.029  0.005   0.028  0.005
privins  0.19 0.04  0.22 0.05 0.33 0.05  0.30 0.05

TABLE 12.3
Physician Visits, Comparison of Estimates, Higher Quantiles
q = 0.75
Machado- Santos Silva Machado- Santos Silva Hierarchical HQRPLN
(via lqm.counts) (Bayesian Estimates) (Bayesian Estimates)
Estimate SE Mean Std Mean Std
Intercept 1.16 0.07 1.26 0.05 1.34 0.05
Hosp 0.26 0.04 0.26 0.02 0.25 0.02
Health    −0.37 0.06   −0.36 0.06    
−0.38 0.06
Numchron 0.21 0.02 0.20 0.01 0.19 0.01
Gender    −0.13 0.04   −0.15 0.03    
−0.15 0.03
School  0.026  0.005  0.020  0.003  0.019  0.004
Privins 0.21 0.05 0.21 0.04 0.18 0.04
q = 0.95
Machado- Santos Silva Machado- Santos Silva Hierarchical HQRPLN
(via lqm.counts) (Bayesian Estimates) (Bayesian Estimates)
Estimate SE Mean Std Mean Std
Intercept 1.90 0.07 2.32 0.05 2.12 0.05
Hosp 0.21 0.04 0.20 0.02 0.23 0.02
Health      −0.37 0.09    −0.35 0.05    −0.42 0.05
Numchron 0.18 0.02 0.12 0.01 0.15 0.01
Gender    −0.01 0.05    −0.07 0.03     −0.08 0.03
School  0.036  0.006  0.018  0.003  0.020  0.003
Privins 0.20 0.06 0.04 0.05 0.08 0.03

though less precisely estimated, than negative binomial regression. Posterior mean Wiq
from the HQRPLN estimation show subject 3735 as the most extreme outlier. This subject
has no physician visits, despite a high number of hospital stays and chronic conditions.
Estimated regression coefficients for higher quantiles show a diminished influence of
gender and insurance status. The Bayesian estimates for q = 0.95 also show a lessened
influence of total chronic conditions.
560 Bayesian Hierarchical Models

12.8 Computational Notes
[1] The JAGS code for the HQRPLN model is as follows:

cat("model{ xi <- (1−2*q)/(q*(1−q))


   
   
for (i in 1:n){ y[i] ~dpois(mu[i])
log(mu[i]) <- nu[i]
   
eta[i] <- b[1]+b[2]*hosp[i]+b[3]*excelhlth[i]
   
   
+b[4]*numchron[i]+b[5]*gender[i]
   
+b[6]*school[i]+b[7]*privins[i]
   
w[i] ~dexp(sigmaq)
tau[i] <- (q*(1−q)*sigmaq)/(2*w[i])
   
   
nu[i] ~dnorm(xi*w[i] + eta[i],tau[i])
log(L[i]) <- −mu[i]+y[i]*log(mu[i])−logfact(y[i])
   
LL[i] <- log(L[i])}
   
   
sigmaq ~dgamma(1,0.001)
   
for (j in 1:7) {b[j] ~dnorm(0,0.001) }}
   
", file="model2.jag")

References
Alhamzawi R (2012) R Package ‘Brq’, Bayesian Analysis of Quantile Regression Models. https​://cr​
an.r-​proje​ct.or​g/web​/pack​ages/​Brq/B​rq.pd​f
Alhamzawi R, Yu K, Pan J (2011) Prior elicitation in Bayesian quantile regression for longitudinal
data. Journal of Biometrics and Biostatistics, 2, 115.
Baladandayuthapani V, Mallick B, Carroll R (2005) Spatially adaptive Bayesian penalized regression
splines (P-splines). Journal of Computational and Graphical Statistics, 14, 378–394.
Banerjee S, Ghosal S (2014) Bayesian variable selection in generalized additive partial linear models.
Stat, 3(1), 363–378.
Belitz C, Lang S (2008) Simultaneous selection of variables and smoothing parameters in structured
additive regression models. Computational Statistics & Data Analysis, 53, 61–81.
Benoit D, Van den Poel D. (2012) Binary quantile regression: A Bayesian approach based on the asym-
metric Laplace distribution. Journal of Applied Econometrics, 27(7), 1174–1188.
Benoit D, Van den Poel D (2014) bayesQR: A Bayesian approach to quantile regression. Journal of
Statistical Software, 76(7). https​://ww​w.jst​atsof​t.org​/arti​cle/v​iew/v​076i0​7
Benoit D, Van den Poel D (2017) bayesQR: A Bayesian approach to quantile regression. Journal of
Statistical Software, 76(7). https​://ww​w.jst​atsof​t.org​/arti​cle/v​iew/v​076i0​7
Berry S, Carroll R, Ruppert D (2002) Bayesian smoothing and regression splines for measurement
error problems. Journal of the American Statistical Association, 97, 160–169.
Berzuini C, Larizza C (1996) A unified approach for modeling longitudinal and failure time data,
with application in medical monitoring. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(2), 109–123.
Biller C (2000) Adaptive Bayesian regression splines in semiparametric generalized linear models.
Journal of Computational and Graphical Statistics, 9, 122–140.
Biller C, Fahrmeir L (1997) Bayesian spline-type smoothing in generalized regression models.
Computational Statistics,12, 135–151.
Bondell H, Reich B, Wang H (2010) Noncrossing quantile regression curve estimation. Biometrika,
97(4), 825–838.
Hierarchical Methods for Nonlinear and Quantile Regression 561

Borsuk M, Stow C (2000) Bayesian parameter estimation in a mixed-order model of BOD decay.
Water Research, 34, 1830–1836.
Brezger A, Lang S (2006) Generalized structured additive regression based on Bayesian P-splines.
Computational Statistics and Data Analysis, 50, 967–991.
Brezger A, Steiner W (2008) Monotonic regression based on Bayesian P-splines: An application to
estimating price response functions from store-level scanner data. Journal of Business & Economic
Statistics, 26, 90–104.
Brumback B, Ruppert D, Wand M (1999) Variable selection and function estimation in additive non-
parametric regression using a data-based prior: Comment. Journal of the American Statistical
Association, 94, 794–797.
Brumback BA, Rice JA (1998) Smoothing spline models for the analysis of nested and crossed sam-
ples of curves. Journal of the American Statistical Association, 93(443), 961–976.
Cade B, Noon B (2003). A gentle introduction to quantile regression for ecologists. Frontiers in Ecology
and the Environment, 1(8), 412–420.
Cai Y, Jiang T (2015) Estimation of non-crossing quantile regression curves. Australian & New Zealand
Journal of Statistics, 57, 139–162.
Chen X, Ender P, Mitchell M, Wells C (2003) Regression with Stata, from http:​//www​.ats.​ucla.​edu/
s​tat/s​tata/​webbo​oks/r​eg/de​fault​.htm
Chen Z (1993) Fitting multivariate regression functions by interaction spline models. Journal of the
Royal Statistical Society, Series B, 55, 473–491.
Chib S, Greenberg E (2013) On conditional variance estimation in nonparametric regression. Statistics
and Computing, 23(2), 261–270.
Chib S, Jeliazkov I (2006) Inference in semiparametric dynamic models for binary longitudinal data.
Journal of the American Statistical Association, 101(474), 685–700.
Congdon P (2006) A model framework for mortality and health data classified by age, area, and time.
Biometrics, 62(1), 269–278.
Congdon P (2017) Quantile regression for overdispersed count data: A hierarchical method. Journal
of Statistical Distributions and Applications, 4, 18.
Connolly SR, Thibaut LM (2012) A comparative analysis of alternative approaches to fitting species-
abundance models. Journal of Plant Ecology, 5(1), 32–45.
Cottet R, Kohn R, Nott D (2008) Variable selection and model averaging in semiparametric overdis-
persed generalized linear models. Journal of the American Statistical Association, 103, 661–671.
Coull B, Ruppert D, Wand M (2001) Simple incorporation of interactions into additive models.
Biometrics, 57, 539–545.
Currie I, Durban M (2002) Flexible smoothing with P-splines: A unified approach. Statistical Modelling,
2, 333–349.
Deb P, Trivedi PK (1997) Demand for medical care by the elderly: A finite mixture approach. Journal
of Applied Econometrics, 12(3), 313–336.
Denison DG, Mallick BK, Smith AF (1998) Bayesian mars. Statistics and Computing, 8(4), 337–346.
Dennison D, Holmes C, Mallick B, Smith A (2002) Bayesian Methods for Non-linear Classification and
Regression. John Wiley, Chichester, UK.
Dias R, Gamerman D (2002) A Bayesian approach to hybrid splines non-parametric regression.
Journal of Statistical Computation and Simulation, 72, 285–298.
Dunham JB, Cade BS, Terrell JW (2002) Influences of spatial and temporal variation on fish-habi-
tat relationships defined by regression quantiles. Transactions of the American Fisheries Society,
131(1), 86–98.
Durban M, Currie I, Eilers P (2006) Multidimensional P-spline mixed models: A unified approach to
smoothing on large grids. Working Paper, Department of Statistic, Universidad Carlos III de
Madrid, Spain. http:​//www​.unav​arra.​es/me​tma3/​Paper​s/Inv​ited/​Durba​n.pdf​
Eilers P, Marx B (1996) Flexible smoothing with B-splines and penalties. Statistical Science, 11, 89–121.
Eilers P, Marx B (2004) Splines, knots, and penalties. Working Paper. www.stat.lsu.edu/faculty/marx/
El Adlouni S, Salaou G, St-Hilaire A (2018) Regularized Bayesian quantile regression. Communications
in Statistics – Simulation and Computation, 47(1), 277–293.
562 Bayesian Hierarchical Models

Engle R, Granger C, Rice J, Weiss A (1986) Semiparametric estimates of the relation between weather
and electricity sales. Journal of the American Statistical Association, 81, 310–320.
Fahrmeir L, Knorr-Held L (2000) Dynamic and semiparametric models, pp 513–543, in Smoothing and
Regression: Approaches, Computation and Application, ed M Schimek. John Wiley.
Fahrmeir L, Lang S (2001) Bayesian inference for generalized additive mixed models based on
Markov random field priors. Journal of the Royal Statistical Society C, 50, 201–220.
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modeling Based on Generalized Linear Models. Springer,
Berlin.
Friedman J (1991) Multivariate adaptive regression splines. Annals of Statistics, 19, 1–67.
Fuzi M, Jemain A, Ismail N (2016) Bayesian quantile regression model for claim count data. Insurance:
Mathematics and Economics, 66, 124–137.
Gelman A, Stern H, Carlin J, Dunson D, Vehtari A, Rubin D (2014) Bayesian Data Analysis, 3rd Edition.
Chapman and Hall/CRC.
Geraci M, Bottai M (2006) Quantile regression for longitudinal data using the asymmetric Laplace
distribution. Biostat, 8(1), 140–154.
Gustafson P (2000) Bayesian regression modelling with interactions and smooth effects. Journal of the
American Statistical Association, 95, 795–806.
Hastie T, Tibshirani T (1993) Varying coefficient models. Journal of the Royal Statistical Society B, 55,
757–796.
Hooper P (2001) Flexible regression modeling with adaptive logistic basis functions. Canadian Journal
of Statistics, 29, 343–378.
James G, Hastie T, Sugar C (2000) Principal component models for sparse functional data. Biometrika
87, 587–602.
Jerak A, Lang S (2005) Locally adaptive function estimation for binary regression models. Biometrical
Journal, 47, 151–166.
Kharratzadeh M (2017) Splines in Stan. https​://mc​-stan​.org/​users​/docu​menta​tion/​case-​studi​es/
sp​lines​_in_s​tan.h​tml
Kitagawa G, Gersch W (1996) Smoothness Priors Analysis of Time Series. Springer Verlag, New York.
Klein N, Kneib T, Lang S (2015) Bayesian generalized additive models for location, scale, and shape
for zero-inflated and overdispersed count data. Journal of the American Statistical Association,
110(509), 405–419.
Knorr-Held L (1999) Conditional prior proposals in dynamic models. Scandinavian Journal of Statistics,
26, 129–144.
Koenker R (2005) Quantile Regression. Cambridge University Press, Cambridge, UK.
Kohler M, Umlauf N, Beyerlein A, Winkler C, Ziegler A-G, Greven S (2016) Flexible Bayesian additive
joint models with an application to type 1 diabetes research. arXiv preprint arXiv:1611.01485
Kohn R, Schimek M, Smith M (2000) Spline and kernel regression for dependent data, Chapter 6,
pp 135–158, in Smoothing and Regression Approaches, Computation and Estimation, ed M Schimek.
John Wiley.
Kohn R, Smith M, Chan D (2001) Nonparametric regression using linear combinations of basis func-
tions. Statistics and Computing, 11, 313–322.
Konishi S, Ando T, Imoto S (2004) Bayesian information criteria and smoothing parameter selection
in radial basis function networks. Biometrika, 91, 27–43.
Koop G, Poirier D (2004) Bayesian variants of some classical semiparametric regression techniques.
Journal of Econometrics, 123, 259–282.
Koop G, Tole L (2004) Measuring the health effects of air pollution: To what extent can we really say
that people are dying from bad air? Journal of Environmental Economics and Management, 47,
30–54.
Koop GM (2003) Bayesian Econometrics. John Wiley & Sons Inc.
Kottas A, Gelfand AE (2001) Bayesian semiparametric median regression modeling. Journal of the
American Statistical Association, 96(456), 1458–1468.
Kozumi H, Kobayashi G (2011) Gibbs sampling methods for Bayesian quantile regression. Journal of
Statistical Computation and Simulation, 81(11), 1565–1578.
Hierarchical Methods for Nonlinear and Quantile Regression 563

Krivobokova T, Crainiceanu C M, Kauermann G (2008). Fast adaptive penalized splines. Journal of


Computational and Graphical Statistics, 17, 1–20.
Lang S, Brezger A (2004) Bayesian P-splines. Journal of Computational and Graphical Statistics, 13,
183–212.
Lee R, Carter L (1992) Modeling and forecasting U.S. mortality. Journal of the American Statistical
Association, 87, 659–675.
Lenk P (1999) Bayesian inference for semiparametric regression using a Fourier representation.
Journal of Royal Statistical Society B, 61, 863–879.
Liu X, Wang L, Liang H (2011) Estimation and variable selection for semiparametric additive partial
linear models. Statistica Sinica, 21(3), 1225.
Machado J, Silva J (2005) Quantiles for counts. Journal of American Statistical Association, 100(472),
1226–1237.
MacNab Y, Gustafson P (2007) Regression B-spline smoothing in Bayesian disease mapping: with an
application to patient safety surveillance. Statistics in Medicine, 26, 4455–4474.
Marra G, Wood S (2011) Practical variable selection for generalized additive models. Computational
Statistics & Data Analysis, 55(7), 2372–2387.
Mayr A, Fenske N, Hofner B, Kneib T, Schmid M (2012) Generalized additive models for location,
scale and shape for high dimensional data—A flexible approach based on boosting. Journal of
the Royal Statistical Society: Series C (Applied Statistics), 61(3), 403–427.
McElreath R (2016) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman &
Hall/CRC.
Meyer K (2005) Random regression analyses using B-splines to model growth of Australian Angus
cattle. Genetics Selection Evolution, 37, 473–500.
Meyer R, Millar B (1998) Bayesian stock assessment using a nonlinear state-space model, in Statistical
Modeling, eds B Marx, H Friedl, Proceedings of the 13th International Workshop on Statistical
Modelling, New Orleans, pp 284–291.
Min I, Kim I (2004) A Monte Carlo comparison of parametric and nonparametric quantile regres-
sions. Applied Economics Letters, 11(2), 71–74.
Natario I, Knorr-Held L (2003) Non-parametric ecological regression and spatial variation. Biometrical
Journal, 45, 670–688.
Neuhaus JM, McCulloch CE (2006) Separating between-and within-cluster covariate effects by using
conditional and partitioning methods. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 68(5), 859–872.
Ngo L, Wand M (2004) Smoothing with mixed model software. Journal of Statistical Software, 9(1), 1–54.
Panagiotelis A, Smith M (2008) Bayesian identification, selection and estimation of functions in high-
dimensional additive models. Journal of Econometrics, 143, 291–316.
Pham T H, Wand MP (2015) Generalized Additive Mixed Model Analysis via gammSlice. http:​//
www​.matt​-wand​.utsa​cadem​ics.i​nfo/P​hamWa​nd.pd​f
Qian S, Reckhow K, Zhai J, McMahon G (2005), Nonlinear regression modeling of nutrient loads in
streams: A Bayesian approach. Water Resources Research, 41, W07012. doi:10.1029/2005WR003986
Rahman M (2016). Bayesian quantile regression for ordinal models. Bayesian Analysis, 11(1), 1–24.
Ruppert D, Wand M, Carroll R (2003) Semiparametric Regression. Cambridge University Press.
Sanchez LB, Galarza CE, Lachos VH (2017) R Package ‘ALDqr’, Quantile Regression Using
Asymmetric Laplace Distribution. https​://cr​an.r-​proje​ct.or​g/web​/pack​ages/​ALDqr​/ALDq​
r.pdf​
Scheipl F (2011) spikeSlabGAM: Bayesian variable selection, model choice and regularization for
generalized additive mixed models in R. Journal of Statistical Software, 43(14), 1–24.
Scheipl F, Fahrmeir L, Kneib T (2012) Spike-and-slab priors for function selection in structured addi-
tive regression models. Journal of the American Statistical Association, 107(500), 1518–1532.
Schennach SM (2005) Bayesian exponentially tilted empirical likelihood. Biometrika, 92, 31–46.
Shively TS, Kohn R, Wood S (1999) Variable selection and function estimation in additive nonpara-
metric regression using a data-based prior. Journal of the American Statistical Association, 94(447),
777–794.
564 Bayesian Hierarchical Models

Silva G, Dean C, Niyonsenga T, Vanasse A (2008) Hierarchical Bayesian spatiotemporal analysis of


revascularization odds using smoothing splines. Statistics in Medicine 27, 2381–2401.
Smith M, Kohn R (1996) Nonparametric regression using Bayesian variable selection. Journal of
Econometrics, 75, 317–344.
Smith M, Kohn R (1997) A Bayesian approach to nonparametric bivariate regression. Journal of the
American Statistical Association, 92, 1522–1535.
Smith M, Wong C-M, Kohn R (1998) Additive nonparametric regression with autocorrelated errors.
Journal of the Royal Statistical Society, Series B, 60, 311–331.
Tutz G, Reithinger F (2007) A boosting approach to flexible semiparametric mixed models. Statistics
in Medicine, 26, 2872–2900.
Umlauf N, Klein N, Zeileis A, Koehler M (2016) BAMLSS: Bayesian additive models for location scale
and shape (and beyond). Working Papers in Economics and Statistics, 2017–04, University of
Innsbruck.
Wahba G (1983) Bayesian confidence intervals for the cross validated smoothing spline. Journal of the
Royal Statistical Society, Series B, 45, 133–150.
Wand M (2003) Smoothing and mixed models. Computational Statistics, 18, 223–249.
West M, Harrison P (1997) Bayesian Forecasting and Dynamic Models, 2nd Edition. Springer-Verlag,
New York.
Wood S (2006) Generalized Additive Models: An Introduction with R. CRC Press.
Wood S (2008) Fast stable direct fitting and smoothness selection for generalized additive models.
Journal of the Royal Statistical Society, Series B, 70(3), 495–518.
Wood S (2016) Just another gibbs additive modeller: Interfacing JAGS and mgcv. Journal of Statistical
Software, 75(7). doi:10.18637/jss.v075.i07
Wood S, Kohn R (1998) A Bayesian approach to robust nonparametric binary regression. Journal of the
American Statistical Association, 93, 203–213.
Wood S, Pya N, Safken B (2016) Smoothing parameter and model selection for general smooth mod-
els. Journal of the American Statistical Association, 111(516), 1548–1563.
Wood SN (2003) Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 65(1), 95–114.
Wood SN, Augustin NH (2002) GAMs with integrated model selection using penalized regression
splines and applications to environmental modelling. Ecological Modelling, 157(2–3), 157–177.
Wu H, Zhang JT (2006) Nonparametric Regression Methods for Longitudinal Data Analysis: Mixed-Effects
Modeling Approaches, Vol. 515. John Wiley & Sons.
Xi R, Li Y, Hu Y (2016). Bayesian quantile regression based on the empirical likelihood with spike and
slab priors. Bayesian Analysis, 11(3), 821–855.
Yau P, Kohn R (2003) Estimation and variable selection in nonparametric heteroscedastic regression.
Statistics and Computing 13, 191–208.
Yau P, Kohn R, Wood S (2003) Bayesian variable selection and model averaging in high dimensional
multinomial nonparametric regression. Journal of Computational and Graphical Statistics, 12,
23–54.
Yu K, Moyeed RA (2001) Bayesian quantile regression. Statistics & Probability Letters, 54(4), 437–447.
Yue Y, Hong H (2012) Bayesian Tobit quantile regression model for medical expenditure panel sur-
vey data. Statistical Modelling, 12(4), 323–346.
Yue Y, Speckman P, Sun D (2012). Priors for Bayesian adaptive spline smoothing. Annals of the Institute
of Statistical Mathematics, 64(3), 577–613.
Zeileis A, Kleiber C, Jackman S (2008) Regression models for count data in R. Journal of Statistical
Software, 27(8), 1–25.
Index

AAPC, see Area APC model Autocorrelation, 14, 20, 21, 173, 282, 283, 422,
Abrams, K., 107 427–428
Absolute risk difference (ARD), 133 Autoregression parameters, 30
Accelerated failure time (AFT) model, Autoregressive (AR) model, 166–172, 193, 204,
478, 481, 490 288–290, 418
ACP, see Autoregressive conditional Poisson low order, 169–170
models random coefficient, 168–169
Adaptive non-parametric Autoregressive conditional Poisson (ACP)
regression, 541–543 models, 189, 189–191, 190
AFT, see Accelerated failure time model Autoregressive moving average (ARMA)
Age-area interactions, 451–452 models, 165, 167–168, 173, 200, 202–204,
Age-period-cohort (APC) model, 447, 448 216, 282, 381
AIC, see Akaike Information Criterion Auxiliary momentum vector, 18
AICcmodavg, 59
Air passenger data, 181–182 Baker, R., 109
Aitchison, J., 142 Bamlss, see Bayesian Additive Models for
Aitkin, I., 346 Location, Scale, and Shape
Aitkin, M., 346 Banerjee, S., 215
Akaike Information Criterion (AIC), 71, BARMA, see Binary autoregressive moving
72, 73, 137 average models
Albert, J., 10, 332 Barnard, J., 412
Albert, J. H., 131 Barry, R., 67, 69
Alcohol effect, 304–305 Baseball salary data, 280–281
ALD, see Asymmetric Laplace distribution Baseline fixed effects model, 426
Anaesthesia, 110 Basic structural model (BSM), 178–179
Analysis of variance (ANOVA), categorical Basu, S., 139
predictors and, 259–263 Bayarri, M., 92
Ando, T., 143 Bayes approach, 1, 80, 187, 340
ANOVA, see Analysis of variance Bayes factor, 64, 68–71, 78, 86, 87, 88, 344
Antedependence model, 170–172, 204–205 Bayes formula, 2–3, 62
APC, see Age-period-cohort model Bayesian Additive Models for Location, Scale,
Approximate Bayesian bootstrap method, 455 and Shape (Bamlss), 45
Approximate methods, 60 Bayesian chi-square method, 90, 96–97
AR, see Autoregressive model Bayesian general linear models, 270
ARD, see Absolute risk difference Bayesian hierarchical methods, 103
Area APC (AAPC) model, 447 Bayesian Information Criterion (BIC), 60, 71, 72,
ARMA, see Autoregressive moving average 73, 77, 137, 280
models Bayesian Macroeconometrics (BMR), 166
ARMA-GLM model, 189 Bayesian spatial predictor selection models,
Asparouhov, T., 332, 344 293–296
Assuncao, R., 30 Bayesian spatial smoothing, 225
Asymmetric Laplace distribution (ALD), 110, Bayesian variable selection algorithms, 60
553, 554, 555 BayesMixSurv package, 481
Attitudes to science, 362–363 Bayes’ theorem, 6
Augmented data likelihood, see Complete data BayesVarSel, 59
likelihood BayesXsrc package, 546
Augmented data multilevel models, 324–325 Bazan, J., 370
Augmented data representations, 55 BCG vaccine, 119–120

565
566 Index

Beath, K., 109 finite mixtures of standard densities,


Bernoulli likelihood, 327 136–137
Berzuini, C., 179 inference in mixture models, 137–141
Besag, J., 30, 80, 222, 227, 373, 376, 447 logistic-normal, 142–144
Best, N., 243 model types, 141–142
Beta-binomial regression model, 267 heterogeneity in count data, 121–126
Beta-binomial representation, 134 non-conjugate poisson mixing, 124–126
Beta-carotene plasma, 533–536 hierarchical priors using continuous
Betancourt, M., 140, 143 mixtures, 105–106
BGR, see Brooks-Gelman-Rubin statistics multivariate meta-analysis, 116–121
“Bias-variance trade-off,” 104 normal-normal hierarchical model and
BIC, see Bayesian Information Criterion applications, 106–111
Binary autoregressive moving average meta-regression, 110–111
(BARMA) models, 191–193, 193 overview, 103–105
Binary response analysis, 327 prior for second stage variance, 111–116
Binary selection indicators, 254 non-conjugate priors, 113–116
Binary work trip data, 558 semiparametric modelling, 144–153
Binomial and multinomial mixture methods, polya tree priors, 149–153
126–134 specifying baseline density, 146–148
Dirichlet parameters, 130–131 truncated Dirichlet processes and
ecological inference, 131–134 stick-breaking priors, 148–149
non-conjugate priors, 128–130 Box–Cox power law, 171
Binomial-beta model, 106, 131 Box–Jenkins methods, 173, 202
Binomial disease prevalence data, 240 Breast cancer recurrence, 132–133
Binomial logit analysis, 83 Brexit voting, 295
Binomial logit-normal (BLN) representation, 134 Bridge sampling estimates, 63–65
Binomial meta-analysis, 132–133 bridgesampling package, 59, 66, 70, 95–96
Binomial regression, 267–269 brms package, 321
Birthweight and maternal age, 551–552 Brockmann, H., 261
Bitcoin price, 195–197 Brooks, S., 25
Bivariate density, 12 Brooks-Gelman-Rubin (BGR) statistics, 296
Bivariate exponential model, 503 Brown, H., 85
Bivariate Gumbel model, 503 BSM, see Basic structural model
Bivariate normal model, 119–120 B-splines, 236, 529–530, 533, 548
Bivariate Student t model, 120 Buck, C., 88
Bivariate survival models, 502, 506 BUGS coding, 45, 46–47, 51, 55, 139, 152, 308, 380,
blavaan package, 344 480, 558
Bliese, P., 327 Bulmer, M., 124
BLN, see Binomial logit-normal representation Burr, D., 152
Blood level readings, 228–229 BYM model, 228, 229, 233
BMI, see Body mass index
BMR, see Bayesian Macroeconometrics Cai, B., 81, 82
Body mass index (BMI), 304 Calcagno, V., 273
Borrowing strength, 103–156 Calder, C., 238
binomial and multinomial mixture methods, California Education Department, 542
126–134 Cancer survival, 499–500
Dirichlet parameters, 130–131 Canonical dynamic model, 427
ecological inference, 131–134 Capital asset pricing model, 422–423
non-conjugate priors, 128–130 CAR, see Conditional autoregressive
discrete mixtures and semiparametric CARBayes package, 224, 232, 380
smoothing methods, 134–144 Cargnoni, C., 384
finite mixtures of parametric densities, Carlin, B., 88, 178
135–136 Carter–Lee mortality forecasting model, 178
Index 567

Carvalho, C., 84 Gibbs sampling, 17–18


Carvedilol, 85 Hamiltonian Monte Carlo (HMC) algorithm,
Casella, G., 30 18–19
Castellanos, M., 92 hierarchical Bayes applications, 5–8
Causal mediation effects, 302 Latent Gaussian models, 19–20
Cause-specific and subdistribution hazards, Metropolis–Hastings (M–H) sampling, 14–17
513–514 Metropolis sampling, 8–9
CDE, see Controlled direct effect overview, 1–2
Celeux, G., 139 posterior inference
Cepeda, E., 415, 417, 418 from Bayes formula, 2–3
CFA, see Confirmatory factor analysis Markov chain Monte Carlo (MCMC)
Chan, K., 133, 189 sampling, 3–5
Chatuverdi, A., 170 Conceptions data, 546
Chelgren, N., 115 Conditional autoregressive (CAR) approach,
Chen, M-H., 68, 283, 491, 492 290–291, 295
Chen, R-B., 255 Conditional autoregressive (CAR) priors,
Chen, Z., 81, 425, 426 221–227, 229, 233, 349, 376
Chib, S., 30, 63, 147, 167, 332, 355, 362, 421 alternative conditional priors, 223–225
Chintagunta, P., 423 ICAR(1) and convolution priors, 226–227
Choi, J., 293, 295 linking conditional and joint specifications,
Cholesky decomposition methods, 121, 240, 369, 222–223
412, 413, 417 Conditional likelihood, 6, 74, 128, 140, 345,
Cholesky factor correlation matrix 405, 407
approach, 417 Conditional linear model, 460
Cholesky parameterisation, 386 Conditional spatial variance, 227–229
Christiansen, C., 113, 124, 125, 126 Confirmatory factor analysis (CFA), 342,
Chronic disease, 379–380 349, 350
Chronic obstructive pulmonary disease Confirmatory factor model, 348
(COPD), 240 Congdon, P., 230, 234, 396, 498
Clark, A., 230 Conjugate cluster effects, 325–328
Clayton, D., 179, 447 Conjugate linear normal model, 319
Clifford–Hammersley theorem, 7 Conjugate priors, 26, 136, 187, 270, 325, 332, 406
Clinical meta-analysis, 106, 107, 108, 109 Conlon, E., 112
Clinical trial, 85, 129 Controlled direct effect (CDE), 300
Cluster labelling issues, 137–138 Conventional ANOVA, 260
Clutch random effects, 17 Conventional discrete mixture methods, 199
Cocaine use, 460–461 Conventional Monte Carlo methods, 3, 4
Common factor model, 391–392 COPD, see Chronic obstructive pulmonary
Competing risks (CR), 507–514 disease
modelling frailty, 509–514 Correlation matrix, 55, 120, 292, 322, 357, 369,
Complete data likelihood, 6, 7, 67, 73, 345–346 389, 395, 412, 416, 417, 426, 427
Complex data, Bayesian methods for, 1–37 Counting process functions, 474–475
assessing efficiency and convergence, 20–25 Covariance functions, 237–238
hierarchical model parameterisation, Covariance matrix, 80, 117, 118, 131, 170, 215, 223,
22–24 323, 340, 357, 378, 383, 385, 387
multiple chain methods, 24–25 Covariance matrix decomposition, 82
choice of prior density, 25–31 Covariance regression model, 418
assessing posterior sensitivity, 27–29 Covariance selection, 80, 82, 85, 86
including evidence, 26–27 Cox model, 485, 513
selection problems in hierarchical Bayes Cox regression, 512
models, 29–31 CR, see Competing risks
choice of proposal density, 9–10 Crime data, 361–362, 371
full conditional densities, 10–14 Crime type and ethnicity, 333–334
568 Index

Crossed random effects, 328–331 logistic-normal, 142–144


Cross-validation, 61, 87, 88, 91, 92, 94, 97–98 model types, 141–142
Crowder, M. J., 83 Discrete time hazard models, 494–502
Cubic splines, 236, 536, 543, 545 life tables, 496–502
Cumulative density function, 90 Discretion in trade policy, 276
Cumulative hazard specifications, 484–486 Disease risk, 221
Cure rate models, 490–494 Diserud, O., 124
Doss, H., 152
Dangl, T., 284 DP, see Dirichlet process
Daniels, M., 25, 113, 319, 325, 331, 332, 414 DPMmeta, 152
Data augmentation, 7, 27, 273, 279, 324–325, 558 DPP, see Dirichlet process prior
Data manipulation, 45 DPpackage, 103, 145, 152
Davidian, M., 406 Drton, M., 143
DBP, see Diastolic blood pressure Druyts, E., 111
Dean, C., 326 DuMouchel, W., 113
Deely, N., 104 Dunson, D., 81, 82, 148, 360
de la Horra, J., 89 Durbin-Watson (DW) statistics, 283, 422
de Leeuw, J., 321 Dynamic factor analysis, 386–388
de Mazancourt, C., 273 Dynamic generalised linear models, 166,
DerSimonian, R., 110 186, 284
DeSarbo, W., 62 Dynamic linear models, 165, 166, 174, 175, 198,
Desipramine, 460–461 284, 387, 428
Deviance information criterion (DIC), 59, 61, Dynamic longitudinal models, 427–432
72–73, 78, 218, 219, 220, 258, 380 for discrete data, 429–432
Diabetes progression, 257–259
Diabetic Retinopathy Study, 506 Earthquake locations, 240–241
Diastolic blood pressure (DBP), 120 Earthquake magnitudes, 244
DIC, see Deviance information criterion EDF, see Empirical distribution function
Diebolt, N., 135, 137 EFA, see Exploratory factor analysis
Differential item functioning (DIF), 359, 369 Elementary school attainment, 542–543
Digestive tract decontamination, 152 Elicitation techniques, 26, 27
Diggle, P., 238 El Niño/Southern Oscillation (ENSO), 167–168
Direct treatment effects, 302 ELPD, see Expected log posterior predictive
Dirichlet process (DP), 138, 144–145, 147, density
155–156, 333, 437 Empirical distribution function (EDF), 7
Dirichlet process mixture median Engen, S., 124
regression, 553 ENSO, see El Niño/Southern Oscillation
Dirichlet process prior (DPP), 371 Epileptic seizure data, 285–286, 432, 441–442
Dirichlet weight mixture, 29 Escobar, M., 152
Discrete convolution models, 241–244 EU Referendum voting, 295–296
Discrete kernel approach, 244 Expected log posterior predictive density
Discrete mixture, 142 (ELPD), 75, 76
approach, 254, 279 Exploratory factor analysis (EFA), 342, 347, 353
model, 74, 103, 136, 230–231, 436–442 Exponentially tilted empirical likelihood, 553
parametric, 60, 135, 437 Extended logistic model, 13–14, 32–34
regression, 277, 280–281 Extreme values and outliers, 133–134
and semiparametric smoothing methods, Eye tracking data, 152–153
134–144
finite mixtures of parametric densities, Factor models
135–136 normal linear structural equation and,
finite mixtures of standard densities, 340–346
136–137 definition, 343–345
inference in mixture models, 137–141 forms, 342–343
Index 569

marginal and complete data likelihoods, multivariate binary data, 357–359


345–346 multivariate count data, 355–356
Markov chain Monte Carlo (MCMC) Generalised linear model (GLM), 45, 166, 186,
sampling, 345–346 264, 277, 282, 284, 525, 552
robust density assumptions in, 370–373 Generalised MCAR (GMCAR) model, 377
spatial, 379–380 Generalised Poisson model, 125
Factor score identifiability, 346–354 General linear mixed model (GLMM), 21, 81, 91,
Fahrmeir, L., 27, 429 147, 322–324, 325, 331, 433, 443, 490, 498
Fernandez, C., 230 for longitudinal data, 406–418
First maternity age, 493–494 centred/non-centred priors, 408–409
Fisher information matrix, 60 multiple sources of error variation, 415–418
Fisher z-transformed correlation, 118 random covariance matrix and effect
FitARMA package, 168 selection, 411–415
Fixed effect model, 70, 78, 103, 319, 527 unit level random effects, 409–411
Fixed regression effects, 21, 26, 320, 336, 407, General model averaging approach, 527
424, 445 Genton, M., 118
Follicular cell lymphoma data, 513–514 Geographically weighted linear regression
Formal model choice, 59–61, 253 (GWR), 291–292
Fotouhi, A., 429 George, E., 29, 106
Friel, N., 66, 67, 77 Geostatistical models, 215, 238
Fruhwirth-Schnatter, S., 65, 81, 82, 83, 86, 179 German General Social Survey, 493
Full conditional density, 9, 10–14, 16, 24, 62 Geweke, J., 198, 428
Geyer, C., 10
Galaxy data, 143–144 GGT, see Gamma-glutamyl transpeptidase
Gamerman, D., 319, 415, 417, 418 Ghosh, K., 178
Gamma-glutamyl transpeptidase (GGT), Ghosh, M., 260
304, 305 Ghosh, S., 392
Gamma parameters, 103, 481, 506 “Ghosting,” 91
Gao, S., 129 Gibbs sampling, 17–18, 24, 109, 136, 146, 214, 321,
GARCH, see Generalised autoregressive 331, 333, 415
conditional heteroscedastic models Giltinan, D., 406
GARMA, see Generalised autoregressive GLM, see Generalised linear model
moving average representation GLMM, see General linear mixed model
Gatsonis, C., 325, 331, 332 Global sea level change, 183–184
Gaussian Markov random field (GMRF), GMCAR, see Generalised MCAR model
238–239 GMRF, see Gaussian Markov random field
Gaussian process, 231, 234, 236, 241–242, 243 Goldstein, H., 320
GCSE, see General Certificate of Secondary Gotway, C., 228
Education G-prior method, 273
GDP and female education, 539–541 Graphical analysis, 68
Gelfand, A., 87, 215, 231, 392 Green, P., 135, 136, 149, 230
Gelman, A., 25, 114, 115 Greenberg, E., 167
Gelman–Rubin scale reduction factors, 14, 24 Griddy–Gibbs technique, 195
General Certificate of Secondary Education Griffin, J., 231
(GCSE), 546 Growth model simulation, 416–417
Generalised autoregressive conditional Gupta, A. K., 131
heteroscedastic (GARCH) models, 171, Gustafson, P., 25, 112
172, 194, 195 GWR, see Geographically weighted linear
Generalised autoregressive moving average regression
(GARMA) representation, 188–189
Generalised linear factor models, 354–369 Half Cauchy priors, 114, 115, 116, 196, 256,
categorical data, 360–369 369, 417
latent scale IRT models, 359–360 Halling, M., 284
570 Index

Halverson, R., 327 Horseshoe crabs, 261–263


Hamilton, B., 147 Horseshoe prior, 84, 110, 192, 256, 257, 258, 298
Hamiltonian dynamics, 18 Hospital infection, 510–513
Hamiltonian Monte Carlo (HMC), 2, 51, 55 Hospital mortality, 125–126, 135
algorithm, 18–19 Howley, P., 127
scheme, 49 HQRPLN, see Hierarchical quantile regression
Hanson, T., 135, 148 using the Poisson lognormal method
Harun, N., 81 Hu, J., 148
Harvey, A., 384 Hudson Bay Company, 384
HB, see Hierarchical Bayes Hurn, M., 136
Held, L., 376 Hwang, R., 415
Herring, A., 81, 86, 360, 491 Hypertension treatments, 120–121
Heteroscedasticity, 541–543 Hypertension trial, 85–86
and generalised error densities, 433–442
discrete mixture models, 436–442 Ibrahim, J., 88, 147, 148, 283, 491
and regression heterogeneity, 276–282 ICAR(1) and convolution priors, 226–227
nonconstant error variances, 276–277 Identifiability issues, 348–354
varying effects using discrete mixtures, Imai, K., 302
277–278 IMD, see Index of multiple deprivation
zero-inflated Poisson (ZIP), 278–282 Immigration policy, 303–304
Hierarchical Bayes (HB) applications, 5–8 IMPS, see Inpatient multidimensional
Hierarchical Bayes (HB) model, 16, 29–31 psychiatric scale data
Hierarchical Bayes (HB) setting, 6, 7 Index of multiple deprivation (IMD), 218, 546
Hierarchical binary regression, 16–17 Infection treatment trial, 505–506
Hierarchical linear normal model, 82 INLA, see Integrated nested Laplace
Hierarchical methods, 525–560 approximation
general additive methods, 543–546 INLABMA package, 218, 224
heteroscedasticity, 541–543 Inpatient multidimensional psychiatric scale
multivariate basis function regression, (IMPS) data, 461–462
536–541 Integrated likelihood approach, 6, 7, 67, 73
non-parametric basis function models, Integrated nested Laplace approximation
526–536 (INLA), 20, 45–46, 238, 240
mixed model splines, 527–529 Integrated WAIC (iWAIC), 93, 94
model selection, 532–536 International Social Survey Program (ISSP), 362
other than truncated polynomials, 529–532 Intraclass correlation, 418
non-parametric regression methods, 546–552 Inverse chi-squared density, 111
overview, 525–526 Inverse probability of treatment weighted
quantile regression, 552–559 (IPTW) approach, 306, 308
non-metric responses, 554–559 Inverse Wishart distribution, 412
Hierarchical model parameterisation, 22–24 Inverse Wishart prior, 319, 323, 424
Hierarchical Poisson models, 121–126 IPTW, see Inverse probability of treatment
Hierarchical quantile regression using the weighted approach
Poisson lognormal (HQRPLN) IRT, see Item response theory
method, 558, 559, 560 Ishwaran, H., 143, 148, 149, 371
Hierarchical regression models, 225 ISSP, see International Social Survey Program
Hierarchical smoothing methods, 104 Item response theory (IRT), 348, 363
Histogram smoothing, 131, 143 models, 359–360
Histogram technique, 27 iWAIC, see Integrated WAIC
HMC, see Hamiltonian Monte Carlo
Hobert, J., 30 Jackson, D., 109
Hoff, P., 142, 143 Jacobian adjustments, 13
Hoijtink, H., 91 JAGS coding, 45, 46, 47–49, 55, 78, 236, 480,
Hong, H., 554 558, 560
Index 571

jagsUI package, 2, 48, 67, 78, 85, 93, 94, 115, 121, Latent growth curve models, 436, 437
140, 144, 154–155, 184, 202, 353, 362, 371, Latent regression vs differential item
395, 431, 440, 451, 534 functioning, 366–369
James, L., 143, 148 Latent trait longitudinal models, 445–446
Jansen, M., 326 Laud, P., 88
Jarque–Bera test, 91, 115 Lawson, A., 230, 293, 295
Jeffreys, H., 26, 70 Leave-one-out information criterion (LOO-IC), 47,
Jeliazkov, I., 30, 63, 421 61, 75–77, 78, 116, 143, 168, 172, 182, 190,
Jin, X., 377, 378 192, 196, 197, 201, 228–229, 232, 233, 234,
Job applicant data, 352–354 258, 267, 287, 298, 299, 304, 321, 336, 363,
Joe, H., 29, 54 369, 380, 385, 392, 418, 423, 485, 530, 540
Johnson, V., 90 Ledolter, J., 189
Joint density, 2, 10, 51, 61, 132, 169, 177, 216, 222, Lee, J., 415
223, 224, 239, 339, 375, 377, 453, 456, Lee, K., 118, 255
457, 458 Lee, K-J., 255
Joint posterior density, 4, 27 Lenk, P., 62, 142
Joint regression model, 417–418 Leroux, B., 224, 376
Jonsen, I., 29 Leroux global index, 232
Joreskog, K. G., 340 Lesage, J., 291
Jung, R. C., 283 Lewandowski, D., 29
Li, L., 91
Kaplan–Meier estimate, 477, 506, 510 Life tables, 496–502
Kashiwagi, N., 187 Lim, Y., 189, 283
Kass, R., 62, 70, 319, 415 Limiting long-term illness (LLTI), 218–221, 232
Kato, B., 91 Lin, T., 136
Keane, M., 428 Lindley–Smith model format, 320–322
Kernel density methods, 4, 62 Linear Bayes approach, 187
Kinney, S., 81 Linear co-regionalisation model, 378
Kleinman, K. P., 147, 148 Linear factor reduction model, 370
Knorr-Held, L., 27, 177, 230 Linear Gaussian state space model, 284
Kohn, R., 80, 81 Linear Gaussian transition model, 284
Kooperberg, C., 222 Linear regression, 46, 60, 300, 525, 542, 555
Koopman, S., 384 Little, R., 453
Kreft, I., 321 LLT, see Local linear trend
Kuk, A. Y., 441 LLTI, see Limiting long-term illness
Kumar, J., 170 Local level model, 175
Kuo, L., 80, 425, 426 Local linear trend (LLT), 382
Kurowicka, D., 29 Logistic-normal model, 128
Logistic regression, 27, 52–53, 255, 327, 331, 499
“Label switching,” 135, 138 Logit-binomial model, 93–95
Lag and error models, 288 Logit regression, 125, 265, 271, 272, 273, 278, 295,
Lagged count model, 432 297, 308, 327, 357, 363, 455, 541
Lagged earnings model, 430–431 Log likelihood ratio, 71
Laird, N., 110 Log-logistic model, 477, 493, 494
Lambert, P., 110 Log marginal likelihood, 60, 66, 68
Lancaster, T., 428 Log odds ratio (LOR), 133
Langevin random walk scheme, 10 Log posterior predictive density (LPPD), 75–76,
Laplace approximation, 3, 10, 20, 67 333, 334
Laplace methods, 86 Log relative risk (LRR), 133
LaplacesDemon package, 45 Longitudinal data, 405–462
Lasso prior, 256, 257, 258, 259, 481 categorical choice, 423–427
Lasso random effect models, 83 dynamic models, 427–432
Latent Gaussian models, 19–20 for discrete data, 429–432
572 Index

general linear mixed models for, 406–418 algorithms, 1–2, 9, 21, 45, 47
centred/non-centred priors, 408–409 sampling, 3–5, 14, 20, 22, 24, 29–30, 31, 37, 46,
multiple sources of error variation, 67, 80, 91, 139, 176, 180, 221, 256, 261,
415–418 264, 270, 278, 324, 345–346, 349, 410, 423
random covariance matrix and effect Markov Poisson regression, 283
selection, 411–415 Markov random field (MRF), 30, 214
unit level random effects, 409–411 Marra, G., 237
heteroscedasticity and generalised error Marriott, J., 167
densities, 433–442 MARS, see Multivariate adaptive regression
discrete mixture models, 436–442 spline method
missing data, 452–462 Marshall, C., 91, 92, 93, 110, 112
common factor models, 455–457 Martinez-Beneito, M. A., 378, 380
forms of regression, 454–455 Martin Marietta company, 46
pattern mixture models, 459–462 Math achievement, 321–322
predictor data, 457–459 Maths aptitude, 371–372
multilevel and multivariate, 443–452 Mavridis, D., 121
latent trait, 445–446 Maximum-entropy priors, 27
multiple scale, 446–452 Maximum likelihood (ML) analysis, 27, 219,
overview, 405–406 328, 428
temporal correlation and autocorrelated Maximum likelihood (ML) estimation, 26, 386
residuals, 418–423 Maximum likelihood (ML) factor analysis,
explicit temporal schemes, 419–423 352, 354
LOO-IC, see Leave-one-out information MCAR, see Missingness completely at random;
criterion Multivariate CAR prior
Lopes, H. F., 65 MCC, see Measure of creatinine clearance
LOR, see Log odds ratio MCE, see Marginal causal effect
Louis, T., 88 MCMC, see Markov chain Monte Carlo
Low birthweight babies, 273, 294 MCMCpack, 45
Low order autoregressive models, 169–170 MCMCvis, 45
LPPD, see Log posterior predictive density MDP, see Mixed Dirichlet process
LRR, see Log relative risk Measure of creatinine clearance (MCC),
LSAT data, 363–366 450–451
Lubrano, M., 170 Median regression, 110, 553, 558
Lung cancer trial, 484–486 Meta-analysis model, 22, 26, 27
Lung function and ozone exposure, 307–308 Meta-regression, 110–111
Metropolis–Hastings (M–H) sampling, 8, 10,
McCulloch, C. E., 551 14–17, 49, 63, 178, 195
McCulloch, R., 29 Metropolis sampling, 8–9, 32–34
MacNab, Y., 225, 326, 376 extended logistic model, 13–14
Mallick, B., 80 normal density parameters estimation, 11–12
maptools package, 218 Meyer, M., 88, 388
MAR, see Missingness at random MGMRF, see Multivariate Gaussian Markov
Marginal causal effect (MCE), 308 random field
Marginal likelihood, 3, 8, 12, 59, 60, 77, 80, 84, 87, M-H, see Metropolis-Hastings sampling
128, 345–346 Migon, H., 392, 501
approximation, 62–63, 344 Militino, A., 136
estimation, 51, 60, 61, 67, 68–71, 406 MIMIC, see Multiple indicator-multiple cause
Marginal structural model (MSM), 306–308 model
Markham, F., 111 Missing data in longitudinal data, 55, 452–462
Markov chain model, 3, 14, 16, 283, 285, 429 common factor models, 455–457
Markov chain Monte Carlo (MCMC), 61, 62, 72, forms of regression, 454–455
75, 76, 80, 91, 93, 106, 114, 135, 136, 137, pattern mixture models, 459–462
144, 269, 288, 293, 302, 320, 344, 483 predictor data, 457–459
Index 573

Missingness at random (MAR), 453, 454 Mortality and environment, 286–287


Missingness completely at random (MCAR), Moyeed, R., 110
453, 454 MPH, see Mixed proportional hazard models
Missingness not at random (MNAR), 453, 454 Mplus package, 344
Mixed Dirichlet process (MDP), 145, 146, 147, MRF, see Markov random field
437, 442 MSM, see Marginal structural model
Mixed model P splines, 536 MSOE, see Multiple source of error
Mixed model splines, 527–529 MultiBUGS, 45, 46, 47
Mixed predictive checks, 91–95 Multilevel and multivariate longitudinal data,
Mixed predictive method, 322 443–452
Mixed predictive p-test, 91 latent trait, 445–446
Mixed proportional hazard (MPH) models, multiple scale, 446–452
488, 510 Multilevel logistic regression, 331
ML, see Maximum likelihood Multilevel models, 317–336
MNAR, see Missingness not at random crossed and multiple membership random
Model fit and checks, 59–98 effects, 328–331
effective model dimension and penalised fit discrete responses, 322–328
measures, 71–78 augmented data multilevel models,
alternative complexity measures, 73–75 324–325
deviance information criterion (DIC), conjugate cluster effects, 325–328
72–73 normal linear mixed model, 318–322
leave one out information criterion Lindley–Smith model format, 320–322
(LOO-IC), 75–77 overview, 317–318
widely applicable Bayesian information robust, 331–336
criterion (WBIC), 77–78 Multinomial-Dirichlet mixture, 268
widely applicable information criterion Multinomial-Dirichlet model, 106, 131
(WAIC), 75–77 Multinomial regression, 267–269
formal model selection, 59–71 Multiple chain convergence criterion, 25
approximating marginal likelihoods, Multiple chain methods, 24–25
62–63 Multiple indicator-multiple cause (MIMIC)
importance and bridge sampling model, 342
estimates, 63–65 Multiple membership random effects, 328–331
marginal likelihood for hierarchical Multiple random effects, 22, 24, 67, 85, 92,
models, 67–71 166, 500
path sampling, 65–67 Multiple scale longitudinal data, 446–452
overview, 59 Multiple source of error (MSOE), 181
predictive methods, 87–95 Multivariate adaptive regression spline (MARS)
mixed checks, 91–95 method, 536
model checking and choice, 87–89 Multivariate and nested survival times, 502–506
posterior predictive density, 89–91 Multivariate CAR (MCAR) prior, 376
variance component choice and model Multivariate count data, 124, 355–356
averaging, 80–86 Multivariate dynamic linear model (DLM),
random effects selection, 80–86 381–386
“Model/M-closed” scenario, 59 Multivariate Gaussian Markov random field
Moment approximations, 62 (MGMRF), 375
Monte Carlo methods, 3–5 Multivariate meta-analysis, 116–121
Monte Carlo permutation test, 229 Multivariate mixed model generalisation, 538
Monte Carlo simulation, 105 Multivariate normal density, 55
Monte Carlo variance, 21, 66 Multivariate normal distribution, 458
Moran’s I statistic, 229, 288 Multivariate random effects, 80
Moreira, A., 392 Multivariate regression, 255
Morris, C., 113, 124, 125, 126 Multivariate skew-normal model, 118
Morris, C. N., 479 Multivariate skew-t model, 118, 434
574 Index

Multivariate spatial responses, 373–378 Lindley–Smith model format, 320–322


Multivariate stochastic volatility, 394–395 Normal linear model, 28, 236, 277, 283, 288,
Multivariate time series, 381–395 410, 525
dynamic factor analysis, 386–388 Normal linear regression, 5, 46, 60, 253, 256, 276,
dynamic linear models, 381–386 288, 289, 300, 305, 552
stochastic volatility, 388–395 Normal linear structural equation and factor
Muthén, B., 332, 344 models, 340–346
definition, 343–345
NASDAQ daily trading volume statistics, forms, 342–343
171–172 marginal and complete data likelihoods,
Natarajan, R., 415 345–346
Natural direct effect (NDE), 301 Markov chain Monte Carlo (MCMC)
Natural indirect effect (NIE), 301, 305 sampling, 345–346
Naylor, J., 167 Normal-normal hierarchical model, 106–111
NCP, see Non-centred parameterisation Normal proposal densities, 10, 14
NDE, see Natural direct effect Normal random effects, 16–17
Nearest neighbour Gaussian process (NNGP) Normal-t approach, 109, 110
approach, 239, 241 Norton, J., 227
Neighbourhood social deprivation, 330–331 No U-Turn Sampler (NUTS) algorithm,
Neuhaus, J. M., 551 19, 49, 257
Neves, C., 501 NRT, see Nicotine replacement therapy
New York Stock Exchange, 46 Numeric integration methods, 67
Nicotine replacement therapy (NRT), 115–116, Nursing home patients, 479–481
150–151 NUTS, see No U-Turn Sampler algorithm
NIE, see Natural indirect effect
Nile discharges data, 200–201 Oh, M-S, 189, 283
Nimble package, 45 Old Faithful geyser data, 191–193
NIMH schizophrenic collaborative study, OPENBUGS package, 46
426–427 Ordinal symptom score, 426–427
Niu, X., 227 Ordinal three level model, 327–328
NNGP, see Nearest neighbour Gaussian process
approach Pacific blue shark population, 185–186
Non-centred parameterisation (NCP), 22, 24, Papaspiliopoulos, O., 408
408, 409, 413 Parametric hazards, 475–477
Non-conjugate poisson mixing, 124–126 Parametric lifetime models, 475
Non-conjugate priors, 26, 113–115, 128–130 Parametric methods, 500
Nonconjugate random mixture models, 265, 267 Parametric mixture models, 137
Nonconstant error variances, 276–277 Parametric survival models, 503
Non-informative prior, 26 Pareto prior, 113
Nonlinear predictor effects, 525 Pareto smoothed importance sampling
Non-parametric approach, 29 (PSIS), 77
Non-parametric basis function models, 526–536 Partially non-centred parameterisations
mixed model splines, 527–529 (PNCP), 408, 409
model selection, 532–536 Path sampling, 65–67, 70
other than truncated polynomials, 529–532 Pattern mixture models, 459–462
Non-parametric maximum likelihood Pauger, D., 259
estimation, 135 Pauler, D., 114
Non-parametric prior for random effects, 371 PC, see Penalising complexity
Non-parametric regression, 526, 546–552 PE, see Piecewise exponential
Nonsingular models, 60 Pedroza, C., 178
Normal density parameters estimation, 11–12 Penalised fit measures, 78
Normal linear mixed model, 318–322, 328, 329, Penalising complexity (PC), 113
331, 407, 408, 433, 434, 437, 444 Permanent survivor fraction (PSF), 493
Index 575

Pettitt, A., 66, 67, 225 Prenatal care, 334–336


Pharmacokinetic application, 438–439 Prescott, R., 85
PHM, see Proportional hazard model Progesterone readings, 549–551
PH Weibull regression, 479 Propensity score (PS) methods, 296–297, 298
Physician visits counts, 558–559 Proportional hazard model (PHM), 473, 478, 481
Piecewise exponential (PE) models, 476 PS, see Propensity score methods
Piecewise exponential (PE) priors, 482–484 PSF, see Permanent survivor fraction
Piironen, J., 257 PSIS, see Pareto smoothed importance sampling
Pinheiro, J., 331, 332 P-spline approach, 527
PLN, see Poisson lognormal model PSRF, see Potential scale reduction factor
Plug-in deviance, 72, 73 PT, see Polya tree
Plummer, M., 73, 78, 143
PNCP, see Partially non-centred Quadratic growth curve model, 418
parameterisations Quantile regression, 552–559
Point processes, models for, 234–241 non-metric responses, 554–559
covariance functions, 237–238 Quasi-Bayesian Monte Carlo algorithm, 302
sparse and low rank approaches, 238–241 Quintana, F., 128
Poisson-gamma methodology, 125, 126, 127
Poisson-gamma mixture, 103, 134 R, Bayesian analysis in, 2, 12, 14, 18, 32, 35–36,
Poisson-gamma model, 73, 106, 121, 122, 123, 37, 45–55
264, 265, 266 BUGS coding, 46–47
Poisson likelihood approach, 476, 482 coding for rstan, 49–54
Poisson lognormal (PLN) model, 73, 124, 134, custom distributions through functions
153, 555 block, 53–54
Poisson process model, 125 differences between generic packages, 55
Poisson regression, 264–267, 294 Hamiltonian Monte Carlo, 49
Polya process model, 29 Stan program syntax, 49–51
Polya tree (PT), 149–153 target + representation, 51–53
Pooling strength applications, 134 JAGS coding, 47–49
Posterior density, 3, 4, 9–10, 12, 20, 24, 27, 62, 63, overview, 45–46
68, 72, 74, 75, 105, 131, 239, 276, 345, 354, R2MultiBUGS, 45, 46, 47
437, 442, 485, 501, 528 R2OpenBUGS package, 2, 45, 46, 93, 151, 153,
Posterior model probabilities, 59, 60 192, 224, 232, 240, 273, 295, 321, 322, 327,
Posterior predictive check (PPC), 110 330, 331, 380, 385, 416, 418
Posterior predictive loss (PPL) model, 88, 89 R2WinBUGS, 46
Posterior predictive model checks, 89–91 Rabe-Hesketh, S., 366
Posterior predictive p-tests, 557 Radial basis methods, 536
Posterior probability, 8, 12, 21, 86, 105, 167, 255 Radioimmunoassay and esterase, 279–280
Potential scale reduction factor (PSRF), 25 Raftery, A., 62, 70
Potts prior, 232, 233, 234 Rails data, 263
Pourahmadi, M., 414 Random coefficient autoregressive models
PPC, see Posterior predictive check (RCAR), 168–169
PPL, see Posterior predictive loss model Random coefficient normal linear model, 410
Prabhakaran, S., 148 Random covariance matrix and effect
Precision and covariance matrices, 55 selection, 411
Predictive approach, 61, 87 Random effect model, 2, 8, 22, 24, 26, 29, 60, 61,
Predictive methods, 87–95 71, 72, 74, 111, 179, 319
mixed checks, 91–95 Random intercept and slope (RIAS) model,
model checking and choice, 87–89 410, 448
posterior model checks, 89–91 Random intercept model, 418, 419, 426
Predictor retention and discrete mixture Random regression imputation, 459
models, 55 Rasser, G., 230
Predictor selection, 80, 273, 280–281, 295 Rattanasiri, S., 135
576 Index

RCAR, see Random coefficient autoregressive RJMCMC, see Reversible jump Markov chain
models Monte Carlo methods
Regime switching models, 200 R MASS library, 53
Regression coefficients, 60, 67, 254 R mgcv package, 236
Regression parameter models, 30 Road fatalities in Ontario, 190–191
Regression techniques, 253–308 Robert, C., 135, 137
categorical predictors and analysis of Roberts, G., 25, 136
variance (ANOVA), 259–263 Robust random effects, 441–442
variance components testing, 260–263 Rodrguez-Bernal, M., 89
heteroscedasticity and heterogeneity, 276–282 Rodrigues, A., 30
nonconstant error variances, 276–277 rstan, 2, 14, 18, 29, 45, 49–54, 55, 70, 78, 143, 166,
varying effects using discrete mixtures, 167, 224, 233, 258, 321, 327, 349, 364, 417,
277–278 514–516
zero-inflated Poisson (ZIP), 278–282 beetle mortality, 34–35
latent scales for binary and categorical data, custom distributions through functions
270–276 block, 53–54
augmentation for ordinal responses, 273, Hamiltonian Monte Carlo, 49
275–276 Stan program syntax, 49–51
for overdispersed data, 264–269 target + representation, 51–53
binomial and multinomial, 267–269 rstanarm package, 45
Poisson regression, 264–267 rube package, 45, 46, 116, 151, 152, 373, 439, 452,
overview, 253 485, 499, 551
predictor selection, 254–256 Rubin, D., 453
selection bias and causal effects, 296–308 Rue, H., 376
causal path sequences, 299–305 runjags package, 48
marginal structural models, 306–308
mediation and marginal models, 299 Sahu, S., 88, 118
propensity score adjustment, 296–299 Salanti, G., 121
shrinkage priors, 256–259 SAR, see Spatial autoregressive
spatial, 288–296 Sargent, D., 27
Bayesian spatially varying coefficients, Savage, J., 143
292–293 Saville, B., 81, 86
Bayesian spatial predictor selection SBP, see Systolic blood pressure
models, 293–296 Scaled inverse chi-squared density, see Inverse
conditional autoregression, 290–291 chi-squared density
GWR and Bayesian SVC models, 291–292 Schabenberger, O., 228
lag and error models, 288 Schaefer, M. B., 185
simultaneous autoregressive models, Schifflers, E., 447
288–290 School attendance data, 266–267
time series, 282–287 Schools data meta analysis, 17–18
time-varying effects, 283–287 Schotman, P., 170
Reich, B., 293 SDM, see Spatial Durbin model
Residual autocorrelation, 422 Second-stage covariance, 119, 121
Residual variance, 60 Seeds data, 83–84, 93–95
Reverse-mode algorithmic differentiation, 19 Seemingly unrelated time series equations
Reversible jump Markov chain Monte Carlo (SUTSE) model, 383
(RJMCMC) methods, 344, 345, 437, 483 Seismic Hazard Harmonization in Europe
RIAS, see Random intercept and slope model (SHARE), 240
Richardson, S., 135, 136, 149, 230, 231 Self-exciting threshold autoregressive (SETAR)
R-INLA package, 20, 45, 166, 182, 191, 229, 295 models, 200
Risser, M., 238 Seltzer, M., 331
rjags, 2, 76, 78, 139, 151, 276, 281, 350, 427, 451, SEM, see Spatial errors model
461, 462, 539, 540 Semiparametric approaches, 243
Index 577

Semiparametric hazards, 481–486 spatial smoothing and prediction for area


cumulative hazard specifications, 484–486 data, 214–221
piecewise exponential priors, 482–484 SAR schemes, 216–221
Semiparametric mixture model, 103 Spatial discontinuity and robust smoothing,
Semiparametric modelling, 144–153, 499 229–234
polya tree priors, 149–153 Spatial Durbin model (SDM), 218
specifying baseline density, 146–148 Spatial errors model (SEM), 217, 218, 220,
truncated Dirichlet processes and stick- 342, 354
breaking priors, 148–149 Spatial lag model (SLM), 217, 220, 288, 289, 295
Semiparametric regression models, 525 Spatial lag regression coefficient (SLRC), 288, 294
SETAR, see Self-exciting threshold Spatially varying coefficient (SVC), 292–293,
autoregressive models 292–293
Sethuraman, J., 231 Spatial median relative risks, 234
Shannon–Weaver entropy, 27 Spatial regression, 225, 288–296
Shapiro–Wilk normality test, 115 Bayesian spatially varying coefficients (SVC),
Shapiro–Wilk W statistic, 91 292–293
SHARE, see Seismic Hazard Harmonization in Bayesian spatial predictor selection models,
Europe 293–296
Shared effect missingness model, 461–462 conditional autoregression, 290–291
Shen, S., 142 Geographically weighted linear regression
Simpson, D., 113 (GWR), 291–292
Single source of error (SSOE), 181 lag and error models, 288
Sinharay, S., 68, 69, 91 simultaneous autoregressive models,
Site-specific effects, 244 288–290
Skewed cholesterol data, 439–440 “Spatial switching” model, 230
Skew-elliptical densities, 118 Spatio-temporal error, 170
Skew probit, 273, 371–372 SPDE, see Stochastic partial differentiation
Skew t model, 116, 118 approach
SLM, see Spatial lag model spdep package, 218
SLRC, see Spatial lag regression coefficient Spiegelhalter, D., 72, 73, 78, 91, 92, 93, 110, 112
Smith, A., 104 Spike and slab prior (SSP), 254, 255, 257, 258, 298
Smith, M., 80, 81 spNNG package, 239, 241
Smoothing splines, 527 SSOE, see Single source of error
Sparse precision matrix method, 233 SSP, see Spike and slab prior
Spatial autoregressive (SAR) model, 288 SSVS, see Stochastic search variable selection
Spatial autoregressive (SAR) schemes, 216–221 Stan program syntax, 45, 46, 49–51, 55
Spatial covariance modelling, 225, 240 Stan User’s Guide and Reference Manual, 55
Spatial data analysis, 225, 340 State-space approach, 187
Spatial dependence representation, 213–246 State-space method, 173
conditional autoregressive priors, 221–227 State-space models, 28, 29, 165–166, 187
alternative conditional priors, 223–225 State-space priors, 172–186
ICAR(1) and convolution priors, 226–227 basic structural model, 178–179
linking conditional and joint identification questions, 179–184
specifications, 222–223 nonlinear models for continuous data,
discontinuity and robust smoothing, 184–186
229–234 sampling schemes, 176–178
discrete convolution models, 241–244 simple signal models, 175–176
models for point processes, 234–241 State-space techniques, 193, 194
covariance functions, 237–238 Statistical analysis, 2, 45
sparse and low rank approaches, 238–241 Steel, M., 231
overview, 213–214 Stern, H., 68, 69, 91
priors on conditional spatial variance, Stochastic partial differentiation (SPDE)
227–229 approach, 238
578 Index

Stochastic search variable selection (SSVS), 80, binary autoregressive moving average
254, 255, 295, 298, 343 (BARMA) models, 191–193
Stochastic volatility model, 6, 193, 196 generalised autoregressive moving
Structural equation models (SEMs), 344 average (GARMA) representation,
Structural time series models, 28 188–189
Student t densities, 143 modelling discontinuities, 197–202
Student t distribution, 136 modelling temporal structure, 166–172
Student t model, 370, 372–373 antedependence models, 170–172
Suicides, London, 232–234 low order autoregressive models, 169–170
Survival and event history models, 471–519 random coefficient autoregressive
competing risks (CR), 507–514 models, 168–169
modelling frailty, 509–514 overview, 165–166
in continuous time, 472–481 state-space priors, 172–186
accelerated hazards, 478–481 basic structural model, 178–179
counting process functions, 474–475 identification questions, 179–184
parametric hazards, 475–477 nonlinear models for continuous data,
discrete time hazard models, 494–502 184–186
life tables, 496–502 sampling schemes, 176–178
including frailty, 488–494 simple signal models, 175–176
cure rate models, 490–494 stochastic variances, 193–197
multivariate and nested survival times, Time series regression, 282–287
502–506 time-varying effects, 283–287
overview, 471–472 Time-varying autoregressive (TVAR) models, 168
semiparametric hazards, 481–486 Tingley, D., 303
cumulative hazard specifications, Tiwari, R., 178
484–486 Tong, B., 277
piecewise exponential priors, 482–484 Total fertility rate (TFR), 539–541
Suspected myocardial infarction, 298–299 TPRS, see Thin-plate regression spline smoother
SUTSE, see Seemingly unrelated time series Trend-surface models, 236
equations model Trivariate normal random intercept model, 426
SVC, see Spatially varying coefficient Trout density, 555, 557
Symmetric proposal densities, 11 Truncated Dirichlet processes (TDP), 148–149,
Symmetric smoothing kernel, 242 334, 437
Systolic blood pressure (SBP), 120, 304–305 Truncated stick-breaking scheme, 148–149
Tsay, R., 391
Tam, W., 128 Tuchler, R., 81, 82, 86
Tanner, M., 371 Turtle mortality data, 68–71
Target + representation, 51–53 Turtle survival data, 35–36, 78
Taylor approximation, 20 Tutz, G., 429
TDP, see Truncated Dirichlet processes TVAR, see Time-varying autoregressive models
Terbinafine, 133–134 Two-chain analysis, 184
TFR, see Total fertility rate Two-level normal linear model, 28
Thin-plate regression spline (TPRS) smoother,
539, 540 UN Human Development Report, 539
Thin plate splines, 236 Unit level random effects, 409–411
Thompson, E., 10 Univariate random effects, 80
Thompson, S., 118 Univariate regression, 295
Tiao, G., 391 Unwanted pursuit behaviour (UPB), 281
Time series analysis, 165–205 Unweighted logistic regression, 52
for discrete responses and alternative state- UPB, see Unwanted pursuit behaviour
space approach, 186–193 US National Eye Institute, 506
autoregressive conditional Poisson (ACP) US National Longitudinal Survey, 430–431
models, 189–191 US presidential election voting data, 269
Index 579

van Dijk, H., 170 WINBUGS package, 46, 120


van Dongen, S., 115 Winkelmann, R., 124, 355, 362
Van Duijn, M., 326 Winship, D. A., 129
van Houwelingen, H., 119 WISC-R, see Wechsler Intelligence Scale for
Variable selection, see Predictor selection Children-Revised
Variance-covariance matrix, 81 Wishart distribution, 148
Varying regression coefficient approach, 288 Wishart prior method, 322
Vectorisation, 50 Wishart scale matrix, 120, 392
Vehtari, A., 77, 257 Wood, S., 236, 237
Vermunt, J., 327
Veteran’s Administration, 484 Xie, W., 66
Viele, K., 277 Xu, X., 260

Wagner, H., 83, 259 Yanagimoto, T., 187


WAIC, see Widely applicable information Yang, C., 81
criterion Yau, K. K., 441
Wakefield, J., 114 Yoghurt brand choice, 425–426
Walfish, S., 133 Young-Xu, Y., 133
Waller, L., 215 Yu, K., 110, 388
Watanabe, S., 77 Yue, Y., 554
WBIC, see Widely applicable Bayesian
information criterion Zarepour, M., 148, 149, 371
Wechsler Intelligence Scale for Children- Zelig package, 52
Revised (WISC-R), 349–352 Zero-inflated Poisson (ZIP), 124, 142, 278–282,
Weibull hazard rate, 476 279, 281–282
Weibull model, 478, 479, 480, 495, 499, 503 Zhang, Z., 106
Weighted log-likelihood estimation, 51, 52 Zhao, Y., 114, 319
Weiss, R., 332 Zheng, X., 366
West, M., 65, 145, 152 Zhu, R., 54
Widely applicable Bayesian information Zimmermann, K. F., 124
criterion (WBIC), 60, 77–78 ZIP, see Zero-inflated Poisson
Widely applicable information criterion
(WAIC), 75–77, 78, 93, 115, 172, 182, 190,
191, 232, 258, 321, 322, 333, 334, 350, 499,
500, 540

You might also like