Course Notes
Course Notes
February 8, 2019
Introduction
These are (incomplete) course notes about generalised linear mixed models
(GLMM). Special emphasis is placed on understanding the underlying struc-
ture of a GLMM in order to show that slight modifications of this structure can
produce a wide range of models. These include familiar models like regression
and ANOVA, but also models with intimidating names: animal models, thresh-
old models, meta-analysis, MANCOVA and random regression . . . The primary
aim of the course is to show that these models are only daunting by name.
The secondary aim is to show how these models can be fitted in a Bayesian
framework using Markov chain Monte Carlo (MCMC) methods in the R pack-
age MCMCglmm. For those not comfortable using Bayesian methods, many of the
models outlined in the course notes can be fitted in asreml or lmer with little
extra work. If you do use MCMCglmm, please, cite ?.
1
Contents
Introduction 1
Contents 2
2 GLMM 28
2.1 Linear Model (LM) . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 Linear Predictors . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Generalised Linear Model (GLM) . . . . . . . . . . . . . . . . . . 30
2.3 Over-dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Multiplicative Over-dispersion . . . . . . . . . . . . . . . 33
2.3.2 Additive Over-dispersion . . . . . . . . . . . . . . . . . . 35
2.4 Random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Prediction with Random effects . . . . . . . . . . . . . . . . . . . 45
2.6 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7 A note on fixed effect priors and covariances . . . . . . . . . . . . 57
2
CONTENTS 3
5 Multi-response models 89
5.1 Relaxing the univariate assumptions of causality . . . . . . . . . 89
5.2 Multinomial Models . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Zero-inflated Models . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Posterior predictive checks . . . . . . . . . . . . . . . . . . 104
5.4 Hurdle Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5 Zero-altered Models . . . . . . . . . . . . . . . . . . . . . . . . . 108
Acknowledgments 139
Bibliography 140
Chapter 1
In the context of a generalised linear mixed model (GLMM), here are what I
see as the pro’s and cons of using (restricted) maximum likelihood (REML) ver-
sus Bayesian Markov chain Monte Carlo (MCMC) Bayesian methods. REML is
fast and easy to use, whereas MCMC can be slow and technically more challeng-
ing. Particularly challenging is the specification of a sensible prior, something
which is a non-issue in a REML analysis. However, analytical results for non-
Gaussian GLMM are generally not available, and REML based procedures use
approximate likelihood methods that may not work well. MCMC is also an
approximation but the accuracy of the approximation increases the longer the
analysis is run for, being exact at the limit. In addition REML uses large-
sample theory to derive approximate confidence intervals that may have very
poor coverage, especially for variance components. Again, MCMC measures of
confidence are exact, up to Monte Carlo error, and provide an easy and intuitive
way of obtaining measures of confidence on derived statistics such as ratios of
variances, correlations and predictions.
To illustrate the differences between the approaches lets imagine we’ve ob-
served several random deviates (y) from a standard normal (i.e. µ = 0 and
σ 2 = 1). The likelihood is the probability of the data given the parameters:
P r(y|µ, σ 2 )
5
CHAPTER 1. BAYESIAN STATISTICS & MCMC 6
P r(µ, σ 2 |y)
which seems more reasonable, until we realise that this probability is pro-
portional to
P r(y|µ, σ 2 )P r(µ, σ 2 )
where the first term is the likelihood, and the second term represents our
prior belief in the values that the model parameters could take. Because the
choice of prior is rarely justified by an objective quantification of the state of
knowledge it has come under criticism, and indeed we will see later that the
choice of prior can make a difference.
1.1 Likelihood
We can generate 5 observations from this distribution using rnorm:
We can plot the probability density function for the standard normal using
dnorm and we can then place the 5 data on it:
[1] 0.003015919
Of course we don’t know the true mean and variance and so we may want
to ask how probable the data would be if, say, µ = 0, and σ 2 = 0.5:
0.4
●
●
●
0.3
●
Probability
0.2
●
0.1
0.0
−3 −2 −1 0 1 2 3
possible.y
Figure 1.1: Probability density function for the unit normal with the data points
overlaid.
[1] 0.005091715
It would seem that the data are more likely under this set of parameters than
the true parameters, which we must expect some of the time just from random
sampling. To get some idea as to why this might be the case we can overlay the
two densities (Figure 1.2), and we can see that although some data points (e.g.
1.2) are more likely with the true parameters, in aggregate the new parameters
produce a higher likelihood.
●
●
0.5
0.4
●
●
●
●
●
density
0.3
●
0.2
●
0.1
0.0
−3 −2 −1 0 1 2 3
pos.y
Figure 1.2: Two probability density functions for normal distributions with
means of zero, and a variance of one (black line) and a variance of 0.5 (red line).
The data points are overlaid.
5
4
3
0.1
σ2
0.15
2
0.2
0.25
0.35
0.45
0.5
1
5 0.6
0.5 5
0.85 0.6
0.7
0.9
0.4
0.3
0.05
0
−2 −1 0 1 2
Figure 1.3: Likelihood surface for the likelihood P r(y|µ, σ 2 ). The likelihood has
been normalised so that the maximum likelihood has a value of one.
> MLest
mean var
-0.1051258 0.4726117
Call:
glm(formula = y ~ 1, data = Ndata)
Deviance Residuals:
1 2 3 4 5
CHAPTER 1. BAYESIAN STATISTICS & MCMC 10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1051 0.3437 -0.306 0.775
Here we see that although the estimate of the mean (intercept) is the same,
the estimate of the variance (the dispersion parameter: 0.591) is higher when
n
fitting the model using glm. In fact the ML estimate is a factor of n−1 smaller.
var
0.5907647
−0.2642=0.069
0.4
●
0.2422=0.059
−0.7482=0.559
0.3
−0.2422=0.059
dnorm(pos.y)
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
pos.y
Figure 1.4: Probability density function for the unit normal with 2 realisations
overlaid. The solid vertical line is the true mean, whereas the vertical dashed line
is the mean of the two realisations (the ML estimator of the mean). The variance
is the expected squared distance between the true mean and the realisations.
The ML estimator of the variance is the average squared distance between the
ML mean and the realisations (horizontal dashed lines), which is always smaller
than the average squared distance between the true mean and the realisations
(horizontal solid lines)
For a single variance component the inverse Wishart takes two scalar param-
eters, V and nu. The distribution tends to a point mass on V as the degree of
belief parameter, nu goes to infinity. The distribution tends to be right skewed
when nu is not very large, with a mode of nuV∗ nu but a mean of V∗ nu (which is
+2 nu−2
not defined for nu < 2).1
V=1
0.3
0
0.2
nu= 1
0.1
nu= 0.2
nu= 0.02
nu= 0.002
0.0
0 1 2 3 4 5
Index
Figure 1.5: Probability density function for a univariate inverse Wishart with
the variance at the limit set to 1 (V=1) and varying degree of belief parameter
(nu). With V=1 these distributions are equivalent to inverse gamma distributions
with shape and scale parameters set to nu/2.
have some value. It therefore seems reasonable that when specifying a prior,
care must be taken that this condition is met. In the example here where V
is a single variance this condition is met if V>0 and nu>0. If this condition is
not met then the prior is said to be improper, and in WinBUGS (and possibly
other software) improper priors cannot be specified. Although great care has to
be taken when using improper priors, MCMCglmm does allow them as they have
some useful properties, and some common improper priors are discussed in sec-
tion 1.5. However, for now we will use the prior specification V=1 and nu=0.002
which is frequently used for variance components. For the mean we will use a
diffuse normal prior centred around zero but with very large variance (108 ). If
the variance is finite then the prior is always proper.
As before we can write a function for calculating the (log) prior probability:
> logprior <- function(par, priorR, priorB) {
+ dnorm(par[1], mean = priorB$mu, sd = sqrt(priorB$V),
+ log = TRUE) + log(dinvgamma(par[2], shape = priorR$nu/2,
+ scale = (priorR$nu * priorR$V)/2))
+ }
where priorR is a list with elements V and nu specifying the prior for the
variance, and priorB is a list with elements mu and V specifying the prior for
the mean. MCMCglmm takes these prior specifications as a list:
> prior <- list(R = list(V = 1, nu = 0.002), B = list(mu = 0,
+ V = 1e+08))
The prior has some influence on the posterior mode of the variance, and we
can use an optimisation algorithm again to locate the mode:
> Best <- optim(c(mean = 0, var = 1), fn = loglikprior,
+ y = Ndata$y, priorR = prior$R, priorB = prior$B,
+ method = "L-BFGS-B", lower = c(-1e+05, 1e-05),
+ upper = c(1e+05, 1e+05), control = list(fnscale = -1,
+ factr = 1e-16))$par
> Best
CHAPTER 1. BAYESIAN STATISTICS & MCMC 14
5
4
3
0.1
σ2
0.15
2
0.2
0.25
0.35
0.45
0.5
1
5 0.6
0.5 5
0.85 0.6
0.7
0.9
0.4
0.3
0.05
0
−2 −1 0 1 2
Figure 1.6: Likelihood surface for the likelihood P r(y|µ, σ 2 ) in black, and the
posterior distribution P r(µ, σ 2 |y) in red. The likelihood has been normalised so
that the maximum likelihood has a value of one, and the posterior distribution
has been normalised so that the posterior mode has a value of one. The prior
distributions P r(µ) ∼ N (0, 108 ) and P r(σ 2 ) ∼ IW (V = 1, nu = 0.002) were
used.
mean var
-0.1051258 0.3377710
The posterior mode for the mean is identical to the ML estimate, but the
posterior mode for the variance is even less than the ML estimate which is known
to be downwardly biased. The reason that the ML estimate is downwardly
biased is because it did no take into account the uncertainty in the mean. In a
Bayesian analysis we can do this by evaluating the marginal distribution of σ 2
and averaging over the uncertainty in the mean.
Z
2
P r(σ |y) ∝ P r(µ, σ 2 |y)dµ
after averaging over any nuisance parameters, such as the mean in this case.
The Markov chain is drawing random (but often correlated) samples from
the joint posterior distribution (depicted by the red contours in Figure 1.6). The
element of the output called Sol contains the distribution for the mean, and the
element called VCV contains the distribution for the variance. We can produce
a scatter plot:
and we see that MCMCglmm is sampling the same distribution as the pos-
terior distribution calculated on a grid of possible parameter values (Figure 1.8).
A very nice property of MCMC is that we can normalise the density so that
it integrates to 1 (a true probability) rather than normalising it with respect to
some other aspect of the distribution, such as the density at the ML estimator
or the joint posterior mode as in Figures 1.3 and 1.6. To make this clearer,
imagine we wanted to know how much more probable the unit normal (i.e. with
µ = 0 and σ 2 = 1) was than a normal distribution with the posterior modal
parameters. We can calculate this by taking the ratio of the posterior densities
at these two points:
[1] 4.522744
Now, if we wanted to know the probability that the parameters lay in the
region of parameter space we were plotting, i.e. lay in the square µ = (−2, 2)
and σ 2 = (0, 5) then this would be more difficult. We would have to evaluate the
density at a much larger range of parameter values than we had done, ensuring
that we had covered all regions with positive probability. Because MCMC has
sampled the distribution randomly, this probability will be equal to the expected
probability that we have drawn an MCMC sample from the region. We can
obtain an estimate of this by seeing what proportion of our actual samples lie
in this square:
CHAPTER 1. BAYESIAN STATISTICS & MCMC 16
●
● ●
●
● ●
●
●
● ●
● ● ●
● ●
● ● ●
●
● ●
●
●
●
●
● ●
● ●
● ● ●
●
● ●
●
● ●
●
● ● ●
● ●
● ●
●
● ●
● ● ●
●
●
● ●
●
● ●
●
● ● ●
● ● ●
● ●
5
●
●
● ●
●
●
● ●
● ● ● ●
● ● ● ●
● ●
●
● ● ●
● ●
●
● ●
●
● ● ● ● ● ● ● ●
● ●
●
● ● ●
● ●
● ●
● ●
●
● ● ●
● ●
● ● ●
● ● ●
●
● ●
● ●
●
● ● ● ●
● ● ●
●
● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
4
● ●
●
● ●
● ●
● ● ● ● ●
● ●
●
● ● ●
● ●
● ●● ●
● ● ●
● ● ● ● ●
●
● ● ● ●
● ●
● ● ●
● ● ●
● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ● ● ●
● ●●
●
● ●
●
● ●● ●
● ● ● ●●
● ● ●
● ● ● ● ● ● ● ●
● ● ●
●
● ● ●
●
● ● ● ●
● ●● ● ● ● ●
● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ●
● ●
● ●
●
● ● ● ● ● ● ●
● ● ●
● ● ●● ● ●
● ● ● ●
● ● ●
● ●● ● ● ●
● ●● ● ● ●
3
● ● ●
● ● ● ●● ●● ● ●
●
● ●●
●
● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ●● ●
● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●● ●●● ● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●
● ● ● ●●● ● ● ●
● ●● ● ●●
●
● ●
●
● ● ●
● ●● ● ●● ●
● ● ● ●
●● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●
●
●
●
●
● ●● ●
● ● ● ●
●● ●
●●
● ●
● ● ●
●
● ● ●● ● ● ●● ●
● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●● ● ● ● ●
●● ● ● ●● ●● ● ● ●
● ● ● ●● ● ● ●
● ● ● ●● ●
● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●● ● ●
● ●
●
● ● ●● ●● ● ●● ● ●
●
●● ●● ●● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ●
●
● ● ● ●● ● ● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ● ●● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ●
● ● ● ● ● ● ● ●● ● ●
●
● ● ● ● ●
●
●
●
● ● ● ●
● ●● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ●
● ● ● ● ● ●● ● ● ●● ● ●● ●
● ●
●
● ● ● ●●
● ● ● ● ● ● ● ●●
● ●
● ● ● ●● ● ●
● ●● ●
● ●● ● ● ● ●
●● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
2
● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●● ● ●
● ● ● ● ●
●● ● ●● ● ●● ● ● ● ● ●● ●
● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ●
● ● ●● ● ●
● ● ● ● ● ● ● ● ●●●● ● ● ●
● ● ●● ● ● ●● ● ●
●
● ● ●● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●● ● ● ● ●● ● ● ●
● ● ●
● ●● ● ● ●
● ● ● ● ● ●● ● ● ●● ●
●● ● ●● ● ● ●
●
● ● ● ●● ● ●●
● ● ●●
●
● ● ● ●● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ●●●● ● ●
● ● ● ●
● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ● ●● ● ●● ● ● ● ● ●
● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ●
● ● ●
●
● ●● ● ● ● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ●
● ● ●● ● ●
●
●● ● ● ●● ● ●●●● ●●● ● ● ●●● ● ● ●● ●● ●● ● ● ●
● ●
●
● ●● ● ● ● ● ●
● ●
●●
● ● ● ●● ●
● ●●
● ●
● ● ●● ● ●● ●●
● ● ●●
●● ● ● ●● ●● ● ●●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ●
● ● ●● ●● ● ● ●
● ●● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ●● ● ● ●
● ● ● ●● ● ● ● ●
● ●●
●
●● ● ● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●
●
● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●● ● ●●●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●●
●
●
● ●
●
●●● ● ● ● ● ● ●
● ● ● ● ●●
● ●● ●● ●●
● ● ●● ● ●●● ● ●●● ●● ●● ●●●
● ●
● ●● ● ● ●● ● ● ●●
●●● ●
● ●
●
●
● ●
● ●●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●● ●●
● ● ●
● ●●
● ●
● ●
●
● ●● ●●● ●● ●●
● ●● ●●●
● ● ● ●● ●
● ●● ● ● ● ●●
●●● ●
●●●● ● ●
●● ● ● ● ● ●
● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ●●
●
● ●● ●●
●●● ●●● ●●
●● ●
● ● ●● ●●●
●●
●●
● ● ●●
● ●
●● ●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ●● ●
● ● ● ● ● ● ● ● ●
●
●
●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●●●
● ●
● ● ●● ●● ●
●●
● ●● ● ●●● ●
●● ● ● ● ● ● ●●
●
● ● ●● ●● ● ●
● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●●
● ● ●
● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●●● ●● ● ● ● ● ●● ●● ● ●● ● ● ●
●
●●
●● ● ● ●● ●● ● ● ● ●●●●●●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ●●●●● ● ● ●●● ●
●● ●● ●●● ●
●
● ●●●
●● ●● ●
●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ●● ● ●
●
● ● ● ●● ●● ●●●● ●● ● ●● ● ●● ● ●● ●● ●● ●●●
●● ● ●
● ●● ● ● ●● ● ● ● ● ●● ● ● ●
● ● ●●● ●● ● ● ●●● ● ●●● ● ● ● ●●● ●● ● ●
● ● ● ● ● ● ● ● ●●● ● ●●● ●●● ● ● ●
●● ●●● ●●
● ●● ● ● ● ● ● ● ● ●
●
●● ● ●●● ● ● ●●● ●
● ● ● ●● ●● ●●● ●● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●●● ●
●●●
● ● ● ●●
● ●
●
● ● ● ●
●●
● ● ●● ● ● ● ● ● ● ● ● ● ●●●●● ● ●● ●● ● ●
● ● ●
●
● ●●● ● ● ● ●●● ● ● ●● ● ●● ● ●● ●● ● ●● ● ●● ●●● ●● ●●●● ● ● ● ●● ● ●● ● ●
● ● ●● ● ●● ●● ● ●●
●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ●
●● ●
● ● ● ● ● ● ● ●●●● ●● ●●● ● ● ● ●
●● ● ●● ● ●
● ● ● ●● ● ● ● ● ●● ● ● ●
●● ● ● ● ● ●● ●
● ●
●● ● ●
● ●●
● ● ●● ●●
● ●●● ●●●●● ●● ● ● ●● ● ●● ●●● ● ●●● ● ●●● ● ● ● ●● ● ● ● ● ●
● ●● ● ● ● ● ●
● ● ● ●● ●● ●
●● ● ●
●
● ●
● ●
● ●● ● ●● ● ● ● ● ● ●● ● ●●●● ●
● ●● ●● ●●●● ● ● ● ●● ● ●
●
● ● ●●● ●●● ●●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●
● ●●
● ●● ● ● ●●●
●●● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ●●● ●
●●● ● ●●●●● ●● ●● ● ●● ● ●● ● ●
●●●● ●
● ●●
● ● ● ●
● ● ●● ●● ●●●● ●● ●
● ●●
● ● ●●● ● ● ●
●
●● ●● ●● ● ●
● ● ●●
●●
●
●
●
● ●●
● ●●
●
● ● ● ●●
● ● ●●● ● ●
● ● ● ● ● ●
● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ●
●● ●● ● ●
●●●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●●
● ● ● ●● ●●● ●●●●● ● ● ●
● ● ●●● ● ● ● ●●● ●●●● ●● ● ● ●
1
● ●● ● ● ●●●●● ●●● ● ● ●
● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●
● ●
● ● ● ●●
● ● ●● ● ●● ● ● ● ● ● ●●
●● ● ●●
● ●●● ● ● ● ● ●● ● ● ●●●●● ● ●●
● ●●● ●●
● ● ●●● ● ● ● ●●●
●
● ●● ● ● ●● ●●● ● ● ●● ●
●
● ●●
●
● ● ●●● ●
●●●● ●●● ●●● ●● ●●●●● ●● ● ● ●●●● ●●● ●●●● ●● ●●●
●● ● ● ●●● ● ●● ● ●● ●●● ●● ●●●●● ●● ● ● ● ●●● ●● ●
● ● ● ●●●● ●● ● ●
● ●
●
● ●●
● ● ● ● ● ● ● ●●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●●● ● ●● ● ●●● ● ●
● ● ●● ●● ●● ● ●●● ● ●● ●
● ●●●
● ●● ● ●●
●●
● ●● ● ● ● ●●● ● ●●●
●●● ● ●● ● ●● ●● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●
● ●● ●● ●● ● ● ● ● ● ●● ●●● ● ●● ●●●● ● ● ●● ●●● ● ●●● ● ●● ● ●● ● ●
●● ●●● ● ●● ● ●●●
● ● ●● ●● ● ● ● ● ●
● ● ● ●● ●● ● ●● ● ● ●●
● ● ●
● ●●●
●● ● ●● ● ●● ● ● ●●●● ● ● ●●
● ●● ●● ● ● ●● ● ●● ● ● ● ● ●●●● ● ●●●●●
●● ●●●● ● ●●●●● ●● ● ● ●●● ●●
●
●●●●● ●●●● ● ● ●● ● ●● ●● ●●● ●
● ● ●●● ●●
● ●●
● ●●
● ● ● ● ● ● ● ● ● ●●●
● ●● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ●●
●
●●●● ●●●
●
●● ●●●●
●● ● ●
●●●● ● ● ● ● ●● ●●●●● ●●●●●● ● ●● ● ●● ● ●
●
● ● ● ● ● ● ●
● ● ●● ● ●● ●● ● ●●●●● ●
● ●●
●●
● ● ●● ●● ● ● ●●
●● ●●● ●●
●● ● ●● ● ● ●●●●● ●● ● ●
●●●● ● ●●● ●● ●●● ● ● ●● ●
●
●● ● ●● ●●●
●●●● ●●● ●●
●●●● ●●●●●● ●
● ●● ●● ● ●
● ●● ●● ●●●
●● ● ●● ●● ● ●●● ●● ●● ●● ●●●●● ● ● ● ● ● ●● ● ● ●●● ●● ● ●●● ●● ●● ● ●● ● ● ● ●
● ●
● ●●●●● ● ●● ●
● ●● ●●● ●● ● ●● ● ● ● ●● ●●●●● ●● ● ●●
● ●● ●● ●● ●●●● ●● ● ●
●● ●● ●● ● ●●●●● ● ● ● ● ●
● ●●
●● ● ● ●● ● ●●
●● ● ●●● ● ● ●● ●● ●● ● ●●● ● ●● ● ●● ●
●●
● ●●●●● ● ● ●●●● ●●
● ● ●●●
●● ● ● ●●●●● ●●● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ● ●
● ● ● ● ●●● ●● ● ●● ●● ●●
●●
● ●●● ●● ●
●
● ●●● ●
● ● ●●● ● ●●● ●
● ●●● ●
● ●●●●● ●●● ●● ●● ●● ● ● ●● ●● ●●●●●●● ●●
●
● ●● ● ●● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●●●
●
● ●● ● ● ●●●● ●●● ●●
●● ●
●● ● ● ● ●
●●
● ●●● ●●● ●
●
●●
●●
●●●●●●● ● ●●● ●● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ● ●●● ●
● ● ●● ●● ●● ● ●
●● ● ●●● ● ● ●
● ●●● ●●
●● ●● ●●
● ●● ●●●● ● ●●●● ●●● ● ●● ● ● ●● ●
● ● ●●●
● ●
● ●●● ●
●
● ●●●● ●●●●
● ●●
●
●● ● ● ●● ● ●● ●●●● ●● ●●● ● ● ●●
●
●●● ●●●● ●●●●● ●● ●●● ●●
● ●●
● ●●●● ●● ●● ●●
●●● ●●● ●●● ●● ●● ●● ● ●●●● ●● ● ●
●
● ● ● ● ● ●● ● ●● ● ●●●● ●
●●●
●
●● ●● ● ●●● ● ● ● ●●●
●● ●● ●● ●
●● ● ●● ●●● ●● ● ● ●● ●● ●●● ● ● ● ●●● ● ●● ●● ●● ● ● ●●
● ●
●● ● ● ●●
● ● ●● ●
● ●●● ●● ● ●●●●●● ●●●● ●● ●
●● ●●●●● ●●
● ● ●●
●● ●● ●● ●●
●●● ●●● ● ● ● ● ● ●●●●
●●
●
●●
●● ●●●●●●
●
●●
● ●●●●
●●
●●
● ●● ●●● ●● ●●● ● ●●● ● ● ●
●● ● ●●●● ●●●● ●●●● ●●
● ●● ●
●
● ● ●
● ● ●
● ●● ● ● ●● ●●●●●●● ● ● ● ● ● ●● ●● ●●
● ●
● ●●●
●● ●
●●●● ●● ●
●
● ●● ●
●●●● ● ●●●●
●●●●
●● ● ●
● ●
● ●
●●● ●●●●●
●● ● ● ●● ●●
●● ●●●● ●●
●●●●● ● ●
● ●● ●●● ●
●●●●●● ●
● ● ●● ●
● ● ● ●
●●
●● ● ● ●●● ● ● ●●●
●
● ●●● ● ● ● ● ●●● ● ● ● ●●●
● ●●● ●●● ● ●●●●● ●● ●●● ● ●● ●●●
●● ●● ●●● ●● ●●●●
●
●● ●
●
●● ●
●
● ●●● ●●
●
●
●● ●●●● ●● ● ●●●●
● ●●● ●
●●● ●●●● ●● ● ●● ● ● ●●●
● ●● ●●
●● ●● ●●● ● ● ●●● ● ●● ● ●● ● ● ●
● ● ●●● ●● ●● ● ● ● ● ●●● ●
● ●● ● ●● ●● ●
●●● ●●● ● ●
● ●● ●●● ● ●●
● ●● ●● ●● ●●●●
● ●
●
●●●●●●● ●● ●●●●● ●●●●●● ● ●● ● ●●● ●●
●● ● ●● ● ●●●●●●●●● ● ●● ● ●●● ●●●●●● ●●● ● ● ●● ●● ●●● ● ●● ●
●
● ● ●
●
● ● ● ● ●● ● ●
● ●●● ● ● ●●● ●
● ● ●● ●● ●● ●● ●
●
●●●● ●●●●
●● ●●● ● ●● ● ●● ●● ●
●●●●
●●●● ●
●
●●●●●●●● ● ●
●● ● ●
● ●● ●●● ●●●
● ●
●●● ●●● ●●●●● ●●●● ● ●● ●●●●●● ●●● ●●●● ●● ●●● ●●
●●●●● ●●
● ●● ●
●● ●●
●●●●●● ●●
● ● ●●● ● ● ●●
● ● ● ● ● ●
●●●●
●● ●●● ● ● ●● ●● ●●● ●●●● ●●
● ● ● ●●●● ● ●●●
●●●● ●●
●●●●
●●
● ●● ●● ● ●●● ●
●● ●● ●●●
●●●● ● ●●
● ●
● ●
●●● ●● ●●●●● ● ●●
●
● ● ● ● ● ● ●
● ● ● ● ●● ● ● ●● ● ● ● ●●
● ●● ● ●●●● ●●● ●●●● ● ●●● ●
● ● ● ● ●
●● ● ●● ●
●● ●
●● ●●●
●●
●●● ●●●● ●●
●● ●●
● ●●● ● ● ●●● ●●●
●● ●● ●
●●●● ● ● ●●●
●●● ●● ●●●
●●●● ●● ●●
●●
● ●
● ● ● ● ●
● ● ●
● ●● ● ● ● ● ● ●
●● ●● ●
●
●
●
●● ●●● ●
●
●
●●●● ●
● ● ●●● ●● ●
●●
●●●
●●●●
●
● ●● ●
● ●●
●●
●●●●●
●
●●
●
●
●
●●●●
●
●
●●
●
●●●
●
●
●●●
● ●● ● ● ●
● ● ●● ●●●
●
●
●●●●●●
●
● ●● ● ● ● ●●
●
● ●●
●●
●
●
●
● ●
● ●
●
●● ● ●●● ● ● ●
● ● ● ●● ●
●● ●●
● ●
●● ●● ●● ●
●●
● ● ● ●● ● ● ● ●
● ●● ●
● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●
●●● ● ●
● ●● ● ● ●● ● ● ● ●● ●● ●●
●● ●
● ●●● ●●● ●●● ●
● ●●●●●●● ●●●●● ●●●● ● ● ●● ● ● ●● ●●● ●●● ● ● ●●●●● ●●●●
●●
● ●●
● ●●● ●●●
●
●●●● ● ●●●● ●
● ●●● ●●●●● ●●●● ●●● ● ● ● ●●● ● ●● ● ●●● ● ● ●
●●● ●● ● ●● ●● ● ●● ●●● ●●●
●●● ●●● ●●●● ●● ●●●●● ●
● ●● ●
●
● ●●●●
● ●● ●●
● ● ●●
●●● ●●●●● ● ● ●●
●● ●
● ●● ●●● ●●●
● ● ●● ●●●● ● ●● ● ● ●● ● ●●●● ● ● ● ● ●
● ●
● ●● ● ● ● ●● ●● ●●●●
● ● ●●●●● ●●
●● ●● ●●● ●● ●
● ●●●●●
●
● ●● ●● ● ●
●●
●●●●●● ● ●
● ●
●●● ●● ●● ●
●● ●●
●● ● ● ●●
● ● ● ●● ● ● ●●●●● ●●●● ●● ● ● ● ● ●● ●●● ● ● ● ●
● ●● ● ● ●● ●●● ●●● ●●● ●● ● ●●
● ●●● ●● ●● ●●● ●● ●●● ● ●●
● ● ●
● ●● ●
●●● ●
● ●● ●●●●●●●●●● ●●● ● ●●●●● ●●●●●●
● ●●●● ●● ●●
●
● ●●● ●● ●
● ● ●
● ● ●●● ●● ● ●● ● ● ●● ●●●
● ●● ●●●
●●● ●● ●● ●●●● ● ●●
●●●●●
●●
●
●
●● ●
● ● ●
●●●● ● ●●
●● ●●●●●● ●
●●●● ●●●●●● ●
●
●
●●
●
●
●
●
●●●●●●
●
●
●●
●●●● ●●
● ●●●●●
● ●● ●●
●●●
● ● ●
●●●●● ●●● ●
●● ●● ●● ●●
●●● ●
●
●●● ●● ●● ●● ● ●● ●●● ●●
●● ● ●● ●● ●
● ● ●● ● ●● ●●●●●● ●●●● ● ●● ●●●●
● ●●●● ●●●
●● ●●
●●● ●●●●
● ●● ●
●
●●●●● ● ● ●●●●
● ●●● ● ● ●
●●●
●●●
●
● ●● ●
●●●
●
●●●●●●●
● ●●●
●
●●●●● ●●●●●
● ● ●●●
● ●●● ● ● ●● ● ●
●●●
●●
● ● ●●
●●●● ●
●●●● ● ● ●
● ●
●● ●●●●●●
●● ● ● ●●
●●● ●● ● ●●
●●●● ●●●●●●●●●
●
● ●●● ●
●
●●●● ●●●● ●●●
●● ● ●●● ●● ●●
●●●● ●●●●● ●
●
●●●●● ●●● ●● ●● ●●● ● ● ●●● ● ●●
●
●● ● ●●
●
● ●● ● ● ● ●● ●●● ● ●●
● ● ●
● ●●●●
●● ●● ● ●
●●●● ●●●●●●●●●●
● ● ●
●●●●●
●
●●●●
●
●●
●●●●
● ● ●
●●●● ● ●●
●● ●●●●●●
● ● ●● ● ●● ●●
● ●●●
●● ●● ●●
●● ●●●● ●
●●●●●●
●●●
● ●
●●● ●
●●●●
● ●●● ●●
●●●●●● ●●●●●
●●● ●● ●●● ●●● ●
● ● ● ● ● ●● ●● ●● ●●● ●● ● ● ●
● ● ●● ● ● ●●
● ●● ● ●●●
●● ● ●
●●●● ●●●
●
● ●●●● ●● ●●●● ●
●● ●●● ●●
●● ●●●● ●●●●●●●●
●●●● ● ●●
● ●
●●●
● ●
●
●● ●● ●
●●
●●
●
●●● ●
●● ● ●●●●●● ● ●●● ●● ●●● ●●●●●●●●●●●
● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ●● ● ●●
● ●●●● ● ●
●●●●●
●
●
●● ●● ●●● ●● ●●●● ●●●● ●● ●● ● ●●● ●● ●●● ●●●●● ● ● ● ●●●●●●● ●
● ●●● ●● ●● ●●●●● ●
●
●● ● ●● ● ●●
● ●●● ● ● ●
● ●
●● ●●
●●● ● ● ● ●●
● ● ●●● ●●●● ● ●●●●●●
●●
● ●● ●● ●
●●
● ● ●●
● ●●
●●
●● ●●●●
● ● ●●
● ●
●● ● ●● ●●
●
● ●
●
●●●● ●
● ●●●●
●●● ●
●●●
● ●● ● ●●●
●
●
●●● ●●●●●●
●●●
●●
●●●● ●
●●
●●
●
● ●●
●● ●●●
●●●● ●
●●●●●●● ●
● ●
●●
●
●● ●●
● ●●●
●● ●● ●
● ●● ●●●
●
● ●● ●
● ● ●●● ●● ●
●●
● ●● ●●●
●●● ●
●●● ●● ●●●●●●
●
● ●
●●
● ●● ●●●● ●
●
● ●●● ●●●●● ● ●●
●
● ●
●
●●●
●
●●●●
●●●
●
●●●●●●●●● ●●●
●
●● ●● ●
●
●●●●●
●
●
●●●●●●●●
● ●
●●● ●
●● ●
●●● ● ● ● ●● ●
●●● ● ●●●
●
● ●● ● ●
● ●●
● ● ●
● ●●●● ● ●●●●●● ● ●●●
●●●● ● ●● ● ●
●●● ● ●● ●●● ● ●●●●●●●
●●●● ●●
● ●●
● ●
●●● ●●●●
● ● ●● ●
●●●
● ●
● ●●● ● ●●●
● ●●● ● ●
●● ●●●
● ●●
● ●●●● ●●● ●●
●● ●●●●● ● ● ● ●●●● ● ●●
●
● ●● ● ●● ●● ●●
● ●
●● ●
●●●●
●●●
●● ●●●●● ●●● ●●
●● ●● ●●
● ●●●
●●●
●
●
●
●
● ●●●
●●●●●●
●●
●
●
●
●
● ●●●●●●●
● ●●
●
●●● ●● ●●
●
●● ●●
●● ●●
● ●
● ●●
●● ●● ●
●●
● ●
●
●●●
● ●
●●●
● ●●●●
●
●●
●●●●●
●
●
●
●●
●●●
●●●●●●●●
● ●●●●
●●●●● ●● ●● ●●
● ●●● ●● ● ● ●●● ● ●
● ●● ●●●● ●● ●●● ●● ● ● ● ●
● ● ● ●●●● ●● ●● ● ●●
●● ●
●●
● ●
●● ●
● ●● ●●●
● ●
● ●
●●●● ●
●●●●● ●● ●● ● ●●● ● ●●
● ●
●●●
●●●
● ● ●●●●●●●
● ●
●● ●●
●
● ●
● ●●●●
● ●●
● ●● ●● ●●● ● ●● ● ● ●
●● ● ●●●● ●●
●●● ●●● ●●
● ● ●●
●
●● ●●●●
●●●
●
● ● ●●● ●●●● ● ●● ●
●●●●
●
●
● ● ● ●
●●
●
●●●
●●●
●●●●● ●
●
●●●●●● ●
●●●
●●
●● ●● ● ●
●
●●●
●●
●●●●●●
●
● ●● ●●●●●●● ●● ●● ● ●
● ● ●●●●●●
● ●●● ● ● ●
● ●●●● ● ● ●● ● ●●● ● ● ●●● ● ●●●●●●●
● ● ●●● ●●● ●
● ● ●●● ●●● ● ●
●
●●
●●●●●
● ●
●● ● ●●● ●●
●● ●●●●● ●●●
●
●●
●●● ●●
● ●● ● ●●● ●● ● ●●●● ●●● ● ●●●● ● ●●
●●
●● ● ●● ●●● ● ● ●●● ●●● ● ●●● ●
●●
●●●●●●●●
●
●
●● ●●
●● ●●●●
●
● ●
●●●
● ●●
●
● ●●
●
●
●
●●
●
●●
●
●
●
● ●●
●
●
●●
●●
●●●●● ●●
●
●
●● ●
● ● ●●●●
●●●
●●●
● ●●●●
●●●● ● ●
●●●
●●●● ●●
●
●
●●●●●
●
●●
●●
●● ●● ●
●●●●
●
●
●●●●●
● ●●●●● ●●●● ● ● ●
●
● ● ● ●● ● ●● ●● ●● ●●● ● ● ● ● ●●●● ●● ● ●
● ●● ● ● ●●●●● ●
●
●●●●●● ●
●
●●● ●●
●●●● ●●●●● ●
● ●● ●
●●
●
●●
● ●●●●●● ●●●●● ● ●●●
●●● ●●● ●●●● ● ● ●
● ●● ●● ● ● ● ●● ● ●●
● ●● ●● ●●
● ●
●●●● ●●
●●
● ●●● ●
● ●●●
●●
●● ●
●
●●●●
● ●●●
●●
●●●●●●
●●● ●
● ●●
●●
●● ● ●
●●● ●
●● ●●●●●
●
●●●
●●● ●●●● ●● ●
●●●● ● ●● ●
● ● ●● ● ●● ● ● ●● ●
●
●● ●
● ● ●●●● ●
●●● ● ●●●
●
●●
●●
● ●● ●●●
● ●●●
●●●
●● ● ●
●●
●
●●
●●● ●
●
●
●●
●●
● ●
●
●●●●
●●
●
●
●
●●
●
●
●
●●●●●
●
●
● ●
●
●
●
●
●●●
● ●●●●●●
●●●
●●
●
●
●
●● ●
●
●
●●●
●●
●
●
●●
●● ●
●
●
●● ●●●●
●
●
●● ●
●●
●●●● ●
●
●
●
●
●● ●●●
●●●
●● ●●●● ●
●●
●●
●
●
●
●● ● ●● ●●● ●●●
●●
●●●● ●
●●● ●●●●●● ●
●●
● ●
● ●● ●
● ● ●
● ●●● ● ● ●●
●●● ●
●
●
●●●● ●●● ●●●
● ●●
●● ● ●●
● ●● ● ●
● ●
● ●● ● ● ●
●●
●●●●● ● ●● ●
●
● ● ●
● ●● ● ●
●●● ●●●●●
●●
●
●
●● ● ●●● ●● ●● ●
● ●
● ●
● ● ● ●● ● ●
●● ● ● ● ●
● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ●●● ● ● ● ● ●●● ● ●●
●● ●● ● ●●
● ●● ●●●
●●
● ●●●● ●●● ●●●●
● ● ●
● ●● ●●●● ●●● ●●●
●●
●● ●●
● ● ●● ●
● ●● ●●
●●
● ●
●●
●●●●● ●● ●●
● ●●●
●● ● ●
●● ●● ●● ●●● ●● ●●●●●●● ●● ●●● ●●
●● ●● ● ● ● ● ●
● ● ● ●● ● ●● ●●●● ● ● ●● ● ●● ●
●● ● ●●
●●
● ● ●●●●●● ●●●●●
●
●●● ●● ●●●
● ●
● ●● ●●● ●
● ● ● ●●
●●
●
●● ●●●●
●●
● ●
●
●●●●●● ●●●● ●●● ● ●●
● ●● ●● ● ●
●● ●●● ●● ●● ●●● ●● ●● ●
● ●● ● ●● ●● ●●●● ●● ● ●
● ●●
●●●●● ●●● ●●●● ●●● ●
● ●
●● ●●●● ● ●● ●● ●●● ● ● ●● ● ● ● ●
●● ●
●● ●
●● ● ●●●●● ●●●● ● ● ●● ●●
● ●●
● ●
● ●●●●●
●●●
●●
● ●
● ●●●● ●●●●
●●
● ●● ●●● ●
●● ●●● ●●●
●●
●●●● ●●●● ●●●●● ●●
●●●● ●● ●
●● ●● ●●
● ●●
●●
●●●● ●● ● ●
●●
● ●● ●● ● ●● ●
●● ● ●● ● ●●● ● ●●
●●●●● ●● ● ●● ●●● ●●●●● ●● ●
● ●● ● ●●● ●●●●● ● ●● ● ●
● ●● ● ● ● ● ● ● ●● ●●● ● ●● ● ●●● ●●● ● ●● ●● ●
●
●●● ●● ● ●● ●● ●●● ●●● ● ● ●
●
●● ● ●● ●
● ●● ● ●
●
0
−2 −1 0 1 2
Figure 1.7: The posterior distribution P r(µ, σ 2 |y). The black dots are samples
from the posterior using MCMC, and the red contours are calculated by eval-
uating the posterior density on a grid of parameter values. The contours are
normalised so that the posterior mode has a value of one.
FALSE TRUE
0.0225 0.9775
There is Monte Carlo error in the answer (0.978) but if we collect a large
number of samples then this can be minimised.
Using a similar logic we can obtain the marginal distribution of the variance
by simply evaluating the draws in VCV ignoring (averaging over) the draws in
Sol:
Posterior Distribution of σ2
1000
800
Frequency
600
400
200
0
0 1 2 3 4 5
σ 2
Figure 1.8: Histogram of samples from the marginal distribution of the variance
P r(σ 2 |y) using MCMC. The vertical line is the joint posterior mode, which
differs slightly from the marginal posterior mode (the peak of the marginal
distribution).
In this example (see Figure 1.8) the marginal mode and the joint mode are
very similar, although this is not necessarily the case and can depend both on
the data and the prior. Section 1.5 introduces improper priors that are non-
informative with regard to the marginal distribution of a variance.
1.4 MCMC
In order to be confident that MCMCglmm has successfully sampled the poste-
rior distribution it will be necessary to have a basic understanding of MCMC
methods. MCMC methods are often used when the joint posterior distribution
cannot be derived analytically, which is nearly always the case. MCMC relies
on the fact that although we cannot derive the complete posterior, we can cal-
culate the height of the posterior distribution at a particular set of parameter
CHAPTER 1. BAYESIAN STATISTICS & MCMC 18
values, as we did to obtain the contour plot in Figure 1.6. However, rather than
going systematically through every likely combination of µ and σ and calculate
the height of the distribution at regular distances, MCMC moves stochastically
through parameter space, hence the name ‘Monte Carlo’.
Pr
σ2
Figure 1.9: The posterior distribution P r(µ, σ 2 |y). This perspective plot is
equivalent to the contour plot in Figure 1.6
Pr
µ σ2
Figure 1.10: The posterior distribution P r(µ, σ 2 |y), but only for values of σ 2
between 1 and 5, rather than 0 to 5 (Figure 1.9). The edge of the surface facing
left is the conditional distribution of the mean when σ 2 = 1 (P r(µ|y, σ 2 = 1)).
This conditional distribution follows a normal distribution.
> autocorr(m1a.2$Sol)
, , (Intercept)
(Intercept)
Lag 0 1.0000000000
Lag 1 -0.0157652146
Lag 5 0.0094886774
Lag 10 0.0093923394
Lag 50 0.0002389178
> autocorr(m1a.2$VCV)
, , units
units
2 The double t is because I cannot spell.
CHAPTER 1. BAYESIAN STATISTICS & MCMC 22
Lag 0 1.000000000
Lag 1 0.175580402
Lag 5 -0.007972959
Lag 10 -0.011741307
Lag 50 0.003373268
The correlation between successive samples is low for the mean (-0.016) but
a bit high for the variance (0.176). When auto-correlation is high the chain
needs to be run for longer, and this can lead to storage problems for high di-
mensional problems. The argument thin can be passed to MCMCglmm specifying
the intervals at which the Markov chain is stored. In model m1a.2 we specified
thin=1 meaning we stored every iteration (the default is thin=10). I usually
aim to store 1,000-2,000 iterations and have the autocorrelation between suc-
cessive stored iterations less than 0.1.
> plot(m1a.2$Sol)
On the left of Figure 1.11 is a time series of the parameter as the MCMC
iterates, and on the right is a posterior density estimate of the parameter (a
smoothed histogram of the output). If the model has converged there should be
no trend in the time series. The equivalent plot for the variance is a little hard
to see on the original scale, but on the log scale the chain looks good (Figure
1.12):
> plot(log(m1a.2$VCV))
1.0
2
0.8
0.6
0
0.4
−2
0.2
0.0
Figure 1.11: Summary plot of the Markov Chain for the intercept. The left plot
is a trace of the sampled posterior, and can be thought of as a time series. The
right plot is a density estimate, and can be thought of a smoothed histogram
approximating the posterior.
chain is one which gets ‘stuck’ at some parameter value and cannot escape. This
is usually obvious from the mcmc plots but MCMCglmm will often terminate before
the analysis has finished with an error message of the form:
0.5
3
0.4
2
0.3
1
0.2
0
−1
0.1
−2
0.0
Figure 1.12: Summary plot of the Markov Chain for the logged variance. The
logged variance was plotted rather than the variance because it was easier to
visualise. The left plot is a trace of the sampled posterior, and can be thought
of as a time series. The right plot is a density estimate, and can be thought of
a smoothed histogram approximating the posterior.
5
4
3
0.1
σ2
0.15
2
0.2
0.25
0.35
0.45
0.5
1
5 0.6
0.5 5
0.85 0.6
0.7
0.9
0.4
0.3
0.05
0
−2 −1 0 1 2
Figure 1.13: Likelihood surface for the likelihood P r(y|µ, σ 2 ) in black, and an
MCMC approximation for the posterior distribution P r(µ, σ 2 |y) in red. The
likelihood has been normalised so that the maximum likelihood has a value
of one, and the posterior distribution has been normalised so that the poste-
rior mode has a value of one. Flat priors were used (P r(µ) ∼ N (0, 108 ) and
P r(σ 2 ) ∼ IW (V = 0, nu = 0)) and so the posterior distribution is equivalent to
the likelihood.
5
4
3
0.1
σ2
0.15
2
0.2
0.25
0.35
0.45
0.5
1
5 0.6
0.5 5
0.85 0.6
0.7
0.9
0.4
0.3
0.05
0
−2 −1 0 1 2
Figure 1.14: Likelihood surface for the likelihood P r(y|µ, σ 2 ) in black, and an
MCMC approximation for the posterior distribution P r(µ, σ 2 |y) in red. The
likelihood has been normalised so that the maximum likelihood has a value of
one, and the posterior distribution has been normalised so that the posterior
mode has a value of one. A non-informative prior was used (P r(µ) ∼ N (0, 108 )
and P r(σ 2 ) ∼ IW (V = 0, nu = −2))
400
200
0
0 1 2 3 4 5
28
CHAPTER 2. GLMM 29
where the β’s are the unknown coefficients to be estimated, and the variables
in this font are observed predictors. Continuous predictors such as day re-
main unchanged, but categorical predictors are expanded into a series of binary
variables of the form ‘do the data come from 1961, yes or no? ’, ‘do the data
come from 1962, yes or no? ’, and so on for as many years for which there are
data.
It is cumbersome to write out the equation for each data point in this way,
and a more compact way of representing the system of equations is
E[y] = Xβ (2.1)
where X is called a design matrix and contains the predictor information,
0
and β = [β1 β2 β3 β4 ] is the vector of parameters.
The binary predictors do the data come from 1961, yes or no? and there
was no speed limit, yes or no? do not appear. These are the first factor levels
of year and limit respectively, and are absorbed into the global intercept (β1 )
which is fitted by default in R. Hence the expected number of injuries for the
four combinations (on day zero) are β1 for 1961 with no speed limit, β1 + β2 for
1961 with a speed limit, β1 + β3 for 1962 with no speed limit and β1 + β2 + β3
for 1962 with a speed limit.
I is an identity matrix. It has ones along the diagonal, and zeros in the
off-diagonals. The zero off-diagonals imply that the residuals are uncorrelated,
and the ones along the diagonal imply that they have the same variance σe2 . We
could use glm to estimate β and σe2 assuming that y is normally distributed:
but the injuries are count data and the residuals show the typical right skew:
Histogram of m2a.1$resid
40
30
Frequency
20
10
0
−20 −10 0 10 20 30
m2a.1$resid
Figure 2.1: Histogram of residuals from model m2a.1 which assumed they fol-
lowed a Gaussian distribution.
Its not extreme, and the conclusions probably won’t change, but we could
assume that the data follow some other distribution.
some function of the mean response. This function is called the link function.
For example, with a log link we are trying to predict the logged expectation:
log(E[y]) = Xβ (2.3)
or alternatively
Call:
glm(formula = y ~ limit + year + day, family = poisson, data = Traffic)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.1774 -1.4067 -0.4040 0.9725 4.9920
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.0467406 0.0372985 81.685 < 2e-16 ***
limityes -0.1749337 0.0355784 -4.917 8.79e-07 ***
year1962 -0.0605503 0.0334364 -1.811 0.0702 .
day 0.0024164 0.0005964 4.052 5.09e-05 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results look fairly straightforward, having a speed limit reduces the
number of injuries significantly, there are fewer injuries in 1962 (although sig-
nificance is marginal) and there is a significant increase in the number of injuries
over the year. Are these big effects or small effects? The coefficients are on the
log scale so to get back to the data scale we need to exponentiate. The exponent
of the intercept is the predicted number of injuries on day zero in 1961 without
a speed limit:
> exp(m2a.2$coef["(Intercept)"])
(Intercept)
21.04663
To get the prediction for the same day with a speed limit we need to add
the limityes coefficient
(Intercept)
17.66892
With a speed limit there are expected to be 0.840 times less injuries than if
there were no speed limits. This value can be more directly obtained:
> exp(m2a.2$coef["limityes"])
limityes
0.8395127
and holds true for any given day in either year. For example, without a
speed limit on the final day of the year (92) in 1961 we expect 24.742 injuries:
(Intercept)
24.74191
(Intercept)
20.77115
The proportional change is identical because the model is linear on the log
scale.
CHAPTER 2. GLMM 33
2.3 Over-dispersion
Most count data do not conform to a Poisson distribution because the variance
in the response exceeds the expectation. This is known as over-dispersion and it
is easy to see how it arises, and why it is so common. In the summary to m2a.2
note that the ratio of the residual deviance to the residual degrees of freedom
is 3.162 which means, roughly speaking, there is 3.2 times as much variation in
the residuals than what we expect.
If the predictor data had not been available to us then the only model we
could have fitted was one with just an intercept:
> m2a.3 <- glm(y ~ 1, data = Traffic, family = "poisson")
> summary(m2a.3)
Call:
glm(formula = y ~ 1, family = "poisson", data = Traffic)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6546 -1.4932 -0.3378 0.9284 5.0601
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.07033 0.01588 193.3 <2e-16 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
glm(formula = y ~ limit + year + day, family = quasipoisson,
data = Traffic)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.1774 -1.4067 -0.4040 0.9725 4.9920
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.046741 0.067843 44.909 < 2e-16 ***
limityes -0.174934 0.064714 -2.703 0.00753 **
year1962 -0.060550 0.060818 -0.996 0.32078
day 0.002416 0.001085 2.227 0.02716 *
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 9.80056e-05
CHAPTER 2. GLMM 35
However, perhaps it was Christmas day and everything was under 5 foot of
snow. Although the accidents may have been independent in the sense that all
9 cars didn’t crash into each other, they are non-independent in the sense that
they all happened on a day where the underlying probability may be different
from that underlying any other day (data point).
The element Sol contains the posterior distribution of the coefficients of the
linear model, and we can plot their marginal distributions:
Notice that the year1962 coefficient has a high posterior density around
zero, in agreement with the over-dispersed glm model, and that in general the
estimates for the two models are broadly similar. This agreement is superficial.
l=η+e (2.6)
where l is a vector of latent variables (log(E[y]) in this case) and eta (η) the
usual symbol for the linear predictor (Xβ). The data we observe are assumed
to be Poisson variables with expectation equal to the exponentiated latent vari-
ables:
y ∼ P ois(exp(l)) (2.7)
Note that the latent variable does not exactly predict y, as it would if the data
were Gaussian, because there is additional variability in the Poisson process. In
the call to MCMCglmm I specified pl=TRUE to indicate that I wanted to store the
posterior distributions of latent variables. This is not usually necessary and
can require a lot of memory (we have 1000 realisations for each of the 182 data
1 This is a bit disingenuous - it is no coincidence that the Markov chain without over-
2.8 3.2
4
0
4000 6000 8000 10000 12000 2.8 2.9 3.0 3.1 3.2
4
−0.3
0
4000 6000 8000 10000 12000 −0.4 −0.3 −0.2 −0.1 0.0 0.1
0 4
−0.2
4000 6000 8000 10000 12000 −0.3 −0.2 −0.1 0.0 0.1
Figure 2.2: MCMC summary plot for the coefficients from a Poisson glm (model
m2a.5).
points). However as an example we can obtain the posterior mean residual for
data point 92 which is the data from day 92 in 1961 when there was no speed
limit:
> lat92 <- m2a.5$Liab[, 92]
> eta92 <- m2a.5$Sol[, "(Intercept)"] + m2a.5$Sol[,
+ "day"] * Traffic$day[92]
> resid92 <- lat92 - eta92
> mean(resid92)
[1] -0.1417317
This particular day has a negative expected residual indicating that the prob-
ability of getting injured was less than expected for this particular realisation
of that day in that year. If that particular day could be repeated it does not
necessarily mean that the actual number of injuries would always be less than
expected, because it would follow a Poisson distribution with rate parameter
λ =exp(lat92)=21.920. In fact there would be a 21.767% chance of having
more injuries than if the residual had been zero:
CHAPTER 2. GLMM 37
20
0.14
0.12
15
0.10
10
0.08
5
0.06
Figure 2.3: MCMC summary plot for the residual (units) variance from a
Poisson glm (model m2a.5). The residual variance models any over-dispersion,
and a residual variance of zero implies that the response conforms to a standard
Poisson.
The forces that created this residual were only realised on day 92 in 1961,
however we could ask hypothetically what if those forces were present on an-
other day. Figure 2.4 plots the first 92 residuals as function of day (red lines)
as scatter around the expectation on the log scale (solid black line). Each resid-
ual is only realised once, and the black dashed line is the hypothetical resid92
CHAPTER 2. GLMM 38
600
6
500
●
5
400
log(E[y])
E[y]
300
4
200
●
3
100
2
0
0 20 40 60 80 0 20 40 60 80
day day
Figure 2.4: The predicted number of injuries on the log scale (left) and data
scale (right) as a function of the continuous covariate day for 1961 without a
speed limit. In order to highlight a point, the slope of the plotted relationship
is an order of magnitude steeper than the model m2a.5 estimate. The solid
black line is the value of the linear predictor, and the red dashed lines represent
noise around the linear predictor. Each dashed line is a residual from the model,
which is only observed for a particular data point. The vertical distance between
the black dot and the solid black line is the observed residual on day 92. The
black dashed line is the predicted value of a data point observed on other days
but with the same residual value. All lines are parallel and linear on the log
scale, but this is not the case on the data scale.
In the Traffic example the non linearities are small so the differences in
parameter estimates are not large using either multiplicative or additive models.
However, multiplying the intercept in model m2a.5 by half the residual variance
is in closer agreement with the quasipoisson model than the raw intercept:
[1] 20.92171
0.007
1.2
MEDe[y]=η
0.006
1.1
Ee[y]=η + 12σ2e
Density
Density
0.005
1.0
0.004
0.9
0.8
0.003
4.9 5.0 5.1 5.2 5.3 5.4 5.5 140 160 180 200 220 240 260
log(y) y
Figure 2.5: The hypothetical distribution for the number of injuries on the log
scale (left) and data scale (right) for day 92 in 1961 without a speed limit. These
can viewed as vertical slices from Figure 2.4 on day 92. On the log scale the
distribution is assumed to be normal and so the residuals are symmetrically
distributed around the linear predictor. As a consequence the linear predictor
(η) is equal to the mean, median and mode of the distribution on the log scale.
Because the exponential function is non-linear this symmetry is lost on the data
scale, and the different measures of central tendency do not coincide. Since the
residuals are normal on the log scale, the distribution on the data scale is log-
normal and so analytical solutions exist for the mean, mode and median. σ 2 is
the residual variance.
CHAPTER 2. GLMM 40
[1] 19.89901
> exp(m2a.3$coef["(Intercept)"])
(Intercept)
21.54891
Analytical results for these transformations can be obtained for the Poisson
log-normal, but for other distributions this is not always the case. Section 5.2
gives prediction functions for other types of distribution. One could reasonably
ask, why have this additional layer of complexity, why not just stick with the
multiplicative model? This brings us to random effects.
When we treat an effect as fixed we believe that the only information regard-
ing its value comes from data associated with that particular level. If we treat
an effect as random we also use this information, but we weight it by what other
data tell us about the likely values that the effects could take. In a Bayesian
analysis this additional information could come from data not formally included
in the analysis, in which case it would be called a prior. In hierarchical models
this additional information comes from data associated with other factor levels
of the same type.
It is common to hear things like ‘year is a random effect’ as if you just have
to estimate a single effect for all years. It is also common to hear things like
‘years is random’ as if years were sampled at random. Better to say year effects
CHAPTER 2. GLMM 41
are random and understand that it is the effects that are random not the years,
and that we’re trying to estimate as many effects as there are years. In this
sense they’re the same as fixed effects, and we can easily treat the year effects
as random to see what difference it makes.
random = ~ year
although we don’t need anything to the left of the ∼ because the response
is known from the fixed effect specification. In addition, the global intercept is
suppressed by default, so in fact this specification produces the design matrix:
year1961 year1962
1 1 0
2 1 0
184 0 1
Earlier I said that there was no distinction between fixed and random effects
in a Bayesian analysis - all effects are random - so lets not make the distinction
and combine the design matrices (W = [X, Z]) and combine the vectors of
0 0 0
parameters (θ = [β , u ] ):
and
You will notice that this new design matrix is exactly equivalent to the
original design matrix X except we have one additional variable year1961. In
our first model this variable was absorbed in to the global intercept because it
could no be uniquely estimated from the data. What has changed that could
make this additional parameter estimable? As is usual in a Bayesian analysis,
if there is no information in the data it has to come from the prior. In model
m2a.5 we used the default normal prior for the fixed effects with means of zero,
large variances of 108 , and no covariances. Lets treat the year effects as random,
but rather than estimate a variance component for them we’ll fix the variance
at 108 in the prior:
The estimates for the intercept, day and the effect of a speed limit now appear
completely different (Figure 2.6). However, in the original model (m2a.5) the
prediction for each year is obtained by:
However, for this model we have to add the intercept to both random effects
to get the year predictions. MCMCglmm does not store the posterior distribution
of the random effects by default, but because we specified pr=TRUE, the whole
of θ is stored rather than just β:
We can merge the two posterior distributions to see how they compare:
20000
4e−05
0
−20000
0e+00
4000 6000 8000 10000 12000 −20000 0 10000
6
−0.3 −0.1
4
2
0
4000 6000 8000 10000 12000 −0.4 −0.3 −0.2 −0.1 0.0 0.1
200
0
4000 6000 8000 10000 12000 −0.002 0.000 0.002 0.004 0.006
Figure 2.6: MCMC summary plots for the intercept, speed limit and day coef-
ficients from model m2a.6 where year effects were treated as random. Note the
high posterior variance for the intercept.
The posterior distributions are very similar (Figure 2.7; but see Section 2.7
why they are not identical), highlighting the fact that effects that are fixed
are those associated with a variance component which has been set a priori to
something large (108 in this case), where effects that are random are associated
with a variance component which is not set a priori but is estimated from
the data. As the variance component tends to zero then no matter how many
random effects there are, we are effectively only estimating a single parameter
(the variance). This makes sense, if there were no differences between years we
only need to estimate a global intercept and not separate effects for each year.
Alternatively if the variance is infinite then we need to estimate separate effects
for each year. In this case the intercept is confounded with the average value
of the random effect, resulting in a wide marginal distribution for the intercept,
and strong posterior correlations between the intercept and the mean of the
random effects:
> plot(c(m2a.6$Sol[, "year.1961"] + m2a.6$Sol[,
+ "year.1962"])/2, c(m2a.6$Sol[, "(Intercept)"]))
CHAPTER 2. GLMM 44
6
3.2
5
4
3.0
3
2
1
2.8
0
0 200 400 600 800 2.7 2.8 2.9 3.0 3.1 3.2
5
3.1
4
3
2.9
2
1
2.7
Figure 2.7: MCMC summary plots for the year effects from a model where year
effects were treated as fixed (black) and where they were treated as random
(red) but with the variance component set at a large value rather than being
estimated. The posterior distributions are virtually identical.
With only two levels, there is very little information to estimate the variance,
and so we would often make the a priori decision to treat year effects as fixed,
and fix the variance components to something large (or infinity in a frequentist
analysis).
At the moment we have day as a continuous covariate, but we could also have
random day effects and ask whether the number of injuries on the same day but
in different years are correlated. Rather than fixing the variance component at
something large, we’ll use the same weaker prior that we used for the residual
variance:
> Traffic$day <- as.factor(Traffic$day)
> prior <- list(R = list(V = 1, nu = 0.002), G = list(G1 = list(V = 1,
+ nu = 0.002)))
> m2a.7 <- MCMCglmm(y ~ year + limit + as.numeric(day),
+ random = ~day, family = "poisson", data = Traffic,
CHAPTER 2. GLMM 45
20000
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●●●
●●●
●●●
10000
●●
●
●●●
●●
c(m2a.6$Sol[, "(Intercept)"])
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
0
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●
−10000
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●
●●
●●
●●
●●●
●●
−20000
●
●
●
●
Figure 2.8: Joint posterior distribution of the intercept and the mean of the two
random year effects. The variance component associated with year was fixed at
a large value (108 ) and so the effects are almost completely confounded.
day has also gone in the fixed formula, but as a numeric variable, in order to
capture any time trends in the number of injuries. Most of the over-dispersion
seems to be captured by fitting day as a random term (Figure 2.9):
> plot(m2a.7$VCV)
In fact it explains so much that the residual variance is close to zero and
mixing seems to be a problem. The chain would have to be run for longer, and
the perhaps an alternative prior specification used.
0.18
20
0.14
15
10
0.10
5
0.06
0
4000 8000 12000 0.05 0.10 0.15 0.20
20 40 60 80
0.015
0.000
Figure 2.9: MCMC summary plot of the variance component associated with day
(top) and the residual variance component (below). The trace for the residual
variance shows strong autocorrelation and needs to be ran for longer.
over the residuals. Often it is important to get the expectation after marginalis-
ing residuals, and indeed after marginalising other random effects. For example
we may not be so interested in knowing the expected number of injuries on the
average day, but knowing the expected number of injuries on any random day.
u.1 u.2
success 1 2
failure 4 3
u.1 u.1 u.1 u.1 u.1 u.2 u.2 u.2 u.2 u.2
[1,] 1 0 0 0 0 1 1 0 0 0
If the original data are already binary then there is no information to mea-
sure how repeatable trials are within a binomial unit because we only have a
single trial per observation. This does not necessarily mean that heterogeneity
in the underlying probabilities does not exist, only that we can’t estimate it.
Imagine we are in a room of 100 people and we are told that 5% of the people
CHAPTER 2. GLMM 49
will be dead the following day. If the people in the room were a random sample
from the UK population I would worry - I probably have a 5% chance of dying.
If on the other hand the room was a hospital ward and I was a visitor, I may not
worry too much for my safety. The point is that in the absence of information,
the binary data look the same if each person has a 5% chance of dying or if 5
people have a 100% chance of dying. Most programs set the residual variance
to zero and assume the former, but it is important to understand that this is a
convenient but arbitrary choice. Given this, it is desirable that any conclusions
drawn from the model do not depend on this arbitrary choice. Worryingly, both
the location effects (fixed and random) and variance components are completely
dependent on the magnitude of the residual variance.
To demonstrate we will use some data from a pilot study on the Indian
meal moth (Plodia interpunctella) and its granulosis virus (PiGV) collected by
Hannah Tidbury & Mike Boots at the University of Sheffield.
> data(PlodiaRB)
The data are taken from 874 moth pupae for which the Pupated variable is
zero if they failed to pupate (because they were infected with the virus) or one
if they successfully pupated. The 874 individuals are spread across 49 full-sib
families, with family sizes ranging from 6 to 38.
and then fit a second model where the residual variance is fixed at 2:
The posterior distribution for the intercept differs between the two models
(see Figure 2.10):
2.0
−0.8
−1.0
1.5
−1.2
−1.4
1.0
−1.6
0.5
−1.8
−2.0
0.0
Figure 2.10: MCMC summary plots for the intercept of a binary GLMM where
the residual variance was fixed at one (black) and two (red).
1.2
2.5
0.8
1.5
0.4
0.5
0.0
4000 8000 12000 0.0 1.0 2.0 3.0
0.4
1.4
0.2
1.0
0.0
Iterations
Figure 2.11: MCMC summary plots for the between family variance component
of a binary GLMM where the residual variance was fixed at one (black) and two
(red).
Using the approximation due to ? described earlier we can also rescale the
2
estimates by the estimated residual variance (σunits ) in order to obtain the
posterior distributions of the parameters under the assumption that the actual
residual variance (σe2 ) is equal to some other r
value. For location effects the pos-
1+c2 σe2
terior distribution needs to be multiplied by and for the variance
1+c2 σ 2
units
1+c2 σ 2
components the posterior distribution needs to be multiplied by 1+c2 σ2 e
units
where c is some constant that depends on the link function. For the probit
CHAPTER 2. GLMM 52
0.35
8
0.30
6
0.25
0.20
4
0.15
2
0.10
0.05
Figure 2.12: MCMC summary plots for the intra-family correlation from a bi-
nary GLMM where the residual variance was fixed at one (black) and two (red).
√
c = 1 and for the logit c = 16 3/15π. We can obtain estimates under the
assumption that σe2 = 0:
The posteriors should be virtually identical under a flat prior (See Figure
2.13) although with different priors this is not always the case. Remarkably, ?
2
show that leaving a diffuse prior on σunits and rescaling the estimates each
iteration, a Markov chain with superior mixing and convergence properties can
be obtained (See section 8).
It should also be noted that a diffuse prior on the logit scale is not necessarily
weakly informative on the probability scale. For example, the default setting
for the prior on the intercept is N (0, 108 ) on the logit scale, which although
CHAPTER 2. GLMM 53
2.5
−0.6
2.0
−0.8
−1.0
1.5
−1.2
1.0
−1.4
0.5
−1.6
0.0
Figure 2.13: MCMC summary plots for the expected proportion of caterpillars
pupating from a binary GLMM where the residual variance was fixed at one
(black) and two (red).
relatively flat across most of the probability scale, has a lot of density close to
zero and one:
This diffuse prior can cause problems if there is complete (or near complete)
separation. Generally this happens when the binary data associated with some
level of a categorical predictor are all success or all failures. For example, imagine
we had 50 binary observations from an experiment with two treatments, for the
first treatment the probability of success is 0.5 but in the second it is only one
in a thousand:
500
400
300
Frequency
200
100
0
plogis(rnorm(1000, 0, sqrt(1e+08)))
y
treatment 0 1
1 15 10
2 25 0
Call:
glm(formula = y ~ treatment, family = "binomial", data = data.bin)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.01077 -1.01077 -0.00008 -0.00008 1.35373
CHAPTER 2. GLMM 55
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4055 0.4082 -0.993 0.321
treatment2 -19.1606 2150.8026 -0.009 0.993
the effect of treatment does not appear significant despite the large effect
size. This is in direct contrast to an exact binomial test:
where the 95% confidence interval for the probability of success is 0.000 to
0.137.
The default MCMCglmm model also behaves oddly (see Figure 2.15):
For these types of problems, I usually remove the global intercept (-1) and
2
use the prior N (0, σunits + π 2 /3) because this is reasonably flat on the proba-
bility scale when a logit link is used. For example,
1.0
0.08
−10
0.04
−15
−20
0.00
Figure 2.15: MCMC summary plots for the intercept and treatment effect in
a binary GLM. In treatment 2 all 25 observations were failures and so the ML
estimator on the probability scale is zero and −∞ on the logit scale. With a flat
prior on the treatment effect the posterior distribution is improper, and with a
diffuse prior (as used here) the posterior is dominated by the high prior densities
at extreme values.
looks a little better (see Figure 2.15), and the posterior distribution for the
probability of success in treatment 2 is consistent with the exact binomial test
for which the 95% CI were (0.000 - 0.137). With such a simple model, the
prediction for observation 26 is equal to the treatment 2 effect and so we can
get the the credible interval (on the data scale) for treatment 2 using the predict
function:
1.0
0.8
0.6
0.0
0.4
−1.0
0.2
−2.0
0.0
4000 8000 12000 −2 −1 0 1
0.4
−2
0.3
−4
0.2
−6
0.1
0.0
−8
Figure 2.16: MCMC summary plots for the intercept and treatment effect in
a binary GLM. In treatment 2 all 25 observations were failures and so the ML
estimator on the probability scale is zero and −∞ on the logit scale. A flat
prior on the probability scale was used and the posterior distribution is better
behaved than if a flat prior on the logit scale had been used (see Figure 2.15).
2
Remembering the identity σ(a+b) = σa2 + σb2 + 2σa,b , this implies:
" #
β(Intercept) 108 108 108 108
β1961
= ∼ =
β1962 β(Intercept) + βyear1962 108 10 + 108
8
108 208
(2.18)
where β1961 and β1962 are the actual year effects, rather than the global in-
tercept and the contrast. In hindsight this is a bit odd, for one thing we expect
the 1962 effect to be twice as variable as the 1961 effect. With such weak priors
it makes little difference, but lets reparameterise the model anyway.
Rather than having a global intercept and a year contrast, we will have
separate intercepts for each year:
year1961 year1962
1 1 0
2 1 0
184 0 1
and a prior that has a covariance between the two year effects:
[,1] [,2]
[1,] 1e+08 5e+07
[2,] 5e+07 1e+08
8
has the same form as a mixed effect model with a prior variance of 102 for the
intercept, and the variance component associated with the random year effects
8
also fixed at 102 :
This arises because the two random effects have the joint prior distribution:
" 8 #
10
βyear.1961 0
∼ 2
108
(2.19)
βyear.1962 0 2
8
which when combined with the prior for the intercept, N (0, 102 ), gives:
The model:
Categorical Random
Interactions
> data(BTdata)
The data are morphological measurements (tarsus length and back colour)
made on 828 blue tit chicks from 106 mothers (dam). Half the offspring from
each mother were swapped with half the offspring from another mother soon
after hatching. The nest they were reared in is recorded as fosternest.
fits sex as a fixed effect, and dam and fosternest as random effects.
> diag(autocorr(m3a.1$VCV)[2, , ])
> plot(m3a.1$VCV)
> effectiveSize(m3a.1$VCV)
60
CHAPTER 3. CATEGORICAL RANDOM INTERACTIONS 61
0.40
0 2 4 6 8
0.25
0.10
12
0.10
8
4
0.00
0
4000 6000 8000 10000 12000 0.00 0.05 0.10 0.15 0.20
8
4
0.50
4000 6000 8000 10000 12000 0.45 0.50 0.55 0.60 0.65 0.70
Figure 3.1: MCMC summary plot for the variance components from model
m3a.1.
ance components will show negative posterior correlations because the the total
variance is being divided up. Imagine cutting a piece of string; making one bit
longer has to reduce the size of the other bits, by necessity. If we hadn’t exper-
imentally manipulated the birds then all chicks with the same mother, would
be raised in the same nest, and there would be no information in the data to
separate these terms. In this case the posterior correlation between these pa-
rameters would approach -1 as the prior information goes to zero.
The lower 95% credible interval for the fosternest variance is low
> HPDinterval(m3a.1$VCV)
lower upper
dam 0.13766735 0.3101992
fosternest 0.01904245 0.1300758
units 0.51064413 0.6352781
attr(,"Probability")
[1] 0.95
[1] 2014.776
> m3a.1$DIC
[1] 1991.29
The tarsus lengths were standardised prior to analysis - this is not recom-
mended, but was done in the original analyses of these data (?) so that com-
parisons would be scale invariant. The original analyses were done in REML
where it is hard to get accurate confidence intervals for functions of variance
components. With MCMC procedures this is simple. For example if we want
to know what proportion of the total variance is explained by dams
lower upper
var1 0.1792323 0.3452472
attr(,"Probability")
[1] 0.95
CHAPTER 3. CATEGORICAL RANDOM INTERACTIONS 63
One nice thing though about standardised data is that effect sizes are im-
mediately apparent. For example, fixed effects are in standard deviation units
and the sex effects are non-trivial:
> summary(m3a.1)
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 1991.29
G-structure: ~dam
~fosternest
R-structure: ~units
Given that the sexes differ in their mean phenotype it may be worth exploring
whether they vary in other ways. For example, perhaps there are sex-limited
genes that mean that related brothers resemble each other more than they do
their sisters. Perhaps females are less sensitive to environmental variation? To
fit these models it will be necessary to understand how the variance functions,
such as us() and idh(), work. We could refit the model m3a.1 using the random
effect specifications:
and these would give exactly the same answer as the model specified as
∼
dam+fosternest. The term inside the brackets is a model formula and is
interpreted exactly how you would interpret any R formula expect the inter-
cept is not fitted by default. These formula are therefore fitting an intercept
which is interacted with the random effects. For the dam terms we can get a
representation of the interaction for the first few levels of dam:
> levels(BTdata$dam)[1:5]
Across the top, we have the original dam effects in red, and along the side
we have the term defined by the variance structure formula (just the intercept
in this case). The interaction forms a new set of factors. Although they have
different names from the original dam levels, it is clear that there is a one to
one mapping between the original and the new factor levels and the models are
therefore equivalent. For more complex interactions this is not the case.
We could also fit sex in the variance structure model, (i.e. us(sex):dam or
idh(sex):dam)1 :
which creates three times as many random factors, one associated with off-
spring of each sex for each each dam.
and the model formula is essentially ∼ sex-1. To add the global intercept, us(1+sex):dam
could be fitted but this can be harder to interpret because the effects are then Fem, Male-Fem
and UNK-Fem. If a us structure is fitted, the two models are equivalent reparameterisations of
each other although the priors have to be modified accordingly. This is not the case if the
variance function is idh. In this case the sex-specific variances are allowed to vary as before,
2
but a constant covariance equal to σFem is also assumed
CHAPTER 3. CATEGORICAL RANDOM INTERACTIONS 65
to allow the variance in the effects to be different for each row of factors; i.e.
does the identity of a chicks mother explain different amounts of variation de-
pending on the sex of the chick. We can fit this model using the idh function
and represent our belief in how the effects are distributed as a 3 × 3 covariance
matrix V:
2
σFemale 0 0
2
Vdam = 0 σMale 0
2
0 0 σUNK
The sex specific variances for males and females look pretty similar, but the
sex-specific variance for birds with unknown sex is not behaving well. This is
not that surprising given that there are only 47 birds with unknown sex and
these tend to be thinly spread across dams. This variance component is likely
to be dominated by the prior, but for now we’ll leave the model as it is and
come back to some possible alternative solutions later.
We can extract the marginal means for each variance and place them into a
matrix:
Note, that they are in general less than the marginal mean of the dam
variance in model m3a.1 (0.224) where a sex interaction was not fitted. Because
the dam effects are assumed to be multivariate normal we can plot an ellipsoid
that completely represents their distribution (you can rotate the figure in R):
6
0.3
4
2
0.1
0
4000 6000 8000 10000 12000 0.0 0.1 0.2 0.3 0.4
6
4
2
0
4000 6000 8000 10000 12000 0.0 0.1 0.2 0.3 0.4
15
0.4
0 5
0.0
4000 6000 8000 10000 12000 0.0 0.2 0.4 0.6 0.8
Figure 3.2: MCMC summary plot for the sex-specific dam variance components
from model m3a.3. The number of chicks with unknown (UNK) sex is low, with
very little replication within dams. The posterior distribution for the UNK vari-
ance component is dominated by the prior which has a marginal distribution of
V=1 and nu=0.002.
If we had measured the offspring of a lot of dams, and for each dam we had
measured a very large number of offspring of each sex, then we could calculate
the average tarsus lengths within a nest for males, females and unknowns sep-
arately. If we produced a scatter plot of these means the data would have the
same shape as this ellipsoid and 95% of the data would lie inside.
0.5
sexMale.dam
-0.5
0.5
0 0.4
0.2
-0.5 0
-0.2
-0.4
sexFem.dam
sexUNK.dam
Figure 3.3: Ellipsoid that circumscribes 95% of the expected dam effects as
estimated in model m3a.3. This can be thought of as a scatter plot of the dam
effects between each sex, if the dam effects could be directly measured. Because
the covariances of the dam effects between the sexes were set to zero the axes
of the ellipsoids are all parallel to the figure axes.
2
σFemale σFemale,Male σFemale,UNK
2
Vdam = σFemale,Male σMale σMale,UNK
2
σFemale,UNK σMale,UNK σUNK
We will now use a prior for the covariance matrix where nu=4 (1 more than
the dimension of V) and the prior covariance matrix is an diagonal matrix with
small variances. This may seem surprising but the motivation is laid out in
Section 3.6.3:
The posterior mean (co)variances for this model show that the covariances
are almost the same magnitude as the variances suggesting strong correlations:
0.5
0 sexMale.dam.sexFem.dam
-0.5
1
0.5
0
-0.5
0.5 0 -1
-0.5
sexUNK.dam.sexFem.dam
sexFem.dam.sexFem.dam
Figure 3.4: Ellipsoid that circumscribes 95% of the expected dam effects as
estimated in model m3a.4. This can be thought of as a scatter plot of the dam
effects between each sex, if the dam effects could be directly measured. The
correlations of the dam effects between the sexes were estimated and found to be
close to one, and the sex-specific variances were all roughly equal in magnitude.
Consequently the major axis of the ellipsoid lies at 45o to the figure axes.
8
0.85
4
0.70
0
0 200 400 600 800 1000 0.70 0.80 0.90 1.00
8
6
0.4
4
2
−0.2
0
0 200 400 600 800 1000 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
8
6
0.4
4
2
−0.2
0 200 400 600 800 1000 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
Figure 3.5: MCMC summary plot for the between sex correlations in dam effects
from model m3a.4.
> m3a.4$DIC
[1] 1997.34
> m3a.1$DIC
[1] 1991.29
To be completed ....
lmer MCMCglmm/asreml No. Parameters Variance Correlation
V V V 1 1 1
(1|dam) dam 1 V V V 1 1 1
V V V 1 1 1
V1,1 C1,2 C1,3 1 r1,2 r1,3
(sex-1|dam) us(sex):dam 6 C1,2 V2,2 C2,3 r1,2 1 r2,3
C1,3 C2,3 V3,3 r1,3 r2,3 1
V 0 0 1 0 0
(1|sex:dam) sex:dam 1 0 V 0 0 1 0
0 0 V 0 0 1
V1 + V2 V1 V1 1 r r
(1|dam)+(1|sex:dam) dam+sex:dam 2 V1 V1 + V2 V1 r 1 r
V1 V1 V1 + V2 r r 1
V1,1 0 0 1 0 0
- idh(sex):dam 3 0 V2,2 0 0 1 0
0 0 V3,3 0 0 1
1 r1,2 r1,3 1 r1,2 r1,3
- corg(sex):dam 3 r1,2 1 r2,3 r1,2 1 r2,3
r1,3 r2,3 1 r1,3 r2,3 1
p p
pV1,1 r1,2 V1,1 V2,2 r1,3 pV1,1 V2,2 1 r1,2 r1,3
- corgh(sex):dam 3 r1,2 V1,1 V2,2
p V
p 2,2 r2,3 V2,2 V2,3 r1,2 1 r2,3
r1,3 V1,1 V3,3 r2,3 V2,2 V3,3 V3,3 r1,3 r2,3 1
Table 3.1: Different random effect specifications in lmer, MCMCglmm and asreml. sex is a factor with three levels so the
resulting matrix is 3 × 3. Continuous variables can also go on the LHS of the pipe, or within the variance structure functions
(e.g. us,idh). In this case the associated parameters are regression coefficients for which a variance is estimated. For example,
if the chicks were of different ages (or we’d measured the same chicks at different ages) we may want to see if the growth rate is
more similar for chicks raised by the same mother. (1+age|dam) or us(1+age):dam estimates a 2 × 2 matrix which includes the
variance in intercepts (when age=0), the variance in slopes, and the covariance that exists between them. Fixed parameters
are in red.
CHAPTER 3. CATEGORICAL RANDOM INTERACTIONS 71
6
4
2
0
xv
∼ IW (nu∗ =-2, V∗ = 0)
CHAPTER 3. CATEGORICAL RANDOM INTERACTIONS 73
In most cases correlation matrices do not have known form and so cannot be
directly Gibbs sampled. MCMCglmm uses a method proposed by ? with the target
prior as in ?. Generally this algorithm is very efficient as the Metropolis-Hastings
acceptance probability only depends on the degree to which the candidate prior
and the target prior (the prior you specify) conflict. The candidate prior is
equivalent to the prior in ? with nu=0 so as long as a diffuse prior is set, mixing
is generally not a problem. If nu=0 is set (the default) then the Metropolis-
Hastings steps are always accepted resulting in Gibbs sampling. However, a
prior of this form puts high density on extreme correlations which can cause
problems if the data give support to correlations in this region.
2 IMPORTANT: In versions < 2.05 priors on each variance of an idh structure were dis-
tributed as IW (nu∗ =nu-dim(V)+1, V∗ = V[1,1]) but this was a source of confusion and was
changed.
3 In versions < 2.18 cor fitted what is now a corg structure. The reason for the change is
to keep the asreml and MCMCglmm syntax equivalent. However, the corgh structure in asreml
is a reparameterised us structure whereas in MCMCglmm the variances are fixed in the prior.
Chapter 4
Continuous Random
Interactions
us(sex):dam
The term entering into the variance function model was categorical, and we
saw that by fitting the interaction we were essentially estimating the parameters
of the covariance matrix:
2
σFemale σFemale,Male σFemale,UNK
2
Vdam = σFemale,Male σMale σMale,UNK
2
σFemale,UNK σMale,UNK σUNK
We are also free to define the variance function model with continuous covari-
ates, or even a mixture of continuous and categorical factors, and the resulting
covariance matrix is interpreted in the same way.
> data(ChickWeight)
The data consist of body weights (weight) for 50 chicks (Chick) measured
up to 12 times over a 3 week period. The variable Time is the number of days
since hatching, and Diet is a four level factor indicating the type of protein diet
the chicks received.
74
CHAPTER 4. CONTINUOUS RANDOM INTERACTIONS 75
0 510 20
42 48
300 ●
●
●
● ●
200 ●● ●
●● ●●
100 ●●● ●●●
●●● ●●●
44 45 43 41 47 49 46 50
●
● ●
● ●●
● 300
● ●
● ●● ●● 200
●●●
●
●
●●●
●
●●● ●●● ●●●● ●●●
●●●
●●●● ●●● ●● ●●
●●● ●● ●●
●● ●●● ●●
●● 100
●● ●●● ●● ●● ●●● ●● ●●●● ●●
● ● ● ● ● ● ● ●
36 31 39 38 32 40 34 35
●
● ●●
300 ● ●
● ●
●●
●
●● ●● ●
●
200 ●●
● ● ●● ● ● ● ● ●●
●● ●● ● ● ● ●● ●●
●
●● ●● ●●● ●●● ●●● ● ● ●●
100 ●●●●
●
●●●●
●●
●●●●
●
●●●●
●
●●●
●
●●●
●●
●●●
●
●●●
●
27 28 26 25 29 21 33 37
weight
● ●
●
● ● ●● 300
●
●●
●
● ●● ● ●●
●
●● ●● ●● ●● ●●
● ● ●●●●●
●
●●●
200
●●●●● ●●● ●●●● ●●●● ●● ●
●
●●●
●
●●●● 100
●●●● ●●●● ●●●● ●●● ●●●● ●●● ●●● ●●●●
2 5 14 7 24 30 22 23
300 ●
●
●
●
● ●
● ●●● ●●
200 ●● ●●● ●● ●● ● ●
●●●
●
●●●●
100 ●●● ●● ●● ●● ●●●●● ●●●● ●●●
●●●●●
●●●●●
●●●●
●●●● ●●●●●●●
●●●●●●●●●●● ●●●● ●●●●
17 19 4 6 11 3 1 12
300
●●●●
● ●
●● ●
● ●
●● 200
●
●● ●
●● ●●●
● ●●●●●
● ●● ●● ●● ●●
●●●●● ●●● ●●● ●● ●● ●●● ●●● ●● 100
●●●● ●●●●●● ●●●●● ●●●● ●●● ●●●● ●●●●● ●●●●●
18 16 15 13 9 20 10 8
300
200
● ●
● ●●●
100 ●
●● ●●●●●●●
● ●●●● ●●●●● ●●●●●●●
●● ●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●● ●●●●● ●
Time
Figure 4.1: Weight data of 50 chicks from hatching until three weeks old.
We’ve saved the random chick effects so we can plot the predicted growth
functions for each bird. For now we will just predict the growth function as-
suming that all birds were on Diet 1 (the intercept):
CHAPTER 4. CONTINUOUS RANDOM INTERACTIONS 76
400
300
200
100
0
0 5 10 15 20
pos.time
0 510 20
42 48
300 ●
●
●
●
●● ●●
●●●
●
200 ●● ●●
●●● ●●
●●
100 ●●
●●
● ●●
●● ●●
●●
●● ●●
●
0
44 45 43 41 47 49 46 50
● ●
● ● ●
● ●
● ●
● ●●
●
●
●
300
● ● ●●●
●●●
●●
●●
● ●●●●
●●
●● ●
●●●●
●
● ●●●
● ●●● ●●● 200
●●●
●●
●●● ●●
● ●●● ●●● ●●● ●●● ●●● 100
●●●● ●●●● ●●●● ●●●● ●●●● ●●●● ●●●● ●●●●
0
36 31 39 38 32 40 34 35
●
●●
weight + prediction.1@x
●
300 ●
● ●
● ●
●● ●● ● ●
●●
● ●
●
●
●
●
●
●
● ●
●● ●●●
●● ●
●●● ●
●●
●
● ●●●●
200 ●●
●● ●● ●● ●● ●● ●● ●●● ●●●
100 ●●●● ●●●● ●●●●● ●●●●● ●●
●●●● ●●●
●●●
●
●●●
●
●
● ●●●●
●
●
●
●●●● ●●●● ●●● ●●●● ●●
●●● ●
●●●
● ● ●
●●● ●●●
0
27 28 26 25 29 21 33 37
●
● ●
●
●
●
● ●●●
●
● 300
●
● ●
● ●
●
● ●●
●●●
●●
●● ●●●
●● ●
● ● 200
●●●
●
●●● ●●● ●●● ●●● ●●● ●●●●
●●
● ●●●
●
●
●
●●● ●●●● ●●●●● ●●●
● ●●●● ●●●●
●● ●●● ●●● 100
●●●● ●●● ●● ●●● ●● ●●● ●●●● ●●●
●●
● 0
2 5 14 7 24 30 22 23
300 ● ●
●
●●
●●●
●
●
●
●●●
● ●
200 ●●
●● ●●
●●
● ●●
●●● ●●● ●
●
●
●●
● ●●
●
●●
●
● ●●
●●●●
●
100 ●●● ●●●
●
●● ●●
●●● ●●● ●●●●● ●●●● ●●●
●●●● ●●●● ●●●●
●● ●●
● ●●●
●● ●●● ●●●●
●● ●●●● ●●●●●● ● ●●
●●
●● ●●●●
0 ●●●● ●● ●●
17 19 4 6 11 3 1 12
●
300
● ● ●
● ● ● 200
●
●● ●
●●
● ●
●●
●
● ●●
●●●●●
● ●●●●
●●● ● ●●● ●●
●
●●●
●
●●● ●
●●● ●●●
●● ●●● ●●
●● ●●● ●●● ●●● ●●● 100
●●●●●●
●● ●●●●●
●●
●●
●● ●●●●
●●
●● ●●●● ●●●● ●●●● ●●●●●
● ●●●●
●● ● ● 0
18 16 15 13 9 20 10 8
300
200 ●
● ●
●● ●
● ●
●● ●●
●
100 ● ●●● ●●●● ● ●●●●
● ●●●●
●● ●●
● ●
●●● ●
●●●● ●
●●●●●
●● ●●● ●●
●● ●●●●●
● ●●●
●●●●●● ● ●●●●●● ● ●●● ●●●● ● ●● ●●●●● ●●●
●●
0 ●● ●●
● ●●
● ●●●●
●●
●●●● ●●●● ●●
●
Time
Figure 4.3: Weights of each chick as a function of age in blue, with the predicted
weights in purple. A quadratic population growth curve was fitted with random
chick intercepts.
The predictions don’t look that bad, but you will notice that for some chicks
(e.g. 13,19,34) the slope of the predicted growth seems either to shallow, or
too steep. To account for this we can start by fitting us(1+time):Chick. The
linear model inside the variance function has two parameters, an intercept (1)
and a regression slope associated with Time which define the set of interactions:
Each chick now has an intercept and a slope, and because we have used the
us function we are estimating the 2 × 2 matrix:
CHAPTER 4. CONTINUOUS RANDOM INTERACTIONS 78
" 2
#
σ(Intercept) σ(Intercept),Time
VChick = 2
σ(Intercept),Time σTime
2
σ(Intercept) is the amount of variation in intercepts between chicks, and
2
σTime is the amount of variation in the regression slopes between chicks. If
the idh function had been used the covariance would have been set to zero and
we could have interpreted variation in intercepts as variation in overall size,
and variation in slopes as variation in growth rate. However, there is often
covariance between intercepts and slopes and it is usually a good idea to use the
us function and estimate them (see Section 4.3). We shall do so:
The traces look OKish for the chick (co)variance matrices (Figure 4.4) but
notice that the the estimate of intercept-slope correlation is close to the bound-
ary of parameter space (-1):
var1
-0.9817454
> autocorr(int.slope.cor)
, , 1
[,1]
Lag 0 1.000000000
Lag 10 0.198725570
Lag 50 0.002783530
Lag 100 0.005720720
Lag 500 -0.001004745
and we should run it for longer in order to sample the posterior adequately.
For now we will carry on and obtain the predictions from the model we ran, but
using the perdict function rather than dong it ‘by hand’:
CHAPTER 4. CONTINUOUS RANDOM INTERACTIONS 79
350
0.000
50
4000 6000 8000 10000 12000 50 100 150 200 250 300 350
0.00
4000 6000 8000 10000 12000 −80 −60 −40 −20
0.00
4000 6000 8000 10000 12000 −80 −60 −40 −20
Figure 4.4: MCMC summary plots for the chick covariance components from
model m4a.2. The lower and upper plots are the intercept and slope variance
components respectively, and the middle two plots are the intercept-slope co-
variance.
and we can see that the fit is much better (See Figure 4.5). In theory we
could fit higher order random regressions (data and prior permitting) and use
something like DIC to choose which is the best compromise between the fit
of the model to the data and how many effective parameters were fitted. For
example we could go from the 1st order random regression to a 2nd order model:
0 510 20
42 48
300 ●
●
●
●
● ●●
200 ●● ●●
●● ●●●
100 ●●●●● ●●●●
● ●●
0
44 45 43 41 47 49 46 50
weight + predict(m4a.2, marginal = NULL)
●
● ●
● ●
●
●
●●
300
● ●●●●● ●
● ●
● ●● ●● 200
● ●●●
● ●
●●●● ●●●● ●● ●●
●●●●
●● ●●●● ●●● ●●●
●● ●●
●●● ●●● ●●● ●●● ●● 100
●●●● ●●●● ●●● ●●●● ●●●● ●●● ●●● ●●●
0
36 31 39 38 32 40 34 35
●
● ●
● ●●
300 ● ● ●
●
●
●
●
● ●
● ●●
●● ●●
●
●●
● ●
●●
●
●● ●● ●● ●● ●●
200 ●●
●
●● ●●
● ●● ●● ●● ●●
●
●● ●●● ●●● ●●
●●
● ●● ●●
● ●● ●●
100 ●●●
●●
●●●
●●
●●●
●●
●●●
●
●●●
●
●●●
●
●●●
●
● ●●
0 ● ● ● ●●
27 28 26 25 29 21 33 37
● ●
●
●
● ● ●
● ●
●
● ●● 300
●● ●● ●● ●● ●● 200
●
●●● ●● ●● ●● ●●
● ●● ●
●●●●●
● ●
●●●
●●●● ●●● ●●● ●●● ●●
●●●
● ●● ●●● ●●● 100
●●●● ●●●●
●●●●
●●●
●●● ●●
●
● ●●●●● ●●●●●
0
2 5 14 7 24 30 22 23
300 ●
●●●
●
●
●
200 ●
●● ●●
● ●
● ●● ●● ●
●● ●● ●● ●● ●
●●●
● ●
●● ●●●
●
100 ●●● ●●● ●● ●●
● ● ●●●●●●● ●●●● ●●●
●●●● ●●●● ●●● ●●●●● ● ●●●●●●●●
●●●● ● ●●●●● ●●●●
0 ●
17 19 4 6 11 3 1 12
300
● ●
●●●●
● ●
● ●
● ●
● 200
●
● ●
● ●
●● ●●●●
●● ●
● ●●
●●●
● ●●● ●●● ●●●
●●●●● ●●●●●●●●● ●●●●●●●● ●●
●● ●●
●● ●●● ●●● ●●● 100
●●●●● ● ● ●●●● ●●● ●●●● ●●●● ●●●●
0
18 16 15 13 9 20 10 8
300
200 ● ●●
100 ●
●●●●●●●●●●● ●●●●●
●
●●●●●●
● ●
●●● ●●●● ●●●●●●●● ●
●● ●●●●●●● ●●●●●●●●
●● ●●●●●●●● ●●●●●●● ●
0
0 510 20 0 510 20 0 510 20 0 510 20
Time
Figure 4.5: Weights of each chick as a function of age in blue, with the predicted
weights in purple. A quadratic population growth curve was fitted with a first
order random regression for chicks (i.e. a random intercept-slope model).
and the DIC has gone down, suggesting that the model is better:
> m4a.1$DIC
[1] 5525.147
CHAPTER 4. CONTINUOUS RANDOM INTERACTIONS 81
0 510 20
42 48
400
300 ● ●
●
●
200 ●● ●●
●● ●●
100 ●●●● ●●●●
●● ●●
44 45 43 41 47 49 46 50
400
weight + predict(m4a.3, marginal = NULL)
●
● ●
●
●
●●
300
●
●●●●● ● ● ●● 200
●
●●
●●● ●●● ●●● ●● ●●
●●●●
●●● ●●● ●●● ●●● ●●● ●●● ●●
●●● ●● ●●● ●●● ●●● ●●● ●●● 100
●●● ●●● ●
●● ●● ●● ●● ●● ●●
36 31 39 38 32 40 34 35
400 ●
●
●
●●
300 ● ● ●
●
● ● ●
● ● ●● ●●
● ●
200 ●●
●●● ●● ●● ●● ●● ●● ●●
●●● ●●● ●● ●● ●● ●●● ●● ●●
100 ●● ●● ●● ●● ●● ● ●● ●
●●● ●●●● ●●●● ●●●● ●●● ●●●● ●●● ●●
●
27 28 26 25 29 21 33 37
● 400
● ●
● ●● 300
● ● ●● ●
●●
● ●● ●●
● ●●
● 200
●
●● ●● ●● ●● ● ●
●●●
●●●●● ●●●●● ●●● ●●●● ●●● ●●
● ●●●●
●●●●
●●● 100
●●●● ●●● ●●●● ●●● ●●●●
●●● ●●● ●●●●●
2 5 14 7 24 30 22 23
400
300 ● ●
●
● ●
●●● ●●● ●●
200 ●● ●● ●● ●● ● ●
●
●● ●● ●
●●●● ●●● ●●●●
100 ●●● ●●●● ●● ●●● ●●●●●●●● ●●●● ●●●
●●●● ●●●● ●●● ●●● ●●●●●●●●●●● ●●●● ●●●
17 19 4 6 11 3 1 12
400
300
● ●
●● ●
● ●
●● 200
● ●
●●● ●
●●●●● ●●●●●
● ●● ●● ●●
●
●● ●● ●●● ●●● ●●● ●●●
●●●●● ●●●●●●●●● ●●●●●●●● ●●● ●●●●● ●●● 100
●●●● ●●● ● ●●●● ●●●●
18 16 15 13 9 20 10 8
400
300
200 ● ●●●
100 ●
●●
●●●●●●●●● ●●●●●
●●●●●●
●
●●●●●●●
●
●●●● ●●●●●● ●●●●●●●
●● ●●●●●●● ●●●●●●●● ●●●●● ●
Time
Figure 4.6: Weights of each chick as a function of age in blue, with the predicted
weights in purple. A quadratic population growth curve was fitted with a sec-
ond order random regression for chicks (i.e. a random intercept-slope-quadratic
model).
> m4a.2$DIC
[1] 4544.509
> m4a.3$DIC
[1] 3932.327
It is worth seeing whether using an AIC measure using REML also suggests
the highest order model is the better model.
[1] 5578.963
CHAPTER 4. CONTINUOUS RANDOM INTERACTIONS 82
[1] 4732.387
[1] 4274.606
> detach(package:lme4)
with the variance increasing at extreme values of Time and being zero at
Time=0. For an intercept-slope model such as this the expected variance is
quadratic in the predictor, and for a intercept-slope-quadratic model the vari-
ance is cubic in the predictor. Generally the expected variance can be obtained
using:
0
V AR[y] = diag(ZVZ )
and we can use this to predict the change in variance as a function of Time
for the three models:
3
2
1
0
y
−1
−2
−3
Time
Figure 4.7: Hypothetical regression lines where the variance in slopes is one but
the variance in intercepts is zero. The expected variance of y is a quadratic
function of Time, being zero when Time=0, and increasing with positive or
negative values.
●
●
350
●
● ●
●
● ●
●
300
● ● ●
● ●
●
● ● ●
● ●
● ●
● ●
●
● ● ●
●
250
●
●
● ● ●
●
● ●
● ●
● ●
●
● ●
● ●
●
● ● ●
weight
● ● ● ●
● ● ● ●
● ●
200
● ●
● ●
● ● ● ●
●
● ●
● ●
●
● ● ●
● ● ●
● ●
● ●
●
● ● ● ●
●
● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
150
● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ●
●
● ● ● ●
● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
● ●
●
● ●
●
● ● ●
● ● ● ● ●
●
● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ●
● ●
100
●
● ● ●
● ● ● ●
●
● ●
●
● ● ●
● ● ● ●
● ●
● ●
● ●
● ● ●
●
● ●
● ● ● ●
●
●
● ●
● ● ●
●
● ● ● ● ●
●
●
● ●
●
● ● ● ● ● ●
●
● ● ● ●
● ●
● ●
●
●
● ●
● ●
●
● ●
● ●
● ● ●
50
●
●
● ●
● ● ●
● ●
●
●
●
● ●
●
0 5 10 15 20
Time
Figure 4.8: Chick weights plotted as a function of time. 95% of the data are
expected to fall within the dashed lines assuming the model with random inter-
cepts is the correct model, and the diet treatments have small effects.
The simple model, without a slope term has constant variance across the
range, and is clearly inconsistent with the data (Figure 4.8). The second model
on the other hand
> plot(weight ~ Time, data = ChickWeight, cex.lab = 1.5)
> mu.2 <- polynomial %*% beta.2
> sd.2 <- sqrt(units.2 + diag(polynomial[, 1:2,
+ drop = FALSE] %*% VCV.2 %*% t(polynomial[,
+ 1:2, drop = FALSE])))
CHAPTER 4. CONTINUOUS RANDOM INTERACTIONS 85
●
●
350
●
● ●
●
● ●
●
300
● ● ●
● ●
●
● ● ●
● ●
● ●
● ●
●
● ● ●
●
250
●
●
● ● ●
●
● ●
● ●
● ●
●
● ●
● ●
●
● ● ●
weight
● ● ● ●
● ● ● ●
● ●
200
● ●
●
● ●
● ● ● ●
●
● ●
● ●
●
● ● ●
● ● ●
● ●
● ●
●
● ● ● ●
●
● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●
150
● ● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ●
●
● ● ● ●
● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
● ●
●
● ●
●
● ● ●
● ● ● ● ●
●
● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ●
● ●
100
●
● ●
● ●
● ● ● ●
●
● ●
● ● ●
● ● ● ●
● ●
● ●
● ●
● ● ●
●
● ●
● ● ● ●
●
●
● ●
● ● ●
●
● ● ● ● ●
●
●
● ●
●
● ● ● ● ● ●
●
● ● ● ●
● ●
● ●
●
●
● ●
● ●
●
● ●
● ●
● ● ●
50
●
●
● ●
● ● ●
● ●
●
●
●
● ●
●
0 5 10 15 20
Time
Figure 4.9: Chick weights plotted as a function of time. 95% of the data are
expected to fall within the dashed lines assuming the model with random in-
tercepts and slopes is the correct model, and the diet treatments have small
effects.
●
●
● ●
●
●
● ●
● ● ●
● ● ●
● ●
● ● ●
●
●
● ●
weight
● ● ●
●
● ●
● ● ● ● ●
●
●
● ● ●
● ●
200
● ● ● ● ●
● ●
● ●
● ● ●
●
● ●
● ●
●
● ● ●
● ●
● ●
● ● ● ●
●
● ● ●
● ● ●
● ● ●
●
●
● ● ●
● ●
● ● ●
●
● ●
● ●
● ● ● ● ●
●
● ● ● ● ●
● ●
● ●
● ●
● ● ● ● ●
● ●
● ●
●
● ●
● ●
● ● ● ●
● ●
● ● ●
● ●
● ●
● ●
● ● ●
●
● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ●
● ● ● ●
●
0
0 5 10 15 20
Time
Figure 4.10: Chick weights plotted as a function of time. 95% of the data are
expected to fall within the dashed lines assuming the model with random inter-
cepts, slopes and quadratic effects is the correct model, and the diet treatments
have small effects.
4.4 Meta-analysis
Random intercept-slope models implicitly assume that the variance changes as a
quadratic function of the predictor. This can be used to our advantage because
it allows us to fit meta-analytic models. In meta-analysis the data are usually
some standardised statistic which has been estimated with different levels of
measurement error. If we wanted to know the expected value of these statistics
we would want to weight our answer to those measurements made with the
smallest amount of error. If we assume that measurement error around the true
value is normally distributed then we could assume the model:
yi = β1 + mi + ei (4.1)
where β1 is the expected value, mi is some deviation due to measurement
error, and ei is the deviation of the statistic from the global intercept not due to
measurement error. Some types of meta-analysis presume ei does not exist and
that the only variation between studies is due to measurement error. This is not
realistic, I think. Often, standard errors are reported in the literature, and these
can be viewed as an approximation to the expected standard deviation of the
measurement error. If we put the standard errors for each statistic as a column
in the data frame (and call it SE) then the random term idh(SE):units defines
a diagonal matrix with the standard errors on the diagonal. Using results from
Equation 4.2
0
VAR[m] = ZVZ
0
2 (4.2)
= Zσm IZ
0
2
= σm ZZ
2
fixing σm = 1 in the prior, the expected variance in the measurement er-
rors are therefore the standard errors squared (the sampling variance) and all
measurement errors are assumed to be independent of each other. The random
regression therefore fits a random effect meta-analysis.
4.5 Splines
blah blah
fits a penalised cubic spline. The coefficients are random effects, stored in
the Sol element of the model output and the single variance component (the
penalising bit) is in the VCV element. Its usually a good idea to scale the covariate
to lie in the interval [0,1] or some such thing.
Chapter 5
Multi-response models
In practice does this matter? Lets imagine there was only one unmeasured
variable: disposable income. There are repeatable differences between individu-
als in their disposable income, but also some variation within individuals across
the four years. Likewise, people vary in what proportion of their disposable
income they are willing to spend on a holiday versus a car, but this also changes
from year to year. We can simulate some toy data to get a feel for the issues:
89
CHAPTER 5. MULTI-RESPONSE MODELS 90
> av_ratio<-rbeta(200,10,10)
> ac_ratio<-rbeta(800, 2*(av_ratio[id]), 2*(1-av_ratio[id]))
> # expected proportion spent on car + some year to year variation
>
> y.car<-(ac_wealth*ac_ratio)^0.25 # disposable income * proportion spent on car
> y.hol<-(ac_wealth*(1-ac_ratio))^0.25 # disposable income * proportion spent on holiday
> Spending<-data.frame(y.hol=y.hol, y.car=y.car, id=id)
A simple regression suggests the two types of spending are negatively related
but the association is weak with the R2 = 0.012.
Call:
lm(formula = y.car ~ y.hol, data = Spending)
Residuals:
Min 1Q Median 3Q Max
-0.88552 -0.19737 -0.00205 0.18858 1.26133
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.11702 0.03873 28.84 < 2e-16 ***
y.hol -0.11453 0.03731 -3.07 0.00221 **
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With id added as a random term to deal with the the repeated measures, a
similar conclusion is reached although the estimate is more negative:
Iterations = 3001:12991
Thinning interval = 10
Number of chains = 1
Sample size per chain = 1000
It is useful to think of a new data frame where the response variables have
been stacked column-wise and the other predictors duplicated accordingly. Be-
low is the original data frame on the left (Spending) and the stacked data frame
on the right:
y trait id units
1 1.063890 y.hol 1 1
y.hol y.car id 2 1.135753 y.hol 1 2
1 1.063890 0.191558 1 .. .. .. ..
. . . .
2 1.135753 0.724603 1 =⇒ 800 0.983501 y.hol 200 800
.. .. .. 801 0.191558 y.car 1 1
. . .
800 0.983501 1.310201 200 802 0.724603 y.car 1 2
.. .. .. ..
. . . .
1600 1.310201 y.car 200 800
From this we can see that fitting a multi-response model is a direct extension
to how we fitted models with categorical random interactions (Chapter 3):
We have fitted the fixed effect trait so that the two types of spending can
have different intercepts. I usually suppress the intercept (-1) for these types
of models so the second coefficient is not the difference between the intercept
for the first level of trait (y.hol) and the second level (y.car) but the actual
trait specific intercepts. In other words the design matrix for the fixed effects
has the form:
CHAPTER 5. MULTI-RESPONSE MODELS 92
trait[1]=="y.hol" trait[1]=="y.car" 1 0
trait[2]=="y.hol" trait[2]=="y.car" 1 0
.. .. .. ..
. .
. .
trait[800]=="y.hol"
trait[800]=="y.car"
=
1 0
trait[801]=="y.hol"
trait[801]=="y.car"
0 1
trait[802]=="y.hol"
trait[802]=="y.car"
0 1
.. .. .. ..
. . . .
trait[1600]=="y.hol" trait[1600]=="y.car" 0 1
A 2 × 2 covariance matrix is estimated for the random term where the di-
agonal elements are the variance in consistent individual effects for each type
of spending. The off-diagonal is the covariance between these effects which if
positive suggests that people that consistently spend more on their holidays
consistently spend more on their cars. A 2 × 2 residual covariance matrix is also
fitted. In Section 3.4 we fitted heterogeneous error models using idh():units
which made sense in this case because each level of unit was specific to a par-
ticular datum and so any covariances could not be estimated. In multi-response
models this is not the case because both traits have often been measured on the
same observational unit and so the covariance can be measured. In the context
of this example a positive covariance would indicate that in those years an in-
dividual spent a lot on their car they also spent a lot on their holiday.
The regression coefficients (see Figure 5.1) differ substantially at the within
individual (green) and between individual (red) levels, and neither is entirely
consistent with the regression coefficient from the univariate model (black). The
process by which we generated the data gives rise to this phenomenon - large
variation between individuals in their disposable income means that people who
are able to spend a lot on their holiday can also afford to spend a lot on their
holidays (hence positive covariation between id effects). However, a person that
spent a large proportion of their disposable income in a particular year on a hol-
iday, must have less to spend that year on a car (hence negative residual (within
year) covariation).
CHAPTER 5. MULTI-RESPONSE MODELS 93
id regression
1.0
0.5
0.0
univariate regression
units regression
Iterations
Figure 5.1: MCMC summary plot of the coefficient from a regression of car
spending on holiday spending in black. The red and green traces are from a
model where the regression coefficient is estimated at two levels: within an
individual (green) and across individuals (red). The relationship between the
two types of spending is in part mediating by a third unmeasured variable,
disposable income.
When fitting the simpler univariate model we make the assumption that the
effect of spending money on a car directly effects how much you spend on a
holiday. If this relationship was purely causal then all regression coefficients
would have the same expectation, and the simpler model would be justified.
For example, we could set up a simpler model where two thirds of the varia-
tion in holiday expenditure is due to between individual differences, and holiday
expenditure directly affects how much an individual will spend on their car (us-
ing a regression coefficient of -0.3). The variation in car expenditure not caused
by holiday expenditure is also due to individual differences, but in this case they
only explain a third of the variance.
> Spending$y.hol2 <- rnorm(200, 0, sqrt(2))[Spending$id] +
+ rnorm(800, 0, sqrt(1))
CHAPTER 5. MULTI-RESPONSE MODELS 94
We can fit the univariate and multivariate models to these data, and compare
the regression coefficients as we did before. Figure 5.2 shows that the regression
coefficients are all very similar and a value of -0.3 has a reasonably high posterior
probability. However, it should be noted that the posterior standard deviation is
smaller in the simpler model because the more strict assumptions have allowed
us to pool information across the two levels to get a more precise answer.
Iterations
Figure 5.2: MCMC summary plot of the coefficient from a regression of car
spending on holiday spending in black. The red and green traces are from a
model where the regression coefficient is estimated at two levels: within an in-
dividual (green) and across individuals (red). In this model the relationship
between the two types of spending is causal and the regression coefficients have
the same expectation. However, the posterior standard deviation from the sim-
ple regression is smaller because information from the two different levels is
pooled.
CHAPTER 5. MULTI-RESPONSE MODELS 95
> data(SShorns)
> head(SShorns)
id horn sex
1 1 scurred female
2 2 scurred female
3 3 scurred female
4 4 scurred female
5 5 polled female
6 6 polled female
The sex and horn morph were recorded for each individual, giving the con-
tingency table:
female male
normal 83 352
polled 65 0
scurred 96 70
and we’ll see if the frequencies of the three horn types differ, and if the trait
is sex dependent. The usual way to do this would be to use a Chi square test,
and to address the first question we could add the counts of the two sexes:
> chisq.test(rowSums(Ctable))
data: rowSums(Ctable)
X-squared = 329.52, df = 2, p-value < 2.2e-16
which strongly suggests the three morphs differ in frequency. We could then
ask whether the frequencies differ by sex:
> chisq.test(Ctable)
CHAPTER 5. MULTI-RESPONSE MODELS 96
data: Ctable
X-squared = 202.3, df = 2, p-value < 2.2e-16
which again they do, which is not that surprising since the trait is partly sex
limited, with males not expressing the polled phenotype.
If there were only two horn types, polled and normal for example, then we
could have considered transforming the data into the binary variable polled or
not? and analysing using a glm with sex as a predictor. In doing this we have
reduced the dimension of the data from J = 2 categories to a single (J − 1 = 1)
contrast. The motivation for the dimension reduction is obvious; if being a
male increased the probability of expressing normal horns by 10%, it must by
necessity reduce the probability of expressing polled horn type by 10%, because
an individual cannot express both horn types simultaneously. The dimension
reduction essentially constrains the probability of expressing either horn type
to unity:
For binary data we designated one category to be the success (polled) and
one category to be the failure (normal) which we will call the baseline category.
The latent variable in this case was the log-odds ratio of succeeding versus
failing:
P r(horn[i] = polled)
li = log = logit (P r(horn[i] = polled)) (5.3)
P r(horn[i] = normal)
With more than two categories we need to have J − 1 latent variables, which
in the original horn type example are:
P r(horn[i] = polled)
li,polled = log (5.4)
P r(horn[i] = normal)
and
P r(horn[i] = scurred)
li,scurred = log (5.5)
P r(horn[i] = normal)
CHAPTER 5. MULTI-RESPONSE MODELS 97
The two latent variables are indexed as trait, and the unit of observation
(i) as units, as in multi-response models. As with binary models the residual
variance is not identified, and can be set to any arbitrary value. For reasons
that will become clearer later I like to work with the residual covariance matrix
1
J (I + J) where I and J are J − 1 dimensional identity and unit matrices, re-
spectively.
The posterior distribution for the intercepts is shown in Figure 5.3, and the
model clearly needs to be run for longer (Figure 5.3). However...
The problem can also be represented using the contrast matrix ∆ (?):
−1 −1
∆= 1 0 (5.6)
0 1
where the rows correspond to the factor levels (normal, polled and scurred)
and the columns to the two latent variables. For example column one corre-
sponds to li,polled which on the log scale is P r(horn[i] = polled)−P r(horn[i] =
normal).
0
P r(horn[i] = normal)
exp (∆∆ )−1 ∆li ∝ E P r(horn[i] = polled) (5.7)
P r(horn[i] = scurred)
The residual and any random effect covariance matrices are for estimability
0
purposes estimated on the J − 1 space with V = ∆ Ṽ∆ where Ṽ is the co-
variance matrix estimated on the J − 1 space. To illustrate, we will rescale the
intercepts as if the residual covariance matrix was zero (see Sections and ) and
predict the expected probability for each horn type:
2.0
−2.0
1.0
−2.4
0.0
4000 8000 12000 −2.6 −2.2 −1.8
3
−1.0
2
−1.2
1
−1.4
Figure 5.3: Posterior distribution of fixed effects from model m5c.1: a simple
multinomial logit model with intercepts only
Iterations = 1:1000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 1000
> prop.table(rowSums(Ctable))
To test for the effects of sex specific expression we can also fit a model with
a sex effect:
In this case we have not interacted sex with trait, and so we are estimating
the difference between the sexes in their expression of normal and polled+scurred
jointly. The posterior distribution is plotted in Figure 5.4 and clearly shows that
males are more likely to express the normal horn phenotype than females.
A more general model would be to estimate separate probabilities for each
cell, but the contingency table indicates that one cell (polled males) has zero
counts which will cause extreme separation problems. We could choose to have
a better prior for the fixed effects, that is close to being flat for the two-way (i.e.
polled vs scurred, normal vs.scurred & polled vs. normal) marginal probabilities
within each sex:
Iterations = 1:1000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 1000
0.0
2.0
−0.6
1.0
−1.2
0.0
4000 6000 8000 10000 12000 −1.0 −0.5 0.0
2.0
0.4
1.0
0.0
0.0
4000 6000 8000 10000 12000 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
1.0
−2.6
−3.2
0.0
Figure 5.4: Posterior distribution of fixed effects from model m5c.2 in which a
main effect of sex was included
Iterations = 1:1000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 1000
logit scale) that a zero is from the zero-inflation process that we aim to model
with the second latent variable. The likelihood has the form:
pscl fits zero-inflated models very well through the zeroinfl function, and
I strongly recommend using it if you do not want to fit random effects. To
illustrate the syntax for fitting ZIP models in MCMCglmm I will take one of
their examples:
> table(bioChemists$art == 0)
FALSE TRUE
640 275
[1] 0.1839859
As with binary models we do not observe any residual variance for the
zero-inflated process, and in addition the residual covariance between the zero-
inflation and the Poisson process cannot be estimated because both processes
cannot be observed in a single data point. To deal with this I’ve fixed the resid-
ual variance for the zero-inflation at 1, and the covariance is set to zero using the
idh structure. Setting V=diag(2) and nu=0.0021 we have the inverse-gamma
prior with shape=scale=0.001 for the residual component of the Poisson pro-
cess which captures over-dispersion:
1 Earlier versions of the CourseNotes had nu=1.002. In versions <2.05 the marginal prior
of a variance associated with an idh structure was inverse-Wishart with nu∗ = nu − 1 where
nu∗ is the marginal degree of belief. In versions >=2.05 I changed this so that nu∗ = nu as it
was leading to confusion.
CHAPTER 5. MULTI-RESPONSE MODELS 103
As is often the case the parameters of the zero-inflation model mixes poorly
(See Figure 5.5) especially when compared to equivalent hurdle models (See
Section 5.4). Poor mixing is often associated with distributions that may not
be zero-inflated but instead over-dispersed.
4000 6000 8000 10000 12000 −0.6 −0.4 −0.2 0.0 0.2 0.4
0.0
4000 6000 8000 10000 12000 −0.1 0.0 0.1 0.2 0.3 0.4 0.5
4000 6000 8000 10000 12000 −0.2 0.0 0.1 0.2 0.3 0.4
Figure 5.5: Posterior distribution of fixed effects from model m5d.1 in which
trait 1 (art) is the Poisson process and trait 2 (zi.art) is the zero-inflation.
The model would have to be run for (much) longer to say something concrete
about the level of zero-inflation but my guess would be it’s not a big issue, given
the probability is probably quite small:
CHAPTER 5. MULTI-RESPONSE MODELS 104
P r(y = 0) = plogis(l2 )
P r(y|y > 0) = plogis(−l2 ) ∗ dpois(y, exp(l1 ))/(1 − ppois(0, exp(l1 )))
(5.9)
To illustrate, we will refit the ZIP model (m5d.1) as a hurdle-Poisson model.
CHAPTER 5. MULTI-RESPONSE MODELS 105
Histogram of nz
120
100
80
Frequency
60
40
20
0
nz
Figure 5.6: Posterior predictive distribution of zeros from model m5d.2 with the
observed number in red.
Plotting the Markov chain for the equivalent parameters that were plotted
for the ZIP model shows that the mixing properties are much better (compare
Figure 5.5 with Figure 5.7).
−0.4
0.0
4000 6000 8000 10000 12000 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
0 3
−1.3
3
−0.1
0
4000 6000 8000 10000 12000 −0.2 0.0 0.2 0.4 0.6
4000 6000 8000 10000 12000 −0.4 −0.2 0.0 0.2 0.4
Figure 5.7: Posterior distribution of fixed effects from model m5d.3 in which
trait 1 (art) is the zero-truncated Poisson process and trait 2 (hu.art) is the
binary trait zero or non-zero.
lower upper
var1 0.2682566 0.3260789
attr(,"Probability")
[1] 0.95
and we can compare this to the predicted number of zero’s from the Poisson
process if it had not been zero-truncated:
> HPDinterval(ppois(0, exp(m5d.3$Sol[, 1] + 0.5 *
+ m5d.3$VCV[, 1])))
lower upper
var1 0.1504119 0.3605811
attr(,"Probability")
[1] 0.95
The credible intervals largely overlap, strongly suggesting a standard Poisson
model would be adequate. However, our prediction for the number of zero’s that
CHAPTER 5. MULTI-RESPONSE MODELS 107
would arise form a non-truncated Poisson process only involved the intercept
term. This prediction therefore pertains to the number of articles published by
single women with no young children who obtained their Ph.D’s from depart-
ments scoring zero for prestige (phd) and whose mentors had published nothing
in the previous 3 years. Our equivalent prediction for men is a little lower
lower upper
var1 0.0903996 0.289349
attr(,"Probability")
[1] 0.95
suggesting that perhaps the number of zero’s is greater than we expected for
this group. However, this may just be a consequence of us fixing the proportion
of zero’s to be constant across these groups. We can relax this assumption by
fitting a separate term for the proportion of zeros for men:
lower upper
var1 0.2300976 0.3060856
attr(,"Probability")
[1] 0.95
the proportion of zeros expected for men is probably still less than what
we expect from a non-truncated Poisson process for which the estimates have
changed very little:
lower upper
var1 0.07802243 0.2463355
attr(,"Probability")
[1] 0.95
CHAPTER 5. MULTI-RESPONSE MODELS 108
P r(y = 0) = 1 − pexp(exp(l2 ))
P r(y|y > 0) = pexp(exp(l2 )) ∗ dpois(y, exp(l1 ))/(1 − ppois(0, exp(l1 )))
(5.10)
since the inverse of the complementary log-log transformation is the distri-
bution function of the extreme value (log-exponential) distribution.
Iterations = 3001:12991
Thinning interval = 10
CHAPTER 5. MULTI-RESPONSE MODELS 109
DIC: 3039.935
R-structure: ~trait:units
we can see from this that the more papers a mentor produces, the more
zero-deflation (or conversely the less papers a mentor produces, the more zero-
inflation).
Chapter 6
> library(kinship2)
Pedigrees and phylogenies are similar things: they are both ways of repre-
senting shared ancestry. Under a quantitative genetic model of inheritance, or
a Brownian motion model of evolution, GLMM’s can be readily extended to
model the similarities that exist between the phenotypes of related individuals
or taxa. In the context of quantitative genetics these models are known as ‘ani-
mal’ models (?), and in the context of the comparative method these models are
known as phylogenetic mixed models (?). The two models are almost identical,
and are relatively minor modifications to the basic mixed model (?).
110
CHAPTER 6. PEDIGREES AND PHYLOGENIES 111
To illustrate, we can load a pedigree for a population of blue tits and dis-
play the pedigree for the nuclear family that has individuals "R187920" and
"R187921" as parents:
> data(BTped)
> Nped <- BTped[which(apply(BTped, 1, function(x) {
+ any(x == "R187920" | x == "R187921")
+ })), ]
> Nped
Both parents form part of what is known as the base population - they are
outbred and unrelated to anybody else in the pedigree.
6.1.2 Phylogenies
Phylogenies can be expressed in tabular form, although only two columns are
required because each species only has a single parent. In general however, phy-
logenies are not expressed in this form presumably because it is hard to traverse
phylogenies (and pedigrees) backwards in time when they are stored this way.
For phylogenetic mixed models we generally only need to traverse phylogenies
forward in time (if at all) but I have stuck with convention and used the phylo
class from the ape package to store phylogenies. As with pedigrees, all species
appearing in the data frame passed to data need to appear in the phylogeny.
Typically, this will only include species at the tips of the phylogeny and so the
measured species should appear in the tip.label element of the phylo object.
An error message will be issued if this is not the case. Data may also exist
for ancestral species, or even for species present at the tips but measured many
generations before. It is possible to include these data as long as the phylogeny
has labelled internal nodes. If nodes are unlabeled then MCMCglmm names them
internally using the default arguments of makeNodeLabel from ape.
To illustrate, lets take the phylogeny of bird families included in the ape pack-
age, and extract the phylogeny in tabular form for the Paridae (Tits), Certhiidae
(Treecreepers), Gruidae (Cranes) and the Struthionidae (Ostriches):
> data("bird.families")
> bird.families <- makeNodeLabel(bird.families)
> some.families <- c("Certhiidae", "Paridae", "Gruidae",
+ "Struthionidae")
> Nphylo <- drop.tip(bird.families, setdiff(bird.families$tip.label,
+ some.families))
> INphylo <- inverseA(Nphylo)
> INphylo$pedigree
node.names
[1,] "Node58" NA NA
[2,] "Node122" "Node58" NA
[3,] "Struthionidae" NA NA
[4,] "Gruidae" "Node58" NA
[5,] "Certhiidae" "Node122" NA
[6,] "Paridae" "Node122" NA
The full phylogeny, with these families and their connecting notes displayed,
is shown in Figure 6.1. You will notice that Node1 - the root - does not appear in
the phylogeny in tabular form. This is because the root is equivalent to the base
population in a pedigree analysis, an issue which we will come back to later.
Another piece of information that seems to be lacking in the tabular form is the
branch length information. Branch lengths are equivalent to inbreeding coeffi-
cients in a pedigree. As with pedigrees the inbreeding coefficients are calculated
by inverseA:
CHAPTER 6. PEDIGREES AND PHYLOGENIES 113
Fringillidae
Passeridae
Paramythiidae
Melanocharitidae
Nectariniidae
Alaudidae
Sylviidae
Zosteropidae
Cisticolidae
Pycnonotidae
Regulidae
Hirundinidae
Aegithalidae
Paridae
Certhiidae
Sittidae
Sturnidae
Muscicapidae
Cinclidae
Bombycillidae
Corvidae
Vireonidae
Laniidae
Pomatostomidae
Orthonychidae
Irenidae
Eopsaltriidae
Pardalotidae
Meliphagidae
Maluridae
Ptilonorhynchidae
Menuridae
Climacteridae
Rhinocryptidae
Conopophagidae
Formicariidae
Furnariidae
Thamnophilidae
Tyrannidae
Eurylaimidae
Pittidae
Acanthisittidae
Procellariidae
Gaviidae
Spheniscidae
Fregatidae
Ciconiidae
Pelecanidae
Threskiornithidae
Phoenicopteridae
Scopidae
Ardeidae
Phalacrocoracidae
Anhingidae
Sulidae
Phaethontidae
Podicipedidae
Falconidae
Sagittariidae
Accipitridae
Laridae
Glareolidae
Charadriidae
Burhinidae
Chionididae
Jacanidae
Rostratulidae
Scolopacidae
Pedionomidae
Thinocoridae
Pteroclidae
Rallidae
Rhynochetidae
Cariamidae
Psophiidae
Heliornithidae
Gruidae
Otididae
Eurypygidae
Columbidae
Caprimulgidae
Eurostopodidae
Nyctibiidae
Steatornithidae
Batrachostomidae
Podargidae
Aegothelidae
Strigidae
Tytonidae
Musophagidae
Trochilidae
Hemiprocnidae
Apodidae
Psittacidae
Neomorphidae
Crotophagidae
Opisthocomidae
Coccyzidae
Centropidae
Cuculidae
Coliidae
Cerylidae
Dacelonidae
Alcedinidae
Todidae
Momotidae
Meropidae
Leptosomidae
Coraciidae
Trogonidae
Rhinopomastidae
Phoeniculidae
Upupidae
Bucorvidae
Bucerotidae
Bucconidae
Galbulidae
Ramphastidae
Lybiidae
Megalaimidae
Picidae
Indicatoridae
Turnicidae
Anatidae
Dendrocygnidae
Anseranatidae
Anhimidae
Odontophoridae
Numididae
Phasianidae
Megapodiidae
Cracidae
Tinamidae
Apterygidae
Casuariidae
Rheidae
Struthionidae
Figure 6.1: A phylogeny of bird families from ? The families in red are the
Tits (Paridae), Treecreepers (Certhiidae), Cranes (Gruidae) and the Ostriches
(Struthionidae) from top to bottom. Blue tits are in the Paridae, and the word
pedigree comes from the french for crane’s foot.
> INphylo$inbreeding
[1] 0.2285714 0.3857143 1.0000000 0.7714286 0.3857143
[6] 0.3857143
You will notice that the Struthionidae have an inbreeding coefficient of 1 be-
cause we used the default scale=TRUE in the call to inverseA. Only ultrametric
trees can be scaled in MCMCglmm and in this case the sum of the inbreeding co-
efficients connecting the root to a terminal node is one. To take the Paridae as
an example:
> sum(INphylo$inbreeding[which(INphylo$pedigree[,
+ 1] %in% c("Paridae", "Node122", "Node58"))])
[1] 1
The inbreeding coefficients for the members of the blue tit nuclear family
are of course all zero:
CHAPTER 6. PEDIGREES AND PHYLOGENIES 114
> inverseA(Nped)$inbreeding
[1] 0 0 0 0 0 0 0 0
Note that specifying cor=T is equivalent to scaling the tree as we did in the
argument to inverseA.
In fact, all of the mixed models we fitted in earlier sections also used an A
matrix, but in those cases the matrix was an identity matrix (i.e. A = I) and
we didn’t have to worry about it. Let’s reconsider the Blue tit model m3a.1
from Section 3 where we were interested in estimating sex effects for tarsus
length together with the amount of variance explained by genetic mother (dam)
and foster mother (fosternest):
All individuals that contributed to that analysis are from a single generation
and appear in BTped together with their parents. However, individuals in the
parental generation do not have tarsus length measurements so they do not have
their own records in BTdata.
y = Xβ + Z1 u1 + Z2 u2 + e (6.1)
where the design matrices contain information relating each individual to a
sex (X) a dam (Z1 ) and a fosternest(Z2 ). The associated parameter vectors
(β, u1 and u2 ) are the effects of each sex, mother and fosternest on tarsus
length, and e is the vector of residuals.
In the model, the u’s are treated as random so we estimate their variance
instead of fixing it in the prior at some (large) value, as we did with the β’s.
We can be a little more explicit about what this means:
variance component.
Since dam’s have very little interaction with the subset of offspring that were
moved to a fosternest, we may be willing to assume that any similarity that
exists between the tarsus lengths of this susbset and the subset that remained at
home must be due to genetic effects. Although not strictly true we can assume
that individuals that shared the same dam also shared the same sire, and so
share around 50% of their genes.
to be completed ...
Chapter 7
Technical Details
l = Xβ + Zu + e (7.3)
where X is a design matrix relating fixed predictors to the data, and Z is
a design matrix relating random predictors to the data. These predictors have
associated parameter vectors β and u, and e is a vector of residuals. In the
Poisson case these residuals deal with any over-dispersion in the data after ac-
counting for fixed and random sources of variation.
The location effects (β and u), and the residuals (e) are assumed to come
from a multivariate normal distribution:
β β0 B 0 0
u ∼ N 0 , 0 G 0 (7.4)
e 0 0 0 R
117
CHAPTER 7. TECHNICAL DETAILS 118
where β0 is a vector of prior means for the fixed effects with prior (co)variance
B, and G and R are the expected (co)variances of the random effects and resid-
uals respectively. The zero off-diagonal matrices imply a priori independence
between fixed effects, random effects, and residuals. Generally, G and R are
large square matrices with dimensions equal to the number of random effects or
residuals. Typically they are unknown, and must be estimated from the data,
usually by assuming they are structured in a way that they can be parameterised
by few parameters. Below we will focus on the structure of G, but the same
logic can be applied to R.
In the case of ordinal probit models with > 2 categories (i.e. "threshold"
or "ordinal" models), fT /fO depends on an extra set of parameters in addition
to the latent variable: the max(y) + 1 cutpoints γ. The probability of yi is then:
yi = li and zero otherwise so li = yi with out the need for updating, ii) when
yi is discrete and modelled using family="threshold" then Equation defines a
truncated normal distribution and can be slice sampled (?) and iii) when yi is
missing fi (yi |li ) is not defined and samples can drawn directly from the normal.
where j indexes blocks of latent variables that have non-zero residual covari-
ances. For response variables that are neither Gaussian nor threshold, the den-
sity in equation 7.9 is in non-standard form and so Metropolis-Hastings updates
are employed. We use adaptive methods during the burn-in phase to determine
an efficient multivariate normal proposal distribution centered at the previous
value of lj with covariance matrix mM. For computational efficiency we use the
same M for each block j, where M is the average posterior (co)variance of lj
within blocks and is updated each iteration of the burn-in period ?. The scalar
m is chosen using the method of ? so that the proportion of successful jumps is
optimal, with a rate of 0.44 when lj is a scalar declining to 0.23 when lj is high
dimensional (?).
0 0 0
7.2.2 Updating the location vector θ = β u
? provide a method for sampling θ as a complete block that involves solving
the sparse linear system:
0
θ̃ = C−1 W R−1 (l − Wθ? − e? ) (7.11)
CHAPTER 7. TECHNICAL DETAILS 120
e? ∼ N (0, R) (7.14)
θ̃ + θ? gives a realisation from the required probability distribution:
P r(θ|l, W, R, G) (7.15)
Equation 7.11 is solved using Cholesky factorisation. Because C is sparse
and the pattern of non-zero elements fixed, an initial symbolic Cholesky factori-
0
sation of PCP is preformed where P is a fill-reducing permutation matrix (?).
Numerical factorisation must be performed each iteration but the fill-reducing
0
permutation (found via a minimum degree ordering of C + C ) reduces the com-
putational burden dramatically compared to a direct factorisation of C (?).
Λl = Xβ + Zu + e (7.20)
where Λ is a square matrix of the form:
Ψ(l) λl
P
Λ= I− l (7.21)
This sets up a regression where the ith element of the response vector acts
(l)
as a weighted (by Ψi,j ) predictor for the j th element of the response vector with
associated regression parameter λl . Often Ψ(l) is an incidence matrix with the
patterns of ones determining which elements of the response are regressed on
each other.
the necessary equations. ? provide a simple scheme for updating λ. Note that
Equation 7.20 can be rewritten as:
l − Xβ − Zu = e + l Ψ(l) lλl
P
(7.22)
= e + Lλ
where L is the design matrix Ψ(1) l, Ψ(2) l . . . Ψ(L) l for the L path coeffi-
cients. Conditional on β and u, λ can then be sampled using the method of
? with l − Xβ − Zu as response and L as predictor. However, only in a fully
recursive system (there exists a row/column permutation by which all Ψ’s are
triangular) are the resulting draws from the appropriate conditional distribu-
tion, which requires multiplication by the Jacobian of the transform: |Λ|. An
extra Metropolis Hastings step is used to accept/reject the proposed draw when
|Λ| =
6 1.
When the response vector is Gaussian and fully observed, the latent variable
does not need updating. For non-Gaussian data, or with missing responses,
updating the latent variable is difficult because Equation 7.2.1 becomes:
0
P r(li |y, θ, R, G, λ) ∝ fi (yi |li )fN ((Λ−1 e)i |qi Q−1 −1
/i e/i , qi − qi Q/i qi ) (7.23)
where Q = Λ−1 RΛ−> . In the general case Q will not have block diagonal
structure like R and so the scheme for updating latent variables within residual
blocks (i.e. Equation 7.9) is not possible. However, in some cases Λ may have
the form where all non-zero elements correspond to elements of the response
vector that are in the same residual block. In such cases updating the latent
variables remains relatively simple:
D = −2log(Pr(y|Ω)) (7.25)
where Ω is some parameter set of the model. The deviance can be calculated
in different ways depending on what is in ‘focus’, and MCMCglmm calculates
this probability for the lowest level of the hierarchy (?). For fully-observed
Gaussian response variables in the likelihood is the density:
fN (y|Wθ, R) (7.26)
where Ω = {θ, R}. For discrete response variables in univariate analyses
modeled using family="threshold" the density is
Y
FN (γyi |wi θ, rii ) − FN (γyi +1 |wi θ, rii ) (7.27)
i
CHAPTER 7. TECHNICAL DETAILS 123
with Ω = l.
For multivariate models with mixtures of Gaussian (g), threshold (t) and
other non-Gaussian (n) data (including missing data) we can define the deviance
in terms of three conditional densities:
Y
FN (γyi |(W θ)ti|g,n , rti|g,n ) − FN (γyi +1 |(W θ)ti|g,n , rti|g,n ) (7.31)
i
exp(l )
"categorical" 1 J-1 P r(y = k|k 6= 1) = PJ−1 k
1+ j=1 exp(lj )
P r(y = 1) = PJ−11
1+ j=1 exp(lj )
nk
exp(lk )
"multinomialJ" J J-1 P r(yk = nk |k 6= J) = P
1+ J−1
j=1 exp(lj )
nk
CHAPTER 7. TECHNICAL DETAILS
1
P r(yk = nk |k = J) = P
1+ J−1
j=1 exp(lj )
exp(l)
"geometric" 1 1 P r(y) = fG ( 1+exp(l) )
"cengaussian" 2 1 P r(y1 > y > y2 ) = FN (y2 |wθ, σe2 ) − FN (y1 |wθ, σe2 )
fP (y|exp(l))
"ztpoisson" 1 1 P r(y) = 1−fP (0|exp(l))
exp(l2 )
"hupoisson" 1 2 P r(y = 0) =
1+exp(l2 )
exp(l2 ) fP (y|exp(l1 ))
P r(y|y > 0) = 1 − 1+exp(l 2) 1−fP (0|exp(l1 ))
fP (y|exp(l1 ))
P r(y|y > 0) = exp(exp(l2 )) 1−f P (0|exp(l1 ))
exp(l2 ) exp(l2 ) exp(l1 )
"zibinomial" 2 2 P r(y1 = 0) = + 1 − 1+exp(l 2 ) fB (0, n = y1 + y2 | 1+exp(l 1)
)
1+exp(l2 )
exp(l2 ) exp(l1 )
P r(y1 |y1 > 0) = 1 − 1+exp(l 2)
fB (y1 , n = y1 + y2 | 1+exp(l 1)
)
Table 7.1: Distribution types that can fitted using MCMCglmm. The prefixes "zi", "zt", "hu" and "za" stand for zero-
inflated, zero-truncated, hurdle and zero-altered respectively. The prefix "cen" standards for censored where y1 and y2
are the upper and lower bounds for the unobserved datum y. J stands for the number of categories in the multino-
mial/categorical distributions and this must be specified in the family argument for the multinomial distribution. The
density function is for a single datum in a univariate model with w being a row vector of W. f and F are the den-
sity and distribution functions for the subscripted distribution (N =Normal, P =Poisson, E=Exponential, G=Geometric,
B=Binomial). The J − 1 γ’s in the ordinal models are the cutpoints, with γ1 set to zero.
125
Chapter 8
Parameter Expansion
As the covariance matrix approaches a singularity the mixing of the chain be-
comes notoriously slow. This problem is often encountered in single-response
models when a variance component is small and the chain becomes stuck at
values close to zero. Similar problems occur for the EM algorithm and ? intro-
duced parameter expansion to speed up the rate of convergence. The idea was
quickly applied to Gibbs sampling problems (?) and has now been extensively
used to develop more efficient mixed-model samplers (e.g. ???).
The columns of the design matrix (W) can be multiplied by the non-identified
0
working parameters α = [1, α1 , α2 , . . . αk ] :
Wα = [X Z1 α1 Z2 α2 . . . Zk αk ] (8.1)
where the indices denote submatrices of Z which pertain to effects associated
with the same variance component. Replacing W with Wα we can sample the
new location effects θα as described above, and rescale them to obtain θ:
Likewise, the (co)variance matrices can be rescaled by the set of α’s associ-
ated with the variances of a particular variance structure component (αV ):
l= Xα α + e (8.4)
126
CHAPTER 8. PARAMETER EXPANSION 127
The prior densities in the two models are very similar across the range of
variances with reasonable posterior support, and running the models for long
enough will verify that they are sampling from very similar posterior densities.
However, the mixing properties of the two chains are very different, with the
non-parameter expanded chain (in red) getting stuck at values close to zero
(Figure 8.1).
The parameter expanded model is 25% slower per iteration but the effective
sample size is 3.049 times greater:
var1
194.9393
var1
63.93423
CHAPTER 8. PARAMETER EXPANSION 128
20
0.5
0.4
15
0.3
10
0.2
5
0.1
0.0
Figure 8.1: Traces of the sampled posterior distribution for between female
variance in sex ratio. The black trace is from a parameter expanded model, and
the red trace from a non-parameter expanded model.
We can also use the inverse-gamma prior with scale and shape equal to 0.001:
> prior2 <- list(R = list(V = diag(schools$sd^2),
+ fix = 1), G = list(G1 = list(V = 1, nu = 0.002)))
> m7a.2 <- MCMCglmm(estimate ~ 1, random = ~school,
+ rcov = ~idh(school):units, data = schools,
+ prior = prior2, verbose = FALSE)
but Figure 8.3 indicates that such a prior in this context may put too much
density and values close to zero.
For the final prior we have V=1, nu=1, alpha.mu=0 which is equivalent to a
√
proper Cauchy prior for the standard deviation with scale equal to alpha.V.
Following Gelman.2006 we use a scale of 25:
CHAPTER 8. PARAMETER EXPANSION 130
200
150
Frequency
100
50
0
0 10 20 30 40 50
Figure 8.2: Between school standard deviation in educational test scores, with
an improper uniform prior
and Figure 8.4 shows that the prior may have better properties than the
inverse-gamma, and that the posterior is less distorted.
600
500
400
Frequency
300
200
100
0
0 10 20 30 40 50
Figure 8.3: Between school standard deviation in educational test scores, with
an inverse-gamma prior with shape and scale set to 0.001
for some fixed value of the actual residual variance. For example, we can refit
the sex ratio model using a residual variance fixed at ten rather than one:
The two models appear to give completely different posteriors (Figure 8.6)
120
100
80
Frequency
60
40
20
0
0 10 20 30 40 50
Figure 8.4: Between school standard deviation in educational test scores, with
a Cauchy prior with a scale of 25.
The prior specification for the between mother variance is different in the two
models but Figure 8.6 suggests that the difference has little influence. However,
the mixing properties of the second chain are much better (?):
var1
194.9393
var1
636.7686
Although the chain mixes faster as the residual variance is set to be larger,
numerical problem are often encountered because the latent variables can take
CHAPTER 8. PARAMETER EXPANSION 133
15
1.4
1.2
1.0
10
0.8
0.6
5
0.4
0.2
0.0
Figure 8.5: Between mother variation in sex ratio with the residual variance
fixed at 1 (black trace) and 10 (red trace).
on extreme values. For most models a variance of 1 is safe, but care needs to be
taken so that the absolute value of the latent variable is less than 20 in the case
of the logit link and less than 7 for the probit link. If the residual variance is
not fixed but has an alternative proper prior placed on it then the Metropolis-
Hastings proposal distribution for the latent variables may not be well suited
to the local properties of the conditional distribution and the acceptance ratio
may fluctuate widely around the optimal 0.44. This can be fixed by using the
slice sampling methods outlined in ? by passing slice=TRUE to MCMCglmm. Slice
sampling can also be more efficient even if the prior is fixed at some value:
> m7b.4 <- MCMCglmm(sex ~ 1, random = ~dam, data = BTdata,
+ family = "categorical", prior = prior3b, verbose = FALSE,
+ slice = TRUE)
> effectiveSize(m7b.4$VCV[, 1]/(1 + c2 * m7b.3$VCV[,
+ "units"]))
var1
625.4758
CHAPTER 8. PARAMETER EXPANSION 134
35
0.4
30
0.3
25
20
0.2
15
10
0.1
5
0.0
Figure 8.6: Between mother variation in sex ratio with the residual variance
fixed at 1 (black trace) and 10 (red trace) but with both estimates rescaled to
what would be observed under no residual variance.
Chapter 9
There are many situations where it would seem reasonble to put some apsect of
a response variable in as a predictor, and the only thing that stops us is some
(often vague) notion that this is a bad thing to do from a statistical point of
view. The approach appears to have a long history in economics but I came
across the idea in a paper written by ?. The notation of this section, and in-
deed the sampling strategy employed in MCMCglmm is derived from this paper.
Λy = y − Ψ (IL ⊗ y) λ
(9.2)
= y − Ψ (λ ⊗ IN ) y
>
where λ = [λ1 λ2 . . . λL−1 λL ] , and Λ = IN − Ψ (λ ⊗ IN )
Each Ψ(l) can be formed using the function sir which takes two formulae.
Ψ = X1 X> 2 where X1 and X2 are the model matrices defined by the formulae
(with intercept removed). X1 and X> 2 have to be conformable, and although
this could be achieved in many ways, one way to ensure this is to have categorical
predictors in each which have common factor levels. To give a concrete example,
lets take a sample of individuals measured a variable number of times for 2 traits:
135
CHAPTER 9. PATH ANALYSIS & ANTEDEPENDENCE STRUCTURES136
Lets then imagine that each of these indiviuals interacts with another ran-
domly chosen individual - indexed in the vector id1
we can see that the first record for individual id[1]=10 is directly affected
by individual id1[1]=48’s traits:
31 90 131 190
1 1 1 1
We can build on this simple model by stating that only trait 2 affects trait
1:
31 90 131 190
1 0 0 1 1
101 0 0 0 0
31 90 131 190
1 0 0 1 1
101 0 0 1 1
CHAPTER 9. PATH ANALYSIS & ANTEDEPENDENCE STRUCTURES137
One problem is that e? the residual vector that appears in the likelihood for
the latent variable does not have a simple (block) diagonal structure when (as
in the case above) the elements of the response vector that are regressed on each
other are not grouped in the R-structure:
e? ∼ N 0, Λ−1 RΛ−>
(9.3)
Consequently, analyses that involve latent variables (i.e. non-Gaussian data,
or analyses that have incomplete records for determining the R-structure) are
currently not implemented in MCMCglmm.
The path function is a way of specifying path models that are less general
than those formed by sir but are simple enough to allow updating of the latent
variables associated with non-Gaussian data. Imagine a residual structure is
fitted where the N observations are grouped into n blocks of k. For instance
this might be k different characteristics measured in n individuals. A path model
may be entertained whereby an individual’s characteristics only affect their own
characteristics rather than anyone elses. In this case, Ψ(l)
(1)
(l)
= ψ ⊗ In is block
(2) (L)
diagonal, and Ψ = ψ ⊗ In where ψ = ψ , ψ . . . ψ . Consequently,
Λ = IN − Ψ (λ ⊗ IN )
= IN − (ψ ⊗ In ) (λ ⊗ IN )
(9.4)
= (Ik ⊗ In ) − (ψ ⊗ In ) (λ ⊗ Ik ⊗ In )
= (Ik − ψ (λ ⊗ Ik )) ⊗ In
−1
and so Λ−1 = (Ik − ψ (λ ⊗ Ik )) ⊗ In and |Λ| = |Ik − ψ (λ ⊗ Ik ) |n
Mean centering responses can help mixing, because theta and lambda are
not sampled in a block. (Jarrod - I guess when |Λ| = 1 this could be detected
and updating occur in a block?)
9.2 Antedependence
An n × n unstructured covariance matrix can be reparameterised in terms of
regression coefficeints and residual variances from a set of n nested multiple
regressions. For example, for n = 3 the following 3 multiple regressions can be
defined:
1 0 0
Lu = −β2|1 1 0 (9.6)
−β3|1 −β3|2 1
and
σe2u 0 0
01 σe2u2 0
Du = (9.7)
0 0 σe2u3
gives
Vu = Lu Du L>
u (9.8)
Rather than fit the saturated model (in this case all 3 regression coefficients)
k th order antedependence models seek to model Vu whilst constraining the
regression coefficients in Lu to be zero if they are on sub-diagonals>k. For ex-
ample, a first order antedependence model would set the regression coefficients
in the second off-diagonal (i.e. β3|1 ) to zero, but estimate those in the first sub-
diagonal (i.e. β2|1 and β3|2 ). For a 3×3 matrix, a second order antedependence
model would fit a fully unstructured covariance matrix. In terms of Gibbs sam-
pling this parameterisation is less efficient because Vu is sampled in two blocks
(the regression coefficients followed by the innovation variances) rather than
in a single block from the inverse Wishart. However, more flexible conjugate
prior specifications are possible by placing multivariate normal priors on the
regression coefficients and independent inverse Wishart priors on the innovation
variances. By constraining arbitrary regression coefficients to be zero in a fully
unstructured model allows any fully recursive path model to be constructed for
a set of random effects.
9.3 Scaling
The path analyses described above essentially allow elements of the response
vector to be regressed on each other. Regressing an observation on itself would
seem like a peculiar thing to do, although with a little work we can show that by
doing this we can allow two sets of observations to conform to the same model
except for a difference in scale.
Acknowledgments
139
Bibliography
140