TheoryIdeasInla Screen
TheoryIdeasInla Screen
What are the main ideas behind R-INLA? What would you need to
know to implement it yourself?
Likelihood Y
y | x, θ1 ∼ π(yi | xi , θ1 ).
i∈I
Posterior Y
π(x, θ|y) ∝ π(θ) π(x|θ) π(yi |xi , θ). (1)
i∈I
We make the following critical assumptions
x = (µ, β, u, v, η)
is jointly Gaussian.
Example 1 - Precision matrix
Precision matrix (η, u, v, µ, β) N = 100, |θ| = 5.
1.0
0.8
0.6
0.4
0.2
0.0
exp(ηt )
pt =
1 + exp(ηt )
I Linear predictor
ηt = µ + βct + ut + vt , t = 1, . . . , n
Example 2 - Continued
Prior models
I µ and β are Normal
I u is an AR-model, like
ut = φut−1 + t
Hyperparameters
θ = (φ, σ2 , σv2 )
Example 2 - Rewritten
θ ∼ π(θ)
x | θ ∼ π(x | θ) = N (0, Σ(θ))
Y
y | x, θ ∼ π(yi | ηi , θ)
i
1.0
0.8
0.6
0.4
0.2
0.0
x = (η, γ, α, β, f 1 , f 2 , ...)
θ = (θf1 , θf2 , θf3 , θf4 , ...)
What is it that we want to do?
x = (x1 , . . . , xn )T
xi ⊥ xj | x−ij
xi ⊥ xj | x−ij ⇐⇒ Qij = 0
Example: Auto-regressive process
Simple example of a GMRF
Auto-regressive process of order 1
x1 x2 x3 x4 x5 xn
I Analytical tractable
I Modelling using conditional independence
I Merging GMRFs using conditioning (hierarchical models)
I Unified framework for
I understanding
I representation
I Computation using numerical methods for sparse matrices
I Why and how can be understood making use of conditional
independence (maybe today...)
Waypoint
η = µ1 + βz + A1 f 1 + A2 f 2 + .
Let τµ and τβ be the (fixed) prior precisions for µ and β. The two
model components f 1 and f 2 have sparse precision matrices Q1 (θ)
and Q2 (θ). Also, A1 , and similarly for A2 , is a n × m1 sparse
matrix, which is zero except for exactly one 1 in each row. The
joint precision matrix Qjoint (θ) of (η, f 1 , f 2 , β, µ) is
τ I τ A1 τ A2 τ Iz τ I1
Q1 (θ) + τ A1 AT
1 τ A1 AT2 τ A1 z τ A1 1
Q2 (θ) + τ A2 AT
2 τ A2 z τ A2 1
τβ + τ zT z τ zT 1
τµ + τ 1 T 1
Looking at Q
or π(θj | y).
These are very high dimensional integrals and are typically not
analytically tractable.
Gaussian likelihood gives Gaussian fixed-hyper posterior
directly.
10 ●
1 n = 50 ● ●
●
2 idx = 1: n
3 # generate something 5 ●
● ●
● ● ●
smooth representing m ●
●
●
●
● ● ●
●
6 y = fun + rnorm (n , mean ●
−5 ●
=0 , sd =1) ● ● ● ●
7 plot ( idx , y ) ●
−10
●
0 10 20 30 40 50
idx
Assumed hierarchical model
yi | xi , θ ∼ N (xi , τ0 )
1
model="rw2"
Assumed hierarchical model
yi | xi , θ ∼ N (xi , τ0 )
1
model="rw2"
Assumed hierarchical model
yi | xi , θ ∼ N (xi , τ0 )
1
model="rw2"
Derivation of posterior marginals (I)
Since
x, y | θ ∼ N (·, ·)
(derived using π(x, y | θ) ∝ π(y | x, θ) π(x | θ)),
we can compute (numerically) all marginals, using that
Derivation of posterior marginals (I)
Since
x, y | θ ∼ N (·, ·)
(derived using π(x, y | θ) ∝ π(y | x, θ) π(x | θ)),
we can compute (numerically) all marginals, using that
π(x, θ | y )
π(θ | y) =
π(x | y , θ)
Gaussian
z }| {
π(x, y | θ) π(θ)
∝
π(x | y, θ)
| {z }
Gaussian
Posterior marginal for hyperparameter
● ●
0.012
0.012
● ●
Exponential of log density
● ●
0.008
0.008
● ●
● ●
0.004
0.004
● ●
● ●
● ●
● ●
0.000
0.000
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
1 2 3 4 5 6 1 2 3 4 5 6
Log precision log(θ) Log precision log(θ)
Derivation of posterior marginals (II)
From
x | y, θ ∼ N (·, ·)
Derivation of posterior marginals (II)
From
x | y, θ ∼ N (·, ·)
we can compute
Z
π(xi | y) = π(xi | y, θ) π(θ | y) dθ
| {z }
Gaussian
X
≈ π(xi | y, tk )π(tk | y)∆k
k
0.4
0.2
0.0
0.2
0.1
0.0
●●●●●
0.6
● ●
● ●
●
●
●
●
●
●
●
●
density
●
●
0.4
●
●
●
●
●
●
●
● ●
● ●
●
0.2
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●●
●● ●●
●● ●●●
●●●
0.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
10
●
●
● ●
5
●
● ● ●
● ●
●
●
●
●
● ● ●
● ● ● ● ●
y
● ● ● ●
● ● ● ● ● ● ●
● ●
● ●
●
●
●
−5
●
●
● ● ● ●
●
−10
0 10 20 30 40 50
idx
R-code
θ2
θ1
1. More than one hyperparameter
z1
I Compute the Hessian to
construct principal
θ2
components
θ1
1. More than one hyperparameter
z1
I Compute the Hessian to
construct principal
θ2
components
I Grid-search to locate bulk
of the probability mass
θ1
1. More than one hyperparameter
z1
I Compute the Hessian to
construct principal
θ2
components
I Grid-search to locate bulk
of the probability mass
θ1
Alternatives:
I Extreme: use just the modal configuration (empirical Bayes)
I Use a central composite design (CCD), e.g. for m = 2
design points circle points
2. Non-Gaussian observations
n
!
1 > X
π(x|θ, y ) ∝ exp − x Qx + log π(yi |xi )
2
i=1
1 1
f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x0 )(x − x0 )2 = a + bx − cx 2
2 2
with b = f 0 (x0 ) − f 00 (x0 )x0 and c = −f 00 (x0 ). (Note: a is not
relevant).
The GMRF approximation (II)
Thus,
n
!
1 > X
π̃(x|θ, y ) ∝ exp − x Qx + (ai + bi xi − 0.5ci xi2 )
2
i=1
1 T T
∝ exp − x (Q + diag(c))x + b x
2
NC (b, Q + diag(c))
which corresponds to
1.0 1.0
full conditional mode
normal approximation ● expansion point
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 ● 0.0 ●
−1 0 1 2 3 4 −1 0 1 2 3 4
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 ● 0.0 ●
−1 0 1 2 3 4 −1 0 1 2 3 4
π(x, y | θ) π(θ)
π̃(θ | y) ∝
π̃(x | y, θ)
x=x? (θ)
X
π(xi | y) ≈ π(xi | y, tk ) π̃(tk | y)∆k
| {z }
k
Not Gaussian!
Approximating π(xi |y , θ)
2. Laplace approximation
π(x, θ, y )
π̃LA (xi |θ, y ) ∝
π̃GG (x −i |xi , θ, y )
x −i =x ?
−i (xi ,θ)
This is default option when using INLA but this choice can be
modified.
The integrated nested Laplace approximation (INLA)
In other words, each random effect can be big, but there cannot be
too many random effects unless they share parameters.
How to represent the posterior?
The posterior
π(θ, x|y = data)
is a massive object!
Discuss:
I Samples from π(θ, x|y = data)
I Marginal, marginal-joint, and conditional distributions
I Samples from θ and Gaussian π(x|θ = sample, y = data)
Any questions?