0% found this document useful (0 votes)
17 views69 pages

TheoryIdeasInla Screen

Uploaded by

bayesianito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views69 pages

TheoryIdeasInla Screen

Uploaded by

bayesianito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Introduction and motivation

for the INLA course in Iceland 2016

Haakon Chris. Bakka

Norwegian University of Science and Technology

What are the main ideas behind R-INLA? What would you need to
know to implement it yourself?

Wide time estimate: 4h presentation, 2h exercises.


Source

This presentation is partially based on the wonderful and very


readable paper Bayesian Computing with INLA: A Review
by Rue et. al. in 2016.
Goal

The goal of this presentation is to improve your understanding of


the INLA method.

The level of details is set to the maximum we can get through in


2-3 hours. This is not about coding.
What kind of models can we work with?
Latent Gaussian Models

Likelihood Y
y | x, θ1 ∼ π(yi | xi , θ1 ).
i∈I

Latent variable (linear predictor)


 
x | θ2 ∼ N µ(θ2 ), Q−1 (θ2 )

Posterior Y
π(x, θ|y) ∝ π(θ) π(x|θ) π(yi |xi , θ). (1)
i∈I
We make the following critical assumptions

1. The number of hyperparameters |θ| is small, typically 2 to 5,


but not exceeding 20.
2. The distribution of the latent field, x|θ is Gaussian and
required to be a Gaussian Markov random field (GMRF).
Possibly with a high dimension (e.g. 105 ). Many xi will not be
observed.
3. Conditionally on the knowledge of θ and x, the observations y
are mutually independent
Are there many interesting models of this type?
Example 1: Groups and individuals

ηij = µ + cij β + ui + vi + wij , i = 1, . . . , N, j = 1, . . . , M

for covariates cij and “random effects” u, v and w.

If we assign Gaussian priors on µ, β, u and v, then

x = (µ, β, u, v, η)

is jointly Gaussian.
Example 1 - Precision matrix
Precision matrix (η, u, v, µ, β) N = 100, |θ| = 5.

1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


Example 2: Smoothing of binary time-series

I Data is sequence of 0 and 1s


I Probability for a 1 at time t, pt , depends on time

exp(ηt )
pt =
1 + exp(ηt )
I Linear predictor

ηt = µ + βct + ut + vt , t = 1, . . . , n
Example 2 - Continued
Prior models
I µ and β are Normal
I u is an AR-model, like

ut = φut−1 + t

with parameters (φ, σ2 ).


I v could be an unstructured term/ ’random effect’/
slow-varying trend
gives
x = (µ, β, u, v, η)
is jointly Gaussian.

Hyperparameters
θ = (φ, σ2 , σv2 )
Example 2 - Rewritten

We can reinterpret the model as

θ ∼ π(θ)
x | θ ∼ π(x | θ) = N (0, Σ(θ))
Y
y | x, θ ∼ π(yi | ηi , θ)
i

I dim(x) could be large 102 -105


I dim(θ) is small 1-5
Precision matrix (η, u, v, µ, β) N = 100, M = 5.

1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


Example 3: Presence-Absence-Abundance

I Count data, with lots of zeros


I Bernoulli model for presence-absence, with probability
pi = invlogit(ηi )
I Poisson model for abundance, with intensity λi = exp(γi )
I ηi = Xi α + f1,i + ...
I γi = Xi β + f4,i + ...

x = (η, γ, α, β, f 1 , f 2 , ...)
θ = (θf1 , θf2 , θf3 , θf4 , ...)
What is it that we want to do?

We want to compute probabilities in, and to sample from, big


multivariate Gaussians.
Waypoint

So far There are a million interesting models. These models


have huge Gaussian components.
Next goal How does the precision matrix Q(θ) look?
First What is a GMRF? (And why are we talking about
precision matrix?)
What is a Gaussian Markov random field (GMRF)?

A GMRF is a simple construct


I A normal distributed random vector

x = (x1 , . . . , xn )T

I Additional Markov properties:

xi ⊥ xj | x−ij

xi and xj are conditional independent (CI).


The precision matrix

If xi ⊥ xj | x−ij for a set of {i, j}, then we need to constrain the


parametrisation of the GMRF.
I Covariance matrix: difficult
I Precision matrix: easy
Conditional independence and the precision matrix

The density of a zero mean Gaussian


 
1
π(x) ∝ |Q|1/2 exp − xT Qx
2

Constraining the parametrisation to obey CI properties


Theorem

xi ⊥ xj | x−ij ⇐⇒ Qij = 0
Example: Auto-regressive process
Simple example of a GMRF
Auto-regressive process of order 1

xt | xt−1 , . . . , x1 ∼ N (φxt−1 , 1), t = 2, . . . , n

and x1 ∼ N (0, (1 − φ2 )−1 ).

x1 x2 x3 x4 x5 xn

Tridiagonal precision matrix


 
1 −φ
−φ 1 + φ2 −φ 
 
Q=
 .. .. .. 
 . . . 

 2
−φ 1 + φ −φ 
−φ 1
Main features of GMRFs

I Analytical tractable
I Modelling using conditional independence
I Merging GMRFs using conditioning (hierarchical models)
I Unified framework for
I understanding
I representation
I Computation using numerical methods for sparse matrices
I Why and how can be understood making use of conditional
independence (maybe today...)
Waypoint

So far We know what a GMRF (Gaussian Markov Random


Field) is, and why they make sense computationally,
and why the make sense for modeling relationships.
Next How does the precision matrix Q(θ) look in a typical
model?
General example Q(θ)

η = µ1 + βz + A1 f 1 + A2 f 2 + .
Let τµ and τβ be the (fixed) prior precisions for µ and β. The two
model components f 1 and f 2 have sparse precision matrices Q1 (θ)
and Q2 (θ). Also, A1 , and similarly for A2 , is a n × m1 sparse
matrix, which is zero except for exactly one 1 in each row. The
joint precision matrix Qjoint (θ) of (η, f 1 , f 2 , β, µ) is
 
τ I τ A1 τ A2 τ Iz τ I1
 Q1 (θ) + τ A1 AT
1 τ A1 AT2 τ A1 z τ A1 1 
Q2 (θ) + τ A2 AT
 

 2 τ A2 z τ A2 1  
 τβ + τ zT z τ zT 1 
τµ + τ 1 T 1
Looking at Q

1 Q = INLA ::: inla . rw (100 , 2)


2 image ( Q )

1 demo ( " Epil " )


2 result = inla ( formula , family = " poisson " , data = Epil ,
control . compute = list ( config = T ) )
3 image ( force Symmetr ic ( result $ misc $ configs $ config [[1]] $ Q ) )
4 image ( chol ( forc eSymmetr ic ( result $ misc $ configs $ config [[1]] $ Q )
))
What do we care about?
The most important quantity in Bayesian statistics is the posterior
distribution:

Posterior Prior Likelihood


z }| { z }| { zY }| {
π(x, θ | y) ∝ π(θ)π(x | θ) π(yi | xi , θ)
i∈I

from which we can derive the quantities of interest, such as


Z Z
π(xi | y) ∝ π(x, θ | y)dx−i dθ
Z
= π(xi | θ, y)π(θ | y)dθ

or π(θj | y).

These are very high dimensional integrals and are typically not
analytically tractable.
Gaussian likelihood gives Gaussian fixed-hyper posterior

Fix the hyperparameters to θ0 , then

π(x|y = data) ∝ π(x, y = data)


∝ π(y = data|x)π(x)

where everything is Gaussian. To renormalise, we need to compute


a determinant.
Recall: What is our model framework?

Latent Gaussian models


Y
y |x, θ ∼ π(yi |ηi , θ)
i
x|θ ∼ π(x|θ) ∼ N (0, Q(θ)−1 ) Gaussian!
θ ∼ π(θ) Not Gaussian

where the precision matrix Q(θ) is sparse. Generally these


“sparse” Gaussian distributions are called Gaussian Markov random
fields (GMRFs).

The sparseness can be exploited for very quick computations for


the Gaussian part of the model through numerical algorithms for
sparse matrices.
The INLA idea

Use the posterior distribution

π(x, θ | y) ∝ π(θ)π(x | θ)π(y | x, θ)

to approximate the posterior marginals

π(xi | y ) and π(θj | y )

directly.

Let us consider a toy example to illustrate the ideas.


How does INLA work?
Observations
yi = m(i) + i , i = 1, . . . , n
Here, we assume that m(i) is a smooth function wrt i and
iid
i ∼ N (0, τ0 ) with known precision τ0 .
How does INLA work?
Observations
yi = m(i) + i , i = 1, . . . , n
Here, we assume that m(i) is a smooth function wrt i and
iid
i ∼ N (0, τ0 ) with known precision τ0 .

10 ●
1 n = 50 ● ●

2 idx = 1: n
3 # generate something 5 ●
● ●
● ● ●

smooth representing m ●


● ● ●

4 fun = 100 * (( idx - n / 2) / n ) ^3 ●


● ● ● ●
y
0 ●

● ● ●



● ●
● ●


5 # add some noise ●


6 y = fun + rnorm (n , mean ●
−5 ●
=0 , sd =1) ● ● ● ●

7 plot ( idx , y ) ●

−10

0 10 20 30 40 50
idx
Assumed hierarchical model

1. Data: Gaussian observations with known precision

yi | xi , θ ∼ N (xi , τ0 )

1
model="rw2"
Assumed hierarchical model

1. Data: Gaussian observations with known precision

yi | xi , θ ∼ N (xi , τ0 )

2. Latent model: A random walk of second order1


n
!
θX
π(x | θ) ∝ θ(n−2)/2 exp − (xi − 2xi−1 + xi−2 )2
2
i=3

1
model="rw2"
Assumed hierarchical model

1. Data: Gaussian observations with known precision

yi | xi , θ ∼ N (xi , τ0 )

2. Latent model: A random walk of second order1


n
!
θX
π(x | θ) ∝ θ(n−2)/2 exp − (xi − 2xi−1 + xi−2 )2
2
i=3

3. Hyperparameter: The smoothing parameter θ which we assign


a Γ(a, b) prior

π(θ) ∝ θa−1 exp (−bθ) , θ>0

1
model="rw2"
Derivation of posterior marginals (I)

Since
x, y | θ ∼ N (·, ·)
(derived using π(x, y | θ) ∝ π(y | x, θ) π(x | θ)),
we can compute (numerically) all marginals, using that
Derivation of posterior marginals (I)

Since
x, y | θ ∼ N (·, ·)
(derived using π(x, y | θ) ∝ π(y | x, θ) π(x | θ)),
we can compute (numerically) all marginals, using that

π(x, θ | y )
π(θ | y) =
π(x | y , θ)
Gaussian
z }| {
π(x, y | θ) π(θ)

π(x | y, θ)
| {z }
Gaussian
Posterior marginal for hyperparameter

Select a grid of points t1 , . . . , tk to represent the density θ | y.


(Here, the points are chosen to be equi-distant).
Posterior marginal for θ Posterior marginal for θ (interpolated)
● ●

● ●
0.012

0.012
● ●
Exponential of log density

Exponential of log density


● ●

● ●
0.008

0.008
● ●

● ●
0.004

0.004
● ●

● ●
● ●

● ●
0.000

0.000
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 1 2 3 4 5 6
Log precision log(θ) Log precision log(θ)
Derivation of posterior marginals (II)

From
x | y, θ ∼ N (·, ·)
Derivation of posterior marginals (II)

From
x | y, θ ∼ N (·, ·)
we can compute
Z
π(xi | y) = π(xi | y, θ) π(θ | y) dθ
| {z }
Gaussian
X
≈ π(xi | y, tk )π(tk | y)∆k
k

where tk , k = 1, . . . , K , correspond to representative points of


θ | y and ∆k are the corresponding weights (equal to 1 if points
are equi-distant).
Posterior marginal for latent parameters

Compute the conditional marginal posterior for each xi given tk .


Here, shown for x1 .
Posterior marginal for x1 for each θ (unweighted)
0.8
0.6
density

0.4
0.2
0.0

−13 −12 −11 −10 −9 −8 −7 −6


x1
Posterior marginal for latent parameters

Weigh the resulting (conditional) marginal posterior by the density


associated with each θk .
Posterior marginal for x1 for each θ (weighted)
0.4
0.3
weighted density

0.2
0.1
0.0

−13 −12 −11 −10 −9 −8 −7 −6


x1
Posterior marginal for latent parameters

Numerically sum over all conditional densities to obtain the


posterior marginal for each xi .
Posterior marginal for x1
0.8

●●●●●
0.6

● ●
● ●








density



0.4








● ●
● ●

0.2


● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●●
●● ●●
●● ●●●
●●●
0.0

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−13 −12 −11 −10 −9 −8 −7 −6


x1
Fitted spline

The posterior marginals are used to calculate summary statistics,


like means, variances and credible intervals:



10



● ●
5


● ● ●
● ●




● ● ●
● ● ● ● ●
y

● ● ● ●
● ● ● ● ● ● ●
● ●
● ●



−5



● ● ● ●


−10

0 10 20 30 40 50

idx
R-code

1 formula = y ~ -1 + f ( idx , model = " rw2 " , constr = FALSE ,


2 hyper = list ( prec = list ( prior = " loggamma " , param = c (a , b ) ) ) )
3
4 result = inla ( formula ,
5 data = data . frame ( y =y , idx = idx ) ,
6 control . family = list ( initial = log ( tau _ 0) , fixed = TRUE
))
7
8 plot ( idx , y , pch =19)
9 lines ( result $ summary . random [[1]] $ mean , col =2 , lwd =2)
Extensions

This is the basic idea behind INLA. It is quite simple.


Extensions

This is the basic idea behind INLA. It is quite simple.

However, we need to extend this basic idea so we can deal with


1. More than one hyperparameter
2. Non-Gaussian observations
1. More than one hyperparameter

I Locate the mode

θ2

θ1
1. More than one hyperparameter

I Locate the mode z2

z1
I Compute the Hessian to
construct principal

θ2
components

θ1
1. More than one hyperparameter

I Locate the mode z2

z1
I Compute the Hessian to
construct principal

θ2
components
I Grid-search to locate bulk
of the probability mass

θ1
1. More than one hyperparameter

I Locate the mode z2

z1
I Compute the Hessian to
construct principal

θ2
components
I Grid-search to locate bulk
of the probability mass

θ1

All points found have equal area weight ∆k .


Alternatives for moderate number of hyperparameters

Integrating out the hyperparameter for moderate m (6 to 12) is


expensive as the number of evaluation points is exponential in m.

Alternatives:
I Extreme: use just the modal configuration (empirical Bayes)
I Use a central composite design (CCD), e.g. for m = 2
design points circle points
2. Non-Gaussian observations

In application we may choose likelihoods other than a Gaussian.


How does this change things?
Non-Gaussian, BUT KNOWN
z }| {
π(x, y | θ) π(θ)
π(θ | y) ∝
π(x | y, θ)
| {z }
Non-Gaussian and UNKNOWN

I In many cases π(x | y , θ) is very close to a Gaussian


distribution, and can be replaced with a Laplace
approximation.
The GMRF (Laplace) approximation
Let x denote a GMRF with precision matrix Q and mean µ.
Approximate

n
!
1 > X
π(x|θ, y ) ∝ exp − x Qx + log π(yi |xi )
2
i=1

by using a second-order Taylor expansion of log π(yi |xi ) around µ0 ,


say.
Recall

1 1
f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x0 )(x − x0 )2 = a + bx − cx 2
2 2
with b = f 0 (x0 ) − f 00 (x0 )x0 and c = −f 00 (x0 ). (Note: a is not
relevant).
The GMRF approximation (II)
Thus,
n
!
1 > X
π̃(x|θ, y ) ∝ exp − x Qx + (ai + bi xi − 0.5ci xi2 )
2
i=1
 
1 T T
∝ exp − x (Q + diag(c))x + b x
2

to get a Gaussian approximation with precision matrix Q + diag(c)


and mean given by the solution of (Q + diag(c))µ = b. The
canonical parameterisation is

NC (b, Q + diag(c))

which corresponds to

N ((Q + diag(c))−1 b, (Q + diag(c))−1 ).


Illustration

Expansion around 0 Expansion around 0.5

1.0 1.0
full conditional mode
normal approximation ● expansion point
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 ● 0.0 ●

−1 0 1 2 3 4 −1 0 1 2 3 4

Expansion around 1 Expansion around 1.5

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 ● 0.0 ●

−1 0 1 2 3 4 −1 0 1 2 3 4

If y | x, θ is Gaussian ”the approximation” is exact!


What do we get ...

π(x, y | θ) π(θ)
π̃(θ | y) ∝
π̃(x | y, θ)
x=x? (θ)

I find the mode of π̃(θ | y) (optimization)


I explore π̃(θ | y) to find grid points tk for numerical
integration.

However, why is it called integrated nested Laplace approximation?


There is another step that changes:

X
π(xi | y) ≈ π(xi | y, tk ) π̃(tk | y)∆k
| {z }
k
Not Gaussian!
Approximating π(xi |y , θ)

Three possible approximations:


1. Gaussian distribution derived from π̃G (x|θ, y ), i.e.

π̃(xi |θ, y ) = N (xi ; µi (θ), σi2 (θ))

with mean µi (θ) and marginal variance σi2 (θ).


However, errors in location and/or lack of skewness possible

2. Laplace approximation

3. Simplified Laplace approximation


Laplace approximation of π(xi |θ, y )

π(x, θ, y )
π̃LA (xi |θ, y ) ∝
π̃GG (x −i |xi , θ, y )
x −i =x ?
−i (xi ,θ)

The approximation is very good but expensive as n factorizations


of (n − 1) × (n − 1) matrices are required to get the n marginals.

Computational modifications exist:


1. Approximate the modal configuration of the GMRF
approximation.
2. Reduce the size n by only involving the “neighbors”.
Simplified Laplace approximation

Faster alternative to the Laplace approximation

I based on a series expansion up to third order of the numerator


and denominator of π̃LA (xi |θ, y )
I corrects the Gaussian approximation for error in location and
lack of skewness.

This is default option when using INLA but this choice can be
modified.
The integrated nested Laplace approximation (INLA)

Step I Approximate π(θ|y ) using the Laplace approximation and


select good evaluation points t k .

Step II For each t k and i approximate π(xi |y , t K ) using the


Laplace or simplified Laplace approximation for selected
values of xi .

Step III For each i, sum out t k


X
π̃(xi |y ) = π̃(xi |t k , y ) × π̃(t k |y ) × ∆k .
k
How can we assess the error in the approximations?

Tool 1: Compare a sequence of improved approximations


1. Gaussian approximation
2. Simplified Laplace
3. Laplace
No big differences → good approximation.
How can we assess the error in the approximations?

Tool 2: Estimate the ”effective” number of parameters and


compare this with the number of observations.

Experience ahs shown that n = 2 is usually very good.


Limitations

I The dimension of the latent field x can be large (102 –106 )


I But the dimension of the hyperparameters θ must be small
(≤ 15)

In other words, each random effect can be big, but there cannot be
too many random effects unless they share parameters.
How to represent the posterior?
The posterior
π(θ, x|y = data)
is a massive object!

Discuss:
I Samples from π(θ, x|y = data)
I Marginal, marginal-joint, and conditional distributions
I Samples from θ and Gaussian π(x|θ = sample, y = data)

1 demo ( " Epil " )


2 result = inla ( formula , family = " poisson " , data = Epil ,
control . compute = list ( config = T ) )
3 str ( result $ misc $ configs )
Optimisation and exploring hyper-space

Make sure the parametrisation is good. Asymptotically Gaussian,


but can we make it more Gaussian? More unimodal?
Optimise through gradient descent
Explore through Hessian and grid or CCD
All of these are just choices that you can choose! Inside INLA you
can choose precision vs speed! All the way from high dim
spacetime to 100 datapoints and a simple model, you can choose
an appropriate tradeoff!
Waypoint

So far Many details about INLA


Later Go over this, and read the paper
Now A few additional notes
Internal representation of hyper-parameters
Look through logfile

Practical: Use the logfile I have


Summary
Thank you for the attention

Any questions?

You might also like