0% found this document useful (0 votes)
38 views

Automatic Reparameterisation of Probabilistic Programs

Uploaded by

dperepolkin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Automatic Reparameterisation of Probabilistic Programs

Uploaded by

dperepolkin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Automatic Reparameterisation of Probabilistic Programs

Maria I. Gorinova * 1 Dave Moore 2 Matthew D. Hoffman 2

Abstract parameterisation (CP). If we instead work with an auxil-


iary, standard normal variable z̃ ∼ N (0, 1), and obtain z
Probabilistic programming has emerged as a pow-
by applying the transformation z = µ + σz̃, we say the
erful paradigm in statistics, applied science, and
variable z̃ is in its non-centred parameterisation (NCP). Al-
machine learning: by decoupling modelling from
though the centred parameterisation is often more intuitive,
inference, it promises to allow modellers to di-
non-centring can dramatically improve the performance of
rectly reason about the processes generating data.
inference (Betancourt & Girolami, 2015). Neal’s funnel
However, the performance of inference algo-
(Figure 1a) provides a simple example: most Markov chain
rithms can be dramatically affected by the parame-
Monte Carlo (MCMC) algorithms have trouble sampling
terisation used to express a model, requiring users
from the funnel due to the strong non-linear dependence be-
to transform their programs in non-intuitive ways.
tween latent variables. Non-centring the model removes this
We argue for automating these transformations,
dependence, converting the funnel into a spherical Gaussian
and demonstrate that mechanisms available in re-
distribution.
cent modelling frameworks can implement non-
centring and related reparameterisations. This Bayesian practitioners are often advised to manually non-
enables new inference algorithms, and we pro- centre their models (Stan Development Team et al., 2016);
pose two: a simple approach using interleaved however, this breaks the separation between modelling and
sampling and a novel variational formulation that inference and requires expressing the model in a potentially
searches over a continuous space of parameteri- less intuitive form. Moreover, it requires the user to un-
sations. We show that these approaches enable derstand the concept of non-centring and to know a priori
robust inference across a range of models, and where in the model it might be appropriate. Because the best
can yield more efficient samplers than the best parameterisation for a given model may vary across datasets,
fixed parameterisation. even experts may need to find the optimal parameterisation
by trial and error, burdening modellers and slowing down
the model development loop (Blei, 2014).
1. Introduction We propose that non-centring and similar reparameterisa-
Reparameterising a probabilistic model means expressing tions be handled automatically by probabilistic program-
it in terms of new variables defined by a bijective transfor- ming systems. We demonstrate how such program trans-
mation of the original variables of interest. The reparam- formations may be implemented using the effect handling
eterised model expresses the same statistical assumptions mechanisms present in several modern deep probabilistic
as the original, but can have drastically different posterior programming frameworks, and consider two inference algo-
geometry, with significant implications for both variational rithms enabled by automatic reparameterisation: interleaved
and sampling-based inference algorithms. Hamiltonian Monte Carlo (iHMC), which alternates HMC
steps between centred and non-centred parameterisations,
Non-centring is a particularly common form of reparam- and a novel algorithm we call Variationally Inferred Parame-
eterisation in Bayesian hierarchical models. Consider a terisation (VIP), which searches over a continuous space of
random variable z ∼ N (µ, σ); we say this is in centred reparameterisations that includes non-centring as a special
*
Work done while interning at Google. 1 University of Edin-
case.1 We compare these strategies to a fixed centred and
burgh, Edinburgh, UK 2 Google, San Francisco, CA, USA. Corre- non-centred parameterisation across a range of well-known
spondence to: Maria I. Gorinova <[email protected]>, Dave hierarchical models. Our results suggest that both VIP and
Moore <[email protected]>, Matthew D. Hoffman <mhoff- iHMC can enable for more automated robust inference, of-
[email protected]>. ten performing at least as well as the best fixed parame-
Proceedings of the 37 th International Conference on Machine 1
Code for these algorithms and experiments is available at
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by https://github.com/mgorinova/autoreparam.
the author(s).
Automatic Reparameterisation of Probabilistic Programs

NealsFunnel(z, x) : z=0
z ∼ N (0, 3) lpz = log pN (z | 0, 3)
x ∼ N (0, exp(z/2)) x=0
lpx = log pN (x | 0, exp(z/2))

(a) Centred (left) and non-centred (right) parame- (b) Model that generates variables (c) The model in the context of
terisation. z and x. log_prob_at_0.

Figure 1. Neal’s funnel (Neal, 2003): z ∼ N (0, 3); x ∼ N (0, ez/2 ).

terisation and sometimes better, without requiring a priori of their scheme.


knowledge of the optimal parameterisation. Both strategies
Recently, there has been work on accelerating MCMC infer-
have the potential to free modellers from thinking about
ence through learned reparameterisation: Parno & Marzouk
manual reparameterisation, accelerate the modelling cycle,
(2018) and Hoffman et al. (2019) run samplers in the image
and improve the robustness of inference in next-generation
of a bijective map fitted to transform the target distribu-
modelling frameworks.
tion approximately to an isotropic Gaussian. These may be
viewed as ‘black-box’ methods that rely on learning the tar-
2. Related Work get geometry, potentially using highly expressive neural vari-
ational models, while we use probabilistic-program transfor-
The value of non-centring is well-known to MCMC prac-
mations to apply ‘white-box’ reparameterisations similar to
titioners and researchers (Stan Development Team et al.,
those a modeller could in principle implement themselves.
2016; Betancourt & Girolami, 2015), and can also lead
Because they exploit model structure, white-box approaches
to better variational fits in hierarchical models (Yao et al.,
can correct pathologies such as those of Neal’s funnel (Fig-
2018). However, the literature largely treats this as a mod-
ure 1a) directly, reliably, and at much lower cost (in parame-
elling choice; Yao et al. (2018) propose that “there is no
ters and inference overhead) than black-box models. White-
general rule to determine whether non-centred parameterisa-
and black-box reparameterisations are not mutually exclu-
tion is better than the centred one.” We are not aware of prior
sive, and may have complementary advantages; combining
work that treats non-centring directly as a computational
them is a likely fruitful direction for improving inference in
phenomenon to be exploited by inference systems.
structured models.
Non-centred parameterisation of probabilistic models can be
Previous work in probabilistic programming has been ex-
seen as analogous to the reparameterisation trick in stochas-
ploring other ‘white-box’ approaches to perform or optimise
tic optimisation (Kingma & Welling, 2013); both involve
inference. For example, Hakaru (Narayanan et al., 2016;
expressing a variable in terms of a diffeomorphic transfor-
Zinkov & Shan, 2017) and PSI (Gehr et al., 2016; 2020)
mation from a "standardised" variable. In the context of
use program transformations to perform symbolic inference,
probabilistic inference, these are complementary tools: the
while Gen (Cusumano-Towner et al., 2019) and SlicStan
reparameterisation trick yields low-variance stochastic gradi-
(Gorinova et al., 2019) can statically analyse the model
ents of variational objectives, whereas non-centring changes
structure to compile to a more efficient inference strategy.
the geometry of the posterior itself, leading to qualitatively
To the best of our knowledge, the approach presented in this
different variational fits and MCMC trajectories.
paper is the first to apply variational inference as a dynamic
In the context of Gibbs sampling, Papaspiliopoulos et al. pre-processing step, which optimises the program based on
(2007) introduce a family of partially non-centred parame- both the program structure and observed data.
terisations similar to those we use in VIP (described below)
and show that it improves mixing in a spatial GLMM. Our 3. Understanding the Effect of
current work can be viewed as an general-purpose exten-
sion of this work that mechanically reparameterises user-
Reparameterisation
provided models and automates the choice of parameter- Non-centring reparameterisation is not always optimal; its
isation. Similarly, Yu & Meng (2011) proposed a Gibbs usefulness depends on properties of both the model and
sampling scheme that interleaves steps in centred and non- the observed data. In this section, we develop intuition by
centred parameterisations; our interleaved HMC algorithm working with a simple hierarchical model for which we
can be viewed as an automated, gradient-based descendent can derive the posterior analytically. Consider a simple
Automatic Reparameterisation of Probabilistic Programs

realisation of a model discussed by Betancourt & Girolami data was generated from some unknown latent variables.
(2015, (2)), where for a vector of N datapoints y, and some Most probabilistic programming languages (PPLs) provide
given constants σ and σµ , we have: some mechanism for transforming a generative process into
an inference program; our automatic reparameterisation
θ ∼ N (0, 1) µ ∼ N (θ, σµ ) approach is applicable to PPLs that transform generative
yn ∼ N (µ, σ) for all n ∈ 1 . . . N programs using effect handling. This includes modern deep
PPLs such as Pyro (Uber AI Labs, 2017) and Edward2 (Tran
In the non-centred model, y is defined in terms of µ̃ and θ, et al., 2018).
where µ̃ is a standard Gaussian variable:
4.1. Effect Handling-based Probabilistic Programming
θ ∼ N (0, 1) µ̃ ∼ N (0, 1)
Consider a generative program, where running the program
yn ∼ N (θ + σµ µ̃, σ) for all n ∈ 1 . . . N
forward generates samples from the prior over latent vari-
ables and data. Effect handling-based PPLs treat generating
Figure 2a and Figure 2b show the graphical models for the a random variable within such a model as an effectful opera-
two parameterisations. In the non-centred case, the direct tion (an operation that is understood as having side effects)
dependency between θ and µ is substituted by a conditional and provide ways for resolving this operation in the form
dependency given the data y, which creates an “explaining of effect handlers, to allow for inference. For example, we
away” effect. Intuitively, this means that the stronger the often need to transform a statement that generates a random
evidence y is (large N , and small variance), the stronger the variable to a statement that evaluates some (log) density
dependency between θ and µ̃ becomes, creating a poorly- or mass function. We can implement this using an effect
conditioned posterior that may slow inference. handler:
As the Gaussian distribution is self-conjugate, the posterior
log_prob_at_0 =
in each case (centred or non-centred) is also a Gaussian
distribution, and we can analytically inspect its covariance handler {v ∼ D(a1 , . . . , aN ) 7→
matrix V . To quantify the quality of the parameterisation v = 0; lpv = log pD (v | a1 , . . . , aN )}2
in each case, we investigate the condition number κ of
the posterior covariance matrix under the optimal diagonal The handler log_prob_at_0 handles statements of the
preconditioner. This models the common practice (imple- form v ∼ D(a1 , . . . , aN ). Such statements normally
mented in tools such as PyMC3 and Stan and followed in mean “sample a random variable from the distribution
our experiments) of sampling using a fitted diagonal precon- D(a1 , . . . , aN ) and record its value in v”. However, when
ditioner. executed in the context of log_prob_at_0 (we write
Figure 2c shows the condition numbers κCP and κNCP for with log_prob_at_0 handle model), statements that
each parameterisation as a function of q = N/σ 2 ; the full contain random-variable constructions are handled by set-
derivation is in Appendix A. This figure confirms the intu- ting the value of the variable v to 0, then evaluating the log
ition that the non-centred parameterisation is better suited density (or mass) function of D(a1 , . . . , aN ) at v = 0 and
for situation when the evidence is weak, while strong evi- recording its value in a new (program) variable lpv .
dence calls for centred parameterisation. In this example we For example, consider the function implementing Neal’s fun-
can exactly determine the optimal parameterisation, since nel in Figure 1b. When executed without any context, this
the model has only one variable that can be reparameterised function generates two random variables, z and x. When
and the posterior has a closed form. In more realistic set- executed in the context of the log_prob_at_0 handler, it
tings, even experts cannot predict the optimal parameterisa- does not generate random variables, but it instead evaluates
tion for hierarchical models with many variables and groups log pN (z | 0, 3) and log pN (x | 0, exp(z/2)) (Figure 1c).
of data, and the wrong choice can lead to poor conditioning,
heavy tails or other pathological geometry. This approach can be extended to produce a function
that corresponds to the log joint density (or mass) func-
tion of the latent variables of the model. In §§ B.1,
4. Reparameterising Probabilistic Programs we give the pseudo-code implementation of a function
An advantage of probabilistic programming is that the pro- make_log_joint, which takes a model M (z | x) — that
gram itself provides a structured model representation, and generates latent variables z and generates and observes data
we can explore model reparameterisation through the lens x — and returns the function f (z) = log p(z, x). This is
of program transformations. In this paper, we focus on 2
Algebraic effects and handlers typically involve passing a
transforming generative probabilistic programs where the continuation within the handler. We make the continuation implicit
program represents a sampling process describing how the to stay close to Edward2’s implementation.
Automatic Reparameterisation of Probabilistic Programs

θ µ θ µ̃

yn yn

n = 1, ..., N n = 1, ..., N

(a) Centred. (b) Non-centred. (c) The condition number as a function of the data’s strength.

Figure 2. Effects of reparameterising a simple model with known posterior.

a core operation, as it transforms a generative model into returns the log joint function of the transformed variables z̃
a function proportional to the posterior distribution, which rather than the original variables z.
can be repeatedly evaluated and automatically differentiated
For example, make_log_joint(NealsFunnel(z, x)) gives:
to perform inference.
log p(z, x) = log N (z | 0, 3) + log N (x | 0, exp(z/2))
More generally, effectful operations are operations that can
have side effects, e.g. writing to a file. The program- make_log_joint(with ncp handle NealsFunnel(z, x))
ming languages literature formalises cases where impure corresponds to the function:
behaviour arises from a set of effectful operations in terms
of algebraic effects and their handlers (Plotkin & Power, log p(z̃, x̃) = log N (z̃ | 0, 1) + log N (x̃ | 0, 1)
2001; Plotkin & Pretnar, 2009; Pretnar, 2015). A concrete where z = 3z̃ and x = exp(z/2)x̃.
implementation for an effectful operation is given in the
form of effect handlers, which (similarly to exception han- This approach can easily be extended to other parameter-
dlers) are responsible for resolving the operation. Effect isations, including partially centred parameterisations (as
handlers can be used as a powerful abstraction in probabilis- shown later in §§ 5.2), non-centring and whitening multi-
tic programming, and have been incorporated into recent variate Gaussians, and transforming constrained variables
frameworks such as Pyro and Edward2 (Moore & Gorinova, to have unbounded support.
2018).
Edward2 Implementation. We implement reparameter-
4.2. Model Reparameterisation Using Effect Handlers isation handlers in Edward2, a deep PPL embedded in
Python and TensorFlow (Tran et al., 2018). A model
Once equipped with an effect handling-based PPL, we can in Edward2 is a Python function that generates random
easily construct handlers to perform many model transfor- variables. In the core of Edward2 is a special case of
mations, including model reparameterisation. effect handling called interception. To obtain the joint
density of a model, the language provides the function
Non-centring Handler. ncp = handler { make_log_joint_fn(model)4 , which uses a log_prob
v ∼ N (µ, σ), v ∈
/ data 7→ ṽ ∼ N (0, 1); v = µ+σṽ} interceptor (handler) as previously described.
A non-centring handler can be used to non-centre all stan- We extend the usage of interception to treat sample state-
dardisable 3 latent variables in a model. The handler simply ments in one parameterisation as sample statements in an-
applies to statements of the form v ∼ N (µ, σ), where v is other parameterisation (similarly to the ncp handler above):
not a data variable, and transforms them to ṽ ∼ N (0, 1), def noncentre(rv_fn, **d):
v = µ + σṽ. When nested within a log_prob handler # Assumes a location-scale family.
(like the one from §§ 4.1), log_prob handles the trans- rv_fn = ed.interceptable5 (rv_fn)
formed standard normal statement ṽ ∼ N (0, 1). Thus, rv_std = rv_fn(loc=0, scale=1)
make_log_joint applied to a model in the ncp context return d["loc"] + d["scale"] * rv_std

3
We focus on Gaussian variables, but non-centring is broadly 4
Corresponds to make_log_joint(model) in our example.
applicable, e.g. to the location-scale family and random variables 5
Wrapping the constructor with ed.interceptable en-
that can be expressed as a bijective transformation z = fθ (z̃) of a sures that we can nest this interceptor in the context of other
“standardised” variable z̃. interceptors.
Automatic Reparameterisation of Probabilistic Programs

We use the interceptor by executing a model of interest model, MCMC methods applied to the learned parameteri-
within the interceptor’s context (using Python’s context man- sation maintain their asymptotic guarantees.
agers). This overrides each random variable’s constructor
Consider a model with latent variables z. We introduce
to construct a variable with location 0 and scale 1, and scale
parameterisation parameters λ = (λi ) ∈ [0, 1] for each
and shift that variable appropriately:
variable zi , and transform zi ∼ N (zi | µi , σi ) by defining
with ed.interception(noncentre): z̃i ∼ N (λi µi , σiλi ) and zi = µi + σi1−λi (z̃i − λi µi ). This
neals_funnel() defines a continuous relaxation that includes NCP as the
special case λ = 0 and CP as λ = 1. More generally, it
We present and explain in more detail all interceptors used supports a combinatorially large class of per-variable and
for this work in Appendix B. partial centrings.

5. Automatic Model Reparameterisation Example. Recall the example model from Section 3,
which defines the joint density p(θ, µ, y) = N (θ | 0, 1) ×
We introduce two inference strategies that exploit automatic N (µ | θ, σµ ) × N (y | µ, σ). Using the parameterisation
reparameterisation: interleaved Hamiltonian Monte Carlo above to reparameterise µ, we get:
(iHMC), and the Variationally Inferred Parameterisation
(VIP). p(θ, µ̂, y) = N (θ | 0, 1) × N (µ̂ | λθ, σµλ )
× N (y | θ + σµ1−λ (µ̂ − λθ), σ)
5.1. Interleaved Hamiltonian Monte Carlo
Automatic reparameterisation opens up the possibility of al- Similarly to before, we analytically derive an expression for
gorithms that exploit multiple parameterisations of a single the posterior under different values of λ. Figure 4 shows
model. We consider interleaved Hamiltonian Monte Carlo the condition number κ(λ) of the diagonally preconditioned
(iHMC), which uses two HMC steps to produce each sample posterior, for different values of q = N/σ 2 with fixed prior
from the target distribution: the first step is made in CP, us- scale σµ = 1. As expected, when the data is weak (q =
ing the original model latent variables, while the second step 0.01), setting the parameterisation parameter λ to be close
is made in NCP, using the auxiliary standardised variables. to 0 (NCP), results in a better conditioned posterior than
Interleaving MCMC kernels across parameterisations has setting it close to 1 (CP), and conversely for strong data
been explored in previous work on Gibbs sampling (Yu & (q = 100). More interestingly, in intermediate cases (q = 1)
Meng, 2011; Kastner & Frühwirth-Schnatter, 2014), which the optimal value for λ is truly between 0 and 1, yielding a
demonstrated that CP and NCP steps can be combined to modest but real improvement over the extreme points.
achieve more robust and performant samplers. Our con-
tribution is to make the interleaving automatic and model- Optimisation. For a general model with latent variables z
agnostic: instead of requiring the user to write multiple and data x, we aim to choose the parameterisation λ under
versions of their model and a custom inference algorithm, which the posterior p(z̃ | x; λ) is “most like” an indepen-
we implement iHMC as a black-box inference algorithm for dent normal distribution. A natural objective to minimise
centred Edward2 models. is KL(q(z̃; θ) || p(z̃ | x; λ)), where q(z̃; θ) = N (z̃ |
µ, diag(σ)) is an independent normal model with varia-
Algorithm 1 outlines iHMC. It takes a single centred model tional parameters θ = (µ, σ). Minimising this divergence
Mcp (z | x) that defines latent variables z and generates corresponds to maximising a variational lower bound, the
data x. It uses the function make_ncp to automatically ELBO (Bishop, 2006):
obtain a non-centred version of the model, Mncp (z̃ | x),
which defines auxiliary variables z̃ and function f , such that L(θ, λ) = Eq(z̃;θ) (log p(x, z̃; λ) − log q(z̃; θ))
z = f (z̃).
Note that the auxiliary parameters λ are not statistically
identifiable: the marginal likelihood log p(x; λ) = log p(x)
5.2. Variationally Inferred Parameterisation
is constant with respect to λ. However, the computational
The best parameterisation for a given model may mix cen- properties of the reparameterised models differ, and the
tred and non-centred representations for different variables. variational bound will prefer models for which the pos-
To efficiently search the space of reparameterisations, we terior is close in KL to a diagonal normal. Our key hy-
propose the variationally inferred parameterisation (VIP) al- pothesis (which the results in Figure 6 seem to support) is
gorithm, which selects a parameterisation by gradient-based that diagonal-normal approximability is a good proxy for
optimisation of a differentiable variational objective. VIP MCMC sampling efficiency.
can be used as a pre-processing step to another inference To search for a good model reparameterisation, we optimise
algorithm; as it only changes the parameterisation of the L(θ, λ) using stochastic gradients to simultaneously fit the
Automatic Reparameterisation of Probabilistic Programs

Algorithm 1: Interleaved Hamiltonian Monte Carlo Algorithm 2: Variationally Inferred Parameterisation


Arguments: data x; a centred model Mcp (z | x) Arguments: data x; a centred model Mcp (z | x)
Returns: S samples z(1) , . . . z(S) from p(z | x) Returns: S samples z(1) , . . . z(S) from p(z | x)
1: Mncp (z̃ | x), f = make_ncp(Mcp (z | x)) 1: Mvip (z̃ | x; λ), f = make_vip(Mcp (z | x))
2: log pcp = make_log_joint(Mcp (z | x)) 2: log p(x, z̃) = make_log_joint(Mvip (z̃ | x; λ))
3: log pncp = make_log_joint(Mncp (z̃ | x)) 3:
4: 4: Q(z̃; θ) = make_variational(Mvip (z̃ | x; λ))
5: z0 = init() 5: log q(z̃; θ) = make_log_joint(q(z̃; θ))
6: for s ∈ [1, . . . , S] do 6:
7: z0 = hmc_step(log pcp , z(s−1) ) 7: L(θ, λ) = Eq (log p(x, z̃; λ)) − Eq (log q(z̃; θ))
8: θ ∗ , λ∗ = argmax L(θ, λ)
8: z00 = hmc_step(log pncp , f −1 (z0 ))
9: log p(x, z̃) = make_log_joint(Mvip (z̃ | x; λ∗ ))
9: z(s) = f (z00 )
10: z(1) , . . . , z(S) = hmc(log p)
10: return z(1) , . . . , z(S)
11: return f (z(1) ), . . . , f (z(S) )

(a) Different parameterisations λ of the funnel, with mean-field normal variational fit q(z̃)(overlayed in white).


(b) Alternative view as implicit variational distributions qλ (z) (overlayed in white) on the original space.

Figure 3. Neal’s funnel: z ∼ N (0, 3); x ∼ N (0, ez/2 ), with mean-field normal variational fit overlayed.

variational distribution q to the posterior p and optimise


the shape of that posterior. Figure 3a provides a visual ex-
ample: an independent normal variational distribution is a
poor fit to the pathological geometry of a centred Neal’s
funnel, but non-centring leads to a well-conditioned poste-
rior, where the variational distribution is a perfect fit. In
general settings where the reparameterised model is not ex-
actly Gaussian, sampling-based inference can be used to
refine the posterior; we apply VIP as a preprocessing step
for HMC (summarised in Algorithm 2). Both the reparame-
terisation and the construction of the variational model q are
implemented as automatic program transformations using
Edward2’s interceptors.
Figure 4. The condition number κ(λ) for varying q = N/σ 2 and
σµ = 1 in the simple model from Section 3. An alternate interpretation of VIP is that it expands a vari-
ational family to a more expressive family capable of rep-
Automatic Reparameterisation of Probabilistic Programs

resenting prior dependence. Letting z = fλ (z̃) represent Electric Company (Gelman & Hill, 2006): paired causal
the partial centring transformation, an independent normal analysis of the effect of viewing an educational TV show on
family q(z̃) on the transformed model corresponds to an each of 192 classforms over G = 4 grades. The classrooms
implicit posterior qλ∗ (z) = q z̃ = fλ−1 (z) | det Jf −1 (z)|

were divided into P = 96 pairs, and one class in each pair
λ
on the original model variables. Under this interpretation, λ was treated (xi = 1) at random:
are variational parameters that serve to add freedom to the
µg ∼ N (0, 1) ap ∼ N (µg[p] , 1) bg ∼ N (0, 100)
variational family, allowing it to interpolate from indepen-
dent normal (at λi = 1, Figure 3b left) to a representation log σg ∼ N (0, 1) yi ∼ N (ap[i] + bg[i] xi , σg[i] )
that captures the exact prior dependence structure of the
model (at λi = 0, Figure 3b right). 6.2. Algorithms and Experimental Details
For each model and dataset, we compare our methods, in-
6. Experiments terleaved HMC (iHMC) and VIP-HMC, with baselines of
running HMC on either fully centred (CP-HMC) or fully
We evaluate the usefulness of our approach as a robust and
non-centred (NCP-HMC) models. We initialise each HMC
fully automatic alternative to manual reparameterisation.
chain with samples from an independent Gaussian varia-
We compare our methods to HMC ran on fully centred or
tional posterior, and use the posterior scales as a diagonal
fully non-centred models, one of which often gives catas-
preconditioner; for VIP-HMC this variational optimisation
trophically bad results. Our results show not only that VIP
also includes the parameterisation parameters λ. All varia-
improves robustness by avoiding catastrophic reparame-
tional optimisations were run for the same number of steps,
terisations, but also that it sometimes finds a parameteri-
so they were a fixed cost across all methods except iHMC
sation that is better than both the fully centred and fully
(which depends on preconditioners for both the centred
non-centred alternatives.
and non-centred transition kernels). The HMC step size
and number of leapfrog steps were tuned following the
6.1. Models and Datasets
procedures described in Appendix C, which also contains
We evaluate our proposed approaches by using Hamiltonian additional details of the experimental setup.
Monte Carlo to sample from the posterior of hierarchical We report the average effective sample size per 1000 gra-
Bayesian models on several datasets: dient evaluations (ESS/∇), with standard errors computed
Eight schools (Rubin, 1981): estimating the treatment ef- from 200 chains. We use gradient evaluations, rather than
fects θi of a course taught at each of i = 1 . . . 8 schools, wallclock time, as they are the dominant operation in both
given test scores yi and standard errors σi : HMC and VI and are easier to measure reliably; in practice,
the wallclock times we observed per gradient evaluation did
µ ∼ N (0, 5) log τ ∼ N (0, 5) not differ significantly between methods. This is not surpris-
θi ∼ N (µ, τ ) yi ∼ N (θi , σi ) ing, since the (minimal) overhead of interception is incurred
Radon (Gelman & Hill, 2006): hierarchical linear regres- only once at graph-building time. This metric is a direct
sion, in which the radon level ri in a home i in county c is evaluation of the sampler; we do not count the gradient steps
modelled as a function of the (unobserved) county-level ef- taken during the initial variational optimization.
fect mc , the county uranium reading uc , and xi , the number In addition to effective sample size, we also directly exam-
of floors in the home: ined the convergence of posterior moments for each method.
µ, a, b ∼ N (0, 1) mc ∼ N (µ + auc , 1) This yielded similar qualitative conclusions to the results
we report here; more analysis can be found in Appendix D.
log ri ∼ N (mc[i] + bxi , σ)
German credit (Dua & Graff, 2017): logistic regression; 6.3. Results
hierarchical prior on coefficient scales:
Figures 5 and 6 show the results of the experiments. In most
log τ0 ∼ N (0, 10) log τi ∼ N (log τ0 , 1) cases, either the centred or non-centred parameterisation
βi ∼ N (0, τi ) y ∼ Bernoulli(σ(βX T )) works well, while the other does not. An exception is the
German credit dataset, where both CP-HMC and NCP-HMC
Election ’88 (Gelman & Hill, 2006): logistic model of give a small ESS: 1.2±0.2 or 1.3±0.2 ESS/∇ respectively.
1988 US presidential election outcomes by county, given
demographic covariates xi and state-level effects αs : iHMC. Across the datasets in both figures, we see that
βd ∼ N (0, 100) µ ∼ N (0, 100) log τ ∼ N (0, 10) iHMC is a robust alternative to CP-HMC and NCP-HMC.
Its performance is always within a factor of two of the
αs ∼ N (µ, τ ) yi ∼ Bernoulli(σ(αs[i] + β T xi )) best of CP-HMC and NCP-HMC, and sometimes better. In
Automatic Reparameterisation of Probabilistic Programs

Figure 5. Effective sample size and 95% confidence intervals for the radon model across US states.

Figure 6. Effective sample size (w/ 95% intervals) and the optimised ELBO across several models.

addition to being robust, iHMC can sometimes navigate scape under different parameterisations. Figure 8 shows
the posterior more efficiently than either of CP-HMC and typical marginals of the German credit model. In the cen-
NCP-HMC can: in the case of German credit, it performs tred case, the geometry is funnel-like both in the prior (in
better than both (3.0 ± 0.2 ESS/∇). grey) and the posterior (in red). In the non-centred case,
the prior is an independent Gaussian, but the posteriors still
VIP. Performance of VIP-HMC is typically as good as the possess significant curvature. The partially centred parame-
better of CP-HMC and NCP-HMC, and sometimes better. terisations chosen by VIP appear to yield more favourable
On the German credit dataset, it achieves 5.6 ± 0.6 ESS/∇, posterior geometry, where the change in curvature is smaller
more than three times the rate of CP-HMC and NCP-HMC, than that present in the CP and NCP cases.
and significantly better than iHMC. Figure 6 shows the cor-
A practical lesson from our experiments is that while the
respondence between the optimised mean-field ELBO and
ELBO appears to correlate with sampler quality, they are not
the effective sampling rate. We see that parameterizations
necessarily equally sensitive. A variational model that gives
with higher ELBOs tend to yield better samplers, which sup-
zero mass to half of the posterior is only log 2 away from
ports the ELBO as a reasonable predictor of the conditioning
perfect in the ELBO, but the corresponding sampler may be
of a model.
quite bad. We found it helpful to estimate the ELBO with
We show some of the parameterisations that VIP finds in a relatively large number (tens to hundreds, we used 256)
Figure 7. VIP’s behaviour appears reasonable: for most of Monte Carlo samples. As with most variational methods,
datasets we looked at, VIP finds the “correct” global pa- the VIP optimisation is nonconvex in general, and local
rameterisation: most parameterisation parameters are set to optima are also a concern. We occasionally encountered
either 0 or 1 (Figure 7, left). In the cases where a global local optima during development, though we found VIP to
parameterisation is not optimal (e.g. radon MO, radon PA be generally well-behaved on models for which simpler op-
and, most notably, German credit), VIP finds a mixed pa- timisations are well-behaved. In a practical implementation,
rameterisation, combining centred, non-centred, and par- one might detect optimization failure by comparing the VIP
tially centred variables (Figure 7, centre and right). These ELBO to those obtained from fixed parameterizations; for
examples demonstrate the significance of the effect that modest-sized models, a deep PPL can often run multiple
automatic reparameterisation can have on the quality of in- such optimizations in parallel at minimal cost.
ference: manually finding an adequate parameterisation in
the German credit case would, at best, require unreasonable 7. Discussion
amount of hand tuning, while VIP finds such parameterisa-
tion automatically. Our results demonstrate that automated reparameterisation
of probabilistic models is practical, and enables inference
It is interesting to examine the shape of the posterior land-
algorithms that can in some cases find parameterisations
Automatic Reparameterisation of Probabilistic Programs

area of future work. This may be compatible with recent


trends exploring the use of symbolic algebra systems in PPL
runtimes (Narayanan et al., 2016; Hoffman et al., 2018).
We also see promise in automating reparameterisations of
heavy-tailed and multivariate distributions, and in designing
Figure 7. A heat map of VIP parameterisations. Each square rep- new inference algorithms to exploit these capabilities.
resents the obtained using VIP parameterisation parameter λ as-
sociated with a different latent variable in the models(s) (e.g. top
left corner of German credit corresponds to λlog τ1 ). Light regions Acknowledgements
correspond to CP and dark regions to NCP.
We thank the anonymous reviewers for their useful com-
ments and suggestions. Maria Gorinova was supported in
part by the EPSRC Centre for Doctoral Training in Data Sci-
ence, funded by the UK Engineering and Physical Sciences
Research Council (grant EP/L016427/1) and the University
of Edinburgh.

R EFERENCES
Andrieu, C. and Thoms, J. A tutorial on adaptive MCMC.
Statistics and computing, 18(4):343–373, 2008.
Betancourt, M. and Girolami, M. Hamiltonian Monte Carlo
for hierarchical models. Current trends in Bayesian
methodology with applications, 79:30, 2015.

Bishop, C. M. Pattern recognition and machine learning.


Figure 8. Selected prior and posterior marginals under different Springer, 2006.
parameterisations of the German credit model.
Blei, D. M. Build, compute, critique, repeat: Data analysis
with latent variable models. Annual Review of Statistics
even better than those a human could realistically express. and Its Application, 1:203–232, 2014.
These techniques allow modellers to focus on expressing
Cusumano-Towner, M. F., Saad, F. A., Lew, A. K., and
statistical assumptions, leaving computation to the computer.
Mansinghka, V. K. Gen: A general-purpose prob-
We view the methods in this paper as exciting proofs of
abilistic programming system with programmable in-
concept, and hope that they will inspire additional work in
ference. In Proceedings of the 40th ACM SIGPLAN
this space.
Conference on Programming Language Design and Im-
Like all variational methods, VIP assumes the posterior can plementation, PLDI 2019, pp. 221–236, 2019. doi:
be approximated by a particular functional form; in this 10.1145/3314221.3314642.
case, independent Gaussians ‘pulled back’ through the non-
centring transform. If this family of posteriors does not con- Dua, D. and Graff, C. UCI machine learning repository,
tain a reasonable approximation of the true posterior, then 2017. URL http://archive.ics.uci.edu/ml.
VIP will not be effective at whitening the posterior geometry. Gehr, T., Misailovic, S., and Vechev, M. PSI: Exact sym-
Some cases where this might happen include models where bolic inference for probabilistic programs. In Interna-
difficult geometry arises from heavy-tailed components (for tional Conference on Computer Aided Verification, pp.
example, x ∼ Cauchy(0, 1); y ∼ Cauchy(x, 1)), or when 62–83. Springer, 2016.
the true posterior has structured dependencies that are not
well captured by partial centring (for example, many time- Gehr, T., Steffen, S., and Vechev, M. λPSI: Exact infer-
series). Such cases can likely be handled by optimising over ence for higher-order probabilistic programs. In Proceed-
augmented families of reparameterisations, and designing ings of the 41st ACM SIGPLAN Conference on Program-
such families is an interesting topic for future work. ming Language Design and Implementation, pp. 883–897,
2020.
While we focus on reparameterising hierarchical models nat-
urally written in centred form, the inverse transformation— Gelman, A. and Hill, J. Data analysis using regression and
detecting and exploiting implicit hierarchical structure in multilevel/hierarchical models. Cambridge university
models expressed as algebraic equations—is an important press, 2006.
Automatic Reparameterisation of Probabilistic Programs

Gorinova, M. I., Gordon, A. D., and Sutton, C. Probabilis- Plotkin, G. and Pretnar, M. Handlers of algebraic effects. In
tic programming with densities in SlicStan: Efficient, Castagna, G. (ed.), Programming Languages and Systems,
flexible, and deterministic. Proceedings of the ACM on pp. 80–94, Berlin, Heidelberg, 2009. Springer Berlin
Programming Languages, 3(POPL):35, 2019. Heidelberg. ISBN 978-3-642-00590-9.
Hoffman, M., Sountsov, P., Dillon, J. V., Langmore, I., Tran, Pretnar, M. An introduction to algebraic effects and handlers.
D., and Vasudevan, S. NeuTra-lizing bad geometry in Invited tutorial paper. Electronic Notes in Theoretical
Hamiltonian Monte Carlo using neural transport. arXiv Computer Science, 319:19 – 35, 2015. ISSN 1571-0661.
preprint arXiv:1903.03704, 2019. The 31st Conference on the Mathematical Foundations
of Programming Semantics (MFPS XXXI).
Hoffman, M. D., Johnson, M., and Tran, D. Autoconj:
Recognizing and exploiting conjugacy without a domain- Rubin, D. B. Estimation in parallel randomized experiments.
specific language. In Neural Information Processing Journal of Educational Statistics, 6(4):377–401, 1981.
Systems, 2018. ISSN 03629791. URL http://www.jstor.org/
Kastner, G. and Frühwirth-Schnatter, S. Ancillarity- stable/1164617.
sufficiency interweaving strategy (ASIS) for boosting Stan Development Team et al. Stan modelling lan-
MCMC estimation of stochastic volatility models. Com- guage users guide and reference manual. Technical re-
putational Statistics & Data Analysis, 76:408–423, 2014. port, 2016. https://mc-stan.org/docs/2_19/
Kingma, D. P. and Ba, J. Adam: A method for stochastic stan-users-guide/.
optimization. arXiv preprint arXiv:1412.6980, 2014. Tran, D., Hoffman, M. D., Vasudevan, S., Suter, C., Moore,
Kingma, D. P. and Welling, M. Auto-encoding variational D., Radul, A., Johnson, M., and Saurous, R. A. Sim-
Bayes. arXiv preprint arXiv:1312.6114, 2013. ple, distributed, and accelerated probabilistic program-
ming. 2018. URL https://arxiv.org/abs/
Moore, D. and Gorinova, M. I. Effect handling for compos- 1811.02091. Advances in Neural Information Pro-
able program transformations in Edward2. International cessing Systems.
Conference on Probabilistic Programming, 2018. URL
https://arxiv.org/abs/1811.06150. Uber AI Labs. Pyro: A deep probabilistic programming
language, 2017. http://pyro.ai/.
Narayanan, P., Carette, J., Romano, W., Shan, C., and
Zinkov, R. Probabilistic inference by program transfor- Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. Yes,
mation in Hakaru (system description). In International but did it work?: Evaluating variational inference. arXiv
Symposium on Functional and Logic Programming - 13th preprint arXiv:1802.02538, 2018.
International Symposium, FLOPS 2016, Kochi, Japan,
March 4-6, 2016, Proceedings, pp. 62–79. Springer, 2016. Yu, Y. and Meng, X.-L. To center or not to center: That is
doi: 10.1007/978-3-319-29604-3_5. URL http://dx. not the question—an ancillarity–sufficiency interweaving
doi.org/10.1007/978-3-319-29604-3_5. strategy (ASIS) for boosting MCMC efficiency. Journal
of Computational and Graphical Statistics, 20(3):531–
Neal, R. M. Slice sampling. The Annals of Statistics, 31(3): 570, 2011.
705–741, 2003. ISSN 00905364. URL http://www.
jstor.org/stable/3448413. Zinkov, R. and Shan, C. Composing inference algorithms
as program transformations. In Elidan, G., Kersting, K.,
Papaspiliopoulos, O., Roberts, G. O., and Sköld, M. A and Ihler, A. T. (eds.), Proceedings of the Thirty-Third
general framework for the parametrization of hierarchical Conference on Uncertainty in Artificial Intelligence, UAI
models. Statistical Science, pp. 59–73, 2007. 2017, Sydney, Australia, August 11-15, 2017. AUAI Press,
2017.
Parno, M. D. and Marzouk, Y. M. Transport map accel-
erated Markov chain Monte Carlo. SIAM/ASA Journal
on Uncertainty Quantification, 6(2):645–682, 2018. doi:
10.1137/17M1134640. URL https://doi.org/10.
1137/17M1134640.
Plotkin, G. and Power, J. Adequacy for algebraic effects.
In Honsell, F. and Miculan, M. (eds.), Foundations of
Software Science and Computation Structures, pp. 1–24,
Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
ISBN 978-3-540-45315-4.

You might also like