Automatic Reparameterisation in Probabilistic Programming
Automatic Reparameterisation in Probabilistic Programming
Abstract
The performance of approximate posterior inference algorithms can depend strongly on how
the model is parameterised. In particular, non-centring a model can break dependencies
between different levels in hierarchical models and drastically reduce the difficulty of the
inference task. However, it is not obvious how to reparameterise a model in the best possible
way, as the shape of the posterior depends on the properties of the observed data.
We propose two inference strategies that utilise the power of probabilistic programming
to free modellers from the need to choose a parameterisation. The first strategy alternates
between sampling using the centred or the non-centred parameterisation, while the second
strategy learns a partially non-centred parameterisation by optimising a variational objective.
1. Introduction
Reparameterising a probabilistic model means expressing it in terms of a different set of
random variables, representing a bijective transformation of the original variables of interest.
The reparameterised model can have drastically different posterior geometry from the original,
with significant implications for both variational and sampling-based inference algorithms.
In this paper, we focus on non-centring, a particularly common form of reparameterisation
in hierarchical Bayesian modelling. Consider a parameter z ∼ N (µ, σ); we say this is in
centred parameterisation (CP). If we instead work with an auxiliary, standard normal
parameter ∼ N (0, 1), and obtain z by applying the transformation z = µ + σ, we
say the parameter is in its non-centred parameterisation (NCP).1 Although the centred
parameterisation is often more intuitive and interpretable, non-centring can sometimes
drastically improve the performance of inference (Betancourt and Girolami, 2015). Figure 1
illustrates a simple example of such a case.
Bayesian practitioners are often advised to manually non-centre their models; however,
this breaks the separation between modelling and inference and requires expressing the
model in a potentially less intuitive form. Moreover, non-centring is not universally better
than centring: the best parameterisation depends on many factors including the statistical
properties of the observed data. The user must possess the sophistication to understand
what reparameterisation is needed, and where in the model it should be applied.
We explore strategies to tackle this problem automatically via transformations of proba-
bilistic programs. Using the Edward2 probabilistic programming language, we implement
∗
Work done while interning at Google.
1. More generally, non-centring is applicable to location-scale families and any random variable that can
be expressed as a bijective transformation z = fθ () of a “standardized” variable , analogous to the
“reparameterisation trick” in stochastic optimisation (Kingma and Welling, 2013).
c M.I. Gorinova, D. Moore & M.D. Hoffman.
Automatic Reparameterisation
two approaches: Interleaved Hamiltonian Monte Carlo (i-hmc), which alternates HMC
steps between centred and non-centred parameterisations, and a novel algorithm we call
Variationally Inferred Parameterisation (vip), which uses a continuous relaxation to search
over possible ways of reparameterising the model.2 Experiments across a range of models
demonstrate that these strategies enable robust inference, performing at least as well as the
best fixed parameterisation, and sometimes better.
2
Automatic Reparameterisation
L(θ, φ) = Eq(z̃;θ) (log p(x, z̃; φ)) − Eq(z̃;θ) (log q(z̃; θ)) .
3. Experiments
We report experimental results for hierarchical Bayesian regression on the Eight Schools
(Rubin, 1981), Radon (Gelman and Hill, 2006) and German credit datasets. For each, we
specify a model and evaluate i-hmc and vip-hmc by comparing to HMC run on the fully
3
Automatic Reparameterisation
8 Schools German
Credit
hmc-cp 92 ±4 64 ± 21
hmc-ncp 3475 ± 849 35 ± 15
i-hmc 3879 ± 281 101 ± 32
vip-hmc 4986 ± 660 115 ± 12
Figure 1: Neal’s funnel: z ∼ N (0, 3); x ∼ Table 1: Effective sample size to number of
N (0, e−z/2 ) (Neal, 2003), with mean-field normal leapfrog steps (larger is better), with stan-
variational fit overlayed. Although half of the mass dard errors from five trials. Eight schools and
is inside the “funnel” (z < 0), centred samplers German credit models.
have difficulty reaching it.
centred model (hmc-cp) and the fully non-centred model (hmc-ncp). On each run of the
experiment, we obtain 50000 samples after burn-in. We tune the step sizes and number of
leapfrog steps for HMC automatically, as described in more details in Appendix C.
MN IN PA MO ND MA AZ
hmc-cp 798 ± 276 1034 ± 54 425 ± 208 427 ± 45 2840 ± 347 5796 ± 393 3644 ± 129
hmc-ncp 340 ± 35 75 ± 14 43 ± 9 16 ± 7 187 ± 36 179 ± 66 100 ± 26
i-hmc 1495 ± 129 590 ± 287 410 ± 183 233 ± 29 2421 ± 89 6696 ± 97 2472 ± 267
vip-hmc 1144 ± 279 865 ± 98 816 ± 184 416 ± 51 3273 ± 145 5551 ± 336 3875 ± 73
Table 2: Effective sample size to number of leapfrog steps (larger is better), with standard errors
from five trials. Radon data for different US states.
Across the datasets in Table 1 and Table 2, we see that i-hmc is a robust alternative to
using a fully centred or non-centred model. By taking alternating steps in CP and NCP,
i-hmc ensures that it makes reasonable progress, regardless of which of hmc-cp or hmc-ncp
is better. Moreover, we see that vip-hmc finds a reasonable reparameterisation in each
case; typically as good as of the better of hmc-cp and hmc-ncp. On initial inspection, the
learned parameterisations are often very close to fully centred or non-centred (implying that
vip-hmc successfully learns the “correct” global parameterisation for each problem), but a
small number of groups are sometimes flipped to the alternative parameterisation. These
preliminary results suggest that these learned, mixed parameterisations may sometimes be
superior to the best global parameterisation; we are excited to explore this further.
4. Discussion
We presented two inference strategies that use program transformations on probabilistic
programs to automatically make use of different model reparameterisations, and we show both
strategies are robust. We hope that the idea of automatically reparameterising probabilistic
models with the aid of program transformations can lead to new ways of easing the inference
task, potentially allowing us to work with models that were previously infeasable.
4
Automatic Reparameterisation
References
Christophe Andrieu and Johannes Thoms. A tutorial on adaptive MCMC. Statistics and
computing, 18(4):343–373, 2008.
Michael Betancourt and Mark Girolami. Hamiltonian Monte Carlo for hierarchical models.
Current trends in Bayesian methodology with applications, 79:30, 2015.
Andrew Gelman and Jennifer Hill. Data analysis using regression and multilevel/hierarchical
models. Cambridge university press, 2006.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
Dave Moore and Maria I. Gorinova. Effect handling for composable program transformations
in Edward2. International Conference on Probabilistic Programming, 2018. URL https:
//arxiv.org/abs/1811.06150.
Radford M. Neal. Slice sampling. The Annals of Statistics, 31(3):705–741, 2003. ISSN
00905364. URL http://www.jstor.org/stable/3448413.
Gordon Plotkin and Matija Pretnar. Handlers of algebraic effects. In Giuseppe Castagna,
editor, Programming Languages and Systems, pages 80–94, Berlin, Heidelberg, 2009.
Springer Berlin Heidelberg. ISBN 978-3-642-00590-9.
Matija Pretnar. An introduction to algebraic effects and handlers. Invited tutorial paper.
Electronic Notes in Theoretical Computer Science, 319:19 – 35, 2015. ISSN 1571-0661.
doi: https://doi.org/10.1016/j.entcs.2015.12.003. URL http://www.sciencedirect.com/
science/article/pii/S1571066115000705. The 31st Conference on the Mathematical
Foundations of Programming Semantics (MFPS XXXI).
Dustin Tran, Matthew D. Hoffman, Srinivas Vasudevan, Christopher Suter, Dave Moore,
Alexey Radul, Matthew Johnson, and Rif A. Saurous. Simple, Distributed, and Accelerated
probabilistic programming. 2018. URL https://arxiv.org/abs/1811.02091. To appear
in Advances in Neural Information Processing Systems 2019.
Yaming Yu and Xiao-Li Meng. To center or not to center: That is not the question—an
ancillarity–sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal
of Computational and Graphical Statistics, 20(3):531–570, 2011.
5
Automatic Reparameterisation
with ed . interception ( l o g _ p r o b _ i n t e r c e p t o r ):
neals_funnel ()
return log_prob
By executing the neals_funnel function in the context of log_prob_interceptor, we
override each sample statement (a call to a random variable constructor rv_constructor),
to generate a variable that takes on the value provided in the arguments of log_joint_fn.
As a side effect, we also accumulate the result of evaluating each variable’s prior density at
the provided value, which, by the chain rule, gives us the log joint density. The function
log_joint_fn then is equvalent to the function log p, where log p(z, x) = log N (z | 0, 1) +
log N (x | 0, e−z/2 ).
6
Automatic Reparameterisation
Appendix B. Interceptors
Interceptors can be used as a powerful abstractions in a probabilistic programming systems,
as discussed previously by Moore and Gorinova (2018), and shown by both Pyro and
Edward2. In particular, we can use interceptors to automatically reparameterise a model, as
well as to specify variational families. In this section, we show Edward2 pseudo-code for the
interceptors used to implement i-hmc and vip-hmc.
a = tf . nn . sigmoid ( tf . get_variable (
name + " _a_unconstrained " , initializer = tf . zeros_like ( rv_loc ))
b = tf . nn . sigmoid ( tf . get_variable (
name + " _b_unconstrained " , initializer = tf . zeros_like ( rv_scale ))
7
Automatic Reparameterisation