23 STS907
23 STS907
Abstract. Software tools for Bayesian inference have undergone rapid evo-
lution in the past three decades, following popularisation of the first genera-
tion MCMC-sampler implementations. More recently, exponential growth in
the number of users has been stimulated both by the active development of
new packages by the machine learning community and popularity of special-
ist software for particular applications. This review aims to summarize the
most popular software and provide a useful map for a reader to navigate the
world of Bayesian computation. We anticipate a vigorous continued develop-
ment of algorithms and corresponding software in multiple research fields,
such as probabilistic programming, likelihood-free inference and Bayesian
neural networks, which will further broaden the possibilities for employing
the Bayesian paradigm in exciting applications.
Key words and phrases: Statistics, data analysis, MCMC, computation,
probabilistic programming.
1. INTRODUCTION
46
PAST, PRESENT AND FUTURE OF SOFTWARE FOR BAYESIAN INFERENCE 47
p(θ, y)) with the data y to compute the posterior distribu- software is designed with a target user in mind, no compo-
tion of the parameters p(θ |y). We do it with Bayes’ rule: nent is more influenced by the requirements of the target
user than the modeling language. And, as demonstrated
p(y|θ )p(θ )
p(θ |y) = by the variety of different modeling languages, Bayesian
p(y) inference users are a heterogeneous group and there is no
p(y|θ )p(θ ) one-size-fits-all approach.
= ∝ p(y|θ )p(θ ).
p(y|θ )p(θ ) dθ 1.2.2 Computation methods. Once the model is spec-
The most common quantities of interest in Bayesian in- ified, the next step is to perform the computation of the
ference are posterior properties of parameters or functions posterior and other quantities of interest. Therefore, a
thereof, which can be expressed in terms of expectations complete software for Bayesian inference must imple-
over the posterior distribution p(θ |y): ment one or more Bayesian computation methods.
There is no method that is able to perform practically
E g(θ )|y = g(θ )p(θ |y) dθ. feasible Bayesian computation for every model. There-
fore, many different computation methods have been de-
Prior and posterior predictions, model selection and veloped, and each method represents a trade-off between
other quantities of interest follow a similar pattern. Thus, generality and efficiency. The computation method deter-
the main computational problem of Bayesian inference is mines the class of models that can be computed and usu-
computing integrals. ally limits the software more than the modeling language.
Our choice of likelihood and prior rarely lead to a That is, it is not uncommon that the modeling language
closed-form solution for p(θ |y), which in most cases can allows for the specification of models that the computa-
only be evaluated up to a multiplicative constant, and even tion method is not able to compute, not even in theory.
less often to a closed-form solution for the integral. There- And, as a rule, there always exist models that a computa-
fore, computing the quantities of interest is a numerical tion method will not be able to deal with in practice, even
problem and is a challenge in itself. though it is able to do so in theory.
In this paper, our treatment of Bayesian computation is
1.2 Software for Bayesian Inference from a Bayesian software perspective: we limit ourselves
For our discussion of software for Bayesian inference, to discussing methods that were key for the development
we divide the software components into three groups: the of software for Bayesian inference and listing the methods
modeling language, the computation methods and the util- implemented in the software. For details about the history
ities. and the state-of-the-art of Bayesian computation, we refer
the reader to [80].
1.2.1 Modeling language. We use the term modeling
language in the broadest sense of a component that allows 1.2.3 Utilities. With utilities, we refer to all software
the user to specify the likelihood, prior and data (from components that do not fall in to the previous two groups,
now on, we use model to refer to all of these combined). but are still common in Bayesian software and convenient
Alternatively, Bayesian inference can be done by spec- if not essential to the Bayesian inference workflow (for a
ifying a generative model for p(θ, y) instead (see Sec- detailed treatment of the Bayesian workflow, we refer the
tion 4.3) and some languages support specifying both. See reader to [41, 45]):
the Appendix for an illustrative example in different mod- • Diagnosing Bayesian computation: Bayesian computa-
eling languages. tion methods can and often do fail to find the optimum
Every modeling language represents some kind of solution or, in the case of Markov chain Monte Carlo
trade-off between generality and accessibility. On one end (MCMC), properly explore the posterior distribution.
of the spectrum are expressive languages, such probabilis- Diagnostics tools are essential to identifying potential
tic programming languages (PPLs) and general-purpose issues before proceeding with the interpretation of the
programming languages like Python. On the other end, results. Furthermore, most key methods are MCMC
we have software that allows for a single model or a lim- and, therefore, sampling-based and approximate. Ap-
ited number of options. And in between we have Bayesian proximation error must also be quantified and included
inference-specific declarative (e.g., WinBUGS [74]), im- in the interpretation of the results. Common diagnostics
perative (e.g., Stan [21]), or formula-based languages are traceplots, Monte Carlo standard errors, effective
(e.g., R-INLA [71] and rstanarm [50]) that use syntax sample size (ESS), R̂ and simulation-based calibration
similar to the formula object used by generalized linear [119].
models (GLMs) in the core R stats package [100]), etc. • Model validation and comparison: Prior, posterior
The choice of modeling language more so than any and model visualization, prior and posterior predictive
other component determines the target user. Or, when the checks, (approximate) leave-one-out cross-validation
48 E. ŠTRUMBELJ ET AL.
and model evaluation criteria such as WAIC [134] and serve a more specific purpose, for example, software that
computing Bayes factors. The modeling language de- provides only Bayesian computation of a utility or soft-
termines how easy or difficult it is to compute these ware that focuses on a more narrow class of models.
[20]. For example, for prior and posterior predictive When it comes to less commonly used and more specific
checks we have to draw samples from the prior p(y) software, this paper is biased toward Python and R, the
and posterior predictive distribution p(ynew |y). For two most popular languages for data analysis. See Ta-
Bayes factors, we have to evaluate the marginal p(y) bles 1 and 2 for an estimate of the relative popularity
and for cross-validation we have to evaluate the poste- of Bayesian software packages for Python and R, respec-
rior predictive p(ynew |y). tively.
• Computation libraries: Matrix algebra libraries, sup- General-purpose Bayesian computation has had two
port for probability distributions and other statistical distinct periods, each dominated by a certain type of
computation, support for high-performance computing Bayesian computation and software. From the early 1990s
to the 2010s, it was Gibbs sampling and the quintessen-
and automatic differentiation (AD) libraries.
tial representative of software is BUGS. From the 2010s
• Interfaces: Often, Bayesian software provides the user
up to now, it is Hamiltonian Monte Carlo (HMC) and the
with only low-level command-line interface to the com-
quintessential representative is Stan. The first part of the
putation, where the data and model are passed as files. remainder of the paper roughly corresponds to these two
For convenience, interfaces are then developed that al- periods. In Section 2, we describe Gibbs sampling, the
low the user to access the computation from a popular typical structure of Gibbs sampling-based software and
higher-level language such as Python and R. the BUGS language. We also include software that might
• Documentation: This includes software documenta- have been developed later but is related to, was inspired
tion, language definition, examples, case studies and by or is a continuation of BUGS. Similarly, Section 3 fo-
other material that make it easier to use the software. cuses on HMC and Stan.
1.3 Scope and Organization We dedicate Section 4 to software that we were not able
to meaningfully assign to either of the two periods. It fea-
In part, this paper is a survey of the most popular and tures software that focuses on computation, software that
historically most relevant software for general-purpose targets a more specific class of models and the latest de-
Bayesian inference. We also include popular software that velopments in Bayesian software and universal PPLs.
TABLE 1
Total Python Package index (PyPI) downloads for Bayesian inference-related Python packages referenced in this paper for the period between
January 1, 2022 and December 31, 2022. We obtained the information from the PyPI data set (bigquery-public-data.pypi). We include matplotlib
[63], the most popular Python package for statistical graphics, as a baseline for comparison. While these counts should in most cases be a
reasonable proxy for relative popularity, we have to keep in mind that users can also download these packages from other sources. Inclusion in
other packages and automated downloads can also bias the results
TABLE 2
Total RStudio [104] CRAN mirror downloads for Bayesian inference-related R packages referenced in this paper for the period between January 1,
2022 and December 31, 2022. We used the cranlogs package [26]. We include ggplot2 [135], the most popular R package for statistical graphics,
as a baseline for comparison. While these counts should in most cases be a good proxy for relative popularity, we have to keep in mind that users
can also download these packages from other CRAN mirrors or directly from code repositories. Also, some popular R packages are not available
on CRAN, for example, R-INLA, cmdstanr, the R interface to Stan or R2MultiBUGS, the R interface to MultiBUGS
ggplot2 31,457,872 Create Elegant Data Visualisations Using the Grammar of Graphics
mgcv 1,523,237 Mixed GAM Computation Vehicle with Automatic Smoothness Estimation
coda 1,190,640 Output Analysis and Diagnostics for MCMC
rstan 993,086 R Interface to Stan
loo 738,325 Efficient Leave-One-Out Cross-Validation and WAIC for Bayesian Models
bayestestR 599,283 Understand and Describe Bayesian Models and Posterior Distributions
prophet 338,276 Automatic Forecasting Procedure
posterior 314,669 Tools for Working with Posterior Distributions
bayesplot 308,747 Plotting for Bayesian Models
bnlearn 286,003 Bayesian Network Structure Learning, Parameter Learning and Inference
shinystan 272,855 Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models
BayesFactor 239,538 Computation of Bayes Factors for Common Designs
rjags 228,433 Bayesian Graphical Models using MCMC
brms 215,302 Bayesian Regression Models using Stan
MCMCpack 186,124 Markov Chain Monte Carlo (MCMC) Package
rstanarm 164,469 Bayesian Applied Regression Modeling via Stan
bridgesampling 155,278 Bridge Sampling for Marginal Likelihoods and Bayes Factors
R2WinBUGS 61,926 Running WinBUGS and OpenBUGS from R SPLUS
nimble 36,471 MCMC, Particle Filtering and Programmable Hierarchical Modeling
abc 36,251 Tools for Approximate Bayesian Computation (ABC)
R2OpenBUGS 27,284 Running OpenBUGS from R
greta 8453 Simple and Scalable Statistical Modeling in R
abctools 6404 Tools for ABC Analyses
EasyABC 5344 Efficient Approximate Bayesian Computation Sampling Schemes
We discuss the future of software for Bayesian infer- and the full conditional of a node is
ence in Section 5. p(v|V /v) ∝ p(V )
(1)
∝ p v|P (v) p u|P (u) ,
2. FIRST GENERATION—GIBBS SAMPLING-BASED u∈V :v∈P (u)
where P (v) are parent nodes of v.
In the period between the early 1990s and early 2010s,
A Markov chain that updates one node at a time using
the most popular software for general-purpose Bayesian its full conditional will converge to the posterior distribu-
inference was based on graphical models and Gibbs sam- tion under weak conditions. From a practical perspective,
pling as the method of Bayesian computation. this means that we only have to be able to iteratively sam-
The main assumption of this approach is that the con- ple from the full conditionals. Algorithm 1 is a summary
ditional independence between variables in our joint dis- of the Gibbs sampling algorithm. A major appeal of the
tribution p(V ) = p(θ, y) can be represented by a directed algorithm is that there are no algorithm parameters that
acyclic graph (DAG), where each variable is represented need to be tuned, which is a useful property for automated
by a node and every node is conditionally independent of inference. For the purpose of sampling from a full condi-
tional, a hierarchy of samplers is typically used. Because
all other nodes, given its Markov blanket.
the model is stated in a symbolic way, it is straightfor-
A model that admits such a representation is called a ward to check the properties of the full conditional. In
Bayesian network and is a class of probabilistic graphi- most cases, the more specific the distribution, the more
cal models (see the Appendix for an example). The joint efficient the sampling algorithm that we can use.
distribution can then be factored as
2.1 BUGS
p(V ) = p v|P (v) The quintessential representative of this approach is
v∈V BUGS (Bayesian inference Using Gibbs Sampling) [112].
50 E. ŠTRUMBELJ ET AL.
The BUGS project started at the Medical Research Coun- A strong point of the BUGS PPL is that the distinction
cil Biostatistics Unit in Cambridge in 1989. The BUGS between data and parameters is made at run time, based
software evolved into WinBUGS [74, 76, 113], which up- on provided observations. Vectors can also be partially
dated the BUGS language and the sampling algorithms, observed, by leaving the unobserved elements unknown
and OpenBUGS, a GNU General Public License release (NA). This simplifies the simulation of draws for pos-
of WinBUGS that also runs on Linux (with some limi- terior checks. Although WinBUGS focuses on Bayesian
tations) [114]. BUGS, WinBUGS and OpenBUGS are no networks, there is some limited support for undirected
longer developed, but the BUGS project has inspired other graphs (factor models) as long as the entire subset of vari-
software, which we discuss at the end of this section. A ables is represented as a single multivariate node so that
detailed history of BUGS is provided by Lunn et al. [75]. their values are sampled jointly. WinBUGS also supports
The factorization from equation (1) is the basis for both graphical model specification in plate notation with the
the BUGS language and the Gibbs sampling-based com- DoodleBUGS editor.
putation. The BUGS language is a declarative language MultiBUGS [54] is a continuation of the BUGS project.
in which the user states all the parent–child relationships The major contribution of MultiBUGS is that it provides a
P (v) between the variables in the model. See the Ap- more efficient implementation and several parallelization
pendix for an example of a Bayesian network model in techniques. In a multicore environment, MultiBUGS can
JAGS, a language which is very similar to WinBUGS. be several orders of magnitude more efficient than Open-
For sampling from the full conditionals, WinBUGS im- BUGS.
plements a different approach for each of the following R interfaces are available for WinBUGS, OpenBUGS
contingencies [74], Chapter 12.1: and MultiBUGS: R2WinBUGS [117], R2OpenBUGS
[126] and R2MultiBUGS.1
• Discrete distribution: The inverse CDF method.
• Standard distribution: Standard algorithm for that dis- 2.2 JAGS
tribution. JAGS (Just Another Gibbs Sampler) [96] is similar to
• Log-concave: Derivative-free adaptive rejection sam- WinBUGS in its language and computation (see [74],
pling [46]. Many standard distributions are log- Chapter 12.6, for differences). Unlike WinBUGS and
concave, including the exponential family. The prod- OpenBUGS, which are written in Component Pascal,
uct of log-concave is also log-concave, so it is common JAGS is written in C++ and portable. This has con-
for the full-conditional to be log-concave. tributed to its popularity and the fact that is still being
• Restricted range: Slice sampling [90]. actively developed. See the Appendix for an example of a
• Unrestricted range: Current point Metropolis. model written in JAGS.
OpenBUGS includes block sampling methods that JAGS (Just Another Gibbs Sampler) is a clone of BUGS
jointly sample from groups of nodes that are likely to that has a completely independent code base but aims for
be correlated based on the structure of the model. Block similar functionality, although it notably lacks a graphical
updating solves one of the disadvantages of Gibbs sam- user interface (see [74], Chapter 12.6, for a summary of
pling: it is strongly dependent on the parameterization of differences). JAGS is written in C++ and runs on Win-
the model. If two variables have a high posterior correla- dows, MacOS and Linux. It is published under the GNU
tion but are updated independently using Gibbs sampling, General Public License, version 2. JAGS incorporates a
then the Markov chain will exhibit high autocorrelation copy of the R math library, which provides high-quality
for both variables. Block updating of correlated nodes algorithms for random number generation and calcula-
solves this problem, which otherwise falls to the user to tion of quantities associated with probability distributions.
solve by reparameterizing the model. The workhorse sampling method for JAGS is slice sam-
pling [90], which can be applied to both continuous- and
discrete-valued nodes. The “glm” module of JAGS in-
Algorithm 1 k—number of nodes, p(xj |x−j )—k full corporates efficient samplers for generalized linear mixed
conditionals, x0 —starting value, m—number of samples models (GLMMs). These samplers are based on the prin-
1: procedure G IBBS SAMPLING( ) ciple of data augmentation, a commonly used technique
2: for i ← 1 : m do to simplify sampling from a graphical model by adding
3: for j ← 1 : k do new nodes [58]. In this case, data augmentation reduces
4: xj(i) ∼ p(xj |x1(i) , . . . , xj(i)−1 , xj(i−1) (i−1)
−1 , . . . , xk ) GLMMs with binary outcomes [2, 62, 98] or binary and
5: end for Poisson outcomes [38] to a linear model with normal out-
6: ith sample ← x (i) comes. This reduction to a normal linear model allows
7: end for
8: end procedure 1 https://github.com/MultiBUGS/R2MultiBUGS.
PAST, PRESENT AND FUTURE OF SOFTWARE FOR BAYESIAN INFERENCE 51
block updating of all the parameters in the linear predic- system. The main idea of HMC is to use the Hamiltonian
tor, which is much more efficient than Gibbs sampling. to define a joint density of position and momentum:
The underlying engine for the linear model uses sparse
p(q, p) ∝ e−H (q,p) = e−U (q) e−K(p) .
matrix algebra [29], which handles fixed and random ef-
fects simultaneously. Substituting U (q) = − log f (q), where f is propor-
tional to the density we want to sample from, and use
2.3 Nimble
standard kinetic energy, we get
Nimble [30], similar to BUGS, focuses on graphical 1 T M −1 p
models. It is an extension of the BUGS language but also p(q, p) ∝ f (q)e− 2 p .
implements a modeling language embedded in R, both of The joint density p(q, p) can be seen as the target den-
which are compiled to C++. Several Bayesian compu- sity over the position vector q augmented by an indepen-
tation methods are implemented, including Metropolis– dent multivariate Gaussian for the momentum vector p,
Hastings (MH), Gibbs sampling and sequential Monte with mean 0 and covariance M.
Carlo (SMC), and the user has the flexibility of assigning Hamiltonian dynamics conserves the Hamiltonian, so
different sampling methods to different nodes. Recently, all states on a trajectory will have the same density p(·, ·).
they have also added support for AD and HMC. See the That makes Hamiltonian dynamics suitable for proposing
Appendix for an example of a model written in Nimble’s the next state in an MCMC algorithm, because a trajectory
R-embedded language. can propose a state far away in position q from the current
state, but still with acceptance probability 1. To reach ev-
3. SECOND GENERATION—HMC-BASED ery possible state, we have to sample a new momentum.
The two main drawbacks of the BUGS-like approach Because the kinetic and potential energy parts of the joint
are the limited expressiveness of the language (imperative density are independent and we are sampling from the ac-
language, no local variables, conditional statements, etc.) tual distribution of momentum p, this sampling leaves the
and the inefficiency of computation. The single-node ex- target distribution invariant. That is, p(q, p) remains the
ploration of Gibbs sampling is inefficient when the nodes stationary distribution of the Markov chain. In practice,
are highly correlated in the posterior, in particular, when however, the leapfrog method, while being a stable sim-
the dimensionality in terms of parameters is high, it re- ulation of Hamiltonian dynamics, will not conserve the
verts to random walk behavior [59, 89]. Hamiltonian exactly, and there will be relatively small
The only MCMC algorithm that theoretically scales to fluctuations. That is why we still have to apply a Metropo-
high dimensions on a broad class of models is HMC. In- lis correction. Putting it all together, Algorithm 2 summa-
troductions to HMC have been provided, for example, by rizes the basic HMC algorithm.
Neal [91] and Betancourt [10], and a more detailed math- The main ideas behind HMC had been known for more
ematical treatment by Betancourt [11]. than 20 years before HMC featured in popular Bayesian
HMC is a physics-inspired approach to proposing the software [33]. The key enabler of more automatic use of
next state that uses the gradient of the target density for
a better understanding of its geometry. Hamiltonian dy- Algorithm 2 f —a function proportional to our target
namics consist of a d-dimensional position vector q and density, q0 —starting value, —step size, L—number of
a d-dimensional momentum vector p. The evolution of steps, M mass matrix, m—number of samples
the system as a function of “algorithmic time,” t, is deter- 1: procedure HMC
mined by the function H (q, p) (the Hamiltonian) and the 2: for i ← 1 : m do
ordinary differential equations 3: p ∼ N(0, M) resample momentum
dqi ∂H dpi ∂H 4: get (q ∗ , p∗ ) with L leapfrog steps of size
= , =− . from (qi−1 , p)
dt ∂pi dt ∂qi ∗ ∗
5: α ← min{1, e−H (q ,p )+H (qi−1 ,p) }
To simulate Hamiltonian dynamics, we need to dis-
6: sample u ∼ U (0, 1)
cretize time with some step size . The most commonly
7: if u ≤ α then Metropolis correction
used method is the leapfrog symplectic integrator. Hamil-
8: qi ← q ∗ accept transition
tonian dynamics have several properties, which are im-
9: else
portant for HMC to work: they preserve the Hamiltonian,
10: qi ← qi−1
they are reversible and they are symplectic, and thus vol-
11: end if
ume preserving.
12: end for
For HMC, the Hamiltonian H is typically chosen so
13: ith sample ← qi
that it is separable: H (q, p) = U (q) + K(p), where U (q)
14: end procedure
is the potential energy and K(p) the kinetic energy of the
52 E. ŠTRUMBELJ ET AL.
HMC was the development of automatic differentiation also has mature interfaces for Python and R (RStan [115],
(AD; see [6] for an introduction and survey). Simulating PyStan [102]) and lightweigt wrappers for Python and R
the Hamiltonian dynamics of HMC requires the gradient (CmdStanPy [116], CmdStanR [39], BridgeStan2 ). There
of the density and in order for the software to be gen- are also interfaces for most languages that are tradition-
eral purpose, we must be able to compute the gradient for ally used for data analysis: Matlab (Matlabstan), Julia
any program the user can code. Of the four general ap- (Stan.jl), Stata (StataStan), Mathematica (MathematicaS-
proaches to computing derivatives, three will not work: tan), Scala (ScalaStan) and http request-based interface
manually deriving them is not practical, numerical dif- (httpstan).
ferentiation via finite differences is too unstable due to While Stan implements black-box variational infer-
rounding and truncation errors and also is slow in high ence [69], Laplace approximation and standard optimiza-
dimensions and symbolic differentiation suffers from ex- tion methods, the core Bayesian computation method is
pression swell and leads to inefficient code. AD instead NUTS, a variant of HMC. Stan has a rich mathematics
exploits the fact that every program is a composition of library with AD [22], and OpenCL-based GPU support
elementary operations and, as long as each elementary with kernel fusion [23, 24].
operation also implements a derivative, we can apply the The Stan PPL is an imperative language with which
chain rule to derive the gradient of the composition. This the user specifies the computation of the (log-)posterior.
leads to machine level precision of gradients. Most mod- A program is divided into blocks, the most important of
ern inference software implements or imports an AD li- which are data, parameters and model. See the Appendix
brary. An important limitation of HMC is that it can only for an example of a model written in Stan. The distinc-
be used on smooth spaces. tion between data and parameters is made at compile time,
A challenge in making HMC useful in general-purpose so changing a variable from data to a parameter (or vice
Bayesian inference is automatically tuning its parame- versa) requires moving it from the data to the parameter
ters (mass matrix, step size, number of steps). HMC- block (or vice versa) and recompiling. Notable work on
based software typically implements one or more warmup the Stan language includes SlicStan [53, 51, 52], which
phases for parameter tuning. Software then proceeds with contains several improvements, and translating Stan to
sampling and the warmup samples are discarded. The key Pyro [5].
development was the no-U-turn sampler (NUTS) [59], Because of HMC-based computation, the class of mod-
which is, with some modifications, still the core Bayesian els that can be fit by Stan are models with a smooth den-
computation method in Stan. The basic idea of NUTS is to sity. An important omission are models with discrete pa-
have a dynamic number of steps by simulating the Hamil- rameters, which currently have to be manually marginal-
tonian trajectory until we detect a turn back toward the ized out. This means that Stan does not subsume what
starting state (or reach the maximum number of steps). can be fit with BUGS and that HMC does not make
While promising alternatives for tuning the number of Gibbs sampling-based software obsolete. However, em-
steps, including more GPU computation friendly variants pirical evidence suggests that, when applicable, Stan is
[61, 60], NUTS is still the most common implementation currently the go-to software for general-purpose Bayesian
of HMC and is also available in most modern software for inference [7].
general-purpose Bayesian inference. The majority of Stan users are not writing the models
HMC/NUTS admits several specific MCMC diagnos- directly in the Stan language. There are several popular
tics [10]: when the step size is too large to capture a fea- packages that provide a simplified formula or options-
ture of the target density (which can lead to nonnegligi- based modeling language for a mode specific class of
ble bias), this is likely to manifest as a diverging simu- models and use a Stan backend for modeling and com-
lation, which can be detected and we can use a smaller putation: the R package brms [16] for modeling with hi-
step size; reaching the maximum number of steps before erarchical models; Prophet [121] implemented in Python
terminating the trajectory is an indication of inefficient [123] and R [122] for nonlinear time series forecasting
exploration; the Bayesian fraction of missing information with trend, seasonality and holiday effects; and the R
(BMFI) [9] quantifies how well momentum resampling package rstanarm [50] for a Bayesian analogue to R lm,
matches the marginal energy distribution and can be used glm, aov, etc. Overall, there are more than 140 R packages
to detect poor adaptation during warmup or inefficient ex- built on top Stan, providing easy-to-use interfaces for var-
ploration. ious types of models common in different applications.
The success of these packages is not only due to Stan,
3.1 Stan but also due to increasing number of useful utilities in R,
Stan [21] is by far the most popular software for Python and Julia.
general-purpose Bayesian inference. Stan is implemented
in C++, has a standalone command line interface, but 2 github.com/roualdes/bridgestan.
PAST, PRESENT AND FUTURE OF SOFTWARE FOR BAYESIAN INFERENCE 53
4.2.2 Birch. [88] is a universal PPL that transpiles to 4.3 Likelihood-Free Bayesian Inference
C++, with GPU support. Users implement the joint dis-
Likelihood-free inference (LFI) methods such as ap-
tribution of their model in a generative manner, with
proximate Bayesian computation (ABC) [110], Bayesian
a preference for generic and object-oriented program-
synthetic likelihood (BSL) [99], machine learning-based
ming paradigms. Inference methods are based on SMC
posterior approximations and surrogate likelihood meth-
with gradient-based kernels. A defining feature of Birch
ods [56, 25] refer to (mostly) Bayesian computation meth-
is support for automatic marginalization and automatic
ods that can be used when it is impossible or infeasible to
conditioning. Much like AD, these recognize known
evaluate the likelihood function, but a generative simula-
forms, such as conjugacies and discrete enumerations,
to marginalize out random variables where possible, and tor model exists. Such methods are popular, for example,
condition them on later simulations where necessary. The in astrostatistics, genetics, ecology, systems biology and
implementation of these is based on a heuristic known human cognition modeling. Engine for Likelihood-Free
as delayed sampling [87], which reveals these opportuni- Inference (ELFI) [73] is a Python package for LFI that
ties during program execution by deferring the simulation covers all the main approaches (ABC, BSL and ML-based
of random variables for as long as possible. The result methods). ELFI has a modular design that consists of a
is the automatic enhancement of inference methods with DAG-based modeling API and a separate API for infer-
features such as Rao–Blackwellization [87] and variable ence, allowing a user to choose flexibly from a selection
elimination [136]. Birch has been demonstrated on prob- of algorithms that generate a sample from the approxi-
lems where the number of random variables is unknown, mate posterior distribution. Sampling can be done using
such as multiobject tracking [88] (where the number of adaptive Importance Sampling or MCMC/HMC, and with
objects is unknown), and statistical phylogenetics [103] or without the use of a surrogate model for the likeli-
(where the number of extinct side branches of a phy- hood function approximation. The surrogate model em-
logeny is unknown). See the Appendix for an example ulates a target function using Gaussian processes (GPs)
of a model written in Birch. and active learning (Bayesian optimization). The active
learning approach has been demonstrated to accelerate
4.2.3 Pyro. [12] is a Python PPL built on the PyTorch likelihood-free inference up to several orders of magni-
[93] backend. The main computation method in Pyro is tude. Other general-purpose ABC packages are Python
stochastic variational inference, so the software is aimed packages pyABC [108] and ABCpy [34], and R packages
at scalable probabilistic machine learning. NumPyro [94] abc [27], EasyABC [64], and abctools [92] and ABCreg
is a NumPy-based backend for the Pyro PPL that uses [127]. Neural network based surrogate models are acces-
JAX for AD and compilation to CPU/GPU. sible via Python package sbi [125] and R package [3] pro-
4.2.4 Blang. [14] is an open source package for ap- vides a toolbox for BSL. More detailed surveys of ABC
proximating posterior distributions over arbitrary spaces, software are provided by Nunes and Prangle [92] and
that is, Bayesian models containing not only integer and Kousathanas et al. [67].
real variables but also user-defined datatypes such as phy- 4.4 Software That Focuses on Computation
logenetic trees, random graphs and sequence alignments.
The Blang project includes a standard library of common Blackjax5 is a Python library of MCMC methods for
datatypes and distributions, written in the Blang language, JAX. It works on CPU and GPU, is robust, efficient and
and extension points to create new datatypes and asso- easily integrates with PPLs that provide densities com-
ciated distributions. Users can publish versioned Blang patible with JAX (TFP, Oryx, NumPyro, Aesara, PyTen-
packages containing new datatypes and distributions and sor/PyMC).
import contributed packages and their transitive depen- Emcee [35, 36] is a Python implementation of the
dencies. The Blang language’s scoping rules are used to Affine Invariant MCMC ensemble Bayesian computation
automatically detect sparsity patterns and construct a type method [49]. This derivative-free approach is suitable for
of graphical model known as a factor graph. Based on low-dimensional problems with black-box likelihoods,
this factor graph, the posterior distribution is approxi- which are common in astrophysics. Another Python pack-
mated via an adaptive nonreversible parallel tempering age that is popular in astrophysics is dynesty [111], which
algorithm [118], which by default is parallelized over the implements dynamic nested sampling [57].
user’s CPU cores, but can also be distributed over MPI Mamba.jl6 is a Julia package aimed at users who want
(Message Passing Interface) thanks to Blang’s integration to use and develop MCMC methods. It implements sev-
with the Pigeons distributed Parallel Tempering package.4 eral MCMC methods (HMC, NUTS, Metropolis-within-
See the Appendix for an example of a model written in Gibbs, etc.) and MCMC diagnostics. Another popular
Blang.
5 https://github.com/blackjax-devs/blackjax.
4 https://github.com/Julia-Tempering/Pigeons.jl. 6 https://github.com/brian-j-smith/Mamba.jl.
PAST, PRESENT AND FUTURE OF SOFTWARE FOR BAYESIAN INFERENCE 55
Julia package that implements state-of-the-art Bayesian graphical predictive checking. The R package projpred
computation methods is DynamicHMC.jl.7 [95] performs projection predictive variable selection for
Bayesian generalized linear and additive multilevel mod-
4.5 Other General-Purpose Software
els fit using MCMC methods. The R package priors-
Other general purpose software includes Infer.NET ense [66] performs efficient prior and likelihood sensitiv-
[83] is a machine learning library written in C# for the ity analysis for Bayesian models fit using MCMC meth-
.NET framework. It facilitates automatic approximate in- ods. MCMCpack [81] is an R package that implements
ference for Bayesian networks and Markov random fields. MCMC-based computation for several statistical meth-
Bayesian computation is mostly limited to message pass- ods. Tools for structural learning and parameter estima-
ing. TensorFlow Probability (TFP) [31] is a Python li- tion of Bayesian networks include bnlearn [109], Bayes
brary built on TensorFlow [1]. An example of a PPL with Net for Matlab [86], HUGIN [77], VIBES [13], MSBNx
a very compact syntax is greta [47], a PPL embedded in [65], along with commercial tools GEnIe/SMILE [32] and
R but based on TensorFlow and TFP. While limited in the Netica.9
Bayesian computation methods provided, it is extensible.
See the Appendix for an example of a model written in 5. CHALLENGES AND FUTURE PERSPECTIVES
greta. Oryx8 is a PPL built on top of JAX. Journal of Sta- The field of software for Bayesian inference has never
tistical software also recently published a special issue been more active or varied. There are developments in all
on Bayesian software [18], which includes some software directions, providing better tools that allow for more ac-
covered by this paper and other specialized software. cessible, robust or efficient treatment of typical modeling
4.6 Popular Utilities and Specialized Software as well as pushing the boundaries of what can be done.
Similar to programming languages, where one might
CODA [97] is a still popular R package for post- prefer Python for general-purpose programming, R for
hoc diagnostics and analysis of MCMC output. ArviZ data wrangling and visualization or the emerging Julia for
[70] is a Python package, which provides MCMC di- high-performance data analytics, there is no one-size-fits-
agnostics, model evaluation and model validation tools. all approach to software for Bayesian inference. Stan is
ArviZ is backend-agnostic and currently the most popu- the typical choice for Bayesian model building and infer-
lar such tool in Python. The R package bridgesampling ence, Pyro or TFP for Bayesian machine learning, and nu-
[55] estimates marginal likelihoods, Bayes factors, poste- merous other tools for more specialized tasks. Such diver-
rior model probabilities and normalizing constants. The sity is understandable, because limiting the tool simplifies
R package posterior [17] subsets, binds, mutates and con- it and allows for more efficient computation. While there
verts between formats of MCMC samples and includes has been some encouraging progress in universal PPLs
lightweight implementations of state-of-the-art posterior and underlying Bayesian computation, it is not yet clear
inference diagnostics. BayesFactor [85] is an R pack- if a novel trade-off between expressivity and efficiency
age for computing Bayes factors for contingency tables, can be struck, leading to a third generation of tools.
one- and two-sample designs, one-way designs, general As a result, users have to either accept the limitations
ANOVA designs and linear regression. The R package of their tool of choice to learn how to work with multiple
shinystan [42] provides a graphical user interface for in- tools and languages. A natural solution would be to auto-
teractive Markov chain Monte Carlo (MCMC) diagnos- matically translate between languages or from statistical
tics and other tools for analyzing a posterior sample. The notation into code, as illustrated in the Appendix. This is
procedures in shinystan are agnostic to what generated the a difficult problem, because languages differ in expressiv-
MCMC samples but with some added functionality for ity and even when they exist, automatic translations could
models fit with RStan. The R package loo [133] performs result in inefficient code. Regardless, there appears to be
efficient approximate leave-one-out cross-validation for a relative lack of incentive in this area.
Bayesian models fit using MCMC methods. The R pack- A new PPL is most often learned from model examples
age bayestestR [78] has tools for dealing with uncertainty with code and data, or from translations from a language
and effects in a Bayesian statistics framework. It is ag- we are already familiar with. So, it is not a surprise that
nostic of the software that generated the posterior sam- popular PPLs such as BUGS and Stan have extensive doc-
ples and includes MAP estimates, measures of disper- umentation, including user’s manuals, case studies and in
sion, ROPE and Bayes factors. The R package bayesplot the case of Stan, examples of translations from BUGS to
[40] has graphing functions for Bayesian models, in- Stan. Popular PPLs are also accessible on all popular plat-
cluding posterior draws, visual MCMC diagnostics and forms and through major programming languages (typi-
cally standalone with interfaces), have open governance
7 https://github.com/tpapp/DynamicHMC.jl.
8 https://github.com/jax-ml/oryx/. 9 https://www.norsys.com/netica.html.
56 E. ŠTRUMBELJ ET AL.
α ∼ N 0, 52 , @bm.random_variable
def beta():
return Normal(0, 5)
β ∼ N 0, 52 , @bm.random_variable
def sigma():
return Uniform(0, 1)
σ ∼ U (0, 10), @bm.random_variable
def x(i):
return Normal(0, sigma())
where yi is the dependent variable and xi is the predictor. @bm.random_variable
def y():
This model is in the class of Bayesian networks. Its rep- return Normal(logit = beta() * x + alpha(), sigma())
[20] C ARPENTER , B. (2021). What do we need from a prob- [36] F OREMAN -M ACKEY, D., H OGG , D. W., L ANG , D. and
abilistic programming language to support Bayesian work- G OODMAN , J. (2013). emcee: The MCMC hammer. Publ. As-
flow? In International Conference on Probabilistic Program- tron. Soc. Pac. 125 306.
ming (PROBPROG) 46. [37] F OURNIER , D. A., S KAUG , H. J., A NCHETA , J., I ANELLI ,
[21] C ARPENTER , B., G ELMAN , A., H OFFMAN , M. D., L EE , D., J., M AGNUSSON , A., M AUNDER , M. N., N IELSEN , A. and
G OODRICH , B., B ETANCOURT, M., B RUBAKER , M., G UO , S IBERT, J. (2012). AD model builder: Using automatic differ-
J., L I , P. et al. (2017). Stan: A probabilistic programming lan- entiation for statistical inference of highly parameterized com-
guage. J. Stat. Softw. 76. plex nonlinear models. Optim. Methods Softw. 27 233–249.
[22] C ARPENTER , B., H OFFMAN , M. D., B RUBAKER , M., L EE , MR2901959 https://doi.org/10.1080/10556788.2011.597854
D., L I , P. and B ETANCOURT, M. (2015). The Stan math library: [38] F RÜHWIRTH -S CHNATTER , S., F RÜHWIRTH , R., H ELD , L.
Reverse-mode automatic differentiation in C++. Available at and RUE , H. (2009). Improved auxiliary mixture sam-
arXiv:1509.07164. pling for hierarchical models of non-Gaussian data. Stat.
[23] C EŠNOVAR , R. (2022). Parallel computation in the Stan prob- Comput. 19 479–492. MR2565319 https://doi.org/10.1007/
abilistic programming language Ph.D. thesis Univerza v Ljubl- s11222-008-9109-4
jani, Fakulteta za računalništvo in informatiko. [39] G ABRY, J. and Č EŠNOVAR , R. (2022). A lightweight R inter-
[24] C IGLARI Č , T., Č EŠNOVAR , R. and Š TRUMBELJ , E. (2020). face to CmdStan.
Automated OpenCL GPU kernel fusion for Stan math. In Pro- [40] G ABRY, J. and M AHR , T. (2022). bayesplot: Plotting for
ceedings of the International Workshop on OpenCL 1–6. Bayesian models. R package version 1.10.0.
[25] C RANMER , K., B REHMER , J. and L OUPPE , G. (2020). The [41] G ABRY, J., S IMPSON , D., V EHTARI , A., B ETANCOURT, M.
frontier of simulation-based inference. Proc. Natl. Acad. Sci. and G ELMAN , A. (2019). Visualization in Bayesian work-
USA 117 30055–30062. MR4263287 https://doi.org/10.1073/ flow. J. Roy. Statist. Soc. Ser. A 182 389–402. MR3902665
pnas.1912789117 https://doi.org/10.1111/rssa.12378
[26] C SÁRDI , G. (2019). cranlogs: Download logs from the ‘RStu- [42] G ABRY, J. and V EEN , D. (2022). shinystan: Interactive visual
dio’ ‘CRAN’ mirror. R package version 2.1.1. and numerical diagnostics and posterior analysis for Bayesian
[27] C SILLÉRY, K., F RANÇOIS , O. and B LUM , M. G. (2012). abc: models. R package version 2.6.0. Available at https://CRAN.
An R package for approximate Bayesian computation (ABC). R-project.org/package=shinystan.
Methods Ecol. Evol. 3 475–479. [43] G AEDKE -M ERZHÄUSER , L., VAN N IEKERK , J., S CHENK , O.
[28] C USUMANO -T OWNER , M. F., S AAD , F. A., L EW, A. K. and and RUE , H. (2023). Parallelized integrated nested Laplace ap-
M ANSINGHKA , V. K. (2019). Gen: A general-purpose proba- proximations for fast Bayesian inference. Stat. Comput. 33 25.
bilistic programming system with programmable inference. In MR4526361 https://doi.org/10.1007/s11222-022-10192-1
Proceedings of the 40th ACM Sigplan Conference on Program- [44] G E , H., X U , K. and G HAHRAMANI , Z. (2018). Turing: A
ming Language Design and Implementation 221–236. language for flexible probabilistic inference. In International
[29] DAVIS , T. A. (2006). Direct Methods for Sparse Linear Sys- Conference on Artificial Intelligence and Statistics 1682–1690.
tems. Fundamentals of Algorithms 2. SIAM, Philadelphia, PA. PMLR.
MR2270673 https://doi.org/10.1137/1.9780898718881 [45] G ELMAN , A., V EHTARI , A., S IMPSON , D., M ARGOSSIAN ,
[30] DE VALPINE , P., T UREK , D., PACIOREK , C. J., A NDERSON - C. C., C ARPENTER , B., YAO , Y., K ENNEDY, L., G ABRY, J.,
B ERGMAN , C., T EMPLE L ANG , D. and B ODIK , R. (2017). B ÜRKNER , P.-C. et al. (2020). Bayesian workflow. Available at
Programming with models: Writing statistical algorithms arXiv:2011.01808.
for general model structures with NIMBLE. J. Comput. [46] G ILKS , W. R. and W ILD , P. (1992). Adaptive rejection sam-
Graph. Statist. 26 403–413. MR3640196 https://doi.org/10. pling for Gibbs sampling. J. R. Stat. Soc., Ser. C 41 337–348.
1080/10618600.2016.1172487 [47] G OLDING , N. (2019). greta: Simple and scalable statistical
[31] D ILLON , J. V., L ANGMORE , I., T RAN , D., B REVDO , E., VA - modelling in R. J. Open Sour. Softw. 4 1601.
SUDEVAN , S., M OORE , D., PATTON , B., A LEMI , A., H OFF - [48] G OODMAN , N., M ANSINGHKA , V., ROY, D. M., B ONAWITZ ,
MAN , M. et al. (2017). Tensorflow distributions. Available at K. and T ENENBAUM , J. B. (2012). Church: A language for
arXiv:1711.10604. generative models. Available at arXiv:1206.3255.
[32] D RUZDZEL , M. J. (1999). SMILE: Structural modeling, infer- [49] G OODMAN , J. and W EARE , J. (2010). Ensemble samplers with
ence, and learning engine and GeNIe: A development environ- affine invariance. Commun. Appl. Math. Comput. Sci. 5 65–80.
ment for graphical decision-theoretic models. In American As- MR2600822 https://doi.org/10.2140/camcos.2010.5.65
sociation for Artificial Intelligence Proceedings 902–903. [50] G OODRICH , B., A LI , I., G ABRY, J. and S AM , B. (2021). rstan-
[33] D UANE , S., K ENNEDY, A. D., P ENDLETON , B. J. and arm: Bayesian applied regression modeling via Stan R package
ROWETH , D. (1987). Hybrid Monte Carlo. Phys. Lett. B 195 version 2.21.3. https://CRAN.R-project.org/package=rstanarm.
216–222. MR3960671 https://doi.org/10.1016/0370-2693(87) [51] G ORINOVA , M. I. (2022). Program analysis of probabilistic
91197-x programs. Available at arXiv:2204.06868.
[34] D UTTA , R., S CHOENGENS , M., PACCHIARDI , L., U MMA - [52] G ORINOVA , M. I., G ORDON , A. D. and S UTTON , C. (2019).
DISINGU , A., W IDMER , N., K ÜNZLI , P., O NNELA , J.-P. and Probabilistic programming with densities in SlicStan: Efficient,
M IRA , A. (2021). ABCpy: A high-performance computing per- flexible, and deterministic. Proc. ACM Program. Lang. 3 1–30.
spective to approximate Bayesian computation. J. Stat. Softw. [53] G ORINOVA , M., M OORE , D. and H OFFMAN , M. (2020). Au-
100 1–38. tomatic reparameterisation of probabilistic programs. In Inter-
[35] F OREMAN -M ACKEY, D., FARR , W. M., S INHA , M., national Conference on Machine Learning 3648–3657. PMLR.
A RCHIBALD , A. M., H OGG , D. W., S ANDERS , J. S., Z UNTZ , [54] G OUDIE , R. J., T URNER , R. M., D E A NGELIS , D. and
J., W ILLIAMS , P. K., N ELSON , A. R. et al. (2019). emcee T HOMAS , A. (2020). MultiBUGS: A parallel implementation
v3: A Python ensemble sampling toolkit for affine-invariant of the BUGS modelling framework for faster Bayesian infer-
MCMC. Available at arXiv:1911.07688. ence. J. Stat. Softw. 95.
PAST, PRESENT AND FUTURE OF SOFTWARE FOR BAYESIAN INFERENCE 59
[55] G RONAU , Q. F., S INGMANN , H. and WAGENMAKERS , E.-J. [73] L INTUSAARI , J., V UOLLEKOSKI , H., K ANGASRÄÄSIÖ , A.,
(2020). bridgesampling: An R package for estimating normal- S KYTÉN , K., JÄRVENPÄÄ , M., M ARTTINEN , P., G UTMANN ,
izing constants. J. Stat. Softw. 92 1–29. M. U., V EHTARI , A., C ORANDER , J. et al. (2018). ELFI: En-
[56] G UTMANN , M. U. and C ORANDER , J. (2016). Bayesian opti- gine for likelihood-free inference. J. Mach. Learn. Res. 19 16.
mization for likelihood-free inference of simulator-based statis- MR3862423
tical models. J. Mach. Learn. Res. 17 125. MR3555016 [74] L UNN , D., JACKSON , C., B EST, N., T HOMAS , A. and
[57] H IGSON , E., H ANDLEY, W., H OBSON , M. and L ASENBY, S PIEGELHALTER , D. (2012). The BUGS Book: A Practical In-
A. (2019). Dynamic nested sampling: An improved algo- troduction to Bayesian Analysis. CRC Press, Boca Raton, FL.
rithm for parameter estimation and evidence calculation. Stat. [75] L UNN , D., S PIEGELHALTER , D., T HOMAS , A. and B EST, N.
Comput. 29 891–913. MR3994608 https://doi.org/10.1007/ (2009). The BUGS project: Evolution, critique and future direc-
s11222-018-9844-0 tions. Stat. Med. 28 3049–3067. MR2750401 https://doi.org/10.
[58] H OBERT, J. P. (2011). The data augmentation algorithm: The- 1002/sim.3680
ory and methodology. In Handbook of Markov Chain Monte [76] L UNN , D. J., T HOMAS , A., B EST, N. and S PIEGELHALTER ,
Carlo. Chapman & Hall/CRC Handb. Mod. Stat. Methods 253– D. (2000). WinBUGS—a Bayesian modelling framework: Con-
293. CRC Press, Boca Raton, FL. MR2858452 cepts, structure, and extensibility. Stat. Comput. 10 325–337.
[59] H OFFMAN , M. D. and G ELMAN , A. (2014). The no-U-turn [77] M ADSEN , A. L., L ANG , M., K JÆRULFF , U. B. and J ENSEN ,
sampler: Adaptively setting path lengths in Hamiltonian Monte F. (2003). The Hugin tool for learning Bayesian networks.
Carlo. J. Mach. Learn. Res. 15 1593–1623. MR3214779 In Symbolic and Quantitative Approaches to Reasoning with
[60] H OFFMAN , M. D., R ADUL , A. and S OUNTSOV, P. (2021). An Uncertainty. Lecture Notes in Computer Science 2711 594–
adaptive MCMC scheme for setting trajectory lengths in Hamil- 605. Springer, New York. MR2050972 https://doi.org/10.1007/
tonian Monte Carlo. Int. Conf. Artif. Intell. Stat. 978-3-540-45062-7_49
[61] H OFFMAN , M. and S OUNTSOV, P. (2022). Tuning-free gener- [78] M AKOWSKI , D., B EN -S HACHAR , M. S. and L ÜDECKE , D.
alized Hamiltonian Monte Carlo. Proc. Mach. Learn. Res. 151 (2019). bayestestR: Describing effects and their uncertainty,
7799–7813. existence and significance within the Bayesian framework. J.
[62] H OLMES , C. C. and H ELD , L. (2006). Bayesian auxil- Open Sour. Softw. 4 1541.
iary variable models for binary and multinomial regression. [79] M ANSINGHKA , V., S ELSAM , D. and P EROV, Y. (2014). Ven-
Bayesian Anal. 1 145–168. MR2227368 https://doi.org/10.
ture: A higher-order probabilistic programming platform with
1214/06-BA105
programmable inference. Available at arXiv:1404.0099.
[63] H UNTER , J. D. (2007). Matplotlib: A 2D graphics environ-
[80] M ARTIN , G. M., F RAZIER , D. T. and ROBERT, C. P.
ment. Comput. Sci. Eng. 9 90–95.
(2022). Computing Bayes: From then ‘til now’. Available at
[64] JABOT, F., FAURE , T. and D UMOULIN , N. (2013). Easy ABC:
arXiv:2208.00646.
Performing efficient approximate Bayesian computation sam-
[81] M ARTIN , A. D., Q UINN , K. M. and PARK , J. H. (2011).
pling schemes using R. Methods Ecol. Evol. 4 684–687.
MCMCpack: Markov chain Monte Carlo in R. J. Stat. Softw.
[65] K ADIE , C. M., H OVEL , D. and H ORVITZ , E. (2001). MSBNx:
42 22.
A Component-Centric Toolkit for Modeling and Inference with
[82] M ARTINS , T. G., S IMPSON , D., L INDGREN , F. and RUE ,
Bayesian Networks. Microsoft Research, Richmond, WA. Tech-
H. (2013). Bayesian computing with INLA: New fea-
nical Report MSR-TR-2001-67 28.
[66] K ALLIOINEN , N., PAANANEN , T., B ÜRKNER , P.-C. and tures. Comput. Statist. Data Anal. 67 68–83. MR3079584
V EHTARI , A. (2021). Detecting and diagnosing prior https://doi.org/10.1016/j.csda.2013.04.014
and likelihood sensitivity with power-scaling. Available at [83] M INKA , T., W INN , J. M., G UIVER , J. P., Z AYKOV, Y.,
arXiv:2107.14054. FABIAN , D. and B RONSKILL , J. (2018). /Infer.NET 0.3. Mi-
[67] KOUSATHANAS , A., D UCHEN , P. and W EGMANN , D. (2019). crosoft Research Cambridge. Available at http://dotnet.github.
A guide to general-purpose ABC software. In Handbook of io/infer.
Approximate Bayesian Computation. Chapman & Hall/CRC [84] M ONNAHAN , C. C. and K RISTENSEN , K. (2018). No-U-turn
Handb. Mod. Stat. Methods 369–413. CRC Press, Boca Raton, sampling for fast Bayesian inference in ADMB and TMB: In-
FL. MR3889290 troducing the adnuts and tmbstan R packages. PLoS ONE 13
[68] K RAINSKI , E., G ÓMEZ -RUBIO , V., BAKKA , H., L ENZI , A., e0197954. https://doi.org/10.1371/journal.pone.0197954
C ASTRO -C AMILO , D., S IMPSON , D., L INDGREN , F. and [85] M OREY, R. D. and ROUDER , J. N. (2022). BayesFactor: Com-
RUE , H. (2018). Advanced Spatial Modeling with Stochastic putation of Bayes factors for common designs.
Partial Differential Equations Using R and INLA. CRC Press, [86] M URPHY, K. (2001). The Bayes Net toolbox for Matlab. Com-
Boca Raton, FL. put. Sci. Stat. 33 1024–1034.
[69] K UCUKELBIR , A., T RAN , D., R ANGANATH , R., G ELMAN , A. [87] M URRAY, L. M., L UNDÉN , D., K UDLICKA , J., B ROMAN , D.
and B LEI , D. M. (2017). Automatic differentiation variational and S CHÖN , T. B. (2018). Delayed sampling and automatic
inference. J. Mach. Learn. Res. 18 14. MR3634881 Rao-blackwellization of probabilistic programs. In Proceedings
[70] K UMAR , R., C ARROLL , C., H ARTIKAINEN , A. and M ARTÍN , of the 21st International Conference on Artificial Intelligence
O. A. (2019). ArviZ a unified library for exploratory analysis and Statistics (AISTATS).
of Bayesian models in Python. J. Open Sour. Softw.. [88] M URRAY, L. M. and S CHÖN , T. B. (2018). Automated learn-
[71] L INDGREN , F. and RUE , H. (2015). Bayesian spatial modelling ing with a probabilistic programming language: Birch. Annu.
with R-INLA. J. Stat. Softw. 63 1–25. Rev. Control 46 29–43. MR3907522 https://doi.org/10.1016/j.
[72] L INDGREN , F., RUE , H. and L INDSTRÖM , J. (2011). An arcontrol.2018.10.013
explicit link between Gaussian fields and Gaussian Markov [89] N EAL , R. M. (1993). Probabilistic Inference Using Markov
random fields: The stochastic partial differential equation ap- Chain Monte Carlo Methods. Department of Computer Science,
proach. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 423–498. Univ. Toronto.
MR2853727 https://doi.org/10.1111/j.1467-9868.2011.00777. [90] N EAL , R. M. (2003). Slice sampling. Ann. Statist. 31 705–767.
x MR1994729 https://doi.org/10.1214/aos/1056562461
60 E. ŠTRUMBELJ ET AL.
[91] N EAL , R. M. MCMC using Hamiltonian dynamics. Handb. [111] S PEAGLE , J. S. (2020). dynesty: A dynamic nested sampling
Markov Chain Monte Carlo. package for estimating Bayesian posteriors and evidences. Mon.
[92] N UNES , M. A. and P RANGLE , D. (2015). abctools: An R pack- Not. R. Astron. Soc. 493 3132–3158.
age for tuning approximate Bayesian computation analyses. R J. [112] S PIEGELHALTER , D., T HOMAS , A., B EST, N. and G ILKS , W.
7 189–205. (1996). BUGS 0.5: Bayesian inference using Gibbs sampling
[93] PASZKE , A., G ROSS , S., M ASSA , F., L ERER , A., B RADBURY, manual (version II). In MRC Biostatistics Unit 1–59. Institute
J., C HANAN , G., K ILLEEN , T., L IN , Z., G IMELSHEIN , N. et of Public Health, Cambridge, UK.
al. (2019). Pytorch: An imperative style, high-performance deep [113] S PIEGELHALTER , D. J., T HOMAS , A., B EST, N. and L UNN ,
learning library. Adv. Neural Inf. Process. Syst. 32. D. (2003). WinBUGS Version 1.4 User Manual, MRC Biostatis-
[94] P HAN , D., P RADHAN , N. and JANKOWIAK , M. (2019). Com- tics Unit, Cambridge. Available at http://www.mrc-bsu.cam.ac.
posable effects for flexible and accelerated probabilistic pro- uk/bugs.
gramming in NumPyro. Available at arXiv:1912.11554. [114] S PIEGELHALTER , D., T HOMAS , A., B EST, N. and L UNN , D.
[95] P IIRONEN , J., PAASINIEMI , M., C ATALINA , A., W EBER , F. (2014). OpenBUGS user manual. Version 3.2.3.
and V EHTARI , A. (2023). projpred: Projection predictive fea- [115] S TAN D EVELOPMENT T EAM (2022). RStan: The R interface to
ture selection. Stan.
[96] P LUMMER , M. et al. (2003). JAGS: A program for analysis of [116] S TAN D EVELOPMENT T EAM (2022). A lightweight Python in-
Bayesian graphical models using Gibbs sampling. In Proceed- terface to CmdStan.
ings of the 3rd International Workshop on Distributed Statisti- [117] S TURTZ , S., L IGGES , U. and G ELMAN , A. (2005).
cal Computing 124 1–10. R2WinBUGS: A package for running WinBUGS from R. J.
[97] P LUMMER , M., B EST, N., C OWLES , K. and V INES , K. (2006). Stat. Softw. 12 1–16.
CODA: Convergence diagnosis and output analysis for MCMC. [118] S YED , S., B OUCHARD -C ÔTÉ , A., D ELIGIANNIDIS , G. and
R News 6 7–11. D OUCET, A. (2022). Non-reversible parallel tempering: A scal-
[98] P OLSON , N. G., S COTT, J. G. and W INDLE , J. (2013). able highly parallel MCMC scheme. J. R. Stat. Soc. Ser. B. Stat.
Bayesian inference for logistic models using Pólya-Gamma Methodol. 84 321–350. MR4412989
latent variables. J. Amer. Statist. Assoc. 108 1339–1349. [119] TALTS , S., B ETANCOURT, M., S IMPSON , D., V EHTARI ,
MR3174712 https://doi.org/10.1080/01621459.2013.829001 A. and G ELMAN , A. (2018). Validating Bayesian infer-
[99] P RICE , L. F., D ROVANDI , C. C., L EE , A. and N OTT, ence algorithms with simulation-based calibration. Available at
D. J. (2018). Bayesian synthetic likelihood. J. Comput. arXiv:1804.06788.
Graph. Statist. 27 1–11. MR3788296 https://doi.org/10.1080/ [120] TAREK , M., X U , K., T RAPP, M., G E , H. and G HAHRAMANI ,
10618600.2017.1302882 Z. (2020). DynamicPPL: Stan-like speed for dynamic proba-
[100] R C ORE T EAM (2022). R: A Language and Environment for bilistic models. Available at arXiv:2002.02702.
Statistical Computing. R Foundation for Statistical Computing, [121] TAYLOR , S. J. and L ETHAM , B. (2018). Forecasting at scale.
Vienna, Austria. Amer. Statist. 72 37–45. MR3790566 https://doi.org/10.1080/
[101] R AINFORTH , T. W. G. (2017). Automating inference, learning, 00031305.2017.1380080
and design using probabilistic programming Ph.D. thesis Univ. [122] TAYLOR , S. J. and L ETHAM , B. (2021). prophet: Automatic
Oxford. Forecasting Procedure.
[102] R IDELL , A. (2022). Python interface to Stan. [123] TAYLOR , S. J. and L ETHAM , B. (2022). Prophet: Automatic
[103] RONQUIST, F., K UDLICKA , J., S ENDEROV, V., B ORGSTRÖM , Forecasting Procedure.
J., L ARTILLOT, N., L UNDÉN , D., M URRAY, L., S CHÖN , T. B. [124] T EHRANI , N., A RORA , N. S., L I , Y. L., S HAH , K. D.,
and B ROMAN , D. (2021). Universal probabilistic programming N OURSI , D., T INGLEY, M., T ORABI , N., L IPPERT, E., M EI -
offers a powerful approach to statistical phylogenetics. Com- JER , E. et al. (2020). Bean machine: A declarative probabilistic
mun. Biol. 4. programming language for efficient programmable inference.
[104] RS TUDIO T EAM (2021). RStudio: Integrated Development En- In International Conference on Probabilistic Graphical Models
vironment for R. RStudio, PBC, Boston, MA. 485–496. PMLR.
[105] RUE , H., M ARTINO , S. and C HOPIN , N. (2009). Approxi- [125] T EJERO -C ANTERO , A., B OELTS , J., D EISTLER , M., L UECK -
mate Bayesian inference for latent Gaussian models by using MANN , J.-M., D URKAN , C., G ONÇALVES , P. J., G REEN -
integrated nested Laplace approximations. J. R. Stat. Soc. Ser. BERG , D. S. and M ACKE , J. H. (2020). sbi: A toolkit for
B. Stat. Methodol. 71 319–392. MR2649602 https://doi.org/10. simulation-based inference. J. Open Sour. Softw. 5 2505.
1111/j.1467-9868.2008.00700.x [126] T HOMAS , N. (2020). R2OpenBUGS: Running OpenBUGS
[106] RUE , H., R IEBLER , A., S ØRBYE , S. H., I LLIAN , J. B., S IMP - from R.
SON , D. P. and L INDGREN , F. K. (2017). Bayesian computing [127] T HORNTON , K. R. (2009). Automating approximate Bayesian
with INLA: A review. Annu. Rev. Stat. Appl. 4 395–421. computation by local linear regression. BMC Genet. 10 1–5.
[107] S ALVATIER , J., W IECKI , T. V. and F ONNESBECK , C. (2016). [128] T OLPIN , D., VAN DE M EENT, J.-W., YANG , H. and W OOD ,
Probabilistic programming in Python using PyMC3. PeerJ F. (2016). Design and implementation of probabilistic program-
Comput. Sci. 2 e55. ming language anglican. In Proceedings of the 28th Symposium
[108] S CHÄLTE , Y., K LINGER , E., A LAMOUDI , E. and H ASE - on the Implementation and Application of Functional Program-
NAUER , J. (2022). pyABC: Efficient and robust easy- ming Languages 1–12.
to-use approximate Bayesian computation. Available at [129] T RAN , D., H OFFMAN , M. W., M OORE , D., S UTER , C., VA -
arXiv:2203.13043. SUDEVAN , S. and R ADUL , A. (2018). Simple, distributed, and
[109] S CUTARI , M. (2010). Learning Bayesian networks with the bn- accelerated probabilistic programming. Adv. Neural Inf. Pro-
learn R package. J. Stat. Softw. 35. cess. Syst. 31.
[110] S ISSON , S. A., FAN , Y. and B EAUMONT, M. (2018). Hand- [130] VAN N IEKERK , J., BAKKA , H., RUE , H. and S CHENK , O.
book of Approximate Bayesian Computation. CRC Press, Boca (2021). New frontiers in Bayesian modeling using the INLA
Raton, FL. package in R. J. Stat. Softw. 100 1–28.
PAST, PRESENT AND FUTURE OF SOFTWARE FOR BAYESIAN INFERENCE 61
[131] VAN N IEKERK , J., K RAINSKI , E., RUSTAND , D. and RUE , and WAIC. Stat. Comput. 27 1413–1432. MR3647105
H. (2023). A new avenue for Bayesian inference with https://doi.org/10.1007/s11222-016-9696-4
INLA. Comput. Statist. Data Anal. 181 107692. MR4540934 [135] W ICKHAM , H. (2016). Ggplot2: Elegant Graphics for Data
https://doi.org/10.1016/j.csda.2023.107692 Analysis. Springer, Berlin.
[132] VAN N IEKERK , J. and RUE , H. (2021). Correcting the Laplace [136] W IGREN , A., R ISULEO , R. S., M URRAY, L. M. and L IND -
method with variational Bayes. Available at arXiv:2111.12945. STEN , F. (2019). Parameter elimination in particle Gibbs sam-
[133] V EHTARI , A., G ABRY, J., M AGNUSSON , M., YAO , Y., pling. Advances in Neural Information Processing Systems 32
B ÜRKNER , P.-C., PAANANEN , T. and G ELMAN , A. (2022). (NeurIPS 2019).
loo: Efficient leave-one-out cross-validation and WAIC for [137] W OOD , S. N. (2017). Generalized Additive Models: An Intro-
Bayesian models. R package version 2.5.1. duction with R, Second ed. CRC Press, Boca Raton, FL.
[134] V EHTARI , A., G ELMAN , A. and G ABRY, J. (2017). Practical
Bayesian model evaluation using leave-one-out cross-validation