PyMC A Modern and Comprehensive Probabilistic Programming Framework in Python
PyMC A Modern and Comprehensive Probabilistic Programming Framework in Python
ABSTRACT
PyMC is a probabilistic programming library for Python that provides tools for
constructing and fitting Bayesian models. It offers an intuitive, readable syntax that is
close to the natural syntax statisticians use to describe models. PyMC leverages the
Submitted 2 June 2023 symbolic computation library PyTensor, allowing it to be compiled into a variety of
Accepted 13 July 2023 computational backends, such as C, JAX, and Numba, which in turn offer access to
Published 1 September 2023
different computational architectures including CPU, GPU, and TPU. Being a
Corresponding authors general modeling framework, PyMC supports a variety of models including
Christopher J. Fonnesbeck,
[email protected]
generalized hierarchical linear regression and classification, time series, ordinary
Thomas Wiecki, differential equations (ODEs), and non-parametric models such as Gaussian
[email protected] processes (GPs). We demonstrate PyMC’s versatility and ease of use with examples
Academic editor spanning a range of common statistical models. Additionally, we discuss the positive
Muhammad Aleem role of PyMC in the development of the open-source ecosystem for probabilistic
Additional Information and programming.
Declarations can be found on
page 29
DOI 10.7717/peerj-cs.1516 Subjects Data Science, Scientific Computing and Simulation, Programming Languages
Keywords Bayesian statistics, Probabilistic programming, Python, Markov chain Monte Carlo,
Copyright
Statistical modeling
2023 Abril-Pla et al.
Distributed under
Creative Commons CC-BY 4.0
How to cite this article Abril-Pla O, Andreani V, Carroll C, Dong L, Fonnesbeck CJ, Kochurov M, Kumar R, Lao J, Luhmann CC, Martin
OA, Osthege M, Vieira R, Wiecki T, Zinkov R. 2023. PyMC: a modern, and comprehensive probabilistic programming framework in Python.
PeerJ Comput. Sci. 9:e1516 DOI 10.7717/peerj-cs.1516
INTRODUCTION
Probabilistic programming languages (PPLs) are general-purpose programming languages
or libraries with built-in tools for Bayesian model specification and inference, allowing
practitioners to focus on the creation of models rather than on computational details
(Rainforth, 2017). PPLs have dramatically changed Bayesian modeling, enabling
practitioners to perform analyses of increasing complexity while simultaneously lowering
barriers to entry. Additionally, PPLs facilitate an iterative modeling process that is now
more relevant than ever (Gelman et al., 2020; Martin, Kumar & Lao, 2021).
Since the early 2000s, the PyMC project has provided scientists with an open-source
PPL in Python (Salvatier, Wiecki & Fonnesbeck, 2016). The first release of PyMC version
2.0 in January 2009 introduced a general implementation of the Metropolis-Hastings
sampler (Metropolis et al., 1953; Hastings, 1970), accelerated by probability distributions
implemented in Fortran. In 2016 a major rewrite of PyMC introduced gradient-based
sampling algorithms, notably Hamiltonian Monte Carlo (HMC) and the No-U-Turn-
Sampler (NUTS) algorithm (Hoffman & Gelman, 2014) that dramatically improved the
speed and robustness of model fitting. This was facilitated by leveraging Theano (The
Theano Development Team et al., 2016), a numerical computation library widely used in
deep learning at the time, whose automatic differentiation capabilities enabled these
gradient-based algorithms.
The announcement of the discontinuation of Theano in 2017 prompted the PyMC
project to adopt a new computational backend strategy. After considering other deep
learning libraries, most notably TensorFlow and TensorFlow probability, the Theano
project was forked into the non-affiliated Aesara project (Bastien et al., 2022a), and later
the PyMC team forked Aesara once again into PyTensor (Bastien et al., 2022b). The focus
of PyTensor is no longer to support deep learning, but instead to build, optimize, and
compile symbolic computational graphs to serve the needs of PyMC.
PyMC’s explicit use of a computational graph also serves to distinguish it from systems
like Pyro which implement inference using effect handlers and Stan where the program is
compiled into a log density function in C++ with relatively less introspection of the
program’s AST representation before doing so.
PyMC has played a significant role as an incubator for other libraries, nurturing and
facilitating the development of specialized tools and functionalities. For example, ArviZ
(Kumar et al., 2019) has emerged as a dedicated package for exploratory analysis of
Bayesian models, providing a library-agnostic solution sampling diagnostics, model
criticism, model comparison, and result preparation. Additionally, the handling of
generalized linear models (GLMs) has found a new home in the Bambi library (Capretto
et al., 2022), which has been able to focus on refining and expanding this functionality. For
instance, Bambi also supports models such as splines, Gaussian processes, and
distributional models. PyMC’s collaborative ecosystem has thus enabled the evolution and
diversification of tools, allowing each library to excel in its respective domain while
collectively advancing the field of Bayesian modeling.
API OVERVIEW
Random variables (RVs)—variables whose values are described by a probability
distribution—serve as the building blocks of a PPL. In PyMC, probability distributions are
PyTensor RandomVariable instances that are created from subclasses of the
Distribution class. These subclasses define the specific type of distribution, such as a
normal (Normal) distribution or a binomial distribution (Binomial), and specify the
parameters of the distribution as arguments, such as the mean and standard deviation for a
normal distribution. The suite of probability distributions is organized into seven groups:
continuous, discrete, multivariate, mixture, time series, censored1, simulator, and symbolic
1
This allows to transform regular PyMC distributions.
distributions into censored ones.
In general, users work with PyMC random variables as part of a larger model that
describes how variables are related to each other and to the data. To this end, PyMC’s
Model class is a container for all of the variables and other attributes that define a
probabilistic model. The Model is a context manager that gathers the collection of
interrelated observed and unobserved random variables as they are specified by the user. In
this sense, an RV can be understood as the symbolic expression describing how a variable
in the model is the result of mathematical operations on probability distributions. For
example, an RV might be the distribution Normal(0, 1), or an expression such as the sum
Normal(0, 1) + Uniform(2, 3).
Each random variable in a model can be either observed or unobserved, depending on
whether it is associated with available data or is latent. An unobserved RV can be specified
in PyMC by a name (string) and zero or more arguments, corresponding to the parameters
of the statistical distribution. For example, a normal prior can be defined in a Model
context like this:
Here, the name “x” is the same as the assigned Python variable, but it need not be.
Observed RVs are defined similarly, but with an additional observed keyword
argument to which the data is passed:
with pm.Model():
obs = pm.Normal("x", mu=0, sigma=1, observed=[-1, 0, 0, 1])
In this example, the list passed to the observed argument represents four samples of a
one-dimensional random variable. The observed argument supports lists, NumPy arrays,
Pandas Series, DataFrames, and PyTensor arrays. When used in combination with
variational inference, observed supports minibatching using pm.Minibatch.
To describe a parent-child relationship within a model, random variables can be used as
parameters of other random variables. For example, if we wish to assign a normal prior and
an exponential prior to the mean and standard deviation, respectively, of a normal
distribution:
with pm.Model():
mu = pm.Normal("mu", mu=0, sigma=1)
sigma = pm.Exponential("sigma", 1)
x = pm.Normal("x", mu=mu, sigma=sigma)
Random variables can be transformed and combined with other random variables
algebraically using built-in operators and a suite of functions to yield deterministic nodes
in the model:
with pm.Model():
x = pm.Normal("x", mu=0, sigma=1)
y = pm.Gamma("y", alpha=1, beta=1)
plus_2 = x + 2
summed = x + y
squared = x∗∗2
sined = pm.math.sin(x)
Though these transformations work seamlessly, the resulting values are by default
treated as intermediate and are not stored when model fitting. To store them, the pm.
Deterministic wrapper should be applied, which gives the deterministic node a name
and a value.
with pm.Model():
x = pm.Normal("x", mu=0, sigma=1)
plus_2 = pm.Deterministic("x plus 2", x + 2)
with pm.Model():
# not recommended:
x = [pm.Normal(f"x_{i}", mu=0, sigma=1) for i in range
(3)]
Here, we parameterize x using dims instead of shape. The Model associates the first
dimension of x with the dimension named “city” in the coords argument, thereby
creating the RV again as length 3 array. This information will also be associated with the
output from model fitting, which simplifies working with the results.
Operations, including indexing, can be applied to vector-valued RVs in the same way
one would operate on a NumPy array. Selection based on the dims and coords is not yet
implemented.
with model:
y = x[0] * x[1] # indexing is supported
x.dot(x.T) # linear algebra is supported
x.sel(city="Mumbai") # not yet supported
A PyMC Model has references to all random variables (RVs) defined within it and
provides access to the model log-probability and its gradients. Consider the following
model:
The Model retains collections of different subsets of variables as properties. To get all
RVs, except for the deterministic ones:
model.basic_RVs
model.free_RVs
model.observed_RVs
The joint log probability of the unobserved and observed variables is the sum of the log
probability of each variable, taking dependencies into account:
logp = model.compile_logp()
logp({"a": 0.5, "b_log___": 1.2})
array(-73.39055683)
The compile_logp function optimizes and compiles the symbolic PyTensor
computation graph into the selected backend (C, Numba or JAX), conditioned on the
observed_RVs. The returned compiled function can be evaluated numerically for
arbitrary values of the free RVs. The optimization and compilation of the computational
graph done by the compile_logp method is expensive, but once this is done, the resulting
compiled function can be called efficiently on different inputs of similar types.
Having fully specified a probabilistic model, we can select an appropriate method for
inference. This may be a simple optimization, where we find the maximum a posteriori
estimates for all model parameters:
with model:
fit = pm.find_MAP()
fit['a'], fit['b']
(array(0.12646729), array(3.38501407))
Alternatively, we may choose a fully Bayesian approach and use Markov chain Monte
Carlo to fit the model. This returns a trace of sampled values as output, representing the
marginal posterior distribution of all model parameters. The sample function provides the
interface to all of PyMC’s MCMC methods. For example, if we want 1,000 posterior
samples after tuning the sampler for 1,000 iterations, we can call:
with model:
idata = pm.sample(1000, tune=1000)
Distributions
Because Bayesian statistics involves constructing probabilistic models, PyMC provides
robust implementations of all commonly-used probability distributions as classes in its
distributions module, as well as a large set of more specialized distributions. The
availability of these classes facilitates the construction of probabilistic graphs for forward
sampling and inference. Stochastic random variables in PyMC models have a common set
Model
As introduced in Section API overview, the Model module provides the Model class, which
encapsulates all components of the Bayesian model the user specifies. The user API for a
Model instance is through a context manager, which serves to automatically add variables
to the model graph, according to the associations between variables by the user;
specifically, a random variable passed as an argument to another random variable
establishes a parent-child relationship between them. All interaction with a Model instance
by the user (e.g., model modification, model fitting) is done through the context manager.
While the Model class has a large suite of methods, almost none of them are user-facing
but are instead called by other methods and functions in PyMC in order to set up,
introspect, or manipulate the model.
As a convenience to users, the Model module includes functions for generating
visualizations of the model DAG (see for example Fig. 3), using either GraphViz (Gansner
& North, 2000) or NetworkX (Hagberg, Schult & Swart, 2008).
Logprob
This module contains the logic for operating with RandomVariable objects including:
Converting RandomVariable graphs into joint log-probability graphs, transforming
constrained RandomVariables so their support is on unconstrained spaces,
RandomVariable-aware pretty printing, and LaTeX output.
Step methods
PyMC constructs posterior samplers using a Metropolis-within-Gibbs scheme, where
blocks of parameters are assigned the same MCMC step algorithm. Distributions define
their own default step algorithm, but this may be manually overridden by the user. For
instance, in the coal mining disaster example (see Section Coal mining disasters) the
Metropolis step method was assigned to the discrete sp variable and NUTS was assigned to
the rest of the variables, which are continuous.
The default sampler for all continuous distributions is based on the No-U-Turn sampler
(Hoffman & Gelman, 2014)4. In practice, many models use only continuous parameters,
4
This is the first description of the sampler, and so will be only using this algorithm, which sees particular attention for performance.
but the implementation in PyMC
includes several improvements that have Some changes from the algorithm as originally described include multinomial sampling of
been developed since then, for example the tree (Betancourt, 2016), and a corrected U-turn check. Other available inference
on mass matrix tuning.
methods include random-walk Metropolis (including specialized versions for binary and
categorical parameters), a slice sampler (Neal, 2003), and Differential Evolution Metropolis
sampling (Ter Braak, 2006).
Sampling
In addition to the MCMC step methods described in the previous section, which are useful
to sample from the posterior distribution, PyMC supports sampling from both the prior
predictive and posterior predictive distributions with flexible shapes allowing for in-
sample estimates, as well as sample interpolation or extrapolation. This can be done using
the pm.sample_prior_predictive or pm.sample_posterior_predictive
functions. It is important to remark that PyMC allows using the same model definition to
compute posteriors distributions (backward sampling) or predictive distributions (forward
sampling), without requiring any intervention from the user.
Variational
The variational inference implementation is inspired by Ranganath et al. (2016) and
defined as follows:
sup tðEq ½ðOp;q f ÞðzÞÞ (1)
f 2F
The simplified objective now does not need supf 2F anymore, and setting the distance
function to identity tI ðxÞ ¼ x we obtain
p;q
sup tI ðEq ðOKL f ÞðzÞ Þ ¼ Eq log qðzÞ log pðzjxÞ (4)
f 2F
ODE
PyMC includes an Ordinary Differential Equations (ODE) module. The API mimics
requires a function fðy; t; hÞ which takes as arguments an array of states y, a time argument
t, and an array of parameters h. Once a solution to the ODE is found, an array of times at
which to evaluate the solution is used.
It is implemented as an PyTensor Op which allows Hamiltonian Monte Carlo to
differentiate through the ODE solution. The underlying implementation uses Scipy’s ODE
solver which in turn uses the lsoda routine in the ODEPACK library written in
FORTRAN.
Gaussian process
A Gaussian process (GP) can be used as a prior probability distribution over the space of
continuous functions for modeling non-linear processes non-parametrically. PyMC
includes a specialized API to define and fit GP models, specifically using latent and
marginal approximations, and generate predictions for arbitrary inputs. The GP
estimation is fully compatible with the MCMC, ADVI, and MAP methods, allowing users
to flexibly fit models of various computational complexities while balancing computational
estimation time.
Additionally, PyMC includes a number of kernels, including exponentiated quadratic,
Matern, and periodic kernels, while allowing for the flexibility of users to define their own
kernels if need be.
CASE STUDIES
Coal mining disasters
Between 1851 and 1962, a record number of accidents occurred in coal mines located in
the United Kingdom (Jarrett, 1979). It is suspected that the application of certain safety
regulations had the effect of reducing the number of accidents. Therefore, we are interested
in estimating three quantities: the year in which the rate changed (the switch-point), the
rate of accidents prior to regulation, and the rate after the regulation change.
The data is shown in Code Block 1, we have the variable disasters that contains the
number of accidents per year and the variable years containing the range of years for
How can we build a model for this problem? One approach is to think generatively, that
is, create a story of how data may have been generated. Generative stories can be very
powerful, informal devices that aid model construction and understanding (Blitzstein &
Hwang, 2019).
We can think of our problem as having a moving slider, years to the left of this slider are
assigned an average number of accidents while years to the right are assigned a different
average number of accidents. A key property of Bayesian modeling is that we consider
multiple plausible scenarios that could have generated the data (see Fig. 4).
Next, we need to specify prior probability distributions which quantify the information
we have about plausible parameter values before we have observed any data. For the
“slider” we will use a discrete uniform distribution that assigns equal probability to all years
in a given interval, although other choices are possible. This distribution has two
parameters: the lowest possible value, and the highest one. In our problem, those
correspond to the year 1851 and 1962, respectively. A range wider than this does not make
sense given that we only have data for this particular range, and a narrower range will
imply that we have some external information indicating that not all years between 1851
and 1962 are equally plausible candidates.
Our observed data consists of counts, i.e., the number of disasters. Commonly, the
Poisson distribution is used for count data. The Poisson distribution is defined using a
single parameter that represents the average rate of events (disasters) in our example. As
we do not know the rate and want to estimate it, we have to set a prior distribution for it.
We do not have much information other than the rate must be positive and “most likely”
have a small value, given that coal mining disasters are not very frequent events. One
option could be to pick the distribution Exponð1Þ, which says that the expected rate is 1 but
is wide enough to allow for lower and higher values.
Using standard statistical notation, we can write the model as:
sp UðA0 ; A1 Þ
t0 Exponð1Þ
t1 Exponð1Þ
(6)
t ; if t , sp;
ratet ¼ 0
t1 ; if t sp
acct Poissonðratet Þ
And using PyMC we can write this model as described in Code Block 2.
Once defined in PyMC we can get a visual representation of the model as shown in
Fig. 3. We can use this visual representation to check that we do not have a semantic error
in the model and to communicate the model to others.
One remarkable feature of PyMC is that its syntax is very close to the statistical notation,
as we can see by comparing Eq. (6) with Code Block 2. The cases in Eq. (6) are coded using
the pm.math.switch (condition, true, false) function, which uses the first
argument to select either of the next two arguments.
Missing values are handled concisely by passing a numpy.ndarray, pandas.
DataFrame or pandas.Series with NaN values (see Code Block 1) to the observed
argument when creating an observed stochastic random variable. This means that we will
automatically get a posterior predictive distribution over the missing values. The imputed
and observed values are combined into a Deterministic node acc that represents the
original vector specified as an observed random variable.
ax[0,0].plot(years[np.isfinite(disasters)],
idata.prior_predictive["acc_observed"].sel(draw=50,
chain=0),
".")
az.plot_dist(idata.prior_predictive["acc_observed"],
ax=ax[1,1],
rotated=True)
Code 3. Part of the code to generate Fig. 4. Lines related to the style of the plot have been
omitted.
On the top panel, we can see one sample from the prior predictive distribution, for this
sample the mean rate before the year 1880 is around 0.3, and after that around 1.3 (orange
line). From the bottom panel, we can see that the average prior predictive prediction
describes a uniform distribution of accidents across years, this is expected given that we
have defined the same prior for t_0 and t_1. Additionally, just by eyeballing, we can see
that our model favors relatively low values of accidents per year, with around 85 percent of
the mass being assigned to values equal to or lower than 3. A more accurate estimate can be
obtained by counting samples satisfying this property: np.mean(idata.
prior_predictive["acc_observed"] < 3).
Once confident with the model specification, we can estimate the parameters using one
of the multiple inferential methods available in PyMC. If we decide to use Markov Chain
Monte Carlo methods (MCMC), the continuous variables are sampled using an Adaptive
Dynamic Hamiltonian Monte Carlo called NUTS (Hoffman & Gelman, 2014). Solving
models with a discrete or a mix of discrete and continuous variables, like the one in Code
Block 2 is also possible using compound samplers that could be manually specified by the
user or automatically assigned by PyMC. For example, the variable sp in Model 2, being
discrete, will be assigned to the Metropolis sampler, and the rate variables, continuous, to
NUTS. Other samplers, such as Sequential Monte Carlo (SMC), are also suitable for a mix
of discrete and continuous variables.
Posterior sampling
A common way to visually inspect the posterior is by plotting the marginal distributions
for each parameter, as in Fig. 5. Once we have computed the posterior, we can use it to
answer questions of interest: this can be done by computing numerical quantities,
generating plots, and more often than not by a combination of both.
From Fig. 6 we can see that the switch-point (orange line) is most likely around 1890,
but we still have some uncertainty. The orange band represents the 94% credible interval
and goes from 1885 to 1894. That is, according to the data and model, we think there is a
94% chance that the rate of accidents changed between 1885 and 1894. The black line
represents the posterior mean for the rate of accidents, and the gray band is the 94%
credible interval for that mean. Notice that in our model we specify prior distributions for
two rate values, but we do not get just two point estimates for those rates, we get two
distributions, one with mean 3 and one with mean 1. Even more, from approximately
1885 to 1894 we get a mix of those two distributions. The uncertainty about when the
transition occurred is reflected in a rather smooth transition of the rates around these
years.
Dirichlet-multinomial distribution
This example demonstrates the use of a Dirichlet compound multinomial distribution to
model categorical count data. Models like this one are important in a variety of areas,
including natural language processing (Madsen, Kauchak & Elkan, 2005), ecology
(Harrison et al., 2019), and Genomics (Holmes, Harris & Quince, 2012; Nowicka &
Robinson, 2016).
The Dirichlet-multinomial can be understood as taking draws from a multinomial
distribution where each sample has a slightly different probability vector, which is itself
drawn from a common Dirichlet distribution. In contrast with the multinomial
distribution, which assumes that all observations arise from a single fixed probability
vector. This enables the Dirichlet-multinomial to accommodate over-dispersed count data.
Here we will discuss a community ecology example. Let’s assume we have observed counts
of k ¼ 5 different tree species in n ¼ 10 different forests.
Our data is arranged in a two-dimensional matrix of integers where each row, indexed
by i 2 ð0.n 1Þ, is an observation (different forest), and each column j 2 ð0.k 1Þ is a
category (tree species).
We could write this model as shown in Code Block 4, where we have a multinomial
likelihood counts with a Dirichlet prior p. Furthermore, the prior is parameterized in
terms of two hyper-priors, a Dirichlet distribution for the expected fraction of each
category, frac and a log-normal conc controlling the concentration of the Dirichlet prior
or in terms of the data the level of over-dispersion.
Working with labeled arrays reduces the cognitive load of working with
multidimensional arrays, reducing the chance of making errors and reducing frustration.
In Code Block 5 by defining the frac Dirichlet parameters using dims="tree" we are
guaranteed to have the dimensions of the prior matching the number of trees in our
dataset. The advantage of using labels also extends to the post-inference processing stage.
For example, idata_dmm.posterior.sel(tree="pine") will return the subset of the
posterior related to pine and idata_dmm.posterior_predictive.counts.sel
(tree="pine") will do the same for the predictive counts of pine.
Additionally, automatic labeling becomes possible. For instance, after sampling from
the model in Code Block 5, calling az.plot_posterior(idata_dmm) generates Fig. 8.
Notice how the frac parameter is meaningfully labeled with the names of the trees. The
alternative would be integer labels with no intrinsic meaning.
JAX-based sampling
The most recent major version of PyMC is built on top of the PyTensor Python package,
which allows the definition, optimization, and efficient evaluation of mathematical
expressions involving multi-dimensional arrays. PyMC models, through PyTensor, can be
compiled to C, Numba and JAX, and in principle, other computational backends could be
added with relatively little effort. This allows for the efficient and fast evaluation of the log-
probability density (see Section PyTensor for details). Still, the samplers accessible by
calling pm.sample() are coded in Python and NumPy. A good approach to improve the
performance of PyMC’s samplers is to write them in PyTensor. This will reduce the
overhead of calling Python code and—more importantly—enable a series of optimizations
due to PyTensor’s ability to manipulate graphs, including the ability to customize the
sampler based on patterns in the model structure. Details of such optimizations are out of
the scope of this article and will be discussed in a future manuscript.
An alternative to PyMC’s PyTensor-based samplers is samplers written in JAX. Using
these samplers, all the operations needed to compute a posterior can be performed under
JAX, reducing the Python overhead during sampling and leveraging all JAX performance
improvements and features like the ability to sample on GPUs or TPUs. Currently, PyMC
offers NUTS JAX samplers via NumPyro (Phan, Pradhan & Jankowiak, 2019) or BlackJAX
(BlackJax devs, 2022) with the functions pm.sample_numpyro_nuts and pm.
sample_blackjax_nuts, respectively. Significantly, BlackJAX and NumPyro can both be
used because in PyMC the modeling language is decoupled from the inference methods;
BlackJAX and NumPyro only require a log-probability density function written in JAX.
This demonstrates that samplers can be developed independently of PyMC and then be
made available to users of the library.
In the following example, we compare PyMC with its default Python/NumPy NUTS
sampler, PyMC running the BlackJAX NUTS sampler, and PyMC running the NumPyro
sampler. We also include cmdstanpy (Lee et al., 2017), a command line interface to Stan.
The motivation for these comparisons is not to provide an exhaustive benchmark over a
wide range of models and datasets but instead to give an example of the attainable speed-
ups when using PyMC with a JAX-based sampler.
Suppose we are interested in ranking tennis players from 1968 until now5. To do so, we
5
This model and benchmarks were initi- can use the Bradley-Terry model. The central idea of this model is that each player has a
ally run by Martin Ingram and can be
found on https://martiningram.github. latent skill h. When players i and j play each other, player i wins with probability
io/mcmc-comparison/. pðiwins j hi ; hj Þ ¼ logisticðhi hj Þ. For example, if player i has a skill value of 1 and player j
has a skill value of −1, then the Bradley-Terry model implies that the player i beats the
player j with probability logisticð2Þ 88.1%.
We run the tennis_model in Code Block 6 using the NUTS sampler under six
conditions:
To see how the runtime changes with the size of the dataset, we choose different start
years for the fits: 2020, 2019, 2015, 2010, 2000, 1990, 1980, and finally 1968. This means
datasets ranging from 3,620 observations to 160,420.
For all conditions, we run 1,000 warm-up steps and 1,000 draws per chain, for a total of
four chains.
Figure 9 shows the effective sample size per second for all the samplers previously
mentioned. The values are an average of four separate runs. CmdStanPy performs better
than PyMC on smaller datasets, but is slower on larger ones. PyMC with either the
BlackJAX or NumPyro backends performs the best on the CPU, shown in yellow and
magenta respectively. These JAX-based samplers have similar performance. On the other
hand, when running on the GPU, the samplers are more efficient on larger datasets.
Among the largest dataset, PyMC with NumPyro and BlackJax on the vectorized GPU
performs the best, while PyMC with its default sampler and CmdStanPy (both on the
CPU) show the worst results.
EXTENSIONS
PyMC is a very flexible tool, and the PyMC community is quite active, the combination of
which enables many specialized packages to be built by others. This provides the benefit of
providing a coherent ecosystem of tools for PyMC users.
We note the following here:
Besides the main PyMC package, the PyMC developers also maintain a PyMC-
experimental package, a collection of features extending the core functionality of PyMC in
diverse directions. PyMC-experimental is intended to host unusual probability
distributions, advanced model fitting algorithms, innovative yet not fully tested methods,
or any code that may be inappropriate to include in the PyMC repository, but may want to
be made available to users. This package is ideal for researchers and developers wanting to
contribute new research as features to PyMC.
At the time of writing, PyMC-experimental includes
PYTENSOR
PyTensor is a pure-Python library that allows one to define, optimize, and efficiently
evaluate mathematical expressions involving multi-dimensional arrays, including
automatic differentiation. It is used as the computational backend of the PyMC library and
was developed from its predecessors Theano (The Theano Development Team et al., 2016)
and Aesara (Bastien et al., 2022a). At its core, PyTensor implements an extensible
computational graph framework that is accompanied by graph rewriting optimizations
and linkers to various compilable backends. At the time of writing, PyTensor graphs can be
readily linked to backends including C (Kernighan & Ritchie, 1988), JAX (Bradbury et al.,
2018), and Numba (Lam, Pitrou & Seibert, 2015), yielding compiled functions that are
much faster to evaluate than pure Python implementations of the computational graph. It
combines aspects of a computer algebra system (CAS) with aspects of an optimizing
compiler. This combination of CAS with optimizing compilation is particularly useful for
tasks in which complicated mathematical expressions are evaluated repeatedly, and
evaluation speed is critical, as is the case in MCMC applications. PyTensor does not only
provide a powerful computation backend for PyMC, but also decouples PyMC from the
underlying compilation backends, making it easier to use new compilers without
disrupting the existing PyMC code-base.
Code 7. Definition and call of a PyTensor function. Notice that the tensors x, y, and z have
been previously defined. When debugging it may be useful to avoid defining a function and
instead perform a direct evaluation of the tensor, like z.eval(x: 0, y:[1, np.e]).
This separation of the abstract definitions of mathematical expressions and the actual
computation of those expressions is central to PyTensor and hence PyMC. When defining
a PyMC model, we are just defining a PyTensor computational graph that we will later use
to obtain quantities like prior predictive samples, posterior samples, log-probabilities, etc.
This separation is useful as PyTensor can automatically optimize the mathematical
operations inside a graph. For example, if we define w = pt.exp(pt.log(x + y)),
PyTensor will first simplify the graph to w = x + y and then perform the computation.
Other optimizations include constant propagation, replacing numerically unstable
operations with numerically stable versions, avoiding computing the same quantity more
than once, and efficient sparse matrix multiplication.
Random variables
We now show how to manually generate samples from PyMC distribution and evaluate
their log-probability. On line 1 of Code Block 8 we define a Normal distribution with mean
0 and standard deviation 1. On line 2 we take 1,000 draws from that distribution, using the
pm.draw(.) function. Finally, on line 3 we use the pm.logp(.) function to compute the
log-probability of the samples generated in the previous step. We use the eval() method
to obtain the actual values, instead of a symbolic representation. Figure 10 shows the
results.
x = pm.Normal.dist(mu=0, sigma=1)
x_draws = pm.draw(x, draws=1_000)
x_logp = pm.logp(rv=x, value=[x_draws]).eval()
Code 8. In line 1 a PyMC distribution is used to specify a symbolic random variable from a
corresponding to a Normal distribution. The pm.draw(.) call in line 2 invokes PyTensor
graph compilation while accounting for automatic updating of random number
generators, and returns an array with 1,000 draws of variable x. Using pm.logp(.), the
log-probability densities of x for each element in x_draws are derived symbolically, and
then evaluated by the call to the .eval() method.
While the example from Code Block 8 may seem trivial, it is important to note that the
variable x which is passed to pm.draw(.) or pm.logp(.) can be the result of a symbolic
computation involving multiple random variables and other tensors. A simple example is
given in Code Block 9 where the random variable b depends on another random variable a,
and the variable x is a tensor variable that merely depends on other variables, some of
which represent random variables. To a much larger extent, this is how computational
graphs of PyMC models are handled behind the scenes, and how users can access
properties such as the log-probability densities of probabilistic graphs built with PyTensor.
Code 9. Random variables such as the ones defined in lines 1 and 2 behave like tensor
variables and can be used as such in standard operations such as addition (line 3). The
sampling and derivation of log-probabilities (lines 4 and 5) can operate on any graph that
involves random variables.
DISCUSSION
PyMC has been the leading probabilistic programming language in Python for years. Its
intuitive syntax that balances simplicity with flexibility has been key to this success. The
contributor community is varied, composed of users, technical writers, developers, and
even visual designers for artifacts such as logos. This diversity of contributors has aided in
many ways to the improvement of the library and its adoption by a large audience. As
PyMC has grown, its functionality has spun off into more specialized and feature-rich
packages for the Bayesian community. For instance, sampling diagnostics, model
comparison, and visualizations had been forked into ArviZ which supports PyMC but
many other PPLs as well. Similarly, the definition of complex generalized linear
hierarchical models using a formula notation similar to those found in R has now been
delegated to Bambi.
In this way, the PyMC contributor environment has been incredibly beneficial for the
computational Bayesian community. This is evidenced by the numerous sister packages
that PyMC has seeded, each with a more focused developmental process. This makes it
easier to maintain the software, add new features, and for the users to find specialized
packages to fit their needs while continuing to grow an ever larger and interconnected
community.
In this manuscript, we have highlighted some of the most relevant features of the
current state of PyMC development and mention some changes to come in the near future.
We trust that the technical innovations, strong community, and interoperability with the
Scientific Python ecosystem herald a bright future for PyMC.
CONTRIBUTING TO PYMC
As a community-driven project, we are always excited to collaborate with new
contributors. For those interested in working on PyMC, we invite them to read our
contributing guidelines (https://www.pymc.io/projects/docs/en/latest/contributing/index.
html). As part of our quality control process, contributions are submitted as a pull request
(PR) that is subject to review and revision prior to being merged into the appropriate
project repository; most PRs need approval from at least 1 core developer. Major
innovations and changes to the API are subject to collective agreement from the core
APPENDIX
PyMC is available from the Python Package Index at https://pypi.org/project/pymc/.
Alternatively, it can be installed using conda, which is the recommended way of installing
PyMC. The project is hosted and developed at https://github.com/pymc-devs/pymc. The
package documentation, including installation instructions and many examples of how to
use PyMC to conduct different statistical analysis, can be found at https://docs.pymc.io.
ACKNOWLEDGEMENTS
We thank Google Summer of Code (GSoC), a global program that offers student
developers stipends to write code for open-source projects. We also want to thank all the
students that participated in the GSoC and contributed to PyMC. We thank Martin
Ingram who is the original author of the model in Section JAX-based sampling, the
original model and benchmark can be found at https://martiningram.github.io/mcmc-
comparison/ and Kevin Murphy for his helpful comments on an earlier version of this
manuscript. We want to thank Adrian Seyboldt for his significant contributions to PyMC.
Finally, PyMC would not be the same without the work of hundreds of volunteers
reporting issues, fixing bugs, and contributing features to the project, to whom we are also
indebted.
Funding
NumFOCUS, a nonprofit 501(c)(3) public charity, provides operational and financial
support to PyMC. PyMC Labs, a Bayesian consulting company, provides funding for the
Grant Disclosures
The following grant information was disclosed by the authors:
NumFOCUS.
PyMC Labs.
National Agency of Scientific and Technological Promotion ANPCyT: PICT-02212.
National Scientific and Technical Research Council CONICET: PIP-0087.
Competing Interests
The authors declare that they have no competing interests.
Colin Carroll, Ravin Kumar and Junpeng Lao are employed by Google Inc., Christopher
J. Fonnesbeck is employed by Baseball Operations Research and Development, Maxim
Kochurov, Ricardo Vieira and Thomas Wiecki are employed by PyMC Labs and Michael
Osthege is employed by Forschungszentrum Jülich GmbH.
Author Contributions
Oriol Abril-Pla conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the article, and approved the final draft.
Virgile Andreani conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the article, and approved the final draft.
Colin Carroll conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the article, and approved the final draft.
Larry Dong conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the article, and approved the final draft.
Christopher J Fonnesbeck conceived and designed the experiments, performed the
experiments, analyzed the data, performed the computation work, prepared figures and/
or tables, authored or reviewed drafts of the article, and approved the final draft.
Maxim Kochurov conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the article, and approved the final draft.
Ravin Kumar conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the article, and approved the final draft.
Data Availability
The following information was supplied regarding data availability:
Code files are available at GitHub and Zenodo:
https://github.com/pymc-devs/paper_v5.
Osvaldo A Martin, Ravin Kumar, Oriol Abril-Pla, & Thomas Wiecki. (2023). pymc-
devs/paper_v5: submission 1 (submission_1). Zenodo. https://doi.org/10.5281/zenodo.
8121048.
REFERENCES
Bastien F, Lamblin P, Abergeron, Willard BT, Goodfellow I, Pascanu R, Carriepl, Breuleux O,
Notoraptor, Warde-Farley D, Xue R, Bergstra J, Harlouci Affan M, Sundararaman R, Askari
R, Maqianlie, Panneerselvam S, Belopolsky A, Turian J, Ballasn, van Tulder G, Lefrancois S,
Vieira R, Almahairi A, Hengjean, Bouchard N, Khaotik, Gulcehre C, Lowin J. 2022a. aesara-
devs/aesara: rel-2.8.9. Available at https://aesara.readthedocs.io/en/latest/.
Bastien F, Lamblin P, Abergeron, Willard BT, Goodfellow I, Pascanu R, Carriepl, Breuleux O,
Notoraptor, Warde-Farley D, Xue R, Bergstra J, Harlouci, Affan M, Sundararaman R,
Askari R, Maqianlie, Panneerselvam S, Belopolsky A, Turian J, Ballasn, van Tulder G, Vieira
R, Lefrancois S, Almahairi A, Hengjean, Bouchard N, Khaotik Gulcehre C, Lowin J. 2022b.
pymc-devs/pytensor: rel-2.8.12. Available at https://pytensor.readthedocs.io.
Betancourt M. 2016. Identifying the optimal integration time in Hamiltonian Monte Carlo. ArXiv
preprint. DOI 10.48550/arXiv.1601.00225.