Probabilistic Programming Julia
Probabilistic Programming Julia
Kai Xu
Department of Engineering
University of Cambridge
Signed:
Date:
1 Introduction 1
2 Background 3
2.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 General Inference Algorithms . . . . . . . . . . . . . . . . . . 6
2.2.1 Sequential Monte Carlo and Particle Gibbs . . . . . . . 6
2.2.2 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . 12
2.2.3 Evaluating Inference Results . . . . . . . . . . . . . . . 15
2.3 Probabilistic Programming . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Approaches to Probabilistic Programs . . . . . . . . . . 18
2.3.2 Exemplar Probabilistic Programs . . . . . . . . . . . . 20
3 Related Work 29
3.1 An Existing HMC Implementation - Stan . . . . . . . . . . . . 30
3.2 The Turing Infrastructure . . . . . . . . . . . . . . . . . . . . 31
3.2.1 The Framework . . . . . . . . . . . . . . . . . . . . . . 32
3.3 MCMC Libraries in Julia . . . . . . . . . . . . . . . . . . . . . 34
5 Evaluation 55
5.1 Validation of Correctness . . . . . . . . . . . . . . . . . . . . . 55
5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
i
5.2.1 Comparison within Turing . . . . . . . . . . . . . . . . 57
5.2.2 Comparison against Stan . . . . . . . . . . . . . . . . . 59
5.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 More Experiments . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Bayesian Inference versus Gradient Descend . . . . . . 65
5.4.2 Sensitivity of Different Variables . . . . . . . . . . . . . 67
Bibliography 74
ii
List of Figures
iii
iv
List of Tables
v
vi
Chapter 1
Introduction
1
edge of this project in terms of three major topics, Bayesian inference, gen-
eral inference algorithms and probabilistic programming, along with some
motivations behind them in Chapter 2. Then in Chapter 3, related work
including some other existing PPLs (especially Stan) as well as the existing
infrastructure of Turing are introduced. In Chapter 4, the design and im-
plementation of the HMC sampler is discussed in detail. After that, some
experimental results of the HMC sampler in terms of validation of correct-
ness and evaluations of performance and robustness are given in Chapter 5.
Finally in Chapter 6, the dissertation finishes with a summary of this project
with current limitations of Turing discussed, and proposes some future works
to overcome these limitations.
2
Chapter 2
Background
This chapter gives a brief review of three core components involved in this
project, which are Bayesian inference framework, general inference algo-
rithms and probabilistic programming.
Bayesian inference uses Bayes’ rule to learn model parameters from some
given data. In particular, for a given model m with some unknown parame-
ters θ, Bayesian inference estimates the distribution of θ from observed data
D using Bayes’ rule (Equation 2.1)
p(D|θ, m)p(θ|m)
p(θ|D, m) =
p(D|m)
(2.1)
p(D|θ, m)p(θ|m)
=R ,
p(D|θ, m)p(θ|m)dθ
3
the likelihood, p(θ|m) the prior and p(θ|D, m) the posterior.
Bayes’ rule cooperates the prior knowledge p(θ|m) with the likelihood p(D|θ, m)
obtained from data and updates the posterior knowledge about the parame-
ters p(θ|D, m). This obtained posterior is actually an updated prior and can
be used as the prior knowledge for further data.
With model parameters learnt, models can be used to test new data Dnew by
calculating
Z
P (Dnew |D, m) = P (Dnew |θ, D, m)P (θ|D, m)dθ. (2.2)
A large prediction P (Dnew |D, m) indicates the model has a good prediction
performance on the data or new data fits the model well, and verse vice. This
could be used to either 1) test whether a model is satisfying by i) splitting
a dataset into training set and testing set, ii) training the model on the
training set and iii) testing the model on the testing set, or 2) test whether
some incoming data is from the same data set or not.
In addition, simpler models are usually preferred in the sense of that they
are more probable, which is called the Bayesian Ockham’s razor. Simpler
models can be chosen by comparing models using p(D|m) in Equation 2.1 as
a metric. The term p(D|m) here is called the marginal likelihood or model
evidence [1, 2].
A simpler model usually centres its model evidence within a small range
of data while a more complex model is more expressive so it spreads its
marginal probability more thinly over the data space [1]. Figure 2.1 gives an
illustrations of how the model evidence distributes differently for simple and
complex model. Here the x-axis represents the space of data, the range C
is the corresponding range of data D in the space, model m1 is the simpler
model and model m2 is the more complex one. It can be seen that the
marginal probability of m1 are more centred while that of m2 spreads wider.
Although model m2 can explain more data range than m1 , for given data D,
m1 is more probable than m2 because the corresponding integral of model
4
evidence over C is larger. This shows a situation where a more expressive
model can be less probable [1].
Evidence
P(D|m1)
P(D|m2)
C D
5
2.2 General Inference Algorithms
Inference means estimating model parameters from data. There are gener-
ally two categories of inference methods. The first category contains exact
methods including complete enumeration and exact marginalisation, and the
second one consists of approximate methods including deterministic approx-
imations and Monte Carlo methods.
Methods in the first category usually fail when distributions are intractable,
i.e. it is impossible to conduct these methods mathematically. However,
Monte Carlo methods can be used to do inference on any distribution, thus
they are also called general inference algorithms. As probabilistic program-
ming requires automatic inference of any probabilistic models defined by the
user, Monte Carlo methods are necessary in this project.
In the Turing project, there are mainly three inference algorithms involved1 .
They are Sequential Monte Carlo (SMC), Particle Gibbs (PG) and Hamilto-
nian Monte Carlo (HMC), the third of which is the target algorithm to be
implemented in this M.Phi project. This section describes the theory behind
these three inference algorithms.
Specifically, the unknown quantity in the Markov process {Xk }k≥1 is charac-
1
The fact is true up to the date when this dissertation is written and there may be
more samplers supported in Turing in the future.
6
terised by the initial density
X1 ∼ µ(·) (2.3)
n
Y
p(x1:n ) = p(x1 ) p(xk |x1:k−1 )
k=2
n (2.5)
Y
= µ(x1 ) f (xk |xk−1 ).
k=2
Also, the conditional independence of the observation {Yk }k≥1 given the un-
observed signal is characterised by
n
Y
p(y1:n |x1:n ) = g(yk |xk ). (2.7)
k=1
p(x1:n , y1:n )
p(x1:n |y1:n ) = ∝ p(x1:n , y1:n ), (2.8)
p(y1:n )
7
where n is fixed.
N
1 X
p̂(x1:n |y1:n ) = δ (i) (x1:n ), (2.9)
N i=1 X1:n
Z 1 if a ∈ A ⊂ E n
1:n
δa1:n (x1:n )dx1:n = . (2.10)
A 0 otherwise
The Monte Carlo method works well when it is possible to sample from the
probability density p(x1:n |y1:n ) and n is not large.
These two problems can be tackled by using a Monte Carlo sampling method
called Sequential Importance Sampling (SIS), which is a sequential version
of Importance Sampling (IS) [5].
8
q(x1:n |y1:n ). This importance distribution is chosen to be easy to be sampled
from, and is usually the prior distribution p(x1:n ). After samples are drawn
from the importance distribution, each of them is then weighted by
PN (i) (i)
wn ϕ(X1:n )
i=1
Ep(x1:n |y1:n ) (ϕ) = PN (i)
(2.12)
i=1 wn
For example, the mean of the target distribution can be computed by setting
ϕ(x) = x and applying Equation 2.12, which gives
PN (i) (i)
(i) i=1 wn X1:n
E(X1:n ) = PN (i)
. (2.13)
i=1 wn
Knowing that
9
by
(i) (i)
1. Sample Xn from f (·|Xn−1 ) (or µ(·) if n = 1);
(i) (i) (i) (i)
2. Set w1:n = (w1:n−1 , wn−1 × g(yn |Xn ));
(i) (i) (i)
3. Set X1:n = (X1:n−1 , Xn ).
This sequential way owns the property that, no matter how large n is, there
is always only one component Xn to be sampled at one time. In addition,
this algorithm can be easily parallelised and distributed to multiple com-
puters or processors, which means that the sampling speed can be further
accelerated.
The solution to this problem is to discard particles with low weights and
multiply particles with high weights.
N
X
p̂(x1:n |y1:n ) = Wn(i) δX (i) (x1:n ), (2.15)
1:n
i=1
(i)
(i) wn
where Wn = PN (i) and is named the normalised weight.
i=1 wn
When the distribution has its mass centred in a few number of particles,
(i)
it is resampled by sampling N times X1:n ∼ p̂(x1:n |y1:n ) to build the new
approximation
N
1 X
p̃(x1:n |y1:n ) = δ (i) (x1:n ) (2.16)
N i=1 X1:n
The criteria to perform resampling step can be set as the Effective Sample
10
Size (ESS) of the samples being lower than a threshold, which is the way
Turing conducts. Also, these new resampled samples are proved to be ap-
proximately distributed to p(x1:n |y1:n ) but statistically dependent [5].
Particle Gibbs
The PG method is a SMC based method that runs multiple passes of the
SMC algorithm, where each pass is conditional on the trajectory sampled at
the last run of the SMC sampler [7]. The conditioned on trajectory here is
known as a reference trajectory.
To describe the PG algorithm, let x1:T = (xb11 , xb22 , . . . , xbTT ) be the reference
trajectory with ancestor indices b1:T . For initialisation, we sample xk0 ∼ p(x0 )
for k 6= b1 and set π0k = 1/M for all k, where M is the number of particles in
each particle set.
Now suppose we have a weighted sample from p(xt |y1:t ) at iteration t < T ,
then each iteration of the PG sampling is conducted as below.
ak
2. Sample xkt+1 ∼ p(xkt+1 |xt t ), ∀k 6= bt ;
k
3. Set the weights as wt+1 = p(yt+1 |xkt+1 ) and normalise the weights by
k k
/ M i
P
πt+1 = wt+1 i=1 wt+1 for all k sets.
Thanks to the fact that PG uses x1:T as a reference trajectory in its SMC
pass, it holds an invariance property that the exact target distribution is
invariant, which ensures that PG is an unbiased sampling method [8].
11
2.2.2 Hamiltonian Monte Carlo
Figure 2.2 gives an illustration of how HMC performs differently from MH.
As you can see in this figure, for a given number of sample number (30 here),
HMC (in red) has a better exploration in the target distribution while the
MH method (in green) get stuck in a corner of the distribution.
12
�
�����
� ����
������
�����
����
��
��� ����
�
����
����
����
�
��
�� � � � � �
�����
When the new state x0 is proposed, depending on the probability ratio be-
tween the proposed state and the previous state
P ∗ (x0 )Q(x(t) ; x0 )
a= , (2.17)
P ∗ (x(t) )Q(x0 ; x( t)
the following rule is used to decide whether to accept the new state or
not:
13
each other. However, one can proved that the convergence of MH method,
which says with t → ∞, the probability distribution of x(t) will be asymptotic
to P (x) = P ∗ (x)/Z [1].
e−E(x)
P (x) = , (2.18)
Z
dE(x)
where both E(x) and its gradient dx
can be evaluated.
This E(x) can be seen as the energy level of the corresponding system and is
called the energy function. In HMC, the state space x is also augmented by
an additional momentum variables p, and the sampling process then consists
of two steps, which are described below.
ẋ = p (2.20)
and
14
dE(x)
ṗ = − . (2.21)
dx
These two proposals (x and p) are then used to create asymptotically samples
from the joint density below
1 1
PH (x, p) = exp[−H(x, p)] = exp[−E(x)] exp[−k(p)]. (2.23)
ZH ZH
As this joint density is separable, the desired distribution (i.e the marginal
distribution of x) can be obtained by simply discarding p, which gives
e−E(x)
P (x) = . (2.24)
Z
In terms of the time complexity, compared with MH, HMC lowers the time
complexity from O(N 2 ) to O(N ) by reducing the random walks [1].
Samples generated from MCMC algorithms are correlated to each other, thus
the 0 real0 sample size is reduced by autocorrelation. Also, because MCMC
methods are MC methods, there exist some estimation errors due to the
limitation of Monte Carlo approximation [10]. How 0 useful0 these samples
15
are can be determined by two metrics, which are Effective Sample Size (ESS)
and Monte Carlo Standard Error (MCSE).
n
ESS = P∞ , (2.25)
1+ k=1 ρk
where n is the total number of samples in the chain and ρk is the autocorre-
lation factor at lag k of the chain [11].
As ESS measures how many samples are effective in the chain, the larger this
value is, the higher the sampling efficiency. Different MCMC samplers can
be evaluated by generating the same number of samples and comparing the
ESS for each sampling results.
a
b X
σ̂g2 = (Ȳj − ḡn )2 , (2.26)
a − 1 j=1
where Ȳj = 1b jb
P 1
Pn
i=(j−1)b+1 g(Xi ) (for j = 1, . . . , a), ḡn = n i=1 g(Xi ) and g
is any real-valued function [12].
σ̂g2
MCSE = √ . (2.27)
n
16
As MCSE measures the inaccuracy of MC samples, a smaller value of MCSE
is an indicator of better sampling performance.
However, it has been argued that MCSE is generally unimportant when the
goal of inference is parameters themselves rather than the expectation of
parameters, in which case the ESS is would be a more important measure
[13]. As some models are interested in using the expectation of parameters
while some of them not, both MCSE and ESS will be used as performance
metrics in this project.
17
There are mainly two challenges when achieving a universal PPL framework
in general, the first one of which is the development of efficient inference
algorithms and the second one of which is how to utilise existing program-
ming language infrastructure. In many PPLs, the second issues is tackled by
embedding PPLs into some existing programming languages to use as much
infrastructures as the 0 master0 programming language has [16].
In the rest part of this section, how probabilistic models are represented
by probabilistic programs is firstly introduced, and then three probabilistic
models written in Turing and Stan (a popular PPL which also implements
the HMC algorithm) are provided as illustrations of how models are written
in practise. More concrete discussions on existing PPLs will be provided in
Chapter 3.
18
f : (θ, D) 7→ π0 (θ)gθ (D). (2.28)
In the Bayesian framework, this joint distribution could be the joint function
of the prior P (θ) and the likelihood P (D|θ).
Many existing PPLs including Stan, Infer.NET and WinBUGS takes this
approach [16].
19
where z is drawn from some distributions which supports sampling.
The hybrid approach actually combines the determinised approach and the
randomised in the sense that it adds noise to the determinised function. This
approach is able to describe any models in the Bayesian inference framework
and is the approach taken by Turing.
This section lists three probabilistic models written in two PPLs, Turing and
Stan, as an illustration of how probabilistic models are expressed in practice.
These three models will also be used for experiments in Chapter 5.
The first model is a univariate Gaussian model with conjugate priors. Con-
jugate priors are useful to construct posteriors that own the same form after
being updated by the Bayes’ rule [18].
√
x ∼ Normal(µ, σ), (2.30)
√
where σ ∼ InverseGamma(α, θ) and µ ∼ Normal(0, σ).
Listing 2.2 shows how this model is written in Turing and Listing 2.3 gives
the corresponding code in Stan3 .
3
The Turing version of this model is available at the home page of Turing (https:
20
Listing 2.2: The Gaussians Model in Turing
1 xs = [1.5, 2.0] # the data
2
3 @model gauss begin
4 @assume s ∼ InverseGamma(2, 3) # define the variance
5 @assume m ∼ Normal(0, sqrt(s)) # define the mean
6 for i = 1:length(xs)
7 @observe xs[i] ∼ Normal(m, sqrt(s)) # observe data points
8 end
9 @predict s m # predict s and m
10 end
21
According to [19], there is an exact inference result for this model. In Gen-
eral, for a Gaussian model with priors µ ∼ Normal(µ0 , σ 2 /k0 ) and σ 2 ∼
InverseGamma(α, θ)4 , the posterior means are
k0 µ0 + nx̄
µ̄ = (2.31)
k0 + n
and
2α + n 1 X nk0
σ¯2 = (2θ + (xi − x̄)2 + (µ0 − x̄)2 ). (2.32)
2α + n − 2 2α + n i
k0 + n
1 × 0 + 2 × 1.75 7
µ̄ = = (2.33)
1+2 6
and
2×2+2 1 1 2×1 7 2 49
σ¯2 = (2 × 3 + 2 × ( )2 + ( ) ) = . (2.34)
2×2+2−22×2+2 4 1+2 4 24
This exact inference result will be used to validate the correctness of the
HMC implementation in Section 5.1.
4
This kind of priors is called Normal-inverse-Gamma (NIG) prior, which is a special
case of Normal-inverse-chi-squared (NIX) prior. [19] gives the expressions of posterior
means of NIX and the corresponding means of NIG is found by using the relation between
the scaled inverse chi-squared distribution and the inverse gamma distribution, which says
2
if x ∼ ScaledInverseChiSquared(ν, τ 2 ) then x ∼ InverseGamma( ν2 , ντ2 ).
22
Beta-Binomial Model
x ∼ Bernoulli(p), (2.35)
In our example, the parameters of the Beta distribution are set as α = 1 and
β = 1.
Listing 2.4 shows how this model is written in Turing and Listing 2.5 gives
the corresponding program in Stan5 .
5
The Stan version of this model is adapted from the one at the home page of the Julia
interface of Stan (https://github.com/goedman/Stan.jl) and the corresponding
Stan version is written by the student.
23
8 int<lower=0,upper=1> obs[N];
9 }
10 parameters { # parameter definition
11 real<lower=0,upper=1> theta;
12 }
13 model { # model definition
14 theta ∼ beta(1, 1); # define the prior
15 obs ∼ bernoulli(theta); # observe data points
16 }
17 "
18
19 betabinomial = Stanmodel(name="betabinomial", model=
betabinomial_str)
According to [18] there is an exact inference result for this model, which says
generally for a beta-binomial model with a prior Beta(α, β), the mean of p
is
α + N1
p̄ = , (2.36)
α+β+N
1+3 1
p̄ = = . (2.37)
1 + 1 + 10 3
Again, this exact inference result will be used to validate the correctness of
the HMC implementation in Section 5.1.
Logistic Regression
24
our example, the label of data points xi ∈ D ⊂ R2 follows a Bernoulli
distribution
ti ∼ Bernoulli(y), (2.38)
where yi = f (β0 + β 0 xi ) = 1
1+exp(−(β0 +β 0 xi ))
and ti is the true label.
1
Here the function f (x) = 1+exp(−x) is called the logistic function and logis-
tic regression is named because the probability is estimated by this logistic
function. Also in this model, ti and xi has a linear relationship with weight
vector β = (β1 , β2 )0 and bias β0 .
In the Bayesian framework, the the bias and the weight vector are given
Gaussian priors as
βi ∼ Normal(0, σ 2 ). (2.39)
The value of σ determines the fitting level of our model by controlling the
magnitude of βi .
Listing 2.6 shows how this model is written in Turing and Listing 2.7 gives
the corresponding model in Stan.
25
14 end
15 for i = 1:4
16 y = f(xs[i], beta) # compute label
17 @observe ts[i] ∼ Bernoulli(y) # observe data point
18 end
19 @predict beta # output beta
20 end
26
30 beta_2 ∼ normal(0, 2);
31 for (i in 1:N)
32 ts[i] ∼ bernoulli(ys[i]); # likelihood
33 }
34 "
w0
ϕ(t)
x t y
w
β 0 = β + αL(β), (2.40)
where α is the learning rate and L(β) is the loss function defined by
X α 2
L(β) = ti log(y) + (1 − ti ) log(1 − y) + (β0 + β12 + β22 ). (2.41)
i
2
[1] indicates a relationship between the variance of Gaussian prior and the
learning rate of GD, which is σ 2 = 1/α. Also, the 0 leapfrog0 step size in
27
√
HMC can also be chosen using the learning rate by = 2η [1]. These two
relations can be used to set parameters to make two methods comparable and
the experiment to compare Bayesian prediction against GD will be given in
Section 5.4.1.
28
Chapter 3
Related Work
The first popular PPL is named WinBUGS in 2000, which can be used to
describe graphical models, and utilises a Gibbs sampler to do inference [22].
After that many PPLs with different inference algorithms were developed.
For example, Infer.NET from Microsoft makes use of message passing (2014);
Stan from the Stan development team uses HMC as the inference algorithm;
LibBi uses Particle MCMC (2013); and in 2014, Anglican, from Wood’s
group, consists various inference algorithms [16, 23, 24, 25].
This chapter gives more practical insight of how PPLs are implemented, and
is structured as follow. In Section 3.1, Stan is selected to be discussed in
detail as it also implements the HMC algorithm. Some features of Stan
mentioned here will be linked to its performance later in Section 5.2. Then
Section 3.2 will introduce the fundamental infrastructure of Turing in terms
of the basic architecture, key components and the flow of how a probabilistic
model is learnt.
29
3.1 An Existing HMC Implementation - Stan
Stan, named after the mathematician Stanislaw Ulam, who is one of the
fathers of the MC method, is a C++ program to perform Bayesian inference
[26]. The first version of Stan was released in 2012 and now it has a large
and active development community as well as affluent documentations. There
currently exist several interfaces of Stan, including command line, R, Python,
Matlab, and Julia. Stan models can be defined in the host language as string
and the host language can then call the Stan compiler to compile defined
models into C++ programs [26].
To use Stan, a user defines a Stan program in its own syntax, which is similar
to BUGS and Jags in the sense that it allows a user to write a Bayesian model
in a convenient language whose code looks like statistics notation [22, 26, 27].
However, as the Stan program forces typings on variables, users also need
to define the type of data and model parameters as well as intermediate
parameters in the model, which makes model definition troublesome to some
extent.
After a Stan model is defined, it is then compiled to a C++ program and runs
along with data. Note that each time the model is amended, it requires to
be compiled again. The compiling takes relatively long time compared with
its fast sampling speed. As probabilistic modelling usually requires frequent
amendments of models, this is argued to be a disadvantage of Stan. Some
concrete compiling time of Stan programs will be given in Section 5.2.
The result output from Stan is a chain of samples of the posterior parameters
in the model generated by NUTS, which is an adaptive variant of the HMC
algorithm. As Stan supports only HMC, it cannot inference parameters in
discrete space [26]. In addition to samples, Stan also generates and outputs
useful statistics of samples such as mean, variance and ESS of the samples
by default.
30
3.2 The Turing Infrastructure
31
Julia is a dynamic programming language, Turing does not have the tedious
compiling process when each time a model is changed.
There are two major components in the Turing framework, which are the
compiler and the samplers.
The compiler
The defined model within @model will then be translated into a normal Julia
program by the compiler, passed to a sampler as an expression when the
sampling function sample() is called. The detail design of the compiler and
how each operation works will be given in Section 4.1.2.
The samplers
Up to the time when this thesis is submitted, Turing supports four inference
algorithms, which are Importance Sampling (IS), Particle Gibbs (PG), Se-
quential Monte Carlo (SMC) and Hamiltonian Monte Carlo (HMC)1 . These
1
IS, PG and SMC are existing when the student joined the project. The student mainly
contributes to implementation of the HMC sampler in this project. In addition, the student
also re-written the IS sampling using the new data structure ParticleContainer,
32
samplers can be constructed by calling the respective algorithms with corre-
sponding parameter(s) as follow
• IS: IS(n_samples)
• SMC: SMC(n_samples)
Among these four samplers, three of them, IS, PG and SMC, are particle
based methods, in which samples are represented as particles and each par-
ticle needs to run a separate copy of the program independently. In order to
improve sampling efficiency of these methods, coroutines, which is a similar
technique to subroutines, are used to execute programs in Turing.
1. A coroutine can yield control multiple times during its execution and
the result can be required either at that point or later;
The HMC sampler is the one this project focuses on and the corresponding
design and implementation will be given in Section 4.2.
33
Sampling a defined model
For instance, if one would like to sample the Gaussian model gauss in Sec-
tion 2.3.2 using a HMC sampler for 1000 samples with 0 leapfrog0 step size
and 0 leapfrog0 step number being 0.5 and 15 respectively, the statement be-
low could be used.
1 chain = sample(gauss, HMC(1000, 0.5, 15))
Using the Gaussian model as an example, one could extract all weighted
samples of parameter s using a single indexing notation as below.
1 chain[:s]
34
defined in a simple customised Julia expression using random-walk Metropo-
lis and HMC; the second one supports applying inference algorithms (MH,
HMC and slice sampling) on probability defined in the form of functions;
and the third one allows users to define models in its own type and applies
MCMC algorithms to them, which is similar to a PPL to some extent.
Here the package Mamba.jl also provides some convenient toolkits to diagnose
the MCMC sampling result by computing MCSE and ESS, and thus this
package is used in the experiment part of this project.
35
36
Chapter 4
37
The probabilistic program is then be translated into a standard Julia pro-
gram by a compiler, which involves using the metaprogramming technique,
macro, in Julia. The translated program is essentially a noisy evaluators that
can evaluate the density of probability distribution the model defines. The
implementation of the compiler will be discussed in Section 4.1.2.
In detail, these three operations are supported with the syntax below.
2
The Julia package Distributions supports most of the common distributions.
However, due to the fact the distributions in the package are not differentiable, a wrapper
of common distributions are written by the student. This will be discussed in Section 4.2.2.
38
tations are aimed to be used as below.
• static: if this argument is set, it means that the existence of the cor-
responding variable does not rely on other variables. Therefore the
variables exists in each execution of the program.
Note: When this argument is set to true and the distribution is dif-
ferentiable, the corresponding variable could be efficiently sampled by
samplers like HMC.
Macros provide a way to include generated code in the final body of a pro-
gram. They can map a tuple of arguments to a returned expression. Differ-
ently from functions which are executed in runtime, macros are executed in
parse time [29, 30]. This allows programmers to generate and include pieces
of customised codes before the full program is executed.
39
The macro change_op below illustrates how macros work and the difference
between macros and functions.
1 macro change_op(ex)
2 # Change the operation to multiplication
3 ex.args[1] = :*
4 # Return an expression to print the result
5 return :(print($(ex)))
6 end
This macro is aimed to change the operation of the input expression to mul-
tiplication. Giving an argument 1 + 2, it gives result as below.
1 @change_op 1 + 2 # call the marco
2 > 3 # the result of expression "print(1 * 2)"
Using
1 @assume m ∼ Normal(0, 1; static=true)
40
the corresponding argument in the expression, i.e. turn Normal(0, 1)
to dNormal(0, 1).
If the prior is a single variable (like m in the example above), the expression
will become
1 m = Turing.assume(
2 Turing.sampler, # the sampler defined by the user
3 dNormal(0, 1), # the distribution
4 PriorSym(:m) # the prior identity
5 )
1 m = Turing.assume(
2 Turing.sampler, # the sampler defined by the user
3 dNormal(0, 1), # the distribution
4 PriorArr(:(priors[i], :i, 1)) # the prior identity
5 )
Here the value 1 in the prior identity construction (Line 4 of the second code
block above) is the concrete value at a specific iteration of the loop (the loop
between Line 2 to 4 in the first code block above).
The aim of using types PriorSym and PriorArr is to pass an identity of each
prior so that they can be replayed by the HMC sampler specifically. Actually
this way of replay is not perfect and the replay issue will be specifically
discussed in Section 4.2.3.
41
The @observe macro
For instance,
1 xs = [1.5, 2.0]
2
3 @observe xs[1] ∼ Normal(0, 1)
will become
1 Turing.assume(
2 Turing.sampler, # the sampler defined by the user
3 dNormal(0, 1), # the distribution
4 1.5 # the concrete value of observation
5 )
Using
1 @predict m
as an example, if the value of m is 1.1 at the time when this macro is called,
the statement above will become
1 Turing.predict(
2 Turing.sampler, # the sampler defined by the user
3 :m, # the symbol of variable
4 1.1 # the current value of variable
5 )
42
The @model macro
The macro @model wraps all the original and generated statements in its
scope into an expression ex, stores this expression in the Turing’s global
scope as
1 TURING[:modelex] = ex
The general inference engine developed in this project uses the standard
HMC algorithm. In this section, the implementation of the HMC sampler
in Turing is given in detail, with how automatic differentiation works, both
generally and inside the sampler, explained in Section 4.2.2.
Notice here in this algorithm, initial states are required. This is done by sim-
ply drawing samples from priors and using these samples as initial states.
43
Algorithm 1: The Hamiltonian Monte Carlo Algorithm
Data: sample number n, 0 leapfrog0 step number τ , 0 leapfrog0 step size
, initial state x, energy function E(), gradient function gradE()
Result: n samples
1 for i = 1 to n do
2 p = randn(length(x)); // draw momentum from Normal(0, 1)
3 oldH = p’ ∗ p / 2 + E(x); // record old Hamiltonian
4 oldx = x; // record old state
5 // Make τ 0 leapfrog0 steps
6 g = gradE(x); // evaluate gradient
7 for t = 1 to τ do
8 p − = ∗ g / 2; // make a half step for momentum
9 x + = ∗ p; // make a full step for state
10 g = gradE(x); // evaluate gradient
11 p − = ∗ g / 2; // make a half step for momentum
12 end
13 // Decide wether to accept the proposal state or not
14 H = p’ ∗ p / 2 + E(x); // compute new Hamiltonian
15 dH = H − oldH; // compute the difference in Hamiltonian
16 if dH < 0 then
17 acc = true
18 else
19 if rand() < exp(-dH) then
20 acc = true
21 else
22 acc = false
23 end
24 end
25 if ¬acc then
26 x = oldx; // rewind if rejected
27 end
28 end
44
It is worth to be mentioned that in each iteration the number of evaluation of
E(x) is O(1) (Line 3 and Line 14) and the number of evaluation of gradE(x)
(Line 6, and Line 10 in a τ times loop) is O(τ ). So in the whole sampling
process these two complexities are O(n) and O(nτ ) respectively.
Evaluating E(x)
where P (x) is the posterior distribution P (θ|D) that we are interested in,
i.e. P (x) = P (θ|D).
By plugging Equation 4.2 into Equation 4.1 with P (x) = P (θ|D), we have
That is to say, in the sense of that the only required information by the HMC
algorithm are the evaluations of E(x) and gradE(x), it actually only needs
to extract the log-joint probability and the corresponding gradient from the
probabilistic program.
45
Y Y
log(P (D, θ)) = log( P (x|θ)) + log( P (θi |θ−i ))
x∈D θi ∈θ
X X . (4.4)
= P (x|θ) + P (θi |θ−i )
x∈D θi ∈θ
The core idea behind AD is a new type of numbers call dual numbers, which
will be introduced next, followed by an example and detailed application of
AD in Turing.
46
Dual Numbers
The mathematics behind forward mode AD are dual numbers, which are
defined as formal truncated Taylor series in the form of
v + v̇. (4.5)
where v is called the real part and v̇ is called the dual part. Note that any
non-dual number v can be viewed as a dual number v + 0.
and
where we can see in both of them, the coefficients of have the same results
as what the symbolic differentiation rules give3 .
This means that we can use dual numbers as data structures and evaluate
functions of dual numbers by
Also the chain rule of differentiation holds, which is illustrated in Equation 4.9
below.
3 d d d
Symbolic differentiation rules are dx (f (x) + g(x)) = dx f (x) + dx g(x) and
d d d
dx (f (x)g(x)) = ( dx f (x))g(x) + f (x)( dx g(x)).
47
f (g(v + v̇)) = f (g(v) + g 0 (v)v̇)
. (4.9)
= f (g(v)) + f 0 (g(v))g 0 (v)v̇
Here we can see the coefficient of in the second line of in Equation 4.9 has
the same form as the derivative of any composition of functions.
In summary, we can use dual numbers as the basic data structure with el-
ementary functions implemented to support passing through dual numbers
according to Equation 4.6, 4.7 and 4.8. By this way, the derivative of any
function f () of interest is given by
df (x)
= epsilon-coefficient(dual-version(f )(v + 1)). (4.10)
dx x=v
48
Table 4.1: Example of evaluating an expression with dual numbers.
Example
df (x)
, (4.11)
dx x=3
(x − 1)2
f (x) = exp(− ). (4.12)
2
We can evaluate f (x) with dual numbers using Equation 4.8 and Equa-
tion 4.9. The evaluation steps are shown in Table 4.1 with the corresponding
computational graph shown in Figure 4.1.
df (x)
= −0.271. (4.13)
dx x=3
49
x v0 v1 v2 v3 v4 f(x)
df (x) (x − 1)2
= exp(− )(−x + 1)
dx x=3 2 x=3
4
= exp(− )(−3 + 1) , (4.14)
2
= −2 exp(−2)
= −0.271
Evaluating gradE(x)
1. Set the dual part of the variable we want to get gradient w.r.t to 1;
3. Collect the dual part of the negative log-joint accumulated by the model
as the gradient;
50
Here in Step 2, it requires passing through variables in Dual type to the den-
sity functions from the Distributions package, which is unfortunately not
supported by the package. Therefore, a custom wrapper of the distribution
package, dDistritbuion, is written in the way that the density functions are
manually written by the student in normal Julia way and other functions
(like rand()) are passed to the corresponding ones from the Distributions
package. This is only a compromise to this issue and a more ideal way
is to give a patch to the Distributions package to make it support Dual
type.
With all key components introduced in Section 4.2.1 and Section 4.2.2, it is
possible to show the implementation of the HMC sampler now.
51
• logjoint - log-joint probability of data (Dual{Float64})
• first - a flag to tell if the program runs for the first time (Bool)
assume()
The first step of the assume() function is to produce priors, which performers
differently depending on the first flag. If the program runs for the first time,
priors will be drawn from the prior distributions, converted into a Dual type
and store in the dictionary priors; if it is not the first time, priors will be
fetched from priors, which is called the replay of priors. After that in the
second step, the value of prior will be used to compute the log probability
density, which is then accumulated to logjoint.
Here the replay of priors is done by using the corresponding symbols as keys
to store and fetch values to and from a dictionary. To be more specific, if
the prior is stored in a single variable (say s), the key will be :s; if the prior
is stored in some position within an array (say p[i]), the key will be p[1],
where i is converted to its concrete value when the prior is called (1 here).
The passing and conversion of prior variables is done by a costumed type
Prior, which determines the type of priors and constructs corresponding
sub-type of Prior (PriorSym for single variables and PriorArr for arraies or
dicionaries) in compiling time and convert the Prior to the corresponding
symbol in the runtime.
In fact this way of replay is not necessary in most of the scenarios because for
a normal probabilistic model the orders of priors being generated are always
the same in each time the program runs. In such senerios, a more simpler
52
implementation to replay priors is to simply store them in an array in the
order of being called and fetch them in order as well. However, since Turing
allows users to use branches and loops when defining probabilistic programs,
the order of priors being called may differ if there are branches existing. In
such cases, this new implementation of replaying priors is useful.
In fact, this implementation still has problems when the container of priors
being more complex, e.g nested arrays or customised types. A more universal
approach to solve the replay issue is to combine these two methods above.
To be more specific, Julia can generate an ID for each macro it runs. The
IDs of macros can be used as keys to store the corresponding priors for each
statement in a dictionary. Additionally, the data structure inside each key is
designed to be a linked list. This allows priors defined in a loop to be stored
in the dictionary with the same key but different indices in a list. This way
of prior replay can support priors stored in any Julia container but has not
been implemented due to the limitation of time.
observe()
The observe() function simply computes the log density of the observed
data point and adds it to logjoint.
predict()
run()
The main body of the sampler follows the algorithm description in Algo-
rithm 1 with initial states drawn from priors and E(x) and gradE(x) evalu-
53
ated by the ways introduced previously in this section. One thing to mention
here is that the logjoint of the sampler needs to be re-set to 0 manually
at some point after each time the program runs. This is because the value
needs to be used after the evaluation of the program but also needs to be 0
before each evaluation.
When the sampling is completed, the re-run number as well as the accept
ratio in the HMC algorithm itself, will be prompted to users to help users
evaluate the sampling result as well as tune the HMC settings.
54
Chapter 5
Evaluation
Table 5.1 shows the inference results using different samplers (and exact
inference when available), and Table 5.2 gives the corresponding settings
samplers used.
55
Table 5.1: Inference results of different samplers.
56
It can be seen for all models tested, the inference results of the new HMC
sampler are consistent with those of others, which means the implementation
is valid.
5.2 Performance
In this section, the HMC sampler is firstly compared against other samplers
within Turing, and then against Stan.
57
�� ��
��
������ �� ������
�������������
�������������
������ ������
��������� ���������
�� ��������� ���������
��������������� ���������������
���������������� ����������������
�������������� � ���������������
� �
� ������� ������� � ������� �������
������������ ������������
� ������
�������������
������
���������
���������
���������������
����������������
� ������������
�
� ��� ����
������������
Regarding the results of the logistic regression model in Figure 5.1c, one set-
ting of PG (in light blue) performs the best while all HMC settings performs
extremely badly. This is argued to be the drawback of the forward mode
of AD as the logistic regression model has more variables than the first two
models.
As the inference engine in Stan is also based on HMC, the inference result
from Stan can be used to compared with the HMC sampler in Turing. Note
that the HMC algorithm implemented in Stan is not the standard HMC but
a optimised version called NUTS.
59
Table 5.3: Comparison of HMC samplers in Turing and Stan (the Gaussian
model).
s m
PPL ¯ τ̄ Time
MCSE ESS MCSE ESS
Turing 0.195 181 0.027 825 0.55 4 0.594s
Stan 0.106 356 0.046 378 0.55 3.9 0.180s
As NUTS automatically sets 0 leapfrog0 step size and 0 leapfrog0 step number,
in order to get relatively reasonable results, in our experiments, models were
firstly learnt in Stan and the means of 0 leapfrog0 step size and 0 leapfrog0 step
number provided by the result from Stan were used as the parameters for
the HMC sampler in Turing when feasible1 .
To compare the HMC samplers in Turing against that in Stan, the models in
Section 2.3.2 were used and inference were done with 1000 samples for 100
times for each model in each language. Generated samples were then built
into chains to compute MCSE and ESS by the Mamba.jl package.
The sampling efficiency, measured by the MCSE and ESS for 1000 samples,
1
Some of the parameters used by Stan is not suitable for the HMC sampler in Turing
because Turing implements unbounded HMC. In some situations, large step size will cause
too many rejections from variables out of range.
60
Table 5.4: Comparison of HMC samplers in Turing and Stan (the beta-
binomial model).
p
PPL ¯ τ̄ Time
MCSE ESS
Turing 0.00106 208 0.1 2 0.235s
Stan 0.00653 459 1.0 2.5 0.172s
Table 5.5: Comparison of HMC samplers in Turing and Stan (the logistic
regression model).
β0 β1 β2
PPL ¯ τ̄ Time
MCSE ESS MCSE ESS MCSE ESS
Turing 0.0488 903 0.0465 871 0.0449 895 0.55 5 1.25s
Stan 0.0728 558 0.0722 503 0.0709 515 0.55 5.4 0.215s
varies from model to model and even parameter to parameter. For the Gaus-
sian, Turing has better sampling efficiency for the parameter m but worse for
s. This is interesting in the sense that neither Stan nor Turing wins in both
and also indicates that some variables may be more sensible to the parameter
settings in Turing, which will be further experimented in Section 5.4.2. For
the beta-binomial model, Stan outperforms Turing with an obvious advan-
tage. For the logistic regression model, Turing shows better efficiency in all
three parameters. This loss of efficiency in Stan may be due to its imple-
mentation of NUTS, which automates the tuning process but sacrifices the
sampling efficiency to some extent. However notice that Turing still takes
longer sampling time and requires tuning in practise.
61
Table 5.6: Comparison between Turing and Stan (ESS per second)
In terms of the concrete time, Stan shows an obvious advantage over Turing,
with about 6 timers faster than Turing for the logistic regression model as
highest. However, as Stan actually needs to compile its model to C++ each
time the model changes, it will take a relative long time to run a new model
(or a newly-amended model). For instance, it takes 5.06s to compile the
Gaussian model in Stan (via Julia interface) to C++ program, 5.66s for
the beta-binomial model and 4.87 for the logistic regression model. Luckily
Turing does not have such a drawback, which means if the user iterates his
or her model very frequently, Turing would show its advantage.
Taking the time into account, it is also worth the measure the sampling
efficiency by the number of ESS per second, which is shown in Table 5.6.
5.3 Robustness
This section is to discuss how the HMC sampler performs differently with
the number of samples and variables increasing.
62
Increasing the number of samples In order to evaluate how the sampler
performs with the number of samples increasing, three HMC results of differ-
ent models are taken from Figure 5.1 and put together in Figure 5.2. (Recall
that the HMC settings here are = 0.05 and τ = 2 for gauss, = 0.01 and
τ = 10 for beta, and = 1 and τ = 5 for lr.)
��
��
�������������
������
�����
����
��
�
� ������� �������
������������
Figure 5.2: Time used the HMC sampler with number of samples varying.
As you can see, the sampling time has a (roughly) linear relation with the
number of samples. This is satisfying because it shows that the performance
of the HMC sampler is stable when the number of samples increases. In
fact, this is the nature of the HMC algorithm as samples are consecutively
generated depending on their predecessors.
Note that the nature of SMC and PG does not have such property. Turing
improves their sampling time to be also proportional to the sample numbers
by the use of coroutines, which allows multiple particles to be simulated in
the same time.
63
Increasing the number of variables In order to evaluate how the sam-
pler performs with the number of variables increasing, a set of toy models as
below with the number of priors M varying from 1 to 9 are used. Note that
the number of variables here is actually the number of priors and also the
dimensionality.
Listing 5.1: Toy Model for the Experiment of Varying Number of Variables
1 xs = [1.5, 2.0] # the observations
2 M = 1 # number of means, varying from 1 to 9
3 _
@model gauss var begin
4 ms = Vector{Dual}(M) # initialise an array to store means
5 for i = 1:M
6 @assume ms[i] ∼ Normal(0, sqrt(2)) # define the mean
7 end
8 for i = 1:length(xs)
9 m = mean(ms)
10 @observe xs[i] ∼ Normal(m, sqrt(2)) # observe data points
11 end
12 @predict ms # ask predictions of s and m
13 end
The experiment was conducted by learning these models by the HMC sampler
for 10 times with n = 250, = 0.45 and τ = 5, and the average running
times were recorded. A plot showing how the sampling time changes with
the number of variables in given in Figure 5.3.
As you can see from the figure, the sampling performance of the HMC degen-
erates with the number of variables increasing, i.e. dimensionality increasing.
This is not a nature of HMC but the limitation of the implementation be-
cause MH based methods do not suffer from high dimensionality [1]. This
degeneration is believed to be caused by the use of forward mode of AD,
which requires n times running of the probabilistic program for a model with
n priors. This is not satisfying and possible approach to solve this problem
will be proposed in Section 6.2.
64
��
��
�������������
�
��� ��� ��� ��� ����
��������������
Figure 5.3: Time used the HMC sampler with number of variables varying.
65
�
�����
���
�����
���
�
���
���
���
���
��
��
�� �� � � �
�����
Figure 5.4: Predictions of the Single Neuron BNN from GD (decision bound-
ary in blue) and Bayesian approach (coloured contour for probability).
In this figure, the training data are the two red and two blue points, which
are labelled with 1s and 0s respectively. This prediction results illustrates
the advantages of the Bayesian approach mentioned in Section 2.1, which is
that the GD can only give hard prediction with its decision boundary while
the Bayesian prediction is not only soft but also provides the certainty of its
prediction.
66
5.4.2 Sensitivity of Different Variables
As you can see from this figure, the prior m is more sensible to the change of
parameters than s. This may be due to the nature of the prior distributions
they are drawn from. Also, the increase of s tends to stop early than that of m
in Figure 5.5c, which indicates that there would be different optimal settings
for different variables in the same model. Also, compared with HMC that has
obvious gaps in ESS between s and m, the results of NUTS in Table 5.3 shows
similar ESS for both s and m. Both of these two findings indicate that the
NUTS algorithm has an advantage of automatically tuning the parameters in
the sampling process to balance the HMC move in different dimensions.
67
��� ���
���
���
���
������ ������
���
���
��� � ��� �
� �
���
���
���
� �
���� ���� ���� ���� ���� ���� ���� ���� ���� ����
�������������������� ��������������������
(a) τ = 1 (b) τ = 2
����
���
���
������
���
�
�
���
���
�
��� ��� ��� ��� ��� ���
��������������������
(c) τ = 5
Figure 5.5: Sensitivity of s and m in the Gaussian model for different parameter settings of the HMC sampler.
Chapter 6
The two main challenges of this project, computing the target density func-
tion and the corresponding gradient function for the HMC sampler, are suc-
cessfully solved. The first one is done by building connections between the
probabilistic program and the energy function required by HMC using Bayes’
rule, and the second one is accomplished by applying the forward mode of
AD through probabilistic programs respectively.
The final HMC sampler is workable and could be used in practise but there
are till some limitations, which will be discussed in Section 6.1. Also some
potential approaches to overcome these limitations will be given in Sec-
tion 6.2.
69
6.1 Limitations
There are several limitations of the current HMC sampler, which can be di-
vided in two categories. The first category is about the sampling efficiency.
According to the evaluations given, this implementation of HMC has an ac-
ceptable sampling speed for most of the circumstances, which is similar to the
PG sampler and the SMC sampler in Turing. However, this implementation
has some potential flaws. One is that the performance of the HMC sampler
degenerates with the number of parameters increasing because it uses the
forward mode of AD. Another one is that, compared with the NUTS im-
plementation in Stan, the HMC sampler shows worse sampling efficiency in
cases where different settings are needed for different parameters. In addi-
tion, the sampling time used by the HMC sampler in Turing is still much
higher than that of Stan in general.
The second category is about the functionality of the current HMC sam-
pler. First of all, as the HMC algorithm can be only applied to continuous
spaces, inference of models with both continuous and discrete parameters is
not supported by the HMC sampler. Secondly, the parameters in the sam-
pler are unbounded. This will cause a situation where out-of-bound errors
occur, which is currently solved by rerunning the failed step. However, this
solution is inefficient and would fail in some circumstances. Thirdly, the
HMC sampler needs to be tuned for good performance, which is tedious and
time-consuming.
There are several future tasks to be done to improve the performance of the
HMC sampler and the functionality of Turing, which are listed below.
70
• Optimise the fundamental implementation by getting rid of some du-
plicated calculations and object copying.
Enhancing functionality
71
problems by simply re-run the corresponding HMC step, there are huge
time wasted when there are too many re-runs. This could be completely
avoided by a bounded HMC version. The implementation of this would
also involve re-designing the compiler to support definition of the bound
of variable in the program, or automatically extracting and setting
domain restrictions from the Distribution package somehow.
72
but is worth to be done if there is more time because it can solve the
problem radically. Besides, it is also feasible to currently separate the
distribution references used by HMC and other samplers to make other
samplers works on the Distributions package independently.
73
74
Bibliography
75
[11] Dr. Orlaith Burke. Statistical Methods Autocorrelation: MCMC Output
Analysis. Department of Statistics, University of Oxford, 2012.
[12] James M Flegal, Murali Haran, and Galin L Jones. Markov chain monte
carlo: Can we trust the third significant figure? Statistical Science, pages
250–260, 2008.
[13] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin.
Bayesian data analysis. texts in statistical science series, 2004.
[14] Razvan Ranca. Improving inference performance in probabilistic pro-
gramming languages. 2014.
[15] Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz,
and Joshua B Tenenbaum. Church: a language for generative models.
arXiv preprint arXiv:1206.3255, 2012.
[16] Ge, Hong, Adam Scibior, and Zoubin Ghahramani. Turing: rejuvenating
probabilistic programming in julia. 2016.
[17] Dexter Kozen. Semantics of probabilistic programs. Journal of Com-
puter and System Sciences, 22(3):328 – 350, 1981.
[18] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT
press, 2012.
[19] Kevin P. Murphy. Conjugate bayesian analysis of the gaussian distribu-
tion. 2007.
[20] David A Freedman. Statistical models: theory and practice. cambridge
university press, 2009.
[21] David R Cox. The regression analysis of binary sequences. Journal of
the Royal Statistical Society. Series B (Methodological), pages 215–242,
1958.
[22] David J Lunn, Andrew Thomas, Nicky Best, and David Spiegelhalter.
Winbugs-a bayesian modelling framework: concepts, structure, and ex-
tensibility. Statistics and computing, 10(4):325–337, 2000.
[23] Tom Minka, John Winn, John Guiver, and David Knowles. Infer .net
2.4, 2010. microsoft research cambridge.
[24] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben
Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter
Li, and Allen Riddell. Stan: A probabilistic programming language. J
Stat Softw, 2016.
76
[25] Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. A new
approach to probabilistic programming inference. In Proceedings of the
17th International conference on Artificial Intelligence and Statistics,
pages 1024–1032, 2014.
[26] Andrew Gelman, Daniel Lee, and Jiqiang Guo. Stan a probabilistic
programming language for bayesian inference and optimization. Journal
of Educational and Behavioral Statistics, page 1076998615606113, 2015.
[27] Martyn Plummer et al. Jags: A program for analysis of bayesian graphi-
cal models using gibbs sampling. In Proceedings of the 3rd international
workshop on distributed statistical computing, volume 124, page 125. Vi-
enna, 2003.
[28] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah.
Julia: A fresh approach to numerical computing. arXiv preprint
arXiv:1411.1607, 2014.
[29] Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman.
Julia: A fast dynamic language for technical computing. CoRR,
abs/1209.5145, 2012.
[30] Julia Community. Julia documentation. 2016.
[31] Atilim Baydin, Barak A Pearlmutter, and Alexey Radul. Automatic
differentiation in machine learning: a survey. 2015.
[32] Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Pe-
ter Li, and Michael Betancourt. The stan math library: Reverse-mode
automatic differentiation in c++. arXiv preprint arXiv:1509.07164,
2015.
[33] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler:
adaptively setting path lengths in hamiltonian monte carlo. Journal
of Machine Learning Research, 15(1):1593–1623, 2014.
77