0% found this document useful (0 votes)
3 views

tensorflow_dist

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

tensorflow_dist

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

TensorFlow Distributions

Joshua V. Dillon∗ , Ian Langmore∗, Dustin Tran∗† , Eugene Brevdo∗ , Srinivas Vasudevan∗,
Dave Moore∗ , Brian Patton∗ , Alex Alemi∗ , Matt Hoffman∗, Rif A. Saurous∗
∗ Google, † Columbia University

Abstract e = make_encoder(x)
The TensorFlow Distributions library implements a vi- z = e.sample(n)
sion of probability theory adapted to the modern deep- d = make_decoder(z)
arXiv:1711.10604v1 [cs.LG] 28 Nov 2017

learning paradigm of end-to-end differentiable compu- r = make_prior()


tation. Building on two basic abstractions, it offers flex- avg_elbo_loss = tf.reduce_mean(
ible building blocks for probabilistic computation. e.log_prob(z) - d.log_prob(x) - r.log_prob(z))
Distributions provide fast, numerically stable meth- train = tf.train.AdamOptimizer().minimize(
ods for generating samples and computing statistics, e.g., avg_elbo_loss)
log density. Bijectors provide composable volume-
Figure 1. General pattern for training a variational
tracking transformations with automatic caching. To-
auto-encoder (VAE) [27].
gether these enable modular construction of high di-
mensional distributions and transformations not possi-
While there have been many developments in prob-
ble with previous libraries (e.g., pixelCNNs, autoregres-
abilistic programming languages, there has been lim-
sive flows, and reversible residual networks). They are
ited progress in backend systems for probability dis-
the workhorse behind deep probabilistic programming
tributions. This is despite their fundamental necessity
systems like Edward and empower fast black-box in-
for computing log-densities, sampling, and statistics, as
ference in probabilistic models built on deep-network
well as for manipulations when composed as part of
components. TensorFlow Distributions has proven an
probabilistic programs. Existing distributions libraries
important part of the TensorFlow toolkit within Google
lack modern tools necessary for deep probabilistic pro-
and in the broader deep learning community.
gramming. Absent are: batching, automatic differentia-
Keywords probabilistic programming, deep learning, tion, GPU support, compiler optimization, composabil-
probability distributions, transformations ity with numerical operations and higher-level modules
such as neural network layers, and efficient memory
management.
1 Introduction To this end, we describe TensorFlow Distributions
The success of deep learning—and in particular, deep (r1.4), a TensorFlow (TF) library offering efficient, com-
generative models—presents exciting opportunities for posable manipulation of probability distributions.1
probabilistic programming. Innovations with deep prob- Illustration. Figure 1 presents a template for a varia-
abilistic models and inference algorithms have enabled tional autoencoder under the TensorFlow Python API;2
new successes in perceptual domains such as images this is a generative model of binarized MNIST digits
[19], text [4], and audio [48]; and they have advanced trained using amortized variational inference [27]. Figure 2
scientific applications such as understanding mouse be- implements a standard version with a Bernoulli decoder,
havior [23], learning causal factors in genomics [45], fully factorized Gaussian encoder, and Gaussian prior.
and synthesizing new drugs and materials [18]. By changing a few lines, Figure 3 implements a state-
Reifying these applications in code falls naturally un- of-the-art architecture: a PixelCNN++ [41] decoder and
der the scope of probabilistic programming systems, sys- a convolutional encoder and prior pushed through au-
tems which build and manipulate computable probabil- toregressive flows [26, 36]. (convnet,pixelcnnpp are
ity distributions. Within the past year, languages for omitted for space.)
deep probabilistic programming such as Edward [46] Figures 1 to 3 demonstrate the power of TensorFlow
have expanded deep-learning research by enabling new Distributions: fast, idiomatic modules are composed to
forms of experimentation, faster iteration cycles, and express rich, deep structure. Section 5 demonstrates more
improved reproducibility. applications: kernel density estimators, pixelCNN as a

1 Home:tensorflow.org; Source: github.com/tensorflow/tensorflow.


2 Namespaces: tf=tensorflow; tfd=tf.contrib.distributions;
tfb=tf.contrib.distributions.bijectors.
1
def make_encoder(x, z_size=8): import convnet, pixelcnnpp
net = make_nn(x, z_size*2)
return tfd.MultivariateNormalDiag( def make_encoder(x, z_size=8):
loc=net[..., :z_size], net = convnet(x, z_size*2)
scale=tf.nn.softplus(net[..., z_size:]))) return make_arflow(
tfd.MultivariateNormalDiag(
def make_decoder(z, x_shape=(28, 28, 1)): loc=net[..., :z_size],
net = make_nn(z, tf.reduce_prod(x_shape)) scale_diag=net[..., z_size:])),
logits = tf.reshape( invert=True)
net, tf.concat([[-1], x_shape], axis=0))
return tfd.Independent(tfd.Bernoulli(logits)) def make_decoder(z, x_shape=(28, 28, 1)):
def _logit_func(features):
def make_prior(z_size=8, dtype=tf.float32): # implement single autoregressive step,
return tfd.MultivariateNormalDiag( # combining observed features with
loc=tf.zeros(z_size, dtype))) # conditioning information in z.
cond = tf.layers.dense(z,
def make_nn(x, out_size, hidden_size=(128,64)): tf.reduce_prod(x_shape))
net = tf.flatten(x) cond = tf.reshape(cond, features.shape)
for h in hidden_size: logits = pixelcnnpp(
net = tf.layers.dense( tf.concat((features, cond), -1))
net, h, activation=tf.nn.relu) return logits
return tf.layers.dense(net, out_size) logit_template = tf.make_template(
"pixelcnn++", _logit_func)
Figure 2. Standard VAE on MNIST with mean-field
make_dist = lambda x: tfd.Independent(
Gaussian encoder, Gaussian prior, Bernoulli decoder. tfd.Bernoulli(logit_template(x)))
fully-visible likelihood, and how TensorFlow Distribu- return tfd.Autoregressive(
tions is used within higher-level abstractions (Edward make_dist, tf.reduce_prod(x_shape))
and TF Estimator).
Contributions. TensorFlow Distributions (r1.4) de- def make_prior(z_size=8, dtype=tf.float32):
fines two abstractions: Distributions and Bijectors. return make_arflow(
Distributions provides a collection of 56 distributions tfd.MultivariateNormalDiag(
with fast, numerically stable methods for sampling, com- loc=tf.zeros([z_size], dtype)))
puting log densities, and many statistics. Bijectors pro-
vides a collection of 22 composable transformations with def make_arflow(z_dist, n_flows=4,
efficient volume-tracking and caching. hidden_size=(640,)*3, invert=False):
TensorFlow Distributions is integrated with the Ten- maybe_invert = tfb.Invert if invert else tfb.
sorFlow ecosystem [1]: for example, it is compatible Identity
with tf.layers for neural net architectures, tf.data chain = list(itertools.chain.from_iterable([
for data pipelines, TF Serving for distributed comput- maybe_invert(tfb.MaskedAutoregressiveFlow(
ing, and TensorBoard for visualizations. As part of the shift_and_log_scale_fn=tfb.\
ecosystem, TensorFlow Distributions inherits and main- masked_autoregressive_default_template(
tains integration with TensorFlow graph operations, au- hidden_size))),
tomatic differentiation, idiomatic batching and vector- tfb.Permute(np.random.permutation(n_z)),
ization, device-specific kernel optimizations and XLA, ] for _ in range(n_flows)))
and accelerator support for CPUs, GPUs, and tensor return tfd.TransformedDistribution(
processing units (TPUs) [25]. distribution=z_dist,
TensorFlow Distributions is widely used in diverse bijector=tfb.Chain(chain[:-1]))
applications. It is used by production systems within
Figure 3. State-of-the-art architecture. It uses a Pixel-
Google and by Google Brain and DeepMind for research
CNN++ decoder [41] and autoregressive flows [26, 36]
prototypes. It is the backend for Edward [47].
for encoder and prior.
1.1 Goals Fast. TensorFlow Distributions is computationally and
TensorFlow Distributions is designed with three goals: memory efficient. For example, it strives to use only
2
XLA-compatible ops (which enable compiler optimiza- entropy is allowed. Compound distributions with con-
tions and portability to mobile devices), and whenever jugate priors such as Dirichlet-Multinomial are allowed.
possible it uses differentiable ops (to enable end-to-end The marginal distribution of a hidden Markov model is
automatic differentiation). Random number generators also allowed since hidden states can be efficiently col-
for sampling call device-specific kernels implemented lapsed with the forward-backward algorithm [33].
in C++. Functions with Tensor inputs also exploit vec-
torization through batches (Section 3.3). Multivariate dis- 2 Related Work
tributions may be able to exploit additional vectoriza- The R statistical computing language [21] provides a
tion structure. comprehensive collection of probability distributions. It
Numerically Stable. All operations in TensorFlow inherits from the classic S language [6] and has accumu-
Distributions are numerically stable across half, single, lated user contributions over decades. We use R’s collec-
and double floating-point precisions (as TensorFlow dtypes: tion as a goal for comprehensiveness and ease of user
tf.bfloat16 (truncated floating point), tf.float16, contribution. TensorFlow Distributions differs in being
tf.float32, tf.float64). Class constructors have a object-oriented instead of functional, enabling manipu-
validate_args flag for numerical asserts. lation of Distribution objects; operations are also de-
Idiomatic. As part of the TensorFlow ecosystem, Ten- signed to be fast and differentiable. Most developers of
sorFlow Distributions maintains idioms such as inputs the TensorFlow ecosystem are also Google-employed,
and outputs following a “Tensor-in, Tensor-out” pat- meaning we benefit from more unification than R’s ecosys-
tern (though deviations exist; see Section 3.5), outputs tem. For example, the popular glmnet and lme4 R pack-
preserving the inputs’ dtype, and preferring statically ages support only specific distributions for model-specific
determined shapes. Similarly, TensorFlow Distributions algorithms; all Distributions support generic Tensor-
has no library dependencies besides NumPy [50] and Flow optimizers for training/testing.
six [37], further manages dtypes, supports TF-style broad- The SciPy stats module in Python collects probabil-
casting, and simplifies shape manipulation. ity distributions and statistical functions [24]. Tensor-
Flow’s primary demographic is machine learning users
1.2 Non-Goals and researchers; they typically use Python. Subsequently,
TensorFlow Distributions does not cover all use-cases. we modelled our API after SciPy; this mimicks Tensor-
Here we highlight goals common to probabilistic pro- Flow’s API modelled after NumPy. Beyond API, the de-
gramming languages which are specifically not goals sign details and implementations drastically differ. For
of this library.3 example, TensorFlow Distributions enables arbitrary tensor-
Universality. In order to be fast, the Distribution dimensional vectorization, builds operations in the Ten-
abstraction makes an explicit restriction on the class of sorFlow computational graph, supports automatic dif-
computable distributions. Namely, any Distribution ferentiation, and can run on accelerators. The Tensor-
should offer sample and log_prob implementations that Flow Distributions API also introduces innovations such
are computable in expected polynomial time. For exam- as higher-order distributions (Section 3.5), distribution
ple, the Multinomial-LogisticNormal distribution [7] fails functionals (Section 3.6), and Bijectors (Section 4).
to meet this contract. Stan Math [10] is a C++ templated library for numer-
This also precludes supporting a distributional calcu- ical and statistical functions, and with automatic dif-
lus. For example, convolution is generally not analytic, ferentiation as the backend for the Stan probabilistic
so Distributions do not support the __add__ opera- programming language [9]. Different from Stan, we fo-
tor: if X ∼ f X , Y ∼ fY , and∫share domain D, then cus on enabling deep probabilistic programming. This
Z = X + Y implies f Z (z) = D f X (z − y)fY (y)dy = lead to new innovations with bijectors, shape seman-
(f X ∗ fY )(z).4 tics, higher-order distributions, and distribution func-
Approximate Inference. Distributionsdo not im- tionals. Computationally, TensorFlow Distributions also
plement approximate inference or approximations of enables a variety of non-CPU accelerators, and com-
properties and statistics. For example, a Monte Carlo piler optimizations in static over dynamically executed
approximation of entropy is disallowed yet a function graphs.
which computes an analytical, deterministic bound on
3 Distributions
3 Users TensorFlow Distributions provides a collection of ap-
can subclass Distribution relaxing these properties.
4 In future work, we may support this operation in cases when it satis- proximately 60 distributions with fast, numerically sta-
fies our goals, e.g., for the analytic subset of stable distributions such ble methods for sampling, log density, and many statis-
as Normal, Levy. tics. We describe key properties and innovations below.
3
3.1 Constructor Among other checks, validate_args=True limits in-
TensorFlow Distributions are object-oriented. A dis- teger distributions’ support to integers exactly repre-
tribution implementation subclasses the Distribution sentable by same-size IEEE754 floats, i.e., integers can-
base class. The base class follows a “public calls private” not exceed 2significand_bits . If allow_nan_stats=True,op-
design pattern where, e.g., the public sample method erations can return NaNs; otherwise an error is raised.
calls a private _sample implemented by each subclass.
This handles basic argument validation (e.g., type, shape) 3.2 Methods
and simplifies sharing function documentation.
Distributions take the following arguments: At minimum, supported Distributions implement the
following methods: sample to generate random outcome
parameters indexes to family
Tensors, log_prob to compute the natural logarithm
dtype dtype of samples
of the probability density (or mass) function of random
reparameterization_type sampling (Section 3.4)
outcome Tensors, and batch_shape_tensor,
validate_args bool permitting numerical
event_shape_tensor to describe the dimensions of ran-
checks
dom outcome Tensors (Section 3.3), returned itself as
allow_nan_stats bool permitting NaN
Tensors.
outputs
Supported Distributions implement many additional
name str prepended to TF ops
methods, including cdf, survival_function,quantile,
Parameter arguments support TF-style broadcasting. mean, variance, and entropy. The Distribution base
For example, Normal(loc=1., scale=[0.5, 1., 1.5]) class automates implementation of related functions such
is effectively equivalent to Normal(loc=[1., 1., 1.], as prob given log_prob and log_survival_fn given
scale=[0.5, 1., 1.5]). Distributions use self- log_cdf (unless a more efficient or numerically stable
documenting argument names from a concise lexicon. implementation is available). Distribution-specific sta-
We never use Greek and prefer, for example, loc, scale, tistics are permitted; for example, Wishart implements
rate, concentration, rather than µ, σ , λ, α. the expected log determinant (mean_log_det)of matrix
Alternative parameterizations can sometimes lead to variates, which would not be meaningful for univariate
an “argument zoo.” To migitate this, we distinguish be- distributions.
tween two cases. When numerical stability necessitates All methods of supported distributions satisfy the fol-
them, distributions permit mutually exclusive parame- lowing contract:
ters (this produces only one extra argument). For ex- Efficiently Computable. All member functions have
ample, Bernoulli accepts logits or probs, Poisson expected polynomial-time complexity. Further, they are
accepts rate or log_rate; neither permits specifying vectorized (Section 3.3) and have optimized sampling
both. When alternative parameterizations are structural, routines (Section 3.4). TensorFlow Distributions also fa-
we specify different classes: vors efficient parameterizations: for example, we favor
MultivariateNormalTriL,MultivariateNormalDiag, MultivariateNormalTriL,whose covariance is param-
MultivariateNormalDiagPlusLowRank implement mul- eterized by the outer product of a lower triangular ma-
tivariate normal distributions with different covariance trix, over MultivariateNormalFullCovariancewhich
structures. requires a Cholesky factorization.
The dtype defaults to floats or ints, depending on the Statistically Consistent. Under sample, the Monte
distribution’s support, and with precision given by its Carlo approximation of any statistic converges to the
parameters’. Distributions over integer-valued support statistic’s implementation as the number of samples ap-
(e.g., Poisson) use tf.int*. Distributions over real-valued proaches ∞. Similarly, pdf is equal to the derivative of
support (e.g., Dirichlet) use tf.float*. This distinc- cdf with respect to its input; and sample is equal in dis-
tion exists because of mathematical consistency; and in tribution to uniform sampling followed by the inverse
practice, integer-valued distributions are often used as of cdf.
indexes into Tensors.5 Analytical. All member functions are analytical ex-
If validate_args=True,argument validation happens cluding sample, which is non-deterministic. For exam-
during graph construction when possible; any valida- ple, Mixture implements an analytic expression for an
tion at graph execution (runtime) is gated by a Boolean. entropy lower bound method, entropy_lower_bound;
its exact entropy is intractable. However, no method
function’s implementation uses a Monte Carlo estimate
5 Currently,TensorFlow Distributions’ dtype does not follow this (even with a fixed seed, or low-discrepancy sequence
standard. For backwards compatibility, we are in the progress of im-
plementing it by adding a new sample_dtype kwarg.
[35]) which we qualify as non-analytical.
4
3. Event shape describes the shape of a single draw
 
n Monte Carlo b examples per s latent
, ,
draws batch dimensions (event space) from the distribution; it may be de-
| {z }| {z } | {z }
sample_shape batch_shape event_shape
pendent across dimensions.
(indep, (indep, not (can be Figure 4 illustrates this partition for the latent code in
identically identical) dependent) a variational autoencoder (Figure 1). Combining these
distributed)
three shapes in a single Tensor enables efficient, id-
iomatic vectorization and broadcasting.
Figure 4. Shape semantics. Refers to variational distri- Member functions all comply with the distribution’s
bution in Figure 1. shape semantics and dtype. As another example, we
initialize a batch of three multivariate normal distribu-
tions in R2 . Each batch member has a different mean.
Fixed Properties. In keeping with TensorFlow id- # Initialize 3-batch of 2-variate
ioms, Distribution instances have fixed shape seman- # MultivariateNormals each with different mean.
tics (Section 3.3), dtype, class methods, and class prop- mvn = tfd.MultivariateNormalDiag(
erties throughout the instance’s lifetime. Member func- loc=[[1., 1.], [2., 2.], [3., 3.]]))
tions have no side effects other than to add ops to the x = mvn.sample(10)
TensorFlow graph. # ==> x.shape=[10, 3, 2]. 10 samples across
Note this is unlike the statefulness of exchangeable # 3 batch members. Each sample in R^2.
random primitives [2], where sampling can memoize pdf = mvn.prob(x)
over calls to lazily evaluate infinite-dimensional data # ==> pdf.shape=[10, 3]. One pdf calculation
structures. To handle such distributions, future work # for 10 samples across 3 batch members.
may involve a sampling method which returns another
Partitioning Tensor dimensions by “sample”, “batch”,
distribution storing those samples. This preserves im-
and “event” dramatically simplifies user code while nat-
mutability while enabling marginal representations of
urally exploiting vectorization. For example, we describe
completely random measures such as a Chinese restau-
a Monte Carlo approximation of a Normal-Laplace com-
rant process, which is the compound distribution of a
pound distribution,
Dirichlet process and multinomial distribution [3]. Namely, ∫
its sample computes a Pólya urn-like scheme caching p(x | σ , µ 0 , σ0 ) = Normal(x | µ, σ ) Laplace(µ | µ 0 , σ0 ) dµ.
the number of customers at each table.6 R

# Draw n iid samples from a Laplace.


3.3 Shape Semantics
mu = tfd.Laplace(
To make Distributions fast, a key challenge is to en- loc=mu0, scale=sigma0).sample(n)
able arbitrary tensor-dimensional vectorization. This al- # ==> sample_shape = [n]
lows users to properly utilize multi-threaded compu- # batch_shape = []
tation as well as array data-paths in modern accelera- # event_shape = []
tors. However, probability distributions involve a num- # Compute n different Normal pdfs at
ber of notions for different dimensions; they are often # scalar x, one for each Laplace draw.
conflated and thus difficult to vectorize. pr_x_given_mu = tfd.Normal(
To solve this, we (conceptually) partition a Tensor’s loc=mu, scale=sigma).prob(x)
shape into three groups: # ==> sample_shape = []
1. Sample shape describes independent, identically # batch_shape = [n]
distributed draws from the distribution. # event_shape = []
2. Batch shape describes independent, not identically # Average across each Normal's pdf.
distributed draws. Namely, we may have a set of pr_x = tf.reduce_mean(pr_x_given_mu, axis=0)
(different) parameterizations to the same distribu- # ==> pr_estimate.shape=x.shape=[]
tion. This enables the common use case in ma- This procedure is automatically vectorized because the
chine learning of a “batch” of examples, each mod- internal calculations are over tensors, where each rep-
elled by its own distribution. resents the differently parameterized Normal distribu-
tions. sigma and x are automatically broadcast; their
6 While the Chinese restaurant process is admittable as a (sequence value is applied pointwise thus eliding n copies.
of) Distribution, the Dirichlet process is not: its probability mass To determine batch and event shapes (sample shape
function involves a countable summation. is determined from the sample method), we perform
5
shape inference from parameters at construction time. variational inference algorithms minimize the KL diver-
Parameter dimensions beyond that necessary for a sin- gence between pY and another distribution h,
gle distribution instance always determine batch shape. E [log[pY (Y )/h(Y )]] using gradient-based optimization.
Inference of event shapes is typically not required as To do so, one can compute a Monte Carlo approxima-
distributions often know it a priori; for example, Normal tion
is univariate. On the other hand, Multinomial infers its N
1 Õ
event shape from the rightmost dimension of its logits SN : = φ(Yn ), where Yn ∼ pY . (1)
argument. Dynamic sample and batch ranks are not al- N n=1
lowed because they conflate the shape semantics (and This lets us use S N not only as an estimate of our ex-
thus efficient computation); dynamic event ranks are pected loss E [φ(Y )], but also use ∇λ S N as an estimate of
not allowed as a design choice for consistency. the gradient ∇λ E [φ(Y )] with respect to parameters λ of
Note that event shape (and shapes in general) reflects pY . If the samples Yn are reparameterized (in a smooth
the numerical shape and not the mathematical defini- enough way), then both approximations are justified
tion of dimensionality. For example, Categorical has [14, 27, 42].
a scalar event shape over a finite set of integers; while
a one-to-one mapping exists, OneHotCategorical has 3.5 Higher-Order Distributions
a vector event shape over one-hot vectors. Other distri- Higher-order distributions are Distributions which
butions with non-scalar event shape include Dirichlet are functions of other Distributions. This deviation
(simplexes) and Wishart (positive semi-definite matri- from the Tensor-in, Tensor-out pattern enables mod-
ces). ular, inherited construction of an enormous number of
distributions. We outline three examples below and use
3.4 Sampling a running illustration of composing distributions.
Sampling is one of the most common applications of a TransformedDistributionis a distribution p(y) con-
Distribution. To optimize speed, we implement sam- sisting of a base distribution p(x) and invertible, differ-
pling by registering device-specific kernels in C++ to entiable transform Y = д(X ). The base distribution is an
TensorFlow operations. We also use well-established al- instance of the Distribution class and the transform
gorithms for random number generation. For example, is an instance of the Bijector class (Section 4).
draws from Normal use the Box-Muller transform to re- For example, we can construct a (standard) Gumbel
turn an independent pair of normal samples from an in- distribution from an Exponential distribution.
dependent pair of uniform samples [8]; CPU, GPU, and standard_gumbel = tfd.TransformedDistribution(
TPU implementations exist. Draws from Gamma use the distribution=tfd.Exponential(rate=1.),
rejection sampling algorithm of Marsaglia and Tsang bijector=tfb.Chain([
[30]; currently, only a CPU implementation exists. tfb.Affine(
Reparameterization. Distributions employ a scale_identity_multiplier=-1.,
reparameterization_type property (Section 3.1) event_ndims=0),
which annotates the interaction between automatic dif- tfb.Invert(tfb.Exp()),
ferentiation and sampling. Currently, there are two such ]))
annotations: “fully reparameterized” and “not reparam- standard_gumbel.batch_shape # ==> []
eterized”. To illustrate “fully reparameterized”, consider standard_gumbel.event_shape # ==> []
dist = Normal(loc, scale). The sample
y = dist.sample() is implemented internally via x = The Invert(Exp) transforms the Exponential distribu-
tf.random_normal([]); y = scale * x + loc. The tion by the natural-log, and the Affine negates. In gen-
sample y is “reparameterized” because it is a smooth eral, algebraic relationships of random variables are pow-
function of the parameters loc, scale, and a parameter- erful, enabling distributions to inherit method imple-
free sample x. In contrast, the most common Gamma mentations from parents (e.g., internally, we implement
sampler is “not reparameterized”: it uses an accept-reject multivariate normal distributions as Affine transforms
scheme that makes the samples depend non-smoothly of Normal).
on the parameters [34]. Building on standard_gumbel,we can also construct
When composed with other TensorFlow ops, a “fully 2∗28∗28 independent relaxations of the Bernoulli distri-
reparameterized” Distribution enables end-to-end au- bution, known as Gumbel-Softmax or Concrete [22, 29].
tomatic differentiation on functions of its samples. A alpha = tf.stack([
common use case is a loss depending on expectations tf.fill([28 * 28], 2.),
of the form E [φ(Y )] for some function φ. For example, tf.ones(28 * 28)])
6
concrete_pixel = tfd.TransformedDistribution( mutual information [11]; divergence measures such as
distribution=standard_gumbel, Kullback-Leibler, Csiszár-Morimoto f -divergence [12,
bijector=tfb.Chain([ 31], and multi-distribution divergences [15]; and dis-
tfb.Sigmoid(), tance metrics such as integral probability metrics [32].
tfb.Affine(shift=tf.log(alpha)), We implement all (analytic) distribution functionals
]), as methods in Distributions. For example, below we
batch_shape=[2, 28 * 28]) write functionals of Normal distributions:
concrete_pixel.batch_shape # ==> [2, 784]
p = tfd.Normal(loc=0., scale=1.)
concrete_pixel.event_shape # ==> [] q = tfd.Normal(loc=-1., scale=2.)
The Affine shifts by log(alpha) for two batches. Ap- xent = p.cross_entropy(q)
kl = p.kl_divergence(q)
plying Sigmoid renders a batch of [2, 28∗28] univariate # ==> xent - p.entropy()
Concrete distributions.
Independent enables idiomatic manipulations between
batches and event dimensions. Given a Distribution 4 Bijectors
instance dist with batch dimensions, Independent builds We described Distributions, sources of stochasticity
a vector (or matrix, or tensor) valued distribution whose which collect properties of probability distributions. Bijectors
event components default to the rightmost batch dimen- are deterministic transformations of random outcomes
sion of dist. and of equal importance in the library. It consists of 22
Building on concrete_pixel, we reinterpret the 784 composable transformations for manipulating Distributions,
batches as jointly characterizing a distribution. with efficient volume-tracking and caching of pre-transformed
image_dist = tfd.TransformedDistribution(
samples. We describe key properties and innovations
distribution=tfd.Independent(concrete_pixel),
below.
bijector=tfb.Reshape(
4.1 Motivation
event_shape_out=[28, 28, 1],
event_shape_in=[28 * 28])) The Bijector abstraction is motivated by two challenges
image_dist.batch_shape # ==> [2] for enabling efficient, composable manipulations of prob-
image_dist.event_shape # ==> [28, 28, 1] ability distributions:

The image_dist distribution is over 28 × 28 × 1-dim. 1. We seek a minimally invasive interface for manip-
events (e.g., MNIST-resolution pixel images). ulating distributions. Implementing every trans-
Mixture defines a probability mass function p(x) = formation of every distribution results in a combi-
ÍK natorial blowup and is not realistically maintain-
k=1 πk p(x | k), where the mixing weights πk are pro-
vided by a Categorical distribution as input, and the able. Such a policy is unlikely to keep pace with
the pace of research. Lack of encapsulation exac-
components p(x | k) are arbitrary Distributions with
same support. For components that are batches of the erbates already complex ideas/code and discour-
same family, MixtureSameFamilysimplifies the construc- ages community contributions.
tion where its components are from the rightmost batch 2. In deep learning, rich high-dimensional densities
dimension. Building on image_dist: typically use invertible volume-preserving map-
pings or mappings with fast volume adjustments
image_mixture = tfd.MixtureSameFamily( (namely, the logarithm of the Jacobian’s determi-
mixture_distribution=tfd.Categorical( nant has linear complexity with respect to dimen-
probs=[0.2, 0.8]), sionality) [36]. We’d like to efficiently and idiomat-
components_distribution=image_dist) ically support them.
image_mixture.batch_shape # ==> []
By isolating stochasticity from determinism, Distributions
image_mixture.event_shape # ==> [28, 28, 1]
are easy to design, implement, and validate. As we illus-
Here, MixtureSameFamilycreates a mixture of two com- trate with the flexibility of TransformedDistribution
ponents with weights [0.2, 0.8]. The components are in Section 3.5, the ability to simply swap out functions
slices along the batch axis of image_dist. applied to the distribution is a surprisingly powerful
asset. Programmatically, the Bijector distinction en-
3.6 Distribution Functionals ables encapsulation and modular distribution construc-
Functionals which take probability distribution(s) as in- tions with inherited, fast method implementations. Sta-
put and return a scalar are ubiquitous. They include in- tistically, Bijectors enable exploiting algebraic relation-
formation measures such as entropy, cross entropy, and ships among random variables.
7
4.2 Definition and prob. For bijectors with constant Jacobian such as
To address Section 4.1, the Bijector API provides an Affine, TransformedDistribution automatically im-
interface for transformations of distributions suitable plements statistics such as mean, variance, and entropy.
for any differentiable and bijective map (diffeomorphism) The following example implements an affine-transformed
as well as certain non-injective maps (Section 4.5). Laplace distribution.
Formally, given a random variable X and diffeomor-
vector_laplace = tfd.TransformedDistribution(
phism F , we can define a new random variable Y whose
distribution=tfd.Laplace(loc=0., scale=1.),
density can be written in terms of X ’s,
bijector=tfb.Affine(
pY (y) = pX (F −1 (y)) |DF −1 (y)|, (2) shift=tf.Variable(tf.zeros(d)),
where DF −1 is the inverse of the Jacobian of F . Each scale_tril=tfd.fill_triangular(
Bijector subclass corresponds to such a function F , tf.Variable(tf.ones(d*(d+1)/2)))),
and TransformedDistribution uses the bijector to au- event_shape=[d])
tomate the details of the transformation Y = F (X )’s
The distribution is learnable via tf.Variables and that
density (Equation 2). This allows us to define many new
the Affine is parameterized by what is essentially the
distributions in terms of existing ones.
Cholesky of the covariance matrix. This makes the mul-
A Bijector implements how to transform a Tensor
tivariate construction computationally efficient and more
and how the input Tensor’s shape changes; this Tensor
numerically stable; bijector caching (Section 4.4) may
is presumed to be a random outcome possessing
even eliminate back substitution.
Distribution shape semantics. Three functions char-
acterize how the Tensor is transformed:
4.3 Composability
1. forward implements x 7→ F (x), and is used by
Bijectors can compose using higher-order Bijectors
TransformedDistribution.sampleto convert one
such as Chain and Invert. Figure 3 illustrates a power-
random outcome into another. It also establishes
ful example where the arflow method composes a se-
the name of the bijector.
quence of autoregressive and permutation Bijectors
2. inverse undoes the transformation y 7→ F −1 (y)
to compactly describe an autoregressive flow [26, 36].
and is used by
The Chain bijector enables simple construction of
TransformedDistribution.log_prob.
rich Distributions. Below we construct a multivari-
3. inverse_log_det_jacobian computes
ate logit-Normal with matrix-shaped events.
log |DF −1 (y)| and is used by
TransformedDistribution.log_prob to adjust matrix_logit_mvn =
for how the volume changes by the transforma- tfd.TransformedDistribution(
tion. In certain settings, it is more numerically distribution=tfd.Normal(0., 1.),
stable (or generally preferable) to implement the bijector=tfb.Chain([
forward_log_det_jacobian.Because forward and tfb.Reshape([d, d]),
reverse log ◦ det ◦ Jacobians differ only in sign, ei- tfb.SoftmaxCentered(),
ther (or both) may be implemented. tfb.Affine(scale_diag=diag),
A Bijector also describes how it changes the Tensor’s ]),
shape so that TransformedDistribution can imple- event_shape=[d * d])
ment functions that compute event and batch shapes.
Most bijectors do not change the Tensor’s shape. Those The Invert bijector effectively doubles the number of
which do implement forward_event_shape_tensorand bijectors by swapping forward and inverse. It is use-
inverse_event_shape_tensor.Each takes an input shape ful in situations such as the Gumbel construction in
(vector) and returns a new shape representing the Tensor’s Section 3.5. It is also useful for transforming constrained
event/batch shape after forward or inverse transfor- continuous distributions onto an unconstrained real-valued
mations. Excluding higher-order bijectors, currently only space. For example:
SoftmaxCentered changes the shape.7
Using a Bijector, TransformedDistribution auto- softminus_gamma = tfd.TransformedDistribution(
matically and efficiently implements sample, log_prob, distribution=tfd.Gamma(
Í concentration=alpha,
7 To implement softmax(x ) = exp(x )/ i exp(x i ) as a diffeo- rate=beta),
morophism, its forward appends a zero to the event and its reverse
strips this padding. The result is a bijective map which avoids the fact
bijector=tfb.Invert(tfb.Softplus()))
Í
that softmax(x ) = exp(x − c)/ i exp(x i − c) for any c.
8
This performs a softplus-inversion to robustly transform caching reduces the overall complexity from quadratic
Gamma to be unconstrained. This enables a key compo- to linear (in event size).
nent of automated algorithms such as automatic differ-
entiation variational inference [28] and the No U-Turn 4.5 Smooth Coverings
Sampler [20]. They only operate on real-valued spaces, The Bijector framework extends to non-injective trans-
so unconstraints expand their scope. formations, i.e., smooth coverings [44].8 Formally, a smooth
covering is a continuous function F on the entire do-
4.4 Caching main D where, ignoring sets of measure zero, the do-
Bijectors automatically cache input/output pairs of main can be partitioned as a finite union D = D 1 ∪ · · · ∪
operations, including the log ◦ det ◦ Jacobian. This is ad- D K such that each restriction F : D k → F (D) is a dif-
vantageous when the inverse calculation is slow, nu- feomorphism. Examples include AbsValue and Square.
merically unstable, or not easily implementable. A cache We implement them by having the inverse method re-
hit occurs when computing the probability of results turn the set inverse F −1 (y) := {x : F (x) = y} as a
of sample. That is, if q(x) is the distribution associated tuple.
with x = f (ε) and ε ∼ r , then caching lowers the cost Smooth covering Bijectors let us easily build half-
of computing q(x i ) since distributions, which allocate probability mass over only
  −1
q(x i ) = r ((f −1 ◦ f )(εi )) ∂ε ∂
◦ f −1 ◦ f (εi ) = r (εi ). the positive half-plane of the original distribution. For
example, we build a half-Cauchy distribution as
Because TensorFlow follows a deferred execution model,
Bijector caching is nominal; it has zero memory or half_cauchy = tfd.TransformedDistribution(
computational cost. The Bijector base class merely bijector=tfb.AbsValue(),
replaces one graph node for another already existing distribution=tfd.Cauchy(loc=0., scale=1.))
node. Since the existing node (“cache hit”) is already an The half-Cauchy and half-Student t distributions are
execution dependency, the only cost of “caching” is dur- often used as “weakly informative” priors, which ex-
ing graph construction. hibit heavy tails, for variance parameters in hierarchi-
Caching is computationally and numerically benefi- cal models [16].
cial for importance sampling algorithms, which com-
pute expectations. They weight by a drawn samples’ re- 5 Applications
ciprocal probability. Namely, We described two abstractions: Distributions and

µ= f (x)p(x)dx Bijectors. Recall Figures 1 to 3, where we showed the
power of combining these abstractions for changing from

f (x)p(x) simple to state-of-the-art variational auto-encoders. Be-
= q(x)dx low we show additional applications of TensorFlow Dis-
q(x)
n tributions as part of the TensorFlow ecosystem.
Õ f (x i )p(x i ) iid
= lim n −1 , where x i ∼ q.
n→∞
i
q(x i ) 5.1 Kernel Density Estimators
Caching also nicely complements black-box variational A kernel density estimator (KDE) is a nonparametric
inference algorithms [28, 40]. In these procedures, the estimator of an unknown probability distribution [51].
approximate posterior distribution only computes Kernel density estimation provides a fundamental smooth-
log_prob over its own samples. In this setting the sam- ing operation on finite samples that is useful across many
ple’s preimage (εi ) is known without computing f −1 (x i ). applications. With TensorFlow Distributions, KDEs can
MultivariateNormalTriL is implemented as a be flexibly constructed as a MixtureSameFamily. Given
TransformedDistribution with the Affine bijector. a finite set of points x in RD , we write
Caching removes the need for quadratic complexity back f = lambda x: tfd.Independent(tfd.Normal(
substitution. For an InverseAutoregressiveFlow [26], loc=x, scale=1.))
n = x.shape[0].value
laplace_iaf = tfd.TransformedDistribution(
kde = tfd.MixtureSameFamily(
distribution=tfd.Laplace(loc=0., scale=1.), mixture_distribution=tfd.Categorical(
bijector=tfb.Invert( probs=[1 / n] * n),
tfb.MaskedAutoregressiveFlow( components_distribution=f(x))
shift_and_log_scale_fn=tfb.\
Here, f is a callable taking x as input and returns a dis-
masked_autoregressive_default_template(
tribution. Above, we use an independent D-dimensional
hidden_layers))),
8 Bijector caching is currently not supported for smooth coverings.
event_shape=[d])
9
import pixelcnn from edward.models import Normal

def pixelcnn_dist(params, x_shape=(32, 32, 3)): z = x = []


def _logit_func(features): z[0] = Normal(loc=tf.zeros(K),scale=tf.ones(K))
# implement single autoregressive step h = tf.layers.dense(
# on observed features z[0], 512, activation=tf.nn.relu)
logits = pixelcnn(features) loc = tf.layers.dense(h, D, activation=None)
return logits x[0] = Normal(loc=loc, scale=0.5)
logit_template = tf.make_template( for t in range(1, T):
"pixelcnn", _logit_func) inputs = tf.concat([z[t - 1], x[t - 1]], 0)
make_dist = lambda x: tfd.Independent( loc = tf.layers.dense(
tfd.Bernoulli(logit_template(x))) inputs, K, activation=tf.tanh)
return tfd.Autoregressive( z[t] = Normal(loc=loc, scale=0.1)
make_dist, tf.reduce_prod(x_shape))) h = tf.layers.dense(
z[t], 512, activation=tf.nn.relu)
x = pixelcnn_dist() loc = tf.layers.dense(h, D, activation=None)
loss = -tf.reduce_sum(x.log_prob(images)) x[t] = Normal(loc=loc, scale=0.5)
train = tf.train.AdamOptimizer(
).minimize(loss) # run for training Figure 6. Stochastic recurrent neural network, which is
generate = x.sample() # run for generation a state space model with nonlinear dynamics. The tran-
sition mimicks a recurrent tanh cell and the omission is
Figure 5. PixelCNN distribution on images. It uses
multi-layered.
Autoregressive, which takes as input a callable re-
turning a distribution per time step.
5.3 Edward Probabilistic Programs
Normal distribution (equivalent to We describe how TensorFlow Distributions enables Ed-
MultivariateNormalDiag), which induces a Gaussian ward as a backend. In particular, note that non-goals in
kernel density estimator. TensorFlow Distributions can be accomplished at higher-
Changing the callable extends the KDE to alterna- level abstractions. Here, Edward wraps TensorFlow Dis-
tive distribution-based kernels. For example, we can use tributions as random variables, associating each Distribution
lambda x: MultivariateNormalTriL(loc=x) for a to a random outcome Tensor (calling sample) in the
multivariate kernel, and alternative distributions such TensorFlow graph. This enables a calculus where Ten-
as lambda x: Independent(StudentT(loc=x, scale=0.5, sorFlow ops can be applied directly to Edward random
df=3). The same concept also applies for bootstrap tech- variables; this is a non-goal for TensorFlow Distribu-
niques [13]. We can employ parametric bootstrap or tions.
stratified sampling to replace the equal mixing weights. As an example, Figure 6 implements a stochastic re-
current neural network (RNN), which is an RNN whose
5.2 PixelCNN Distribution hidden state is random [5]. Formally, for T time steps,
the model specifies the joint distribution
Building from the KDE example above, we now show a
modern, high-dimensional density estimator. Figure 5 Ö T
builds a PixelCNN [49] as a fully-visible autoregressive p(x, z) = Normal(z1 | 0, I) Normal(zt | f (zt −1 ), 0.1)
distribution on images, which is a batch of 32 × 32 × 3 t =2
T
RGB pixel images from Small ImageNet. Ö
The variable x is the pixelCNN distribution. It makes Normal(xt | д(zt ), 0.5),
t =1
use of the higher-order Autoregressive distribution,
which takes as input a Python callable and number of where each time step in an observed real-valued sequence
autoregressive steps. The Python callable takes in cur- x = [x1 , . . . , xT ] ∈ RT ×D is associated with an un-
rently observed features and returns the per-time step observed state zt ∈ RK ; the initial latent variable z1
distribution. -tf.reduce_sum(x.log_prob(images)) is drawn randomly from a standard normal. The noise
is the loss function for maximum likelihood training; standard deviations are fixed and broadcasted over the
x.sample generates new images. batch. The latent variable and likelihood are parameter-
We also emphasize modularity. Note here, we used ized by neural networks.
the pixelCNN as a fully-visible distribution. This differs The program is generative: starting from a latent state,
from Figure 3 which employs pixelCNN as a decoder it unrolls state dynamics through time. Given this pro-
(conditional likelihood). gram and data, Edward’s algorithms enable approximate
10
inference (a second non-goal of TensorFlow Distribu- def mvn_regression_fn(
tions). features, labels, mode, params=None):
d = features.shape[-1].value
5.4 TensorFlow Estimator API mvn = tfd.MultivariateNormalTriL(
loc=tf.layers.dense(features, d),
As part of the TensorFlow ecosystem, TensorFlow Dis- scale_tril=tfd.fill_triangular(
tributions complements other TensorFlow libraries. We tf.layers.dense(features, d*(d+1)/2)))
show how it complements TensorFlow Estimator. if mode == tf.estimator.ModeKeys.PREDICT:
Figure 7 demonstrates multivariate linear regression return mvn.mean()
in the presence of heteroscedastic noise, loss = -tf.reduce_sum(mvn.log_prob(labels))
if mode == tf.estimator.ModeKeys.EVAL:
U ∼ MultivariateNormal(0, Id ) metric_fn = lambda x,y:
1
tf.metrics.mean_squared_error(x, y)
Y = Σ 2 (X )U + µ(X ) return tpu_estimator.TPUEstimatorSpec(
mode=mode,
where Σ : Rd → {Z ∈ Rd ×d : Z < 0}, µ : Rd → Rd , loss=loss,
1 eval_metrics=(
and Σ 2 denotes the Cholesky decomposition. Adding
metric_fn, [labels, mvn.mean()]))
more tf.layers to the parameterization of the optimizer = tf.train.AdamOptimizer()
MultivariateNormalTriL enables learning nonlinear if FLAGS.use_tpu:
transformations. (Σ = Id would be appropriate in ho- optimizer = tpu_optimizer.
moscedastic regimes.) CrossShardOptimizer(optimizer)
Using Distributions to build Estimators is natu- train_op = optimizer.minimize(loss)
ral and ergonomic. We use TPUEstimator in particu- return tpu_estimator.TPUEstimatorSpec(
mode=mode, loss=loss, train_op=train_op)
lar, which extends Estimator with configurations for
TPUs [25]. Together, Distributions and Estimators # TPUEstimator Boilerplate.
provide a simple, scalable platform for efficiently de- run_config = tpu_config.RunConfig(
ploying training, evaluation, and prediction over diverse master=FLAGS.master,
hardware and network topologies. model_dir=FLAGS.model_dir,
Figure 7 only writes the Estimator object. For train- session_config=tf.ConfigProto(
allow_soft_placement=True,
ing, call estimator.train(); for evaluation, call log_device_placement=True),
estimator.evaluate(); for prediction, call tpu_config=tpu_config.TPUConfig(
estimator.predict(). Each takes an input function FLAGS.iterations,
to load in data. FLAGS.num_shards))
estimator = tpu_estimator.TPUEstimator(
model_fn=mvn_regression_fn,
6 Discussion config=run_config,
The TensorFlow Distributions library implements a vi- use_tpu=FLAGS.use_tpu,
sion of probability theory adapted to the modern deep- train_batch_size=FLAGS.batch_size,
learning paradigm of end-to-end differentiable compu- eval_batch_size=FLAGS.batch_size)
tation. Distributions provides a collection of 56 distri- Figure 7. Multivariate regression with TPUs.
butions with fast, numerically stable methods for sam-
pling, computing log densities, and many statistics.
Bijectors provides a collection of 22 composable trans-
formations with efficient volume-tracking and caching. intend to expand the Distribution interface to include
Although Tensorflow Distributions is relatively new, supports, e.g., real-valued, positive, unit interval, etc.,
it has already seen widespread adoption both inside and as a class property. We are also exploring the possibility
outside of Google. External developers have built on of exposing exponential family structure, for example
it to design probabilistic programming and statistical providing separate unnormalized_log_prob and
systems including Edward [47] and Greta [17]. Further, log_normalizer methods where appropriate.
Distribution and Bijector is being used as the de- As part of the trend towards hardware-accelerated
sign basis for similar functionality in the PyTorch com- linear algebra, we are working to ensure that all distri-
putational graph framework [39], as well as the Pyro bution and bijector methods are compatible with TPUs
and ZhuSuan probabilistic programming systems [38, [25], including special functions such as gamma, as well
43]. as rejection sampling-based (e.g., Gamma) and while-
Looking forward, we plan to continue expanding the loop based sampling mechanisms (e.g., Poisson). We also
set of supported Distributions and Bijectors. We aim to natively support Distributions over SparseTensors.
11
Acknowledgments [16] Andrew Gelman et al. 2006. Prior distributions for variance pa-
rameters in hierarchical models (comment on article by Browne
We thank Jasper Snoek for feedback and comments, and
and Draper). Bayesian analysis 1, 3 (2006), 515–534.
Kevin Murphy for thoughtful discussions since the be- [17] Nick Golding. 2017. greta: Simple and Scalable Statistical Mod-
ginning of TensorFlow Distributions. DT is supported elling in R. https://CRAN.R-project.org/package=greta R pack-
by a Google Ph.D. Fellowship in Machine Learning and age version 0.2.0.
an Adobe Research Fellowship. [18] Rafael Gómez-Bombarelli, David Duvenaud, José Miguel
Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy D
Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. 2016. Automatic
References chemical design using a data-driven continuous representation
of molecules. arXiv preprint arXiv:1610.02415 (2016).
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,
[19] Ian Goodfellow, Jean Pouget-Abadie, M Mirza, Bing Xu, David
Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jef-
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Ben-
frey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfel-
gio. 2014. Generative Adversarial Nets. In Neural Information
low, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia,
Processing Systems.
Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Lev-
[20] Matthew D Hoffman and Andrew Gelman. 2014. The no-U-turn
enberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray,
sampler: adaptively setting path lengths in Hamiltonian Monte
Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya
Carlo. Journal of Machine Learning Research 15 (2014), 1593–
Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay
1623.
Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Mar-
[21] Ross Ihaka and Robert Gentleman. 1996. R: A Lan-
tin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
guage for Data Analysis and Graphics. Journal of Com-
2015. TensorFlow: Large-Scale Machine Learning on Heteroge-
putational and Graphical Statistics 5, 3 (1996), 299–314.
neous Systems. (2015). https://www.tensorflow.org/ Software
https://doi.org/10.1080/10618600.1996.10474713
available from tensorflow.org.
[22] Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical repa-
[2] Nathanel L Ackerman, Cameron E Freer, and Daniel M Roy. 2016.
rameterization with gumbel-softmax. In International Confer-
Exchangeable random primitives. In Workshop on Probabilistic
ence on Learning Representations.
Programming Semantics. 2016.
[23] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P
[3] David J Aldous. 1985. Exchangeability and related topics. In
Adams, and Sandeep R Datta. 2016. Composing graphical mod-
École d’Été de Probabilités de Saint-Flour XIII—1983. Springer, 1–
els with neural networks for structured representations and fast
198.
inference. In Neural Information Processing Systems.
[4] Anonymous. 2017. Generative Models for Data Efficiency and
[24] Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001.
Alignment in Language. OpenReview (2017).
SciPy: Open source scientific tools for Python. (2001).
[5] Justin Bayer and Christian Osendorfer. 2014. Learning Stochas-
http://www.scipy.org/
tic Recurrent Networks. arXiv.org (2014). arXiv:1411.7610v3
[25] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson,
[6] Richard A Becker and John M Chambers. 1984. S: an interactive
Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia,
environment for data analysis and graphics. CRC Press.
Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford
[7] David M Blei and John Lafferty. 2006. Correlated topic models.
Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jef-
In Neural Information Processing Systems.
frey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Got-
[8] G. E. P. Box and Mervin E. Muller. 1958. A Note on the Gener-
tipati, William Gulland, Robert Hagmann, C Richard Ho, Doug
ation of Random Normal Deviates. The Annals of Mathematical
Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron
Statistics (1958), 610–611.
Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy
[9] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel
Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law,
Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker,
Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin,
Jiqiang Guo, Peter Li, and Allen Riddell. 2016. Stan: a proba-
Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran
bilistic programming language. Journal of Statistical Software
Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy
(2016).
Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda,
[10] Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel
Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Sama-
Lee, Peter Li, and Michael Betancourt. 2015. The Stan math li-
diani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed
brary: Reverse-mode automatic differentiation in C++. arXiv
Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory
preprint arXiv:1509.07164 (2015).
Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan,
[11] Thomas M Cover and Joy A Thomas. 1991. Elements of Infor-
Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon.
mation Theory. Wiley Series in Telecommunications and Signal
2017. In-Datacenter Performance Analysis of a Tensor Process-
Processing.
ing Unit. arXiv preprint arXiv:1704.04760 (2017).
[12] Imre Csiszár. 1963. Eine informationstheoretische Ungleichung
[26] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen,
und ihre Anwendung auf Beweis der Ergodizitaet von Markoff-
Ilya Sutskever, and Max Welling. 2016. Improving Variational
schen Ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl. 8 (1963),
Inference with Inverse Autoregressive Flow. In Neural Informa-
85–108.
tion Processing Systems.
[13] Bradley Efron and Robert J Tibshirani. 1994. An introduction to
[27] Diederik P Kingma and Max Welling. 2014. Auto-Encoding Vari-
the bootstrap. CRC press.
ational Bayes. In International Conference on Learning Represen-
[14] M.C. Fu. 2006. Simulation. Handbook in Operations Research
tations.
and Management Science, Vol. 13. North Holland.
[28] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gel-
[15] Dario Garcia-Garcia and Robert C Williamson. 2012. Diver-
man, and David M Blei. 2017. Automatic Differentiation Vari-
gences and risks for multiclass experiments. In Conference on
ational Inference. Journal of Machine Learning Research 18, 14
Learning Theory.
12
(2017), 1–45. In International Conference on Learning Representations.
[29] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The [42] John Schulman, Nicolas Heess, Theophane Weber, and Pieter
Concrete Distribution: A Continuous Relaxation of Discrete Abbeel. 2015. Gradient Estimation Using Stochastic Computa-
Random Variables. In International Conference on Learning Rep- tion Graphs. In Neural Information Processing Systems.
resentations. [43] Jiaxin Shi, Jianfei Chen, Jun Zhu, Shengyang Sun, Yucen Luo,
[30] George Marsaglia and Wai Wan Tsang. 2000. A simple method Yihong Gu, and Yuhao Zhou. 2017. ZhuSuan: A Library for
for generating gamma variables. ACM Transactions on Mathe- Bayesian Deep Learning. arXiv preprint arXiv:1709.05870 (2017).
matical Software (TOMS) 26, 3 (2000), 363–372. [44] Michael Spivak. [n. d.]. A Comprehensive Introduction to Dif-
[31] Tetsuzo Morimoto. 1963. Markov processes and the H-theorem. ferential Geometry, Vol. III. AMC 10 ([n. d.]), 12.
Journal of the Physical Society of Japan 18, 3 (1963), 328–331. [45] Dustin Tran and David Blei. 2017. Implicit Causal Mod-
[32] Alfred Müller. 1997. Integral probability metrics and their gen- els for Genome-wide Association Studies. arXiv preprint
erating classes of functions. Advances in Applied Probability 29, arXiv:1710.10742 (2017).
2 (1997), 429–443. [46] Dustin Tran, Matthew D Hoffman, Rif A Saurous, Eugene
[33] Kevin P Murphy. 2012. Machine Learning: a Probabilistic Perspec- Brevdo, Kevin Murphy, and David M Blei. 2017. Deep Proba-
tive. MIT press. bilistic Programming. In International Conference on Learning
[34] Christian Naesseth, Francisco Ruiz, Scott Linderman, and Representations.
David M Blei. 2017. Reparameterization Gradients through [47] Dustin Tran, Alp Kucukelbir, Adji B Dieng, Maja Rudolph,
Acceptance-Rejection Sampling Algorithms. In Artificial Intel- Dawen Liang, and David M Blei. 2016. Edward: A library for
ligence and Statistics. probabilistic modeling, inference, and criticism. arXiv preprint
[35] Harald Niederreiter. 1978. Quasi-Monte Carlo methods and arXiv:1610.09787 (2016).
pseudo-random numbers. Bull. Amer. Math. Soc. 84, 6 (1978), [48] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-
957–1041. monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
[36] George Papamakarios, Theo Pavlakou, and Iain Murray. 2017. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative
Masked Autoregressive Flow for Density Estimation. In Neural Model for Raw Audio. arXiv preprint arXiv:1609.03499 (2016).
Information Processing Systems. [49] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
[37] Benjamin Peterson. [n. d.]. six: Python 2 and 3 compatibility 2016. Pixel recurrent neural networks. In International Confer-
utilities. https://github.com/benjaminp/six. ([n. d.]). ence on Machine Learning.
[38] Pyro Developers. 2017. Pyro. https://github.com/pyro/pyro. [50] Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. 2011.
(2017). The NumPy array: a structure for efficient numerical computa-
[39] Pytorch Developers. 2017. Pytorch. tion. Computing in Science & Engineering 13, 2 (2011), 22–30.
https://github.com/pytorch/pytorch. (2017). [51] Larry Wasserman. 2013. All of Statistics: A concise Course in
[40] Rajesh Ranganath, Sean Gerrish, and David Blei. 2014. Black Statistical Inference. Springer Science & Business Media.
box variational inference. In Artificial Intelligence and Statistics.
[41] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P
Kingma. 2017. PixelCNN++: Improving the PixelCNN with Dis-
cretized Logistic Mixture Likelihood and Other Modifications.

13

You might also like