tensorflow_dist
tensorflow_dist
Joshua V. Dillon∗ , Ian Langmore∗, Dustin Tran∗† , Eugene Brevdo∗ , Srinivas Vasudevan∗,
Dave Moore∗ , Brian Patton∗ , Alex Alemi∗ , Matt Hoffman∗, Rif A. Saurous∗
∗ Google, † Columbia University
Abstract e = make_encoder(x)
The TensorFlow Distributions library implements a vi- z = e.sample(n)
sion of probability theory adapted to the modern deep- d = make_decoder(z)
arXiv:1711.10604v1 [cs.LG] 28 Nov 2017
The image_dist distribution is over 28 × 28 × 1-dim. 1. We seek a minimally invasive interface for manip-
events (e.g., MNIST-resolution pixel images). ulating distributions. Implementing every trans-
Mixture defines a probability mass function p(x) = formation of every distribution results in a combi-
ÍK natorial blowup and is not realistically maintain-
k=1 πk p(x | k), where the mixing weights πk are pro-
vided by a Categorical distribution as input, and the able. Such a policy is unlikely to keep pace with
the pace of research. Lack of encapsulation exac-
components p(x | k) are arbitrary Distributions with
same support. For components that are batches of the erbates already complex ideas/code and discour-
same family, MixtureSameFamilysimplifies the construc- ages community contributions.
tion where its components are from the rightmost batch 2. In deep learning, rich high-dimensional densities
dimension. Building on image_dist: typically use invertible volume-preserving map-
pings or mappings with fast volume adjustments
image_mixture = tfd.MixtureSameFamily( (namely, the logarithm of the Jacobian’s determi-
mixture_distribution=tfd.Categorical( nant has linear complexity with respect to dimen-
probs=[0.2, 0.8]), sionality) [36]. We’d like to efficiently and idiomat-
components_distribution=image_dist) ically support them.
image_mixture.batch_shape # ==> []
By isolating stochasticity from determinism, Distributions
image_mixture.event_shape # ==> [28, 28, 1]
are easy to design, implement, and validate. As we illus-
Here, MixtureSameFamilycreates a mixture of two com- trate with the flexibility of TransformedDistribution
ponents with weights [0.2, 0.8]. The components are in Section 3.5, the ability to simply swap out functions
slices along the batch axis of image_dist. applied to the distribution is a surprisingly powerful
asset. Programmatically, the Bijector distinction en-
3.6 Distribution Functionals ables encapsulation and modular distribution construc-
Functionals which take probability distribution(s) as in- tions with inherited, fast method implementations. Sta-
put and return a scalar are ubiquitous. They include in- tistically, Bijectors enable exploiting algebraic relation-
formation measures such as entropy, cross entropy, and ships among random variables.
7
4.2 Definition and prob. For bijectors with constant Jacobian such as
To address Section 4.1, the Bijector API provides an Affine, TransformedDistribution automatically im-
interface for transformations of distributions suitable plements statistics such as mean, variance, and entropy.
for any differentiable and bijective map (diffeomorphism) The following example implements an affine-transformed
as well as certain non-injective maps (Section 4.5). Laplace distribution.
Formally, given a random variable X and diffeomor-
vector_laplace = tfd.TransformedDistribution(
phism F , we can define a new random variable Y whose
distribution=tfd.Laplace(loc=0., scale=1.),
density can be written in terms of X ’s,
bijector=tfb.Affine(
pY (y) = pX (F −1 (y)) |DF −1 (y)|, (2) shift=tf.Variable(tf.zeros(d)),
where DF −1 is the inverse of the Jacobian of F . Each scale_tril=tfd.fill_triangular(
Bijector subclass corresponds to such a function F , tf.Variable(tf.ones(d*(d+1)/2)))),
and TransformedDistribution uses the bijector to au- event_shape=[d])
tomate the details of the transformation Y = F (X )’s
The distribution is learnable via tf.Variables and that
density (Equation 2). This allows us to define many new
the Affine is parameterized by what is essentially the
distributions in terms of existing ones.
Cholesky of the covariance matrix. This makes the mul-
A Bijector implements how to transform a Tensor
tivariate construction computationally efficient and more
and how the input Tensor’s shape changes; this Tensor
numerically stable; bijector caching (Section 4.4) may
is presumed to be a random outcome possessing
even eliminate back substitution.
Distribution shape semantics. Three functions char-
acterize how the Tensor is transformed:
4.3 Composability
1. forward implements x 7→ F (x), and is used by
Bijectors can compose using higher-order Bijectors
TransformedDistribution.sampleto convert one
such as Chain and Invert. Figure 3 illustrates a power-
random outcome into another. It also establishes
ful example where the arflow method composes a se-
the name of the bijector.
quence of autoregressive and permutation Bijectors
2. inverse undoes the transformation y 7→ F −1 (y)
to compactly describe an autoregressive flow [26, 36].
and is used by
The Chain bijector enables simple construction of
TransformedDistribution.log_prob.
rich Distributions. Below we construct a multivari-
3. inverse_log_det_jacobian computes
ate logit-Normal with matrix-shaped events.
log |DF −1 (y)| and is used by
TransformedDistribution.log_prob to adjust matrix_logit_mvn =
for how the volume changes by the transforma- tfd.TransformedDistribution(
tion. In certain settings, it is more numerically distribution=tfd.Normal(0., 1.),
stable (or generally preferable) to implement the bijector=tfb.Chain([
forward_log_det_jacobian.Because forward and tfb.Reshape([d, d]),
reverse log ◦ det ◦ Jacobians differ only in sign, ei- tfb.SoftmaxCentered(),
ther (or both) may be implemented. tfb.Affine(scale_diag=diag),
A Bijector also describes how it changes the Tensor’s ]),
shape so that TransformedDistribution can imple- event_shape=[d * d])
ment functions that compute event and batch shapes.
Most bijectors do not change the Tensor’s shape. Those The Invert bijector effectively doubles the number of
which do implement forward_event_shape_tensorand bijectors by swapping forward and inverse. It is use-
inverse_event_shape_tensor.Each takes an input shape ful in situations such as the Gumbel construction in
(vector) and returns a new shape representing the Tensor’s Section 3.5. It is also useful for transforming constrained
event/batch shape after forward or inverse transfor- continuous distributions onto an unconstrained real-valued
mations. Excluding higher-order bijectors, currently only space. For example:
SoftmaxCentered changes the shape.7
Using a Bijector, TransformedDistribution auto- softminus_gamma = tfd.TransformedDistribution(
matically and efficiently implements sample, log_prob, distribution=tfd.Gamma(
Í concentration=alpha,
7 To implement softmax(x ) = exp(x )/ i exp(x i ) as a diffeo- rate=beta),
morophism, its forward appends a zero to the event and its reverse
strips this padding. The result is a bijective map which avoids the fact
bijector=tfb.Invert(tfb.Softplus()))
Í
that softmax(x ) = exp(x − c)/ i exp(x i − c) for any c.
8
This performs a softplus-inversion to robustly transform caching reduces the overall complexity from quadratic
Gamma to be unconstrained. This enables a key compo- to linear (in event size).
nent of automated algorithms such as automatic differ-
entiation variational inference [28] and the No U-Turn 4.5 Smooth Coverings
Sampler [20]. They only operate on real-valued spaces, The Bijector framework extends to non-injective trans-
so unconstraints expand their scope. formations, i.e., smooth coverings [44].8 Formally, a smooth
covering is a continuous function F on the entire do-
4.4 Caching main D where, ignoring sets of measure zero, the do-
Bijectors automatically cache input/output pairs of main can be partitioned as a finite union D = D 1 ∪ · · · ∪
operations, including the log ◦ det ◦ Jacobian. This is ad- D K such that each restriction F : D k → F (D) is a dif-
vantageous when the inverse calculation is slow, nu- feomorphism. Examples include AbsValue and Square.
merically unstable, or not easily implementable. A cache We implement them by having the inverse method re-
hit occurs when computing the probability of results turn the set inverse F −1 (y) := {x : F (x) = y} as a
of sample. That is, if q(x) is the distribution associated tuple.
with x = f (ε) and ε ∼ r , then caching lowers the cost Smooth covering Bijectors let us easily build half-
of computing q(x i ) since distributions, which allocate probability mass over only
−1
q(x i ) = r ((f −1 ◦ f )(εi )) ∂ε ∂
◦ f −1 ◦ f (εi ) = r (εi ). the positive half-plane of the original distribution. For
example, we build a half-Cauchy distribution as
Because TensorFlow follows a deferred execution model,
Bijector caching is nominal; it has zero memory or half_cauchy = tfd.TransformedDistribution(
computational cost. The Bijector base class merely bijector=tfb.AbsValue(),
replaces one graph node for another already existing distribution=tfd.Cauchy(loc=0., scale=1.))
node. Since the existing node (“cache hit”) is already an The half-Cauchy and half-Student t distributions are
execution dependency, the only cost of “caching” is dur- often used as “weakly informative” priors, which ex-
ing graph construction. hibit heavy tails, for variance parameters in hierarchi-
Caching is computationally and numerically benefi- cal models [16].
cial for importance sampling algorithms, which com-
pute expectations. They weight by a drawn samples’ re- 5 Applications
ciprocal probability. Namely, We described two abstractions: Distributions and
∫
µ= f (x)p(x)dx Bijectors. Recall Figures 1 to 3, where we showed the
power of combining these abstractions for changing from
∫
f (x)p(x) simple to state-of-the-art variational auto-encoders. Be-
= q(x)dx low we show additional applications of TensorFlow Dis-
q(x)
n tributions as part of the TensorFlow ecosystem.
Õ f (x i )p(x i ) iid
= lim n −1 , where x i ∼ q.
n→∞
i
q(x i ) 5.1 Kernel Density Estimators
Caching also nicely complements black-box variational A kernel density estimator (KDE) is a nonparametric
inference algorithms [28, 40]. In these procedures, the estimator of an unknown probability distribution [51].
approximate posterior distribution only computes Kernel density estimation provides a fundamental smooth-
log_prob over its own samples. In this setting the sam- ing operation on finite samples that is useful across many
ple’s preimage (εi ) is known without computing f −1 (x i ). applications. With TensorFlow Distributions, KDEs can
MultivariateNormalTriL is implemented as a be flexibly constructed as a MixtureSameFamily. Given
TransformedDistribution with the Affine bijector. a finite set of points x in RD , we write
Caching removes the need for quadratic complexity back f = lambda x: tfd.Independent(tfd.Normal(
substitution. For an InverseAutoregressiveFlow [26], loc=x, scale=1.))
n = x.shape[0].value
laplace_iaf = tfd.TransformedDistribution(
kde = tfd.MixtureSameFamily(
distribution=tfd.Laplace(loc=0., scale=1.), mixture_distribution=tfd.Categorical(
bijector=tfb.Invert( probs=[1 / n] * n),
tfb.MaskedAutoregressiveFlow( components_distribution=f(x))
shift_and_log_scale_fn=tfb.\
Here, f is a callable taking x as input and returns a dis-
masked_autoregressive_default_template(
tribution. Above, we use an independent D-dimensional
hidden_layers))),
8 Bijector caching is currently not supported for smooth coverings.
event_shape=[d])
9
import pixelcnn from edward.models import Normal
13