0% found this document useful (0 votes)
45 views10 pages

1 - Early Work - Deep Unsupervised Learning

This paper proposes a new method called diffusion probabilistic models for defining highly flexible and tractable generative models. The method uses a Markov chain to gradually transform one distribution into a target distribution, and learns the reverse transformation. This allows for exact sampling, cheap likelihood evaluation, and tractable computation of posteriors. The paper demonstrates the approach on several image datasets.

Uploaded by

Yifei Peng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views10 pages

1 - Early Work - Deep Unsupervised Learning

This paper proposes a new method called diffusion probabilistic models for defining highly flexible and tractable generative models. The method uses a Markov chain to gradually transform one distribution into a target distribution, and learns the reverse transformation. This allows for exact sampling, cheap likelihood evaluation, and tractable computation of posteriors. The paper demonstrates the approach on several image datasets.

Uploaded by

Yifei Peng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Deep Unsupervised Learning using

Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein JASCHA @ STANFORD . EDU


Stanford University
Eric A. Weiss EWEISS @ BERKELEY. EDU
University of California, Berkeley
Niru Maheswaranathan NIRUM @ STANFORD . EDU
Stanford University
Surya Ganguli SGANGULI @ STANFORD . EDU
Stanford University

Abstract these models are unable to aptly describe structure in rich


datasets. On the other hand, models that are flexible can be
A central problem in machine learning involves
molded to fit structure in arbitrary data. For example, we
modeling complex data-sets using highly flexi-
can define models in terms of any (non-negative) function
ble families of probability distributions in which
learning, sampling, inference, and evaluation (x) yielding the flexible distribution p (x) = (x)
Z , where
are still analytically or computationally tractable. Z is a normalization constant. However, computing this
Here, we develop an approach that simultane- normalization constant is generally intractable. Evaluating,
ously achieves both flexibility and tractability. training, or drawing samples from such flexible models typ-
The essential idea, inspired by non-equilibrium ically requires a very expensive Monte Carlo process.
statistical physics, is to systematically and slowly A variety of analytic approximations exist which amelio-
destroy structure in a data distribution through rate, but do not remove, this tradeoff–for instance mean
an iterative forward diffusion process. We then field theory and its expansions (T, 1982; Tanaka, 1998),
learn a reverse diffusion process that restores variational Bayes (Jordan et al., 1999), contrastive diver-
structure in data, yielding a highly flexible and gence (Welling & Hinton, 2002; Hinton, 2002), minimum
tractable generative model of the data. This ap- probability flow (Sohl-Dickstein et al., 2011b;a), minimum
proach allows us to rapidly learn, sample from, KL contraction (Lyu, 2011), proper scoring rules (Gneit-
and evaluate probabilities in deep generative ing & Raftery, 2007; Parry et al., 2012), score matching
models with thousands of layers or time steps, (Hyvärinen, 2005), pseudolikelihood (Besag, 1975), loopy
as well as to compute conditional and posterior belief propagation (Murphy et al., 1999), and many, many
probabilities under the learned model. We addi- more. Non-parametric methods (Gershman & Blei, 2012)
tionally release an open source reference imple- can also be very effective1 .
mentation of the algorithm.
1.1. Diffusion probabilistic models

1. Introduction We present a novel way to define probabilistic models that


allows:
Historically, probabilistic models suffer from a tradeoff be-
tween two conflicting objectives: tractability and flexibil- 1. extreme flexibility in model structure,
ity. Models that are tractable can be analytically evaluated 2. exact sampling,
and easily fit to data (e.g. a Gaussian or Laplace). However, 1
Non-parametric methods can be seen as transitioning
smoothly between tractable and flexible models. For instance,
Proceedings of the 32 nd
International Conference on Machine
a non-parametric Gaussian mixture model will represent a small
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-
amount of data using a single Gaussian, but may represent infinite
right 2015 by the author(s).
data as a mixture of an infinite number of Gaussians.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics

3. easy multiplication with other distributions, e.g. in or- 2. We show how to easily multiply the learned distribu-
der to compute a posterior, and tion with another probability distribution (eg with a
4. the model log likelihood, and the probability of indi- conditional distribution in order to compute a poste-
vidual states, to be cheaply evaluated. rior)
3. We address the difficulty that training the inference
Our method uses a Markov chain to gradually convert one
model can prove particularly challenging in varia-
distribution into another, an idea used in non-equilibrium
tional inference methods, due to the asymmetry in the
statistical physics (Jarzynski, 1997) and sequential Monte
objective between the inference and generative mod-
Carlo (Neal, 2001). We build a generative Markov chain
els. We restrict the forward (inference) process to a
which converts a simple known distribution (e.g. a Gaus-
simple functional form, in such a way that the re-
sian) into a target (data) distribution using a diffusion pro-
verse (generative) process will have the same func-
cess. Rather than use this Markov chain to approximately
tional form.
evaluate a model which has been otherwise defined, we ex-
plicitly define the probabilistic model as the endpoint of the 4. We train models with thousands of layers (or time
Markov chain. Since each step in the diffusion chain has an steps), rather than only a handful of layers.
analytically evaluable probability, the full chain can also be 5. We provide upper and lower bounds on the entropy
analytically evaluated. production in each layer (or time step)

Learning in this framework involves estimating small per- There are a number of related techniques for training prob-
turbations to a diffusion process. Estimating small pertur- abilistic models (summarized below) that develop highly
bations is more tractable than explicitly describing the full flexible forms for generative models, train stochastic tra-
distribution with a single, non-analytically-normalizable, jectories, or learn the reversal of a Bayesian network.
potential function. Furthermore, since a diffusion process Reweighted wake-sleep (Bornschein & Bengio, 2015) de-
exists for any smooth target distribution, this method can velops extensions and improved learning rules for the orig-
capture data distributions of arbitrary form. inal wake-sleep algorithm. Generative stochastic networks
(Bengio & Thibodeau-Laufer, 2013; Yao et al., 2014) train
We demonstrate the utility of these diffusion probabilistic a Markov kernel to match its equilibrium distribution to
models by training high log likelihood models for a two- the data distribution. Neural autoregressive distribution
dimensional swiss roll, binary sequence, handwritten digit estimators (Larochelle & Murray, 2011) (and their recur-
(MNIST), and several natural image (CIFAR-10, bark, and rent (Uria et al., 2013a) and deep (Uria et al., 2013b) ex-
dead leaves) datasets. tensions) decompose a joint distribution into a sequence
of tractable conditional distributions over each dimension.
1.2. Relationship to other work Adversarial networks (Goodfellow et al., 2014) train a gen-
The wake-sleep algorithm (Hinton, 1995; Dayan et al., erative model against a classifier which attempts to dis-
1995) introduced the idea of training inference and gen- tinguish generated samples from true data. A similar ob-
erative probabilistic models against each other. This jective in (Schmidhuber, 1992) learns a two-way map-
approach remained largely unexplored for nearly two ping to a representation with marginally independent units.
decades, though with some exceptions (Sminchisescu et al., In (Rippel & Adams, 2013; Dinh et al., 2014) bijective
2006; Kavukcuoglu et al., 2010). There has been a re- deterministic maps are learned to a latent representation
cent explosion of work developing this idea. In (Kingma with a simple factorial density function. In (Stuhlmüller
& Welling, 2013; Gregor et al., 2013; Rezende et al., 2014; et al., 2013) stochastic inverses are learned for Bayesian
Ozair & Bengio, 2014) variational learning and inference networks. Mixtures of conditional Gaussian scale mix-
algorithms were developed which allow a flexible genera- tures (MCGSMs) (Theis et al., 2012) describe a dataset
tive model and posterior distribution over latent variables using Gaussian scale mixtures, with parameters which de-
to be directly trained against each other. pend on a sequence of causal neighborhoods. There is
additionally significant work learning flexible generative
The variational bound in these papers is similar to the one mappings from simple latent distributions to data distribu-
used in our training objective and in the earlier work of tions – early examples including (MacKay, 1995) where
(Sminchisescu et al., 2006). However, our motivation and neural networks are introduced as generative models, and
model form are both quite different, and the present work (Bishop et al., 1998) where a stochastic manifold mapping
retains the following differences and advantages relative to is learned from a latent space to the data space. We will
these techniques: compare experimentally against adversarial networks and
MCGSMs.
1. We develop our framework using ideas from physics,
quasi-static processes, and annealed importance sam- Related ideas from physics include the Jarzynski equal-
pling rather than from variational Bayesian methods. ity (Jarzynski, 1997), known in machine learning as An-
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
T
t=0 t= 2 t=T

q x(0···T )

p x(0···T )

fµ x(t) , t x(t)

Figure
⇣ 1. The⌘ proposed modeling framework trained on 2-d swiss roll data. The top row shows time slices from the forward trajectory
(0···T )
q x . The data distribution (left) undergoes Gaussian diffusion, which gradually transforms it into an identity-covariance Gaus-
⇣ ⌘
sian (right). The middle row shows the corresponding time slices from the trained reverse trajectory p x(0···T ) . An identity-covariance
Gaussian (right) undergoes a Gaussian diffusion process with learned mean ⇣ and covariance
⌘ functions, and is gradually transformed back
into the data distribution (left). The bottom row shows the drift term, fµ x(t) , t x(t) , for the same reverse diffusion process.

nealed Importance Sampling (AIS) (Neal, 2001), which simple, tractable, distribution, and then learn a finite-time
uses a Markov chain which slowly converts one distribu- reversal of this diffusion process which defines our gener-
tion into another to compute a ratio of normalizing con- ative model distribution (See Figure 1). We first describe
stants. In (Burda et al., 2014) it is shown that AIS can also the forward, inference diffusion process. We then show
be performed using the reverse rather than forward trajec- how the reverse, generative diffusion process can be trained
tory. Langevin dynamics (Langevin, 1908), which are the and used to evaluate probabilities. We also derive entropy
stochastic realization of the Fokker-Planck equation, show bounds for the reverse process, and show how the learned
how to define a Gaussian diffusion process which has any distributions can be multiplied by any second distribution
target distribution as its equilibrium. In (Suykens & Vande- (e.g. as would be done to compute a posterior when in-
walle, 1995) the Fokker-Planck equation is used to perform painting or denoising an image).
stochastic optimization. Finally, the Kolmogorov forward
and backward equations (Feller, 1949) show that forward 2.1. Forward Trajectory
and reverse diffusion processes can be described using the
same functional form. The Kolmogorov forward equation We label the data distribution q x(0) . The data distribu-
corresponds to the Fokker-Planck equation, while the Kol- tion is gradually converted into a well behaved (analyti-
mogorov backward equation describes the time-reversal of cally tractable) distribution ⇡ (y) by repeated application
this diffusion process, but requires knowing gradients of of a Markov diffusion kernel T⇡ (y|y0 ; ) for ⇡ (y), where
the density function as a function of time. is the diffusion rate,

Z
2. Algorithm ⇡ (y) = dy0 T⇡ (y|y0 ; ) ⇡ (y0 ) (1)
Our goal is to define a forward (or inference) diffusion pro- ⇣ ⌘ ⇣ ⌘
q x(t) |x(t 1) = T⇡ x(t) |x(t 1) ; t . (2)
cess which converts any complex data distribution into a
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
T
t=0 t= 2 t=T

p x(0···T )

Figure 2. Binary sequence learning via binomial diffusion. A binomial diffusion model was trained on binary ‘heartbeat’ data, where a
pulse occurs every 5th bin. Generated samples (left) are identical to the training data. The sampling procedure consists of initialization
at independent binomial noise (right), which is then transformed into the data distribution by a binomial diffusion process, with trained
bit flip probabilities. Each row contains an independent sample. For ease of visualization, all samples have been shifted so that a pulse
occurs in the first column. In the raw sequence data, the first pulse is uniformly distributed over the first five bins.

(a) (b)

Figure 3. The proposed framework trained on the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset. (a) Example training data. (b)
Random samples generated by the diffusion model.

The forward trajectory, corresponding to starting at the data forward process (Feller, 1949). Since q x(t) |x(t 1) is a
distribution and performing T steps of diffusion, is thus Gaussian (binomial) distribution, and if t is small, then
q x(t 1) |x(t) will also be a Gaussian (binomial) distribu-
⇣ ⌘ ⇣ ⌘Y
T ⇣ ⌘ tion. The longer the trajectory the smaller the diffusion rate
q x(0···T ) = q x(0) q x(t) |x(t 1)
(3)
can be made.
t=1
During learning only the mean and covariance for a Gaus-
For the experiments shown below, q x(t) |x(t 1) corre- sian diffusion kernel, or the bit flip probability for a bi-
sponds to either Gaussian diffusion into a Gaussian distri- nomial kernel, need be estimated. As shown in Table
bution with identity-covariance, or binomial diffusion into C.1, fµ x(t) , t and f⌃ x(t) , t are functions defining the
an independent binomial distribution. Table C.1 gives the mean and covariance of the reverse Markov transitions for
diffusion kernels for both Gaussian and binomial distribu- a Gaussian, and fb x(t) , t is a function providing the bit
tions. flip probability for a binomial distribution. The computa-
tional cost of running this algorithm is the cost of the these
2.2. Reverse Trajectory functions, times the number of time-steps. For all results in
The generative distribution will be trained to describe the this paper, multi-layer perceptrons are used to define these
same trajectory, but in reverse, functions. A wide range of regression or function fitting
⇣ ⌘ ⇣ ⌘ techniques would be applicable however, including nonpa-
p x(T ) = ⇡ x(T ) (4) rameteric methods.

⇣ ⌘ ⇣ ⌘Y
T ⇣ ⌘ 2.3. Model Probability
p x(0···T ) = p x(T ) p x(t 1)
|x(t) . (5)
t=1 The probability the generative model assigns to the data is
For both Gaussian and binomial diffusion, for continuous ⇣ ⌘ Z ⇣ ⌘
diffusion (limit of small step size ) the reversal of the p x(0) = dx(1···T ) p x(0···T ) . (6)
diffusion process has the identical functional form as the
Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Naively this integral is intractable – but taking a cue from where the entropies and KL divergences can be analyt-
annealed importance sampling and the Jarzynski equality, ically computed. The derivation of this bound parallels
we instead evaluate the relative probability of the forward the derivation of the log likelihood bound in variational
and reverse trajectories, averaged over forward trajectories, Bayesian methods.
⇣ ⌘ Z ⇣ ⌘ q x(1···T ) |x(0) As in Section 2.3 if the forward and reverse trajectories are
p x(0) = dx(1···T ) p x(0···T ) (7) identical, corresponding to a quasi-static process, then the
q x(1···T ) |x(0)
Z ⇣ ⌘ p x(0···T ) inequality in Equation 13 becomes an equality.
= dx(1···T ) q x(1···T ) |x(0) Training consists of finding the reverse Markov transitions
q x(1···T ) |x(0)
which maximize this lower bound on the log likelihood,
(8) ⇣ ⌘
Z ⇣ ⌘
p̂ x(t 1) |x(t) = argmax K. (15)
= dx(1···T ) q x(1···T ) |x(0) · p(x(t 1) |x(t) )

⇣ ⌘YT
p x(t 1) |x(t) The specific targets of estimation for Gaussian and bino-
p x(T ) . (9) mial diffusion are given in Table C.1.
t=1
q x(t) |x(t 1)
This can be evaluated rapidly by averaging over samples Thus, the task of estimating a probability distribution has
from the forward trajectory q x(1···T ) |x(0) . For infinites- been reduced to the task of performing regression on the
imal the forward and reverse distribution over trajecto- functions which set the mean and covariance of a sequence
ries can be made identical (see Section 2.2). If they are of Gaussians (or set the state flip probability for a sequence
identical then only a single sample from q x(1···T ) |x(0) of Bernoulli trials).
is required to exactly evaluate the above integral, as can
be seen by substitution. This corresponds to the case of a 2.4.1. S ETTING THE D IFFUSION R ATE t

quasi-static process in statistical physics (Spinney & Ford, The choice of t in the forward trajectory is important for
2013; Jarzynski, 2011). the performance of the trained model. In AIS, the right
schedule of intermediate distributions can greatly improve
2.4. Training the accuracy of the log partition function estimate (Grosse
et al., 2013). In thermodynamics the schedule taken when
Training
Z amounts⇣to maximizing the model log likelihood,
⌘ ⇣ ⌘ moving between equilibrium distributions determines how
L = dx(0) q x(0) log p x(0) (10) much free energy is lost (Spinney & Ford, 2013; Jarzynski,
Z ⇣ ⌘ 2011).
= dx(0) q x(0) · In the case of Gaussian diffusion, we learn2 the forward
2 R 3 diffusion schedule 2···T by gradient ascent on K. The
dx(1···T ) q x(1···T ) |x(0) ·
variance 1 of the first step is fixed to a small constant
log 4 QT p(x(t 1) |x(t) ) 5 , (11)
p x(T ) t=1 q (x(t) |x(t 1) ) to prevent overfitting. The dependence of samples from
q x(1···T ) |x(0) on 1···T is made explicit by using ‘frozen
which has a lower bound provided by Jensen’s inequality, noise’ – as in (Kingma & Welling, 2013) the noise is treated
Z ⇣ ⌘ as an additional auxiliary variable, and held constant while
L dx(0···T ) q x(0···T ) · computing partial derivatives of K with respect to the pa-
" # rameters.
⇣ ⌘YT
p x(t 1) |x(t)
(T )
log p x . (12) For binomial diffusion, the discrete state space makes gra-
q x(t) |x(t 1)
t=1 dient ascent with frozen noise impossible. We instead
As described in Appendix B, for our diffusion trajectories choose the forward diffusion schedule 1···T to erase a con-
this reduces to, stant fraction T1 of the original signal per diffusion step,
1
yielding a diffusion rate of t = (T t + 1) .
L K (13)
T Z
X ⇣ ⌘
2.5. Multiplying Distributions, and Computing
K= dx(0) dx(t) q x(0) , x(t) ·
Posteriors
t=2
⇣ ⇣ ⌘ ⇣ ⌘⌘
DKL q x(t 1) |x(t) , x(0) p x(t 1) |x(t) Tasks such as computing a posterior in order to do signal
⇣ ⌘ ⇣ ⌘ ⇣ ⌘ denoising or inference of missing values requires multipli-
+ Hq X(T ) |X(0) Hq X(1) |X(0) Hp X(T ) . 2
Recent experiments suggest that it is just as effective to in-
(14) stead use the same fixed t schedule as for binomial diffusion.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics

(a) (b) (c)

Figure 4. The proposed framework trained on dead leaf images (Jeulin, 1997; Lee et al., 2001). (a) Example training image. (b) A sample
from the previous state of the art natural image model (Theis et al., 2012) trained on identical data, reproduced here with permission.
(c) A sample generated by the diffusion model. Note that it demonstrates fairly consistent occlusion relationships, displays a multiscale
distribution over object sizes, and produces circle-like objects, especially at smaller scales. As shown in Table 2, the diffusion model has
the highest log likelihood on the test set.

cation of the model distribution p x(0) with a second dis- presented in Section 2.1 satisfies
tribution, or bounded positive function, r x(0) , producing ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ⇣ ⌘
a new distribution p̃ x(0) / p x(0) r x(0) . q x(t+1) |x(t) q x(t) = q x(t) |x(t+1) q x(t+1) .

Multiplying distributions is costly and difficult for many (17)


techniques, including variational autoencoders, GSNs, The new chain must instead satisfy
NADEs, and most graphical models. However, under a dif- ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ⇣ ⌘
fusion model it is straightforward, since the second distri- q̃ x(t+1) |x(t) q̃ x(t) = q̃ x(t) |x(t+1) q̃ x(t+1) .
bution can be treated either as a small perturbation to each
step in the diffusion process, or often exactly multiplied (18)
into each diffusion step. Figure 5 demonstrates the use of As derived in Appendix C, one way to choose a new
a diffusion model to perform inpainting of a natural image. Markov chain which satisfies Equation 18 is to set
The following sections describe how to multiply distribu-
⇣ ⌘ ⇣ ⌘ ⇣ ⌘
tions in the context of diffusion probabilistic models. q̃ x(t+1) |x(t) / q x(t+1) |x(t) r x(t+1) , (19)
⇣ ⌘ ⇣ ⌘ ⇣ ⌘
2.5.1. M ODIFIED M ARGINAL D ISTRIBUTIONS q̃ x(t) |x(t+1) / q x(t) |x(t+1) r x(t) . (20)
First, in order to compute p̃ x(0) , we multiply each of
the intermediate distributions by a corresponding function So that p̃ x(t) |x(t+1) corresponds to q̃ x(t) |x(t+1) ,
r x(t) . We use a tilde above a distribution or Markov p x(t) |x(t+1) is modified in the corresponding fashion,
transition to denote that it belongs to a trajectory that has ⇣ ⌘ ⇣ ⌘ ⇣ ⌘
been modified in this way. q̃ x(0···T ) is the modified for- p̃ x(t) |x(t+1) / p x(t) |x(t+1) r x(t) . (21)
ward trajectory, which starts at the distribution q̃ x(0) =
1
q x(0) r x(0) and proceeds through the sequence of
Z̃0 2.5.3. A PPLYING r x(t)
intermediate distributions
⇣ ⌘ 1 ⇣ (t) ⌘ ⇣ (t) ⌘ If r x(t) is sufficiently smooth, then it can be treated
q̃ x(t) = q x r x , (16) as a small perturbation to the reverse diffusion kernel
Z̃t
p x(t) |x(t+1) . In this case p̃ x(t) |x(t+1) will have an
where Z̃t is the normalizing constant for the tth intermedi- identical functional form to p x(t) |x(t+1) , but with per-
ate distribution. turbed mean and covariance for the Gaussian kernel, or
with perturbed flip rate for the binomial kernel. The per-
2.5.2. M ODIFIED C ONDITIONAL D ISTRIBUTIONS turbed diffusion kernels are given in Table C.1.
Next, writing the relationship between the forward and re- If r x(t) can be multiplied with a Gaussian (or binomial)
verse conditional distributions demonstrates how multiply- distribution in closed form, then it can be directly multi-
ing each intermediate distribution by r x(t) changes the plied with the reverse diffusion kernel p x(t) |x(t+1) in
Markov diffusion chain. By Bayes’ rule the forward chain closed form, and need not be treated as a perturbation. This
Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Dataset K K Lnull Model Log Likelihood


Swiss Roll 2.35 bits 6.45 bits Dead Leaves
Binary Heartbeat -2.414 bits/seq. 12.024 bits/seq. MCGSM 1.244 bits/pixel
Bark -0.55 bits/pixel 1.5 bits/pixel Diffusion 1.489 bits/pixel
Dead Leaves 1.489 bits/pixel 3.536 bits/pixel MNIST
CIFAR-10 11.895 bits/pixel 18.037 bits/pixel Stacked CAE 121 ± 1.6 bits
MNIST See table 2 DBN 138 ± 2 bits
Deep GSN 214 ± 1.1 bits
Table 1. The lower bound K on the log likelihood, computed on a Diffusion 220 ± 1.9 bits
holdout set, for each of the trained models. See Equation 12. The Adversarial net 225 ± 2 bits
right column is the improvement relative to an isotropic Gaussian
or independent
⇣ ⌘ binomial distribution. Lnull is the log likelihood Table 2. Log likelihood comparisons to other algorithms. Dead
of ⇡ x(0) . leaves images were evaluated using identical training and test data
as in (Theis et al., 2012). MNIST log likelihoods were estimated
using the Parzen-window code from (Goodfellow et al., 2014),
and show that our performance is comparable to other recent tech-
applies in the case where r x(t) consists of a delta func-
niques.
tion for some subset of coordinates, as in the inpainting
example in Figure 5.

2.5.4. C HOOSING r x(t) sampling from the trained model and inpainting of miss-
ing data, and compare model performance against other
Typically, r x(t) should be chosen to change slowly over
techniques. In all cases the objective function and gradi-
the course of the trajectory. For the experiments in this
ent were computed using Theano (Bergstra & Breuleux,
paper we chose it to be constant,
⇣ ⌘ ⇣ ⌘ 2010), and model training was with SFO (Sohl-Dickstein
r x(t) = r x(0) . (22) et al., 2014). The lower bound on the log likelihood
provided by our model is reported for all datasets in Ta-
T t
ble 1. A reference implementation of the algorithm uti-
Another convenient choice is r x(t) = r x(0) T . Un-
lizing Blocks (van Merriënboer et al., 2015) is avail-
der this second choice r x(t) makes no contribution to the
able at https://github.com/Sohl-Dickstein/
starting distribution for the reverse trajectory. This guaran-
Diffusion-Probabilistic-Models.
tees that drawing the initial sample from p̃ x(T ) for the
reverse trajectory remains straightforward.
3.1. Toy Problems
2.6. Entropy of Reverse Process 3.1.1. S WISS ROLL
Since the forward process is known, it is possible to place A diffusion probabilistic model was built of a two dimen-
upper and lower bounds on the entropy of each step in the sional swiss roll distribution, using a radial basis function
reverse trajectory. These bounds can be used to constrain network to generate fµ x(t) , t and f⌃ x(t) , t . As illus-
the learned reverse transitions p x(t 1) |x(t) . The bounds trated in Figure 1, the swiss roll distribution was success-
on the conditional entropy of a step in the reverse trajectory fully learned. See Appendix Section D.1.1 for more details.
are
⇣ ⌘ ⇣ ⌘ ⇣ ⌘ 3.1.2. B INARY H EARTBEAT D ISTRIBUTION
Hq X(t) |X(t 1) + Hq X(t 1) |X(0) Hq X(t) |X(0)
⇣ ⌘ ⇣ ⌘ A diffusion probabilistic model was trained on simple bi-
 Hq X(t 1) |X(t)  Hq X(t) |X(t 1) , nary sequences of length 20, where a 1 occurs every 5th
(23) time bin, and the remainder of the bins are 0, using a multi-
layer perceptron to generate the Bernoulli rates fb x(t) , t
where both the upper and lower bounds depend only on of the reverse trajectory. The log likelihood under the true
the conditional forward trajectory q x(1···T ) |x(0) , and can distribution is log2 15 = 2.322 bits per sequence. As
be analytically computed. The derivation is provided in can be seen in Figure 2 and Table 1 learning was nearly
Appendix A. perfect. See Appendix Section D.1.2 for more details.

3. Experiments 3.2. Images


We train diffusion probabilistic models on a variety of con- We trained Gaussian diffusion probabilistic models on sev-
tinuous datasets, and a binary dataset. We then demonstrate eral image datasets. The multi-scale convolutional archi-
Deep Unsupervised Learning using Nonequilibrium Thermodynamics

(a) (b) (c)

Figure 5. Inpainting. (a) A bark image from (Lazebnik et al.,


⇣ 2005).
⌘ (b) The same image with the central 100⇥100 pixel region replaced
with isotropic Gaussian noise. This is the initialization p̃ x(T ) for the reverse trajectory. (c) The central 100⇥100 region has been
inpainted using a diffusion probabilistic model trained on images of bark, by sampling from the posterior distribution over the missing
region conditioned on the rest of the image. Note the long-range spatial structure, for instance in the crack
⇣ entering
⌘ on the left side of the
inpainted region. The sample from the posterior was generated as described in Section 2.5, where r x (0)
was set to a delta function
for known data, and a constant for missing data.

tecture shared by these experiments is described in Ap- Bark Texture Images A probabilistic model was trained
pendix Section D.2.1, and illustrated in Figure D.1. on bark texture images (T01-T04) from (Lazebnik et al.,
2005). For this dataset we demonstrate that it is straightfor-
3.2.1. DATASETS ward to evaluate or generate from a posterior distribution,
by inpainting a large region of missing data using a sample
MNIST In order to allow a direct comparison against
from the model posterior in Figure 5.
previous work on a simple dataset, we trained on MNIST
digits (LeCun & Cortes, 1998). The relative log likeli-
hoods are given in Table 2 to a variety of techniques (Ben- 4. Conclusion
gio et al., 2012; Bengio & Thibodeau-Laufer, 2013; Good-
We have introduced a novel algorithm for modeling proba-
fellow et al., 2014). Samples from the MNIST model are
bility distributions that enables exact sampling and evalua-
given in Figure App.1 in the Appendix. Our training algo-
tion of probabilities and demonstrated its effectiveness on a
rithm provides an asymptotically exact lower bound on the
variety of toy and real datasets, including challenging natu-
log likelihood. However, most previous reported results
ral image datasets. For each of these tests we used a similar
on MNIST log likelihood rely on Parzen-window based
basic algorithm, showing that our method can accurately
estimates computed from model samples. For this com-
model a wide variety of distributions. Most existing den-
parison we therefore estimate MNIST log likelihood using
sity estimation techniques must sacrifice modeling power
the Parzen-window code released with (Goodfellow et al.,
in order to stay tractable and efficient, and sampling or
2014).
evaluation are often extremely expensive. The core of our
algorithm consists of estimating the reversal of a Markov
diffusion chain which maps data to a noise distribution; as
CIFAR-10 A probabilistic model was fit to the training
the number of steps is made large, the reversal distribution
images for the CIFAR-10 challenge dataset (Krizhevsky &
of each diffusion step becomes simple and easy to estimate.
Hinton, 2009). Samples from the trained model are pro-
The result is an algorithm that can learn a fit to any data dis-
vided in Figure 3.
tribution, but which remains tractable to train, exactly sam-
ple from, and evaluate, and under which it is straightfor-
ward to manipulate conditional and posterior distributions.
Dead Leaf Images Dead leaf images (Jeulin, 1997; Lee
et al., 2001) consist of layered occluding circles, drawn
from a power law distribution over scales. They have an an- Acknowledgements
alytically tractable structure, but capture many of the statis- We thank Lucas Theis, Subhaneil Lahiri, Ben Poole, Diederik P.
tical complexities of natural images, and therefore provide Kingma, Taco Cohen, and Philip Bachman for extremely help-
a compelling test case for natural image models. As illus- ful discussion, and Ian Goodfellow for sharing Parzen-window
trated in Table 2 and Figure 4, we achieve state of the art code. We thank Khan Academy and the Office of Naval Re-
search for funding Jascha Sohl-Dickstein. We further thank the
performance on the dead leaves dataset.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Office of Naval Research, the Burroughs-Wellcome foundation, Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. Annealing
Sloan foundation, and James S. McDonnell foundation for fund- between distributions by averaging moments. In Advances in
ing Surya Ganguli. Neural Information Processing Systems, pp. 2769–2777, 2013.
Hinton, G. E. Training products of experts by minimizing con-
References trastive divergence. Neural Computation, 14(8):1771–1800,
2002.
Barron, J. T., Biggin, M. D., Arbelaez, P., Knowles, D. W., Ker-
anen, S. V., and Malik, J. Volumetric Semantic Segmentation Hinton, G. E. The wake-sleep algorithm for unsupervised neural
Using Pyramid Context Features. In 2013 IEEE International networks ). Science, 1995.
Conference on Computer Vision, pp. 3448–3455. IEEE, De-
cember 2013. ISBN 978-1-4799-2840-8. doi: 10.1109/ICCV. Hyvärinen, A. Estimation of non-normalized statistical models
2013.428. using score matching. Journal of Machine Learning Research,
6:695–709, 2005.
Bengio, Y. and Thibodeau-Laufer, E. Deep generative
Jarzynski, C. Equilibrium free-energy differences from nonequi-
stochastic networks trainable by backprop. arXiv preprint
librium measurements: A master-equation approach. Physical
arXiv:1306.1091, 2013.
Review E, January 1997.
Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. Better Mix- Jarzynski, C. Equalities and inequalities: irreversibility and the
ing via Deep Representations. arXiv preprint arXiv:1207.4404, second law of thermodynamics at the nanoscale. In Annu. Rev.
July 2012. Condens. Matter Phys. Springer, 2011.
Bergstra, J. and Breuleux, O. Theano: a CPU and GPU math Jeulin, D. Dead leaves models: from space tesselation to ran-
expression compiler. Proceedings of the Python for Scientific dom functions. Proc. of the Symposium on the Advances in the
Computing Conference (SciPy), 2010. Theory and Applications of Random Sets, 1997.

Besag, J. Statistical Analysis of Non-Lattice Data. The Statisti- Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K.
cian, 24(3), 179-195, 1975. An introduction to variational methods for graphical models.
Machine learning, 37(2):183–233, 1999.
Bishop, C., Svensén, M., and Williams, C. GTM: The generative
topographic mapping. Neural computation, 1998. Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast inference in
sparse coding algorithms with applications to object recogni-
Bornschein, J. and Bengio, Y. Reweighted Wake-Sleep. Interna- tion. arXiv preprint arXiv:1010.3467, 2010.
tional Conference on Learning Representations, June 2015.
Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes.
Burda, Y., Grosse, R. B., and Salakhutdinov, R. Accurate and International Conference on Learning Representations, De-
Conservative Estimates of MRF Log-likelihood using Reverse cember 2013.
Annealing. arXiv:1412.8566, December 2014. Krizhevsky, A. and Hinton, G. Learning multiple layers of fea-
tures from tiny images. Computer Science Department Univer-
Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. The sity of Toronto Tech. Rep., 2009.
helmholtz machine. Neural computation, 7(5):889–904, 1995.
Langevin, P. Sur la théorie du mouvement brownien. CR Acad.
Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear Inde- Sci. Paris, 146(530-533), 1908.
pendent Components Estimation. arXiv:1410.8516, pp. 11,
October 2014. Larochelle, H. and Murray, I. The neural autoregressive distribu-
tion estimator. Journal of Machine Learning Research, 2011.
Feller, W. On the theory of stochastic processes, with partic-
ular reference to applications. In Proceedings of the [First] Lazebnik, S., Schmid, C., and Ponce, J. A sparse texture represen-
Berkeley Symposium on Mathematical Statistics and Probabil- tation using local affine regions. Pattern Analysis and Machine
ity. The Regents of the University of California, 1949. Intelligence, IEEE Transactions on, 27(8):1265–1278, 2005.

Gershman, S. J. and Blei, D. M. A tutorial on Bayesian nonpara- LeCun, Y. and Cortes, C. The MNIST database of handwritten
metric models. Journal of Mathematical Psychology, 56(1): digits. 1998.
1–12, 2012. Lee, A., Mumford, D., and Huang, J. Occlusion models for natu-
ral images: A statistical study of a scale-invariant dead leaves
Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, pre- model. International Journal of Computer Vision, 2001.
diction, and estimation. Journal of the American Statistical
Association, 102(477):359–378, 2007. Lyu, S. Unifying Non-Maximum Likelihood Learning Objectives
with Minimum KL Contraction. In Shawe-Taylor, J., Zemel,
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde- R. S., Bartlett, P., Pereira, F. C. N., and Weinberger, K. Q.
Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative (eds.), Advances in Neural Information Processing Systems 24,
Adversarial Nets. Advances in Neural Information Processing pp. 64–72. 2011.
Systems, 2014.
MacKay, D. Bayesian neural networks and density networks. Nu-
Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wier- clear Instruments and Methods in Physics Research Section A:
stra, D. Deep AutoRegressive Networks. arXiv preprint Accelerators, Spectrometers, Detectors and Associated Equip-
arXiv:1310.8499, October 2013. ment, 1995.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Murphy, K. P., Weiss, Y., and Jordan, M. I. Loopy belief propa- Theis, L., Hosseini, R., and Bethge, M. Mixtures of conditional
gation for approximate inference: An empirical study. In Pro- Gaussian scale mixtures applied to multiscale image represen-
ceedings of the Fifteenth conference on Uncertainty in artificial tations. PloS one, 7(7):e39857, 2012.
intelligence, pp. 467–475. Morgan Kaufmann Publishers Inc.,
1999. Uria, B., Murray, I., and Larochelle, H. RNADE: The real-valued
neural autoregressive density-estimator. Advances in Neural
Neal, R. Annealed importance sampling. Statistics and Comput- Information Processing Systems, 2013a.
ing, January 2001.
Uria, B., Murray, I., and Larochelle, H. A Deep and Tractable
Ozair, S. and Bengio, Y. Deep Directed Generative Autoencoders. Density Estimator. arXiv:1310.1757, pp. 9, October 2013b.
arXiv:1410.0630, October 2014.
van Merriënboer, B., Chorowski, J., Serdyuk, D., Bengio, Y.,
Parry, M., Dawid, A. P., Lauritzen, S., and Others. Proper local Bogdanov, D., Dumoulin, V., and Warde-Farley, D. Blocks
scoring rules. The Annals of Statistics, 40(1):561–592, 2012. and Fuel. Zenodo, May 2015. doi: 10.5281/zenodo.17721.

Welling, M. and Hinton, G. A new learning algorithm for mean


Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Back-
field Boltzmann machines. Lecture Notes in Computer Science,
propagation and Approximate Inference in Deep Generative
January 2002.
Models. Proceedings of the 31st International Conference on
Machine Learning (ICML-14), January 2014. Yao, L., Ozair, S., Cho, K., and Bengio, Y. On the Equivalence
Between Deep NADE and Generative Stochastic Networks. In
Rippel, O. and Adams, R. P. High-Dimensional Probability Esti- Machine Learning and Knowledge Discovery in Databases,
mation with Deep Density Models. arXiv:1410.8516, pp. 12, pp. 322–336. Springer, 2014.
February 2013.

Schmidhuber, J. Learning factorial codes by predictability mini-


mization. Neural Computation, 1992.

Sminchisescu, C., Kanaujia, A., and Metaxas, D. Learning joint


top-down and bottom-up processes for 3D visual inference. In
Computer Vision and Pattern Recognition, 2006 IEEE Com-
puter Society Conference on, volume 2, pp. 1743–1752. IEEE,
2006.

Sohl-Dickstein, J., Battaglino, P., and DeWeese, M. New


Method for Parameter Estimation in Probabilistic Models:
Minimum Probability Flow. Physical Review Letters, 107(22):
11–14, November 2011a. ISSN 0031-9007. doi: 10.1103/
PhysRevLett.107.220601.

Sohl-Dickstein, J., Battaglino, P. B., and DeWeese, M. R. Mini-


mum Probability Flow Learning. International Conference on
Machine Learning, 107(22):11–14, November 2011b. ISSN
0031-9007. doi: 10.1103/PhysRevLett.107.220601.

Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-scale


optimization by unifying stochastic gradient and quasi-Newton
methods. In Proceedings of the 31st International Conference
on Machine Learning (ICML-14), pp. 604–612, 2014.

Spinney, R. and Ford, I. Fluctuation Relations : A Pedagogical


Overview. arXiv preprint arXiv:1201.6381, pp. 3–56, 2013.

Stuhlmüller, A., Taylor, J., and Goodman, N. Learning stochastic


inverses. Advances in Neural Information Processing Systems,
2013.

Suykens, J. and Vandewalle, J. Nonconvex optimization using a


Fokker-Planck learning machine. In 12th European Confer-
ence on Circuit Theory and Design, 1995.

T, P. Convergence condition of the TAP equation for the infinite-


ranged Ising spin glass model. J. Phys. A: Math. Gen. 15 1971,
1982.

Tanaka, T. Mean-field theory of Boltzmann machine learning.


Physical Review Letters E, January 1998.

You might also like