0% found this document useful (0 votes)
13 views8 pages

Oussidi 2018

This document provides an overview of deep generative models, including key architectures such as RBM, DBM, DBN, VAE, and GAN, along with their learning procedures, potential, and limitations. It discusses the categorization of generative models into cost function-based and energy-based models, detailing training strategies and issues encountered in model design. The paper serves as an introductory survey for those interested in the field of deep generative modeling.

Uploaded by

abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Oussidi 2018

This document provides an overview of deep generative models, including key architectures such as RBM, DBM, DBN, VAE, and GAN, along with their learning procedures, potential, and limitations. It discusses the categorization of generative models into cost function-based and energy-based models, detailing training strategies and issues encountered in model design. The paper serves as an introductory survey for those interested in the field of deep generative modeling.

Uploaded by

abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Deep Generative Models: Survey

Achraf Oussidi Azeddine Elhassouny


Master, Data Science and Big Data Ph.D., Assistant Professor
ENSIAS Software Engineering Department - ENSIAS
Mohammed V University, Rabat, Morocco Mohammed V University, Rabat, Morocco
Email: [email protected] Email: [email protected]

Abstract—Generative models have found their way to the general) in a phase referred to as pre-training [9], before the
forefront of deep learning the last decade and so far, it seems final training of the whole network at once in the fine-tuning
that the hype will not fade away any time soon. In this paper, phase.
we give an overview of the most important building blocks of
most recent revolutionary deep generative models such as RBM, To construct a deep generative model by combining others
DBM, DBN, VAE and GAN. We will also take a look at three [10], we need to keep in mind that the probability distribution
of state-of-the-art generative models, namely PixelRNN, DRAW of the resulting model can be calculated and evaluated explic-
and NADE. We will delve into their unique architectures, the itly or implicitly in order to provide a probabilistic ground
learning procedures and their potential and limitations. We will for sampling or eventually doing inference on the model. In
also review some of the known issues that arise when trying to
design and train deep generative architectures using shallow ones general, feedforward networks are easier to stack and combine,
and how different models deal with these issues. This paper is not while energy-based are harder to combine without losing
meant to be a comprehensive study of these models, but rather tractability of joint probabilities.
a starting point for those who bear an interest in the field.
II. G ENERATIVE M ODELS
I. I NTRODUCTION
A. Boltzmann Machines
Generative models have been in the forefront of deep
Boltzmann Machines [10] is an energy-based model. [8]
unsupervised learning for the last decade. The reason for that
That is, it associates a scalar energy function to the model that
is because they offer a very efficient way to analyze and
takes in a configuration of the input variables and returns a
understand unlabeled data. The idea behind generative models
scalar value describing the “badness” level of the configuration
is to capture the inner probabilistic distribution that generates
in question. The goal of learning is therefore to find an
a class of data to generate similar data. This can be used for
energy function (within a predetermined functional space) that
fast data indexing and retrieval [1] [2] and plenty of other
associates smaller values to the correct configurations and
tasks. Generative models have been used in numerous fields
higher values for incorrect ones, both within and out of the
and problems such as visual recognition tasks [3], speech
training examples. Predictions are then made by selecting the
recognition and generation [4], natural language processing
configurations that minimize the energy.
[5] [6] and robotics [7]. In this paper, we are going to go over
The Boltzmann Machine was first introduced by Geoffrey
the most common building blocks for generative models in the
Hinton et al. in 1983 [10] [11]. Its main purpose was to carry
first chapter, and a survey of state of the art deep generative
out efficient searches for combinations of “hypotheses” that
models in use today in the second chapter.
maximally satisfy some constrained data input. The original
Generative models, in general, can be divided into two main
Boltzmann Machine is an undirected symmetric network of
categories:
binary units that are divided into visible and hidden units
• Cost function-based models such as autoencoders and (fig. 1). However, alternative real-valued variations have been
generative adversarial networks. proposed and even took over the scene and surpassed binary
• Energy-based models [8] where the joint probability is ones in popularity.
defined using an energy function; for instance, Boltzmann In this section, we start by introducing the binary Boltzmann
machine and its variants and deep belief networks. Machine being the main inspiration for Restricted Boltzmann
Depending on nature and depth, a model can admit different Machines (RBM) which in turn is the building block for many
types of training. In general, some of the training strategies are sophisticated and more powerful generative models, including
fast but non-efficient and others are more efficient but hard to Deep Boltzmann Machines (DBM) and Deep Belief Networks
carry out or take too long. There are also techniques used (DBN).
to avoid this tradeoff such as two-phased training. The most 1) Binary Boltzmann Machine: Binary Boltzmann Ma-
notable example is deep belief network which often undergoes chines are among the easiest networks to implement and could
a separate training for its components (two layers at a time in theoretically, given enough time and computational power,
978-1-5386-4396-9/18/$31.00 
c 2018 IEEE
Boltzmann machines is more biologically plausible. Hebbian
learning is one of the oldest learning algorithms. It can be
summarized as “Cells that fire together wire together.” [12]
In practice, neurons choose to either strengthen their link or
weaken it based on how often they agree in their outputs.
If two neurons would more often than not have the same
output, the learning algorithm puts more weight on their link.
Similarly, if they disagree most often, the link between them is
weakened. This learning process is said to be more biologically
plausible because it does not require any backlinks to be
maintained by the network to receive gradient information,
and every weight update relies only on the neighboring units.
2) Restricted Boltzmann Machine: The tractability of the
joint distribution is one of the biggest drawbacks of Boltzmann
machines. Restricted Boltzmann Machines (formally Harmo-
nium) [13] are a special type of Boltzmann machines with two
layers: One visible and one hidden layer, that was designed to
Fig. 1. A Boltzmann Machine with 5 visible units (in blue) and 5 hidden solve this problem. The RBM is a graphical model of binary
units (in red). units. However, real-valued generalization is straightforward
[14] [15]. The connections in an RBM are undirected and
there are no visible-visible or hidden-hidden connections (fig.
learn complex distributions. However, they have not proven
2). Among other things, this bipartite architecture allows us
useful on a practical level. Similar to Hopfield networks,
to have more control over the joint distribution by casting it
Boltzmann machines are fully connected networks of binary
into a sum of conditional probabilities. RBMs are a powerful
units that use the same energy function. However, unlike
replacement for fully connected Boltzmann machines when
Hopfield networks, Boltzmann machines are not memory
building a deep architecture because of the independence of
driven and try to capture the inner structure and regularities
units within the same layer, which allows for more freedom
instead. The power of the binary Boltzmann Machine lies
and flexibility.
in the hidden units that allow it to extend the simple linear
interactions to higher-order ones and give it the possibility to
model virtually any probabilistic distribution. The Energy of
the binary Boltzmann Machine is given by:
1 
E(x) = −( wij xi xj + bi x i ) (1)
2 ij i

Where x = (x1 , x2 , . . . , xd ) ∈ {0, 1}d , W = (wij )ij is the


weight matrix and B = (b1 , x2 , . . . , bd ) is the bias vector.
The joint probability of the network is given by:
1
P (x) = exp(−E(x)) (2)
Z(b)
Where Z(b) is the partition function that ensures P (x) ≤ 1.
Boltzmann machines are theoretically capable of learning
any given distribution simply by being shown examples sam-
pled from it. Essentially, the network sets the strengths of
the connections between the units to capture the correlations
that tie them together in order to build a generative network Fig. 2. A Restricted Boltzmann Machine with 3 visible units (in blue) and 4
capable of, among other things, producing new examples of hidden units (in red).
the same distribution. And since not all the variables (units)
in a Boltzmann machine are directly observed, it gives us a RBMs can be trained using the traditional techniques of
handle to control the sampling of new examples. Furthermore, maximum likelihood [16]. Sampling from an RBM can be
the model can take in an incomplete example and use it to done using Gibbs sampling method or any other Markov Chain
output the complete version. Monte Carlo (MCMC) [17] method.
Learning in Boltzmann machines is of a Hebbian nature, 3) Deep Boltzmann Machine: Deep Boltzmann Machine
meaning to update a weight, we only need information (DBM) [18] is an undirected deep network of several hidden
from the neighboring neurons. This means that learning in layers. In DBMs every unit is connected to every unit from
the adjacent layers. Similar to RBMs, there are no connections is trained independently, and a fine tuning (slow) stage where
between units of the same layer (fig. 3). DBMs can also be the network as a whole is trained using a variation of the
viewed as a group of RBMs stacked together (fig. 4). wake-sleep algorithm [22].

Fig. 3. A Deep Boltzmann Machine with 1 visible layer (in blue) and 3 Fig. 5. A Deep Belief Network with a similar architecture to The DBM in
hidden layers (in red). fig. 3. In DBNs, all connections are undirected except for the top two layers.

Sampling from a DBN is done by first running multiple


steps of Gibbs sampling on the two hidden layers with directed
connections. We then use the sampled latent variables to draw
samples from the visible units by running a step of ancestral
sampling through the network.
B. Autoencoders
An autoencoder is a neural network trained for the purpose
of recreating its input as the output. It is a feedforward non-
recurrent network of which the aim is to continually reduce
the dimensionality to a smaller hidden layer often called the
code representative of the input. In a similar but mirroring
process, the network then recreates the same input structure
from the code layer. The first part is called the encoder and
Fig. 4. The same DBM from fig. 3 decomposed into 3 RBMs. Every two the second decoder (fig. 6).
consecutive layers, when taken together, will form an RBM. The goal of an autoencoder is not to perfectly copy the input
to the output. Therefore, we must prevent it from learning a
Training of DBMs is often done in two stages: a pre- trivial identity function which comes easily if the autoencoder
training stage where every RBM is trained independently and is not properly “restrained”. The aim is for our model to
a fine tuning stage where the network is trained at once using pick up the underlying patterns and characteristics of the data
backpropagation. [19] distribution to be able to generate new never seen before
4) Deep Belief Networks: Deep belief networks are another examples of the same distribution as the examples provided
deep architectures consisting of many hidden layers that during the training phase. Formally, an autoencoder can be
revolutionized the deep learning scene when they were first written in a deterministic way (although, it is not usually the
introduced in 2006. [20]. Similar to DBMs, DBNs do not have case) as a composition of two functions:
connections within the same layer. The difference between the
x = fd (h) where h = fe (x)
two is that in DBN only the top two layers have directed
connections pointing towards the visible layer, the rest all Where fe is the encoder, fd is the decoder, x is the input
have undirected connections (fig. 5). The most widely used variable and h is the code.
method to train a DBN is a greedy layer-wise fast algorithm Since an autoencoder is a particular case of neural networks,
introduced by Hinton et al. [21]. Similar to the algorithm it can be trained using the standard techniques for training
described above for training DBMs, this algorithm consists feedforward neural networks, such as mini batch gradient
of two stages: an initialization (fast) stage where every layer descent and back-propagation.
(for very large sets of training data,) to 50% (for relatively
smaller ones).
3) Sparse Autoencoders: A sparse autoencoder is an au-
toencoder with a sparsity constraint imposed on its loss
function. The sparsity constraint ensures that the number of
active units in the code layer is minimal which leads to a
sparse representation of the input data. There are many ways
to add a sparsity constraint, a few of which are:
• Penalizing the derivative – we can obtain a network that
sparsely encodes the input data by adding a regularizing
term to the mean squared error cost function often used
with this type of networks that ensures that most of the
code neurons have low values. (close to 0 when using the
Sigmoid activation function and close to −1 when using
the tanh activation function.)
• K-sparse autoencoders [24] – perhaps, the simplest most
effective way. This method consists of manually setting
Fig. 6. An autoencoder with two layers encoder and two layers decoder. all the code neurons to 0, except for the k neurons with
the highest activation. The error is then backpropagated
through only the k nonzero neurons. Experimental results
1) Undercomplete Autoencoders: As we mentioned earlier, [24] has shown that the lower the value of k (high
a neural network that “clones” the input to the output serves sparsity,) the better the results.
practically no purpose and is therefore useless. What we really
hope for is to learn the underlying features of the input 4) Variational Autoencoders: Autoencoders are among the
examples. To do so, we need to restrict our network to only simplest yet most elegant approaches to generative modeling.
copy the shared properties and patterns. In principle, autoencoders are designed to learn to generate
One way to do it is making the code smaller than the data through extraction of internal regularities of a given
input. The resulting model is referred to as an undercomplete sample. To do so, an autoencoder would have to “decode” a
autoencoder. This gives the network a nice bottleneck shape, code from the code layer distribution. However, the issue that
and since in a feedforward network information flows in one arises is that while the network has successfully learned the
direction, this sort of forces it to pick the relevant features data distribution and is able to encode it without any problems,
to let pass through the narrow tunnel, instead of mimicking a the code distribution remains unknown. Consequently, in order
copying mechanism and driving the raw information contained to generate new unseen data, we can only use completely
in the input forward towards the output. random codes which can only be expected to yield bad and
Although this sounds like a promising solution, networks with unsatisfying results since the code is not sampled from its
large enough capacities still run the risk of learning the trivial distribution. To address this issue and in addition to learning
identity function. the data distribution, variational autoencoders [25] set out
2) Denoising Autoencoders: Another method for restrain- to learn the distribution of the code layer. To achieve that,
ing a network is by “concealing” the real data and showing variational autoencoders make the assumption that the code
the network a slightly modified version of it instead. The catch layer comes from a Gaussian distribution of mean μ and
here is to penalize the model on the real data. For instance, variance σ 2 . This new parametrization (often called the “re-
we could randomly corrupt every input by adding stochastic parametrization trick”) allows us to sample noisy data from
noise before showing it to the network. Thus, by continually the code distribution.
improving the weights on the basis of the network’s capability
C. Generative Adversarial Networks
to generate uncorrupted versions of the input examples, we are
in fact training it to generate data from the same distribution Generative adversarial networks [26] are based on a game
minus the artificially added noise. Hence, the name denoising theory scenario called the minimax game where a discrimi-
autoencoder. [23] nator D and a generator G compete against each other. The
It is crucial that the added noise is stochastic, as deterministic generator network generates data from stochastic noise and
noise can easily be learned by the network which will yield, the discriminator tries to tell whether it is real (coming from
once again, a trivial composition of the identity function and a training set,) or fabricated (from the generator network).
a (deterministic) “de-noising” function, rendering the network This game scenario is modeled as a zero-sum game where
once again useless. In general, the added noise is quantified as the absolute difference of carefully calculated rewards from
the percentage of the altered pixels. (The common practice is both networks is minimized (kept close to zero) so that both
to set them to zero.) Depending on the amount of training data networks learn simultaneously as they try to outperform each
we possess, the amount of corrupted pixels ranges from 30% other. It is important that the generator network has no access
to the real data and only learns through the discriminator’s
feedback. Latent space
More formally, the generator network can be modeled as
a differentiable function that takes in random noise from a
latent space Z following a distribution pz (z) and outputs data
from of the same real data space and (hopefully) of the same
distribution pdata (x): Training set Generator

G : Z → Rn (3)

Where Z is the latent space and n is the dimensionality of the


data space. Real data Fake data
The discriminator network D is a simple classifier neural
network that can be formally represented as a function that Generator
maps from the data distribution to a probability p ∈ [0, 1] that training
Discriminator
represents how likely the input data vector is real:

D : Rn → [0, 1] (4)
Discriminator
The zero-sum game is modeled as the following optimization training
problem:
min max V (D, G) (5) is the data
G D
real?
Where:
V (D, G) = Ex∼pdata (x) [log(D(x))]
(6)
− Ez∼pz (z) [1 − log(D(G(z)))] Fig. 7. Generative adversarial networks architecture.

Training of GAN– [26] [27] Training of GAN is done in


alternation between the discriminator and the generator. We at all. And perhaps the most vicious divergence form is the
begin by training the discriminator on the real data for a few Mode Collapse which can be summarized as follows:
epochs. The goal is for it to become able to associate higher min max(G, D) = max min(G, D) (7)
values to the real data. We then train the same network on G D D G
fake data generated by the generator network. At this point, For images, other problems with GAN include:
the generator is on pause and is not receiving any feedback • Counting– GANs find it hard to understand the impor-
from the training, only the discriminator is being trained. In tance of the number of occurrences of certain objects
other words, the error is not being backpropagated through the (eyes, for instance) even though it has only being shown
generator network. examples that outlined that property.
As a consequence of the previous steps, the discriminator • perspective– They also find it hard to grasp the concept of
network is considerably better at its job than the generator 3D space and would come up with images with distorted
network which so far has not received any training and is perspective.
still generating noise. So now, we put the discriminator on • Global structure– Another struggle for GANs is the
pause and train the generator network using the discriminator’s ability to understand shapes and global structures of
feedback. The goal is to generate data that can fool the objects. This one in particular is a serious issue because
discriminator into classifying it as real. As soon as this one of the main requirements of an image to look real
happens, the generator is then put on pause and we go back is how realistic the shape of the objects it contains look.
to training the discriminator again. We keep alternating the For instance, an image of an eight-legged cow would not
training between the two networks until we get good enough pass as an image of an animal no matter how realistic it
results on the generated data. We can check manually if the might look, simply because eight-legged cows are not a
results are satisfying or not (fig. 7). thing and have certainly not being given any examples of
GAN limitations– In general, GAN give the best results them as part of the training!
on image generation tasks, but they have their drawbacks.
[28] One of the major drawbacks of GAN is that it can III. G ENERATIVE MODELS STATE OF THE ART
sometimes be extremely hard to train. It also falls into the Throughout the previous section, we inspected the most
divergence trap very easily. The optimization algorithm can promising generative models. Those are the building blocks
sometimes get stuck in a poor local minimum. The game of the most powerful and state of the art deep generative
scenario can sometimes stray and not reach any equilibrium models in use today. However, as good as it sounds, the task
of stacking and combining these models to build deeper ones referred to as the vanishing and the exploding gradient prob-
is not always an easy one. Two of the major issues one may lems. [29] Basically, different layers of deep architectures learn
face is the tractability of the joint distribution (fig. 8) and the at different rates. Top layers generally learn faster than bottom
training of these models. layers. This major difference in learning speed among layers of
In general, directed graphical models such as feedforward the same deep architectures sometimes causes bottom layers
neural networks that use a latent distribution
 p(h) to sample to get stuck during training, experiencing almost no change
from the data distribution p(x) = h p(x|h)p(h) are easier at all. To solve this problem, models with deep architectures
to combine as long as the distribution of the latent variables often undergo training sessions for each layer separately (pre-
p(h) and the conditional distribution p(x|h) are kept in check. training) before training the model as a whole (fine-tuning).
Energy-based models on the other hand present more difficul- Pre-training, in a way, gives the network a head start before
ties. Restricted Boltzmann Machines, for example, define the the training to make it faster and more efficient and prevents
joint distribution using the energy function E(x) as follows: the gradient vanishing problem. Another form of pre-training
p(x) = Z1 log (−E(x)), where Z is a normalizing function to consists of using previously trained weights on similar tasks,
make sure p(x) sums up to 1 over all states of x. Unfortunately, instead of setting them at random. For instance, if we are
more often thannot, there is no way for us to explicitly planning to use a deep model comprised of several RBMs
compute Z = x (−E(x)) since we do not have access
on an image classification problem, it makes sense to use the
to the entire data space. In fact, it has been proven that weights of previously trained RBMs on image classification
Z is intractable in the case of RBMs. However, we can problems to initialize the network, instead of setting them at
approximate the conditional distributions using mean-field random.
variational inference. This, among other problems, makes it A. PixelRNN
harder to stack energy-based models. A third approach to
PixelRNN [30] is a type of fully visible deep belief network
modeling probability distribution is proposed by NADE which
that takes advantage of the dependency between pixels that are
we will see in detail later on.
closer together. In principle, the value of a given pixel of an
image depends on the previously predicted ones. This depen-
Generative models dency aggregates to the probability of predicting the image
being the sum of the conditional probabilities of predicting
each pixel given the previous pixels. More formally:
Explicit density Implicit density 2

n
p(x) = p(xi |x1 , . . . , xi−1 ) (8)
Tractable density i=1
Markov chain
Where x is the image to be predicted, xi for i = 1, . . . , n2
represent the pixels and n × n is the definition of the image.
Approximate density Direct
This approach is particularly interesting because it gives
GSN a handle on the tractability of the joint distribution from
-Fully visible which the images have been drawn. This factorization turns
belief nets the prediction problem into a sequence problem, and to take
-NADE GAN full advantage of this representation, a strong and highly
-MADE Markov chain representative recurrent model is needed.
-PixelRNN Recurrent Neural Networks (RNN) are neural networks that
allow for cyclic connections between their units. In other
Variational words, RNNs evaluate similar pieces of information in a
Boltzmann sequential order to make use of the previously evaluated
Variational machines information. The inputs of these micro-tasks are strongly
autoencoders dependent on the previous outputs. RNNs are widely used
in Natural Language Processing (NLP) tasks where, given
Fig. 8. Taxonomy of generative models based on the tractability of the density
a sentence (sequence of words), we predict the word most
distribution. Adapted from Ian Goodfellow, Tutorial: Generative adversarial likely to appear next given the previous ones. This similarity
networks [28]. in concept to the pixels prediction given the previous ones
makes RNNs a great candidate for this task.
Another issue that arises when building deep generative The resulting model is referred to as PixelRNN. It is
models is the training. While backpropagation can be very effi- comprised of up to twelve special types of two-dimensional
cient for shallow architectures, deep architectures require other RNNs called LSTM (Long Short-Term Memory) layers [31].
forms of training since backpropagation can be extremely These special types of RNNs solve a major problem in neural
slow to perform. Another general problem encountered while networks in general: long-term dependencies. Most neural net-
training deep architectures using backpropagation is what is work architectures do not allow for the utilization of previous
outputs, which is unfortunate in many cases where completely C. Neural Autoregressive Distribution Estimation (NADE)
discarding the previously acquired information is not the best NADE [34] addresses the problem of tractable modeling of
practice. In theory, RNNs have the ability to make a current the joint distribution. As a starter, we assume that our data
decision based on the previous ones. However, in practice, they points are binary vectors. In other terms, x ∈ {0, 1}n , where
have not proven useful in cases where a long-term dependency x is a data point coming from the distribution p(x) and n is
is needed, although they performed just fine in short-term the dimensionality of our data.
dependency tasks. Reasons for this shortcoming is explored To estimate the distribution p(x), NADE begins by making
in-depth by Bengio et al. [32]. LSTMs conceptually solve the observation that p(x) can be cast into a product of
this problem and can seamlessly store and reuse long-term conditional one dimensional distributions:
information outputted by the network.

n
PixelRNN as a generative model has many advantages, but p(x) = p(xo(i−1)i |xo(i−1) ) (9)
perhaps the most important of which is the tractability of the i=1
joint probability p(x), which in turn makes it easy to build
metrics to evaluate its performance. PixelRNN, in general, Where o(j) is a permutation of the integers 1, 2, . . . , j, xo(j)
produces good-looking images, its only defect is the slow is the sub-vector of j dimensions in the order o(j) and xo(j)i
generation of samples due to its sequential nature. is the i-th component of the vector xo(j) . In turn, each one
dimensional conditional distribution is modeled using a neural
network of its own:
B. Deep Recurrent Attentive Writer (DRAW) p(xi |xo(i−1) ) = σ(Vo(i−1)i ,. hi + bVo(i−1)i )
  (10)
Deep Recurrent Attentive Writer [33] is a neural network Where: hi = W.,o(i−1) xo(i−1)+c
architecture that generates images in a sequential fashion, σ is the logistic sigmoid function, V, W, b and c are the
mimicking the process of painting as would be performed parameters of the neural network.
by an artist. It works through gradually improving upon the Where RBMs solve the problem of tractability of distribu-
generated image in a process that resembles the brush strokes tions by assuming no correlations between units of different
of an artist. Unlike most generative models that generate entire layers, in particular, between visible and hidden units. NADE
images at once (in the sense that generation is one continuous, took a different approach to deal with this issue by assuming
uninterrupted process), which effectively breaks any analogy dependency of every layer of hidden units on the previous
that could be made with the human creative process, DRAW layers. This similarity in principle to RNNs makes it possible
does it in a more human-like fashion. for NADE to share information for upcoming hidden units
Broadly speaking, DRAW has the same general shape of a since every conditional distribution p(xi |xo(i−1) ) is modeled
variational autoencoder. It is composed of two main recurrent using the same set of parameters (same network).
networks: an encoder and a decoder; and is trained using NADE is trained by maximizing the average log-likelihood:
stochastic gradient descent. What differentiates DRAW from
1 
D n
traditional variational autoencoder is that generation is done on
max log(p(xi |xo(i−1) )) (11)
several passes by the decoder based on the signal emitted by D i=1
d=1
the encoder. The image is constructed little by little, perfecting
the outlines, sharpening the details and retouching until the NADE generalization to real-valued NADE (RNADE) is
generation is complete. similar in principle to RBM generalization to Gaussian-
However, in real life, artists do not follow the same exact Bernoulli RBM (GBRBM). By applying the same procedures,
process described so far. What the decoder does is make small we can build a network that outputs real values in the form
changes on multiple regions of the image at once. So even of mean of the Gaussian of the conditional distributions
though, the generation is done sequentially, it does not quite p(xi |xo(i−1) ).
resemble what goes into making a painting. Artists proceed In short, NADE in all its forms is a very decent alternative
by selectively choosing specific regions each time until the to directed and undirected graphical models that offers a very
painting is done. To model that, DRAW uses attention gates, flexible way to tractably estimate the joint distribution. It
to focus attention on specific regions every iteration and ignore has been used for a wide variety of tasks including image
the rest. In other words, the encoder does not receive the generation and classification, topic modeling, music generation
entire image as input, but rather a “cropped” area which and many more. It is computationally efficient and produces
will then use to instruct the decoder on how to improve the decent results and often performs well on benchmark data sets.
reconstruction. To select the area to be “cropped”, DRAW uses
previously acquired information to decide what portion of the IV. C ONCLUSION
reconstructed image needs “attention.” The word “cropped” To wrap up, deep generative modeling has experienced a
here is not used in the conventional sense because in practice, rapid growth over the last few years which made it gain
what we do is apply a Gaussian filter centered around the more and more attention of the scientific community and the
chosen area to give more influence to pixels near the center. business world alike. In this paper, we presented the most
promising generative models that make up most deep genera- [22] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The” wake-
tive architectures. We studied their advantages, potential and sleep” algorithm for unsupervised neural networks,” Science, vol. 268,
no. 5214, pp. 1158–1161, 1995.
limitation. We also listed some state-of-the-art architectures [23] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
that had the best results on benchmark generation problems. “Stacked denoising autoencoders: Learning useful representations in a
If the research in this area continues in this rate, we can expect deep network with a local denoising criterion,” Journal of Machine
Learning Research, vol. 11, no. Dec, p. 33713408, 2010.
to soon see emerging learning algorithms that not only do the [24] A. Makhzani and B. Frey, “k-sparse autoencoders,” arXiv:1312.5663
job of generating pseudo-real data, but also provide us with a [cs], Dec 2013, arXiv: 1312.5663.
better and deeper understanding of the world around us. [25] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
preprint arXiv:1312.6114, 2013.
[26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
R EFERENCES S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems, 2014, p. 26722680.
[1] T. H.-W. Westerveld, “Using generative probabilistic models for multi- [27] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
media retrieval,” Ph.D. dissertation, Neslia Paniculata, 2004. learning with deep convolutional generative adversarial networks,”
[2] R. Miotto and G. Lanckriet, “A generative context model for semantic arXiv:1511.06434 [cs], Nov 2015, arXiv: 1511.06434.
music annotation and retrieval,” IEEE Transactions on Audio, Speech, [28] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,”
and Language Processing, vol. 20, no. 4, p. 10961108, 2012. arXiv preprint arXiv:1701.00160, 2016.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [29] R. Pascanu, T. Mikolov, and Y. Bengio, “Understanding the exploding
with deep convolutional neural networks,” in Advances in neural infor- gradient problem,” CoRR, vol. abs/1211.5063, 2012. [Online]. Available:
mation processing systems, 2012, pp. 1097–1105. http://arxiv.org/abs/1211.5063
[4] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, [30] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and et al., “Deep neural networks,” arXiv:1601.06759 [cs], Jan 2016, arXiv: 1601.06759.
neural networks for acoustic modeling in speech recognition: The shared [31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
views of four research groups,” IEEE Signal Processing Magazine, computation, vol. 9, no. 8, pp. 1735–1780, 1997.
vol. 29, no. 6, p. 8297, 2012. [32] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
[5] D. Klein and C. D. Manning, “Fast exact inference with a factored with gradient descent is difficult,” IEEE transactions on neural networks,
model for natural language parsing,” in Advances in neural information vol. 5, no. 2, pp. 157–166, 1994.
processing systems, 2003, pp. 3–10. [33] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wier-
[6] R. Cotterell and J. Eisner, “Probabilistic typology: Deep generative stra, “Draw: A recurrent neural network for image generation,”
models of vowel inventories,” arXiv:1705.01684 [cs], May 2017, arXiv: arXiv:1502.04623 [cs], Feb 2015, arXiv: 1502.04623.
1705.01684. [34] B. Uria, M.-A. Côté, K. Gregor, I. Murray, and H. Larochelle, “Neural
[7] S. Thrun et al., “Robotic mapping: A survey,” Exploring artificial autoregressive distribution estimation,” Journal of Machine Learning
intelligence in the new millennium, vol. 1, p. 135, 2002. Research, vol. 17, no. 205, pp. 1–37, 2016.
[8] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial [35] S. E. Fahlman, G. E. Hinton, and T. J. Sejnowski, “Massively parallel
on energy-based learning,” Predicting structured data, vol. 1, p. 0, 2006. architectures for al: Netl, thistle, and boltzmann machines,” Proceedings
[9] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- of AAAI-83109, vol. 113, 1983.
wise training of deep networks,” in Advances in neural information [36] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
processing systems, 2007, p. 153160. 2016.
[10] S. E. Fahlman, G. E. Hinton, and T. J. Sejnowski, “Massively parallel [37] Z. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing, “On unifying
architectures for al: Netl, thistle, and boltzmann machines,” Proceedings deep generative models,” arXiv:1706.00550 [cs, stat], Jun 2017, arXiv:
of AAAI-83109, vol. 113, 1983. 1706.00550.
[11] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm [38] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast inference in
for boltzmann machines,” Cognitive science, vol. 9, no. 1, pp. 147–169, sparse coding algorithms with applications to object recognition,” arXiv
1985. preprint arXiv:1010.3467, 2010.
[12] S. Lowel and W. Singer, “Selection of intrinsic horizontal connections [39] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-
in the visual cortex by correlated neuronal activity,” Science, vol. 255, supervised learning with deep generative models,” in Advances in Neural
no. 5041, p. 209212, Jan 1992. Information Processing Systems, 2014, p. 35813589.
[13] P. Smolensky, “Information processing in dynamical systems: Founda- [40] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and
tions of harmony theory,” COLORADO UNIV AT BOULDER DEPT M. Welling, “Improving variational inference with inverse autoregressive
OF COMPUTER SCIENCE, Tech. Rep., 1986. flow,” arXiv:1606.04934 [cs, stat], Jun 2016, arXiv: 1606.04934.
[14] M. Welling, M. Rosen-Zvi, and G. E. Hinton, “Exponential family
harmoniums with an application to information retrieval,” in Advances
in neural information processing systems, 2005, p. 14811488.
[15] A. Courville, J. Bergstra, and Y. Bengio, “A spike and slab restricted
boltzmann machine,” in Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics, 2011, p. 233241.
[16] T. Tieleman, Training restricted Boltzmann machines using approxima-
tions to the likelihood gradient. ACM, 2008, p. 10641071.
[17] C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan, “An introduction
to mcmc for machine learning,” Machine learning, vol. 50, no. 12, p.
543, 2003.
[18] R. Salakhutdinov and G. Hinton, “Deep boltzmann machines,” in
Proceedings of the Twelfth International Conference on Artificial Intel-
ligence and Statistics, ser. Proceedings of Machine Learning Research,
vol. 5. PMLR, 16–18 Apr 2009, pp. 448–455.
[19] G. E. Hinton and R. R. Salakhutdinov, “A better way to pretrain deep
boltzmann machines,” in Advances in Neural Information Processing
Systems, 2012, p. 24472455.
[20] R. Salakhutdinov and G. E. Hinton, “Deep belief networks,” 2007.
[21] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm
for deep belief nets,” Neural computation, vol. 18, no. 7, p. 15271554,
2006.

You might also like