Flow Based Deep Generative Models Report
Flow Based Deep Generative Models Report
Abstract
1 Background
Recent year have witnessed a progress on deep generative models. Figure 1 shows the taxonomy of
these models [1]. On the left branch of this taxonomic tree, an explicit likelihood can be maximized
by constructing an explicit density. However, the density may be computationally intractable in some
cases, where we have to approximate the density with variational methods or Markov chain Monte
Carlo (MCMC). On the right branch of the tree, the model does not explicitly represent a probability
distribution over the space where the data lies. Instead, the model provides some way of interacting
less directly with the data distribution.
In the following sections, we will briefly introduce generative adversarial networks (GANs) and
variational autoencoder (VAEs) and compare them with the flow-based models.
Final project for “Probabilistic Approaches to Unsupervised Learning” (UCSD CSE 291, Fall 2020)
1.1 Generative adversarial networks (GANs)
Generative adversarial networks (GANs) [2] have demonstrated great success in different fields in
replicating real-world, rich content such as images, human language, and music. Originally inspired
by game theory, a GAN is essentially a game between two players—a generator and a discriminator.
The game goes as follows.
• The discriminator D estimates the probability of a given sample coming from the real data
distribution. It works as a critic and is optimized to tell the fake samples from the real ones.
• The generator G outputs synthetic samples given a noise variable input. It is trained to
capture the real data distribution so that the generated samples are authentic enough to trick
the discriminator into assigning them with high probability of being real samples.
The two models compete against each other during the training—the generator aims at fooling the
discriminator, whereas the discriminator aims not to be cheated. The zero-sum game between them
motivates both to improve their performance. Figure 2 illustrates the pipeline of a GAN.
Formally, we have the data distribution pdata (x) and a prior distribution p(1) on the latent variable z
(usually modeled by a uniform distribution). Moreover, we have a generator G with parameter θg and
a discriminator D with parameter θd .
Then, the objective function for the discriminator is
max Ex∼pdata log D(x) + Ez∼p(z) log (1 − D (G(z))) ,
θd
which encourages the discriminator to distinguish between real and fake data. The objective function
for the generator is
min Ez∼p(z) log (1 − D (G(z))) ,
θg
which pushes the generator to try to fool the discriminator. Finally, combining the two objectives
gives a minimax game bewteen the generator G and the discriminator D:
min max Ex∼pdata log D(x) + Ez∼p(z) log (1 − D (G(z))) .
θg θd
The idea of variational autoencoder (VAE) [4] is different from a traditional autoencoder [5]. Autoen-
coder is a neural network designed to learn an identity function in an unsupervised way to reconstruct
the original input while compressing the data in the process so as to discover a more efficient and
compressed representation.
2
Instead of mapping the input into a fixed vector, VAE map it into a distribution. We sample a z
from some prior distribution pθ (z). Then, x is generated from a conditional distribution pθ (x | z).
Mathematically, the process can be modeled as
Z
pθ (x) = pθ (x | z) pθ (z)dz
However, it is computationally intractable to check all z for the integral. To narrow down the value
space, consider the posterior pθ (z | x) and approximate it by an inference model qφ (z | x). Then, we
can compute the data likelihood as follows.
log pθ (x)
= Ez∼qφ (z|x) [log pθ (x)]
pθ (x | z) pθ (z)
= Ez∼qφ (z|x) log
pθ (z | x)
pθ (x | z) pθ (z) qφ (z | x)
= Ez∼qφ (z|x) log
pθ (z | x) qφ (z | x)
qφ (z | x) qφ (z | x)
= Ez∼qφ (z|x) [log pθ (x | z)] − Ez∼qφ (z|x) log + Ez∼qφ (z|x) log
pθ (z) pθ (z | x)
= Ez∼qφ (z|x) [log pθ (x | z)] − DKL (qφ (z | x) k pθ (z)) + DKL (qφ (z | x) k pθ (z | x))
Now, a VAE adopts the variational method to obtain the evidence lower bound (ELBO) of the data
likelihood.
log pθ (x)
= Ez∼qφ (z|x) [log pθ (x | z)] − DKL (qφ (z | x) k pθ (z)) + DKL (qφ (z | x) k pθ (z | x))
≥ Ez∼qφ (z|x) [log pθ (x | z)] − DKL (qφ (z | x) k pθ (z))
= ELBO(x; θ, φ)
Since ELBO(x; θ, φ) is tractable, we maximize the ELBO of the data likelihood rather than maximiz-
ing the exact data likelihood. That is, the objective of a VAE is
max ELBO(x; θ, φ) .
θ,φ
3
1.3 Comparisons of GANs, VAEs and flow-based models
Mathematically, the objective functions for the three types of models are
(GANs) min max Ex∼pdata log Dθd (x) + Ez∼p(z) log 1 − Dθd Gθg (z)
θg θd
(VAEs) max Ez∼qφ (z|x) [log pθ (x | z)] − DKL (qφ (z | x) kpθ (z))
θ,φ
Below we summarize the differences between GANS, VAEs and flow-based models.
• Generative adversarial networks: GAN provides a smart solution to model the data genera-
tion, an unsupervised learning problem, as a supervised one. The discriminator model learns
to distinguish the real data from the fake samples that are produced by the generator model.
Two models are trained as they are playing a minimax game.
• Variational autoencoders: VAE implicitly optimizes the log-likelihood of the data by maxi-
mizing the evidence lower bound (ELBO).
• Flow-based generative models: A flow-based generative model is constructed by a sequence
of invertible transformations. Unlike other two, the model explicitly learns the data distribu-
tion p(x) and therefore the loss function is simply the negative log-likelihood. Please refer
to Section 3 for detail.
Figure 4: Difference between GAN, VAE and FLOW based model (Source: [7])
4
2.2 Change of variable theorem
Given some random variable z ∼ π(z) and a invertible mapping x = f (z) (i.e., z = f −1 (x) = g(x)).
Then, the distribution of x is
dz dg
p(x) = π(z) = π(g(x)) .
dx dx
dz dg
p(x) = π(z) det = π(g(x)) det ,
dx dx
dg
where det dx is the Jacobian determinant of g.
3 Normalizing Flows
Figure 5 illustrates the core idea behind normalizing flows [7,8]. In each step, we substitute the
variable with the new one by change of variables theorem. Eventually, we hope the final distribution
we obtain is close enough to the target distribution.
Figure 5: Transform a simple distribution into a complex one by applying a sequence of invertible
transformations (Source: [7])
Mathematically, we have, for each step, zi ∼ pi (zi ), zi = fi (zi−1 ) and zi−1 = gi (zi ). Next, we
hope to represent pi (zi ) in terms of pi−1 (zi−1 ) and zi−1 :
dgi (zi )
pi (zi ) = pi−1 (gi (zi )) det (by change of variables theorem)
dzi
dzi−1
= pi−1 (zi−1 ) det (by definition)
dfi (zi−1 )
−1
dfi (zi−1 )
= pi−1 (zi−1 ) det (by inverse function theorem)
dzi−1
−1
dfi
= pi−1 (zi−1 ) det . (by det M det(M −1 ) = det I = 1)
dzi−1
Taking logarithm of both sides, we obtain log pi (zi ) = log pi−1 (zi−1 ) − log det dzdfi−1
i
. Now, recall
that x = zK = fK ◦ fK−1 . . . f1 (z0 ). Thus, we can compute the exact log-likelihood log p(x) of
input data x as follows.
5
log p(x) = log pK (zK )
dfK
= log pK−1 (zK−1 ) − log det
dzK−1
= ...
K
X dfi
= log p0 (z0 ) − log det .
i=1
dzi−1
• fi is easily invertible
• The Jacobian determinant of fi is easy to compute
Finally, we can train the model by maximizing the log-likelihood over some training dataset D:
X
LL(D) = log p(x) .
x∈D
In the following sections, we will introduce three representative normalizing flow models—non-linear
independent components estimation (NICE) [9], real-valued non-volume preserving (RealNVP) [10]
and Glow [11].
The core idea behind non-linear independent components estimation (NICE) [9] is as follows.
The transformation is called an additive coupling layer. It satisfies the requirements described above.
Id 0d×(D−d)
J= ∂m(x1 ) ,
∂x1 ID−d
which has a unit Jacobian determinant det(J) = I. Note that NICE is a type of volume-
preserving flows as it has a unit Jacobian determinant.
However, some dimensions remain unchanged after the transform. Thus, we need to alternate the
dimensions being modified in each step, as illustrated in Figure 6. Note that three coupling layers are
necessary to allow all dimensions to influence one another [9].
6
Figure 6: Alternating Pattern (Source: [9])
The core idea behind real-valued non-volume preserving (RealNVP) [10] is as follows.
The transformation is called an affine coupling layer. It satisfies the requirements described above.
Id 0d×(D−d)
J= ∂yd+1:D ,
∂x1:d diag es(x1:d )
which has a Jacobian determinant that is easy to compute
D−d
Y D−d
X
det(J) = es(x1:d )j = exp s(x1:d )j .
j=1 j=1
Note that the above computation does not involve computing the Jacobian of s and t.
Glow [11] further extends RealNVP with invertible 1×1 convolutions to improve the modeling
capability. Each step in Glow consists of the following operations (see Figure 7).
• Actnorm:
– Forward: y = s x + b
– Backward: x = s (y − b)
P
– Log-determinant: h · w · i log |si |
• Invertible 1×1 convolution:
– Forward: y = Wx
– Backward: x = W−1 y
7
Figure 7: A step of the Glow model (Source: [11])
Figure 8 shows some examples generated by the Glow model on CelebA-HQ dataset [12] with
different temperatures.
Figure 8: Samples of the Glow model with different temperatures (Source: [11])
4 Autoregressive Flows
The key idea behind an autoregressive flow is to model the transformation in a normalizing flow as an
autoregressive model. In an autoregressive model, we assume that the current output depends only on
the data observed in the past and factorize the joint probability p(x1 , x2 , . . . , xD ) into the product of
the probability of observing ‘xi ‘ conditioned on the past observations x1 , x2 , . . . , xi−1 .
p(x) = p(x1 , x2 , . . . , xD )
= p(x1 ) p(x2 | x1 ) p(x3 | x1 , x2 ) . . . p(xD | x1 , x2 , . . . , xD−1 )
D
Y
= p(xi | x1 , x2 , . . . , xi−1 )
i=1
D
Y
= p(xi | x1:i−1 ) .
i=1
8
In the following sections, we will introduce two representative autoregressive flow models—masked
autoregressive flow (MAF) [13] and inverse autoregressive flow (IAF) [14].
Given two random variables z ∼ π(z) and x ∼ p(x) where π(z) is known but p(x) is unknown.
Masked autoregressive flow (MAF) aims to learn p(x). To sample from the model, we have
xi ∼ p(xi |x1:i−1 ) = zi σi (x1:i−1 ) + µi (x1:i−1 ) .
Note that this computation is slow as it is sequential and autoregressive to generate the whole sequence
x. To estimate the density of a sample x, we have
D
Y
p(x) = p(xi | x1:i−1 ) .
i=1
Note that this computation can be fast if we use the masking approach introduced in MADE [15] as it
only requires one single pass to the network.
Now, if we swap x and z by letting z̃ = x and x̃ = z, we get the inverse autoregressive flow (IAF)
1 µi (z̃1:i−1 )
x̃i = z̃i −
σi (z̃1:i−1 ) σi (z̃1:i−1 )
= z̃i σ̃i (z̃1:i−1 ) + µ̃i (z̃1:i−1 ) ,
where
1 µi (z̃1:i−1 )
σ̃i (z̃1:i−1 ) = , µ̃i (z̃1:i−1 ) = − .
σi (z̃1:i−1 ) σi (z̃1:i−1 )
Figure 9 illustrates how MAF and IAF work differently.
Figure 9: Comparison of MAF and IAF. Note that z̃ = x, x̃ = z, π̃ = p and p̃ = π. (Source: [7])
Moreover, Table 1 summarizes the differences between MAF and IAF. We can see the computational
trade-offs between them [8]—MAF is able to calculate the density of a sample x in single pass through
the model, while sampling from it requires D sequential passes, where D is the dimensionality of
x. In contrast, IAF can generate samples with single pass, while calculating the density p(x) of a
sample x requires D sequential passes.
9
Table 1: Comparison of MAF and IAF
MAF IAF
Base distribution z ∼ π(z) x ∼ p(x)
Target distribution z̃ ∼ π̃(z̃) x̃ ∼ p̃(x̃)
Model xi = zi σi (x1:i−1 ) + µi (x1:i−1 ) x̃i = z̃i σ̃i (z̃1:i−1 ) + µ̃i (z̃1:i−1 )
Sampling slow (sequential) fast (single pass)
Density estimation fast (single pass) slow (sequential)
5 Experiments
5.1 Toy data
We implement RealNVP on a toy dataset to examine its effectiveness. We consider a simple 2D data
and use 5 affine coupling layers in the model. Figure 10 shows the results. We can see how the latent
distribution is transformed into the target distribution.
We implement NICE and RealNVP on MNIST handwritten digit database [16] to examine their
effectiveness on more complex data. Each MNIST digit is flatten into a vector of 764 dimensions
(originally a 28×28 image). For NICE, we use 6 additive coupling layers and Figure 11 shows the
generated MNIST digits. For RealNVP, we use 5 affine coupling layers and Figure 12 shows the
generated MNIST digits. We can see that both models are able to capture some characteristics of
MNIST digits.
10
Figure 12: RealNVP on MNIST
6 Summary
In this report, we investigated the flow-based deep generative models, as summarized below.
• We compared different generative models, including GANs, VAEs and flow-based models.
• We surveyed different normalizing flow models, including NICE, RealNVP, Glow, MAF
and IAF.
• We conducted experiments on generating MNIST handwritten digits using NICE and
RealNVP.
11
References
[1] Ian Goodfellow. “Generative Adversarial Networks.” In NeurIPS tutorial. 2016.
[2] Ian J. Goodfellow et al. “Generative Adversarial Nets.” In NeurIPS. 2014.
[3] AI Gharakhanian. Generative Adversarial Networks. blog post. 2017. URL: https : / /
www.kdnuggets.com/2017/01/generative- adversarial- networks- hot- topic-
machine-learning.html.
[4] Diederik P. Kingma and Max Welling. “Auto-Encoding Variational Bayes.” In ICLR. 2014.
[5] Geoffrey E Hinton and Ruslan R Salakhutdinov. “Reducing the dimensionality of data with
neural networks.” science 313.5786 (2006), pp. 504–507.
[6] Lilian Weng. From Autoencoder to Beta-VAE. blog post. 2018. URL: https://lilianweng.
github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html.
[7] Lilian Weng. Flow-based Deep Generative Models. blog post. 2018. URL: https : / /
lilianweng.github.io/lil- log/2018/10/13/flow- based- deep- generative-
models.html.
[8] Danilo Rezende and Shakir Mohamed. “Variational Inference with Normalizing Flows.” In
ICML. 2015.
[9] Laurent Dinh, David Krueger, and Yoshua Bengio. “NICE: Non-linear Independent Compo-
nents Estimation.” In ICLR. 2015.
[10] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. “Density Estimation using Real
NVP.” In ICLR. 2017.
[11] Diederik P. Kingma and Prafulla Dhariwal. “Glow: Generative Flow with Invertible 1×1
Convolutions.” In NeurIPS. 2018.
[12] Tero Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Varia-
tion.” In ICLR. 2018.
[13] George Papamakarios, Theo Pavlakou, and Iain Murray. “Masked Autoregressive Flow for
Density Estimation.” In NeurIPS. 2017.
[14] Diederik P. Kingma et al. “Improved Variational Inference with Inverse Autoregressive Flow.”
In NeurIPS. 2016.
[15] Mathieu Germain et al. “MADE: Masked Autoencoder for Distribution Estimation.” In ICML.
2015.
[16] Yann LeCun et al. “Gradient-based learning applied to document recognition.” Proceedings of
the IEEE 86.11 (1998), pp. 2278–2324.
12