0% found this document useful (0 votes)
14 views11 pages

Dis10 Sol

This document discusses deep generative models, focusing on autoregressive models, autoencoders, and latent variable models. It compares PixelCNN, PixelRNN, and Pixel Transformers in terms of their training and test time efficiencies, and explores various types of autoencoders including bottleneck, denoising, and sparse autoencoders. Additionally, it introduces Variational Autoencoders (VAEs) and the re-parametrization trick, highlighting their applications and challenges in generating high-quality images.

Uploaded by

abc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

Dis10 Sol

This document discusses deep generative models, focusing on autoregressive models, autoencoders, and latent variable models. It compares PixelCNN, PixelRNN, and Pixel Transformers in terms of their training and test time efficiencies, and explores various types of autoencoders including bottleneck, denoising, and sparse autoencoders. Additionally, it introduces Variational Autoencoders (VAEs) and the re-parametrization trick, highlighting their applications and challenges in generating high-quality images.

Uploaded by

abc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CS 182/282A Designing, Visualizing and Understanding Deep Neural Networks

Spring 2021 Sergey Levine Discussion 10

This discussion focuses on autoregressive models, autoencoders and latent variable models.

1 Generative Models
Generative models represent the full joint distribution p(x, y) where x is an input sample and y is an output.
In deep learning, we are interested in generating new samples that generalize a given dataset. In particular,
this process may not require labeled data, as the goal is to understand some representation of the dataset
distribution.
Broadly speaking, deep generative models include autoregressive models, autoencoder models and generative
adversarial networks. In this discussion, we discuss deep generative models from the context of autoregressive
models and autoencoder models.

2 Autoregressive Generative Models


An autoregressive model is a generative model that generates one sample at a time conditioned on its prior
predictions.
In general, training autoregressive generative models divide up x into dimensions x1 , . . . , xn , and discretize
them into k values. We finally model p(x) via the chain rule. For example, when we try to generate one
pixel at a time conditioned on pixel values from prior predictions, consider a n × n image with pixels in some
order (x1 , . . . , xn2 ). Then, we can define the model as,
2
n
Y
pθ (x1 , . . . , xn2 ) = pθ (xi |x1 , . . . , xi−1 )
i=1

Since the distribution p(xi |x1 , . . . , xi−1 ) is complex, we model the distribution using a neural network, fθ .
Using the same example of generating pixels, we can begin at one corner, and proceed diagonally throughout
the 2D spatial map, and use fθ to sample pixels one at a time by conditioning on previously generated pixels.
Unfortunately, this work of generating and sampling each pixels conditioned on prior generated pixels is
expensive. Instead, in practice, we use either PixelRNN, PixelCNN or Pixel Transformer.

PixelCNN Uses a CNN to model the probability of a pixel, given previous pixels. PixelCNN is slow at
generating images, because there is a pass through the entire network for each pixel. But it is fast to train
because there is no recurrence (only a single pass for the image) since the spatial maps are known in advance.
During training the convolutions must be masked to ignore pixels at the same or later position in the pixel
generation order.

PixelRNN PixelRNN uses RNN (or LSTM) to generate images. PixelRNN remembers the state from
more distant pixels using the recurrent states. In particular, the PixelRNN uses recurrence instead of the
3 × 3 convolutions to allow long-range dependencies, and can generate a full row of pixels in one pass.

CS 182/282A, Spring 2021, Discussion 10 1


Figure 1: PixelCNN. Generate images from the corner, and dependency on previous pixels are modeled using
a CNN over the context region

Figure 2: PixelRNN. This figure shows conditional probability. xi is related to the previous points (blue
part) and is unrelated to the point after it (white part)

Training is generally longer, though, due to the recurrence involved, since each row has its own hidden state
in the LSTM layers.

PixelTransformer Pixel Transformers are similar to PixelCNN and PixelRNN, but they use a multi-
headed attention network.
Problem 1: PixelCNN vs. Pixel RNN

What is the main difference between PixelCNN and PixelRNN? In particular, comment on:

• Run-time of PixelCNN and PixelRNN at training time


• Run-time of PixelCNN and PixelRNN at test time
• Generation of Pixels

CS 182/282A, Spring 2021, Discussion 10 2


Solution 1: PixelCNN vs. Pixel RNN

Aside from the main difference that PixelCNNs are modelled using CNNs over context regions, and
RNNs are modelled using a LSTM, in general, we have,
• PixelCNN is faster during training time because it can output the entire image in parallel with
a single pass. We know the spatial maps / context regions, and as a result, convolutions can
be parallelized over context regions. On the other hand, PixelRNNs have recurrent states that
must be calculated row-by-row.
• PixelRNN is faster at test time, since more pixels are generated per pass. PixelCNNs generate
1 pixel per pass, whereas PixelRNNs generate an entire row of pixels at a time

• PixelCNN uses a CNN to generate the next pixel from other pixels in proximity through mask-
ing, but PixelRNN uses row-LSTMs to generate one row at a time recurrently using previous
row of pixels. It is also possible to use Diagonal Bi-LSTM.

CS 182/282A, Spring 2021, Discussion 10 3


3 Autoencoders
Autoencoders are methods to train a network to encode an image into some hidden state and then decode
that image as accurately as possible from the hidden state. During this process, we force the autoencoder
to learn a structured representation. Generally, autoencoders comprise of an encoder and a decoder, with
a hidden state z. They are typically implemented as neural networks to compress the input into a smaller
hidden state, and then decompressed through the decoder. Once the training is done, we can discard the
second part of the network, and use z as the useful features for the original data.
These learned latent representations can be used on downstram tasks, like classification. For example, the
VAE paper shows that VAE achieves adversarial robustness in downstream tasks on colorMNIST and CelebA
datasets.
In particular, there are several key mechanisms to force the autoencoder to learn a structured representation
of the data,

1. Dimensionality: Force the hidden state to be smaller than the input/output, so the network must
compress information
2. Sparsity: Force the hidden state to be sparse, so the network must be compressed
3. Denoising: Corrupt the input with noise, and force the autoencoder to learn to distinguish noise from
the signal
4. Probabilistic Modeling: Forces the hidden state with a prior distribution

In practice, Autoencoders are far less used today, since there exists better alternatives for both representation
learning (VAEs, contrastive learning) and generation (GANs, VAEs, autoregressive models).

Bottleneck Autoencoder Bottleneck Autoencoder can be viewed as non-linear dimensionality reduction,


and can be used as since dimensionality is lower and there are various algorithms tractable in low-dimensional
spaces. This design is antiquated and rarely used. The idea is simple to implement, but reducing dimen-
sionality often fails to provide the structure we want. When the number of hidden dimensions is larger than
input/output, we call it overcomplete, and this may learn the identity function.

Figure 3: Classical bottleneck architecture reducing 10000 dimensions to 128 dimensions

Denoising Autoencoder Denoising Autoencoders corrupt the input with noise and runs a Bottleneck
Autoencoder. However, there are many variants on this idea. In practice, it is unclear which layer to choose
for the bottleneck, and there are some ad-hoc choices (e.g., how much noise to add).

CS 182/282A, Spring 2021, Discussion 10 4


Figure 4: Denoising architecture with a Bottleneck

Sparse Autoencoder Sparse Autoencoder originates from sparse coding theory in the brain, and is
an attempt to describe the input with a sparse representation, by lettingPmost values to zero. In this
D
autoencoder, the dimensionality may be very large, and uses a sparsity loss, j=1 |hj |. In practice, choosing
the regularizer and adjusting hyperparameters can be very hard.

CS 182/282A, Spring 2021, Discussion 10 5


4 Latent Variable Models
Formally, a latent variable model p is a probability distribution over observed variables x and latent variables
z (variables that are not directly observed but inferred), pθ (x, z). Because we know z is unobserved, using
learning methods learned in class (like supervised learning methods) are unsuitable.
Indeed, our learning problem of maximizing the log-likelihood of the data turns from,
N
1 X
θ ← arg max log pθ (xi )
θ N i=1

to the following,
N Z
1 X
θ ← arg max log pθ (xi |z)p(z)dz
θ N i=1
R
where we recognize p(x) = p(x|z)p(z)dz. Unfortunately, the integral is intractable, but we will discuss
ways to find a tractable lower bound.

4.1 Varational Autoencoders (VAE)


The VAE uses the autoencoder framework to generate new images. For the following description of encoder
and decoder of the VAE, let us assume our input x is a 28×28 photo of a handwritten digit in black-and-white,
and we wish to encode this information into a latent representation of space z.

Encoder Encoder maps a high-dimensional input x (like the pixels of an image) and then (most often)
outputs the parameters of a Gaussian distribution that specify the hidden variable z. In other words, they
output µz|x and Σz|x . We implement this as a deep neural network, parameterized by φ, which computes
the probability qφ (z|x). We can sample from this distribution to get noisy values of the representation z.

Decoder Decoder maps the latent representation back to a high dimensional reconstruction, denoted as
x̂, and outputs the parameters to the probability distribution of the data. We implement this as another
neural network, parametrized by θ, which computes the probability pθ (x|z). Following the digit example,
if we represent each pixel as a 0 (black) or 1 (white), probability distribution of a single pixel can be then
represented using a Bernoulli distribution. Indeed, the decoder gets as input the latent representation of a
digit z and outputs 784 Bernoulli parameters, one for each of the 784 pixels in the image.

Training VAEs To train VAEs, we find parameters that maximize the likelihood of the data,
N Z
1 X
θ ← arg max log pθ (xi |z)p(z)dz
θ N i=1

This integral is intractable. However, we can show that it is possible to optimize a tractable lower bound on
the data likelihood, called the Evidence Lower Bound (ELBO),

Li = Ez∼qφ (z|xi ) [log pθ (xi |z)] − DKL (qφ (z|xi )||p(z))

Problem 2: Blurry Images

Why do VAEs typically produce blurry images?

CS 182/282A, Spring 2021, Discussion 10 6


Solution 2: Blurry Images

VAEs learn an explicit distribution of the data by fitting it into a multivariate Gaussian, and the
output is blurry because of the conditional independence assumption of the samples given latent
variables. This is not true for most realistic distributions, like natural images. Indeed, since the real
posterior is often far from a multivariate Gaussian, we observe a lot of variance/noise added to the
model, and this causes blurriness when we decode since Maximum Likelihood training will distribute
the probability mass diffusely over the data space.

4.2 Variational Inference


In this subsection, we will derive the variational approximation in discrete form, and discuss the re-parametrization
trick.
Problem 3: Latent Variable Model

Write out the log-likelihood objective of a discrete latent variable model.

Solution 3: Latent Variable Model

N N
!
X X X
arg max log pθ (xi ) = arg max log pZ (z)pθ (xi |z)
θ θ
i=1 i=1 z

Problem 4: Variational Approximation

Show that
N
X N
X
log pθ (xi ) ≥ Eq(z|xi ) [log pZ (z) − log q(z|xi ) + log pθ (xi |z)]
i=1 i=1

Hint: Use Jensen’s Inequality, which states, log E[X] ≥ E[log X]

Solution 4: Variational Approximation

N N
!
X X X
log pθ (xi ) = log pZ (z)pθ (xi |z)
i=1 i=1 z
N
!
X X q(z|xi )
= log pZ (z)pθ (xi |z)
i=1 z
q(z|xi )
N  !
X 1
= log Eq(z|xi ) pZ (z)pθ (xi |z)
i=1
q(z|xi )
N
X
≥ Eq(z|xi ) [log pZ (z) − log q(z|xi ) + log pθ (xi |z)]
i=1

Problem 5: Variational Approximation Optimization

To optimize the Variational Lower Bound derived in the previous problem, which distribution do we
sample z from?

CS 182/282A, Spring 2021, Discussion 10 7


Solution 5: Variational Approximation Optimization

Sample from q(z|xi )

Combining it with Entropy Recall the entropy function,


Z
H(p) = −E[log p(x)] = − p(x) log p(x)dx
X

and also recall KL-Divergence,


 
q(x)
DKL (q||p) = E log = −E[log p(x)] − H(q)
p(x)

We can show that, our derived approximation can be reformulated as the Evidence Lower Bound (ELBO),

Li = Ez∼qφ (z|xi ) [log pθ (xi |z)] − DKL (qφ (z|xi )||p(z))

Re-Parametrization Trick The re-parametrization trick allows us to break qφ (z|x) into a deterministic
and stochastic portion, and re-parametrize from qφ (z|x) to gφ (x, ). In fact, we can let, z = gφ (x, ) =
g0 (x) +  · g1 (x) where  ∼ p(). This reparametrization trick is simple to implement, and has low variance 1
Problem 6: Reparametrization Example

Let us get an intuition for how we might use re-parametrization in practice. Assume we have a normal
distribution q, parametrized by θ, such that, qθ (x) ∼ N (θ, 1), and we would like to solve,

min Eq [x2 ]
θ

Use the re-parametrization trick on x to derive the gradient.

Solution 6: Reparametrization Example

We can first make the stochastic element in q independent of θ, and rewrite x as

x = θ + ,  ∼ N (0, 1)

Then,
Eq [x2 ] = Ep [(θ + )2 ]
where p ∼ N (0, 1). Then, we can write the derivative of Eq [x2 ],

∇θ Eq [x2 ] = ∇θ Ep [(θ + )2 ]


= Ep [2(θ + )]

4.3 Normalizing Flows


In Flows, we wish to map simple distributions (easy to sample and evaluate densities) to complex ones
(learned via data), and they describe the transformation of a probability density through a sequence of
invertible mappings. Let us consider a directed, latent-variable model over observed variables X and latent
variables Z.
1 See Appendix of this paper for more information

CS 182/282A, Spring 2021, Discussion 10 8


In practice, we learn an invertible mapping fθ : Rn → Rn from z to x.

x = fθ (z)
z = fθ−1 (x)

We then, maximize the likelihood of p(x).


Hence, we can interpret Normalizing Flow, as (1) we would like the change of variables give a normalized
density at N (0, 1) after applying an invertible transformation, and (2) that the invertible transformations
can be composed with each other to create more complex invertible transformations.
Problem 7: Flow Objective

In Flows, our training objective is to let maximize the log-likelihood of x, where we have,

x = fθ (z)
z = fθ−1 (x)

Write out the training objective explicitly, then use change of variables to derive the Normalizing
Flow objective. Also derive the properties that fθ must satisfy for practical flows.

Solution 7: Flow Objective

Our training objective is,


N
1 X
max log p(xi )
θ N
i=1

Change of variables from x to z shows that,


 
df (z)
p(x) = p(z)|det |−1
dz

Then, we have, our training objective is,


N  
1 X df (zi ) −1
max log p(zi )|det |
θ N dzi
i=1

Rewriting in terms of z, given that z = fθ−1 (x),


N   !
1 X −1 df (zi ) −1
max log p(f (xi ))|det |
θ N dzi
i=1

which, by properties of product of logs, is equivalent to,


N  
1 X df (zi ) −1
max log p(f −1 (xi )) + log |det |
θ N i=1 dzi

and so,
N  
1 X df (zi )
max log p(f −1 (xi )) − log |det |
θ N i=1 dzi

In particular, we need that fθ should be (1) differentiable, (2) invertible and (3) have a tractable
Jacobian determinant.

There are two main flow models we discuss in class,

CS 182/282A, Spring 2021, Discussion 10 9


1. Nonlinear Independent Components Estimation (NICE)
2. Real Non-Volume Preserving Transformation (Real-NVP)

NICE NICE model composes two invertible transformations: additive coupling layers and rescaling layers.
The coupling layer in NICE partitions a variable z into two disjoints subsets, z1:d and zd+1:n . Then it applies
the following forward mapping,

1. x1:d = z1:d (identity mapping)


2. xd+1:n = zd+1:n + gθ (z1:d ) where gθ is the neural net

and the following inverse mapping,

1. z1:d = x1:d (identity mapping)


2. zd+1:n = xd+1:n − hθ (z1:d ) where hθ is the neural net

Here, notice that the Jacobian of the forward map is lower triangular, whose determinant is simply the
product of the elements on the diagonal, which is 1. Also, note that, then we have that the mapping is
volume preserving, meaning that the transformed distribution px will have the same “volume” compared to
the original one pz .

Real-NVP Real-NVP adds scaling factors to the transformation,



xd+1:n = exp hθ (z1:d ) zd+1:n + hθ (z1:d )

where represents element-wise product. This results in a non-volume preserving transformation.


Problem 8: Real-NVP Determinant

Determine the determinant of the Jacobian of the forward map of the Real-NVP. In other words,
find,  
df (z)
|det |
dz

CS 182/282A, Spring 2021, Discussion 10 10


Solution 8: Real-NVP Determinant

We note that,
 
df (z) dx1:d /dz1:d dx1:d /dzd+1:n
=
dz dxd+1:n /dz1:d dxd+1:n /dzd+1:n

We know dx dz1:d is the identity, because we let z1:d = x1:d by the identity mapping. Likewise,
1:d dx1:d
dzd+1:n
is 0, because there is no dependence between either.
We must calculate, then, dx d+1:n
dzd+1:n . Recall,

xd+1:n = exp hθ (z1:d ) zd+1:n + hθ (z1:d )

Let diag(Z) represent the diagonal matrix formed from zd+1:n . Then,

dxd+1:n d 
= exp hθ (z1:d ) zd+1:n
dzd+1:n dzd+1:n
d 
= diag(Z) exp hθ (z1:d )
dzd+1:n

= diag(exp hθ (z1:d ) )

Hence,
" #
df (z) I 0
= 
dz dxd+1:n /dz1:d diag(exp hθ (z1:d ) )

We now find the log determinant,


 ! n−d
df (z) Y 
log det = log | exp hθ (z1:d ) i |
dz i=1
  ! n−d
df (z) X  
log |det | = log exp hθ (z1:d ) i
dz i=1
n−d
X
= (hθ (z1:d ))i
i=1
  n
df (z) Y 
|det |= exp hθ (z1:d )i
dz
i=d+1

CS 182/282A, Spring 2021, Discussion 10 11

You might also like