Dis10 Sol
Dis10 Sol
This discussion focuses on autoregressive models, autoencoders and latent variable models.
1 Generative Models
Generative models represent the full joint distribution p(x, y) where x is an input sample and y is an output.
In deep learning, we are interested in generating new samples that generalize a given dataset. In particular,
this process may not require labeled data, as the goal is to understand some representation of the dataset
distribution.
Broadly speaking, deep generative models include autoregressive models, autoencoder models and generative
adversarial networks. In this discussion, we discuss deep generative models from the context of autoregressive
models and autoencoder models.
Since the distribution p(xi |x1 , . . . , xi−1 ) is complex, we model the distribution using a neural network, fθ .
Using the same example of generating pixels, we can begin at one corner, and proceed diagonally throughout
the 2D spatial map, and use fθ to sample pixels one at a time by conditioning on previously generated pixels.
Unfortunately, this work of generating and sampling each pixels conditioned on prior generated pixels is
expensive. Instead, in practice, we use either PixelRNN, PixelCNN or Pixel Transformer.
PixelCNN Uses a CNN to model the probability of a pixel, given previous pixels. PixelCNN is slow at
generating images, because there is a pass through the entire network for each pixel. But it is fast to train
because there is no recurrence (only a single pass for the image) since the spatial maps are known in advance.
During training the convolutions must be masked to ignore pixels at the same or later position in the pixel
generation order.
PixelRNN PixelRNN uses RNN (or LSTM) to generate images. PixelRNN remembers the state from
more distant pixels using the recurrent states. In particular, the PixelRNN uses recurrence instead of the
3 × 3 convolutions to allow long-range dependencies, and can generate a full row of pixels in one pass.
Figure 2: PixelRNN. This figure shows conditional probability. xi is related to the previous points (blue
part) and is unrelated to the point after it (white part)
Training is generally longer, though, due to the recurrence involved, since each row has its own hidden state
in the LSTM layers.
PixelTransformer Pixel Transformers are similar to PixelCNN and PixelRNN, but they use a multi-
headed attention network.
Problem 1: PixelCNN vs. Pixel RNN
What is the main difference between PixelCNN and PixelRNN? In particular, comment on:
Aside from the main difference that PixelCNNs are modelled using CNNs over context regions, and
RNNs are modelled using a LSTM, in general, we have,
• PixelCNN is faster during training time because it can output the entire image in parallel with
a single pass. We know the spatial maps / context regions, and as a result, convolutions can
be parallelized over context regions. On the other hand, PixelRNNs have recurrent states that
must be calculated row-by-row.
• PixelRNN is faster at test time, since more pixels are generated per pass. PixelCNNs generate
1 pixel per pass, whereas PixelRNNs generate an entire row of pixels at a time
• PixelCNN uses a CNN to generate the next pixel from other pixels in proximity through mask-
ing, but PixelRNN uses row-LSTMs to generate one row at a time recurrently using previous
row of pixels. It is also possible to use Diagonal Bi-LSTM.
1. Dimensionality: Force the hidden state to be smaller than the input/output, so the network must
compress information
2. Sparsity: Force the hidden state to be sparse, so the network must be compressed
3. Denoising: Corrupt the input with noise, and force the autoencoder to learn to distinguish noise from
the signal
4. Probabilistic Modeling: Forces the hidden state with a prior distribution
In practice, Autoencoders are far less used today, since there exists better alternatives for both representation
learning (VAEs, contrastive learning) and generation (GANs, VAEs, autoregressive models).
Denoising Autoencoder Denoising Autoencoders corrupt the input with noise and runs a Bottleneck
Autoencoder. However, there are many variants on this idea. In practice, it is unclear which layer to choose
for the bottleneck, and there are some ad-hoc choices (e.g., how much noise to add).
Sparse Autoencoder Sparse Autoencoder originates from sparse coding theory in the brain, and is
an attempt to describe the input with a sparse representation, by lettingPmost values to zero. In this
D
autoencoder, the dimensionality may be very large, and uses a sparsity loss, j=1 |hj |. In practice, choosing
the regularizer and adjusting hyperparameters can be very hard.
to the following,
N Z
1 X
θ ← arg max log pθ (xi |z)p(z)dz
θ N i=1
R
where we recognize p(x) = p(x|z)p(z)dz. Unfortunately, the integral is intractable, but we will discuss
ways to find a tractable lower bound.
Encoder Encoder maps a high-dimensional input x (like the pixels of an image) and then (most often)
outputs the parameters of a Gaussian distribution that specify the hidden variable z. In other words, they
output µz|x and Σz|x . We implement this as a deep neural network, parameterized by φ, which computes
the probability qφ (z|x). We can sample from this distribution to get noisy values of the representation z.
Decoder Decoder maps the latent representation back to a high dimensional reconstruction, denoted as
x̂, and outputs the parameters to the probability distribution of the data. We implement this as another
neural network, parametrized by θ, which computes the probability pθ (x|z). Following the digit example,
if we represent each pixel as a 0 (black) or 1 (white), probability distribution of a single pixel can be then
represented using a Bernoulli distribution. Indeed, the decoder gets as input the latent representation of a
digit z and outputs 784 Bernoulli parameters, one for each of the 784 pixels in the image.
Training VAEs To train VAEs, we find parameters that maximize the likelihood of the data,
N Z
1 X
θ ← arg max log pθ (xi |z)p(z)dz
θ N i=1
This integral is intractable. However, we can show that it is possible to optimize a tractable lower bound on
the data likelihood, called the Evidence Lower Bound (ELBO),
VAEs learn an explicit distribution of the data by fitting it into a multivariate Gaussian, and the
output is blurry because of the conditional independence assumption of the samples given latent
variables. This is not true for most realistic distributions, like natural images. Indeed, since the real
posterior is often far from a multivariate Gaussian, we observe a lot of variance/noise added to the
model, and this causes blurriness when we decode since Maximum Likelihood training will distribute
the probability mass diffusely over the data space.
N N
!
X X X
arg max log pθ (xi ) = arg max log pZ (z)pθ (xi |z)
θ θ
i=1 i=1 z
Show that
N
X N
X
log pθ (xi ) ≥ Eq(z|xi ) [log pZ (z) − log q(z|xi ) + log pθ (xi |z)]
i=1 i=1
N N
!
X X X
log pθ (xi ) = log pZ (z)pθ (xi |z)
i=1 i=1 z
N
!
X X q(z|xi )
= log pZ (z)pθ (xi |z)
i=1 z
q(z|xi )
N !
X 1
= log Eq(z|xi ) pZ (z)pθ (xi |z)
i=1
q(z|xi )
N
X
≥ Eq(z|xi ) [log pZ (z) − log q(z|xi ) + log pθ (xi |z)]
i=1
To optimize the Variational Lower Bound derived in the previous problem, which distribution do we
sample z from?
We can show that, our derived approximation can be reformulated as the Evidence Lower Bound (ELBO),
Re-Parametrization Trick The re-parametrization trick allows us to break qφ (z|x) into a deterministic
and stochastic portion, and re-parametrize from qφ (z|x) to gφ (x, ). In fact, we can let, z = gφ (x, ) =
g0 (x) + · g1 (x) where ∼ p(). This reparametrization trick is simple to implement, and has low variance 1
Problem 6: Reparametrization Example
Let us get an intuition for how we might use re-parametrization in practice. Assume we have a normal
distribution q, parametrized by θ, such that, qθ (x) ∼ N (θ, 1), and we would like to solve,
min Eq [x2 ]
θ
x = θ + , ∼ N (0, 1)
Then,
Eq [x2 ] = Ep [(θ + )2 ]
where p ∼ N (0, 1). Then, we can write the derivative of Eq [x2 ],
x = fθ (z)
z = fθ−1 (x)
In Flows, our training objective is to let maximize the log-likelihood of x, where we have,
x = fθ (z)
z = fθ−1 (x)
Write out the training objective explicitly, then use change of variables to derive the Normalizing
Flow objective. Also derive the properties that fθ must satisfy for practical flows.
and so,
N
1 X df (zi )
max log p(f −1 (xi )) − log |det |
θ N i=1 dzi
In particular, we need that fθ should be (1) differentiable, (2) invertible and (3) have a tractable
Jacobian determinant.
NICE NICE model composes two invertible transformations: additive coupling layers and rescaling layers.
The coupling layer in NICE partitions a variable z into two disjoints subsets, z1:d and zd+1:n . Then it applies
the following forward mapping,
Here, notice that the Jacobian of the forward map is lower triangular, whose determinant is simply the
product of the elements on the diagonal, which is 1. Also, note that, then we have that the mapping is
volume preserving, meaning that the transformed distribution px will have the same “volume” compared to
the original one pz .
Determine the determinant of the Jacobian of the forward map of the Real-NVP. In other words,
find,
df (z)
|det |
dz
We note that,
df (z) dx1:d /dz1:d dx1:d /dzd+1:n
=
dz dxd+1:n /dz1:d dxd+1:n /dzd+1:n
We know dx dz1:d is the identity, because we let z1:d = x1:d by the identity mapping. Likewise,
1:d dx1:d
dzd+1:n
is 0, because there is no dependence between either.
We must calculate, then, dx d+1:n
dzd+1:n . Recall,
xd+1:n = exp hθ (z1:d ) zd+1:n + hθ (z1:d )
Let diag(Z) represent the diagonal matrix formed from zd+1:n . Then,
dxd+1:n d
= exp hθ (z1:d ) zd+1:n
dzd+1:n dzd+1:n
d
= diag(Z) exp hθ (z1:d )
dzd+1:n
= diag(exp hθ (z1:d ) )
Hence,
" #
df (z) I 0
=
dz dxd+1:n /dz1:d diag(exp hθ (z1:d ) )