0% found this document useful (0 votes)
36 views7 pages

L15 Autoregressive and Reversible Models

This document summarizes reversible and autoregressive models for deep generative modeling. 1) Reversible models use bijective, differentiable transformations to map inputs to outputs in a way that allows inverting the transformation and computing the likelihood. Chaining together reversible blocks results in a reversible network that preserves volume. 2) Autoregressive models decompose the joint distribution into a product of conditionals, allowing tractable maximum likelihood. Previous examples are neural language models and RNNs. New techniques can scale them to high-dimensional data like images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views7 pages

L15 Autoregressive and Reversible Models

This document summarizes reversible and autoregressive models for deep generative modeling. 1) Reversible models use bijective, differentiable transformations to map inputs to outputs in a way that allows inverting the transformation and computing the likelihood. Chaining together reversible blocks results in a reversible network that preserves volume. 2) Autoregressive models decompose the joint distribution into a product of conditionals, allowing tractable maximum likelihood. Previous examples are neural language models and RNNs. New techniques can scale them to high-dimensional data like images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lecture 15: Autoregressive and Reversible Models

Roger Grosse

In this lecture, we’ll cover two kinds deep generative model architectures
which can be trained using maximum likelihood. The first kind is reversible
architectures, where the network’s computations can be inverted in order to
recover the input which maps to a given output. We’ll see that this makes
the likelihood computation tractable.
The second kind of architecture is autoregressive models. This isn’t new:
we’ve already covered neural language models and RNN language models,
both of which are examples of autoregressive models. In this lecture, we’ll
introduce two tricks for making them much more scalable, so that we can
apply them to high-dimensional data modalities like high-resolution images
and audio waveforms.

1 Reversible Models
Mathematically, reversible models are based on the change-of-variables
formula for probability density functions. Suppose we have a bijective,
differentiable mapping f : Z → X . (“Bijective” means the mapping must be
1–1 and cover all of X .) Since f is bijective, we can think of it as representing
a change-of-variables transformation. For instance, x = f (z) = 12z could
represent a conversion of units from feet to inches. If we have a density
pZ (z), the change-of-variables formula gives us the density pX (x):
  −1
∂x
pX (x) = pZ (z) det , (1)
∂z

where z = f −1 (x). Let’s unpack this. First, ∂x/∂z is the Jacobian of


f , which is the linearization of f around z. Then we take the absolute
value of the matrix determinant. Recall that the absolute value of the
determinant of a matrix gives the factor by which the associated linear
transformation expands or contracts the volume of a set. So the determinant If we consider the linear
of the Jacobian determines how much f is expanding or contracting the transformation x 7→ Ax for some
matrix A, and apply it to a set
volume locally around z. We then take the inverse of the determinant, with volume V , we’ll get a set with
which means if f expands the volume, then the density pX (x) shrinks, and volume V | det A|.
vice versa. Heuristically, this is justified by the following picture:

1
Now suppose the mapping f is the function computed by a generator
network (i.e. its outputs as a function of its inputs). It’s tempting to apply
the change-of-variables formula in order to compute pX (x). But in order
for this to work, three things need to be true:

1. The mapping f needs to be differentiable, so that the Jacobian ∂x/∂z


is defined.

2. We need to be able to compute z = f −1 (x), which means f needs to


be invertible, with an easy-to-compute inverse.

3. We need to be able to compute the (log) determinant of the Jacobian.

With regards to (1), networks with ReLU nonlinearities technically aren’t


differentiable because ReLU is nondifferentiable at 0. In practice, we can ig-
nore this issue because the inputs to the activation function are very unlikely
to be exactly zero, so with high probability, the Jacobian will be defined. Or,
if we’re still worried, we could just pick a differentiable activation function.
But the other two points are much harder to deal with.
Fortunately, there’s a simple and elegant kind of network architecture
called a reversible architecture which is efficiently invertible and for
which we can compute the log determinant efficiently. (In fact, the de-
terminant turns out to be 1.) This architecture is based on the reversible
block, which is very similar to the residual block from Lecture 17. Recall
that residual blocks implement the following equation:

y = x + F(x), (2)

where F is some function, such as a shallow network. Reversible blocks


are similar, except that we divide the units into two groups; the residual
function for the first group depends only on the other group, and the second
group is left unchanged. Mathematically,

y1 = x1 + F(x2 )
(3)
y2 = x2

This is shown schematically in Figure 1. The reversible block is easily


inverted, i.e. if we’re given y1 and y2 , we can recover x1 and x2 :

x2 = y2
(4)
x1 = y1 − F(x2 )

Here’s what happens when we compose two residual blocks, with the
roles of x1 and x2 swapped:

y1 = x1 + F(x2 )
(5)
y2 = x2 + G(y1 )

This is shown schematically in Figure 1. To invert the composition of two


blocks:
x2 = y2 − G(y1 )
(6)
x1 = y1 − F(x2 )

2
(a) (b) (c)

Figure 1: (a) A residual block. (b) A reversible block. (c) A composition


of two reversible blocks.

So we’ve shown how to invert a reversible block. What about the de-
terminant of the Jacobian? Here is the formula for the Jacobian, which we
get by differentiating Eqn. 3 and putting the result into the form of a block
matrix:
∂F
 
∂y I ∂x
= 2
∂x 0 I
(Recall that I denotes the identity matrix.) Here’s the pattern of nonzero
entries of this matrix:

This is an upper triangular matrix. Think back to linear algebra class:


the determinant of an upper triangular matrix is simply the product of
the diagonal entries. In this case, the diagonal entries are all 1’s, so the
determinant is 1. How convenient! Since the determinant is 1, the mapping
is volume preserving, i.e. it maps any given set to another set of the same
volume. In our context, this just means the determinant term disappears
from the change-of-variables formula (Eqn. 1).
All this analysis so far was for a single reversible block. What if we build
a reversible network by chaining together lots of reversible blocks?

3
Fortunately, inversion of the whole network is still easy, since we just invert
each block from top to bottom. Mathematically,
f −1 = f1−1 ◦ · · · ◦ fk−1 . (7)
For the determinant, we can apply the chain rule for derivatives, followed
by the product rule for determinants:
∂xk ∂xk ∂x2 ∂x1
= ···
∂z ∂xk−1 ∂x1 ∂z
∂xk ∂x2 ∂x1
= ··· (8)
∂xk−1 ∂x1 ∂z
= 1 · 1···1
=1
Hence, the full reversible network is also volume preserving.
Because we can compute inverses and determinants, we can train a re-
versible generative model using maximum likelihood using the change-of-
variables formula. This is the idea behind nonlinear independent com-
ponents estimation (NICE)1 . (This paper introduced the idea of training
reversible architectures with maximum likelihood.) The change-of-variables
formula gives us:
  −1
∂x
pX (x) = pZ (z) det
∂z (9)
= pZ (z)
Hence, the maximum likelihood objective over the whole dataset is:
N
Y N
Y
pX (x(i) ) = pZ (f −1 (x(i) )) (10)
i=1 i=1

Remember, pZ is a simple, fixed distribution (e.g. independent Gaussians),


so pZ (z) is easy to evaluate. Note that this objective only makes sense
because of the volume constraint. If f weren’t constrained to be
volume preserving, then f −1 could
map every training example very
2 Autoregressive Models close to 0, and hence pZ (f −1 (x(i) ))
would be large for every training
example. The volume preservation
Autoregressive models are another kind of deep generative model with
constraint prevents this trivial
tractable likelihoods. We’ve already seen two examples in this course: the solution.
1
Dinh et al., 2014. NICE: Non-linear independent components estimation.

4
Figure 2: Examples of sequence modeling tasks with very long contexts.
Left: Modeling images as sequences using raster scan order. Right: Mod-
eling an audio waveform (e.g. speech signal).

neural language model (Lecture 5) and RNNs (Lectures 13-14). Here, the
observations were given as sequences (x(1) , . . . , x(T ) ), and we decomposed
the likelihood into a product of conditional distributions:
T
Y
(1) (t)
p(x ,...,x ) = p(x(t) | x(1) , . . . , x(t−1) ). (11)
t=1

So the maximum likelihood objective decomposes as a sequence of prediction


problems for each term in the sequence given the previous terms. Assuming
the observations were discrete (as they are in all the autoregressive models
considered in this course), the prediction at each time step can be made
using a neural network which outputs a probability distribution using a
softmax activation function.
So far, we’ve mostly considered using short sequences or short context
lengths. The neural language model from Assignment 1 used context win-
dows of length 3, though the architecture could work better with contexts of
length 10 or so. Machine translation at the word level involves outputting
sequences of length 20 or so (the typical number of words in a sentence).
But what if accurately modeling the distribution requires much longer
term dependencies? One example is autoregressive models of images. A
grayscale image is typically represented with pixel values which are integers
from 0 to 255 (i.e. one byte each). We can treat this as a sequence using
the raster scan order (Figure 2). But even a fairly small image would corre-
spond to a very long sequence, e.g. a 100 × 100 image would correspond to
a sequence of length 10,000. Clearly, images have a lot of global structure,
so the predictions would need to take into account all of the pixels which
were already generated. As another example, consider learning a genera-
tive model of audio waveforms. An audio waveform is stored as a sequence
of integer-valued samples, with a sampling rate of at least 16,000 Hz (cy-
cles/second) in order to have reasonably good sound quality. This means
that to predict the next term in the sequence, if we want to account for
even 1 second of context, this requires a context of length 16,000.
One way to account for such a long context is to use an RNN, which
(through its hidden units) accounts for the entire sequence that was gener-
ated so far. The problem is that computing the hidden units for each time
step depends on the hidden units from the previous time step, so the for-
ward pass of backprop requires a for-loop over time steps. (The backward

5
Figure 3: Top: a causal CNN applied to sequential data (such as an audio
waveform). Source: van den Oord et al., 2016, “WaveNet: a generative
model for raw audio”. Bottom: applying causal convolution to model-
ing images. Source: van den Oord et al., 2016, “Pixel recurrent neural
networks”.

pass requires a for-loop as well.) With thousands of time steps, this can
get very expensive. But think about the neural language model architecture
from Lecture 7. At training time, the predictions at each time step are done
independently of each other, so all the time steps can be processed simulta-
neously with vectorized computations. This implies that training with very
long sequences could be done much more efficiently if we could somehow
get rid of the recurrent connections.
Causal convolution is an elegant solution to this problem. Observe
that, in order to apply the chain rule for conditional probability (Eqn. 11),
it’s important that information never leak backwards in time, i.e. that each
prediction be made only using observations from earlier in the sequence.
A model with this property is called causal. We can design convolutional
neural nets (CNNs) to have a causal structure by masking their connections,
i.e. constraining certain of their weights to be zero, as shown in Figure 3. At
training time, the predictions can be computed for the entire sequence with
a single forward pass through the CNN. Causal convolution is a particularly
elegant architecture in that it allows computations to be shared between the
predictions for different time steps, e.g. a given unit in the first layer will
affect the predictions at multiple different time steps.
It’s interesting to contrast a causal convolution architecture with an
RNN. We could turn the causal CNN into an RNN by adding recurrent

6
Figure 4: The dilated convolution architecture used in WaveNet. Source:
van den Oord et al., 2016, “WaveNet: a generative model for raw audio”.

connections between the hidden units. This would have the advantage that,
because of its memory, the model could use information from all previous
time steps to make its predictions. But training would be very slow, since
it would require a for-loop over time steps. A very influential recent pa-
per2 showed that both strategies are actually highly effective for modeling
images. Take a moment to look at the examples in that paper.
The problem with a straightforward CNN architecture is that the pre-
dictions are made using a relatively short context because the output units
have a small receptive field. Fortunately, there’s a clever fix for this prob-
lem, which you’ve already seen in Programming Assignment 2: dilated
convolution. Recall that this means that each unit receives connections
from units in the previous layer with a spacing larger than 1. Figure 4 shows
part of the dilated convolution architecture for WaveNet3 , an autoregres-
sive model for audio. The first layer has a dilation of 1, so each unit has a
receptive field of size 1. The next layer has a dilation of 2, so each unit has
a receptive field of size 2. The dilation factors are spaced by factors of 2,
i.e., {1, 2, . . . , 512}, so that the 10th layer has receptive fields of size 1024.
Hence, it gets exponentially large receptive fields with only a linear number
of connections. This 10-layer architecture is repeated 5 times, so that the
receptive fields are approximately of size 5000, or about 300 milliseconds.
This is a large enough context to generate impressively good audio. You
can find some neat examples here:

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Compared with other autoregressive models, causal, dilated CNNs are


quite efficient at training time, despite their large context. However, all
autoregressive models, including both CNNs and RNNs, share a common
disadvantage: they are very slow to generate from, since the model’s own
samples need to be fed in as inputs, which means it requires a for-loop over
time steps. So if efficiency of generation is a big concern, then GANs or
reversible models would be much preferred. Learning a generative model of
audio which is of similar quality to WaveNet, yet also efficient to generate
from, is an active area of research.
2
van den Oord et al., 2016, “Pixel recurrent neural networks”. https://arxiv.org/
abs/1601.06759
3
van den Oord et al., 2016, “WaveNet: a generative model for raw audio”. https:
//arxiv.org/abs/1609.03499

You might also like