L15 Autoregressive and Reversible Models
L15 Autoregressive and Reversible Models
Roger Grosse
In this lecture, we’ll cover two kinds deep generative model architectures
which can be trained using maximum likelihood. The first kind is reversible
architectures, where the network’s computations can be inverted in order to
recover the input which maps to a given output. We’ll see that this makes
the likelihood computation tractable.
The second kind of architecture is autoregressive models. This isn’t new:
we’ve already covered neural language models and RNN language models,
both of which are examples of autoregressive models. In this lecture, we’ll
introduce two tricks for making them much more scalable, so that we can
apply them to high-dimensional data modalities like high-resolution images
and audio waveforms.
1 Reversible Models
Mathematically, reversible models are based on the change-of-variables
formula for probability density functions. Suppose we have a bijective,
differentiable mapping f : Z → X . (“Bijective” means the mapping must be
1–1 and cover all of X .) Since f is bijective, we can think of it as representing
a change-of-variables transformation. For instance, x = f (z) = 12z could
represent a conversion of units from feet to inches. If we have a density
pZ (z), the change-of-variables formula gives us the density pX (x):
−1
∂x
pX (x) = pZ (z) det , (1)
∂z
1
Now suppose the mapping f is the function computed by a generator
network (i.e. its outputs as a function of its inputs). It’s tempting to apply
the change-of-variables formula in order to compute pX (x). But in order
for this to work, three things need to be true:
y = x + F(x), (2)
y1 = x1 + F(x2 )
(3)
y2 = x2
x2 = y2
(4)
x1 = y1 − F(x2 )
Here’s what happens when we compose two residual blocks, with the
roles of x1 and x2 swapped:
y1 = x1 + F(x2 )
(5)
y2 = x2 + G(y1 )
2
(a) (b) (c)
So we’ve shown how to invert a reversible block. What about the de-
terminant of the Jacobian? Here is the formula for the Jacobian, which we
get by differentiating Eqn. 3 and putting the result into the form of a block
matrix:
∂F
∂y I ∂x
= 2
∂x 0 I
(Recall that I denotes the identity matrix.) Here’s the pattern of nonzero
entries of this matrix:
3
Fortunately, inversion of the whole network is still easy, since we just invert
each block from top to bottom. Mathematically,
f −1 = f1−1 ◦ · · · ◦ fk−1 . (7)
For the determinant, we can apply the chain rule for derivatives, followed
by the product rule for determinants:
∂xk ∂xk ∂x2 ∂x1
= ···
∂z ∂xk−1 ∂x1 ∂z
∂xk ∂x2 ∂x1
= ··· (8)
∂xk−1 ∂x1 ∂z
= 1 · 1···1
=1
Hence, the full reversible network is also volume preserving.
Because we can compute inverses and determinants, we can train a re-
versible generative model using maximum likelihood using the change-of-
variables formula. This is the idea behind nonlinear independent com-
ponents estimation (NICE)1 . (This paper introduced the idea of training
reversible architectures with maximum likelihood.) The change-of-variables
formula gives us:
−1
∂x
pX (x) = pZ (z) det
∂z (9)
= pZ (z)
Hence, the maximum likelihood objective over the whole dataset is:
N
Y N
Y
pX (x(i) ) = pZ (f −1 (x(i) )) (10)
i=1 i=1
4
Figure 2: Examples of sequence modeling tasks with very long contexts.
Left: Modeling images as sequences using raster scan order. Right: Mod-
eling an audio waveform (e.g. speech signal).
neural language model (Lecture 5) and RNNs (Lectures 13-14). Here, the
observations were given as sequences (x(1) , . . . , x(T ) ), and we decomposed
the likelihood into a product of conditional distributions:
T
Y
(1) (t)
p(x ,...,x ) = p(x(t) | x(1) , . . . , x(t−1) ). (11)
t=1
5
Figure 3: Top: a causal CNN applied to sequential data (such as an audio
waveform). Source: van den Oord et al., 2016, “WaveNet: a generative
model for raw audio”. Bottom: applying causal convolution to model-
ing images. Source: van den Oord et al., 2016, “Pixel recurrent neural
networks”.
pass requires a for-loop as well.) With thousands of time steps, this can
get very expensive. But think about the neural language model architecture
from Lecture 7. At training time, the predictions at each time step are done
independently of each other, so all the time steps can be processed simulta-
neously with vectorized computations. This implies that training with very
long sequences could be done much more efficiently if we could somehow
get rid of the recurrent connections.
Causal convolution is an elegant solution to this problem. Observe
that, in order to apply the chain rule for conditional probability (Eqn. 11),
it’s important that information never leak backwards in time, i.e. that each
prediction be made only using observations from earlier in the sequence.
A model with this property is called causal. We can design convolutional
neural nets (CNNs) to have a causal structure by masking their connections,
i.e. constraining certain of their weights to be zero, as shown in Figure 3. At
training time, the predictions can be computed for the entire sequence with
a single forward pass through the CNN. Causal convolution is a particularly
elegant architecture in that it allows computations to be shared between the
predictions for different time steps, e.g. a given unit in the first layer will
affect the predictions at multiple different time steps.
It’s interesting to contrast a causal convolution architecture with an
RNN. We could turn the causal CNN into an RNN by adding recurrent
6
Figure 4: The dilated convolution architecture used in WaveNet. Source:
van den Oord et al., 2016, “WaveNet: a generative model for raw audio”.
connections between the hidden units. This would have the advantage that,
because of its memory, the model could use information from all previous
time steps to make its predictions. But training would be very slow, since
it would require a for-loop over time steps. A very influential recent pa-
per2 showed that both strategies are actually highly effective for modeling
images. Take a moment to look at the examples in that paper.
The problem with a straightforward CNN architecture is that the pre-
dictions are made using a relatively short context because the output units
have a small receptive field. Fortunately, there’s a clever fix for this prob-
lem, which you’ve already seen in Programming Assignment 2: dilated
convolution. Recall that this means that each unit receives connections
from units in the previous layer with a spacing larger than 1. Figure 4 shows
part of the dilated convolution architecture for WaveNet3 , an autoregres-
sive model for audio. The first layer has a dilation of 1, so each unit has a
receptive field of size 1. The next layer has a dilation of 2, so each unit has
a receptive field of size 2. The dilation factors are spaced by factors of 2,
i.e., {1, 2, . . . , 512}, so that the 10th layer has receptive fields of size 1024.
Hence, it gets exponentially large receptive fields with only a linear number
of connections. This 10-layer architecture is repeated 5 times, so that the
receptive fields are approximately of size 5000, or about 300 milliseconds.
This is a large enough context to generate impressively good audio. You
can find some neat examples here:
https://deepmind.com/blog/wavenet-generative-model-raw-audio/