0% found this document useful (0 votes)

36 views7 pages

L15 Autoregressive and Reversible Models

This document summarizes reversible and autoregressive models for deep generative modeling. 1) Reversible models use bijective, differentiable transformations to map inputs to outputs in a way that allows inverting the transformation and computing the likelihood. Chaining together reversible blocks results in a reversible network that preserves volume. 2) Autoregressive models decompose the joint distribution into a product of conditionals, allowing tractable maximum likelihood. Previous examples are neural language models and RNNs. New techniques can scale them to high-dimensional data like images.

Uploaded by

parvathyp220246ec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views7 pages

L15 Autoregressive and Reversible Models

Uploaded by

parvathyp220246ec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Lecture 15: Autoregressive and Reversible Models

Roger Grosse

In this lecture, we’ll cover two kinds deep generative model architectures
which can be trained using maximum likelihood. The first kind is reversible
architectures, where the network’s computations can be inverted in order to
recover the input which maps to a given output. We’ll see that this makes
the likelihood computation tractable.
The second kind of architecture is autoregressive models. This isn’t new:
we’ve already covered neural language models and RNN language models,
both of which are examples of autoregressive models. In this lecture, we’ll
introduce two tricks for making them much more scalable, so that we can
apply them to high-dimensional data modalities like high-resolution images
and audio waveforms.

1 Reversible Models
Mathematically, reversible models are based on the change-of-variables
formula for probability density functions. Suppose we have a bijective,
differentiable mapping f : Z → X . (“Bijective” means the mapping must be
1–1 and cover all of X .) Since f is bijective, we can think of it as representing
a change-of-variables transformation. For instance, x = f (z) = 12z could
represent a conversion of units from feet to inches. If we have a density
pZ (z), the change-of-variables formula gives us the density pX (x):
−1
∂x
pX (x) = pZ (z) det , (1)
∂z

where z = f −1 (x). Let’s unpack this. First, ∂x/∂z is the Jacobian of

f , which is the linearization of f around z. Then we take the absolute
value of the matrix determinant. Recall that the absolute value of the
determinant of a matrix gives the factor by which the associated linear
transformation expands or contracts the volume of a set. So the determinant If we consider the linear
of the Jacobian determines how much f is expanding or contracting the transformation x 7→ Ax for some
matrix A, and apply it to a set
volume locally around z. We then take the inverse of the determinant, with volume V , we’ll get a set with
which means if f expands the volume, then the density pX (x) shrinks, and volume V | det A|.
vice versa. Heuristically, this is justified by the following picture:

1
Now suppose the mapping f is the function computed by a generator
network (i.e. its outputs as a function of its inputs). It’s tempting to apply
the change-of-variables formula in order to compute pX (x). But in order
for this to work, three things need to be true:

1. The mapping f needs to be differentiable, so that the Jacobian ∂x/∂z

is defined.

2. We need to be able to compute z = f −1 (x), which means f needs to

be invertible, with an easy-to-compute inverse.

3. We need to be able to compute the (log) determinant of the Jacobian.

With regards to (1), networks with ReLU nonlinearities technically aren’t

differentiable because ReLU is nondifferentiable at 0. In practice, we can ig-
nore this issue because the inputs to the activation function are very unlikely
to be exactly zero, so with high probability, the Jacobian will be defined. Or,
if we’re still worried, we could just pick a differentiable activation function.
But the other two points are much harder to deal with.
Fortunately, there’s a simple and elegant kind of network architecture
called a reversible architecture which is efficiently invertible and for
which we can compute the log determinant efficiently. (In fact, the de-
terminant turns out to be 1.) This architecture is based on the reversible
block, which is very similar to the residual block from Lecture 17. Recall
that residual blocks implement the following equation:

y = x + F(x), (2)

where F is some function, such as a shallow network. Reversible blocks

are similar, except that we divide the units into two groups; the residual
function for the first group depends only on the other group, and the second
group is left unchanged. Mathematically,

y1 = x1 + F(x2 )
(3)
y2 = x2

This is shown schematically in Figure 1. The reversible block is easily

inverted, i.e. if we’re given y1 and y2 , we can recover x1 and x2 :

x2 = y2
(4)
x1 = y1 − F(x2 )

Here’s what happens when we compose two residual blocks, with the
roles of x1 and x2 swapped:

y1 = x1 + F(x2 )
(5)
y2 = x2 + G(y1 )

This is shown schematically in Figure 1. To invert the composition of two

blocks:
x2 = y2 − G(y1 )
(6)
x1 = y1 − F(x2 )

2
(a) (b) (c)

Figure 1: (a) A residual block. (b) A reversible block. (c) A composition

of two reversible blocks.

So we’ve shown how to invert a reversible block. What about the de-
terminant of the Jacobian? Here is the formula for the Jacobian, which we
get by differentiating Eqn. 3 and putting the result into the form of a block
matrix:
∂F

∂y I ∂x
= 2
∂x 0 I
(Recall that I denotes the identity matrix.) Here’s the pattern of nonzero
entries of this matrix:

This is an upper triangular matrix. Think back to linear algebra class:

the determinant of an upper triangular matrix is simply the product of
the diagonal entries. In this case, the diagonal entries are all 1’s, so the
determinant is 1. How convenient! Since the determinant is 1, the mapping
is volume preserving, i.e. it maps any given set to another set of the same
volume. In our context, this just means the determinant term disappears
from the change-of-variables formula (Eqn. 1).
All this analysis so far was for a single reversible block. What if we build
a reversible network by chaining together lots of reversible blocks?

3
Fortunately, inversion of the whole network is still easy, since we just invert
each block from top to bottom. Mathematically,
f −1 = f1−1 ◦ · · · ◦ fk−1 . (7)
For the determinant, we can apply the chain rule for derivatives, followed
by the product rule for determinants:
∂xk ∂xk ∂x2 ∂x1
= ···
∂z ∂xk−1 ∂x1 ∂z
∂xk ∂x2 ∂x1
= ··· (8)
∂xk−1 ∂x1 ∂z
= 1 · 1···1
=1
Hence, the full reversible network is also volume preserving.
Because we can compute inverses and determinants, we can train a re-
versible generative model using maximum likelihood using the change-of-
variables formula. This is the idea behind nonlinear independent com-
ponents estimation (NICE)1 . (This paper introduced the idea of training
reversible architectures with maximum likelihood.) The change-of-variables
formula gives us:
−1
∂x
pX (x) = pZ (z) det
∂z (9)
= pZ (z)
Hence, the maximum likelihood objective over the whole dataset is:
N
Y N
Y
pX (x(i) ) = pZ (f −1 (x(i) )) (10)
i=1 i=1

Remember, pZ is a simple, fixed distribution (e.g. independent Gaussians),

so pZ (z) is easy to evaluate. Note that this objective only makes sense
because of the volume constraint. If f weren’t constrained to be
volume preserving, then f −1 could
map every training example very
2 Autoregressive Models close to 0, and hence pZ (f −1 (x(i) ))
would be large for every training
example. The volume preservation
Autoregressive models are another kind of deep generative model with
constraint prevents this trivial
tractable likelihoods. We’ve already seen two examples in this course: the solution.
1
Dinh et al., 2014. NICE: Non-linear independent components estimation.

4
Figure 2: Examples of sequence modeling tasks with very long contexts.
Left: Modeling images as sequences using raster scan order. Right: Mod-
eling an audio waveform (e.g. speech signal).

neural language model (Lecture 5) and RNNs (Lectures 13-14). Here, the
observations were given as sequences (x(1) , . . . , x(T ) ), and we decomposed
the likelihood into a product of conditional distributions:
T
Y
(1) (t)
p(x ,...,x ) = p(x(t) | x(1) , . . . , x(t−1) ). (11)
t=1

So the maximum likelihood objective decomposes as a sequence of prediction

problems for each term in the sequence given the previous terms. Assuming
the observations were discrete (as they are in all the autoregressive models
considered in this course), the prediction at each time step can be made
using a neural network which outputs a probability distribution using a
softmax activation function.
So far, we’ve mostly considered using short sequences or short context
lengths. The neural language model from Assignment 1 used context win-
dows of length 3, though the architecture could work better with contexts of
length 10 or so. Machine translation at the word level involves outputting
sequences of length 20 or so (the typical number of words in a sentence).
But what if accurately modeling the distribution requires much longer
term dependencies? One example is autoregressive models of images. A
grayscale image is typically represented with pixel values which are integers
from 0 to 255 (i.e. one byte each). We can treat this as a sequence using
the raster scan order (Figure 2). But even a fairly small image would corre-
spond to a very long sequence, e.g. a 100 × 100 image would correspond to
a sequence of length 10,000. Clearly, images have a lot of global structure,
so the predictions would need to take into account all of the pixels which
were already generated. As another example, consider learning a genera-
tive model of audio waveforms. An audio waveform is stored as a sequence
of integer-valued samples, with a sampling rate of at least 16,000 Hz (cy-
cles/second) in order to have reasonably good sound quality. This means
that to predict the next term in the sequence, if we want to account for
even 1 second of context, this requires a context of length 16,000.
One way to account for such a long context is to use an RNN, which
(through its hidden units) accounts for the entire sequence that was gener-
ated so far. The problem is that computing the hidden units for each time
step depends on the hidden units from the previous time step, so the for-
ward pass of backprop requires a for-loop over time steps. (The backward

5
Figure 3: Top: a causal CNN applied to sequential data (such as an audio
waveform). Source: van den Oord et al., 2016, “WaveNet: a generative
model for raw audio”. Bottom: applying causal convolution to model-
ing images. Source: van den Oord et al., 2016, “Pixel recurrent neural
networks”.

pass requires a for-loop as well.) With thousands of time steps, this can
get very expensive. But think about the neural language model architecture
from Lecture 7. At training time, the predictions at each time step are done
independently of each other, so all the time steps can be processed simulta-
neously with vectorized computations. This implies that training with very
long sequences could be done much more efficiently if we could somehow
get rid of the recurrent connections.
Causal convolution is an elegant solution to this problem. Observe
that, in order to apply the chain rule for conditional probability (Eqn. 11),
it’s important that information never leak backwards in time, i.e. that each
prediction be made only using observations from earlier in the sequence.
A model with this property is called causal. We can design convolutional
neural nets (CNNs) to have a causal structure by masking their connections,
i.e. constraining certain of their weights to be zero, as shown in Figure 3. At
training time, the predictions can be computed for the entire sequence with
a single forward pass through the CNN. Causal convolution is a particularly
elegant architecture in that it allows computations to be shared between the
predictions for different time steps, e.g. a given unit in the first layer will
affect the predictions at multiple different time steps.
It’s interesting to contrast a causal convolution architecture with an
RNN. We could turn the causal CNN into an RNN by adding recurrent

6
Figure 4: The dilated convolution architecture used in WaveNet. Source:
van den Oord et al., 2016, “WaveNet: a generative model for raw audio”.

connections between the hidden units. This would have the advantage that,
because of its memory, the model could use information from all previous
time steps to make its predictions. But training would be very slow, since
it would require a for-loop over time steps. A very influential recent pa-
per2 showed that both strategies are actually highly effective for modeling
images. Take a moment to look at the examples in that paper.
The problem with a straightforward CNN architecture is that the pre-
dictions are made using a relatively short context because the output units
have a small receptive field. Fortunately, there’s a clever fix for this prob-
lem, which you’ve already seen in Programming Assignment 2: dilated
convolution. Recall that this means that each unit receives connections
from units in the previous layer with a spacing larger than 1. Figure 4 shows
part of the dilated convolution architecture for WaveNet3 , an autoregres-
sive model for audio. The first layer has a dilation of 1, so each unit has a
receptive field of size 1. The next layer has a dilation of 2, so each unit has
a receptive field of size 2. The dilation factors are spaced by factors of 2,
i.e., {1, 2, . . . , 512}, so that the 10th layer has receptive fields of size 1024.
Hence, it gets exponentially large receptive fields with only a linear number
of connections. This 10-layer architecture is repeated 5 times, so that the
receptive fields are approximately of size 5000, or about 300 milliseconds.
This is a large enough context to generate impressively good audio. You
can find some neat examples here:

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Compared with other autoregressive models, causal, dilated CNNs are

quite efficient at training time, despite their large context. However, all
autoregressive models, including both CNNs and RNNs, share a common
disadvantage: they are very slow to generate from, since the model’s own
samples need to be fed in as inputs, which means it requires a for-loop over
time steps. So if efficiency of generation is a big concern, then GANs or
reversible models would be much preferred. Learning a generative model of
audio which is of similar quality to WaveNet, yet also efficient to generate
from, is an active area of research.
2
van den Oord et al., 2016, “Pixel recurrent neural networks”. https://arxiv.org/
abs/1601.06759
3
van den Oord et al., 2016, “WaveNet: a generative model for raw audio”. https:
//arxiv.org/abs/1609.03499

UDL Answer Booklet Students
100% (1)
UDL Answer Booklet Students
79 pages
Rem Koolhaas-Elements of Architecture
11% (9)
Rem Koolhaas-Elements of Architecture
6 pages
AASAN.2021 - Invertible and Pseudo-Invertible Encoders An Approach To Inverse Problems With Neural Networks
No ratings yet
AASAN.2021 - Invertible and Pseudo-Invertible Encoders An Approach To Inverse Problems With Neural Networks
199 pages
10 COOKERY Q2M2 Tle10 - He - Cookery - q2 - Mod2 - Preparingvegetabledishes - v3 (70 Pages)
93% (15)
10 COOKERY Q2M2 Tle10 - He - Cookery - q2 - Mod2 - Preparingvegetabledishes - v3 (70 Pages)
71 pages
NN and Optimization Regularization
No ratings yet
NN and Optimization Regularization
198 pages
Hagemann 2021 Inverse Problems 37 085002
No ratings yet
Hagemann 2021 Inverse Problems 37 085002
24 pages
Machine Learning For Data Science 2 - Normalizing Flows V2
No ratings yet
Machine Learning For Data Science 2 - Normalizing Flows V2
50 pages
Universal Approximations of Invariant Maps by Neural Networks
No ratings yet
Universal Approximations of Invariant Maps by Neural Networks
64 pages
Notes For Generative AI
No ratings yet
Notes For Generative AI
31 pages
Deep ReLU Networks
No ratings yet
Deep ReLU Networks
28 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
281 pages
cs236 Lecture3
No ratings yet
cs236 Lecture3
36 pages
On The Invertibility of Invertible NN
No ratings yet
On The Invertibility of Invertible NN
24 pages
ANNand Its Applicationsin Civil Engineering
No ratings yet
ANNand Its Applicationsin Civil Engineering
264 pages
Mod4 Slides
No ratings yet
Mod4 Slides
49 pages
Analyzing Inverse Problems With Invertible Neural Networks
No ratings yet
Analyzing Inverse Problems With Invertible Neural Networks
20 pages
The Functions of Deep Learning: Gilbert Strang
No ratings yet
The Functions of Deep Learning: Gilbert Strang
1 page
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
Nice: N - I C E: ON Linear Ndependent Omponents Stimation
No ratings yet
Nice: N - I C E: ON Linear Ndependent Omponents Stimation
13 pages
Flow Based Deep Generative Models Report
No ratings yet
Flow Based Deep Generative Models Report
12 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
89 pages
Neural ODES
No ratings yet
Neural ODES
32 pages
L14 Exploding and Vanishing Gradients
No ratings yet
L14 Exploding and Vanishing Gradients
13 pages
Lecture # 4-2 Autoregressive Models
No ratings yet
Lecture # 4-2 Autoregressive Models
39 pages
UDL Answer Booklet Students
No ratings yet
UDL Answer Booklet Students
79 pages
Neural Processes
No ratings yet
Neural Processes
11 pages
Rrnns Mckay 2018
No ratings yet
Rrnns Mckay 2018
31 pages
Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference
No ratings yet
Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference
33 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
The Mathematics of Artificial Intelligence: 1 Supervised Learning
No ratings yet
The Mathematics of Artificial Intelligence: 1 Supervised Learning
10 pages
Deep Neural Networks IID
No ratings yet
Deep Neural Networks IID
36 pages
UDL Answer Booklet Students
No ratings yet
UDL Answer Booklet Students
79 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Chase Bank February
100% (2)
Chase Bank February
4 pages
Tutorialon Diffusion Modelsfor Imaging and Vision
No ratings yet
Tutorialon Diffusion Modelsfor Imaging and Vision
90 pages
17 Aap1328
No ratings yet
17 Aap1328
59 pages
2020 CS182 Section 7 Notes
No ratings yet
2020 CS182 Section 7 Notes
5 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Yellow Jambhala Cultivation Booklet PDF
No ratings yet
Yellow Jambhala Cultivation Booklet PDF
15 pages
On The Geometry of Deep Learning
No ratings yet
On The Geometry of Deep Learning
14 pages
Deep Learning A Tutorial
No ratings yet
Deep Learning A Tutorial
16 pages
DLAI4 Networks Recurrent
No ratings yet
DLAI4 Networks Recurrent
7 pages
M3 L4 RNN Regularization
No ratings yet
M3 L4 RNN Regularization
24 pages
L11 - UCLxDeepMind DL2020
No ratings yet
L11 - UCLxDeepMind DL2020
68 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Unit 3
No ratings yet
Unit 3
12 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Ffjord: F - C D S R G M: REE Form Ontinuous Ynamics For Calable Eversible Enerative Odels
No ratings yet
Ffjord: F - C D S R G M: REE Form Ontinuous Ynamics For Calable Eversible Enerative Odels
13 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
BOOK Forest Business Management by Girish
100% (3)
BOOK Forest Business Management by Girish
155 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
No ratings yet
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
13 pages
UDL Answer Booklet Students
No ratings yet
UDL Answer Booklet Students
79 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Business Cases and Benefits Management
100% (2)
Business Cases and Benefits Management
66 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
DL Exam 2023-2
No ratings yet
DL Exam 2023-2
5 pages
LAB 08 Web Filtering
No ratings yet
LAB 08 Web Filtering
19 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
ECE/CS 559 - Neural Networks Lecture Notes #8: Associative Memory and Hopfield Networks
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #8: Associative Memory and Hopfield Networks
9 pages
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
7 pages
Open Source For You - September 2014 in
No ratings yet
Open Source For You - September 2014 in
108 pages
The Actual and Expected Reading Comprehension Ability of Grade Nine Students:Entoto Secondary School in Focus
No ratings yet
The Actual and Expected Reading Comprehension Ability of Grade Nine Students:Entoto Secondary School in Focus
143 pages
391
No ratings yet
391
6 pages
Probability
No ratings yet
Probability
22 pages
CAIIB Elective Paper Information Technology 2023 Mock 01 20th October
No ratings yet
CAIIB Elective Paper Information Technology 2023 Mock 01 20th October
25 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
King of The Pirates (Shonen Jumps One Piece, 1) (Michael Anthony Steele, Eiichiro Oda) (Z-Library)
No ratings yet
King of The Pirates (Shonen Jumps One Piece, 1) (Michael Anthony Steele, Eiichiro Oda) (Z-Library)
114 pages
A320 NEO Limitation + Auto Flight
No ratings yet
A320 NEO Limitation + Auto Flight
26 pages
The Fish'N Chicken Family Value Meals: Quality Take Home Cooking
No ratings yet
The Fish'N Chicken Family Value Meals: Quality Take Home Cooking
2 pages
Zhang 2020 J. Phys. Conf. Ser. 1449 012001
No ratings yet
Zhang 2020 J. Phys. Conf. Ser. 1449 012001
6 pages
Cost
No ratings yet
Cost
26 pages
Study and Comparison of Various Image Ed
No ratings yet
Study and Comparison of Various Image Ed
12 pages
Core Maths ODE
No ratings yet
Core Maths ODE
49 pages
A Connected Curriculum For Higher Education 27th Edition Dilly Fung All Chapter Instant Download
100% (5)
A Connected Curriculum For Higher Education 27th Edition Dilly Fung All Chapter Instant Download
40 pages
Programming Fundamentals: Laboratory Workbook
No ratings yet
Programming Fundamentals: Laboratory Workbook
54 pages
SE - Lighting LED Aluminum Profiles Catalogue 2023
No ratings yet
SE - Lighting LED Aluminum Profiles Catalogue 2023
27 pages
Kipruto CV
No ratings yet
Kipruto CV
3 pages
C.14 Queens Park Urban Conservation Area
No ratings yet
C.14 Queens Park Urban Conservation Area
13 pages
Assignment 7
No ratings yet
Assignment 7
24 pages
The Magic of Childhood A Journey Through Simple Pleasures
No ratings yet
The Magic of Childhood A Journey Through Simple Pleasures
10 pages
A CNN-Based Sentinel-2 Image Super-Resolution Method Using Multiobjective Training
No ratings yet
A CNN-Based Sentinel-2 Image Super-Resolution Method Using Multiobjective Training
14 pages
VADUE Goes To Peru
No ratings yet
VADUE Goes To Peru
19 pages
Psychoanalysis and Structuration Theory: The Social Logic of Identity
No ratings yet
Psychoanalysis and Structuration Theory: The Social Logic of Identity
18 pages
Design of Microbending Deformer For Optical Fiber Weight Sensor
No ratings yet
Design of Microbending Deformer For Optical Fiber Weight Sensor
7 pages
Parasound PLD-1100 Owners Manual
No ratings yet
Parasound PLD-1100 Owners Manual
7 pages
Counters
No ratings yet
Counters
4 pages
M4TXX-BR12SH: TIMEKEEPER SNAPHAT (Battery & Crystal)
No ratings yet
M4TXX-BR12SH: TIMEKEEPER SNAPHAT (Battery & Crystal)
7 pages
GE6100 - Understanding The Self Prelim Updated October 2020 Page 11-30
No ratings yet
GE6100 - Understanding The Self Prelim Updated October 2020 Page 11-30
4 pages
Allama Iqbal Open University, Islamabad (Department of Sociology) Warning
No ratings yet
Allama Iqbal Open University, Islamabad (Department of Sociology) Warning
2 pages
Harmonic Analysis and the Theory of Probability
From Everand
Harmonic Analysis and the Theory of Probability
Salomon Bochner
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

L15 Autoregressive and Reversible Models

Uploaded by

L15 Autoregressive and Reversible Models

Uploaded by

Lecture 15: Autoregressive and Reversible Models

where z = f −1 (x). Let’s unpack this. First, ∂x/∂z is the Jacobian of

1. The mapping f needs to be differentiable, so that the Jacobian ∂x/∂z

2. We need to be able to compute z = f −1 (x), which means f needs to

3. We need to be able to compute the (log) determinant of the Jacobian.

With regards to (1), networks with ReLU nonlinearities technically aren’t

where F is some function, such as a shallow network. Reversible blocks

This is shown schematically in Figure 1. The reversible block is easily

This is shown schematically in Figure 1. To invert the composition of two

Figure 1: (a) A residual block. (b) A reversible block. (c) A composition

This is an upper triangular matrix. Think back to linear algebra class:

Remember, pZ is a simple, fixed distribution (e.g. independent Gaussians),

So the maximum likelihood objective decomposes as a sequence of prediction

Compared with other autoregressive models, causal, dilated CNNs are

You might also like