Understanding Deep Learning
Understanding Deep Learning
OceanofPDF.com
Understanding Deep Learning
OceanofPDF.com
© 2023 Massachusetts Institute of Technology
This work is subject to a Creative Commons CC-BY-ND-NC license.
Subject to such license, all rights are reserved.
The MIT Press would like to thank the anonymous peer reviewers who
provided comments on drafts of this book. The generous work of academic
experts is essential for establishing the authority and quality of our
publications. We ac-knowledge with gratitude the contributions of these
otherwise uncredited readers.
This book was set in Latin Modern Roman by B. Jackowski and J. M.
Nowacki (on behalf of TeX users groups).
d_r0
OceanofPDF.com
This book is dedicated to Blair, Calvert, Coppola, Ellison, Faulkner,
Kerpatenko, Morris, Robinson, Sträussler, Wallace, Waymon, Wojnarowicz,
and all the others whose work is even more important and interesting than
deep learning.
OceanofPDF.com
Contents
Preface
Acknowledgments
Chapter 1: Introduction
Chapter 2: Supervised learning
Chapter 3: Shallow neural networks
Chapter 4: Deep neural networks
Chapter 5: Loss functions
Chapter 6: Fitting models
Chapter 7: Gradients and initialization
Chapter 8: Measuring performance
Chapter 9: Regularization
Chapter 10: Convolutional networks
Chapter 11: Residual networks
Chapter 12: Transformers
Chapter 13: Graph neural networks
Chapter 14: Unsupervised learning
Chapter 15: Generative Adversarial Networks
Chapter 16: Normalizing flows
Chapter 17: Variational autoencoders
Chapter 18: Diffusion models
Chapter 19: Reinforcement learning
Chapter 20: Why does deep learning work?
Chapter 21: Deep learning and ethics
Appendix A: Notation
Appendix B: Mathematics
Appendix C: Probability
Bibliography
Index
OceanofPDF.com
Preface
OceanofPDF.com
Acknowledgments
Writing this book would not have been possible without the generous help
and advice of these individuals: Kathryn Hume, Kevin Murphy, Christopher
Bishop, Peng Xu, Yann Dubois, Justin Domke, Chris Fletcher, Yanshuai
Cao, Wendy Tay, Corey Toler-Franklin, Dmytro Mishkin, Guy McCusker,
Daniel Worrall, Paul McIlroy, Roy Amoyal, Austin Anderson, Romero
Barata de Morais, Gabriel Harrison, Peter Ball, Alf Muir, David Bryson,
Vedika Parulkar, Patryk Lietzau, Jessica Nicholson, Alexa Huxley, Oisin
Mac Aodha, Giuseppe Castiglione, Josh Akylbekov, Alex Gougoulaki,
Joshua Omilabu, Alister Guenther, Joe Goodier, Logan Wade, Joshua
Guenther, Kylan Tobin, Benedict Ellett, Jad Araj, Andrew Glennerster,
Giorgos Sfikas, Diya Vibhakar, Sam Mansat-Bhattacharyya, Ben Ross, Ivor
Simpson, Gaurang Aggarwal, Shakeel Sheikh, Jacob Horton, Felix
Rammell, Sasha Luccioni, Akshil Patel, Alessandro Gentilini, Kevin
Mercier, Krzysztof Lichocki, Chuck Krapf, Brian Ha, Chris Kang,
Leonardo Viotti, Kai Li, Himan Abdollahpouri, Ari Pakman, Giuseppe
Antonio Di Luna, Dan Oneat,ă, Conrad Whiteley, Joseph Santarcangelo,
Brad Shook, Gabriel Brostow, Lei He, Ali Satvaty, Romain Sabathé, Qiang
Zhou, Prasanna Vigneswaran, Siqi Zheng, Stephan Grein, Jonas Klesen,
Giovanni Stilo, Huang Bokai, Bernhard Pfahringer, Joseph Santarcangelo,
Kevin McGuinness, Qiang Sun, Zakaria Lotfi, Yifei Lin, Sylvain Bouix,
Alex Pitt, Stephane Chretien, Robin Liu, Bian Li, Adam Jones, Marcin
Świerkot, Tommy Löfstedt, Eugen Hotaj, Fernando Flores-Mangas, Tony
Polichroniadis, Pietro Monticone, Rohan Deepak Ajwani, Menashe Yarden
Einy, Robert Gevorgyan, Thilo Stadelmann, Gui JieMiao, Botao Zhu,
Mohamed Elabbas, Satya Krishna Gorti, James Elder, Helio Perroni Filho,
Xiaochao Qu, Jaekang Shin, Joshua Evans, Robert Dobson, Shibo Wang,
Edoardo Zorzi, Joseph Santarcangelo, Stanisław Jastrzębski, Pieris
Kalligeros, Matt Hewitt, Zvika Haramaty, Ted Mavroidis, Nikolaj Kuntner,
Amir Yorav, Masoud Mokhtari, Xavier Gabaix, Marco Garosi, Vincent
Schönbach, Avishek Mondal, Victor S.C. Lui, Sumit Bhatia, Julian Asilis,
Hengchao Chen, Siavash Khallaghi, Csaba Szepesvári, and Mike Singer.
I'm particularly grateful to Daniyar Turmukhambetov, Amedeo
Buonanno, Andrea Panizza, Mark Hudson, and Bernhard Pfahringer, who
provided detailed comments on multiple chapters of the book. I'd like to
especially thank Andrew Fitzgibbon, Konstantinos Derpanis, and Tyler
Mills, who read the whole book and whose enthusiasm helped me complete
this project. I'd also like to thank Neill Campbell and Özgür Şimşek, who
hosted me at the University of Bath, where I taught a course based on this
material for the first time. Finally, I'm extremely grateful to my editor
Elizabeth Swayze for her frank advice throughout this process.
Chapter 12 (transformers) and chapter 17 (variational autoencoders)
were first published as blogs for Borealis AI, and adapted versions are
reproduced with permission of Royal Bank of Canada along with Borealis
AI. I am grateful for their support in this endeavor. Chapter 16 (normalizing
flows) is loosely based on the review article by Kobyzev et al. (2020), on
which I was a co-author. I was very fortunate to be able to collaborate on
Chapter 21 with Travis LaCroix from Dalhousie University, who was both
easy and fun to work with, and who did the lion's share of the work.
OceanofPDF.com
Chapter 1
Introduction
The model in figure 1.2a predicts the price of a house based on input
characteristics such as the square footage and the number of bedrooms. This
is a regression problem because the model returns a continuous number
(rather than a category assignment). In contrast, the model in 1.2b takes the
chemical structure of a molecule as an input and predicts both the melting
and boiling points. This is a multivariate regression problem since it
predicts more than one number.
The model in figure 1.2c receives a text string containing a restaurant
review as input and predicts whether the review is positive or negative. This
is a binary classification problem because the model attempts to assign the
input to one of two categories. The output vector contains the probabilities
that the input belongs to each category. Figures 1.2d and 1.2e depict
multiclass classification problems. Here, the model assigns the input to one
of N > 2 categories. In the first case, the input is an audio file, and the
model predicts which genre of music it contains. In the second case, the
input is an image, and the model predicts which object it contains. In each
case, the model returns a vector of size N that contains the probabilities of
the N categories.
1.1.2 Inputs
The input data in figure 1.2 varies widely. In the house pricing example, the
input is a fixed-length vector containing values that characterize the
property. This is an example of tabular data because it has no internal
structure; if we change the order of the inputs and build a new model, then
we expect the model prediction to remain the same.
Conversely, the input in the restaurant review example is a body of text.
This may be of variable length depending on the number of words in the
review, and here input order is important; my wife ate the chicken is not the
same as the chicken ate my wife. The text must be encoded into numerical
form before passing it to the model. Here, we use a fixed vocabulary of size
10,000 and simply concatenate the word indices.
For the music classification example, the input vector might be of fixed
size (perhaps a 10-second clip) but is very high-dimensional. Digital audio
is usually sampled at 44.1 kHz and represented by 16-bit integers, so a ten-
second clip consists of 441,000 integers. Clearly, supervised learning
models will have to be able to process sizeable inputs. The input in the
image classification example (which consists of the concatenated RGB
values at every pixel) is also enormous. Moreover, its structure is naturally
two-dimensional; two pixels above and below one another are closely
related, even if they are not adjacent in the input vector.
Finally, consider the input for the model that predicts the melting and
boiling points of the molecule. A molecule may contain varying numbers of
atoms that can be connected in different ways. In this case, the model must
ingest both the geometric structure of the molecule and the constituent
atoms to the model.
Figures 1.4c–e depict three models where the output has a complex
structure that is not so closely tied to the input. Figure 1.4c shows a model
where the input is an audio file and the output is the transcribed words from
that file. Figure 1.4d is a translation model in which the input is a body of
text in English, and the output contains the French translation. Figure 1.4e
depicts a very challenging task in which the input is descriptive text, and
the model must produce an image that matches this description.
In principle, the latter three tasks can be tackled in the standard
supervised learning framework, but they are more difficult for two reasons.
First, the output may genuinely be ambiguous; there are multiple valid
translations from an English sentence to a French one and multiple images
that are compatible with any caption. Second, the output contains
considerable structure; not all strings of words make valid English and
French sentences, and not all collections of RGB values make plausible
images. In addition to learning the mapping, we also have to respect the
“grammar” of the output.
Fortunately, this “grammar” can be learned without the need for output
labels. For example, we can learn how to form valid English sentences by
learning the statistics of a large corpus of text data. This provides a
connection with the next section of the book, which considers unsupervised
learning models.
Figure 1.5 Generative models for images. Left: two images were generated from a
model trained on pictures of cats. These are not real cats, but samples from a
probability model. Right: two images generated from a model trained on images of
buildings. Adapted from Karras et al. (2020b).
Figure 1.6 Short story synthesized from a generative model of text data. The model
describes a probability distribution that assigns a probability to every output string.
Sampling from the model creates strings that follow the statistics of the training data
(here, short stories) but have never been seen before.
Figure 1.7 Inpainting. In the original image (left), the boy is obscured by metal
cables. These undesirable regions (center) are removed and the generative model
synthesizes a new image (right) under the constraint that the remaining pixels must
stay the same. Adapted from Saharia et al. (2022a).
Figure 1.8 Conditional text synthesis. Given an initial body of text (in black),
generative models of text can continue the string plausibly by synthesizing the
“missing” remaining part of the string. Generated by GPT3 (Brown et al., 2020).
This leads to the idea that we can describe each data example using a
smaller number of underlying latent variables. Here, the role of deep
learning is to describe the mapping between these latent variables and the
data. The latent variables typically have a simple probability distribution by
design. By sampling from this distribution and passing the result through
the deep learning model, we can create new samples (figure 1.10).
Figure 1.10 Latent variables. Many generative models use a deep learning model
to describe the relationship between a low-dimensional “latent” variable and the
observed high-dimensional data. The latent variables have a simple probability
distribution by design. Hence, new examples can be generated by sampling from the
simple distribution over the latent variables and then using the deep learning model
to map the sample to the observed data space.
These models lead to new methods for manipulating real data. For
example, consider finding the latent variables that underpin two real
examples. We can interpolate between these examples by interpolating
between their latent representations and mapping the intermediate positions
back into the data space (figure 1.11).
Figure 1.11 Image interpolation. In each row the left and right images are real and
the three images in between represent a sequence of interpolations created by a
generative model. The generative models that underpin these interpolations have
learned that all images can be created by a set of underlying latent variables. By
finding these variables for the two real images, interpolating their values, and then
using these intermediate variables to create new images, we can generate
intermediate results that are both visually plausible and mix the characteristics of the
two original images. Top row adapted from Sauer et al. (2022). Bottom row adapted
from Ramesh et al. (2022).
Figure 1.13 Policy networks for reinforcement learning. One way to incorporate
deep neural networks into reinforcement learning is to use them to define a mapping
from the state (here position on chessboard) to the actions (possible moves). This
mapping is known as a policy. Adapted from Pablok (2017).
1.4 Ethics
It would be irresponsible to write this book without discussing the ethical
implications of artificial intelligence. This potent technology will change
the world to at least the same extent as electricity, the internal combustion
engine, the transistor, or the internet. The potential benefits in healthcare,
design, entertainment, transport, education, and almost every area of
commerce are enormous. However, scientists and engineers are often
unrealistically optimistic about the outcomes of their work, and the
potential for harm is just as great. The following paragraphs highlight five
concerns.
Existential risk: The major existential risks to the human race all result
from technology. Climate change has been driven by industrialization.
Nuclear weapons derive from the study of physics. Pandemics are more
probable and spread faster because innovations in transport, agriculture, and
construction have allowed a larger, denser, and more interconnected
population. Artificial intelligence brings new existential risks. We should be
very cautious about building systems that are more capable and extensible
than human beings. In the most optimistic case, it will put vast power in the
hands of the owners. In the most pessimistic case, we will be unable to
control it or even understand its motives (see Tegmark, 2018).
OceanofPDF.com
Chapter 2
Supervised learning
When we compute the prediction y from the input x, we call this inference.
The model is just a mathematical equation with a fixed form. It
represents a family of different relations between the input and the output.
The model also contains parameters ϕ. The choice of parameters
determines the particular relation between input and output, so we should
really write:
y = f[x, ϕ]. (2.2)
(2.3)
If the loss is small after this minimization, we have found model parameters
that accurately predict the training outputs yi from the training inputs xi.
After training a model, we must now assess its performance; we run the
model on separate test data to see how well it generalizes to examples that
it didn't observe during training. If the performance is adequate, then we are
ready to deploy the model.
(2.4)
This model has two parameters ϕ = [ϕ0, ϕ1]T, where ϕ0 is the y-intercept of
the line and ϕ1 is the slope. Different choices for the y-intercept and slope
result in different relations between input and output (figure 2.1). Hence,
equation 2.4 defines a family of possible input-output relations (all possible
lines), and the choice of parameters determines the member of this family
(the particular line).
Figure 2.1 Linear regression model. For a given choice of parameters ϕ = [ϕ0, ϕ1]T,
the model makes a prediction for the output (y-axis) based on the input (x-axis).
Different choices for the y-intercept ϕ0 and the slope ϕ1 change these predictions
(cyan, orange, and gray lines). The linear regression model (equation 2.4) defines a
family of input/output relations (lines) and the parameters determine the member of
the family (the particular line).
2.2.2 Loss
For this model, the training dataset (figure 2.2a) consists of I input/output
pairs {xi, yi}. Figures 2.2b–d show three lines defined by three sets of
parameters. The green line in figure 2.2d describes the data more accurately
than the other two since it is much closer to the data points. However, we
need a principled approach for deciding which parameters ϕ are better than
others. To this end, we assign a numerical value to each choice of
parameters that quantifies the degree of mismatch between the model and
the data. We term this value the loss; a lower loss means a better fit.
Figure 2.2 Linear regression training data, model, and loss. a) The training data
(orange points) consist of I = 12 input/output pairs {xi, yi}. b–d) Each panel shows
the linear regression model with different parameters. Depending on the choice of y-
intercept and slope parameters ϕ = [ϕ0, ϕ1]T, the model errors (orange dashed lines)
may be larger or smaller. The loss L is the sum of the squares of these errors. The
parameters that define the lines in panels (b) and (c) have large losses L = 7.07 and L
= 10.28, respectively because the models fit badly. The loss L = 0.20 in panel (d) is
smaller because the model fits well; in fact, this has the smallest loss of all possible
lines, so these are the optimal parameters.
The mismatch is captured by the deviation between the model
predictions f[xi, ϕ] (height of the line at xi) and the ground truth outputs yi.
These deviations are depicted as orange dashed lines in figures 2.2b–d. We
quantify the total mismatch, training error, or loss as the sum of the squares
of these deviations for all I training pairs:
(2.5)
Since the best parameters minimize this expression, we call this a least-
squares loss. The squaring operation means that the direction of the
deviation (i.e., whether the line is above or below the data) is unimportant.
There are also theoretical reasons for this choice which we return to in
chapter 5.
The loss L is a function of the parameters ϕ; it will be larger when the
model fit is poor (figure 2.2b,c) and smaller when it is good (figure 2.2d).
Considered in this light, we term L[ϕ] the loss function or cost function. The
goal is to find the parameters that minimize this quantity:
(2.6)
Notebook 2.1
Supervised learning
There are only two parameters (the y-intercept ϕ0 and slope ϕ1), so we can
calculate the loss for every combination of values and visualize the loss
function as a surface (figure 2.3). The “best” parameters are at the minimum
of this surface.
Problems 2.1–2.2
Figure 2.3 Loss function for linear regression model with the dataset in figure 2.2a.
a) Each combination of parameters ϕ = [ϕ0, ϕ1] has an associated loss. The resulting
loss function L[ϕ] can be visualized as a surface. The three circles represent the three
lines from figure 2.2b–d. b) The loss can also be visualized as a heatmap, where
brighter regions represent larger losses; here we are looking straight down at the
surface in (a) from above and gray ellipses represent isocontours. The best fitting line
(figure 2.2d) has the parameters with the smallest loss (green circle).
2.2.3 Training
The process of finding parameters that minimize the loss is termed model
fitting, training, or learning. The basic method is to choose the initial
parameters randomly and then improve them by “walking down” the loss
function until we reach the bottom (figure 2.4). One way to do this is to
measure the gradient of the surface at the current position and take a step in
the direction that is most steeply downhill. Then we repeat this process until
the gradient is flat and we can improve no further.2
Figure 2.4 Linear regression training. The goal is to find the y-intercept and slope
parameters that correspond to the smallest loss. a) Iterative training algorithms
initialize the parameters randomly and then improve them by “walking downhill”
until no further improvement can be made. Here, we start at position 0 and move a
certain distance downhill (perpendicular to the contours) to position 1. Then we re-
calculate the downhill direction and move to position 2. Eventually, we reach the
minimum of the function (position 4). b) Each position 0–4 from panel (a)
corresponds to a different y-intercept and slope and so represents a different line. As
the loss decreases, the lines fit the data more closely.
2.2.4 Testing
Having trained the model, we want to know how it will perform in the real
world. We do this by computing the loss on a separate set of test data. The
degree to which the prediction accuracy generalizes to the test data depends
in part on how representative and complete the training data is. However, it
also depends on how expressive the model is. A simple model like a line
might not be able to capture the true relationship between input and output.
This is known as underfitting. Conversely, a very expressive model may
describe statistical peculiarities of the training data that are atypical and
lead to unusual predictions. This is known as overfitting.
2.3 Summary
A supervised learning model is a function y = f[x, ϕ] that relates inputs x to
outputs y. The particular relationship is determined by parameters ϕ. To
train the model, we define a loss function L[ϕ] over a training dataset {xi,
yi}. This quantifies the mismatch between the model predictions f[xi, ϕ] and
observed outputs yi as a function of the parameters ϕ. Then we search for
the parameters that minimize the loss. We evaluate the model on a different
set of test data to see how well it generalizes to new inputs.
Chapters 3–9 expand on these ideas. First, we tackle the model itself;
linear regression has the obvious drawback that it can only describe the
relationship between the input and output as a straight line. Shallow neural
networks (chapter 3) are only slightly more complex than linear regression
but describe a much larger family of input/output relationships. Deep neural
networks (chapter 4) are just as expressive but can describe complex
functions with fewer parameters and work better in practice.
Chapter 5 investigates loss functions for different tasks and reveals the
theoretical underpinnings of the least-squares loss. Chapters 6 and 7 discuss
the training process. Chapter 8 discusses how to measure model
performance. Chapter 9 considers regularization techniques, which aim to
improve that performance.
Notes
Loss functions vs. cost functions: In much of machine learning and in
this book, the terms loss function and cost function are used
interchangeably. However, more properly, a loss function is the individual
term associated with a data point (i.e., each of the squared terms on the
right-hand side of equation 2.5), and the cost function is the overall quantity
that is minimized (i.e., the entire right-hand side of equation 2.5). A cost
function can contain additional terms that are not associated with individual
data points (see section 9.1). More generally, an objective function is any
function that is to be maximized or minimized.
Generative vs. discriminative models: The models y = f[x, ϕ] in this
chapter are discriminative models. These make an output prediction y from
real-world measurements x. Another approach is to build a generative
model x = g[y, ϕ], in which the real-world measurements x are computed as
a function of the output y.
Problem 2.3
The generative approach has the disadvantage that it doesn't directly predict
y. To perform inference, we must invert the generative equation as y =
g−1[x, ϕ], and this may be difficult. However, generative models have the
advantage that we can build in prior knowledge about how the data were
created. For example, if we wanted to predict the 3D position and
orientation y of a car in an image x, then we could build knowledge about
car shape, 3D geometry, and light transport into the function x = g[y, ϕ].
This seems like a good idea, but in fact, discriminative models dominate
modern machine learning; the advantage gained from exploiting prior
knowledge in generative models is usually trumped by learning very
flexible discriminative models with large amounts of training data.
Problems
Problem 2.1 To walk “downhill” on the loss function (equation 2.5), we
measure its gradient with respect to the parameters ϕ0 and ϕ1. Calculate
expressions for the slopes ∂L/∂ϕ0 and ∂L/∂ϕ1.
Problem 2.2 Show that we can find the minimum of the loss function in
closed form by setting the expression for the derivatives from problem 2.1
to zero and solving for ϕ0 and ϕ1. Note that this works for linear regression
but not for more complex models; this is why we use iterative model fitting
methods like gradient descent (figure 2.4).
Problem 2.3* Consider reformulating linear regression as a generative
model, so we have x = g[y, ϕ] = ϕ0 + ϕ1y. What is the new loss function?
Find an expression for the inverse function y = g−1[x, ϕ] that we would use
to perform inference. Will this model make the same predictions as the
discriminative version for a given training dataset {xi, yi}? One way to
establish this is to write code that fits a line to three data points using both
methods and see if the result is the same.
1 More properly, the loss function also depends on the training data {xi, yi}, so we should write L
[{xi, yi}, ϕ], but this is rather cumbersome.
2 This iterative approach is not actually necessary for the linear regression model. Here, it's
possible to find closed-form expressions for the parameters. However, this gradient descent approach
works for more complex models where there is no closed-form solution and where there are too
many parameters to evaluate the loss for every combination of values.
OceanofPDF.com
Chapter 3
(3.1)
We can break down this calculation into three parts: first, we compute three
linear functions of the input data (θ10 + θ11x, θ20 + θ21x, and θ30 + θ31x).
Second, we pass the three results through an activation function a[•].
Finally, we weight the three resulting activations with ϕ1, ϕ2, and ϕ3, sum
them, and add an offset ϕ0.
To complete the description, we must define the activation function a[•].
There are many possibilities, but the most common choice is the rectified
linear unit or ReLU:
(3.2)
This returns the input when it is positive and zero otherwise (figure 3.1).
Figure 3.1 Rectified linear unit (ReLU). This activation function returns zero if the
input is less than zero and returns the input unchanged otherwise. In other words, it
clips negative values to zero. Note that there are many other possible choices for the
activation function (see figure 3.13), but the ReLU is the most commonly used and
the easiest to understand.
(3.3)
Figure 3.2 Family of functions defined by equation 3.1. a–c) Functions for three
different choices of the ten parameters ϕ. In each case, the input/output relation is
piecewise linear. However, the positions of the joints, the slopes of the linear regions
between them, and the overall height vary.
where we refer to h1, h2, and h3 as hidden units. Second, we compute the
output by combining these hidden units with a linear function:1
y = ϕ0 + ϕ1h1 + ϕ2h2 + ϕ3h3. (3.4)
Figure 3.3 shows the flow of computation that creates the function in
figure 3.2a. Each hidden unit contains a linear function θ•0 + θ•1x of the
input, and that line is clipped by the ReLU function a[•] below zero. The
positions where the three lines cross zero become the three “joints” in the
final output. The three clipped lines are then weighted by ϕ1, ϕ2, and ϕ3,
respectively. Finally, the offset ϕ0 is added, which controls the overall
height of the final function.
Problems 3.1–3.8
Figure 3.3 Computation for function in figure 3.2a. a–c) The input x is passed
through three linear functions, each with a different y-intercept θ•0 and slope θ•1. d–f)
Each line is passed through the ReLU activation function, which clips negative
values to zero. g–i) The three clipped lines are then weighted (scaled) by ϕ1, ϕ2, and
ϕ3, respectively. j) Finally, the clipped and weighted functions are summed, and an
offset ϕ0 that controls the height is added. Each of the four linear regions corresponds
to a different activation pattern in the hidden units. In the shaded region, h2 is
inactive (clipped), but h1 and h3 are both active.
Problem 3.9
(3.6)
(3.7)
(3.8)
The two outputs are two different linear functions of the hidden units.
As we saw in figure 3.3, the “joints” in the piecewise functions depend
on where the initial linear functions θ•0 + θ•1x are clipped by the ReLU
functions a[•] at the hidden units. Since both outputs y1 and y2 are different
linear functions of the same four hidden units, the four “joints” in each must
be in the same places. However, the slopes of the linear regions and the
overall vertical offset can differ (figure 3.6).
Problem 3.11
Figure 3.6 Network with one input, four hidden units, and two outputs. a)
Visualization of network structure. b) This network produces two piecewise linear
functions, y1[x] and y2[x]. The four “joints” of these functions (at vertical dotted
lines) are constrained to be in the same places since they share the same hidden units,
but the slopes and overall height may differ.
(3.9)
Figure 3.7 Visualization of neural network with 2D multivariate input x = [x1, x2]T
and scalar output y.
where there is now one slope parameter for each input. The hidden units are
combined to form the output in the usual way:
y = ϕ0 + ϕ1h1 + ϕ2h2 + ϕ3h3. (3.10)
Figure 3.8 illustrates the processing of this network. Each hidden unit
receives a linear combination of the two inputs, which forms an oriented
plane in the 3D input/output space. The activation function clips the
negative values of these planes to zero. The clipped planes are then
recombined in a second linear function (equation 3.10) to create a
continuous piecewise linear surface consisting of convex polygonal regions
(figure 3.8j). Each region corresponds to a different activation pattern. For
example, in the central triangular region, the first and third hidden units are
active, and the second is inactive.
Problems 3.12–3.13
Notebook 3.2
Shallow networks II
Appendix B.1.2
Convex region
Figure 3.8 Processing in network with two inputs x = [x1, x2]T, three hidden units
h1, h2, h3, and one output y. a–c) The input to each hidden unit is a linear function of
the two inputs, which corresponds to an oriented plane. Brightness indicates function
output. For example, in panel (a), the brightness represents θ10 + θ11x1 + θ12x2. Thin
lines are contours. d–f) Each plane is clipped by the ReLU activation function (cyan
lines are equivalent to “joints” in figures 3.3d–f). g–i) The clipped planes are then
weighted, and j) summed together with an offset that determines the overall height of
the surface. The result is a continuous surface made up of convex piecewise linear
polygonal regions.
When there are more than two inputs to the model, it becomes difficult
to visualize. However, the interpretation is similar. The output will be a
continuous piecewise linear function of the input, where the linear regions
are now convex polytopes in the multi-dimensional input space.
Note that as the input dimensions grow, the number of linear regions
increases rapidly (figure 3.9). To get a feeling for how rapidly, consider that
each hidden unit defines a hyperplane that delineates the part of space
where this unit is active from the part where it is not (cyan lines in 3.8d–f).
If we had the same number of hidden units as input dimensions Di, we
could align each hyperplane with one of the coordinate axes (figure 3.10).
For two input dimensions, this would divide the space into four quadrants.
For three dimensions, this would create eight octants, and for Di
dimensions, this would create orthants. Shallow neural networks usually
have more hidden units than input dimensions, so they typically create more
than linear regions.
Notebook 3.3
Shallow network regions
Figure 3.9 Linear regions vs. hidden units. a) Maximum possible regions as a
function of the number of hidden units for five different input dimensions Di = {1, 5,
10, 50, 100}. The number of regions increases rapidly in high dimensions; with D =
500 units and input size Di = 100, there can be greater than 10107 regions (solid
circle). b) The same data are plotted as a function of the number of parameters. The
solid circle represents the same model as in panel (a) with D = 500 hidden units. This
network has 51,001 parameters and would be considered very small by modern
standards.
Figure 3.10 Number of linear regions vs. input dimensions. a) With a single input
dimension, a model with one hidden unit creates one joint, which divides the axis
into two linear regions. b) With two input dimensions, a model with two hidden units
can divide the input space using two lines (here aligned with axes) to create four
regions. c) With three input dimensions, a model with three hidden units can divide
the input space using three planes (again aligned with axes) to create eight regions.
Continuing this argument, it follows that a model with Di input dimensions and Di
hidden units can divide the input space with Di hyperplanes to create linear
regions.
(3.11)
(3.12)
Figure 3.11 Visualization of neural network with three inputs and two outputs. This
network has twenty parameters. There are fifteen slopes (indicated by arrows) and
five offsets (not shown).
3.5 Terminology
We conclude this chapter by introducing some terminology. Regrettably,
neural networks have a lot of associated jargon. They are often referred to
in terms of layers. The left of figure 3.12 is the input layer, the center is the
hidden layer, and to the right is the output layer. We would say that the
network in figure 3.12 has one hidden layer containing four hidden units.
The hidden units themselves are sometimes referred to as neurons. When
we pass data through the network, the values of the inputs to the hidden
layer (i.e., before the ReLU functions are applied) are termed pre-
activations. The values at the hidden layer (i.e., after the ReLU functions)
are termed activations.
Figure 3.12 Terminology. A shallow network consists of an input layer, a hidden
layer, and an output layer. Each layer is connected to the next by forward connections
(arrows). For this reason, these models are referred to as feed-forward networks.
When every variable in one layer connects to every variable in the next, we call this a
fully connected network. Each connection represents a slope parameter in the
underlying equation, and these parameters are termed weights. The variables in the
hidden layer are termed neurons or hidden units. The values feeding into the hidden
units are termed pre-activations, and the values at the hidden units (i.e., after the
ReLU function is applied) are termed activations.
For historical reasons, any neural network with at least one hidden layer
is also called a multi-layer perceptron, or MLP for short. Networks with one
hidden layer (as described in this chapter) are sometimes referred to as
shallow neural networks. Networks with multiple hidden layers (as
described in the next chapter) are referred to as deep neural networks.
Neural networks in which the connections form an acyclic graph (i.e., a
graph with no loops, as in all the examples in this chapter) are referred to as
feed-forward networks. If every element in one layer connects to every
element in the next (as in all the examples in this chapter), the network is
fully connected. These connections represent slope parameters in the
underlying equations and are referred to as network weights. The offset
parameters (not shown in figure 3.12) are called biases.
3.6 Summary
Shallow neural networks have one hidden layer. They (i) compute several
linear functions of the input, (ii) pass each result through an activation
function, and then (iii) take a linear combination of these activations to
form the outputs. Shallow neural networks make predictions y based on
inputs x by dividing the input space into a continuous surface of piecewise
linear regions. With enough hidden units (neurons), shallow neural
networks can approximate any continuous function to arbitrary precision.
Chapter 4 discusses deep neural networks, which extend the models
from this chapter by adding more hidden layers. Chapters 5–7 describe how
to train these models.
Notes
“Neural” networks: If the models in this chapter are just functions, why
are they called “neural networks”? The connection is, unfortunately,
tenuous. Visualizations like figure 3.12 consist of nodes (inputs, hidden
units, and outputs) that are densely connected to one another. This bears a
superficial similarity to neurons in the mammalian brain, which also have
dense connections. However, there is scant evidence that brain computation
works in the same way as neural networks, and it is unhelpful to think about
biology going forward.
History of neural networks: McCulloch & Pitts (1943) first came up
with the notion of an artificial neuron that combined inputs to produce an
output, but this model did not have a practical learning algorithm.
Rosenblatt (1958) developed the perceptron, which linearly combined
inputs and then thresholded them to make a yes/no decision. He also
provided an algorithm to learn the weights from data. Minsky & Papert
(1969) argued that the linear function was inadequate for general
classification problems but that adding hidden layers with nonlinear
activation functions (hence the term multi-layer perceptron) could allow the
learning of more general input/output relations. However, they concluded
that Rosenblatt's algorithm could not learn the parameters of such models. It
was not until the 1980s that a practical algorithm (backpropagation, see
chapter 7) was developed, and significant work on neural networks
resumed. The history of neural networks is chronicled by Kurenkov (2020),
Sejnowski (2018), and Schmidhuber (2022).
Activation functions: The ReLU function has been used as far back as
Fukushima (1969). However, in the early days of neural networks, it was
more common to use the logistic sigmoid or tanh activation functions
(figure 3.13a). The ReLU was re-popularized by Jarrett et al. (2009), Nair &
Hinton (2010), and Glorot et al. (2011) and is an important part of the
success story of modern neural networks. It has the nice property that the
derivative of the output with respect to the input is always one for inputs
greater than zero. This contributes to the stability and efficiency of training
(see chapter 7) and contrasts with the derivatives of sigmoid activation
functions, which saturate (become close to zero) for large positive and large
negative inputs.
Figure 3.13 Activation functions. a) Logistic sigmoid and tanh functions. b) Leaky
ReLU and parametric ReLU with parameter 0.25. c) SoftPlus, Gaussian error linear
unit, and sigmoid linear unit. d) Exponential linear unit with parameters 0.5 and 1.0,
e) Scaled exponential linear unit. f) Swish with parameters 0.4, 1.0, and 1.4.
However, the ReLU function has the disadvantage that its derivative is zero
for negative inputs. If all the training examples produce negative inputs to a
given ReLU function, then we cannot improve the parameters feeding into
this ReLU during training. The gradient with respect to the incoming
weights is locally flat, so we cannot “walk downhill.” This is known as the
dying ReLU problem. Many variations on the ReLU have been proposed to
resolve this problem (figure 3.13b), including (i) the leaky ReLU (Maas et
al., 2013), which also has a linear output for negative values with a smaller
slope of 0.1, (ii) the parametric ReLU (He et al., 2015), which treats the
slope of the negative portion as an unknown parameter, and (iii) the
concatenated ReLU (Shang et al., 2016), which produces two outputs, one
of which clips below zero (i.e., like a typical ReLU) and one of which clips
above zero.
A variety of smooth functions have also been investigated (figure 3.13c–d),
including the softplus function (Glorot et al., 2011), Gaussian error linear
unit (Hendrycks & Gimpel, 2016), sigmoid linear unit (Hendrycks &
Gimpel, 2016), and exponential linear unit (Clevert et al., 2015). Most of
these are attempts to avoid the dying neuron problem while limiting the
gradient for negative values. Klambauer et al. (2017) introduced the scaled
exponential linear unit (figure 3.13e), which is particularly interesting as it
helps stabilize the variance of the activations when the input variance has a
limited range (see section 7.5). Ramachandran et al. (2017) adopted an
empirical approach to choosing an activation function. They searched the
space of possible functions to find the one that performed best over a
variety of supervised learning tasks. The optimal function was found to be
a[x] = x/(1 + exp[−βx]), where β is a learned parameter (figure 3.13f). They
termed this function Swish. Interestingly, this was a rediscovery of
activation functions previously proposed by Hendrycks & Gimpel (2016)
and Elfwing et al. (2018). Howard et al. (2019) approximated Swish by the
HardSwish function, which has a very similar shape but is faster to
compute:
(3.13)
Problem 3.18
Figure 3.14 Processing in network with one input, three hidden units, and one
output for problem 3.4. a–c) The input to each hidden unit is a linear function of the
inputs. The first two are the same as in figure 3.3, but the last one differs.
Problem 3.5 Prove that the following property holds for α ∈ ℝ+:
ReLU[α · z] = α · ReLU[z]. (3.14)
(3.15)
Redraw a version of figure 3.3 for each of these functions. The original
parameters were: ϕ = {ϕ0, ϕ1, ϕ2, ϕ3, θ10, θ11, θ20, θ21, θ30, θ31} = {−0.23,
−1.3, 1.3, 0.66, −0.2, 0.4, −0.9, 0.9, 1.1, −0.7}. Provide an informal
description of the family of functions that can be created by neural
networks with one input, three hidden units, and one output for each
activation function.
Problem 3.9* Show that the third linear region in figure 3.3 has a slope that
is the sum of the slopes of the first and fourth linear regions.
Problem 3.10 Consider a neural network with one input, one output, and
three hidden units. The construction in figure 3.3 shows how this creates
four linear regions. Under what circumstances could this network produce a
function with fewer than four linear regions?
Problem 3.11* How many parameters does the model in figure 3.6 have?
Problem 3.12 How many parameters does the model in figure 3.7 have?
Problem 3.13 What is the activation pattern for each of the seven regions in
figure 3.8? In other words, which hidden units are active (pass the input)
and which are inactive (clip the input) for each region?
Problem 3.14 Write out the equations that define the network in figure
3.11. There should be three equations to compute the three hidden units
from the inputs and two equations to compute the outputs from the hidden
units.
Problem 3.15* What is the maximum possible number of 3D linear regions
that can be created by the network in figure 3.11?
Problem 3.16 Write out the equations for a network with two inputs, four
hidden units, and three outputs. Draw this model in the style of figure 3.11.
Problem 3.17* Equations 3.11 and 3.12 define a general neural network
with Di inputs, one hidden layer containing D hidden units, and Do outputs.
Find an expression for the number of parameters in the model in terms of
Di, D, and Do.
1 For the purposes of this book, a linear function has the form z′ = ϕ0 + Σi ϕizi. Any other type of
function is nonlinear. For instance, the ReLU function (equation 3.2) and the example neural network
that contains it (equation 3.1) are both nonlinear. See notes at end of chapter for further clarification.
OceanofPDF.com
Chapter 4
The last chapter described shallow neural networks, which have a single
hidden layer. This chapter introduces deep neural networks, which have
more than one hidden layer. With ReLU activation functions, both shallow
and deep networks describe piecewise linear mappings from input to output.
As the number of hidden units increases, shallow neural networks
improve their descriptive power. Indeed, with enough hidden units, shallow
networks can describe arbitrarily complex functions in high dimensions.
However, it turns out that for some functions, the required number of
hidden units is impractically large. Deep networks can produce many more
linear regions than shallow networks for a given number of parameters.
Hence, from a practical standpoint, they can be used to describe a broader
family of functions.
(4.1)
Figure 4.1 Composing two single-layer networks with three hidden units each. a)
The output y of the first network constitutes the input to the second network. b) The
first network maps inputs x ∈ [−1, 1] to outputs y ∈ [−1, 1] using a function
comprised of three linear regions that are chosen so that they alternate the sign of
their slope. Multiple inputs x (gray circles) now map to the same output y (cyan
circle). c) The second network defines a function comprising three linear regions that
takes y and returns y′ (i.e., the cyan circle is mapped to the brown circle). d) The
combined effect of these two functions when composed is that (i) three different
inputs x are mapped to any given value of y by the first network and (ii) are
processed in the same way by the second network; the result is that the function
defined by the second network in panel (c) is duplicated three times, variously
flipped and rescaled according to the slope of the regions of panel (b).
and
(4.2)
The second network takes y as input and returns y′ and is defined by:
(4.3)
and
(4.4)
Notebook 4.1
Composing networks
Figure 4.2 Composing neural networks with a 2D input. a) The first network (from
figure 3.8) has three hidden units and takes two inputs x1 and x2 and returns a scalar
output y. This is passed into a second network with two hidden units to produce y′. b)
The first network produces a function consisting of seven linear regions, one of
which is flat. c) The second network defines a function comprising two linear regions
in y ∈ [−1, 1]. d) When these networks are composed, each of the six non-flat
regions from the first network is divided into two new regions by the second network
to create a total of 13 linear regions.
(4.5)
(4.6)
It follows that a network with two layers can represent the family of
functions created by passing the output of one single-layer network into
another. In fact, it represents a broader family because in equation 4.6, the
nine slope parameters ψ11, ψ21, …, ψ33 can take arbitrary values, whereas,
in equation 4.5, these parameters are constrained to be the outer product
[ϕ1, ϕ2, ϕ3].
(4.7)
(4.8)
(4.9)
Considering these equations leads to another way to think about how the
network constructs an increasingly complicated function (figure 4.5):
Notebook 4.2
Clipping functions
1. The three hidden units h1, h2, and h3 in the first layer are computed as
usual by forming linear functions of the input and passing these
through ReLU activation functions (equation 4.7).
2. The pre-activations at the second layer are computed by taking three
new linear functions of these hidden units (arguments of the activation
functions in equation 4.8). At this point, we effectively have a shallow
network with three outputs; we have computed three piecewise linear
functions with the “joints” between linear regions in the same places
(see figure 3.6).
3. At the second hidden layer, another ReLU function a[•] is applied to
each function (equation 4.8), which clips them and adds new “joints”
to each.
4. The final output is a linear combination of these hidden units
(equation 4.9).
Figure 4.5 Computation for the deep network in figure 4.4. a–c) The inputs to the
second hidden layer (i.e., the pre-activations) are three piecewise linear functions
where the “joints” between the linear regions are at the same places (see figure 3.6).
d–f) Each piecewise linear function is clipped to zero by the ReLU activation
function. g–i) These clipped functions are then weighted with parameters , ,
and , respectively. j) Finally, the clipped and weighted functions are summed and
an offset that controls the overall height is added.
(4.10)
4.3.1 Hyperparameters
We can extend the deep network construction to more than two hidden
layers; modern networks might have more than a hundred layers with
thousands of hidden units at each layer. The number of hidden units in each
layer is referred to as the width of the network, and the number of hidden
layers as the depth. The total number of hidden units is a measure of the
network's capacity.
We denote the number of layers as K and the number of hidden units in
each layer as D1, D2, …, DK. These are examples of hyperparameters. They
are quantities chosen before we learn the model parameters (i.e., the slope
and intercept terms). For fixed hyperparameters (e.g., K = 2 layers with Dk
= 3 hidden units in each), the model describes a family of functions, and the
parameters determine the particular function. Hence, when we also consider
the hyperparameters, we can think of neural networks as representing a
family of families of functions relating input to output.
Problem 4.2
(4.12)
Appendix B.3
Matrices
and
(4.13)
(4.14)
where, in each case, the function a[•] applies the activation function
separately to every element of its vector input.
The parameters ϕ of this model comprise all of these weight matrices and
bias vectors .
th
If the k layer has Dk hidden units, then the bias vector βk−1 will be of
size Dk. The last bias vector βK has the size Do of the output. The first
weight matrix Ω0 has size D1 × Di where Di is the size of the input. The last
weight matrix ΩK is Do × DK, and the remaining matrices Ωk are Dk+1 × Dk
(figure 4.6).
Notebook 4.3
Deep networks
Figure 4.6 Matrix notation for network with Di = 3-dimensional input x, Do = 2-
dimensional output y, and K = 3 hidden layers h1, h2, and h3 of dimensions D1 = 4,
D2 = 2, and D3 = 3 respectively. The weights are stored in matrices Ωk that pre-
multiply the activations from the preceding layer to create the pre-activations at the
subsequent layer. For example, the weight matrix Ω1 that computes the pre-
activations at h2 from the activations at h1 has dimension 2 × 4. It is applied to the
four hidden units in layer one and creates the inputs to the two hidden units at layer
two. The biases are stored in vectors βk and have the dimension of the layer into
which they feed. For example, the bias vector β2 is length three because layer h3
contains three hidden units.
(4.16)
Problems 4.3–4.6
Figure 4.7a shows how the maximum number of linear regions increases
as a function of the number of parameters for networks mapping scalar
input x to scalar output y. Deep neural networks create much more complex
functions for a fixed parameter budget. This effect is magnified as the
number of input dimensions Di increases (figure 4.7b), although computing
the maximum number of regions is less straightforward.
Figure 4.7 The maximum number of linear regions for neural networks increases
rapidly with the network depth. a) Network with Di = 1 input. Each curve represents
a fixed number of hidden layers K, as we vary the number of hidden units D per
layer. For a fixed parameter budget (horizontal position), deeper networks produce
more linear regions than shallower ones. A network with K = 5 layers and D = 10
hidden units per layer has 471 parameters (highlighted point) and can produce
161,051 regions. b) Network with Di = 10 inputs. Each subsequent point along a
curve represents ten hidden units. Here, a model with K = 5 layers and D = 50 hidden
units per layer has 10,801 parameters (highlighted point) and can create more than
10134 linear regions.
This seems attractive, but the flexibility of the functions is still limited
by the number of parameters. Deep networks can create extremely large
numbers of linear regions, but these contain complex dependencies and
symmetries. We saw some of these when we considered deep networks as
“folding” the input space (figure 4.3). So, it's not clear that the greater
number of regions is an advantage unless (i) there are similar symmetries in
the real-world functions that we wish to approximate or (ii) we have reason
to believe that the mapping from input to output really does involve a
composition of simpler functions.
4.6 Summary
In this chapter, we first considered what happens when we compose two
shallow networks. We argued that the first network “folds” the input space,
and the second network then applies a piecewise linear function. The effects
of the second network are duplicated where the input space is folded onto
itself.
We then showed that this composition of shallow networks is a special
case of a deep network with two layers. We interpreted the ReLU functions
in each layer as clipping the input functions in multiple places and creating
more “joints” in the output function. We introduced the idea of
hyperparameters, which for the networks we've seen so far, comprise the
number of hidden layers and the number of hidden units in each.
Finally, we compared shallow and deep networks. We saw that (i) both
networks can approximate any function given enough capacity, (ii) deep
networks produce many more linear regions per parameter, (iii) some
functions can be approximated much more efficiently by deep networks,
(iv) large, structured inputs like images are best processed in multiple
stages, and (v) in practice, the best results for most tasks are achieved using
deep networks with many layers.
Now that we understand deep and shallow network models, we turn our
attention to training them. In the next chapter, we discuss loss functions.
For any given parameter values ϕ, the loss function returns a single number
that indicates the mismatch between the model outputs and the ground truth
predictions for a training dataset. In chapters 6 and 7, we deal with the
training process itself, in which we seek the parameter values that minimize
this loss.
Notes
Deep learning: It has long been understood that it is possible to build
more complex functions by composing shallow neural networks or
developing networks with more than one hidden layer. Indeed, the term
“deep learning” was first used by Dechter (1986). However, interest was
limited due to practical concerns; it was not possible to train such networks
well. The modern era of deep learning was kick-started by startling
improvements in image classification reported by Krizhevsky et al. (2012).
This sudden progress was arguably due to the confluence of four factors:
larger training datasets, improved processing power for training, the use of
the ReLU activation function, and the use of stochastic gradient descent
(see chapter 6). LeCun et al. (2015) present an overview of early advances
in the modern era of deep learning.
Number of linear regions: For deep networks using a total of D hidden
units with ReLU activations, the upper bound on the number of regions is
2D (Montufar et al., 2014). The same authors show that a deep ReLU
network with Di-dimensional input and K layers, each containing D ≥ Di
hidden units, has linear regions. Montúfar (2017),
Arora et al. (2016) and Serra et al. (2018) all provide tighter upper bounds
that consider the possibility that each layer has different numbers of hidden
units. Serra et al. (2018) provide an algorithm that counts the number of
linear regions in a neural network, although it is only practical for very
small networks.
If the number of hidden units D in each of the K layers is the same, and D is
an integer multiple of the input dimensionality Di, then the maximum
number of linear regions Nr can be computed exactly and is:
(4.17)
The first term in this expression corresponds to the first K − 1 layers of the
network, which can be thought of as repeatedly folding the input space.
However, we now need to devote D/Di hidden units to each input dimension
to create these folds. The last term in this equation (a sum of binomial
coefficients) is the number of regions that a shallow network can create and
is attributable to the last layer. For further information, consult Montufar et
al. (2014), Pascanu et al. (2013), and Montúfar (2017).
Appendix B.2
Binomial coefficient
Problems
Problem 4.1* Consider composing the two neural networks in figure 4.8.
Draw a plot of the relationship between the input x and output y′ for x ∈
[−1, 1].
Figure 4.8 Composition of two networks for problem 4.1. a) The output y of the
first network becomes the input to the second. b) The first network computes this
function with output values y ∈ [−1, 1]. c) The second network computes this
function on the input range y ∈ [−1, 1].
(4.18)
where λ0 and λ1 are non-negative scalars. From this, we see that the weight
matrices can be rescaled by any magnitude as long as the biases are also
adjusted, and the scale factors can be re-applied at the end of the network.
Problem 4.4 Write out the equations for a deep neural network that takes Di
= 5 inputs, Do = 4 outputs and has three hidden layers of sizes D1 = 20, D2
= 10, and D3 = 7, respectively, in both the forms of equations 4.15 and 4.16.
What are the sizes of each weight matrix Ω• and bias vector β•?
Figure 4.9 Hidden unit activations for problem 4.8. a) First hidden unit has a joint
at position x = 1/6 and a slope of one in the active region. b) Second hidden unit has a
joint at position x = 2/6 and a slope of one in the active region. c) Third hidden unit
has a joint at position x = 4/6 and a slope of minus one in the active region.
OceanofPDF.com
Chapter 5
Loss functions
The last three chapters described linear regression, shallow neural networks,
and deep neural networks. Each represents a family of functions that map
input to output, where the particular member of the family is determined by
the model parameters ϕ. When we train these models, we seek the
parameters that produce the best possible mapping from input to output for
the task we are considering. This chapter defines what is meant by the “best
possible” mapping.
That definition requires a training dataset {xi, yi} of input/output pairs. A
loss function or cost function L[ϕ] returns a single number that describes the
mismatch between the model predictions f[xi, ϕ] and their corresponding
ground-truth outputs yi. During training, we seek parameter values ϕ that
minimize the loss and hence map the training inputs to the outputs as
closely as possible. We saw one example of a loss function in chapter 2; the
least squares loss function is suitable for univariate regression problems for
which the target is a real number y ∈ ℝ. It computes the sum of the squares
of the deviations between the model predictions f[xi, ϕ] and the true values
y i.
Appendix A
Number sets
This chapter provides a framework that both justifies the choice of the
least squares criterion for real-valued outputs and allows us to build loss
functions for other prediction types. We consider binary classification,
where the prediction y ∈ {0, 1} is one of two categories, multiclass
classification, where the prediction y ∈ {1, 2, …, K} is one of K categories,
and more complex cases. In the following two chapters, we address model
training, where the goal is to find the parameter values that minimize these
loss functions.
(5.1)
Appendix C.1.5
Independence
(5.3)
(5.4)
5.1.5 Inference
The network no longer directly predicts the outputs y but instead determines
a probability distribution over y. When we perform inference, we often
want a point estimate rather than a distribution, so we return the maximum
of the distribution:
(5.5)
It is usually possible to find an expression for this in terms of the
distribution parameters θ predicted by the model. For example, in the
univariate normal distribution, the maximum occurs at the mean μ.
(5.6)
4. To perform inference for a new test example x, return either the full
distribution Pr(y|f[x, ]) or the maximum of this distribution.
We devote most of the rest of this chapter to constructing loss functions for
common prediction types using this recipe.
Figure 5.3 The univariate normal distribution (also known as the Gaussian
distribution) is defined on the real line z ∈ ℝ and has parameters μ and σ2. The mean
μ determines the position of the peak. The positive root of the variance σ2 (the
standard deviation) determines the width of the distribution. Since the total
probability density sums to one, the peak becomes higher as the variance decreases
and the distribution becomes narrower.
Second, we set the machine learning model f[x, ϕ] to compute one or more
of the parameters of this distribution. Here, we just compute the mean so μ
= f[x, ϕ]:
(5.8)
We aim to find the parameters ϕ that make the training data {xi, yi} most
probable under this distribution (figure 5.4). To accomplish this, we choose
a loss function L[ϕ] based on the negative log-likelihood:
(5.9)
Figure 5.4 Equivalence of least squares and maximum likelihood loss for the
normal distribution. a) Consider the linear model from figure 2.2. The least squares
criterion minimizes the sum of the squares of the deviations (dashed lines) between
the model prediction f[xi, ϕ] (green line) and the true output values yi (orange points).
Here the fit is good, so these deviations are small (e.g., for the two highlighted
points). b) For these parameters, the fit is bad, and the squared deviations are large. c)
The least squares criterion follows from the assumption that the model predicts the
mean of a normal distribution over the outputs and that we maximize the probability.
For the first case, the model fits well, so the probability Pr(yi∣xi) of the data
(horizontal orange dashed lines) is large (and the negative log probability is small). d)
For the second case, the model fits badly, so the probability is small and the negative
log probability is large.
When we train the model, we seek parameters that minimize this loss.
5.3.1 Least squares loss function
Now let's perform some algebraic manipulations on the loss function. We
seek:
(5.10)
where we have removed the first term between the second and third lines
because it does not depend on ϕ. We have removed the denominator
between the third and fourth lines, as this is just a constant scaling factor
that does not affect the position of the minimum.
The result of these manipulations is the least squares loss function that
we originally introduced when we discussed linear regression in chapter 2:
(5.11)
We see that the least squares loss function follows naturally from the
assumptions that the prediction errors are (i) independent and (ii) drawn
from a normal distribution with mean μ = f[xi, ϕ] (figure 5.4).
Notebook 5.1
Least squares loss
5.3.2 Inference
The network no longer directly predicts y but instead predicts the mean μ =
f[x, ϕ] of the normal distribution over y. When we perform inference, we
usually want a single “best” point estimate ŷ, so we take the maximum of
the predicted distribution:
(5.12)
For the univariate normal, the maximum position is determined by the mean
parameter μ (figure 5.3). This is precisely what the model computed, so ŷ =
f[x, ].
(5.13)
In inference, the model predicts the mean μ = f[x, ] from the input, and we
learned the variance during the training process. The former is the best
prediction. The latter tells us about the uncertainty of the prediction.
(5.14)
(5.15)
(5.17)
Second, we set the machine learning model f[x, ϕ] to predict the single
distribution parameter λ. However, λ can only take values in the range [0,
1], and we cannot guarantee that the network output will lie in this range.
Consequently, we pass the network output through a function that maps the
real numbers ℝ to [0, 1]. A suitable function is the logistic sigmoid (figure
5.7):
(5.18)
Problem 5.1
Figure 5.7 Logistic sigmoid function. This function maps the real line z ∈ ℝ to
numbers between zero and one, so sig[z] ∈ [0, 1]. An input of 0 is mapped to 0.5.
Negative inputs are mapped to numbers below 0.5, and positive inputs to numbers
above 0.5.
(5.19)
This is depicted in figure 5.8 for a shallow neural network model. The loss
function is the negative log-likelihood of the training set:
(5.20)
Figure 5.8 Binary classification model. a) The network output is a piecewise linear
function that can take arbitrary real values. b) This is transformed by the logistic
sigmoid function, which compresses these values to the range [0, 1]. c) The
transformed output predicts the probability λ that y = 1 (solid line). The probability
that y = 0 is hence 1 − λ (dashed line). For any fixed x (vertical slice), we retrieve the
two values of a Bernoulli distribution similar to that in figure 5.6. The loss function
favors model parameters that produce large values of λ at positions xi that are
associated with positive examples yi = 1 and small values of λ at positions associated
with negative examples yi = 0.
For reasons to be explained in section 5.7, this is known as the binary cross-
entropy loss.
Notebook 5.2
Binary cross-entropy loss
The transformed model output sig[f[x, ϕ]] predicts the parameter λ of the
Bernoulli distribution. This represents the probability that y = 1, and it
follows that 1 − λ represents the probability that y = 0. When we perform
inference, we may want a point estimate of y, so we set y = 1 if λ > 0.5 and
y = 0 otherwise.
Problem 5.2
The parameters are constrained to take values between zero and one, and
they must collectively sum to one to ensure a valid probability distribution.
Then we use a network f[x, ϕ] with K outputs to compute these K
parameters from the input x. Unfortunately, the network outputs will not
necessarily obey the aforementioned constraints. Consequently, we pass the
K outputs of the network through a function that ensures these constraints
are respected. A suitable choice is the softmax function (figure 5.10). This
takes an arbitrary vector of length K and returns a vector of the same length
but where the elements are now in the range [0, 1] and sum to one. The kth
output of the softmax function is:
(5.22)
Figure 5.10 Multiclass classification for K = 3 classes. a) The network has three
piecewise linear outputs, which can take arbitrary values. b) After the softmax
function, these outputs are constrained to be non-negative and sum to one. Hence, for
a given input x, we compute valid parameters for the categorical distribution: any
vertical slice of this plot produces three values sum to one and would form the
heights of the bars in a categorical distribution similar to figure 5.9.
where the exponential functions ensure positivity, and the sum in the
denominator ensures that the K numbers sum to one.
Appendix B.1.3
Exponential function
(5.23)
(5.24)
where fk[x, ϕ] denotes the kth output of the neural network. For reasons that
will be explained in section 5.7, this is known as the multiclass cross-
entropy loss.
The transformed model output represents a categorical distribution over
possible classes y ∈ {1, 2, …, K}. For a point estimate, we take the most
probable category ŷ = argmaxk[Pr(y = k|f[x, ])]. This corresponds to
whichever curve is highest for that value of x in figure 5.10.
Notebook 5.3
Multiclass cross-entropy
loss
(5.25)
where fd[xi, ϕ] is the dth set of network outputs, which describe the
parameters of the distribution over yd. For example, to predict multiple
continuous variables yd ∈ ℝ, we use a normal distribution for each yd, and
the network outputs fd[xi, ϕ] predict the means of these distributions. To
predict multiple discrete variables yd ∈ {1, 2, …, K}, we use a categorical
distribution for each yd. Here, each set of network outputs fd[xi, ϕ] predicts
the K values that contribute to the categorical distribution for yd.
Appendix C.1.5
Independence
(5.26)
where yid is the dth output from the ith training example.
To make two or more prediction types simultaneously, we similarly
assume the errors in each are independent. For example, to predict wind
direction and strength, we might choose the von Mises distribution (defined
on circular domains) for the direction and the exponential distribution
(defined on positive real numbers) for the strength. The independence
assumption implies that the joint likelihood of the two predictions is the
product of individual likelihoods. These terms will become additive when
we compute the negative log-likelihood.
Problems 5.7–5.10
5.7 Cross-entropy loss
In this chapter, we developed loss functions that minimize negative log-
likelihood. However, the term cross-entropy loss is also commonplace. In
this section, we describe the cross-entropy loss and show that it is
equivalent to using negative log-likelihood.
The cross-entropy loss is based on the idea of finding parameters θ that
minimize the distance between the empirical distribution q(y) of the
observed data y and a model distribution Pr(y|θ) (figure 5.12). The distance
between two probability distributions q(z) and p(z) can be evaluated using
the Kullback-Leibler (KL) divergence:
(5.27)
Appendix C.5.1
KL Divergence
(5.29)
Appendix B.1.3
Dirac delta function
(5.30)
The product of the two terms in the first line corresponds to pointwise
multiplying the point masses in figure 5.12a with the logarithm of the
distribution in figure 5.12b. We are left with a finite set of weighted
probability masses centered on the data points. In the last line, we have
eliminated the constant scaling factor 1/I, as this does not affect the position
of the minimum.
In machine learning, the distribution parameters θ are computed by the
model f[xi, ϕ], so we have:
(5.31)
5.8 Summary
We previously considered neural networks as directly predicting outputs y
from data x. In this chapter, we shifted perspective to think about neural
networks as computing the parameters θ of probability distributions
Pr(y∣θ) over the output space. This led to a principled approach to building
loss functions. We selected model parameters ϕ that maximized the
likelihood of the observed data under these distributions. We saw that this is
equivalent to minimizing the negative log-likelihood.
The least squares criterion for regression is a natural consequence of this
approach; it follows from the assumption that y is normally distributed and
that we are predicting the mean. We also saw how the regression model
could be (i) extended to estimate the uncertainty over the prediction and (ii)
extended to make that uncertainty dependent on the input (the
heteroscedastic model). We applied the same approach to both binary and
multiclass classification and derived loss functions for each. We discussed
how to tackle more complex data types and how to deal with multiple
outputs. Finally, we argued that cross-entropy is an equivalent way to think
about fitting models.
In previous chapters, we developed neural network models. In this
chapter, we developed loss functions for deciding how well a model
describes the training data for a given set of parameters. The next chapter
considers model training, in which we aim to find the model parameters that
minimize this loss.
Notes
Losses based on the normal distribution: Nix & Weigend (1994) and
Williams (1996) investigated heteroscedastic nonlinear regression in which
both the mean and the variance of the output are functions of the input. In
the context of unsupervised learning, Burda et al. (2016) use a loss function
based on a multivariate normal distribution with diagonal covariance, and
Dorta et al. (2018) use a loss function based on a normal distribution with
full covariance.
Robust regression: Qi et al. (2020) investigate the properties of
regression models that minimize mean absolute error rather than mean
squared error. This loss function follows from assuming a Laplace
distribution over the outputs and estimates the median output for a given
input rather than the mean. Barron (2019) presents a loss function that
parameterizes the degree of robustness. When interpreted in a probabilistic
context, it yields a family of univariate probability distributions that
includes the normal and Cauchy distributions as special cases.
Estimating quantiles: Sometimes, we may not want to estimate the mean
or median in a regression task but may instead want to predict a quantile.
For example, this is useful for risk models, where we want to know that the
true value will be less than the predicted value 90% of the time. This is
known as quantile regression (Koenker & Hallock, 2001). This could be
done by fitting a heteroscedastic regression model and then estimating the
quantile based on the predicted normal distribution. Alternatively, the
quantiles can be estimated directly using quantile loss (also known as
pinball loss). In practice, this minimizes the absolute deviations of the data
from the model but weights the deviations in one direction more than the
other. Recent work has investigated simultaneously predicting multiple
quantiles to get an idea of the overall distribution shape (Rodrigues &
Pereira, 2020).
Class imbalance and focal loss: Lin et al. (2017c) address data
imbalance in classification problems. If the number of examples for some
classes is much greater than for others, then the standard maximum
likelihood loss does not work well; the model may concentrate on becoming
more confident about well-classified examples from the dominant classes
and classify less well-represented classes poorly. Lin et al. (2017c)
introduce focal loss, which adds a single extra parameter that down-weights
the effect of well-classified examples to improve performance.
Learning to rank: Cao et al. (2007), Xia et al. (2008), and Chen et al.
(2009) all used the Plackett-Luce model in loss functions for learning to
rank data. This is the listwise approach to learning to rank as the model
ingests an entire list of objects to be ranked at once. Alternative approaches
are the pointwise approach, in which the model ingests a single object, and
the pairwise approach, where the model ingests pairs of objects. Chen et al.
(2009) summarize different approaches for learning to rank.
Other data types: Fan et al. (2020) use a loss based on the beta
distribution for predicting values between zero and one. Jacobs et al. (1991)
and Bishop (1994) investigated mixture density networks for multimodal
data. These model the output as a mixture of Gaussians (see figure 5.14)
that is conditional on the input. Prokudin et al. (2018) used the von Mises
distribution to predict direction (see figure 5.13). Fallah et al. (2009)
constructed loss functions for prediction counts using the Poisson
distribution (see figure 5.15). Ng et al. (2017) used loss functions based on
the gamma distribution to predict duration.
Figure 5.13 The von Mises distribution is defined over the circular domain (−π, π].
It has two parameters. The mean μ determines the position of the peak. The
concentration κ > 0 acts like the inverse of the variance. Hence 1/ is roughly
equivalent to the standard deviation in a normal distribution.
Problems
Problem 5.1 Show that the logistic sigmoid function sig[z] maps z = −∞ to
0, z = 0 to 0.5 and z = ∞ to 1 where:
(5.32)
Problem 5.2 The loss L for binary classification for a single training pair
{x, y} is:
(5.33)
where sig[•] is defined in equation 5.32. Plot this loss as a function of the
transformed network output sig[f[x, ϕ]] ∈ [0, 1] (i) when the training label
y = 0 and (ii) when y = 1.
Problem 5.3* Suppose we want to build a model that predicts the direction
y in radians of the prevailing wind based on local measurements of
barometric pressure x. A suitable distribution over circular domains is the
von Mises distribution (figure 5.13):
(5.34)
(5.35)
where λ ∈ [0, 1] controls the relative weight of the two components, which
have means μ1, μ2 and variances , , respectively. This model can
represent a distribution with two peaks (figure 5.14b) or a distribution with
one peak but a more complex shape (figure 5.14c).
Use the recipe from section 5.2 to construct a loss function for training a
model f[x, ϕ] that takes input x, has parameters ϕ, and predicts a mixture of
two Gaussians. The loss should be based on I training data pairs {xi, yi}.
What problems do you foresee when performing inference?
Problem 5.5 Consider extending the model from problem 5.3 to predict the
wind direction using a mixture of two von Mises distributions. Write an
expression for the likelihood Pr(y∣θ) for this model. How many outputs
will the network need to produce?
Problem 5.6 Consider building a model to predict the number of
pedestrians y ∈ {0, 1, 2,…} that will pass a given point in the city in the
next minute, based on data x that contains information about the time of
day, the longitude and latitude, and the type of neighborhood. A suitable
distribution for modeling counts is the Poisson distribution (figure 5.15).
This has a single parameter λ > 0 called the rate that represents the mean of
the distribution. The distribution has probability density function:
(5.36)
Design a loss function for this model assuming we have access to I training
pairs {xi, yi}.
Figure 5.15 Poisson distribution. This discrete distribution is defined over non-
negative integers z ∈ {0, 1, 2, …} It has a single parameter λ ∈ ℝ+, which is known
as the rate and is the mean of the distribution. a–c) Poisson distributions with rates of
1.4, 2.8, and 6.0, respectively.
OceanofPDF.com
Chapter 6
Fitting models
(6.2)
(6.3)
At the minimum of the loss function, the surface must be flat (or we
could improve further by going downhill). Hence, the gradient will be zero,
and the parameters will stop changing. In practice, we monitor the gradient
magnitude and terminate the algorithm when it becomes too small.
(6.4)
Given a dataset {xi, yi} containing I input/output pairs, we choose the least
squares loss function:
(6.5)
where the term ℓi = (ϕ0 + ϕ1xi − yi)2 is the individual contribution to the loss
from the ith training example.
The derivative of the loss function with respect to the parameters can be
decomposed into the sum of the derivatives of the individual contributions:
(6.6)
(6.7)
Problem 6.1
Figure 6.1 shows the progression of this algorithm as we iteratively
compute the derivatives according to equations 6.6 and 6.7 and then update
the parameters using the rule in equation 6.3. In this case, we have used a
line search procedure to find the value of α that decreases the loss the most
at each iteration.
Notebook 6.2
Gradient descent
Figure 6.1 Gradient descent for the linear regression model. a) Training set of I =
12 input/output pairs {xi, yi}. b) Loss function showing iterations of gradient descent.
We start at point 0 and move in the steepest downhill direction until we can improve
no further to arrive at point 1. We then repeat this procedure. We measure the
gradient at point 1 and move downhill to point 2 and so on. c) This can be visualized
better as a heatmap, where the brightness represents the loss. After only four
iterations, we are already close to the minimum. d) The model with the parameters at
point 0 (lightest line) describes the data very badly, but each successive iteration
improves the fit. The model with the parameters at point 4 (darkest line) is already a
reasonable description of the training data.
(6.8)
This Gabor model maps scalar input x to scalar output y and consists of a
sinusoidal component (creating an oscillatory function) multiplied by a
negative exponential component (causing the amplitude to decrease as we
move from the center). It has two parameters ϕ = [ϕ0, ϕ1]T, where ϕ0 ∈ ℝ
determines the mean position of the function and ϕ1 ∈ ℝ+ stretches or
squeezes it along the x-axis (figure 6.2).
Problems 6.3–6.5
Figure 6.2 Gabor model. This nonlinear model maps scalar input x to scalar output
y and has parameters ϕ = [ϕ0, ϕ1]T. It describes a sinusoidal function that decreases in
amplitude with distance from its center. Parameter ϕ0 ∈ ℝ determines the position of
the center. As ϕ0 increases, the function moves left. Parameter ϕ1 ∈ ℝ+ squeezes the
function along the x-axis relative to the center. As ϕ1 increases, the function narrows.
a–c) Model with different parameters.
Consider a training set of I examples {xi, yi} (figure 6.3). The least
squares loss function for I training examples is defined as:
(6.9)
Figure 6.3 Training data for fitting the Gabor model. The training dataset contains
28 input/output examples {xi, yi}. These data were created by uniformly sampling xi
∈ [−15, 15], passing the samples through a Gabor model with parameters ϕ = [0.0,
16.6]T, and adding normally distributed noise.
Once more, the goal is to find the parameters that minimize this loss.
Figure 6.5 Gradient descent vs. stochastic gradient descent. a) Gradient descent
with line search. As long as the gradient descent algorithm is initialized in the right
“valley” of the loss function (e.g., points 1 and 3), the parameter estimate will move
steadily toward the global minimum. However, if it is initialized outside this valley
(e.g., point 2), it will descend toward one of the local minima. b) Stochastic gradient
descent adds noise to the optimization process, so it is possible to escape from the
wrong valley (e.g., point 2) and still reach the global minimum.
In addition, the loss function contains saddle points (e.g., the blue cross
in figure 6.4). Here, the gradient is zero, but the function increases in some
directions and decreases in others. If the current parameters are not exactly
at the saddle point, then gradient descent can escape by moving downhill.
However, the surface near the saddle point is flat, so it's hard to be sure that
training hasn't converged; if we terminate the algorithm when the gradient
is small, we may erroneously stop near a saddle point.
(6.10)
6.3 Momentum
A common modification to stochastic gradient descent is to add a
momentum term. We update the parameters with a weighted combination of
the gradient computed from the current batch and the direction moved in
the previous step:
(6.11)
where mt is the momentum (which drives the update at iteration t), β ∈ [0,
1) controls the degree to which the gradient is smoothed over time, and α is
the learning rate.
The recursive formulation of the momentum calculation means that the
gradient step is an infinite weighted sum of all the previous gradients,
where the weights get smaller as we move back in time. The effective
learning rate increases if all these gradients are aligned over multiple
iterations but decreases if the gradient direction repeatedly changes as the
terms in the sum cancel out. The overall effect is a smoother trajectory and
reduced oscillatory behavior in valleys (figure 6.7).
Problem 6.10
where now the gradients are evaluated at ϕt −α · mt. One way to think
about this is that the gradient term now corrects the path provided by
momentum alone.
Notebook 6.4
Momentum
Figure 6.8 Nesterov accelerated momentum. The solution has traveled along the
dashed line to arrive at point 1. A traditional momentum update measures the
gradient at point 1, moves some distance in this direction to point 2, and then adds
the momentum term from the previous iteration (i.e., in the same direction as the
dashed line), arriving at point 3. The Nesterov momentum update first applies the
momentum term (moving from point 1 to point 4) and then measures the gradient and
applies an update to arrive at point 5.
6.4 Adam
Gradient descent with a fixed step size has the following undesirable
property: it makes large adjustments to parameters associated with large
gradients (where perhaps we should be more cautious) and small
adjustments to parameters associated with small gradients (where perhaps
we should explore further). When the gradient of the loss surface is much
steeper in one direction than another, it is difficult to choose a learning rate
that (i) makes good progress in both directions and (ii) is stable (figures
6.9a–b).
Figure 6.9 Adaptive moment estimation (Adam). a) This loss function changes
quickly in the vertical direction but slowly in the horizontal direction. If we run full-
batch gradient descent with a learning rate that makes good progress in the vertical
direction, then the algorithm takes a long time to reach the final horizontal position.
b) If the learning rate is chosen so that the algorithm makes good progress in the
horizontal direction, it overshoots in the vertical direction and becomes unstable. c) A
straightforward approach is to move a fixed distance along each axis at each step so
that we move downhill in both directions. This is accomplished by normalizing the
gradient magnitude and retaining only the sign. However, this does not usually
converge to the exact minimum but instead oscillates back and forth around it (here
between the last two points). d) The Adam algorithm uses momentum in both the
estimated gradient and the normalization term, which creates a smoother path.
(6.14)
where the square root and division are both pointwise, α is the learning rate,
and ϵ is a small constant that prevents division by zero when the gradient
magnitude is zero. The term vt+1 is the squared gradient, and the positive
root of this is used to normalize the gradient itself, so all that remains is the
sign in each coordinate direction. The result is that the algorithm moves a
fixed distance α along each coordinate, where the direction is determined by
whichever way is downhill (figure 6.9c). This simple algorithm makes good
progress in both directions but will not converge unless it happens to land
exactly at the minimum. Instead, it will bounce back and forth around the
minimum.
Adaptive moment estimation, or Adam, takes this idea and adds
momentum to both the estimate of the gradient and the squared gradient:
(6.15)
where β and γ are the momentum coefficients for the two statistics.
Using momentum is equivalent to taking a weighted average over the
history of each of these statistics. At the start of the procedure, all the
previous measurements are effectively zero, resulting in unrealistically
small estimates. Consequently, we modify these statistics using the rule:
(6.16)
Since β and γ are in the range [0, 1), the terms with exponents t + 1 become
smaller with each time step, the denominators become closer to one, and
this modification has a diminishing effect.
Finally, we update the parameters as before, but with the modified terms:
(6.17)
The result is an algorithm that can converge to the overall minimum and
makes good progress in every direction in the parameter space. Note that
Adam is usually used in a stochastic setting where the gradients and their
squares are computed from mini-batches:
(6.18)
6.6 Summary
This chapter discussed model training. This problem was framed as finding
parameters ϕ that corresponded to the minimum of a loss function L[ϕ]. The
gradient descent method measures the gradient of the loss function for the
current parameters (i.e., how the loss changes when we make a small
change to the parameters). Then it moves the parameters in the direction
that decreases the loss fastest. This is repeated until convergence.
For nonlinear functions, the loss function may have both local minima
(where gradient descent gets trapped) and saddle points (where gradient
descent may appear to have converged but has not). Stochastic gradient
descent helps mitigate these problems.1 At each iteration, we use a different
random subset of the data (a batch) to compute the gradient. This adds noise
to the process and helps prevent the algorithm from getting trapped in a
sub-optimal region of parameter space. Each iteration is also
computationally cheaper since it only uses a subset of the data. We saw that
adding a momentum term makes convergence more efficient. Finally, we
introduced the Adam algorithm.
The ideas in this chapter apply to optimizing any model. The next
chapter tackles two aspects of training specific to neural networks. First, we
address how to compute the gradients of the loss with respect to the
parameters of a neural network. This is accomplished using the famous
backpropagation algorithm. Second, we discuss how to initialize the
network parameters before optimization begins. Without careful
initialization, the gradients used by the optimization can become extremely
large or extremely small, which can hinder the training process.
Notes
Optimization algorithms: Optimization algorithms are used extensively
throughout engineering, and it is generally more typical to use the term
objective function rather than loss function or cost function. Gradient
descent was invented by Cauchy (1847), and stochastic gradient descent
dates back to at least Robbins & Monro (1951). A modern compromise
between the two is stochastic variance-reduced descent (Johnson & Zhang,
2013), in which the full gradient is computed periodically, with stochastic
updates interspersed. Reviews of optimization algorithms for neural
networks can be found in Ruder (2016), Bottou et al. (2018), and Sun
(2020). Bottou (2012) discusses best practice for SGD, including shuffling
without replacement.
Convexity, minima, and saddle points: A function is convex if no chord
(line segment between two points on the surface) intersects the function.
This can be tested algebraically by considering the Hessian matrix (the
matrix of second derivatives):
(6.19)
If the Hessian matrix is positive definite (has positive eigenvalues) for all
possible parameter values, then the function is convex; the loss function
will look like a smooth bowl (as in figure 6.1c), so training will be
relatively easy. There will be a single global minimum and no local minima
or saddle points.
Appendix B.3.7
Eigenvalues
For any loss function, the eigenvalues of the Hessian matrix at places where
the gradient is zero allow us to classify this position as (i) a minimum (the
eigenvalues are all positive), (ii) a maximum (the eigenvalues are all
negative), or (iii) a saddle point (positive eigenvalues are associated with
directions in which we are at a minimum and negative ones with directions
where we are at a maximum).
Line search: Gradient descent with a fixed step size is inefficient because
the distance moved depends entirely on the magnitude of the gradient. It
moves a long distance when the function is changing fast (where perhaps it
should be more cautious) but a short distance when the function is changing
slowly (where perhaps it should explore further). For this reason, gradient
descent methods are usually combined with a line search procedure in
which we sample the function along the desired direction to try to find the
optimal step size. One such approach is bracketing (figure 6.10). Another
problem with gradient descent is that it tends to lead to inefficient
oscillatory behavior when descending valleys (e.g., path 1 in figure 6.5a).
Figure 6.10 Line search using the bracketing approach. a) The current solution is at
position a (orange point), and we wish to search the region [a, d] (gray shaded area).
We define two points b, c interior to the search region and evaluate the loss function
at these points. Here L[b] > L[c], so we eliminate the range [a, b]. b) We now repeat
this procedure in the refined search region and find that L[b] < L[c], so we eliminate
the range [c, d]. c) We repeat this process until this minimum is closely bracketed.
(6.20)
for the linear regression model (equation 6.5). Prove that this function is
convex by showing that the eigenvalues are always positive. This can be
done by showing that both the trace and the determinant of the matrix are
positive.
Appendix B.3.7
Eigenvalues
Appendix B.3.8
Trace
Appendix B.3.8
Determinant
Problem 6.3 Compute the derivatives of the least squares loss L[ϕ] with
respect to the parameters ϕ0 and ϕ1 for the Gabor model (equation 6.8).
Problem 6.4* The logistic regression model uses a linear function to assign
an input x to one of two classes y ∈ {0, 1}. For a 1D input and a 1D output,
it has two parameters, ϕ0 and ϕ1, and is defined by:
(6.21)
(6.22)
(i) Plot y against x for this model for different values of ϕ0 and ϕ1 and
explain the qualitative meaning of each parameter. (ii) What is a suitable
loss function for this model? (iii) Compute the derivatives of this loss
function with respect to the parameters. (iv) Generate ten data points from a
normal distribution with mean -1 and standard deviation 1 and assign them
the label y = 0. Generate another ten data points from a normal distribution
with mean 1 and standard deviation 1 and assign these the label y = 1. Plot
the loss as a heatmap in terms of the two parameters ϕ0 and ϕ1. (v) Is this
loss function convex? How could you prove this?
Problem 6.5* Compute the derivatives of the least squares loss with respect
to the ten parameters of the simple neural network model introduced in
equation 3.1:
(6.23)
Think carefully about what the derivative of the ReLU function a[•] will be.
Problem 6.6 Which of the functions in figure 6.11 is convex? Justify your
answer. Characterize each of the points 1–7 as (i) a local minimum, (ii) the
global minimum, or (iii) neither.
Problem 6.7* The gradient descent trajectory for path 1 in figure 6.5a
oscillates back and forth inefficiently as it moves down the valley toward
the minimum. It's also notable that it turns at right angles to the previous
direction at each step. Provide a qualitative explanation for these
phenomena. Propose a solution that might help prevent this behavior.
Problem 6.8* Can (non-stochastic) gradient descent with a fixed learning
rate escape local minima?
Problem 6.9 We run the stochastic gradient descent algorithm for 1,000
iterations on a dataset of size 100 with a batch size of 20. For how many
epochs did we train the model?
Problem 6.10 Show that the momentum term mt (equation 6.11) is an
infinite weighted sum of the gradients at the previous iterations and derive
an expression for the coefficients (weights) of that sum.
Problem 6.11 What dimensions will the Hessian have if the model has one
million parameters?
1 Chapter 20 discusses the extent to which saddle points and local minima really are problems in
deep learning. In practice, deep networks are surprisingly easy to train.
OceanofPDF.com
Chapter 7
(7.1)
where the function a[•] applies the activation function separately to every
element of the input. The model parameters ϕ = {β0, Ω0, β1, Ω1, β2, Ω2, β3,
Ω3} consist of the bias vectors βk and weight matrices Ωk between every
layer (figure 7.1).
Figure 7.1 Backpropagation forward pass. The goal is to compute the derivatives
of the loss ℓ with respect to each of the weights (arrows) and biases (not shown). In
other words, we want to know how a small change to each parameter will affect the
loss. Each weight multiplies the hidden unit at its source and contributes the result to
the hidden unit at its destination. Consequently, the effects of any small change to the
weight will be scaled by the activation of the source hidden unit. For example, the
blue weight is applied to the second hidden unit at layer 1; if the activation of this
unit doubles, then the effect of a small change to the blue weight will double too.
Hence, to compute the derivatives of the weights, we need to calculate and store the
activations at the hidden layers. This is known as the forward pass since it involves
running the network equations sequentially.
We also have individual loss terms ℓ i, which return the negative log-
likelihood of the ground truth label yi given the model prediction f[xi, ϕ] for
training input xi. For example, this might be the least squares loss ℓi = (f[xi,
ϕ] − yi)2. The total loss is the sum of these terms over the training data:
(7.2)
where α is the learning rate, and ẞt contains the batch indices at iteration t.
To compute this update, we need to calculate the derivatives:
(7.4)
for the parameters {βk, Ωk} at every layer k ∈ {0, 1, …, K} and for each
index i in the batch. The first part of this chapter describes the
backpropagation algorithm, which computes these derivatives efficiently.
Problem 7.1
As we move backward through the network, we see that most of the terms
we need were already calculated in the previous step, so we do not need to
re-compute them. Proceeding backward through the network in this way to
compute the derivatives is known as the backward pass.
The ideas behind backpropagation are relatively easy to understand.
However, the derivation requires matrix calculus because the bias and
weight terms are vectors and matrices, respectively. To help grasp the
underlying mechanics, the following section derives backpropagation for a
simpler toy model with scalar parameters. We then apply the same approach
to a deep neural network in section 7.4.
(7.5)
(7.6)
where, as usual, xi is the ith training input, and yi is the ith training output.
You can think of this as a simple neural network with one input, one output,
one hidden unit at each layer, and different activation functions sin[•],
exp[•], and cos[•] between each layer.
We aim to compute the derivatives:
(7.7)
Such expressions are awkward to derive and code without mistakes and do
not exploit the inherent redundancy; notice that the three exponential terms
are the same.
The backpropagation algorithm is an efficient method for computing all
of these derivatives at once. It consists of (i) a forward pass, in which we
compute and store a series of intermediate values and the network output,
and (ii) a backward pass, in which we calculate the derivatives of each
parameter, starting at the end of the network, and reusing previous
calculations as we move toward the start.
(7.8)
Figure 7.3 Backpropagation forward pass. We compute and store each of the
intermediate variables in turn until we finally calculate the loss.
(7.9)
The first of these derivatives is straightforward:
(7.10)
(7.11)
The left-hand side asks how ℓ i changes when h3 changes. The right-hand
side says we can decompose this into (i) how f3 changes when h3 changes
and (ii) how ℓ i changes when f3 changes. In the original equations, h3
changes f3, which changes ℓi, and the derivatives represent the effects of this
chain. Notice that we already computed the second of these derivatives, and
the other is the derivative of β3 + ω3 · h3 with respect to h3, which is ω3.
We continue in this way, computing the derivatives of the output with
respect to these intermediate quantities (figure 7.4):
(7.12)
Figure 7.4 Backpropagation backward pass #1. We work backward from the end of
the function computing the derivatives ∂ℓ i/∂f• and ∂ℓ i/∂h• of the loss with respect to
the intermediate quantities. Each derivative is computed from the previous one by
multiplying by terms of the form ∂fk/∂hk or ∂hk/∂fk−1.
In each case, we have already computed the quantities in the brackets in the
previous step, and the last term has a simple expression. These equations
embody Observation 2 from the previous section (figure 7.2); we can reuse
the previously computed derivatives if we calculate them in reverse order.
Problem 7.2
Backward pass #2: Finally, we consider how the loss ℓ i changes when
we change the parameters β• and ω•. Once more, we apply the chain rule
(figure 7.5):
(7.13)
Figure 7.5 Backpropagation backward pass #2. Finally, we compute the derivatives
∂ℓ i/∂β• and ∂ℓ i/∂ω•. Each derivative is computed by multiplying the term ∂ℓ i/∂fk by
∂fk/∂βk or ∂fk/∂ωk as appropriate.
In each case, the second term on the right-hand side was computed in
equation 7.12. When k > 0, we have fk = βk + ωk · hk, so:
(7.14)
This is consistent with Observation 1 from the previous section; the effect
of a change in the weight ωk is proportional to the value of the source
variable hk (which was stored in the forward pass). The final derivatives
from the term f0 = β0 + ω · xi are:
(7.15)
Notebook 7.1
Backpropagation in toy
model
where fk−1 represents the pre-activations at the kth hidden layer (i.e., the
values before the ReLU function a[•]) and hk contains the activations at the
kth hidden layer (i.e., after the ReLU function). The term l[f3, yi] represents
the loss function (e.g., least squares or binary cross-entropy loss). In the
forward pass, we work through these calculations and store all the
intermediate quantities.
Backward pass #1: Now let's consider how the loss changes when we
modify the pre-activations f0, f1, f2. Applying the chain rule, the expression
for the derivative of the loss ℓi with respect to f2 is:
(7.17)
Appendix B.5
Matrix calculus
The three terms on the right-hand side have sizes D3 × D3, D3 × Df, and Df
× 1, respectively, where D3 is the number of hidden units in the third layer,
and Df is the dimensionality of the model output f3.
Similarly, we can compute how the loss changes when we change f1 and
f0:
(7.18)
(7.19)
Note that in each case, the term in brackets was computed in the previous
step. By working backward through the network, we can reuse the previous
computations.
Problem 7.3
The derivative ∂ℓ i/∂f3 of the loss ℓ i with respect to the network output
f3 will depend on the loss function but usually has a simple form.
(7.20)
If you are unfamiliar with matrix calculus, this result is not obvious. It
is explored in problem 7.6.
Problem 7.6
Problems 7.7–7.8
Figure 7.6 Derivative of rectified linear unit. The rectified linear unit (orange
curve) returns zero when the input is less than zero and returns the input otherwise.
Its derivative (cyan curve) returns zero when the input is less than zero (since the
slope here is zero) and one when the input is greater than zero (since the slope here is
one).
The terms on the right-hand side of equations 7.18 and 7.19 have similar
forms. As we progress back through the network, we alternately (i) multiply
by the transpose of the weight matrices and (ii) threshold based on the
inputs fk−1 to the hidden layer. These inputs were stored during the forward
pass.
Backward pass #2: Now that we know how to compute ∂ℓ i/∂fk, we can
focus on calculating the derivatives of the loss with respect to the weights
and biases. To calculate the derivatives of the loss with respect to the biases
βk, we again use the chain rule:
(7.21)
Again, the progression from line two to line three is not obvious and is
explored in problem 7.9. However, the result makes sense. The final line is
a matrix of the same size as Ωk. It depends linearly on hk, which was
multiplied by Ωk in the original expression. This is also consistent with the
initial intuition that the derivative of the weights in Ωk will be proportional
to the values of the hidden units hk that they multiply. Recall that we
already computed these during the forward pass.
(7.23)
Backward pass: We start with the derivative ∂ℓ i/∂fK of the loss function
ℓ i with respect to the network output fK and work backward through the
network:
(7.24)
(7.25)
We calculate these derivatives for every training example in the batch and
sum them together to retrieve the gradient for the SGD update.
Problem 7.10
Since the training algorithm now processes the entire batch in parallel,
the input becomes a multi-dimensional tensor. In this context, a tensor can
be considered the generalization of a matrix to arbitrary dimensions. Hence,
a vector is a 1D tensor, a matrix is a 2D tensor, and a 3D tensor is a 3D grid
of numbers. Until now, the training data have been 1D, so the input for
backpropagation would be a 2D tensor where the first dimension indexes
the batch element and the second indexes the data dimension. In subsequent
chapters, we will encounter more complex structured input data. For
example, in models where the input is an RGB image, the original data
examples are 3D (height × width × channel). Here, the input to the learning
framework would be a 4D tensor, where the extra dimension indexes the
batch element.
(7.26)
where a[•] applies the ReLU functions and Ωk and βk are the weights and
biases, respectively. Imagine that we initialize all the biases to zero and the
elements of Ωk according to a normal distribution with mean zero and
variance σ2. Consider two scenarios:
(7.27)
where Dh is the dimensionality of the input layer h. We have used the rules
for manipulating expectations, and we have assumed that the distributions
over the hidden units hj and the network weights Ωij are independent
between the second and third lines.
Appendix C.2.1
Expectation rules
Using this result, we see that the variance of the pre-activations is:
(7.29)
(7.30)
Problem 7.14
This, in turn, implies that if we want the variance of the subsequent pre-
activations f′ to be the same as the variance of the original pre-
activations f during the forward pass, we should set:
(7.31)
where Dh is the dimension of the original layer to which the weights were
applied. This is known as He initialization.
(7.32)
where Dh′ is the dimension of the layer that the weights feed into.
7.5.3 Initialization for both forward and backward pass
If the weight matrix Ω is not square (i.e., there are different numbers of
hidden units in the two adjacent layers, so Dh and Dh′ differ), then it is not
possible to choose the variance to satisfy both equations 7.31 and 7.32
simultaneously. One possible compromise is to use the mean (Dh + Dh′)/2 as
a proxy for the number of terms, which gives:
(7.33)
Figure 7.7 shows empirically that both the variance of the hidden units in
the forward pass and the variance of the gradients in the backward pass
remain stable when the parameters are initialized appropriately.
Problem 7.15
Notebook 7.3
Initialization
Figure 7.7 Weight initialization. Consider a deep network with 50 hidden layers
and Dh = 100 hidden units per layer. The network has a 100-dimensional input x
initialized from a standard normal distribution, a single fixed target y = 0, and a least
squares loss function. The bias vectors βk are initialized to zero, and the weight
matrices Ωk are initialized with a normal distribution with mean zero and five
different variances . a) Variance of hidden unit activations computed in forward
pass as a function of the network layer. For He initialization ( = 2/Dh = 0.02), the
variance is stable. However, for larger values, it increases rapidly, and for smaller
values, it decreases rapidly (note log scale). b) The variance of the gradients in the
backward pass (solid lines) continues this trend; if we initialize with a value larger
than 0.02, the magnitude of the gradients increases rapidly as we pass back through
the network. If we initialize with a value smaller, then the magnitude decreases.
These are known as the exploding gradient and vanishing gradient problems,
respectively.
The takeaway is that although the underlying ideas in deep learning are
quite complex, implementation is relatively simple. For example, all of the
details of the back-propagation are hidden in the single line of code:
loss.backward().
7.7 Summary
The previous chapter introduced stochastic gradient descent (SGD), an
iterative optimization algorithm that aims to find the minimum of a
function. In the context of neural networks, this algorithm finds the
parameters that minimize the loss function. SGD relies on the gradient of
the loss function with respect to the parameters, which must be initialized
before optimization. This chapter has addressed these two problems for
deep neural networks.
The gradients must be evaluated for a very large number of parameters,
for each member of the batch, and at each SGD iteration. It is hence
imperative that the gradient computation is efficient, and to this end, the
backpropagation algorithm was introduced. Careful parameter initialization
is also critical. The magnitudes of the hidden unit activations can either
decrease or increase exponentially in the forward pass. The same is true of
the gradient magnitudes in the backward pass, where these behaviors are
known as the vanishing gradient and exploding gradient problems. Both
impede training but can be avoided with appropriate initialization.
We've now defined the model and the loss function, and we can train a
model for a given task. The next chapter discusses how to measure the
model performance.
Notes
Backpropagation: Efficient reuse of partial computations while
calculating gradients in computational graphs has been repeatedly
discovered, including by Werbos (1974), Bryson et al. (1979), LeCun
(1985), and Parker (1985). However, the most celebrated description of this
idea was by Rumelhart et al. (1985) and Rumelhart et al. (1986), who also
coined the term “backpropagation.” This latter work kick-started a new
phase of neural network research in the eighties and nineties; for the first
time, it was practical to train networks with hidden layers. However,
progress stalled due (in retrospect) to a lack of training data, limited
computational power, and the use of sigmoid activations. Areas such as
natural language processing and computer vision did not rely on neural
network models until the remarkable image classification results of
Krizhevsky et al. (2012) ushered in the modern era of deep learning.
The implementation of backpropagation in modern deep learning
frameworks such as PyTorch and TensorFlow is an example of reverse-
mode algorithmic differentiation. This is distinguished from forward-mode
algorithmic differentiation in which the derivatives from the chain rule are
accumulated while moving forward through the computational graph (see
problem 7.13). Further information about algorithmic differentiation can be
found in Griewank & Walther (2008) and Baydin et al. (2018).
Initialization: He initialization was first introduced by He et al. (2015). It
follows closely from Glorot or Xavier initialization (Glorot & Bengio,
2010), which is very similar but does not consider the effect of the ReLU
layer and so differs by a factor of two. Essentially the same method was
proposed much earlier by LeCun et al. (2012) but with a slightly different
motivation; in this case, sigmoidal activation functions were used, which
naturally normalize the range of outputs at each layer, and hence help
prevent an exponential increase in the magnitudes of the hidden units.
However, if the pre-activations are too large, they fall into the flat regions
of the sigmoid function and result in very small gradients. Hence, it is still
important to initialize the weights sensibly. Klambauer et al. (2017)
introduce the scaled exponential linear unit (SeLU) and show that, within a
certain range of inputs, this activation function tends to make the
activations in network layers automatically converge to mean zero and unit
variance.
A completely different approach is to pass data through the network and
then normalize by the empirically observed variance. Layer-sequential unit
variance initialization (Mishkin & Matas, 2016) is an example of this kind
of method, in which the weight matrices are initialized as orthonormal.
GradInit (Zhu et al., 2021) randomizes the initial weights and temporarily
fixes them while it learns non-negative scaling factors for each weight
matrix. These factors are selected to maximize the decrease in the loss for a
fixed learning rate subject to a constraint on the maximum gradient norm.
Activation normalization or ActNorm adds a learnable scaling and offset
parameter after each network layer at each hidden unit. They run an initial
batch through the network and then choose the offset and scale so that the
mean of the activations is zero and the variance one. After this, these extra
parameters are learned as part of the model.
Closely related to these methods are schemes such as BatchNorm (Ioffe &
Szegedy, 2015), in which the network normalizes the variance of each batch
as part of its processing at every step. BatchNorm and its variants are
discussed in chapter 11. Other initialization schemes have been proposed
for specific architectures, including the ConvolutionOrthogonal initializer
(Xiao et al., 2018a) for convolutional networks, Fixup (Zhang et al., 2019a)
for residual networks, and TFixup (Huang et al., 2020a) and DTFixup (Xu
et al., 2021b) for transformers.
Reducing memory requirements: Training neural networks is memory
intensive. We must store both the model parameters and the pre-activations
at the hidden units for every member of the batch during the forward pass.
Two methods that decrease memory requirements are gradient
checkpointing (Chen et al., 2016a) and micro-batching (Huang et al., 2019).
In gradient checkpointing, the activations are only stored every N layers
during the forward pass. During the backward pass, the intermediate
missing activations are recalculated from the nearest check-point. In this
manner, we can drastically reduce the memory requirements at the
computational cost of performing the forward pass twice (problem 7.11). In
micro-batching, the batch is subdivided into smaller parts, and the gradient
updates are aggregated from each sub-batch before being applied to the
network. A completely different approach is to build a reversible network
(e.g., Gomez et al., 2017), in which the activations at the previous layer can
be computed from the activations at the current one, so there is no need to
cache anything during the forward pass (see chapter 16). Sohoni et al.
(2019) review approaches to reducing memory requirements.
Distributed training: For sufficiently large models, the memory
requirements or total required time may be too much for a single processor.
In this case, we must use distributed training, in which training takes place
in parallel across multiple processors. There are several approaches to
parallelism. In data parallelism, each processor or node contains a full copy
of the model but runs a subset of the batch (see Xing et al., 2015; Li et al.,
2020b). The gradients from each node are aggregated centrally and then
redistributed back to each node to ensure that the models remain consistent.
This is known as synchronous training. The synchronization required to
aggregate and redistribute the gradients can be a performance bottleneck,
and this leads to the idea of asynchronous training. For example, in the
Hogwild! algorithm (Recht et al., 2011), the gradient from a node is used to
update a central model whenever it is ready. The updated model is then
redistributed to the node. This means that each node may have a slightly
different version of the model at any given time, so the gradient updates
may be stale; however, it works well in practice. Other decentralized
schemes have also been developed. For example, in Zhang et al. (2016a),
the individual nodes update one another in a ring structure.
Data parallelism methods still assume that the entire model can be held in
the memory of a single node. Pipeline model parallelism stores different
layers of the network on different nodes and hence does not have this
requirement. In a naïve implementation, the first node runs the forward pass
for the batch on the first few layers and passes the result to the next node,
which runs the forward pass on the next few layers and so on. In the
backward pass, the gradients are updated in the opposite order. The obvious
disadvantage of this approach is that each machine lies idle for most of the
cycle. Various schemes revolving around each node processing micro-
batches sequentially have been proposed to reduce this inefficiency (e.g.,
Huang et al., 2019; Narayanan et al., 2021a). Finally, in tensor model
parallelism, computation at a single network layer is distributed across
nodes (e.g., Shoeybi et al., 2019). A good overview of distributed training
methods can be found in Narayanan et al. (2021b), who combine tensor,
pipeline, and data parallelism to train a language model with one trillion
parameters on 3072 GPUs.
Problems
Problem 7.1 A two-layer network with two hidden units in each layer can
be defined as:
(7.34)
where the functions a[•] are ReLU functions. Compute the derivatives of
the output y with respect to each of the 13 parameters ϕ•, θ••, and ψ••
directly (i.e., not using the backpropagation algorithm). The derivative of
the ReLU function with respect to its input ∂a[z]/∂z is the indicator function
𝕀[z > 0], which returns one if the argument is greater than zero and zero
otherwise (figure 7.6).
Problem 7.2 Find an expression for the final term in each of the five chains
of derivatives in equation 7.12.
Problem 7.3 What size are each of the terms in equation 7.19?
Problem 7.4 Calculate the derivative ∂ℓ i/∂f[xi, ϕ] for the least squares loss
function:
(7.35)
(7.36)
where the function sig[•] is the logistic sigmoid and is defined as:
(7.37)
where ∂z/∂h is a matrix containing the term ∂zi/∂hj in its ith column and jth
row. To do this, first find an expression for the constituent elements ∂zi/∂hj,
and then consider the form that the matrix ∂z/∂h must take.
Problem 7.7 Consider the case where we use the logistic sigmoid (see
equation 7.37) as an activation function, so h = sig[f]. Compute the
derivative ∂h/∂f for this activation function. What happens to the derivative
when the input takes (i) a large positive value and (ii) a large negative
value?
Problem 7.8 Consider using (i) the Heaviside function and (ii) the
rectangular function as activation functions:
(7.38)
and
(7.39)
Discuss why these functions are problematic for neural network training
with gradient-based optimization methods.
Problem 7.9* Consider a loss function ℓ[f], where f = β + Ωh. We want to
find how the loss ℓ changes when we change Ω, which we'll express with a
matrix that contains the derivative ∂ℓ/∂Ωij at the ith row and jth column. Find
an expression for ∂fi/∂Ωij and, using the chain rule, show that:
(7.40)
Problem 7.10* Derive the equations for the backward pass of the
backpropagation algorithm for a network that uses leaky ReLU activations,
which are defined as:
(7.41)
(7.42)
(7.43)
(7.44)
using the chain rule in each case to make use of the derivatives already
computed.
Figure 7.9 Computational graph for problem 7.12 and problem 7.13. Adapted from
Domke (2010).
Problem 7.13* For the same function in problem 7.42, compute the
derivative ∂y/∂x by forward-mode differentiation. In other words, compute
in order:
(7.45)
using the chain rule in each case to make use of the derivatives already
computed. Why do we not use forward-mode differentiation when we
calculate the parameter gradients for deep networks?
(7.46)
1 Note that we did not actually need the derivatives ∂li/∂hk of the loss with respect to the
activations.In the final backpropagation algorithm, we will not compute these explicitly.
OceanofPDF.com
Chapter 8
Measuring performance
However, this doesn't imply that the classifier is perfect; the model might
have memorized the training set but be unable to predict new examples. To
estimate the true performance, we need a separate test set of input/output
pairs {xi, yi}. To this end, we generate 1000 more examples using the same
process. Figure 8.2a also shows the errors for this test data as a function of
the training step. These decrease as training proceeds, but only to around
40%. This is better than the chance error rate of 90% error rate but far
worse than for the training set; the model has not generalized well to the
test data.
The test loss (figure 8.2b) decreases for the first 1500 training steps but
then increases again. At this point, the test error rate is fairly constant; the
model makes the same mistakes but with increasing confidence. This
decreases the probability of the correct answers and thus increases the
negative log-likelihood. This increasing confidence is a side-effect of the
softmax function; the pre-softmax activations are driven to increasingly
extreme values to make the probability of the training data approach one
(see figure 5.10).
Notebook 8.1
MNIST-1D performance
8.2 Sources of error
We now consider the sources of the errors that occur when a model fails to
generalize. To make this easier to visualize, we revert to a 1D linear least
squares regression problem where we know exactly how the ground truth
data were generated. Figure 8.3 shows a quasi-sinusoidal function; both
training and test data are generated by sampling input values in the range
[0, 1], passing them through this function, and adding Gaussian noise with a
fixed variance.
Figure 8.3 Regression function. Solid black line shows ground truth function. To
generate I training examples {xi, yi}, the input space x ∈ [0, 1] is divided into I equal
segments and one sample xi is drawn from a uniform distribution within each
segment. The corresponding value yi is created by evaluating the function at xi and
adding Gaussian noise (gray region shows ±2 standard deviations). The test data are
generated in the same way.
We fit a simplified shallow neural net to this data (figure 8.4). The
weights and biases that connect the input layer to the hidden layer are
chosen so that the “joints” of the function are evenly spaced across the
interval. If there are D hidden units, then these joints will be at 0, 1/D, 2/D,
…, (D − 1)/D. This model can represent any piecewise linear function with
D equally sized regions in the range [0, 1]. As well as being easy to
understand, this model also has the advantage that it can be fit in closed
form without the need for stochastic optimization algorithms (see problem
8.3). Consequently, we can guarantee to find the global minimum of the
loss function during training.
Problems 8.2–8.3
Figure 8.4 Simplified neural network with three hidden units. a) The weights and
biases between the input and hidden layer are fixed (dashed arrows). b–d) They are
chosen so that the hidden unit activations have slope one, and their joints are equally
spaced across the interval, with joints at x = 0, x = 1/3, and x = 2/3, respectively.
Modifying the remaining parameters ϕ = {β, ω1, ω2, ω3} can create any piecewise
linear function over x ∈ [0, 1] with joints at 1/3 and 2/3. e–g) Three example
functions with different values of the parameters ϕ.
Noise The data generation process includes the addition of noise, so there
are multiple possible valid outputs y for each input x (figure 8.5a). This
source of error is insurmountable for the test data. Note that it does not
necessarily limit the training performance; we will likely never see the same
input x twice during training, so it is still possible to fit the training data
perfectly.
Noise may arise because there is a genuine stochastic element to the data
generation process, because some of the data are mislabeled, or because
there are further explanatory variables that were not observed. In rare cases,
noise may be absent; for example, a network might approximate a function
that is deterministic but requires significant computation to evaluate.
However, noise is usually a fundamental limitation on the possible test
performance.
Bias A second potential source of error may occur because the model is
not flexible enough to fit the true function perfectly. For example, the three-
region neural network model cannot exactly describe the quasi-sinusoidal
function, even when the parameters are chosen optimally (figure 8.5b). This
is known as bias.
Variance We have limited training examples, and there is no way to
distinguish systematic changes in the underlying function from noise in the
underlying data. When we fit a model, we do not get the closest possible
approximation to the true underlying function. Indeed, for different training
datasets, the result will be slightly different each time. This additional
source of variability in the fitted function is termed variance (figure 8.5c).
In practice, there might also be additional variance due to the stochastic
learning algorithm, which does not necessarily converge to the same
solution each time.
(8.1)
and fixed noise σ2 = 𝔼y[(μ[x] − y[x])2]. Here we have used the notation y[x]
to specify that we are considering the output y at a given input position x.
Appendix C.2
Expectation
Now consider a least squares loss between the model prediction f[x, ϕ] at
position x and the observed value y[x] at that position:
(8.2)
where we have both added and subtracted the mean μ[x] of the underlying
function in the second line and have expanded out the squared term in the
third line.
The underlying function is stochastic, so this loss depends on the
particular y[x] we observe. The expected loss is:
(8.3)
where we have made use of the rules for manipulating expectations. In the
second line, we have distributed the expectation operator and removed it
from terms with no dependence on y[x], and in the third line, we note that
the second term is zero since 𝔼y[y[x]] = μ[x] by definition. Finally, in the
fourth line, we have substituted in the definition of the noise σ2. We can see
that the expected loss has been broken down into two terms; the first term is
the squared deviation between the model and the true function mean, and
the second term is the noise.
Appendix C.2.1
Expectation rules
The first term can be further partitioned into bias and variance. The
parameters ϕ of the model f[x, ϕ] depend on the training dataset 𝒟 = {xi,
yi}, so more properly, we should write f[x, ϕ[𝒟]]. The training dataset is a
random sample from the data generation process; with a different sample of
training data, we would learn different parameter values. The expected
model output fμ[x] with respect to all possible datasets 𝒟 is hence:
(8.4)
Returning to the first term of equation 8.3, we add and subtract fμ[x] and
expand:
(8.5)
We then take the expectation with respect to the training dataset 𝒟:
(8.6)
where we have simplified using similar steps as for equation 8.3. Finally,
we substitute this result into equation 8.3:
(8.7)
This equation says that the expected loss after considering the uncertainty in
the training data 𝒟 and the test data y consists of three additive
components. The variance is uncertainty in the fitted model due to the
particular training dataset we sample. The bias is the systematic deviation of
the model from the mean of the function we are modeling. The noise is the
inherent uncertainty in the true mapping from input to output. These three
sources of error will be present for any task. They combine additively for
linear regression with a least squares loss. However, their interaction can be
more complex for other types of problems.
Figure 8.7 Bias and variance as a function of model capacity. a–c) As we increase
the number of hidden units of the toy model, the number of linear regions increases,
and the model becomes able to fit the true function closely; the bias (gray region)
decreases. d–f) Unfortunately, increasing the model capacity has the side-effect of
increasing the variance term (gray region). This is known as the bias-variance trade-
off.
Figure 8.8 Overfitting. a–c) A model with three regions is fit to three different
datasets of fifteen points each. The result is similar in all three cases (i.e., the
variance is low). d–f) A model with ten regions is fit to the same datasets. The
additional flexibility does not necessarily produce better predictions. While these
three models each describe the training data better, they are not necessarily closer to
the true underlying function (black curve). Instead, they overfit the data and describe
the noise, and the variance (difference between fitted curves) is larger.
We've seen that as we add capacity to the model, the bias will decrease,
but the variance will increase for a fixed-size training dataset. This suggests
that there is an optimal capacity where the bias is not too large and the
variance is still relatively small. Figure 8.9 shows how these terms vary
numerically for the toy model as we increase the capacity, using the data
from figure 8.8. For regression models, the total expected error is the sum
of the bias and the variance, and this sum is minimized when the model
capacity is four (i.e., with four hidden units and four linear regions).
Notebook 8.2
Bias-variance trade-off
Figure 8.9 Bias-variance trade-off. The bias and variance terms from equation 8.7
are plotted as a function of the model capacity (number of hidden units / linear
regions) in the simplified model using training data from figure 8.8. As the capacity
increases, the bias (solid orange line) decreases, but the variance (solid cyan line)
increases. The sum of these two terms (dashed gray line) is minimized when the
capacity is four.
8.4.1 Explanation
The discovery of double descent is recent, unexpected, and somewhat
puzzling. It results from an interaction of two phenomena. First, the test
performance becomes temporarily worse when the model has just enough
capacity to memorize the data. Second, the test performance continues to
improve with capacity even after the training performance is perfect. The
first phenomenon is exactly as predicted by the bias-variance trade-off. The
second phenomenon is more confusing; it's unclear why performance
should be better in the over-parameterized regime, given that there are now
not even enough training data points to constrain the model parameters
uniquely.
To understand why performance continues to improve as we add more
parameters, note that once the model has enough capacity to drive the
training loss to near zero, the model fits the training data almost perfectly.
This implies that further capacity cannot help the model fit the training data
any better; any change must occur between the training points. The
tendency of a model to prioritize one solution over another as it extrapolates
between data points is known as its inductive bias.
Problems 8.4–8.5
The model's behavior between data points is critical because, in high-
dimensional space, the training data are extremely sparse. The MNIST-1D
dataset has 40 dimensions, and we trained with 10,000 examples. If this
seems like plenty of data, consider what would happen if we quantized each
input dimension into 10 bins. There would be 1040 bins in total, constrained
by only 105 examples. Even with this coarse quantization, there will only be
one data point in every 1035 bins! The tendency of the volume of high-
dimensional space to overwhelm the number of training points is termed the
curse of dimensionality.
The implication is that problems in high dimensions might look more
like figure 8.11a; there are small regions of the input space where we
observe data with significant gaps between them. The putative explanation
for double descent is that as we add capacity to the model, it interpolates
between the nearest data points increasingly smoothly. In the absence of
information about what happens between the training points, assuming
smoothness is sensible and will probably generalize reasonably to new data.
Figure 8.11 Increasing capacity (hidden units) allows smoother interpolation
between sparse data points. a) Consider this situation where the training data (orange
circles) are sparse; there is a large region in the center with no data examples to
constrain the model to mimic the true function (black curve). b) If we fit a model
with just enough capacity to fit the training data (cyan curve), then it has to contort
itself to pass through the training data, and the output predictions will not be smooth.
c–f) However, as we add more hidden units, the model has the ability to interpolate
between the points more smoothly (smoothest possible curve plotted in each case).
However, unlike in this figure, it is not obliged to.
Figure 8.12 Regularization. a–c) Each of the three fitted curves passes through the
data points exactly, so the training loss for each is zero. However, we might expect
the smooth curve in panel (a) to generalize much better to new data than the erratic
curves in panels (b) and (c). Any factor that biases a model toward a subset of the
solutions with a similar training loss is known as a regularizer. It is thought that the
initialization and/or fitting of neural networks have an implicit regularizing effect.
Consequently, in the over-parameterized regime, more reasonable solutions, such as
that in panel (a), are encouraged.
The answer to this question is uncertain, but there are two likely
possibilities. First, the network initialization may encourage smoothness,
and the model never departs from the sub-domain of smooth function
during the training process. Second, the training algorithm may somehow
“prefer” to converge to smooth functions. Any factor that biases a solution
toward a subset of equivalent solutions is known as a regularizer, so one
possibility is that the training algorithm acts as an implicit regularizer (see
section 9.2).
Notes
Bias-variance trade-off: We showed that the test error for regression
problems with least squares loss decomposes into the sum of noise, bias,
and variance terms. These factors are all present for models with other
losses, but their interaction is typically more complicated (Friedman, 1997;
Domingos, 2000). For classification problems, there are some counter-
intuitive predictions; for example, if the model is biased toward selecting
the wrong class in a region of the input space, then increasing the variance
can improve the classification rate as this pushes some of the predictions
over the threshold to be classified correctly.
Cross-validation: We saw that it is typical to divide the data into three
parts: training data (which is used to learn the model parameters), validation
data (which is used to choose the hyperparameters), and test data (which is
used to estimate the final performance). This approach is known as cross-
validation. However, this division may cause problems where the total
number of data examples is limited; if the number of training examples is
comparable to the model capacity, then the variance will be large.
One way to mitigate this problem is to use k-fold cross-validation. The
training and validation data are partitioned into K disjoint subsets. For
example, we might divide these data into five parts. We train with four and
validate with the fifth for each of the five permutations and choose the
hyperparameters based on the average validation performance. The final
test performance is assessed using the average of the predictions from the
five models with the best hyperparameters on an entirely different test set.
There are many variations of this idea, but all share the general goal of
using a larger proportion of the data to train the model, thereby reducing
variance.
Capacity: We have used the term capacity informally to mean the
number of parameters or hidden units in the model (and hence indirectly,
the ability of the model to fit functions of increasing complexity). The
representational capacity of a model describes the space of possible
functions it can construct when we consider all possible parameter values.
When we take into account the fact that an optimization algorithm may not
be able to reach all of these solutions, what is left is the effective capacity.
The Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971)
is a more formal measure of capacity. It is the largest number of training
examples that a binary classifier can label arbitrarily. Bartlett et al. (2019)
derive upper and lower bounds for the VC dimension in terms of the
number of layers and weights. An alternative measure of capacity is the
Rademacher complexity, which is the expected empirical performance of a
classification model (with optimal parameters) for data with random labels.
Neyshabur et al. (2017) derive a lower bound on the generalization error in
terms of the Rademacher complexity.
Double descent: The term “double descent” was coined by Belkin et al.
(2019), who demonstrated that the test error decreases again in the over-
parameterized regime for two-layer neural networks and random features.
They also claimed that this occurs in decision trees, although Buschjäger &
Morik (2021) subsequently provided evidence to the contrary. Nakkiran et
al. (2021) show that double descent occurs for various modern datasets
(CIFAR-10, CIFAR-100, IWSLT’14 de-en), architectures (CNNs, ResNets,
transformers), and optimizers (SGD, Adam). The phenomenon is more
pronounced when noise is added to the target labels (Nakkiran et al., 2021)
and when some regularization techniques are used (Ishida et al., 2020).
Nakkiran et al. (2021) also provide empirical evidence that test performance
depends on effective model capacity (the largest number of samples for
which a given model and training method can achieve zero training error).
At this point, the model starts to devote its efforts to interpolating smoothly.
As such, the test performance depends not just on the model but also on the
training algorithm and length of training. They observe the same pattern
when they study a model with fixed capacity and increase the number of
training iterations. They term this epoch-wise double descent. This
phenomenon has been modeled by Pezeshki et al. (2022) in terms of
different features in the model being learned at different speeds.
Double descent makes the rather strange prediction that adding training data
can sometimes worsen test performance. Consider an over-parameterized
model in the second descending part of the curve. If we increase the
training data to match the model capacity, we will now be in the critical
region of the new test error curve, and the test loss may increase.
Bubeck & Sellke (2021) prove that overparameterization is necessary to
interpolate data smoothly in high dimensions. They demonstrate a trade-off
between the number of parameters and the Lipschitz constant of a model
(the fastest the output can change for a small input change). A review of the
theory of over-parameterized machine learning can be found in Dar et al.
(2021).
Appendix B.1.1
Lipschitz constant
Notebook 8.4
High-dimensional spaces
Problems
Problem 8.1 Will the multiclass cross-entropy training loss in figure 8.2
ever reach zero? Explain your reasoning.
Problem 8.2 What values should we choose for the three weights and
biases in the first layer of the model in figure 8.4a so that the hidden unit's
responses are as depicted in figures 8.4b–d?
Problem 8.3* Given a training dataset consisting of I input/output pairs {xi,
yi}, show how the parameters {β, ω1, ω2, ω3}for the model in figure 8.4a
using the least squares loss function can be found in closed form.
Problem 8.4 Consider the curve in figure 8.10b at the point where we train
a model with a hidden layer of size 200, which would have 50,410
parameters. What do you predict will happen to the training and test
performance if we increase the number of training examples from 10,000 to
50,410?
Problem 8.5 Consider the case where the model capacity exceeds the
number of training data points, and the model is flexible enough to reduce
the training loss to zero. What are the implications of this for fitting a
heteroscedastic model? Propose a method to resolve any problems that you
identify.
Problem 8.6 Show that two random points drawn from a 1000-dimensional
standard Gaussian distribution are orthogonal relative to the origin with
high probability.
Problem 8.7 The volume of a hypersphere with radius r in D dimensions is:
(8.8)
where Γ[•] is the Gamma function. Show using Stirling's formula that the
volume of a hypersphere of diameter one (radius r = 0.5) becomes zero as
the dimension increases.
Appendix B.1.3
Gamma function
Appendix B.1.4
Stirling's formula
OceanofPDF.com
Chapter 9
Regularization
where the individual terms ℓ i[xi, yi] measure the mismatch between the
network predictions f[xi, ϕ] and output targets yi for each training pair. To
bias this minimization toward certain solutions, we include an additional
term:
(9.2)
where g[ϕ] is a function that returns a scalar that takes a larger value when
the parameters are less preferred. The term λ is a positive scalar that
controls the relative contribution of the original loss function and the
regularization term. The minima of the regularized loss function usually
differ from those in the original, so the training procedure converges to
different parameter values (figure 9.1).
Figure 9.1 Explicit regularization. a) Loss function for Gabor model (see section
6.1.2). Cyan circles represent local minima. Gray circle represents the global
minimum. b) The regularization term favors parameters close to the center of the plot
by adding an increasing penalty as we move away from this point. c) The final loss
function is the sum of the original loss function plus the regularization term. This
surface has fewer local minima, and the global minimum has moved to a different
position (arrow shows change).
(9.3)
(9.4)
Moving back to the negative log-likelihood loss function by taking the log
and multiplying by minus one, we see that λ · g[ϕ] = −log[Pr(ϕ)].
9.1.2 L2 regularization
This discussion has sidestepped the question of which solutions the
regularization term should penalize (or equivalently that the prior should
favor). Since neural networks are used in an extremely broad range of
applications, these can only be very generic preferences. The most
commonly used regularization term is the L2 norm, which penalizes the
sum of the squares of the parameter values:
(9.5)
Figure 9.2 shows the effect of fitting the simplified network from figure
8.4 with weight decay and different values of the regularization coefficient
λ. When λ is small, it has little effect. However, as λ increases, the fit to the
data becomes less accurate, and the function becomes smoother. This might
improve the test performance for two reasons:
If the network is overfitting, then adding the regularization term means
that the network must trade off slavish adherence to the data against
the desire to be smooth. One way to think about this is that the error
due to variance reduces (the model no longer needs to pass through
every data point) at the cost of increased bias (the model can only
describe smooth functions).
When the network is over-parameterized, some of the extra model
capacity describes areas with no training data. Here, the regularization
term will favor functions that smoothly interpolate between the nearby
points. This is reasonable behavior in the absence of knowledge about
the true function.
Figure 9.2 L2 regularization in simplified network (see figure 8.4). a–f) Fitted
functions as we increase the regularization coefficient λ. The black curve is the true
function, the orange circles are the noisy training data, and the cyan curve is the fitted
model. For small λ (panels a–b), the fitted function passes exactly through the data
points. For intermediate λ (panels c–d), the function is smoother and more similar to
the ground truth. For large λ (panels e–f), the fitted function is smoother than the
ground truth, so the fit is worse.
(9.6)
(9.7)
The discretization causes a deviation from the continuous path (figure 9.3).
Figure 9.3 Implicit regularization in gradient descent. a) Loss function with family
of global minima on horizontal line ϕ1 = 0.61. Dashed blue line shows continuous
gradient descent path starting in bottom-left. Cyan trajectory shows discrete gradient
descent with step size 0.1 (first few steps shown explicitly as arrows). The finite step
size causes the paths to diverge and reach a different final position. b) This disparity
can be approximated by adding a regularization term to the continuous gradient
descent loss function that penalizes the squared gradient magnitude. c) After adding
this term, the continuous gradient descent path converges to the same place that the
discrete one did on the original function.
(9.8)
In other words, the discrete trajectory is repelled from places where the
gradient norm is large (the surface is steep). This doesn't change the
position of the minima where the gradients are zero anyway. However, it
changes the effective loss function elsewhere and modifies the optimization
trajectory, which potentially converges to a different minimum. Implicit
regularization due to gradient descent may be responsible for the
observation that full batch gradient descent generalizes better with larger
step sizes (figure 9.5a).
9.2.2 Implicit regularization in stochastic gradient descent
A similar analysis can be applied to stochastic gradient descent. Now we
seek a modified loss function such that the continuous version reaches the
same place as the average of the possible random SGD updates. This can be
shown to be:
(9.9)
Here, Lb is the loss for the bth of the B batches in an epoch, and both L and
Lb now represent the means of the I individual losses in the full dataset and
the |ẞ| individual losses in the batch, respectively:
(9.10)
SGD generalizes better than gradient descent, and smaller batch sizes
generally perform better than larger ones (figure 9.5b). One possible
explanation is that the inherent randomness allows the algorithm to reach
different parts of the loss function. However, it's also possible that some or
all of this performance increase is due to implicit regularization; this
encourages solutions where all the data fits well (so the batch variance is
small) rather than solutions where some of the data fit extremely well and
other data less well (perhaps with the same overall loss, but with larger
batch variance). The former solutions are likely to generalize better.
Notebook 9.2
Implicit regularization
Figure 9.5 Effect of learning rate and batch size for 4000 training and 4000 test
examples from MNIST-1D (see figure 8.1) for a neural network with two hidden
layers. a) Performance is better for large learning rates than for intermediate or small
ones. In each case, the number of iterations is 6000× the learning rate, so each
solution has the opportunity to move the same distance. b) Performance is superior
for smaller batch sizes. In each case, the number of iterations was chosen so that the
training data were memorized at roughly the same model capacity.
9.3.2 Ensembling
Another approach to reducing the generalization gap between training and
test data is to build several models and average their predictions. A group of
such models is known as an ensemble. This technique reliably improves test
performance at the cost of training and storing multiple models and
performing inference multiple times.
The models can be combined by taking the mean of the outputs (for
regression problems) or the mean of the pre-softmax activations (for
classification problems). The assumption is that model errors are
independent and will cancel out. Alternatively, we can take the median of
the outputs (for regression problems) or the most frequent predicted class
(for classification problems) to make the predictions more robust.
One way to train different models is just to use different random
initializations. This may help in regions of input space far from the training
data. Here, the fitted function is relatively unconstrained, and different
models may produce different predictions, so the average of several models
may generalize better than any single model.
Notebook 9.3
Ensembling
9.3.3 Dropout
Dropout randomly clamps a subset (typically 50%) of hidden units to zero
at each iteration of SGD (figure 9.8). This makes the network less
dependent on any given hidden unit and encourages the weights to have
smaller magnitudes so that the change in the function due to the presence or
absence of the hidden unit is reduced.
Figure 9.8 Dropout. a) Original network. b–d) At each training iteration, a random
subset of hidden units is clamped to zero (gray nodes). The result is that the incoming
and outgoing weights from these units have no effect, so we are training with a
slightly different network each time.
This technique has the positive benefit that it can eliminate undesirable
“kinks” in the function that are far from the training data and don't affect
the loss. For example, consider three hidden units that become active
sequentially as we move along the curve (figure 9.9a). The first hidden unit
causes a large increase in the slope. A second hidden unit decreases the
slope, so the function goes back down. Finally, the third unit cancels out
this decrease and returns the curve to its original trajectory. These three
units conspire to make an undesirable local change in the function. This will
not change the training loss but is unlikely to generalize well.
When several units conspire in this way, eliminating one (as would
happen in dropout) causes a considerable change to the output function that
is propagated to the half-space where that unit was active (figure 9.9b). A
subsequent gradient descent step will attempt to compensate for the change
that this induces, and such dependencies will be eliminated over time. The
overall effect is that large unnecessary changes between training data points
are gradually removed even though they contribute nothing to the loss
(figure 9.9).
Figure 9.9 Dropout mechanism. a) An undesirable kink in the curve is caused by a
sequential increase in the slope, decrease in the slope (at circled joint), and then
another increase to return the curve to its original trajectory. Here we are using full-
batch gradient descent, and the model already fits the data as well as possible, so
further training won't remove the kink. b) Consider what happens if we remove the
hidden unit that produced the circled joint in panel (a), as might happen using
dropout. Without the decrease in the slope, the right-hand side of the function takes
an upwards trajectory, and a subsequent gradient descent step will aim to compensate
for this change. c) Curve after 2000 iterations of (i) randomly removing one of the
three hidden units that cause the kink and (ii) performing a gradient descent step. The
kink does not affect the loss but is nonetheless removed by this approximation of the
dropout mechanism.
At test time, we can run the network as usual with all the hidden units
active; however, the network now has more hidden units than it was trained
with at any given iteration, so we multiply the weights by one minus the
dropout probability to compensate. This is known as the weight scaling
inference rule. A different approach to inference is to use Monte Carlo
dropout, in which we run the network multiple times with different random
subsets of units clamped to zero (as in training) and combine the results.
This is closely related to ensembling in that every random version of the
network is a different model; however, we do not have to train or store
multiple networks here.
Figure 9.10 Adding noise to inputs. At each step of SGD, random noise with
variance is added to the batch data. a–c) Fitted model with different noise levels
(small dots represent ten samples). Adding more noise smooths out the fitted function
(cyan line).
(9.11)
where Pr(ϕ) is the prior probability of the parameters, and the denominator
is a normalizing term. Hence, every parameter choice is assigned a
probability (figure 9.11).
Appendix C.1.4
Bayes’ rule
Figure 9.11 Bayesian approach for simplified network model (see figure 8.4). The
parameters are treated as uncertain. The posterior probability Pr(ϕ|{xi, yi}) for a set
of parameters is determined by their compatibility with the data {xi, yi} and a prior
distribution Pr(ϕ). a–c) Two sets of parameters (cyan curves) sampled from the
posterior using normally distributed priors with mean zero and three variances. When
the prior variance is small, the parameters also tend to be small, and the functions
smoother. d–f) Inference proceeds by taking a weighted sum over all possible
parameter values where the weights are the posterior probabilities. This produces
both a prediction of the mean (cyan curves) and the associated uncertainty (gray
region is two standard deviations).
(9.12)
The principle is that the network will build a good internal representation
of the data from the secondary task, which can subsequently be exploited
for the original task. Equivalently, transfer learning can be viewed as
initializing most of the parameters of the final network in a sensible part of
the space that is likely to produce a good solution.
Multi-task learning (figure 9.12b) is a related technique in which the
network is trained to solve several problems concurrently. For example, the
network might take an image and simultaneously learn to segment the
scene, estimate the pixel-wise depth, and predict a caption describing the
image. All of these tasks require some understanding of the image and,
when learned simultaneously, the model performance for each may
improve.
9.3.8 Augmentation
Transfer learning improves performance by exploiting a different dataset.
Multi-task learning improves performance using additional labels. A third
option is to expand the dataset. We can often transform each input data
example in such a way that the label stays the same. For example, we might
aim to determine if there is a bird in an image (figure 9.13). Here, we could
rotate, flip, blur, or manipulate the color balance of the image, and the label
“bird” remains valid. Similarly, for tasks where the input is text, we can
substitute synonyms or translate to another language and back again. For
tasks where the input is audio, we can amplify or attenuate different
frequency bands.
Notebook 9.5
Augmentation
Figure 9.13 Data augmentation. For some problems, each data example can be
transformed to augment the dataset. a) Original image. b–h) Various geometric and
photometric transformations of this image. For image classification, all these images
still have the same label, “bird.” Adapted from Wu et al. (2015a).
9.4 Summary
Explicit regularization involves adding an extra term to the loss function
that changes the position of the minimum. The term can be interpreted as a
prior probability over the parameters. Stochastic gradient descent with a
finite step size does not neutrally descend to the minimum of the loss
function. This bias can be interpreted as adding additional terms to the loss
function, and this is known as implicit regularization.
There are also many heuristics for improving generalization, including
early stopping, dropout, ensembling, the Bayesian approach, adding noise,
transfer learning, multi-task learning, and data augmentation. There are four
main principles behind these methods (figure 9.14). We can (i) encourage
the function to be smoother (e.g., L2 regularization), (ii) increase the
amount of data (e.g., data augmentation), (iii) combine models (e.g.,
ensembling), or (iv) search for wider minima (e.g., applying noise to
network weights).
Notes
An overview and taxonomy of regularization techniques in deep learning
can be found in Kukačka et al. (2017). Notably missing from the discussion
in this chapter is BatchNorm (Szegedy et al., 2016) at its variants, which are
described in chapter 11.
Regularization: L2 regularization penalizes the sum of squares of the
network weights. This encourages the output function to change slowly
(i.e., become smoother) and is the most used regularization term. It is
sometimes referred to as Frobenius norm regularization as it penalizes the
Frobenius norms of the weight matrices. It is often also mistakenly referred
to as “weight decay,” although this is a separate technique devised by
Hanson & Pratt (1988) in which the parameters ϕ are updated as:
(9.13)
where, as usual, α is the learning rate, and L is the loss. This is identical to
gradient descent, except that the weights are reduced by a factor of 1 − λ′
before the gradient update. For standard SGD, weight decay is equivalent to
L2 regularization (equation 9.5) with coefficient λ = λ′/2α. However, for
Adam, the learning rate α is different for each parameter, so L2
regularization and weight decay differ. Loshchilov & Hutter (2019) present
AdamW, which modifies Adam to implement weight decay correctly and
show that this improves performance.
Problem 9.5
Appendix B.3.2
Spectral norm
(9.14)
where g[ϕ0] is the negative of the gradient of the loss function, and α is the
step size. As α → 0, the gradient descent process can be described by a
differential equation:
(9.15)
For typical step sizes α, the discrete and continuous versions converge to
different solutions. We can use backward error analysis to find a correction
g1[ϕ] to the continuous version:
(9.16)
(9.17)
where in the second line, we have introduced the correction term (equation
9.16), and in the final line, we have removed terms of greater order than α2.
Note that the first two terms on the right-hand side ϕ0 + αg[ϕ0] are the same
as the discrete update (equation 9.14). Hence, to make the continuous and
discrete versions arrive at the same place, the third term on the right-hand
side must equal zero, allowing us to solve for g1[ϕ]:
(9.18)
During training, the evolution function g[ϕ] is the negative of the gradient
of the loss:
(9.19)
(9.20)
Problems
Problem 9.1 Consider a model where the prior distribution over the
parameters is a normal distribution with mean zero and variance so that
(9.21)
Problem 9.5 Show that the weight decay parameter update with decay rate
λ:
(9.22)
(9.23)
Problem 9.6 Consider a model with parameters ϕ = [ϕ0, ϕ1]T. Draw the L0,
, and L1 regularization terms in a similar form to figure 9.1b. The LP
regularization term is .
OceanofPDF.com
Chapter 10
Convolutional networks
Chapters 2–9 introduced the supervised learning pipeline for deep neural
networks. However, these chapters only considered fully connected
networks with a single path from input to output. Chapters 10–13 introduce
more specialized network components with sparser connections, shared
weights, and parallel processing paths. This chapter describes convolutional
layers, which are mainly used for processing image data.
Images have three properties that suggest the need for specialized model
architecture. First, they are high-dimensional. A typical image for a
classification task contains 224×224 RGB values (i.e., 150,528 input
dimensions). Hidden layers in fully connected networks are generally larger
than the input size, so even for a shallow network, the number of weights
would exceed 150, 5282, or 22 billion. This poses obvious practical
problems in terms of the required training data, memory, and computation.
Second, nearby image pixels are statistically related. However, fully
connected networks have no notion of “nearby” and treat the relationship
between every input equally. If the pixels of the training and test images
were randomly permuted in the same way, the network could still be trained
with no practical difference. Third, the interpretation of an image is stable
under geometric transformations. An image of a tree is still an image of a
tree if we shift it leftwards by a few pixels. However, this shift changes
every input to the network. Hence, a fully connected model must learn the
patterns of pixels that signify a tree separately at every position, which is
clearly inefficient.
Convolutional layers process each local image region independently,
using parameters shared across the whole image. They use fewer
parameters than fully connected layers, exploit the spatial relationships
between nearby pixels, and don't have to re-learn the interpretation of the
pixels at every position. A network predominantly consisting of
convolutional layers is known as a convolutional neural network or CNN.
(10.1)
In other words, the output of the function f[x] is the same regardless of the
transformation t[x]. Networks for image classification should be invariant
to geometric transformations of the image (figure 10.1a–b). The network
f[x] should identify an image as containing the same object, even if it has
been translated, rotated, flipped, or warped.
Figure 10.1 Invariance and equivariance for translation. a–b) In image
classification, the goal is to categorize both images as “mountain” regardless of the
horizontal shift that has occurred. In other words, we require the network prediction
to be invariant to translation. c,e) The goal of semantic segmentation is to associate a
label with each pixel. d,f) When the input image is translated, we want the output
(colored overlay) to translate in the same way. In other words, we require the output
to be equivariant with respect to translation. Panels c–f) adapted from Bousselham et
al. (2021).
(10.2)
(10.3)
where ω = [ω1, ω2, ω3]T is the kernel (figure 10.2).1 Notice that the
convolution operation is equivariant with respect to translation. If we
translate the input x, then the corresponding output z is translated in the
same way.
Problem 10.1
Figure 10.2 1D convolution with kernel size three. Each output zi is a weighted
sum of the nearest three inputs xi−1, xi, and xi+1, where the weights are ω = [ω1, ω2,
ω3]. a) Output z2 is computed as z2 = ω1x1 + ω2x2 + ω3x3. b) Output z3 is computed
as z3 = ω1x2 + ω2x3 + ω3x4. c) At position z1, the kernel extends beyond the first
input x1. This can be handled by zero padding, in which we assume values outside
the input are zero. The final output is treated similarly. d) Alternatively, we could
only compute outputs where the kernel fits within the input range (“valid”
convolution); now, the output will be smaller than the input.
10.2.2 Padding
Equation 10.3 shows that each output is computed by taking a weighted
sum of the previous, current, and subsequent positions in the input. This
begs the question of how to deal with the first output (where there is no
previous input) and the final output (where there is no subsequent input).
There are two common approaches. The first is to pad the edges of the
inputs with new values and proceed as usual. Zero padding assumes the
input is zero outside its valid range (figure 10.2c). Other possibilities
include treating the input as circular or reflecting it at the boundaries. The
second approach is to discard the output positions where the kernel exceeds
the range of input positions. These valid convolutions have the advantage of
introducing no extra information at the edges of the input. However, they
have the disadvantage that the representation decreases in size.
Figure 10.3 Stride, kernel size, and dilation. a) With a stride of two, we evaluate
the kernel at every other position, so the first output z1 is computed from a weighted
sum centered at x1, and b) the second output z2 is computed from a weighted sum
centered at x3 and so on. c) The kernel size can also be changed. With a kernel size of
five, we take a weighted sum of the nearest five inputs. d) In dilated or atrous
convolution, we intersperse zeros in the weight vector to allow us to combine
information over a large area using fewer weights.
The kernel size can be increased to integrate over a larger area (figure
10.3c). However, it typically remains an odd number so that it can be
centered around the current position. Increasing the kernel size has the
disadvantage of requiring more weights. This leads to the idea of dilated or
atrous convolutions, in which the kernel values are interspersed with zeros.
For example, we can turn a kernel of size five into a dilated kernel of size
three by setting the second and fourth elements to zero. We still integrate
information from a larger input region but only require three weights to do
this (figure 10.3d). The number of zeros we intersperse between the weights
is termed the dilation rate.
Problems 10.2–10.4
10.2.4 Convolutional layers
A convolutional layer computes its output by convolving the input, adding a
bias β, and passing each result through an activation function a[•]. With
kernel size three, stride one, and dilation rate zero, the ith hidden unit hi
would be computed as:
(10.4)
where the bias β and kernel weights ω1, ω2, ω3 are trainable parameters,
and (with zero padding) we treat the input x as zero when it is out of the
valid range. This is a special case of a fully connected layer that computes
the ith hidden unit as:
(10.5)
If there are D inputs x• and D hidden units h•, this fully connected layer
would have D2 weights ω•• and D biases β•. The convolutional layer only
uses three weights and one bias. A fully connected layer can reproduce this
exactly if most weights are set to zero and others are constrained to be
identical (figure 10.4).
Figure 10.4 Fully connected vs. convolutional layers. a) A fully connected layer
has a weight connecting each input x to each hidden unit h (colored arrows) and a
bias for each hidden unit (not shown). b) Hence, the associated weight matrix Ω
contains 36 weights relating the six inputs to the six hidden units. c) A convolutional
layer with kernel size three computes each hidden unit as the same weighted sum of
the three neighboring inputs (arrows) plus a bias (not shown). d) The weight matrix is
a special case of the fully connected matrix where many weights are zero and others
are repeated (same colors indicate same value, white indicates zero weight). e) A
convolutional layer with kernel size three and stride two computes a weighted sum at
every other position. f) This is also a special case of a fully connected network with a
different sparse weight structure.
Problem 10.5
10.2.5 Channels
If we only apply a single convolution, information will inevitably be lost;
we are averaging nearby inputs, and the ReLU activation function clips
results that are less than zero. Hence, it is usual to compute several
convolutions in parallel. Each convolution produces a new set of hidden
variables, termed a feature map or channel.
Figure 10.5a–b illustrates this with two convolution kernels of size three
and with zero padding. The first kernel computes a weighted sum of the
nearest three pixels, adds a bias, and passes the results through the
activation function to produce hidden units h1 to h6. These comprise the
first channel. The second kernel computes a different weighted sum of the
nearest three pixels, adds a different bias, and passes the results through the
activation function to create hidden units h7 to h12. These comprise the
second channel.
Figure 10.5 Channels. Typically, multiple convolutions are applied to the input x
and stored in channels. a) A convolution is applied to create hidden units h1 to h6,
which form the first channel. b) A second convolution operation is applied to create
hidden units h7 to h12, which form the second channel. The channels are stored in a
2D array H1 that contains all the hidden units in the first hidden layer. c) If we add a
further convolutional layer, there are now two channels at each input position. Here,
the 1D convolution defines a weighted sum over both input channels at the three
closest positions to create each new output channel.
In general, the input and the hidden layers all have multiple channels
(figure 10.5c). If the incoming layer has Ci channels and kernel size K, the
hidden units in each output channel are computed as a weighted sum over
all Ci channels and K kernel positions using a weight matrix Ω ∈
and one bias. Hence, if there are Co channels in the next layer, then we need
Ω∈ weights and β ∈ biases.
Problems 10.6–10.8
Notebook 10.1
1D convolution
Figure 10.7 Convolutional network for classifying MNIST-1D data (see figure
8.1). The MNIST-1D input has dimension Di = 40. The first convolutional layer has
fifteen channels, kernel size three, stride two, and only retains “valid” positions to
make a representation with nineteen positions and fifteen channels. The following
two convolutional layers have the same settings, gradually reducing the
representation size. Finally, a fully connected layer takes all sixty hidden units from
the third hidden layer. It outputs ten activations that are subsequently passed through
a softmax layer to produce the ten class probabilities.
This network was trained for 100,000 steps using SGD without
momentum, a learning rate of 0.01, and a batch size of 100 on a dataset of
4,000 examples. We compare this to a fully connected network with the
same number of layers and hidden units (i.e., three hidden layers with 285,
135, and 60 hidden units, respectively). The convolutional network has
2,050 parameters, and the fully connected network has 150,185 parameters.
By the logic of figure 10.4, the convolutional network is a special case of
the fully connected one. The latter has enough flexibility to replicate the
former exactly. Figure 10.8 shows both models fit the training data
perfectly. However, the test error for the convolutional network is much less
than for the fully connected network.
Problem 10.12
Notebook 10.2
Convolution for MNIST-
1D
Figure 10.8 MNIST-1D results. a) The convolutional network from figure 10.7
eventually fits the training data perfectly and has ∼17% test error. b) A fully
connected network with the same number of hidden layers and the number of hidden
units in each learns the training data faster but fails to generalize well with ∼40%
test error. The latter model can reproduce the convolutional model but fails to do so.
The convolutional structure restricts the possible mappings to those that process
every position similarly, and this restriction improves performance.
(10.6)
where ωmn are the entries of the convolutional kernel. This is simply a
weighted sum over a square 3×3 input region. The kernel is translated both
horizontally and vertically across the 2D input (figure 10.9) to create an
output at each position.
Problem 10.13
Figure 10.9 2D convolutional layer. Each output hij computes a weighted sum of
the 3×3 nearest inputs, adds a bias, and passes the result through an activation
function. a) Here, the output h23 (shaded output) is a weighted sum of the nine
positions from x12 to x34 (shaded inputs). b) Different outputs are computed by
translating the kernel across the image grid in two dimensions. c–d) With zero
padding, positions beyond the image's edge are considered to be zero.
Appendix B.3
Tensors
Second, max pooling retains the maximum of the 2×2 input values
(figure 10.11b). This induces some invariance to translation; if the input is
shifted by one pixel, many of these maximum values remain the same.
Finally, mean pooling or average pooling averages the inputs. For all
approaches, we apply downsampling separately to each channel, so the
output has half the width and height but the same number of channels.
10.4.2 Upsampling
The simplest way to scale up a network layer to double the resolution is to
duplicate all the channels at each spatial position four times (figure 10.12a).
A second method is max unpooling; this is used where we have previously
used a max pooling operation for downsampling, and we distribute the
values to the positions they originated from (figure 10.12b). A third
approach uses bilinear interpolation to fill in the missing values between the
points where we have samples. (figure 10.12c).
Figure 10.12 Methods for scaling up representation size (upsampling). a) The
simplest way to double the size of a 2D layer is to duplicate each input four times. b)
In networks where we have previously used a max pooling operation (figure 10.11b),
we can redistribute the values to the same positions they originally came from (i.e.,
where the maxima were). This is known as max unpooling. c) A third option is
bilinear interpolation between the input values.
Figure 10.14 1×1 convolution. To change the number of channels without spatial
pooling, we apply a 1×1 kernel. Each output channel is computed by taking a
weighted sum of all of the channels at the same position, adding a bias, and passing
through an activation function. Multiple output channels are created by repeating this
operation with different weights and biases.
10.5 Applications
We conclude by describing three computer vision applications. We describe
convolutional networks for image classification where the goal is to assign
the image to one of a predetermined set of categories. Then we consider
object detection, where the goal is to identify multiple objects in an image
and find the bounding box around each. Finally, we describe an early
system for semantic segmentation where the goal is to assign a label to each
pixel according to which object is present.
Figure 10.15 Example ImageNet classification images. The model aims to assign
an input image to one of 1000 classes. This task is challenging because the images
vary widely along different attributes (columns). These include rigidity (monkey <
canoe), number of instances in image (lizard < strawberry), clutter (compass < steel
drum), size (candle < spiderweb), texture (screwdriver < leopard), distinctiveness of
color (mug < red wine), and distinctiveness of shape (headland < bell). Adapted from
Russakovsky et al. (2015).
Figure 10.16 AlexNet (Krizhevsky et al., 2012). The network maps a 224×224
color image to a 1000-dimensional vector representing class probabilities. The
network first convolves with 11×11 kernels and stride 4 to create 96 channels. It
decreases the resolution again using a max pool operation and applies a 5×5
convolutional layer. Another max pooling layer follows, and three 3×3 convolutional
layers are applied. After a final max pooling operation, the result is vectorized and
passed through three fully connected (FC) layers and finally the softmax layer.
The dataset size was augmented by a factor of 2048 using (i) spatial
transformations and (ii) modifications of the input intensities. At test time,
five different cropped and mirrored versions of the image were run through
the network, and their predictions averaged. The system was learned using
SGD with a momentum coefficient of 0.9 and a batch size of 128. Dropout
was applied in the fully connected layers, and an L2 (weight decay)
regularizer was used. This system achieved a 16.4% top-5 error rate and a
38.1% top-1 error rate. At the time, this was an enormous leap forward in
performance at a task considered far beyond the capabilities of
contemporary methods. This result revealed the potential of deep learning
and kick-started the modern era of AI research.
Notebook 10.5
Convolution for MNIST
Although there were various minor differences in the training regime, the
most important change between AlexNet and VGG was the depth of the
network. The latter used 19 hidden layers and 144 million parameters. The
networks in figures 10.16 and 10.17 are depicted at the same scale for
comparison. There was a general trend for several years for performance on
this task to improve as the depth of the networks increased, and this is
evidence that depth is important in neural networks.
Problem 10.18
The first part of the network is a smaller version of VGG (figure 10.17)
that contains thirteen rather than fifteen convolutional layers and downsizes
the representation to size 14×14. There is then one more max pooling
operation, followed by two fully connected layers that map to two 1D
representations of size 4096. These layers do not represent spatial position
but instead, combine information from across the whole image.
Here, the architecture diverges from VGG. Another fully connected layer
reconstitutes the representation into 7×7 spatial positions and 512 channels.
This is followed by a series of max unpooling layers (see figure 10.12b) and
deconvolution layers. These are transposed convolutions (see figure 10.13)
but in 2D and without the upsampling. Finally, there is a 1×1 convolution to
create 21 channels representing the possible classes and a softmax operation
at each spatial position to map the activations to class probabilities. The
downsampling side of the network is sometimes referred to as an encoder,
and the upsampling side as a decoder, so networks of this type are
sometimes called encoder-decoder networks or hourglass networks due to
their shape.
The final segmentation is generated using a heuristic method that
greedily searches for the class that is most represented and infers its region,
taking into account the probabilities but also encouraging connectedness.
Then the next most-represented class is added where it dominates at the
remaining unlabeled pixels. This continues until there is insufficient
evidence to add more (figure 10.20).
Figure 10.20 Semantic segmentation results. The final result is created from the 21
probability maps by greedily selecting the best class and using a heuristic method to
find a sensible binary map based on the probabilities and their spatial proximity. If
there is enough evidence, subsequent classes are added, and their segmentation maps
are combined. Adapted from Noh et al. (2015).
10.6 Summary
In convolutional layers, each hidden unit is computed by taking a weighted
sum of the nearby inputs, adding a bias, and applying an activation
function. The weights and the bias are the same at every spatial position, so
there are far fewer parameters than in a fully connected network, and the
parameters don't increase with the input image size. To ensure that
information is not lost, this operation is repeated with different weights and
biases to create multiple channels at each spatial position.
Typical convolutional networks consist of convolutional layers
interspersed with layers that downsample by a factor of two. As the network
progresses, the spatial dimensions usually decrease by factors of two, and
the number of channels increases by factors of two. At the end of the
network, there are typically one or more fully connected layers that
integrate information from across the entire input and create the desired
output. If the output is an image, a mirrored “decoder” upsamples back to
the original size.
The translational equivariance of convolutional layers imposes a useful
inductive bias that increases performance for image-based tasks relative to
fully connected networks. We described image classification, object
detection, and semantic segmentation networks. Image classification
performance was shown to improve as the network became deeper.
However, subsequent experiments showed that increasing the network
depth indefinitely doesn't continue to help; after a certain depth, the system
becomes difficult to train. This is the motivation for residual connections,
which are the topic of the next chapter.
Notes
Dumoulin & Visin (2016) present an overview of the mathematics of
convolutions that expands on the brief treatment in this chapter.
Convolutional networks: Early convolutional networks were developed
by Fukushima & Miyake (1982), LeCun et al. (1989a), and LeCun et al.
(1989b). Initial applications included handwriting recognition (LeCun et al.,
1989a; Martin, 1993), face recognition (Lawrence et al., 1997), phoneme
recognition (Waibel et al., 1989), spoken word recognition (Bottou et al.,
1990), and signature verification (Bromley et al., 1993). However,
convolutional networks were popularized by LeCun et al. (1998), who built
a system called LeNet for classifying 28×28 grayscale images of
handwritten digits. This is immediately recognizable as a precursor of
modern networks; it uses a series of convolutional layers, followed by fully
connected layers, sigmoid activations rather than ReLUs, and average
pooling rather than max pooling. AlexNet (Krizhevsky et al., 2012) is
widely considered the starting point for modern deep convolutional
networks.
ImageNet Challenge: Deng et al. (2009) collated the ImageNet database
and the associated classification challenge drove progress in deep learning
for several years after AlexNet. Notable subsequent winners of this
challenge include the network-in-network architecture (Lin et al., 2014),
which alternated convolutions with fully connected layers that operated
independently on all of the channels at each position (i.e., 1×1
convolutions). Zeiler & Fergus (2014) and Simonyan & Zisserman (2014)
trained larger and deeper architectures that were fundamentally similar to
AlexNet. Szegedy et al. (2017) developed an architecture called
GoogLeNet, which introduced inception blocks. These use several parallel
paths with different filter sizes, which are then recombined. This effectively
allowed the system to learn the filter size.
The trend was for performance to improve with increasing depth. However,
it ultimately became difficult to train deeper networks without
modifications; these include residual connections and normalization layers,
both of which are described in the next chapter. Progress in the ImageNet
challenges is summarized in Russakovsky et al. (2015). A more general
survey of image classification using convolutional networks can be found in
Rawat & Wang (2017). The improvement of image classification networks
over time is visualized in figure 10.21.
Problems
Problem 10.1* Show that the operation in equation 10.4 is equivariant with
respect to translation.
Problem 10.2 Equation 10.3 defines 1D convolution with a kernel size of
three, stride of one, and dilation zero. Write out the equivalent equation for
the 1D convolution with a kernel size of three and a stride of two as
pictured in figure 10.3a–b.
Problem 10.3 Write out the equation for the 1D dilated convolution with a
kernel size of three and a dilation rate of one, as pictured in figure 10.3d.
Problem 10.4 Write out the equation for a 1D convolution with kernel size
of seven, a dilation rate of two, and a stride of three.
Problem 10.5 Draw weight matrices in the style of figure 10.4d for (i) the
strided convolution in figure 10.3a–b, (ii) the convolution with kernel size 5
in figure 10.3c, and (iii) the dilated convolution in figure 10.3d.
Problem 10.6* Draw a 6×12 weight matrix in the style of figure 10.4d
relating the inputs x1, …, x6 to the outputs h1, …, h12 in the multi-channel
convolution as depicted in figures 10.5a–b.
Problem 10.7* Draw a 12×6 weight matrix in the style of figure 10.4d
relating the inputs h1, …, h12 to the outputs in the multi-channel
convolution in figure 10.5c.
Problem 10.8 Consider a 1D convolutional network where the input has
three channels. The first hidden layer is computed using a kernel size of
three and has four channels. The second hidden layer is computed using a
kernel size of five and has ten channels. How many biases and how many
weights are needed for each of these two convolutional layers?
Problem 10.9 A network consists of three 1D convolutional layers. At each
layer, a zero-padded convolution with kernel size three, stride one, and
dilation zero is applied. What size is the receptive field of the hidden units
in the third layer?
Problem 10.10 A network consists of three 1D convolutional layers. At
each layer, a zero-padded convolution with kernel size seven, stride one,
and dilation zero is applied. What size is the receptive field of hidden units
in the third layer?
Problem 10.11 Consider a convolutional network with 1D input x. The first
hidden layer H1 is computed using a convolution with kernel size five,
stride two, and a dilation rate of zero. The second hidden layer H2 is
computed using a convolution with kernel size three, stride one, and a
dilation rate of zero. The third hidden layer H3 is computed using a
convolution with kernel size five, stride one, and a dilation rate of one.
What are the receptive field sizes at each hidden layer?
Problem 10.12 The 1D convolutional network in figure 10.7 was trained
using stochastic gradient descent with a learning rate of 0.01 and a batch
size of 100 on a training dataset of 4,000 examples for 100,000 steps. How
many epochs was the network trained for?
Problem 10.13 Draw a weight matrix in the style of figure 10.4d that shows
the relationship between the 24 inputs and the 24 outputs in figure 10.9.
Problem 10.14 Consider a 2D convolutional layer with kernel size 5×5 that
takes 3 input channels and returns 10 output channels. How many
convolutional weights are there? How many biases?
Problem 10.15 Draw a weight matrix in the style of figure 10.4d that
samples every other variable in a 1D input (i.e., the 1D analog of figure
10.11a). Show that the weight matrix for 1D convolution with kernel size
and stride two is equivalent to composing the matrices for 1D convolution
with kernel size one and this sampling matrix.
Problem 10.16* Consider the AlexNet network (figure 10.16). How many
parameters are used in each convolutional and fully connected layer? What
is the total number of parameters?
Problem 10.17 What is the receptive field size at each of the first three
layers of AlexNet (figure 10.16)?
Problem 10.18 How many weights and biases are there at each
convolutional layer and fully connected layer in the VGG architecture
(figure 10.17)?
Problem 10.19* Consider two hidden layers of size 224×224 with C1 and
C2 channels, respectively, connected by a 3×3 convolutional layer. Describe
how to initialize the weights using He initialization.
1 Strictly speaking, this is a cross-correlation and not a convolution, in which the weights would
be flipped relative to the input (so we would switch xi−1 with xi+1). Regardless, this (incorrect)
definition is the usual convention in machine learning.
OceanofPDF.com
Chapter 11
Residual networks
(11.1)
where h1, h2, and h3 denote the intermediate hidden layers, x is the network
input, y is the output, and the functions fk[•, ϕk] perform the processing.
Figure 11.1 Sequential processing. Standard neural networks pass the output of
each layer directly into the next layer.
(11.2)
Figure 11.3 Shattered gradients. a) Consider a shallow network with 200 hidden
units and Glorot initialization (He initialization without the factor of two) for both the
weights and biases. The gradient ∂y/∂x of the scalar network output y with respect to
the scalar input x changes relatively slowly as we change the input x. b) For a deep
network with 24 layers and 200 hidden units per layer, this gradient changes very
quickly and unpredictably. c) The autocorrelation function of the gradient shows that
nearby gradients become unrelated (have autocorrelation close to zero) for deep
networks. This shattered gradients phenomenon may explain why it is hard to train
deep networks. Gradient descent algorithms rely on the loss surface being relatively
smooth, so the gradients should be related before and after each update step. Adapted
from Balduzzi et al. (2017).
(11.3)
When we change the parameters that determine f1, all of the derivatives in
this sequence can change since layers f2, f3, and f4 are themselves computed
from f1. Consequently, the updated gradient at each training example may
be completely different, and the loss function becomes badly behaved.1
(11.4)
where the first term on the right-hand side of each line is the residual
connection. Each function fk learns an additive change to the current
representation. It follows that their outputs must be the same size as their
inputs. Each additive combination of the input and the processed output is
known as a residual block or residual layer.
Figure 11.4 Residual connections. a) The output of each function fk[x, ϕk] is added
back to its input, which is passed via a parallel computational path called a residual
or skip connection. Hence, the function computes an additive change to the
representation. b) Upon expanding (unraveling) the network equations, we find that
the output is the sum of the input plus four smaller networks (depicted in white,
orange, gray, and cyan, respectively, and corresponding to terms in equation 11.5);
we can think of this as an ensemble of networks. Moreover, the output from the cyan
network is itself a transformation f4[•, ϕ4] of another ensemble, and so on.
Alternatively, we can consider the network as a combination of 16 different paths
through the computational graph. One example is the dashed path from input x to
output y, which is the same in panels (a) and (b).
Once more, we can write this as a single function by substituting in the
expressions for the intermediate quantities hk:
(11.5)
where we have omitted the parameters ϕ• for clarity. We can think of this
equation as “unraveling” the network (figure 11.4b). We see that the final
network output is a sum of the input and four smaller networks,
corresponding to each line of the equation; one interpretation is that residual
connections turn the original network into an ensemble of these smaller
networks whose outputs are summed to compute the result.
Problem 11.1
(11.6)
Problem 11.2
Problem 11.3
where there is one term for each of the eight paths. The identity term on the
right-hand side shows that changes in the parameters ϕ1 in the first layer
f1[x, ϕ1] contribute directly to changes in the network output y. They also
contribute indirectly through the other chains of derivatives of varying
lengths. In general, gradients through shorter paths will be better behaved.
Since both the identity term and various short chains of derivatives will
contribute to the derivative for each layer, networks with residual links
suffer less from shattered gradients.
Notebook 11.2
Residual networks
Figure 11.5 Order of operations in residual blocks. a) The usual order of linear
transformation or convolution followed by a ReLU nonlinearity means that each
residual block can only add non-negative quantities. b) With the reverse order, both
positive and negative quantities can be added. However, we must add a linear
transformation at the start of the network in case the input is all negative. c) In
practice, it's common for a residual block to contain several network layers.
(11.7)
where all quantities are scalars. Then we use these statistics to standardize
the batch activations to have mean zero and unit variance:
(11.8)
where ϵ is a small number that prevents division by zero if hi is the same for
every member of the batch and sh = 0.
Appendix C.2.4
Standardization
After this operation, the activations have mean δ and standard deviation γ
across all members of the batch. Both of these quantities are learned during
training.
Problem 11.5
Notebook 11.3
BatchNorm
Higher learning rates: Empirical studies and theory both show that batch
normalization makes the loss surface and its gradient change more
smoothly (i.e., reduces shattered gradients). This means we can use higher
learning rates as the surface is more predictable. We saw in section 9.2 that
higher learning rates improve test performance.
11.5.1 ResNet
Residual blocks were first used in convolutional networks for image
classification. The resulting networks are known as residual networks, or
ResNets for short. In ResNets, each residual block contains a batch
normalization operation, a ReLU activation function, and a convolutional
layer. This is followed by the same sequence again before being added back
to the input (figure 11.7a). Trial and error have shown that this order of
operations works well for image classification.
Problem 11.7
Figure 11.7 ResNet blocks. a) A standard block in the ResNet architecture contains
a batch normalization operation, followed by an activation function, and a 3×3
convolutional layer. Then, this sequence is repeated. b). A bottleneck ResNet block
still integrates information over a 3×3 region but uses fewer parameters. It contains
three convolutions. The first 1×1 convolution reduces the number of channels. The
second 3×3 convolution is applied to the smaller representation. A final 1×1
convolution increases the number of channels again so that it can be added back to
the input.
The ResNet-200 model (figure 11.8) contains 200 layers and was used
for image classification on the ImageNet database (figure 10.15). The
architecture resembles AlexNet and VGG but uses bottleneck residual
blocks instead of vanilla convolutional layers. As with AlexNet and VGG,
these are periodically interspersed with decreases in spatial resolution and
simultaneous increases in the number of channels. Here, the resolution is
decreased by downsampling using convolutions with stride two. The
number of channels is increased either by appending zeros to the
representation or by using an extra 1×1 convolution. At the start of the
network is a 7×7 convolutional layer, followed by a downsampling
operation. At the end, a fully connected layer maps the block to a vector of
length 1000. This is passed through a softmax layer to generate class
probabilities.
Figure 11.8 ResNet-200 model. A standard 7×7 convolutional layer with stride two
is applied, followed by a MaxPool operation. A series of bottleneck residual blocks
follow (number in brackets is channels after first 1×1 convolution), with periodic
downsampling and accompanying increases in the number of channels. The network
concludes with average pooling across all spatial positions and a fully connected
layer that maps to pre-softmax activations.
The ResNet-200 model achieved a remarkable 4.8% error rate for the
correct class being in the top five and 20.1% for identifying the correct
class correctly. This compared favorably with AlexNet (16.4%, 38.1%) and
VGG (6.8%, 23.7%) and was one of the first networks to exceed human
performance (5.1% for being in the top five guesses). However, this model
was conceived in 2016 and is far from state-of-the-art. At the time of
writing, the best-performing model on this task has a 9.0% error for
identifying the class correctly (see figure 10.21). This and all the other
current top-performing models for image classification are now based on
transformers (see chapter 12).
11.5.2 DenseNet
Residual blocks receive the output from the previous layer, modify it by
passing it through some network layers, and add it back to the original
input. An alternative is to concatenate the modified and original signals.
This increases the representation size (in terms of channels for a
convolutional network), but an optional subsequent linear transformation
can map back to the original size (a 1×1 convolution for a convolutional
network). This allows the model to add the representations together, take a
weighted sum, or combine them in a more complex way.
The DenseNet architecture uses concatenation so that the input to a layer
comprises the concatenated outputs from all previous layers (figure 11.9).
These are processed to create a new representation that is itself
concatenated with the previous representation and passed to the next layer.
This concatenation means there is a direct contribution from earlier layers
to the output, so the loss surface behaves reasonably.
In practice, this can only be sustained for a few layers because the
number of channels (and hence the number of parameters required to
process them) becomes increasingly large. This problem can be alleviated
by applying a 1×1 convolution to reduce the number of channels before the
next 3×3 convolution is applied. In a convolutional network, the input is
periodically downsampled. Concatenation across the downsampling makes
no sense since the representations have different sizes. Consequently, the
chain of concatenation is broken at this point, and a smaller representation
starts a new chain. In addition, another bottleneck 1×1 convolution can be
applied when the downsampling occurs to control the representation size
further.
This network performs competitively with ResNet models on image
classification (see figure 10.21); indeed, it can perform better for a
comparable parameter count. This is presumably because it can reuse
processing from earlier layers more flexibly.
The U-Net was intended for segmenting medical images (figure 11.11)
but has found many other uses in computer graphics and vision. Hourglass
networks are similar but apply further convolutional layers in the skip
connections and add the result back to the decoder rather than concatenating
it. A series of these models form a stacked hourglass network that alternates
between considering the image at local and global levels. Such networks are
used for pose estimation (figure 11.12). The system is trained to predict one
“heatmap” for each joint, and the estimated position is the maximum of
each heatmap.
Figure 11.11 Segmentation using U-Net in 3D. a) Three slices through a 3D
volume of mouse cortex taken by scanning electron microscope. b) A single U-Net is
used to classify voxels as being inside or outside neurites. Connected regions are
identified with different colors. c) For a better result, an ensemble of five U-Nets is
trained, and a voxel is only classified as belonging to the cell if all five networks
agree. Adapted from Falk et al. (2019).
Figure 11.12 Stacked hourglass networks for pose estimation. a) The network input
is an image containing a person, and the output is a set of heatmaps, with one
heatmap for each joint. This is formulated as a regression problem where the targets
are heatmap images with small, highlighted regions at the ground-truth joint
positions. The peak of the estimated heatmap is used to establish each final joint
position. b) The architecture consists of initial convolutional and residual layers
followed by a series of hourglass blocks. c) Each hourglass block consists of an
encoder-decoder network similar to the U-Net except that the convolutions use zero
padding, some further processing is done in the residual links, and these links add
this processed representation rather than concatenate it. Each blue cuboid is itself a
bottleneck residual block (figure 11.7b). Adapted from Newell et al. (2016).
11.7 Summary
Increasing network depth indefinitely causes both training and test
performance for image classification to decrease. This may be because the
gradient of the loss with respect to parameters early in the network changes
quickly and unpredictably relative to the update step size. Residual
connections add the processed representation back to their own input. Now
each layer contributes directly to the output as well as indirectly, so
propagating gradients through many layers is not mandatory, and the loss
surface is smoother.
Residual networks don't suffer from vanishing gradients but introduce an
exponential increase in the variance of the activations during forward
propagation and corresponding problems with exploding gradients. This is
usually handled by adding batch normalization, which compensates for the
empirical mean and variance of the batch and then shifts and rescales using
learned parameters. If these parameters are initialized judiciously, very deep
networks can be trained. There is evidence that both residual links and
batch normalization make the loss surface smoother, which permits larger
learning rates. Moreover, the variability in the batch statistics adds a source
of regularization.
Residual blocks have been incorporated into convolutional networks.
They allow deeper networks to be trained with commensurate increases in
image classification performance. Variations of residual networks include
the DenseNet architecture, which concatenates outputs of all prior layers to
feed into the current layer, and U-Nets, which incorporate residual
connections into encoder-decoder models.
Notes
Residual connections: Residual connections were introduced by He et al.
(2016a), who built a network with 152 layers, which was eight times larger
than VGG (figure 10.17), and achieved state-of-the-art performance on the
ImageNet classification task. Each residual block consisted of a
convolutional layer followed by batch normalization, a ReLU activation, a
second convolutional layer, and second batch normalization. A second
ReLU function was applied after this block was added back to the main
representation. This architecture was termed ResNet v1. He et al. (2016b)
investigated different variations of residual architectures, in which either (i)
processing could also be applied along the skip connection or (ii) after the
two branches had recombined. They concluded neither was necessary,
leading to the architecture in figure 11.7, which is sometimes termed a pre-
activation residual block and is the backbone of ResNet v2. They trained a
network with 200 layers that improved further on the ImageNet
classification task (see figure 11.8). Since this time, new methods for
regularization, optimization, and data augmentation have been developed,
and Wightman et al. (2021) exploit these to present a more modern training
pipeline for the ResNet architecture.
Why residual connections help: Residual networks certainly allow
deeper networks to be trained. Presumably, this is related to reducing
shattered gradients (Balduzzi et al., 2017) at the start of training and the
smoother loss surface near the minima as depicted in figure 11.13 (Li et al.,
2018b). Residual connections alone (i.e., without batch normalization)
increase the trainable depth of a network by roughly a factor of two
(Sankararaman et al., 2020). With batch normalization, very deep networks
can be trained, but it is unclear that depth is critical for performance.
Zagoruyko & Komodakis (2016) showed that wide residual networks with
only 16 layers outperformed all residual networks of the time for image
classification. Orhan & Pitkow (2017) propose a different explanation for
why residual connections improve learning in terms of eliminating
singularities (places on the loss surface where the Hessian is degenerate).
Related architectures: Residual connections are a special case of
highway networks (Srivastava et al., 2015) which also split the computation
into two branches and additively recombine. Highway networks use a
gating function that weights the inputs to the two branches in a way that
depends on the data itself, whereas residual networks send the data down
both branches in a straightforward manner. Xie et al. (2017) introduced the
ResNeXt architecture, which places a residual connection around multiple
parallel convolutional branches.
Residual networks as ensembles: Veit et al. (2016) characterized
residual networks as ensembles of shorter networks and depicted the
“unraveled network” interpretation (figure 11.4b). They provide evidence
that this interpretation is valid by showing that deleting layers in a trained
network (and hence a subset of paths) only has a modest effect on
performance. Conversely, removing a layer in a purely sequential network
like VGG is catastrophic. They also looked at the gradient magnitudes
along paths of different lengths and showed that the gradient vanishes in
longer paths. In a residual network consisting of 54 blocks, almost all of the
gradient updates during training were from paths of length 5 to 17 blocks
long, even though these only constitute 0.45% of the total paths. It seems
that adding more blocks effectively adds more parallel shorter paths rather
than creating a network that is truly deeper.
Regularization for residual networks: L2 regularization of the weights
has a fundamentally different effect in vanilla networks and residual
networks without BatchNorm. In the former, it encourages the output of the
layer to be a constant function determined by the biases. In the latter, it
encourages the residual block to compute the identity plus a constant
determined by the biases.
Several regularization methods have been developed that are targeted
specifically at residual architectures. ResDrop (Yamada et al., 2016),
stochastic depth (Huang et al., 2016), and RandomDrop (Yamada et al.,
2019) all regularize residual networks by randomly dropping residual
blocks during the training process. In the latter case, the propensity for
dropping a block is determined by a Bernoulli variable, whose parameter is
linearly decreased during training. At test time, the residual blocks are
added back in with their expected probability. These methods are effectively
versions of dropout, in which all the hidden units in a block are
simultaneously dropped in concert. In the multiple paths view of residual
networks (figure 11.4b), they simply remove some of the paths at each
training step. Wu et al. (2018b) developed BlockDrop, which analyzes an
existing network and decides which residual blocks to use at runtime with
the goal of improving the efficiency of inference.
Other regularization methods have been developed for networks with
multiple paths inside the residual block. Shake-shake (Gastaldi, 2017a,b)
randomly re-weights the paths during the forward and backward passes. In
the forward pass, this can be viewed as synthesizing random data, and in the
backward pass, as injecting another form of noise into the training method.
ShakeDrop (Yamada et al., 2019) draws a Bernoulli variable that decides
whether each block will be subject to Shake-Shake or behave like a
standard residual unit on this training step.
Batch normalization: Batch normalization was introduced by Ioffe &
Szegedy (2015) outside of the context of residual networks. They showed
empirically that it allowed higher learning rates, increased convergence
speed, and made sigmoid activation functions more practical (since the
distribution of outputs is controlled, so examples are less likely to fall in the
saturated extremes of the sigmoid). Balduzzi et al. (2017) investigated the
activation of hidden units in later layers of deep networks with ReLU
functions at initialization. They showed that many such hidden units were
always active or always inactive regardless of the input but that BatchNorm
reduced this tendency.
Although batch normalization helps stabilize the forward propagation of
signals through a network, Yang et al. (2019) showed that it causes gradient
explosion in ReLU networks without skip connections, with each layer
increasing the magnitude of the gradients by . This
argument is summarized by Luther (2020). Since a residual network can be
seen as a combination of paths of different lengths (figure 11.4), this effect
must also be present in residual networks. Presumably, however, the benefit
of removing the 2K increases in magnitude in the forward pass of a network
with K layers outweighs the harm done by increasing the gradients by 1.21K
in the backward pass, so overall BatchNorm makes training more stable.
Variations of batch normalization: Several variants of BatchNorm have
been proposed (figure 11.14). BatchNorm normalizes each channel
separately based on statistics gathered across the batch. Ghost batch
normalization or GhostNorm (Hoffer et al., 2017) uses only part of the
batch to compute the normalization statistics, which makes them noisier and
increases the amount of regularization when the batch size is very large
(figure 11.14b).
When the batch size is very small or the fluctuations within a batch are very
large (as is often the case in natural language processing), the statistics in
BatchNorm may become unreliable. Ioffe (2017) proposed batch
renormalization, which keeps a running average of the batch statistics and
modifies the normalization of any batch to ensure that it is more
representative. Another problem is that batch normalization is unsuitable for
use in recurrent neural networks (networks for processing sequences, in
which the previous output is fed back as an additional input as we move
through the sequence (see figure 12.19). Here, the statistics must be stored
at each step in the sequence, and it's unclear what to do if a test sequence is
longer than the training sequences. A third problem is that batch
normalization needs access to the whole batch. However, this may not be
easily available when training is distributed across several machines.
Layer normalization or LayerNorm (Ba et al., 2016) avoids using batch
statistics by normalizing each data example separately, using statistics
gathered across the channels and spatial position (figure 11.14c). However,
there is still a separate learned scale γ and offset δ per channel. Group
normalization or GroupNorm (Wu & He, 2018) is similar to LayerNorm but
divides the channels into groups and computes the statistics for each group
separately across the within-group channels and the spatial positions (figure
11.14d). Again, there are still separate scale and offset parameters per
channel. Instance normalization or InstanceNorm (Ulyanov et al., 2016)
takes this to the extreme where the number of groups is the same as the
number of channels, so each channel is normalized separately (figure
11.14e), using statistics gathered across spatial position alone. Salimans &
Kingma (2016) investigated normalizing the network weights rather than
the activations, but this has been less empirically successful. Teye et al.
(2018) introduced Monte Carlo batch normalization, which can provide
meaningful estimates of uncertainty in the predictions of neural networks. A
recent comparison of the properties of different normalization schemes can
be found in Lubana et al. (2021).
Why BatchNorm helps: BatchNorm helps control the initial gradients in a
residual network (figure 11.6c). However, the mechanism by which
BatchNorm improves performance is not well understood. The stated goal
of Ioffe & Szegedy (2015) was to reduce problems caused by internal
covariate shift, which is the change in the distribution of inputs to a layer
caused by updating preceding layers during the backpropagation update.
However, Santurkar et al. (2018) provided evidence against this view by
artificially inducing covariate shift and showing that networks with and
without BatchNorm performed equally well.
Motivated by this, they searched for another explanation for why
BatchNorm should improve performance. They showed empirically for the
VGG network that adding batch normalization decreases the variation in
both the loss and its gradient as we move in the gradient direction. In other
words, the loss surface is both smoother and changes more slowly, which is
why larger learning rates are possible. They also provide theoretical proofs
for both these phenomena and show that for any parameter initialization,
the distance to the nearest optimum is less for networks with batch
normalization. Bjorck et al. (2018) also argue that BatchNorm improves the
properties of the loss landscape and allows larger learning rates.
Other explanations of why BatchNorm improves performance include
decreasing the importance of tuning the learning rate (Ioffe & Szegedy,
2015; Arora et al., 2018). Indeed Li & Arora (2019) show that using an
exponentially increasing learning rate schedule is possible with batch
normalization. Ultimately, this is because batch normalization makes the
network invariant to the scales of the weight matrices (see Huszár, 2019, for
an intuitive visualization).
Hoffer et al. (2017) identified that BatchNorm has a regularizing effect due
to statistical fluctuations from the random composition of the batch. They
proposed using a ghost batch size, in which the mean and standard
deviation statistics are computed from a subset of the batch. Large batches
can now be used without losing the regularizing effect of the extra noise in
smaller batch sizes. Luo et al. (2018) investigate the regularization effects
of batch normalization.
Alternatives to batch normalization: Although BatchNorm is widely
used, it is not strictly necessary to train deep residual nets; there are other
ways of making the loss surface tractable. Balduzzi et al. (2017) proposed
the rescaling by in figure 11.6b; they argued that it prevents gradient
explosion but does not resolve the problem of shattered gradients.
Other work has investigated rescaling the function's output in the residual
block before adding it back to the input. For example, De & Smith (2020)
introduce SkipInit, in which a learnable scalar multiplier is placed at the end
of each residual branch. This helps if this multiplier is initialized to less
than , where K is the number of residual blocks. In practice, they
suggest initializing this to zero. Similarly, Hayou et al. (2021) introduce
Stable ResNet, which rescales the output of the function in the kth residual
block (before addition to the main branch) by a constant λk. They prove that
in the limit of infinite width, the expected gradient norm of the weights in
the first layer is lower bounded by the sum of squares of the scalings λk.
They investigate setting these to a constant , where K is the number of
residual blocks and show that it is possible to train networks with up to
1000 blocks.
Zhang et al. (2019a) introduce FixUp, in which every layer is initialized
using He normalization, but the last linear/convolutional layer of every
residual block is set to zero. Now the initial forward pass is stable (since
each residual block contributes nothing), and the gradients do not explode
in the backward pass (for the same reason). They also rescale the branches
so that the magnitude of the total expected change in the parameters is
constant regardless of the number of residual blocks. These methods allow
training of deep residual networks but don't usually achieve the same test
performance as when using BatchNorm. This is probably because they do
not benefit from the regularization induced by the noisy batch statistics. De
& Smith (2020) modify their method to induce regularization via dropout,
which helps close this gap.
DenseNet and U-Net: DenseNet was first introduced by Huang et al.
(2017b), U-Net was developed by Ronneberger et al. (2015), and stacked
hourglass networks by Newell et al. (2016). Of these architectures, U-Net
has been the most extensively adapted. Çiçek et al. (2016) introduced 3D
U-Net, and Milletari et al. (2016) introduced V-Net, both of which extend
U-Net to process 3D data. Zhou et al. (2018) combine the ideas of
DenseNet and U-Net in an architecture that downsamples and re-upsamples
the image but also repeatedly uses intermediate representations. U-Nets are
commonly used in medical image segmentation (see Siddique et al., 2021,
for a review). However, they have been applied to other areas, including
depth estimation (Garg et al., 2016), semantic segmentation (Iglovikov &
Shvets, 2018), inpainting (Zeng et al., 2019), pansharpening (Yao et al.,
2018), and image-to-image translation (Isola et al., 2017). U-Nets are also a
key component in diffusion models (chapter 18).
Problems
Problem 11.1 Derive equation 11.5 from the network definition in equation
11.4.
Problem 11.2 Unraveling the four-block network in figure 11.4a produces
one path of length zero, four paths of length one, six paths of length two,
four paths of length three, and one path of length four. How many paths of
each length would there be if with (i) three residual blocks and (ii) five
residual blocks? Deduce the rule for K residual blocks.
Problem 11.3 Show that the derivative of the network in equation 11.5 with
respect to the first layer f1[x] is given by equation 11.6.
Problem 11.4* Explain why the values in the two branches of the residual
blocks in figure 11.6a are uncorrelated. Show that the variance of the sum
of uncorrelated variables is the sum of their individual variances.
Problem 11.5* The forward pass for batch normalization given a batch of
scalar values consists of the following operations (figure 11.15):
(11.10)
Figure 11.15 Computational graph for batch normalization (see problem 11.10).
Problem 11.6 Consider a fully connected neural network with one input,
one output, and ten hidden layers, each of which contains twenty hidden
units. How many parameters does this network have? How many
parameters will it have if we place a batch normalization operation between
each linear transformation and ReLU?
Problem 11.7* Consider applying an L2 regularization penalty to the
weights in the convolutional layers in figure 11.7a, but not to the scaling
parameters of the subsequent BatchNorm layers. What do you expect will
happen as training proceeds?
Problem 11.8 Consider a convolutional residual block that contains a batch
normalization operation, followed by a ReLU activation function, and then
a 3×3 convolutional layer. If the input and output both have 512 channels,
how many parameters are needed to define this block? Now consider a
bottleneck residual block that contains three batch
normalization/ReLU/convolution sequences. The first uses a 1×1
convolution to reduce the number of channels from 512 to 128. The second
uses a 3×3 convolution with the same number of input and output channels.
The third uses a 1×1 convolution to increase the number of channels from
128 to 512 (see figure 11.7b). How many parameters are needed to define
this block?
Problem 11.9 The U-Net is completely convolutional and can be run with
any sized image after training. Why do we not train with a collection of
arbitrarily-sized images?
1 In equations 11.3 and 11.6, we overload notation to define fk as the output of the function fk[•].
OceanofPDF.com
Chapter 12
Transformers
(12.2)
(12.3)
The scalar weight a[xm, xn] is the attention that the nth output pays to input
xm. The N weights a[•, xn] are non-negative and sum to one. Hence, self-
attention can be thought of as routing the values in different proportions to
create each output (figure 12.1).
Figure 12.1 Self-attention as routing. The self-attention mechanism takes N inputs
x1, …, xN ∈ ℝD (here N = 3 and D = 4) and processes each separately to compute N
value vectors. The nth output san[x1, … xN] (written as san[x•] for short) is then
computed as a weighted sum of the N value vectors, where the weights are positive
and sum to one. a) Output sa1[x•] is computed as a[x1, x1] = 0.1 times the first value
vector, a[x2, x1] = 0.3 times the second value vector, and a[x3, x1] = 0.6 times the
third value vector. b) Output sa2[x•] is computed in the same way, but this time with
weights of 0.5, 0.2, and 0.3. c) The weighting for output sa3[x•] is different again.
Each output can hence be thought of as a different routing of the N values.
The attention weights a[xm, xn] combine the values from different inputs.
They are also sparse since there is only one weight for each ordered pair of
inputs (xm, xn), regardless of the size of these inputs (figure 12.2c). It
follows that the number of attention weights has a quadratic dependence on
the sequence length N, but is independent of the length D of each input xn.
Problem 12.1
(12.4)
where {qn} and {km} are termed queries and keys, respectively. Then we
compute dot products between the queries and keys and pass the results
through a softmax function:
(12.5)
so for each xn, they are positive and sum to one (figure 12.3). For obvious
reasons, this is known as dot-product self-attention.
Appendix B.3.4
Dot product
Figure 12.3 Computing attention weights. a) Query vectors qn = βq + Ωqxn and key
vectors kn = βk + Ωkxn are computed for each input xn. b) The dot products between
each query and the three keys are passed through a softmax function to form non-
negative attentions that sum to one. c) These route the value vectors (figure 12.1) via
the sparse matrix from figure 12.2c.
The names “queries” and “keys” were inherited from the field of
information retrieval and have the following interpretation: the dot product
operation returns a measure of similarity between its inputs, so the weights
a[x•, xn] depend on the relative similarities between the nth query and all of
the keys. The softmax function means that the key vectors “compete” with
one another to contribute to the final result. The queries and keys must have
the same dimensions. However, these can differ from the dimension of the
values, which is usually the same size as the input, so the representation
doesn't change size.
Problem 12.2
12.2.3 Self-attention summary
The nth output is a weighted sum of the same linear transformation v• = βv +
Ωvx• applied to all of the inputs, where these attention weights are positive
and sum to one. The weights depend on a measure of similarity between
input xn and the other inputs. There is no activation function, but the
mechanism is nonlinear due to the dot-product and a softmax operation
used to compute the attention weights.
Note that this mechanism fulfills the initial requirements. First, there is a
single shared set of parameters ϕ = {βv, Ωv, βq, Ωq, βk, Ωk}. This is
independent of the number of inputs N, so the network can be applied to
different sequence lengths. Second, there are connections between the
inputs (words), and the strength of these connections depends on the inputs
themselves via the attention weights.
(12.6)
(12.7)
where the function Softmax[•] takes a matrix and performs the softmax
operation independently on each of its columns (figure 12.4). In this
formulation, we have explicitly included the dependence of the values,
queries, and keys on the input X to emphasize that self-attention computes a
kind of triple product based on the inputs. However, from now on, we will
drop this dependence and just write:
(12.8)
Notebook 12.1
Self-attention
(12.9)
Problem 12.4
(12.10)
where we have different parameters {βvh, Ωvh}, {βqh, Ωqh}, and {βkh, Ωkh}
for each head. Typically, if the dimension of the inputs xm is D and there are
H heads, the values, queries, and keys will all be of size D/H, as this allows
for an efficient implementation. The outputs of these self-attention
mechanisms are vertically concatenated, and another linear transform Ωc is
applied to combine them (figure 12.6):
(12.12)
Problem 12.5
Figure 12.6 Multi-head self-attention. Self-attention occurs in parallel across
multiple “heads.” Each has its own queries, keys, and values. Here two heads are
depicted, in the cyan and orange boxes, respectively. The outputs are vertically
concatenated, and another linear transformation Ωc is used to recombine them.
(12.13)
where the column vectors xn are separately taken from the full data matrix
X. In a real network, the data passes through a series of these transformers.
Figure 12.7 The transformer. The input consists of a D × N matrix containing the
D-dimensional word embeddings for each of the N input tokens. The output is a
matrix of the same size. The transformer consists of a series of operations. First, there
is a multi-head attention block, allowing the word embeddings to interact with one
another. This forms the processing of a residual block, so the inputs are added back to
the output. Second, a LayerNorm operation is applied. Third, there is a second
residual layer where the same fully connected neural network is applied separately to
each of the N word representations (columns). Finally, LayerNorm is applied again.
12.5.1 Tokenization
A text processing pipeline begins with a tokenizer. This splits the text into
smaller constituent units (tokens) from a vocabulary of possible tokens. In
the discussion above, we have implied that these tokens represent words,
but there are several difficulties.
Inevitably, some words (e.g., names) will not be in the vocabulary.
It's unclear how to handle punctuation, but this is important. If a
sentence ends in a question mark, we must encode this information.
The vocabulary would need different tokens for versions of the same
word with different suffixes (e.g., walk, walks, walked, walking), and
there is no way to clarify that these variations are related.
One approach would be to use letters and punctuation marks as the
vocabulary, but this would mean splitting text into very small parts and
requiring the subsequent network to re-learn the relations between them.
In practice, a compromise between letters and full words is used, and the
final vocabulary includes both common words and word fragments from
which larger and less frequent words can be composed. The vocabulary is
computed using a sub-word tokenizer such as byte pair encoding (figure
12.8) that greedily merges commonly occurring sub-strings based on their
frequency.
Notebook 12.3
Tokenization
Figure 12.8 Sub-word tokenization. a) A passage of text from a nursery rhyme. The
tokens are initially just the characters and whitespace (represented by an underscore),
and their frequencies are displayed in the table. b) At each iteration, the sub-word
tokenizer looks for the most commonly occurring adjacent pair of characters (in this
case, se) and merges them. This creates a new token and decreases the counts for the
original tokens s and e. c) At the second iteration, the algorithm merges e and the
whitespace character_. Note that the last character of the first token to be merged
cannot be whitespace, which prevents merging across words. d) After 22 iterations,
the tokens consist of a mix of letters, word fragments, and commonly occurring
words. e) If we continue this process indefinitely, the tokens eventually represent the
full words. f) Over time, the number of tokens increases as we add word fragments to
the letters and then decreases again as we merge these fragments. In a real situation,
there would be a very large number of words, and the algorithm would terminate
when the vocabulary size (number of tokens) reached a predetermined value.
Punctuation and capital letters would also be treated as separate input characters.
12.5.2 Embeddings
Each token in the vocabulary 𝒱 is mapped to a unique word embedding, and
the embeddings for the whole vocabulary are stored in a matrix Ωe ∈ ℝD×|
𝒱|. To accomplish this, the N input tokens are first encoded in the matrix T
∈ ℝ|𝒱|×N, where the nth column corresponds to the nth token and is a |𝒱| × 1
one-hot vector (i.e., a vector where every entry is zero except for the entry
corresponding to the token, which is set to one). The input embeddings are
computed as X = ΩeT, and Ωe is learned like any other network parameter
(figure 12.9). A typical embedding size D is 1024, and a typical total
vocabulary size |𝒱| is 30,000, so even before the main network, there are
many parameters in Ωe to learn.
12.6.1 Pre-training
In the pre-training stage, the network is trained using self-supervision. This
allows the use of enormous amounts of data without the need for manual
labels. For BERT, the self-supervision task consists of predicting missing
words from sentences from a large internet corpus (figure 12.10).1 During
training, the maximum input length is 512 tokens, and the batch size is 256.
The system is trained for a million steps, corresponding to roughly 50
epochs of the 3.3-billion word corpus.
Problem 12.6
Figure 12.10 Pre-training for BERT-like encoder. The input tokens (and a special
<cls> token denoting the start of the sequence) are converted to word embeddings.
Here, these are represented as rows rather than columns, so the box labeled “word
embeddings” is XT. These embeddings are passed through a series of transformers
(orange connections indicate that every token attends to every other token in these
layers) to create a set of output embeddings. A small fraction of the input tokens is
randomly replaced with a generic <mask> token. In pre-training, the goal is to predict
the missing word from the associated output embedding. As such, the output
embeddings are passed through a softmax function, and the multiclass classification
loss (section 5.24) is used. This task has the advantage that it uses both the left and
right context to predict the missing word but has the disadvantage that it does not
make efficient use of data; here, seven tokens need to be processed to add two terms
to the loss function.
12.6.2 Fine-tuning
In the fine-tuning stage, the model parameters are adjusted to specialize the
network to a particular task. An extra layer is appended onto the transformer
network to convert the output vectors to the desired output format.
Examples include:
Text classification: In BERT, a special token known as the classification
or <cls> token is placed at the start of each string during pre-training. For
text classification tasks like sentiment analysis (in which the passage is
labeled as having a positive or negative emotional tone), the vector
associated with the <cls> token is mapped to a single number and passed
through a logistic sigmoid (figure 12.11a). This contributes to a standard
binary cross-entropy loss (section 5.4).
Figure 12.11 After pre-training, the encoder is fine-tuned using manually labeled
data to solve a particular task. Usually, a linear transformation or a multi-layer
perceptron (MLP) is appended to the encoder to produce whatever output is required.
a) Example text classification task. In this sentiment classification task, the <cls>
token embedding is used to predict the probability that the review is positive. b)
Example word classification task. In this named entity recognition problem, the
embedding for each word is used to predict whether the word corresponds to a
person, place, or organization, or is not an entity.
Text span prediction: In the SQuAD 1.1 question answering task, the
question and a passage from Wikipedia containing the answer are
concatenated and tokenized. BERT is then used to predict the text span in
the passage that contains the answer. Each token maps to two numbers
indicating how likely it is that the text span begins and ends at this location.
The resulting two sets of numbers are put through two softmax functions.
The likelihood of any text span being the answer can be derived by
combining the probability of starting and ending at the appropriate places.
(12.15)
Figure 12.12 Training GPT3-type decoder network. The tokens are mapped to
word embeddings with a special <start> token at the beginning of the sequence. The
embeddings are passed through a series of transformers that use masked self-
attention. Here, each position in the sentence can only attend to its own embedding
and the embeddings of tokens earlier in the sequence (orange connections). The goal
at each position is to maximize the probability of the following ground truth token in
the sequence. In other words, at position one, we want to maximize the probability of
the token It; at position two, we want to maximize the probability of the token takes;
and so on. Masked self-attention ensures the system cannot cheat by looking at
subsequent inputs. The autoregressive task has the advantage of making efficient use
of the data since every word contributes a term to the loss function. However, it only
exploits the left context of each word.
In practice, many strategies can make the output text more coherent. For
example, beam search keeps track of multiple possible sentence
completions to find the overall most likely (which is not necessarily found
by greedily choosing the most likely next word at each step). Top-k
sampling randomly draws the next word from only the top-K most likely
possibilities to prevent the system from accidentally choosing from the long
tail of low-probability tokens and leading to an unnecessary linguistic dead
end.
Notebook 12.4
Decoding strategies
Here, the text containing the paired examples in orange was provided as
context for GPT3, and the system then generated the correct answer in
cyan. This phenomenon extends to many situations, including generating
code snippets based on natural language descriptions, arithmetic, translating
between languages, and answering questions about text passages.
Consequently, it is argued that enormous language models are few-shot
learners; they can learn to do novel tasks based on just a few examples.
However, performance is erratic in practice, and the extent to which it is
extrapolating from learned examples rather than merely interpolating or
copying verbatim is unclear.
12.10.1 ImageGPT
ImageGPT is a transformer decoder; it builds an autoregressive model of
image pixels that ingests a partial image and predicts the subsequent pixel
value. The quadratic complexity of the transformer network means that the
largest model (which contained 6.8 billion parameters) could still only
operate on 64×64 images. Moreover, to make this tractable, the original 24-
bit RGB color space had to be quantized into a nine-bit color space, so the
system ingests (and predicts) one of 512 possible tokens at each position.
Images are naturally 2D objects, but ImageGPT simply learns a different
positional encoding at each pixel. Hence it must learn that each pixel has a
close relationship with its preceding neighbors and also with nearby pixels
in the row above. Figure 12.16 shows example generation results.
Figure 12.16 ImageGPT. a) Images generated from the autoregressive ImageGPT
model. The top-left pixel is drawn from the estimated empirical distribution at this
position. Subsequent pixels are generated in turn, conditioned on the previous ones,
working along the rows until the bottom-right of the image is reached. For each
pixel, the transformer decoder generates a conditional distribution as in equation
12.15, and a sample is drawn. The extended sequence is then fed back into the
network to generate the next pixel, and so on. b) Image completion. In each case, the
lower half of the image is removed (top row), and ImageGPT completes the
remaining part pixel by pixel (three different completions shown). Adapted from
https://openai.com/blog/image-gpt/.
The internal representation of this decoder was used as a basis for image
classification. The final pixel embeddings are averaged, and a linear layer
maps these to activations which are passed through a softmax layer to
predict class probabilities. The system is pre-trained on a large corpus of
web images and then fine-tuned on the ImageNet database resized to 48×48
pixels using a loss function that contains both a cross-entropy term for
image classification and a generative loss term for predicting the pixels.
Despite using a large amount of external training data, the system achieved
only a 27.4% top-1 error rate on ImageNet (figure 10.15). This was less
than convolutional architectures of the time (see figure 10.21) but is still
impressive given the small input image size; unsurprisingly, it fails to
classify images where the target object is small or thin.
Figure 12.17 Vision transformer. The Vision Transformer (ViT) breaks the image
into a grid of patches (16×16 in the original implementation). Each of these is
projected via a learned linear transformation to become a patch embedding. These
patch embeddings are fed into a transformer encoder network, and the <cls> token is
used to predict the class probabilities.
Notes
Natural language processing: Transformers were developed for natural
language processing (NLP) tasks. This is an enormous area that deals with
text analysis, categorization, generation, and manipulation. Example tasks
include part of speech tagging, translation, text classification, entity
recognition (people, places, companies, etc.), text summarization, question
answering, word sense disambiguation, and document clustering. NLP was
initially tackled by rule-based methods that exploited the structure and
statistics of grammar. See Manning & Schutze (1999) and Jurafsky &
Martin (2000) for early approaches.
Recurrent neural networks: Before the introduction of transformers,
many state-of-the-art NLP applications used recurrent neural networks, or
RNNs for short (figure 12.19). The term “recurrent” was introduced by
Rumelhart et al. (1985), but the main idea dates to at least Minsky & Papert
(1969). RNNs ingest a sequence of inputs (words in NLP) one at a time. At
each step, the network receives both the new input and a hidden
representation computed from the previous time step (the recurrent
connection). The final output contains information about the whole input.
This representation can then support NLP tasks like classification or
translation. They have also been used in a decoding context in which
generated tokens are fed back into the model to form the next input to the
sequence. For example, the PixelRNN (Van den Oord et al., 2016c) used
RNNs to build an autoregressive model of images.
Figure 12.19 Recurrent neural networks (RNNs). The word embeddings are passed
sequentially through a series of identical neural networks. Each network has two
outputs; one is the output embedding, and the other (orange arrows) feeds back into
the next neural network, along with the next word embedding. Each output
embedding contains information about the word itself and its context in the preceding
sentence fragment. In principle, the final output contains information about the entire
sentence and could be used to support classification tasks similarly to the <cls> token
in a transformer encoder model. However, RNNs sometimes gradually “forget” about
tokens that are further back in time.
(12.16)
(12.17)
This has led to the idea of multiplying out the quadratic component in the
numerator of equation 12.16 and retaining only some of the terms. For
example, Ke et al. (2021) decouple or untie the content and position
information by retaining only the content-content and position-position
terms and using different projection matrices Ω• for each.
Another modification is to inject information directly about the relative
position. This is more important than absolute position since a batch of text
can start at an arbitrary place in a document. Shaw et al. (2018), Raffel et al.
(2020), and Huang et al. (2020b) all developed systems where a single term
was learned for each relative position offset, and the attention matrix was
modified in various ways using these relative positional encodings. Wei et
al. (2019) investigated relative positional encodings based on predefined
sinusoidal embeddings rather than learned values. DeBERTa (He et al.,
2021) combines these ideas; they retain only a subset of terms from the
quadratic expansion, apply different projection matrices to them, and use
relative positional encodings. Other work has explored sinusoidal
embeddings that encode absolute and relative position information in more
complex ways (Su et al., 2021).
Wang et al. (2020a) compare the performance of transformers in BERT with
different positional encodings. They found that relative positional
encodings perform better than absolute positional encodings, but there was
little difference between using sinusoidal and learned embeddings. A survey
of positional encodings can be found in Dufter et al. (2021).
Extending transformers to longer sequences: The complexity of the
self-attention mechanism increases quadratically with the sequence length.
Some tasks like summarization or question answering may require long
inputs, so this quadratic dependence limits performance. Three lines of
work have attempted to address this problem. The first decreases the size of
the attention matrix, the second makes the attention sparse, and the third
modifies the attention mechanism to make it more efficient.
To decrease the size of the attention matrix, Liu et al. (2018b) introduced
memory-compressed attention. This applies strided convolution to the keys
and values, which reduces the number of positions in a very similar way to
downsampling in a convolutional network. Attention is now applied
between weighted combinations of neighboring positions, where the
weights are learned. Along similar lines, Wang et al. (2020b) observed that
the quantities in the attention mechanism are often low rank in practice and
developed the LinFormer, which projects the keys and values onto a
smaller subspace before computing the attention matrix.
To make attention sparse, Liu et al. (2018b) proposed local attention, in
which neighboring blocks of tokens only attend to one another. This creates
a block diagonal interaction matrix (see figure 12.15). Information cannot
pass from block to block, so such layers are typically alternated with full
attention. Along the same lines, GPT3 (Brown et al., 2020) uses a
convolutional interaction matrix and alternates this with full attention. Child
et al. (2019) and Beltagy et al. (2020) experimented with various interaction
matrices, including convolutional structures with different dilation rates but
allowing some queries to interact with every other key. Ainslie et al. (2020)
introduced the extended transformer construction (figure 12.15h), which
uses a set of global embeddings that interact with every other token. This
can only be done in the encoder version, or these implicitly allow the
system to “look ahead.” When combined with relative position encoding,
this scheme requires special encodings for mapping to, from, and between
these global embeddings. BigBird (Ainslie et al., 2020) combined global
embeddings and a convolutional structure with a random sampling of
possible connections. Other work has investigated learning the sparsity
pattern of the attention matrix (Roy et al., 2021; Kitaev et al., 2020; Tay et
al., 2020).
Finally, it has been noted that the terms in the numerator and denominator
of the softmax operation that computes attention have the form exp[kT q].
This can be treated as a kernel function and, as such, can be expressed as
the dot product g[k]T g[q] where g[•] is a nonlinear transformation. This
formulation decouples the queries and keys, making the attention
computation more efficient. Unfortunately, to replicate the form of the
exponential terms, the transformation g[•] must map the inputs to the
infinite space. The linear transformer (Katharopoulos et al., 2020)
recognizes this and replaces the exponential term with a different similarity
measure. The Performer (Choromanski et al., 2020) approximates this
infinite mapping with a finite-dimensional one. More details about
extending transformers to longer sequences can be found in Tay et al.
(2023) and Prince (2021a).
Problem 12.10
Problems
Problem 12.1 Consider a self-attention mechanism that processes N inputs
of length D to produce N outputs of the same size. How many weights and
biases are used to compute the queries, keys, and values? How many
attention weights a[•, •] will there be? How many weights and biases would
there be in a fully connected network relating all DN inputs to all DN
outputs?
Problem 12.2 Why might we want to ensure that the input to the self-
attention mechanism is the same size as the output?
Problem 12.3* Show that the self-attention mechanism (equation 12.8) is
equivariant to a permutation XP of the data X, where P is a permutation
matrix. In other words, show that:
(12.18)
Appendix B.4.4
Permutation matrix
Problem 12.4 Consider the softmax operation:
(12.19)
in the case where there are five inputs with values: z1 = −3, z2 = 1, z3 = 100,
z4 = 5, z5 = −1. Compute the 25 derivatives, ∂yi/∂zj for all i, j ∈ {1, 2, 3, 4,
5}. What do you conclude?
Problem 12.5 Why is implementation more efficient if the values, queries,
and keys in each of the H heads each have dimension D/H where D is the
original dimension of the data?
Problem 12.6 BERT was pre-trained using two tasks. The first task requires
the system to predict missing (masked) words. The second task requires the
system to classify pairs of sentences as being adjacent or not in the original
text. Identify whether each of these tasks is generative or contrastive (see
section 9.3.6). Why do you think they used two tasks? Propose two novel
contrastive tasks that could be used to pre-train a language model.
Problem 12.7 Consider adding a new token to a precomputed masked self-
attention mechanism with N tokens. Describe the extra computation that
must be done to incorporate this new token.
Problem 12.8 Computation in vision transformers expands quadratically
with the number of patches. Devise two methods to reduce the computation
using the principles from figure 12.15.
Problem 12.9 Consider representing an image with a grid of 16 × 16
patches, each represented by a patch embedding of length 512. Compare the
amount of computation required in the DaViT transformer to perform
attention (i) between the patches, using all of the channels, and (ii) between
the channels, using all of the patches.
Problem 12.10* Attention weights are usually computed as:
(12.20)
Consider replacing with the dot product g[km]T g[qn] where g[•]
is a nonlinear transformation. Show how this makes the computation of the
attention weights more efficient.
1 BERT also uses a secondary task that predicts whether two sentences were originally adjacent in
the text or not, but this only marginally improves performance.
OceanofPDF.com
Chapter 13
Figure 13.2c depicts a knowledge graph that encodes a set of facts about
objects by defining relations between them. Technically, this is a directed
heterogeneous multigraph. It is heterogeneous because the nodes can
represent different types of entities (e.g., people, countries, companies). It is
a multigraph because there can be multiple edges of different types between
any two nodes.
The point set representing the airplane in figure 13.2d can be converted
into a graph by connecting each point to its K nearest neighbors. The result
is a geometric graph where each point is associated with a position in 3D
space. Figure 13.2e represents a hierarchical graph. The table, light, and
room are each described by graphs representing the adjacency of their
respective components. These three graphs are themselves nodes in another
graph that represents the topology of the objects in a larger model.
All types of graphs can be processed using deep learning. However, this
chapter focuses on undirected graphs like the social network in figure 13.2a.
The nth node has an associated node embedding x(n) of length D. These
embeddings are concatenated and stored in the D×N node data matrix X.
Similarly, the eth edge has an associated edge embedding e(e) of length DE.
These edge embeddings are collected into the DE × E matrix E. For
simplicity, we initially consider graphs that only have node embeddings and
return to edge embeddings in section 13.9.
Figure 13.4 Properties of the adjacency matrix. a) Example graph. b) Position (m,
n) of the adjacency matrix A contains the number of walks of length one from node
m to node n. c) Position (m, n) of the squared adjacency matrix A2 contains the
number of walks of length two from node n to node m. d) One hot vector
representing node six, which was highlighted in panel (a). e) When we pre-multiply
this vector by A, the result contains the number of walks of length one from node six
to each node; we can reach nodes five, seven, and eight in one move. f) When we
pre-multiply this vector by A2, the resulting vector contains the number of walks of
length two from node six to each node; we can reach nodes two, three, four, five, and
eight in two moves, and we can return to the original node in three different ways
(via nodes five, seven, and eight).
(13.2)
(13.3)
Edge prediction tasks: The network predicts whether or not there should
be an edge between nodes n and m. For example, in the social network
setting, the network might predict whether two people know and like each
other and suggest that they connect if that is the case. This is a binary
classification task where the two node embeddings must be mapped to a
single number representing the probability that the edge is present. One
possibility is to take the dot product of the node embeddings and pass the
result through a sigmoid function to create the probability:
(13.4)
(13.5)
(13.6)
For node classification and edge prediction tasks, the output should also
be equivariant with respect to permutations of the node indices. However,
for graph-level tasks, the final layer aggregates information from across the
graph, so the output is invariant to the node order. In fact, the output layer
from equation 13.2 achieves this because:
(13.7)
(13.8)
where ne[n] returns the set of indices of the neighbors of node n. Then we
apply a linear transformation Ωk to the embedding at the current node
and to this aggregated value, add a bias term βk, and pass the result through
a nonlinear activation function a[•], which is applied independently to every
member of its vector argument:
(13.9)
Figure 13.7 Simple Graph CNN layer. a) Input graph consists of structure
(embodied in graph adjacency matrix A, not shown) and node embeddings (stored in
columns of X). b) Each node in the first hidden layer is updated by (i) aggregating
the neighboring nodes to form a single vector, (ii) applying a linear transformation
Ω0 to the aggregated nodes, (iii) applying the same linear transformation Ω0 to the
original node, (iv) adding these together with a bias β0, and finally (v) applying a
nonlinear activation function a[•] like a ReLU. c) This process is repeated at
subsequent layers (but with different parameters for each layer) until we produce the
final embeddings at the end of the network.
(13.10)
(13.11)
where the network output f[X, A, Φ] is a single value that determines the
probability that the molecule is toxic (see equation 13.2).
Graph-level tasks only occur in the inductive setting where there are
training and test graphs. However, node-level tasks and edge prediction
tasks can occur in either setting. In the transductive case, the loss function
minimizes the mismatch between the model output and the ground truth
where this is known. New predictions are computed by running the forward
pass and retrieving the results where the ground truth is unknown.
(13.12)
Neighborhood sampling: The full graph that feeds into the batch of nodes
is sampled, thereby reducing the connections at each network layer (figure
13.10). For example, we might start with the batch nodes and randomly
sample a fixed number of their neighbors in the previous layer. Then, we
randomly sample a fixed number of their neighbors in the layer before, and
so on. The graph still increases in size with each layer but in a much more
controlled way. This is done anew for each batch, so the contributing
neighbors differ even if the same batch is drawn twice. This is also
reminiscent of dropout (section 9.3.3) and adds some regularization.
Notebook 13.3
Neighborhood sampling
Figure 13.10 Neighborhood sampling. a) One way of forming batches on large
graphs is to choose a subset of labeled nodes in the output layer (here, just one node
in layer two, right) and then working back to find all of the nodes in the K-hop
neighborhood (receptive field). Only this sub-graph is needed to train this batch.
Unfortunately, if the graph is densely connected, this may retain a large proportion of
the graph. b) One solution is neighborhood sampling. As we work back from the final
layer, we select a subset of neighbors (here, three) in the layer before and a subset of
the neighbors of these in the layer before that. This restricts the size of the graph for
training the batch. In all panels, the brightness represents the distance from the
original node.
Figure 13.11 Graph partitioning. a) Input graph. b) The input graph is partitioned
into smaller subgraphs using a principled method that removes the fewest edges. c-d)
We can now use these subgraphs as batches to train in a transductive setting, so here,
there are four possible batches. e) Alternatively, we can use combinations of the
subgraphs as batches, reinstating the edges between them. If we use pairs of
subgraphs, there would be six possible batches here.
Given one of the above methods to form batches, we can now train the
network parameters in the same way as for the inductive setting, dividing
the labeled nodes into train, test, and validation sets as desired; we have
effectively converted a transductive problem to an inductive one. To
perform inference, we compute predictions for the unknown nodes based on
their k-hop neighborhood. Unlike training, this does not require storing the
intermediate representations, so it is much more memory efficient.
13.8 Layers for graph convolutional networks
In the previous examples, we combined messages from adjacent nodes by
summing them together with the transformed current node. This was
accomplished by post-multiplying the node embedding matrix H by the
adjacency matrix plus the identity A + I. We now consider different
approaches to both (i) the combination of the current embedding with the
aggregated neighbors and (ii) the aggregation process itself.
(13.13)
(13.14)
(13.15)
(13.16)
Problem 13.8
(13.17)
where as before, ne[n] denotes a set containing the indices of the neighbors
of the nth node. Equation 13.17 can be computed neatly in matrix form by
introducing the diagonal N × N degree matrix D. Each non-zero element of
this matrix contains the number of neighbors for the associated node. It
follows that each diagonal element in the inverse matrix D−1 contains the
denominator that we need to compute the average. The new GCN layer can
be written as:
(13.18)
(13.19)
with the logic that information coming from nodes with a very large number
of neighbors should be down-weighted since there are many connections
and they provide less unique information. This can also be expressed in
matrix form using the degree matrix:
(13.20)
Problem 13.9
(13.21)
(13.22)
(13.23)
(13.24)
Problem 13.10
Figure 13.12 Comparison of graph convolutional network, dot product attention,
and graph attention network. In each case, the mechanism maps N embeddings of
size D stored in a D × N matrix X to an output of the same size. a) The graph
convolutional network applies a linear transformation X′ = ΩX to the data matrix. It
then computes a weighted sum of the transformed data, where the weighting is based
on the adjacency matrix. A bias β is added, and the result is passed through an
activation function. b) The outputs of the self-attention mechanism are also weighted
sums of the transformed inputs, but this time the weights depend on the data itself via
the attention matrix. c) The graph attention network combines both of these
mechanisms; the weights are both computed from the data and based on the
adjacency matrix.
13.10 Summary
Graphs consist of a set of nodes, where pairs of these nodes are connected
by edges. Both nodes and edges can have data attached, and these are
referred to as node embeddings and edge embeddings, respectively. Many
real-world problems can be framed in terms of graphs, where the goal is to
establish a property of the entire graph, properties of each node or edge, or
the presence of additional edges in the graph.
Graph neural networks are deep learning models that are applied to
graphs. Since the node order in graphs is arbitrary, the layers of graph
neural networks must be equivariant to permutations of the node indices.
Spatial-based convolutional networks are a family of graph neural networks
that aggregate information from the neighbors of a node and then use this to
update the node embeddings.
One challenge of processing graphs is that they often occur in the
transductive setting, where there is only one partially labeled graph rather
than sets of training and test graphs. This graph can be extremely large,
which adds further challenges in terms of training and has led to sampling
and partitioning algorithms. The edge graph has a node for every edge in
the original graph. By converting to this representation, graph neural
networks can be used to update the edge embeddings.
Notes
Sanchez-Lengeling et al. (2021) and Daigavane et al. (2021) present good
introductory articles on graph processing using neural networks. Recent
surveys of research in graph neural networks can be found in articles by
Zhou et al. (2020a), Wu et al. (2020c), and Veličković (2023), and the books
of Hamilton (2020) and Ma & Tang (2021). GraphEDM (Chami et al.,
2020) unifies many existing graph algorithms into a single framework. In
this chapter, we have related graphs to convolutional networks following
Bruna et al. (2013), but there are also strong connections with belief
propagation (Dai et al., 2016) and graph isomorphism tests (Hamilton et al.,
2017a). Zhang et al. (2019c) provide a review focusing specifically on
graph convolutional networks. Bronstein et al. (2021) provide a general
overview of geometric deep learning, including learning on graphs. Loukas
(2020) discusses what types of functions graph neural networks can learn.
Applications: Applications include graph classification (e.g., Zhang et
al., 2018b), node classification (e.g., Kipf & Welling, 2017), edge
prediction (e.g., Zhang & Chen, 2018), graph clustering (e.g., Tsitsulin et
al., 2020), and recommender systems (e.g., Wu et al., 2023). Methods for
node classification are reviewed by Xiao et al. (2022a), methods for graph
classification by Errica et al. (2019), and methods for edge prediction by
Mutlu et al. (2020) and Kumar et al. (2020a).
Graph neural networks: Graph neural networks were introduced by
Gori et al. (2005) and Scarselli et al. (2008), who formulated them as a
generalization of recursive neural networks. The latter model used the
iterative update:
(13.25)
in which each node embedding hn is updated from the initial embedding xn,
initial embeddings xm∈ne[n] at the adjacent nodes, initial embeddings
ee∈nee[n] at the adjacent edges, and adjacent node embeddings hm∈ne[n].
For convergence, the function f[•, •, •, •, ϕ] must be a contraction mapping
(see figure 16.9). If we unroll this equation in time for K steps and allow
different parameters ϕk at each time K, then equation 13.25 becomes similar
to the graph convolutional network. Subsequent work extended graph
neural networks to use gated recurrent units (Li et al., 2016b) and long
short-term memory networks (Selsam et al., 2019).
Spectral methods: Bruna et al. (2013) applied the convolution operation
in the Fourier domain. The Fourier basis vectors can be found by taking the
eigendecomposition of the graph Laplacian matrix, L = D − A where D is
the degree matrix and A is the adjacency matrix. This has disadvantages:
the filters are not localized, and the decomposition is prohibitively
expensive for large graphs. Henaff et al. (2015) tackled the first problem by
forcing the Fourier representation to be smooth (and hence the spatial
domain to be localized). Defferrard et al. (2016) introduced ChebNet, which
approximates the filters efficiently by using the recursive properties of
Chebyshev polynomials. This both provides spatially localized filters and
reduces the computation. Kipf & Welling (2017) simplified this further to
construct filters that use only a 1-hop neighborhood, resulting in a
formulation similar to the spatial methods described in this chapter and
providing a bridge between spectral and spatial methods.
Spatial methods: Spectral methods are ultimately based on the Graph
Laplacian, so if the graph changes, the model must be retrained. This
problem spurred the development of spatial methods. Duvenaud et al.
(2015) defined convolutions in the spatial domain, using a different weight
matrix to combine the adjacent embeddings for each node degree. This has
the disadvantage that it becomes impractical if some nodes have a very
large number of connections. Diffusion convolutional neural networks
(Atwood & Towsley, 2016) use powers of the normalized adjacency matrix
to blend features across different scales, sum these, pointwise multiply by
weights, and pass through an activation function to create the node
embeddings. Gilmer et al. (2017) introduced message-passing neural
networks, which defined convolutions on the graph as propagating
messages from spatial neighbors. The “aggregate and combine” formulation
of GraphSAGE (Hamilton et al., 2017a) fits into this framework.
Aggregate and combine: Graph convolutional networks (Kipf & Welling,
2017) take a weighted average of the neighbors and current node and then
apply a linear mapping and ReLU. GraphSAGE (Hamilton et al., 2017a)
applies a neural network layer to each neighbor, taking the elementwise
maximum to aggregate. Chiang et al. (2019) propose diagonal enhancement
in which the previous embedding is weighted more than the neighbors. Kipf
& Welling (2017) introduced Kipf normalization, which normalizes the sum
of the neighboring embeddings based on the degrees of the current node
and its neighbors (see equation 13.19).
The mixture model network or MoNet (Monti et al., 2017) takes this one
step further by learning a weighting based on the degrees of the current
node and the neighbor. They associate a pseudo-coordinate system with
each node, where the positions of the neighbors depend on these two
quantities. They then learn a continuous function based on a mixture of
Gaussians and sample this at the pseudo-coordinates of the neighbors to get
the weights. In this way, they can learn the weightings for nodes and
neighbors with arbitrary degrees. Pham et al. (2017) use a linear
interpolation of the node embedding and neighbors with a different
weighted combination for each dimension. The weight of this gating
mechanism is generated as a function of the data.
Higher-order convolutional layers: Zhou & Li (2017) used higher-order
convolutions by replacing the adjacency matrix A with à = Min[AL + I, 1]
where L is the maximum walk-length, 1 is a matrix containing only ones,
and Min[•] takes the pointwise minimum of its two matrix arguments; the
updates now sum together contributions from any nodes where there is at
least one walk of length L. Abu-El-Haija et al. (2019) proposed MixHop,
which computes node updates from the neighbors (using the adjacency
matrix A), the neighbors of the neighbors (using A2), and so on. They
concatenate these updates at each layer. Lee et al. (2018) combined
information from nodes beyond the immediate neighbors using geometric
motifs, which are small local geometric patterns in the graph (e.g., a fully
connected clique of five nodes).
Residual connections: Kipf & Welling (2017) proposed a residual
connection in which the original embeddings are added to the updated ones.
Hamilton et al. (2017b) concatenate the previous embedding to the output
of the next layer (see equation 13.16). Rossi et al. (2020) present an
inception-style network where the node embedding is concatenated to not
only the aggregation of its neighbors but also the aggregation of all
neighbors within a walk of two (via computing powers of the adjacency
matrix). Xu et al. (2018) introduced jump knowledge connections in which
the final output at each node consists of the concatenated node embeddings
throughout the network. Zhang & Meng (2019) present a general
formulation of residual embeddings called GResNet and investigate several
variations in which the embeddings from the previous layer are added, the
input embeddings are added, or versions of these that aggregate information
from their neighbors (without further transformation) are added.
Attention in graph neural networks: Veličković et al. (2019) developed
the graph attention network (figure 13.12c). Their formulation uses
multiple heads whose outputs are combined symmetrically. Gated Attention
Networks (Zhang et al., 2018a) weight the output of the different heads in a
way that depends on the data itself. Graph-BERT (Zhang et al., 2020)
performs node classification using self-attention alone; the graph's structure
is captured by adding position embeddings to the data, similarly to how the
absolute or relative position of words is captured in the transformer (chapter
12). For example, they add positional information that depends on the
number of hops between nodes in the graph.
Permutation invariance: In DeepSets, Zaheer et al. (2017) presented a
general permutation invariant operator for processing sets. Janossy pooling
(Murphy et al., 2018) accepts that many functions are not permutation
equivariant and instead uses a permutation-sensitive function and averages
the results across many permutations.
Edge graphs: The notation of the edge graph, line graph, or adjoint
graph dates to Whitney (1932). The idea of “weaving” layers that update
node embeddings from node embeddings, node embeddings from edge
embeddings, edge embeddings from edge embeddings, and edge
embeddings from node embeddings was proposed by Kearnes et al. (2016).
However, here the node-node and edge-edge updates do not involve the
neighbors. Monti et al. (2018) introduced the dual-primal graph CNN, a
modern formulation in a CNN framework that alternates between updates in
the original and edge graphs.
Power of graph neural networks: Xu et al. (2019) argue that a neural
network should be able to distinguish different graph structures; it is
undesirable to map two graphs to the same output if they have the same
initial node embeddings but different adjacency matrices. They identified
graph structures that could not be distinguished by previous approaches
such as GCNs (Kipf & Welling, 2017) and GraphSAGE (Hamilton et al.,
2017a). They developed a more powerful architecture with the same
discriminative power as the Weisfeiler-Lehman graph isomorphism test
(Weisfeiler & Leman, 1968), which is known to discriminate a broad class
of graphs. This resulting graph isomorphism network was based on the
aggregation operation:
(13.26)
Problem 13.3* Consider the two graphs in figure 13.14. How many ways
are there to walk from node one to node two in (i) three steps and (ii) seven
steps?
(13.27)
(13.28)
(13.29)
(13.30)
Show how this operation can be computed simultaneously for all node
embeddings in the D × N embedding matrix H using linear algebra. You
will need to use both the adjacency matrix A and the degree matrix D.
Problem 13.10* Devise a graph attention mechanism based on dot-product
self-attention and draw its mechanism in the style of figure 13.12.
Problem 13.11* Draw the edge graph associated with the graph in figure
13.15a.
Problem 13.12* Draw the node graph corresponding to the edge graph in
figure 13.15b.
Problem 13.13 For a general undirected graph, describe how the adjacency
matrix of the node graph relates to the adjacency matrix of the
corresponding edge graph.
Problem 13.14* Design a layer that updates a node embedding hn based on
its neighboring node embeddings and neighboring edge
embeddings . You should consider the possibility that the edge
embeddings are not the same size as the node embeddings.
OceanofPDF.com
Chapter 14
Unsupervised learning
The four models in chapters 15 to 18 are all generative models that use
latent variables. Generative adversarial networks (chapter 15) learn to
generate data examples x* from latent variables z, using a loss that
encourages the generated samples to be indistinguishable from real
examples (figure 14.2a).
Figure 14.2 Fitting generative models a) Generative adversarial models provide a
mechanism for generating samples (orange points). As training proceeds (left to
right), the loss function encourages these samples to become progressively less
distinguishable from real examples (cyan points). b) Probabilistic models (including
variational autoencoders, normalizing flows, and diffusion models) learn a
probability distribution over the training data. As training proceeds (left to right), the
likelihood of the real examples increases under this distribution, which can be used to
draw new samples and assess the probability of new data points.
(14.1)
Since probability distributions must sum to one, this implicitly reduces the
probability of examples that lie far from the observed data. As well as
providing a training criterion, assigning probabilities is useful in its own
right; the probability on a test set can be used to compare two models
quantitatively, and the probability for an example can be thresholded to
determine if it belongs to the same dataset or is an outlier.2
(14.2)
Appendix C.5.1
KL divergence
(14.3)
Figure 14.4 Inception score. a) A pretrained network classifies the generated
images. If the images are realistic, the resulting class probabilities should
be peaked at the correct class. b) If the model generates all classes equally frequently,
the marginal (average) class probabilities should be flat. The inception score
measures the average distance between the distributions in (a) and the distribution in
(b). Images from Deng et al. (2009).
However, it does not model the distance with respect to the original data
but rather the activations in the deepest layer of the inception classification
network. These hidden units are the ones most associated with object
classes, so the comparison occurs at a semantic level, ignoring the more
fine-grained details of the images. This metric does take account of
diversity within classes but relies heavily on the information retained by the
features in the inception network; any information discarded by the network
does not contribute to the result. Some of this discarded information may
still be important to generate realistic samples.
Notes
Popular generative models include generative adversarial networks
(Goodfellow et al., 2014), variational autoencoders (Kingma & Welling,
2014), normalizing flows (Rezende & Mohamed, 2015), diffusion models
(Sohl-Dickstein et al., 2015; Ho et al., 2020), autoregressive models
(Bengio et al., 2000; Van den Oord et al., 2016b), and energy-based models
(LeCun et al., 2006). All except energy models are discussed in this book.
Bond-Taylor et al. (2022) provide a recent survey of generative models.
Evaluation: Salimans et al. (2016) introduced the inception score, and
Heusel et al. (2017) introduced the Fréchet inception distance, both of
which are based on the Pool-3 layer of the Inception V3 model (Szegedy et
al., 2016). Nash et al. (2021) used earlier layers of the same network that
retain more spatial information to ensure that the spatial statistics of images
are also replicated. Kynkäänniemi et al. (2019) introduced the manifold
precision/recall method. Barratt & Sharma (2018) discuss the inception
score in detail and point out its weaknesses. Borji (2022) discusses the pros
and cons of different methods for assessing generative models.
1 Until this point, almost all of the relevant math has been embedded in the text. However, the
following four chapters require a solid knowledge of probability. Appendix C covers the relevant
material.
2 Note that not all probabilistic generative models rely on latent variables. The transformer
decoder (section 12.7) was learned without labels, can generate new examples, and can assign a
probability to these examples but is based on an autoregressive formulation (equation 12.15).
OceanofPDF.com
Chapter 15
We aim to generate new samples that are drawn from the same
distribution as a set of real training data {xi}. A single new sample is
generated by (i) choosing a latent variable zj from a simple base
distribution (e.g., a standard normal) and then (ii) passing this data through
a network x* = g[zj, θ] with parameters θ. This network is known as the
generator. During the learning process, the goal is to find parameters θ so
that the samples look “similar” to the real data {xi} (see figure 14.2a).
Similarity can be defined in many ways, but the GAN uses the principle
that the samples should be statistically indistinguishable from the true data.
To this end, a second network f[•, ϕ] with parameters ϕ called the
discriminator is introduced. This network aims to classify its input as being
a real example or a generated sample. If this proves impossible, the
generated samples are indistinguishable from the real examples, and we
have succeeded. If it is possible, the discriminator provides a signal that can
be used to improve the generation process.
Figure 15.1 illustrates this scheme. We start with a training set {xi} of
real 1D examples. A different batch of ten of these examples is
shown in each panel (cyan arrows). To create a batch of samples , we
use the simple generator:
(15.1)
where latent variables {zj} are drawn from a standard normal distribution,
and the parameter θ translates the generated samples along the x-axis
(figure 15.1).
Figure 15.1 GAN mechanism. a) Given a parameterized function (a generator) that
synthesizes samples (orange arrows) and a batch of real examples (cyan arrows), we
train a discriminator to distinguish the real examples from the generated samples
(sigmoid curve indicates the estimated probability that the data point is real). b) The
generator is trained by modifying its parameters so that the discriminator becomes
less confident the samples were synthetic (in this case, by moving the orange samples
to the right). The discriminator is then updated. c) Alternating updates to the
generator and discriminator cause the generated samples to become indistinguishable
from real examples and the impetus to change the generator (i.e., the slope of the
sigmoid function) to diminish.
(15.2)
where yi ∈ {0, 1} is the label, and sig[•] is the logistic sigmoid function
(figure 5.7).
In this case, we assume that the real examples x have label y = 1 and the
generated samples x* have label y = 0 so that:
(15.3)
where i and j index the real examples and generated samples, respectively.
Now we substitute the definition for the generator and note
that we must maximize with respect to θ since we want the generated
samples to be misclassified (i.e., have low likelihood of being synthetic or
high negative log-likelihood):
(15.4)
(15.5)
(15.6)
where Pr(x*) is the probability distribution over the generated samples, and
Pr(x) is the true probability distribution over the real examples.
The optimal discriminator depends on the underlying probabilities:
(15.7)
(15.9)
Appendix C.5.2
Jensen-Shannon divergence
The first term indicates the distance will be small if, wherever the
sample density Pr(x*) is high, the mixture (Pr(x*) + Pr(x))/2 has high
probability. In other words, it penalizes regions with samples x* but no real
examples x; it enforces quality. The second term says that the distance will
be small if, wherever the true density Pr(x) is high, the mixture (Pr(x*) +
Pr(x))/2 has high probability. In other words, it penalizes regions with real
examples but no samples. It enforces coverage. Referring to equation 15.6,
we see that the second term does not depend on the generator, which
consequently doesn't care about coverage; it is happy to generate a subset of
possible examples accurately. This is the putative reason for mode
dropping.
Appendix C.5.1
Kullback-Leibler
divergence
15.2.2 Vanishing gradients
In the previous section, we saw that when the discriminator is optimal, the
loss function minimizes a measure of the distance between the generated
and real samples. However, there is a potential problem with using this
distance between probability distributions as the criterion for optimizing
GANs. If the probability distributions are completely disjoint, this distance
is infinite, and any small change to the generator will not decrease the loss.
The same phenomenon can be seen when we consider the original
formulation; if the discriminator can perfectly separate the generated and
real samples, no small change to the generated data will change the
classification score (figure 15.6).
Figure 15.6 Problem with GAN loss function. If the generated samples (orange
arrows) are easy to distinguish from the real examples (cyan arrows), then the
discriminator (sigmoid) may have a very shallow slope at the positions of the
samples; hence, the gradient to update the parameter of the generator may be tiny.
(15.10)
where p contains the vectorized elements Pij that determine the amount of
mass moved, c contains the distances, Ap = b contains the initial
distribution constraints, and p ≥ 0 ensures the masses moved are non-
negative.1
Problem 15.3
(15.12)
Notebook 15.2
Wasserstein distance
In other words, we optimize over a new set of variables {fi} where adjacent
values cannot change by more than one.
(15.14)
Problems 15.4–15.5
(15.15)
subject to the constraint that the Lipschitz constant of the function f[x] is
less than one (i.e., the absolute gradient of the function is less than one).
Appendix B.1.1
Lipschitz constant
(15.16)
where we must constrain the neural network discriminator f[xi, ϕ] to have
an absolute gradient norm of less than one at every position x:
(15.17)
One way to achieve this is to clip the discriminator weights to a small range
(e.g., [−0.01, 0.01]). An alternative is the gradient penalty Wasserstein GAN
or WGAN-GP, which adds a regularization term that increases as the
gradient norm deviates from unity.
Figure 15.11 Progressive growing. This method generates realistic images of faces
when trained on the CELEBA-HQ dataset and more complex, variable objects when
trained on LSUN categories. Adapted from Karras et al. (2018).
Figure 15.12 Traversing latent space of progressive GAN trained on LSUN cars.
Moving in the latent space produces car images that change smoothly. This usually
only works for short trajectories; eventually, the latent variable moves to somewhere
that produces unrealistic images. Adapted from Karras et al. (2018).
For the generator, the attribute c can be appended to the latent vector z.
For the discriminator, it may be appended to the input if the data are 1D. If
the data comprise images, the attribute can be linearly transformed to a 2D
representation and appended as an extra channel to the discriminator input
or to one of its intermediate hidden layers.
Figure 15.14 Auxiliary classifier GAN. The generator takes a class label as well as
the latent vector. The discriminator must both identify if the data point is real and
predict the class label. This model was trained on ten ImageNet classes. Left to right:
generated examples of monarch butterflies, goldfinches, daisies, redshanks, and gray
whales. Adapted from Odena et al. (2017).
15.4.3 InfoGAN
The conditional GAN and ACGAN both generate samples that have
predetermined attributes. By contrast, InfoGAN (figure 15.13c) attempts to
identify important attributes automatically. The generator takes a vector
consisting of random noise variables z and random attribute variables c.
The discriminator both predicts whether the image is real or synthesized
and estimates the attribute variables.
The insight is that interpretable real-world characteristics should be
easiest to predict and hence will be represented in the attribute variables c.
The attributes in c may be discrete (and a binary or multiclass cross-entropy
loss would be used) or continuous (and a least squares loss would be used).
The discrete variables identify categories in the data, and the continuous
ones identify gradual modes of variation (figure 15.15).
Figure 15.15 InfoGAN for MNIST. a) Training examples from the MNIST
database, which consists of 28×28 pixel images of handwritten digits. b) The first
attribute c1 is categorical with 10 categories; each column shows samples generated
with one of these categories. The InfoGAN recovers the ten digits. The attribute
vectors c2 and c3 are continuous. c) Moving from left to right, each column
represents a different value of c2 while keeping the other latent variables constant.
This attribute seems to correspond to the orientation of the character. d) The third
attribute seems to correspond to the thickness of the stroke. Adapted from Chen et al.
(2016b).
15.5.1 Pix2Pix
The Pix2Pix model (figure 15.16) is a network x = g[c, θ] that maps one
image c to a different style image x using a U-Net (figure 11.10) with
parameters θ. A typical use case would be colorization, where the input is
grayscale, and the output is color. The output should be similar to the input,
and this is encouraged using a content loss that penalizes the ℓ 1 norm ||x −
g[c, θ]||1 between the input and output.
Appendix B.3.2
ℓ1 norm
Figure 15.16 Pix2Pix model. a) The model translates an input image to a prediction
in a different style using a U-Net (see figure 11.10). In this case, it maps a grayscale
image to a plausibly colored version. The U-Net is trained with two losses. First, the
content loss encourages the output image to have a similar structure to the input
image. Second, the adversarial loss encourages the grayscale/color image pair to be
indistinguishable from a real pair in each local region of these images. This
framework can be adapted to many tasks, including b) translating maps to satellite
imagery, c) converting sketches of bags to photo-realistic examples, d) colorization,
and e) converting label maps to photorealistic building facades. Adapted from Isola et
al. (2017).
However, the output image should also look like a realistic conversion of
the input. This is encouraged by using an adversarial discriminator f[c, x,
ϕ], which ingests the before and after images c and x. At each step, the
discriminator tries to distinguish between a real before/after pair and a
before/synthesized pair. To the extent that these can be distinguished
successfully, a feedback signal is provided to modify the U-Net to make its
output more realistic. Since the content loss ensures that the large-scale
image structure is correct, the discriminator is mainly needed to ensure that
the local texture is plausible. To this end, the PatchGAN loss is based on a
purely convolutional classifier. At the last layer, each hidden unit indicates
whether the region within its receptive field is real or synthesized. These
responses are averaged to provide the final output.
One way to think of this model is that it is a conditional GAN where the
U-Net is the generator and is conditioned on an image rather than a label.
Notice, though, that the U-Net input does not include noise and so is not
really a “generator” in the conventional sense. Interestingly, the original
authors experimented with adding noise z to the U-Net in addition to the
input image c. However, the network just learned to ignore it.
15.6 StyleGAN
StyleGAN is a more contemporary GAN that partitions the variation in a
dataset into meaningful components, each of which is controlled by a subset
of the latent variables. In particular, StyleGAN controls the output image at
different scales and separates style from noise. For face images, large-scale
changes include face shape and head pose, medium-scale changes include
the shape and details of facial features, and fine-scale changes include hair
and skin color. The style components represent aspects of the image that are
salient to human beings, and the noise aspects represent unimportant
variation such as the exact placement of hairs, stubble, freckles, or skin
pores.
The GANs that we have seen until now started from a latent variable z
which is drawn from a standard base distribution. This was passed through
a series of convolutional layers to produce the output image. However, the
latent variable inputs to the generator can (i) be introduced at various points
in the architecture and (ii) modify the current representation at these points
in different ways. StyleGAN makes these choices judiciously to control
scale and to separate style from noise (figure 15.19).
Figure 15.19 StyleGAN. The main pipeline (center row) starts with a constant
learned representation (gray box). This is passed through a series of convolutional
layers and gradually upsampled to create the output. Noise (top row) is added
different scales by periodically adding Gaussian variables z• with per-channel scaling
ψ•. The Gaussian style variable z is passed through a fully connected network to
create intermediate variable w (bottom row). This is used to set the mean and
variance of each channel at various points in the pipeline.
15.7 Summary
GANs learn a generator network that transforms random noise into data that
is indistinguishable from a training set. To this end, the generator is trained
using a discriminator network that tries to distinguish real examples from
generated samples. The generator is then updated so that the data that it
creates is identified as being more “real” by the discriminator. The original
formulation of this idea has the flaw that the training signal is weak when
it's easy to determine if the samples are real or generated. This led to the
Wasserstein GAN, which provides a more consistent training signal.
We reviewed convolutional GANs for generating images and a series of
tricks that improve the quality of the generated images, including
progressive growing, mini-batch discrimination, and truncation. Conditional
GAN architectures introduce an auxiliary vector that allows control over the
output (e.g., the choice of object class). Image translation tasks retain this
conditional information in the form of an image but dispense with the
random noise. The GAN discriminator now works as an additional loss term
that favors “realistic” looking images. Finally, we described StyleGAN,
which injects noise into the generator strategically to control the style and
noise at different scales.
Notes
Goodfellow et al. (2014) introduced generative adversarial networks. An
early review of progress can be found in Goodfellow (2016). More recent
overviews include Creswell et al. (2018) and Gui et al. (2021). Park et al.
(2021) present a review of GAN models that focuses on computer vision
applications. Hindupur (2022) maintains a list of named GAN models
(numbering 501 at the time of writing) from ABC-GAN (Susmelj et al.,
2017) right through to ZipNet-GAN (Zhang et al., 2017b). Odena (2019)
lists open problems concerning GANs.
Data: GANs have primarily been developed for image data. Examples
include the deep convolutional GAN (Radford et al., 2015), progressive
GAN (Karras et al., 2018), and StyleGAN (Karras et al., 2019) models
presented in this chapter. For this reason, most GANs are based on
convolutional layers, although more recently, GANs that exploit
transformers in the generator and discriminator to capture long-range
correlations have been developed (e.g., SAGAN, Zhang et al., 2019b).
However, GANs have also been used to generate molecular graphs (De Cao
& Kipf, 2018), voice data (Saito et al., 2017; Donahue et al., 2018b;
Kaneko & Kameoka, 2017; Fang et al., 2018), EEG data (Hartmann et al.,
2018), text (Lin et al., 2017a; Fedus et al., 2018), music (Mogren, 2016;
Guimaraes et al., 2017; Yu et al., 2017), 3D models (Wu et al., 2016), DNA
(Killoran et al., 2017), and video data (Vondrick et al., 2016; Wang et al.,
2018a).
GAN loss functions: It was originally claimed that GANs converged to
Nash equilibria during training. However, more recent evidence suggests
that this isn't always the case (Farnia & Ozdaglar, 2020; Jin et al., 2020;
Berard et al., 2019). (Arjovsky et al., 2017; Metz et al., 2017; Qi, 2020)
identified that the original GAN loss function was unstable, and this led to
different formulations. Mao et al. (2017) introduced the least squares GAN.
For some parameter choices, this implicitly minimizes the Pearson χ2
divergence. Nowozin et al. (2016) argue that the Jensen-Shannon
divergence is a special case of a larger family of f-divergences and show
that any f-divergence can be used for training GANs. Jolicoeur-Martineau
(2019) introduces the relativistic GAN in which the discriminator estimates
the probability that a real data example is more realistic than a generated
one rather than the absolute probability that it is real. Zhao et al. (2017a)
reformulate the GAN into a general energy-based framework in which the
discriminator is a function that attributes low energies to real data and
higher energies elsewhere. As an example, they use an autoencoder and
base the energy on reconstruction error.
Arjovsky & Bottou (2017) analyzed vanishing gradients in GANs, and this
led to the Wasserstein GAN (Arjovsky et al., 2017), which is based on earth
mover's distance/optimal transport. The Wasserstein formulation requires
that the Lipschitz constant of the discriminator is less than one; the original
paper proposed to clip the weights in the discriminator, but subsequent
work imposed a gradient penalty (Gulrajani et al., 2016) or applied spectral
normalization (Miyato et al., 2018) to limit the Lipschitz constant. Other
variations of the Wasserstein GAN were introduced by Wu et al. (2018a),
Bellemare et al. (2017b), and Adler & Lunz (2018). Hermann (2017)
presents an excellent blog post discussing duality and the Wasserstein GAN.
For more information about optimal transport, consult the book by Peyré et
al. (2019). Lucic et al. (2018) present an empirical comparison of GAN loss
functions of the time.
Tricks for training GANs: Many heuristics improve the stability of
training GANs and the quality of the final results. Marchesi (2017) first
used the truncation trick (figure 15.10) to trade off the variability of GAN
outputs relative to their quality. This was also proposed by Pieters &
Wiering (2018) and Brock et al. (2019), who added a regularizer that
encourages the weight matrices in the generator to be orthogonal. This
means that truncating the latent variable has a closer relationship to
truncating the output variance and improves sample quality.
Other tricks include only using the gradients from the top K most realistic
images (Sinha et al., 2020), label smoothing in the discriminator (Salimans
et al., 2016), updating the discriminator using a history of generated images
rather than the ones produced by the latest generator to avoid model
“oscillation” (Salimans et al., 2016), and adding noise to the discriminator
input (Arjovsky & Bottou, 2017). Kurach et al. (2019) present an overview
of normalization and regularization in GANs. Chintala et al. (2020) provide
further suggestions for training GANs.
Sample diversity: The original GAN paper (Goodfellow et al., 2014)
argued that given enough capacity, training samples, and computation time,
a GAN can learn to minimize the Jensen-Shannon divergence between the
generated samples and the true distribution. However, subsequent work has
cast doubt on whether this happens in practice. Arora et al. (2017) suggest
that the finite capacity of the discriminator means that the GAN training
objective can approach its optimum value even when the variation in the
output distribution is limited. Wu et al. (2017) approximated the log-
likelihoods of the distributions produced by GANs using annealed
importance sampling and found a mismatch between the generated and real
distributions. Arora & Zhang (2017) ask human observers to identify GAN
samples that are (near-)duplicates and infer the diversity of images from the
frequency of these duplicates. They found that for DCGAN, a duplicate
occurs with probability >50% with 400 samples; this implies that the
support size was ~400,000, which is smaller than the training set. They also
showed that the diversity increased as a function of the discriminator size.
Bau et al. (2019) take a different approach and investigate the parts of the
data space that GANs cannot generate.
Increasing diversity and preventing mode collapse: The extreme case
of lack of diversity is mode collapse, in which the network repeatedly
produces the same image (Salimans et al., 2016). This is a particular
problem for conditional GANs, where the latent variable is sometimes
completely ignored, and the output depends only on the conditional
information. Mao et al. (2019) introduce a regularization term to help
prevent mode collapse in conditional GANs, which maximizes the ratio of
the distance between generated images with respect to the corresponding
latent variables and hence encourages diversity in the outputs. Other work
that aims to reduce mode collapse includes VEEGAN (Srivastava et al.,
2017), which introduces a reconstruction network that maps the generated
image back to the original noise and hence discourages many-to-one
mappings from noise to images.
Salimans et al. (2016) suggested computing statistics across the mini-batch
and using the discriminator to ensure that these are indistinguishable from
the statistics of batches of real images. This is known as mini-batch
discrimination and is implemented by adding a layer toward the end of the
discriminator that learns a tensor for each image that captures the statistics
of the batch. This was simplified by Karras et al. (2018), who computed a
standard deviation for each feature in each spatial location over the mini-
batch. Then they average over spatial locations and features to get a single
estimate. This is replicated to get a single feature map, which is appended to
a layer near the end of the discriminator network. Lin et al. (2018) pass
concatenated (real or generated) samples to the discriminator and provide a
theoretical analysis of how presenting multiple samples to the discriminator
increases diversity. MAD-GAN (Ghosh et al., 2018) increases the diversity
of GAN samples by using multiple generators and requiring the single
discriminator to identify which generator created the samples, thus
providing a signal to help push the generators to create different samples
from one another.
Multiple scales: Wang et al. (2018b) used multiple discriminators at
different scales to help ensure that image quality is high in all frequency
bands. Other work defined both generators and discriminators at different
resolutions (Denton et al., 2015; Zhang et al., 2017d; Huang et al., 2017c).
Karras et al. (2018) introduced the progressive growing method (figure
15.9), which is somewhat simpler and faster to train.
StyleGAN: Karras et al. (2019) introduced the StyleGAN framework
(section 15.6). In subsequent work (Karras et al., 2020b), they improved the
quality of generated images by (i) redesigning the normalization layers in
the generator to remove “water droplet” artifacts and (ii) reducing artifacts
where fine details do not follow the coarse details by changing the
progressive growing framework. Further improvements include developing
methods to train GANs with limited data (Karras et al., 2020a) and fixing
aliasing artifacts (Karras et al., 2021). A large body of work finds and
manipulates the latent variables in the StyleGAN to edit images (e.g., Abdal
et al., 2021; Collins et al., 2020; Härkönen et al., 2020; Patashnik et al.,
2021; Shen et al., 2020b; Tewari et al., 2020; Wu et al., 2021; Roich et al.,
2022).
Conditional GANs: The conditional GAN was developed by Mirza &
Osindero (2014), the auxiliary classifier GAN by Odena et al. (2017), and
the InfoGAN by Chen et al. (2016b). The discriminators of these models
usually append the conditional information to the discriminator input
(Mirza & Osindero, 2014; Denton et al., 2015; Saito et al., 2017) or to an
intermediate hidden layer in the discriminator (Reed et al., 2016a; Zhang et
al., 2017d; Perarnau et al., 2016). However, Miyato & Koyama (2018)
experimented with taking the inner product between embedded conditional
information with a layer of the discriminator, motivated by the role of the
class information in the underlying probabilistic model. Images generated
by GANs have variously been conditioned on classes (e.g., Odena et al.,
2017), input text (Reed et al., 2016a; Zhang et al., 2017d), attributes (Yan et
al., 2016; Donahue et al., 2018a; Xiao et al., 2018b), bounding boxes and
keypoints (Reed et al., 2016b), and images (e.g., Isola et al., 2017)).
Image translation: Isola et al. (2017) developed the Pix2Pix algorithm
(figure 15.16), and a similar system with higher-resolution results was
subsequently developed by Wang et al. (2018b). StarGAN (Choi et al.,
2018) performs image-to-image translation across multiple domains using
only a single model. The idea of cycle consistency loss was introduced by
Zhou et al. (2016b) in DiscoGAN and Zhu et al. (2017) in CycleGAN
(figure 15.18).
Adversarial loss: In many image translation tasks, there is no
“generator”; such models can be considered supervised learning tasks with
an adversarial loss that encourages realism. The super-resolution algorithm
of Ledig et al. (2017) is a good example of this (figure 15.17). Esser et al.
(2021) used an autoencoder with an adversarial loss. This network takes an
image, reduces the representation size to create a “bottleneck,” and then
reconstructs the image from this reduced data space. In practice, the
architecture is similar to encoder-decoder networks (e.g., figure 10.19).
After training, the autoencoder reproduces something that is both close to
the image and looks highly realistic. They vector quantize (discretize) the
bottleneck of the autoencoder and then learn a probability distribution over
the discrete variables using a transformer decoder. By sampling from this
transformer decoder, they can produce extremely large high-quality images.
Inverting GANs: One way to edit real images is to project them to the
latent space, manipulate the latent variable, and then re-project them to
image space. This process is known as resynthesis. Unfortunately, GANs
only map from the latent variable to the observed data, not vice versa. This
has led to methods to invert GANs (i.e., find the latent variable that
corresponds as closely as possible to an observed image). These methods
fall into two classes. The first learns a network that maps in the opposite
direction (Donahue et al., 2018b; Luo et al., 2017a; Perarnau et al., 2016;
Dumoulin et al., 2017; Guan et al., 2020). This is known as an encoder. The
second approach is to start with some latent variable z and optimize it until
it reconstructs the image as closely as possible (Creswell & Bharath, 2018;
Karras et al., 2020b; Abdal et al., 2019; Lipton & Tripathi, 2017). Zhu et al.
(2020a) combine both approaches.
There has been particular interest in inversion for StyleGAN because it
produces excellent results and can control the image at different scales.
Unfortunately, Abdal et al. (2020) showed that it is not possible to invert
StyleGAN without artifacts and proposed inverting to an extended style
space, and Richardson et al. (2021) trained an encoder that reliably maps to
this space. Even after inverting to the extended space, editing images that
are out of domain may still not work well. Roich et al. (2022) address this
issue by fine-tuning the generator of StyleGAN so that it reconstructs the
image exactly and show that the result can be edited well. They also add
extra terms that reconstruct nearby points exactly so that the modification is
local. This technique is known as pivotal tuning. A survey of GAN
inversion techniques can be found in Xia et al. (2022).
Editing images with GANs: The iGAN (Zhu et al., 2016) allows users to
make interactive edits by scribbling or warping parts of an existing image.
The tool then adjusts the output image to be both realistic and to fit these
new constraints. It does this by finding a latent vector that produces an
image that is similar to the edited image and obeys the edge map of any
added lines. It is typical also to add a mask so that only parts of the image
close to the edits are changed. EditGAN (Ling et al., 2021) jointly models
images and their semantic segmentation masks and allows edits to that
mask.
Problems
Problem 15.1 What will the loss be in equation 15.8 when q(x) = Pr(x)?
Problem 15.2* Write an equation relating the loss L in equation 15.8 to the
Jensen-Shannon distance DJS [q(x) || Pr(x)] in equation 15.9.
Problem 15.3 Consider computing the earth mover's distance using linear
programming in the primal form. The discrete distributions Pr(x = i) and
q(x = j) are defined on x = 1, 2, 3, 4 and:
(15.18)
Write out the contents of the 8×16 matrix A. You may assume that the
contents of P have been vectorized into p column-first.
Problem 15.4* Calculate (i) the KL divergence, (ii) the reverse KL
divergence,(iii) the Jensen-Shannon divergence, and (iv) the Wasserstein
distance between the distributions:
(15.19)
for the range a ∈ [−3, 3]. To get a formula for the Wasserstein distance for
this special case, consider the total “earth” (i.e., probability mass) that must
be moved and multiply this by the squared distance it must move.
Problem 15.5 The KL distance and Wasserstein distances between
univariate Gaussian distributions are given by:
(15.20)
and
(15.21)
OceanofPDF.com
Chapter 16
Normalizing flows
16.1 1D example
Normalizing flows are probabilistic generative models: they fit a probability
distribution to training data (figure 14.2b). Consider modeling a 1D
distribution Pr(x). Normalizing flows start with a simple tractable base
distribution Pr(z) over a latent variable z and apply a function x = f[z, ϕ],
where the parameters ϕ are chosen so that Pr(x) has the desired distribution
(figure 16.1). Generating a new example x* is easy; we draw z* from the
base density and pass this through the function so that x* = f[z*, ϕ].
Figure 16.1 Transforming probability distributions. a) The base density is a
standard normal defined on a latent variable z. b) This variable is transformed by a
function x = f[z, ϕ] to a new variable x, which c) has a new distribution. To sample
from this model, we draw values z from the base density (green and brown arrows in
panel (a) show two examples). We pass these through the function f[z, ϕ] as shown
by dotted arrows in panel (b) to generate the values of x, which are indicated as
arrows in panel (c).
(16.1)
where z = f−1[x, ϕ] is the latent variable that created x. The term Pr(z) is the
original probability of this latent variable under the base density. This is
moderated according to the magnitude of the derivative of the function. If
this is greater than one, then the probability decreases. If it is smaller, the
probability increases.
Notebook 16.1
1D normalizing flows
16.1.3 Learning
To learn the distribution, we find parameters ϕ that maximize the likelihood
of the training data or equivalently minimize the negative log-
likelihood:
(16.2)
where we have assumed that the data are independent and identically
distributed in the first line and used the likelihood definition from equation
16.1 in the third line.
(16.3)
where z = f−1[x, ϕ] is the latent variable z that created x. The first term is
the inverse of the determinant of the D × D Jacobian matrix ∂f[z, ϕ]/∂z,
which contains terms ∂fi[z, ϕ]/∂zj at position (i, j). Just as the absolute
derivative measured the change of area at a point on a 1D function when the
function was applied, the absolute determinant measures the change in
volume at a point in the multivariate function. The second term is the
probability of the latent variable under the base density.
Appendix B.3.8
Determinant
Appendix B.5
Jacobian
(16.4)
(16.5)
(16.6)
where we have overloaded the notation to make fk the output of the function
fk[•, ϕk]. The absolute determinant of this Jacobian can be computed by
taking the product of the individual absolute determinants:
(16.7)
where zi = f−1[xi, ϕ], Pr(zi) is measured under the base distribution, and the
absolute determinant |∂f[zi, ϕ]/∂zi| is given by equation 16.7.
Appendix B.1
Bijection
Problem 16.4
One way to make a linear flow that is general, efficient to invert, and for
which the Jacobian can be computed efficiently is to parameterize it directly
in terms of the LU decomposition. In other words, we use:
(16.9)
(16.10)
The Jacobian ∂f[h]/∂h is diagonal since the dth input to f[h] only affects the
dth output. Its determinant is the product of the entries on the diagonal, so:
(16.11)
The function f[•, ϕ] could be a fixed invertible nonlinearity like the leaky
ReLU (figure 3.13), in which case there are no parameters, or it may be any
parameterized invertible one-to-one mapping. A simple example is a
piecewise linear function with K regions (figure 16.5) which maps [0, 1] to
[0, 1] as:
(16.12)
Problem 16.7
where the parameters ϕ1, ϕ2, …, ϕK are positive and sum to K, and b = ⌊Kh⌋
is the index of the bin that contains h. The first term is the sum of all the
preceding bins, and the second term represents the proportion of the way
through the current bin that h lies. This function is easy to invert, and its
gradient can be calculated almost everywhere. There are many similar
schemes for creating smooth functions, often using splines with parameters
that ensure the function is monotonic and hence invertible.
Problems 16.8–16.9
Elementwise flows are nonlinear but don't mix input dimensions, so they
can't create correlations between variables. When alternated with linear
flows (which do mix dimensions), more complex transformations can be
modeled. However, in practice, elementwise flows are used as components
of more complex layers like coupling flows.
(16.14)
Figure 16.6 Coupling flows. a) The input (orange vector) is divided into h1 and h2.
The first part of the output (cyan vector) is a copy of h1. The output is created
by applying an invertible transformation g[, ϕ] to h2, where the parameters ϕ are
themselves a (not necessarily invertible) function of h1. b) In the inverse mapping,
. This allows us to calculate the parameters ϕ[h1] and then apply the inverse
to retrieve h2.
(16.15)
Figure 16.7 Autoregressive flows. The input h (orange column) and output h′
(cyan column) are split into their constituent dimensions (here four dimensions). a)
Output is an invertible transformation of input h1. Output is an invertible
function of input h2 where the parameters depend on h1. Output is an invertible
function of input h3 where the parameters depend on previous inputs h1 and h2, and
so on. None of the outputs depend on one another, so they can be computed in
parallel. b) The inverse of the autoregressive flow is computed using a similar
method as for coupling flows. However, notice that to compute h2 we must already
know h1, to compute h3, we must already know h1 and h2, and so on. Consequently,
the inverse cannot be computed in parallel.
The function g[•, •] is termed the transformer,1 and the parameters ϕ, ϕ[h1],
ϕ[h1, h2], … are termed conditioners. As for coupling flows, the transformer
g[•, ϕ] must be invertible, but the conditioners ϕ[•] can take any form and
are usually neural networks. If the transformer and conditioner are
sufficiently flexible, autoregressive flows are universal approximators in
that they can represent any probability distribution.
It's possible to compute all of the entries of the output h′ in parallel using
a network with appropriate masks so that the parameters ϕ at position d
only depend on previous positions. This is known as a masked
autoregressive flow. The principle is very similar to masked self-attention
(section 12.7.2); connections that relate inputs to previous outputs are
pruned.
Inverting the transformation is less efficient. Consider the forward
mapping:
(16.16)
(16.17)
(16.18)
where f1[•, ϕ1] and f2[•, ϕ2] are two functions that do not necessarily have
to be invertible (figure 16.8). The inverse can be computed by reversing the
order of computation:
(16.19)
Figure 16.8 Residual flows. a) An invertible function is computed by splitting the
input into h1 and h2 and creating two residual layers. In the first, h2 is processed and
h1 is added. In the second, the result is processed, and h2 is added. b) In the reverse
mechanism the functions are computed in the opposite order, and the addition
operation becomes subtraction.
As for coupling flows, the division into blocks restricts the family of
transformations that can be represented. Hence, the inputs are permuted
between layers so that the variables can mix in arbitrary ways.
This formulation can be inverted easily, but for general functions f1[•,
ϕ1] and f2[•, ϕ2], there is no efficient way to compute the Jacobian. This
formulation is sometimes used to save memory when training residual
networks; because the network is invertible, storing the activations at each
layer in the forward pass is unnecessary.
Problem 16.10
(16.20)
where dist[•, •] is a distance function and 0 < β < 1. When a function with
this property is iterated (i.e., the output is repeatedly passed back in as an
input), the result converges to a fixed point where f[z] = z (figure 16.9). To
understand this, consider applying the function to both the fixed point and
the current position; the fixed point remains static, but the distance between
the two must become smaller, so the current position must get closer to the
fixed point.
Notebook 16.3
Contraction mappings
Figure 16.9 Contraction mappings. If a function has an absolute slope of less than
one everywhere, iterating the function converges to a fixed point f[z] = z. a) Starting
at z0, we evaluate z1 = f[z0]. We then pass z1 back into the function and iterate.
Eventually, the process converges to the point where f[z] = z (i.e., where the function
crosses the dashed diagonal identity line). b) This can be used to invert equations of
the form y = z + f[z] for a value y* by noticing that the fixed point of y* − f[z] (where
the orange line crosses the dashed identity line) is at the same position as where y* =
z + f[z].
Appendix B.3.7
Eigenvalues
(16.22)
where we have used the identity log[|A|] = trace[log[A]] in the first line and
expanded this into a power series in the second line.
Even when we truncate this series, it's still computationally expensive to
compute the trace of the constituent terms. Hence, we approximate this
using Hutchinson's trace estimator. Consider a normal random variable ϵ
with mean 0 and variance I. The trace of a matrix A can be estimated as:
(16.23)
Appendix B.3.8
Trace
where the first line is true because 𝔼[ϵϵT] = I. The second line derives from
the properties of the expectation operator. The third line comes from the
linearity of the trace operator. The fourth line is due to the invariance of the
trace to cyclic permutation. The final line is true because the argument in
the fourth line is now a scalar. We estimate the trace by drawing samples ϵi
from Pr(ϵ):
(16.24)
In this way, we can approximate the trace of the powers of the Taylor
expansion (equation 16.22) and evaluate the log probability.
16.5 Applications
We now describe three applications of normalizing flows. First, we consider
modeling probability densities. Second, we consider the GLOW model for
synthesizing images. Finally, we discuss using normalizing flows to
approximate other distributions.
16.5.1 Modeling densities
Of the four generative models discussed in this book, normalizing flows is
the only model that can compute the exact log-likelihood of a new sample.
Generative adversarial networks are not probabilistic, and both variational
autoencoders and diffusion models can only return a lower bound on the
likelihood.2 Figure 16.11 depicts the estimated probability distributions in
two toy problems using i-ResNet. One application of density estimation is
anomaly detection; the data distribution of a clean dataset is described using
a normalizing flow model. New examples with low probability are flagged
as outliers. However, caution must be used as there may exist outliers with
high probability that don't fall in the typical set (see figure 8.13).
16.5.2 Synthesis
Generative flows, or GLOW, is a normalizing flow model that can create
high-fidelity images (figure 16.12) and uses many of the ideas from this
chapter. It is easiest understood in the normalizing direction. GLOW starts
with a 256×256×3 tensor containing an RGB image. It uses coupling layers,
in which the channels are partitioned into two halves. The second half is
subject to a different affine transform at each spatial position, where the
parameters of the affine transformation are computed by a 2D convolutional
neural network run on the other half of the channels. The coupling layers
are alternated with 1 × 1 convolutions, parameterized as LU decompositions
which mix the channels.
Figure 16.12 Samples from GLOW trained on the CelebA HQ dataset (Karras et
al., 2018). The samples are of reasonable quality, although GANs and diffusion
models produce superior results. Adapted from Kingma & Dhariwal (2018).
(16.25)
Problem 16.11
Figure 16.14 Approximating density models. a) Training data. b) Usually, we
modify the flow model parameters to minimize the KL divergence from the training
data to the flow model. This is equivalent to maximum likelihood fitting (section
5.7). c) Alternatively, we can modify the flow parameters ϕ to minimize the KL
divergence from the flow samples xi = f[zi, ϕ] to d) a target density.
(16.26)
Normalizing flows can model the posterior in VAEs using this trick (see
chapter 17).
16.6 Summary
Normalizing flows transform a base distribution (usually a normal
distribution) to create a new density. They have the advantage that they can
both evaluate the likelihood of samples exactly and generate new samples.
However, they have the architectural constraint that each layer must be
invertible; we need the forward transformation to generate samples and the
backward transformation to evaluate the likelihoods.
It's also important that the Jacobian can be estimated efficiently to
evaluate the likelihood; this must be done repeatedly to learn the density.
However, invertible layers are still useful in their own right even when the
Jacobian cannot be estimated efficiently; they reduce the memory
requirements of training a K-layer network from 𝒪[K] to 𝒪[1].
This chapter reviewed invertible network layers or flows. We considered
linear flows and elementwise flows, which are simple but insufficiently
expressive. Then we described more complex flows, such as coupling,
autoregressive, and residual flows. Finally, we showed how normalizing
flows can be used to estimate likelihoods, generate and interpolate between
images, and approximate other distributions.
Notes
Normalizing flows were first introduced by Rezende & Mohamed (2015)
but had intellectual antecedents in the work of Tabak & Vanden-Eijnden
(2010), Tabak & Turner (2013), and Rippel & Adams (2013). Reviews of
normalizing flows can be found in Kobyzev et al. (2020) and Papamakarios
et al. (2021). Kobyzev et al. (2020) presented a quantitative comparison of
many normalizing flow approaches. They concluded that the Flow++ model
(a coupling flow with a novel elementwise transformation and other
innovations) performed best at the time.
Invertible network layers: Invertible layers decrease the memory
requirements of the back-propagation algorithm; the activations in the
forward pass no longer need to be stored since they can be recomputed in
the backward pass. In addition to the regular network layers and residual
layers (Gomez et al., 2017; Jacobsen et al., 2018) discussed in this chapter,
invertible layers have been developed for graph neural networks (Li et al.,
2021a), recurrent neural networks (MacKay et al., 2018), masked
convolutions (Song et al., 2019), U-Nets (Brügger et al., 2019; Etmann et
al., 2020), and transformers (Mangalam et al., 2022).
Radial and planar flows: The original normalizing flows paper (Rezende
& Mohamed, 2015) used planar flows (which contract or expand the
distribution along certain dimensions) and radial flows (which expand or
contract around a certain point). Inverses for these flows can't be computed
easily, but they are useful for approximating distributions where sampling is
slow or where the likelihood can only be evaluated up to an unknown
scaling factor (figure 16.14).
Applications: Applications include image generation (Ho et al., 2019;
Kingma & Dhariwal, 2018), noise modeling (Abdelhamed et al., 2019),
video generation (Kumar et al., 2019b), audio generation (Esling et al.,
2019; Kim et al., 2018; Prenger et al., 2019), graph generation (Madhawa et
al., 2019), image classification (Kim et al., 2021; Mackowiak et al., 2021),
image steganography (Lu et al., 2021), super-resolution (Yu et al., 2020;
Wolf et al., 2021; Liang et al., 2021), style transfer (An et al., 2021), motion
style transfer (Wen et al., 2021), 3D shape modeling (Paschalidou et al.,
2021), compression (Zhang et al., 2021b), sRGB to RAW image conversion
(Xing et al., 2021), denoising (Liu et al., 2021b), anomaly detection (Yu et
al., 2021), image-to-image translation (Ardizzone et al., 2020), synthesizing
cell microscopy images under different molecular interventions (Yang et al.,
2021), and light transport simulation (Müller et al., 2019b). For applications
using image data, noise must be added before learning since the inputs are
quantized and hence discrete (see Theis et al., 2016).
Rezende & Mohamed (2015) used normalizing flows to model the posterior
in VAEs. Abdal et al. (2021) used normalizing flows to model the
distribution of attributes in the latent space of StyleGAN and then used
these distributions to change specified attributes in real images. Wolf et al.
(2021) use normalizing flows to learn the conditional image of a noisy input
image given a clean one and hence simulate noisy data that can be used to
train denoising or super-resolution models.
Normalizing flows have also found diverse uses in physics (Kanwar et al.,
2020; Köhler et al., 2020; Noé et al., 2019; Wirnsberger et al., 2020; Wong
et al., 2020), natural language processing (Tran et al., 2019; Ziegler &
Rush, 2019; Zhou et al., 2019; He et al., 2018; Jin et al., 2019), and
reinforcement learning (Schroecker et al., 2019; Haarnoja et al., 2018a;
Mazoure et al., 2020; Ward et al., 2019; Touati et al., 2020).
Linear flows: Diagonal linear flows can represent normalization
transformations like Batch-Norm (Dinh et al., 2016) and ActNorm (Kingma
& Dhariwal, 2018). Tomczak & Welling (2016) investigated combining
triangular matrices and using orthogonal transformations parameterized by
the Householder transform. Kingma & Dhariwal (2018) proposed the LU
parameterization described in section 16.5.2. Hoogeboom et al. (2019b)
proposed using the QR decomposition instead, which does not require
predetermined permutation matrices. Convolutions are linear
transformations (figure 10.4) that are widely used in deep learning, but their
inverse and determinant are not straightforward to compute. Kingma &
Dhariwal (2018) used 1×1 convolutions, which is effectively a full linear
transformation applied separately at each position. Zheng et al. (2017)
introduced ConvFlow, which was restricted to 1D convolutions.
Hoogeboom et al. (2019b) provided more general solutions for modeling
2D convolutions either by stacking together masked autoregressive
convolutions or by operating in the Fourier domain.
Elementwise flows and coupling functions: Elementwise flows
transform each variable independently using the same function (but with
different parameters for each variable). The same flows can be used to form
the coupling functions in coupling and autoregressive flows, in which case
their parameters depend on the preceding variables. To be invertible, these
functions must be monotone.
An additive coupling function (Dinh et al., 2015) just adds an offset to the
variable. Affine coupling functions scale the variable and add an offset and
were used by Dinh et al. (2015), Dinh et al. (2016), Kingma & Dhariwal
(2018), Kingma et al. (2016), and Papamakarios et al. (2017). Ziegler &
Rush (2019) propose the nonlinear squared flow, which is an invertible ratio
of polynomials with five parameters. Continuous mixture CDFs (Ho et al.,
2019) apply a monotone transformation based on the cumulative density
function (CDF) of a mixture of K logistics, post-composed by an inverse
logistic sigmoid, scaled, and offset.
The piecewise linear coupling function (figure 16.5) was developed by
Müller et al. (2019b). Since then, systems based on cubic splines (Durkan et
al., 2019a) and rational quadratic splines (Durkan et al., 2019b) have been
proposed. Huang et al. (2018a) introduced neural autoregressive flows, in
which the function is represented by a neural network that produces a
monotonic function. A sufficient condition is that the weights are all
positive and the activation functions are monotone. It is hard to train a
network with the constraint that the weights are positive, so this led to
unconstrained monotone neural networks (Wehenkel & Louppe, 2019),
which model strictly positive functions and then integrate them numerically
to get a monotone function. Jaini et al. (2019) construct positive functions
that can be integrated in closed form based on a classic result that all
positive single-variable polynomials are the sum of squares of polynomials.
Finally, Dinh et al. (2019) investigated piecewise monotonic coupling
functions.
Coupling flows: Dinh et al. (2015) introduced coupling flows in which
the dimensions were split in half (figure 16.6). Dinh et al. (2016) introduced
RealNVP, which partitioned the image input by taking alternating pixels or
blocks of channels. Das et al. (2019) proposed selecting features for the
propagated part based on the magnitude of the derivatives. Dinh et al.
(2016) interpreted multi-scale flows (in which dimensions are gradually
introduced) as coupling flows in which the parameters ϕ have no
dependence on the other half of the data. Kruse et al. (2021) introduce a
hierarchical formulation of coupling flows in which each partition is
recursively divided into two. GLOW (figures 16.12–16.13) was designed by
Kingma & Dhariwal (2018) and uses coupling flows, as do NICE (Dinh et
al., 2015), RealNVP (Dinh et al., 2016), FloWaveNet (Kim et al., 2018),
WaveGlOW (Prenger et al., 2019), and Flow++ (Ho et al., 2019).
Autoregressive flows: Kingma et al. (2016) used autoregressive models
for normalizing flows. Germain et al. (2015) developed a general method
for masking previous variables. This was exploited by Papamakarios et al.
(2017) to compute all of the outputs in the forward direction simultaneously
in masked autoregressive flows. Kingma et al. (2016) introduced the
inverse autoregressive flow. Parallel WaveNet (Van den Oord et al., 2018)
distilled WaveNet (Van den Oord et al., 2016a), which is a different type of
generative model for audio, into an inverse autoregressive flow so that
sampling would be fast (see figure 16.14c–d).
Residual flows: Residual flows are based on residual networks (He et al.,
2016a). RevNets (Gomez et al., 2017) and iRevNets (Jacobsen et al., 2018)
divide the input into two sections (figure 16.8), each of which passes
through a residual network. These networks are invertible, but the
determinant of the Jacobian cannot be computed easily. The residual
connection can be interpreted as the discretization of an ordinary
differential equation, and this perspective led to different invertible
architectures (Chang et al., 2018, 2019a). However, the Jacobian of these
networks could still not be computed efficiently. Behrmann et al. (2019)
noted that the network can be inverted using fixed point iterations if its
Lipschitz constant is less than one. This led to iResNet, in which the log
determinant of the Jacobian can be estimated using Hutchinson's trace
estimator (Hutchinson, 1989). Chen et al. (2019) removed the bias induced
by the truncation of the power series in equation 16.22 by using the Russian
Roulette estimator.
Infinitesimal flows: If residual networks can be viewed as a
discretization of an ordinary differential equation (ODE), then the next
logical step is to represent the change in the variables directly by an ODE.
The neural ODE was explored by Chen et al. (2018e) and exploits standard
methods for forward and backward propagation in ODEs. The Jacobian is
no longer required to compute the likelihood; this is represented by a
different ODE in which the change in log probability is related to the trace
of the derivative of the forward propagation. Grathwohl et al. (2019) used
the Hutchinson estimator to estimate the trace and simplified this further.
Finlay et al. (2020) added regularization terms to the loss function that
make training easier, and Dupont et al. (2019) augmented the representation
to allow the neural ODE to represent a broader class of diffeomorphisms.
Tzen & Raginsky (2019) and Peluchetti & Favaro (2020) replaced the
ODEs with stochastic differential equations.
Universality: The universality property refers to the ability of a
normalizing flow to model any probability distribution arbitrarily well.
Some flows (e.g., planar, elementwise) do not have this property.
Autoregressive flows can be shown to have the universality property when
the coupling function is a neural monotone network (Huang et al., 2018a),
based on monotone polynomials (Jaini et al., 2020) or based on splines
(Kobyzev et al., 2020). For dimension D, a series of D coupling flows can
form an autoregressive flow. To understand why, note that the partitioning
into two parts h1 and h2 means that at any given layer h2 depends only on
the previous variables (figure 16.6). Hence, if we increase the size of h1 by
one at every layer, we can reproduce an autoregressive flow, and the result
is universal. It is not known whether coupling flows can be universal with
fewer than D layers. However, they work well in practice (e.g., GLOW)
without the need for this induced autoregressive structure.
Other work: Active areas of research in normalizing flows include the
investigation of discrete flows (Hoogeboom et al., 2019a; Tran et al., 2019),
normalizing flows on non-Euclidean manifolds (Gemici et al., 2016; Wang
& Wang, 2019), and equivariant flows (Köhler et al., 2020; Rezende et al.,
2019) which aim to create densities that are invariant to families of
transformations.
Problems
Problem 16.1 Consider transforming a uniform base density defined on z ∈
[0, 1] using the function x = f[z] = z2. Find an expression for the
transformed distribution Pr(x).
Problem 16.2* Consider transforming a standard normal distribution:
(16.27)
(16.28)
(16.29)
(16.30)
(16.31)
Write an expression for the inverse of the leaky ReLU. Write an expression
for the inverse absolute determinant of the Jacobian |∂f[z]/∂z|−1 for an
elementwise transformation x = f[z] of the multivariate variable z where:
(16.32)
(16.33)
where b = ⌊Kh⌋ is the bin that h falls into, and the parameters ϕk are
positive, and sum to one. Consider the case where K = 5 and ϕ1 = 0.1, ϕ2 =
0.2, ϕ3 = 0.5, ϕ4 = 0.1, ϕ5 = 0.1. Draw the function f[h, ϕ]. Draw the inverse
function f−1[h′, ϕ].
Problem 16.10 Draw the structure of the Jacobian (indicating which
elements are zero) for the forward mapping of the residual flow in figure
16.8 for the cases where f1[•, ϕ1] and f2[•, ϕ2] are (i) a fully connected
neural network, (ii) an elementwise flow.
Problem 16.11* Write out the expression for the KL divergence in equation
16.25. Why does it not matter if we can only evaluate the probability q(x)
up to a scaling factor κ? Does the network have to be invertible to minimize
this loss function? Explain your reasoning.
OceanofPDF.com
Chapter 17
Variational autoencoders
(17.1)
Appendix C.1.2
Marginalization
Typically, the joint probability Pr(x, z) is broken down using the rules of
conditional probability into the likelihood of the data with respect to the
latent variables term Pr(x|z) and the prior Pr(z):
(17.2)
Appendix C.1.3
Conditional probability
(17.3)
Problem 17.1
Figure 17.1 Mixture of Gaussians (MoG). a) The MoG describes a complex
probability distribution (cyan curve) as a weighted sum of Gaussian components
(dashed curves). b) This sum is the marginalization of the joint density Pr(x, z)
between the continuous observed data x and a discrete latent variable z.
(17.4)
Appendix C.3.2
Multivariate normal
(17.6)
(17.7)
Notebook 17.1
Latent variable models
17.2.1 Generation
A new example x* can be generated using ancestral sampling (figure 17.3).
We draw z* from the prior Pr(z) and pass this through the network f[z*, ϕ]
to compute the mean of the likelihood Pr(x|z*, ϕ) (equation 17.6), from
which we draw x*. Both the prior and likelihood are normal distributions,
so this is straightforward.
Appendix C.4.2
Ancestral sampling
Figure 17.3 Generation from nonlinear latent variable model. a) We draw a sample
z* from the prior probability Pr(z) over the latent variable. b) A sample x* is then
drawn from Pr(x|z*, ϕ). This is a spherical Gaussian with a mean that is a nonlinear
function f[•, ϕ] of z* and a fixed variance σ2I. c) If we repeat this process many
times, we recover the density Pr(x|ϕ).
17.3 Training
To train the model, we maximize the log-likelihood over a training dataset
with respect to the model parameters. For simplicity, we assume that
the variance term σ2 in the likelihood expression is known and concentrate
on learning ϕ:
(17.8)
where:
(17.9)
(17.10)
Appendix B.1.2
Concave functions
(17.11)
Problems 17.2–17.3
(17.12)
(17.13)
(17.14)
We then use Jensen's inequality for the logarithm (equation 17.12) to find a
lower bound:
(17.15)
where the right-hand side is termed the evidence lower bound or ELBO. It
gets this name because Pr(x|ϕ) is called the evidence in the context of
Bayes’ rule (equation 17.19).
In practice, the distribution q(z) has parameters θ, so the ELBO can be
written as:
(17.16)
Figure 17.6 Evidence lower bound (ELBO). The goal is to maximize the log-
likelihood log[Pr(x|ϕ)] (black curve) with respect to the parameters ϕ. The ELBO is
a function that lies everywhere below the log-likelihood. It is a function of both ϕ
and a second set of parameters θ. For fixed θ, we get a function of ϕ (two colored
curves for different values of θ). Consequently, we can increase the log-likelihood by
either improving the ELBO with respect to a) the new parameters θ (moving from
colored curve to colored curve) or b) the original parameters ϕ (moving along the
current colored curve).
Appendix C.1.3
Conditional probability
Here, the first integral disappears between lines three and four since
log[Pr(x|ϕ)] does not depend on z, and the integral of the probability
distribution q(z|θ) is one. In the last line, we have just used the definition of
the Kullback-Leibler (KL) divergence.
Appendix C.5.1
KL divergence
This equation shows that the ELBO is the original log-likelihood minus
the KL divergence DKL[q(z|θ)‖Pr(z|x, ϕ)]. The KL divergence measures the
“distance” between distributions and can only take non-negative values. It
follows the ELBO is a lower bound on log[Pr(x|ϕ)]. The KL distance will
be zero, and the bound will be tight when q(z|θ) = Pr(z|x, ϕ). This is the
posterior distribution over the latent variables z given observed data x; it
indicates which values of the latent variable could have been responsible for
the data point (figure 17.7).
Figure 17.7 Posterior distribution over latent variable. a) The posterior distribution
Pr(z|x*, ϕ) is the distribution over the values of the latent variable z that could be
responsible for a data point x*. We calculate this via Bayes’ rule Pr(z|x*, ϕ) ∝
Pr(x*|z, ϕ)Pr(z). b) We compute the first term on the righthand side (the likelihood)
by assessing the probability of x* against the symmetric Gaussian associated with
each value of z. Here, it was more likely to have been created from z1 than z2. The
second term is the prior probability Pr(z) over the latent variable. Combining these
two factors and normalizing so the distribution sums to one gives us the posterior
Pr(z|x*, ϕ).
(17.18)
where the joint distribution Pr(x, z|ϕ) has been factored into conditional
probability Pr(x|z, ϕ)Pr(z) between the first and second lines, and the
definition of KL divergence is used again in the last line.
Problem 17.4
(17.19)
Since the optimal choice for q(z|θ) was the posterior Pr(z|x), and this
depends on the data example x, the variational approximation should do the
same, so we choose:
(17.20)
where g[x, θ] is a second neural network with parameters θ that predicts the
mean μ and variance Σ of the normal variational approximation.
(17.21)
(17.22)
where is the nth sample from q(z|x, θ). This is known as a Monte Carlo
estimate. For a very approximate estimate, we can just use a single sample
z* from q(z|x, θ):
(17.23)
Appendix C.2
Expectation
(17.24)
Figure 17.9 Variational autoencoder. The encoder g[x, θ] takes a training example
x and predicts the parameters μ, Σ of the variational distribution q(z|x, θ). We sample
from this distribution and then use the decoder f[z, ϕ] to predict the data x. The loss
function is the negative ELBO, which depends on how accurate this prediction is and
how similar the variational distribution q(z|x, θ) is to the prior Pr(z) (equation 17.21).
Figure 17.10 The VAE updates both factors that determine the lower bound at each
iteration. Both the parameters ϕ of the decoder and the parameters θ of the encoder
are manipulated to increase this lower bound.
(17.25)
to draw from the intended Gaussian. Now we can compute the derivatives
as usual because the backpropagation algorithm does not need to pass down
the stochastic branch. This is known as the reparameterization trick (figure
17.11).
Problem 17.5
Notebook 17.2
Reparameterization trick
Figure 17.11 Reparameterization trick. With the original architecture (figure 17.9),
we cannot easily backpropagate through the sampling step. The reparameterization
trick removes the sampling step from the main pipeline; we draw from a standard
normal and combine this with the predicted mean and covariance to get a sample
from the variational distribution.
17.8 Applications
Variational autoencoders have many uses, including denoising, anomaly
detection, and compression. This section reviews several applications for
image data.
(17.26)
In principle, we could approximate this probability using equation 17.22 by
drawing samples from Pr(z) = Normz[0, I] and computing:
(17.27)
(17.28)
where now we draw the samples from q(z). If q(z) is close to the region of z
where the Pr(x|z) has high likelihood, then we will focus the sampling on
the relevant area of space and estimate Pr(x) much more efficiently.
Notebook 17.3
Importance sampling
17.8.2 Generation
VAEs build a probabilistic model, and it's easy to sample from this model
by drawing from the prior Pr(z) over the latent variable, passing this result
through the decoder f[z, ϕ], and adding noise according to Pr(x|f[z, ϕ]).
Unfortunately, samples from vanilla VAEs are generally low-quality (figure
17.12a–c). This is partly because of the naïve spherical Gaussian noise
model and partly because of the Gaussian models used for the prior and
variational posterior. One trick to improve generation quality is to sample
from the aggregated posterior q(z|θ) = (1/I)Σiq(z|xi, θ) rather than the prior;
this is the average posterior over all samples and is a mixture of Gaussians
that is more representative of true distribution in latent space.
Figure 17.12 Sampling from a standard VAE trained on CELEBA. In each column,
a latent variable z* is drawn and passed through the model to predict the mean f[z*,
ϕ] before adding independent Gaussian noise (see figure 17.3). a) A set of samples
that are the sum of b) the predicted means and c) spherical Gaussian noise vectors.
The images look too smooth before we add the noise and too noisy afterward. This is
typical, and usually, the noise-free version is shown since the noise is considered to
represent aspects of the image that are not modeled. Adapted from Dorta et al.
(2018). d) It is now possible to generate high-quality images from VAEs using
hierarchical priors, specialized architecture, and careful regularization. Adapted from
Vahdat & Kautz (2020).
17.8.3 Resynthesis
VAEs can also be used to modify real data. A data point x can be projected
into the latent space by either (i) taking the mean of the distribution
predicted by the encoder or (ii) by using an optimization procedure to find
the latent variable z that maximizes the posterior probability, which Bayes’
rule tells us is proportional to Pr(x|z)Pr(z).
In figure 17.13, multiple images labeled as “neutral” or “smiling” are
projected into latent space. The vector representing this change is estimated
by taking the difference in latent space between the means of these two
groups. A second vector is estimated to represent “mouth closed” versus
“mouth open.”
Figure 17.13 Resynthesis. The original image on the left is projected into the latent
space using the encoder, and the mean of the predicted Gaussian is chosen to
represent the image. The center-left image in the grid is the reconstruction of the
input. The other images are reconstructions after manipulating the latent space in
directions representing smiling/neutral (horizontal) and mouth open/closed (vertical).
Adapted from White (2016).
Now the image of interest is projected into the latent space, and then the
representation is modified by adding or subtracting these vectors. To
generate intermediate images, spherical linear interpolation or Slerp is used
rather than linear interpolation. In 3D, this would be the difference between
interpolating along the surface of a sphere versus digging a straight tunnel
through its body.
Problem 17.6
17.8.4 Disentanglement
In the resynthesis example above, the directions in space representing
interpretable properties had to be estimated using labeled training data.
Other work attempts to improve the characteristics of the latent space so
that its coordinate directions correspond to realworld properties. When each
dimension represents an independent real-world factor, the latent space is
described as disentangled. For example, when modeling face images, we
might hope to uncover head pose or hair color as independent factors.
Methods to encourage disentanglement typically add regularization
terms to the loss function based on either (i) the posterior q(z|x, θ) over the
latent variables z, or (ii) the aggregated posterior q(z|θ) = (1/I)Σiq(z|xi, θ):
(17.29)
(17.30)
where β > 1 determines how much more the deviation from the prior Pr(z)
is weighted relative to the reconstruction error. Since the prior is usually a
multivariate normal with a spherical covariance matrix, its dimensions are
independent. Hence, up-weighting this term encourages the posterior
distributions to be less correlated. Another variant is the total correlation
VAE, which adds a term to decrease the total correlation between variables
in the latent space (figure 17.14) and maximizes the mutual information
between a small subset of the latent variables and the observations.
Figure 17.14 Disentanglement in the total correlation VAE. The VAE model is
modified so that the loss function encourages the total correlation of the latent
variables to be minimized and hence encourages disentanglement. When trained on a
dataset of images of chairs, several of the latent dimensions have clear realworld
interpretations, including a) rotation, b) overall size, and c) legs (swivel chair versus
normal). In each case, the central column depicts samples from the model, and as we
move left to right, we are subtracting or adding a coordinate vector in latent space.
Adapted from Chen et al. (2018d).
17.9 Summary
The VAE is an architecture that helps to learn a nonlinear latent variable
model over x. This model can generate new examples by sampling from the
latent variable, passing the result through a deep network, and then adding
independent Gaussian noise.
It is not possible to compute the likelihood of a data point in closed
form, and this poses problems for training with maximum likelihood.
However, we can define a lower bound on the likelihood and maximize this
bound. Unfortunately, for the bound to be tight, we need to compute the
posterior probability of the latent variable given the observed data, which is
also intractable. The solution is to make a variational approximation. This is
a simpler distribution (usually a Gaussian) that approximates the posterior
and whose parameters are computed by a second encoder network.
To create high-quality samples from the VAE, it seems to be necessary to
model the latent space with more sophisticated probability distributions
than the Gaussian prior and posterior. One option is to use hierarchical
priors (in which one latent variable generates another). The next chapter
discusses diffusion models, which produce very high-quality examples and
can be viewed as hierarchical VAEs.
Notes
The VAE was originally introduced by Kingma & Welling (2014). A
comprehensive introduction to variational autoencoders can be found in
Kingma et al. (2019).
Applications: The VAE and variants thereof have been applied to images
(Kingma & Welling, 2014; Gregor et al., 2016; Gulrajani et al., 2016;
Akuzawa et al., 2018), speech (Hsu et al., 2017b), text (Bowman et al.,
2015; Hu et al., 2017; Xu et al., 2020), molecules (Gómez-Bombarelli et al.,
2018; Sultan et al., 2018), graphs (Kipf & Welling, 2016; Simonovsky &
Komodakis, 2018), robotics (Hernández et al., 2018; Inoue et al., 2018;
Park et al., 2018), reinforcement learning (Heess et al., 2015; Van Hoof et
al., 2016), 3D scenes (Eslami et al., 2016, 2018; Rezende Jimenez et al.,
2016), and handwriting (Chung et al., 2015).
Applications include resynthesis and interpolation (White, 2016; Bowman
et al., 2015), collaborative filtering (Liang et al., 2018), and compression
(Gregor et al., 2016). Gómez-Bombarelli et al. (2018) use the VAE to
construct a continuous representation of chemical structures that can then
be optimized for desirable properties. Ravanbakhsh et al. (2017) simulate
astronomical observations for calibrating measurements.
Relation to other models: The autoencoder (Rumelhart et al., 1985;
Hinton & Salakhutdinov, 2006) passes data through an encoder to a
bottleneck layer and then reconstructs it using a decoder. The bottleneck is
similar to latent variables in the VAE, but the motivation differs. Here, the
goal is not to learn a probability distribution but to create a low-dimensional
representation that captures the essence of the data. Autoencoders also have
various applications, including denoising (Vincent et al., 2008) and
anomaly detection (Zong et al., 2018).
If the encoder and decoder are linear transformations, the autoencoder is
just principal component analysis (PCA). Hence, the nonlinear autoencoder
is a generalization of PCA. There are also probabilistic forms of PCA.
Probabilistic PCA (Tipping & Bishop, 1999) adds spherical Gaussian noise
to the reconstruction to create a probability model, and factor analysis adds
diagonal Gaussian noise (see Rubin & Thayer, 1982). If we make the
encoder and decoder of these probabilistic variants nonlinear, we return to
the variational autoencoder.
Architectural variations: The conditional VAE (Sohn et al., 2015)
passes class information c into both the encoder and decoder. The result is
that the latent space does not need to encode the class information. For
example, when MNIST data are conditioned on the digit label, the latent
variables might encode the orientation and width of the digit rather than the
digit category itself. Sønderby et al. (2016a) introduced ladder variational
autoencoders, which recursively correct the generative distribution with a
data-dependent approximate likelihood term.
Modifying likelihood: Other work investigates more sophisticated
likelihood models Pr(x|z). The PixelVAE (Gulrajani et al., 2016) used an
autoregressive model over the output variables. Dorta et al. (2018) modeled
the covariance of the decoder output as well as the mean. Lamb et al.
(2016) improved the quality of reconstruction by adding extra
regularization terms that encourage the reconstruction to be similar to the
original image in the space of activations of a layer of an image
classification model. This model encourages semantic information to be
retained and was used to generate the results in figure 17.13. Larsen et al.
(2016) use an adversarial loss for reconstruction, which also improves
results.
Latent space, prior, and posterior: Many different forms for the
variational approximation to the posterior have been investigated, including
normalizing flows (Rezende & Mohamed, 2015; Kingma et al., 2016),
directed graphical models (Maaløe et al., 2016), undirected models (Vahdat
et al., 2020), and recursive models for temporal data (Gregor et al., 2016,
2019).
Other authors have investigated using a discrete latent space (Van Den Oord
et al., 2017; Razavi et al., 2019b; Rolfe, 2017; Vahdat et al., 2018a,b) For
example, Razavi et al. (2019b) use a vector quantized latent space and
model the prior with an autoregressive model (equation 12.15). This is slow
to sample from but can describe very complex distributions.
Jiang et al. (2016) use a mixture of Gaussians for the posterior, allowing
clustering. This is a hierarchical latent variable model that adds a discrete
latent variable to improve the flexibility of the posterior. Other authors
(Salimans et al., 2015; Ranganath et al., 2016; Maaløe et al., 2016; Vahdat
& Kautz, 2020) have experimented with hierarchical models that use
continuous variables. These have a close connection with diffusion models
(chapter 18).
Combination with other models: Gulrajani et al. (2016) combined VAEs
with an autoregressive model to produce more realistic images. Chung et al.
(2015) combine the VAE with recurrent neural networks to model time-
varying measurements.
As discussed above, adversarial losses have been used to inform the
likelihood term directly. However, other models have combined ideas from
generative adversarial networks (GANs) with VAEs in different ways.
Makhzani et al. (2015) use an adversarial loss in the latent space; the idea is
that the discriminator will ensure that the aggregated posterior distribution
q(z) is indistinguishable from the prior distribution Pr(z). Tolstikhin et al.
(2018) generalize this to a broader family of distances between the prior
and aggregated posterior. Dumoulin et al. (2017) introduced adversarially
learned inference which uses an adversarial loss to distinguish two pairs of
latent/observed data points. In one case, the latent variable is drawn from
the latent posterior distribution and, in the other, from the prior. Other
hybrids of VAEs and GANs were proposed by Larsen et al. (2016), Brock et
al. (2016), and Hsu et al. (2017a).
Posterior collapse: One potential problem in training is posterior
collapse, in which the encoder always predicts the prior distribution. This
was identified by Bowman et al. (2015) and can be mitigated by gradually
increasing the term that encourages the KL distance between the posterior
and the prior to be small during training. Several other methods have been
proposed to prevent posterior collapse (Razavi et al., 2019a; Lucas et al.,
2019b,a), and this is also part of the motivation for using a discrete latent
space (Van Den Oord et al., 2017).
Blurry reconstructions: Zhao et al. (2017c) provide evidence that the
blurry reconstructions are partly due to Gaussian noise and also because of
the sub-optimal posterior distributions induced by the variational
approximation. It is perhaps not coincidental that some of the best synthesis
results have come from using a discrete latent space modeled by a
sophisticated autoregressive model (Razavi et al., 2019b) or from using
hierarchical latent spaces (Vahdat & Kautz, 2020; see figure 17.12d). Figure
17.12a-c used a VAE that was trained on the CELEBA database (Liu et al.,
2015). Figure 17.12d uses a hierarchical VAE that was trained on the
CELEBA HQ dataset (Karras et al., 2018).
Other problems: Chen et al. (2017) noted that when more complex
likelihood terms are used, such as the PixelCNN (Van den Oord et al.,
2016c), the output can cease to depend on the latent variables at all. They
term this the information preference problem. This was addressed by Zhao
et al. (2017b) in the InfoVAE, which added an extra term that maximized
the mutual information between the latent and observed distributions.
Another problem with the VAE is that there can be “holes” in the latent
space that do not correspond to any realistic sample. Xu et al. (2020)
introduce the constrained posterior VAE, which helps prevent these vacant
regions in latent space by adding a regularization term. This allows for
better interpolation from real samples.
Disentangling latent representation: Methods to “disentangle” the latent
representation include the beta VAE (Higgins et al., 2017) and others (e.g.,
Kim & Mnih, 2018; Kumar et al., 2018). Chen et al. (2018d) further
decomposed the ELBO to show the existence of a term measuring the total
correlation between the latent variables (i.e., the distance between the
aggregate posterior and the product of its marginals). They use this to
motivate the total correlation VAE, which attempts to minimize this
quantity. The Factor VAE (Kim & Mnih, 2018) uses a different approach to
minimize the total correlation. Mathieu et al. (2019) discuss the factors that
are important in disentangling representations.
Reparameterization trick: Consider computing an expectation of some
function, where the probability distribution with which the expectation is
taken depends on some parameters. The reparameterization trick computes
the derivative of this expectation with respect to these parameters. This
chapter introduced this as a method to differentiate through the sampling
procedure approximating the expectation; there are alternative approaches
(see problem 17.5), but the reparameterization trick gives an estimator that
(usually) has low variance. This issue is discussed in Rezende et al. (2014),
Kingma et al. (2015), and Roeder et al. (2017).
Lower bound and the EM algorithm: VAE training is based on
optimizing the evidence lower bound (sometimes also referred to as the
ELBO, variational lower bound, or negative variational free energy).
Hoffman & Johnson (2016) and Lücke et al. (2020) re-express this lower
bound in several ways that elucidate its properties. Other work has aimed to
make this bound tighter (Burda et al., 2016; Li & Turner, 2016; Bornschein
et al., 2016; Masrani et al., 2019). For example, Burda et al. (2016) use a
modified bound based on using multiple importance-weighted samples
from the approximate posterior to form the objective function.
The ELBO is tight when the distribution q(z|θ) matches the posterior
Pr(z|x, ϕ). This is the basis of the expectation maximization (EM) algorithm
(Dempster et al., 1977). Here, we alternately (i) choose θ so that q(z|θ)
equals the posterior Pr(z|x, ϕ) and (ii) change ϕ to maximize the lower
bound (figure 17.15). This is viable for models like the mixture of
Gaussians, where we can compute the posterior distribution in closed form.
Unfortunately, this is not the case for the nonlinear latent variable model, so
this method cannot be used.
Problem 17.7
(17.31)
(17.32)
(17.33)
OceanofPDF.com
Chapter 18
Diffusion models
18.1 Overview
A diffusion model consists of an encoder and a decoder. The encoder takes
a data sample x and maps it through a series of intermediate latent variables
z1 … zT. The decoder reverses this process; it starts with zT and maps back
through zT−1, …, z1 until it finally (re-)creates a data point x. In both
encoder and decoder, the mappings are stochastic rather than deterministic.
The encoder is prespecified; it gradually blends the input with samples
of white noise (figure 18.1). With enough steps, the conditional distribution
q(zT|x) and marginal distribution q(zT) of the final latent variable both
become the standard normal distribution. Since this process is prespecified,
all the learned parameters are in the decoder.
Figure 18.1 Diffusion models. The encoder (forward, or diffusion process) maps
the input x through a series of latent variables z1 … zT. This process is prespecified
and gradually mixes the data with noise until only noise remains. The decoder
(reverse process) is learned and passes the data back through the latent variables,
removing noise at each stage. After training, new examples are generated by
sampling noise vectors zT and passing them through the decoder.
(18.1)
where ϵt is noise drawn from a standard normal distribution. The first term
attenuates the data plus any noise added so far, and the second adds more
noise. The hyperparameters βt ∈ [0, 1] determine how quickly the noise is
blended and are collectively known as the noise schedule. The forward
process can equivalently be written as:
(18.2)
Figure 18.2 Forward process. a) We consider one-dimensional data x with T = 100
latent variables z1, …, z100 and β = 0.03 at all steps. Three values of x (gray, cyan,
and orange) are initialized (top row). These are propagated through z1, …, z100. At
each step, the variable is updated by attenuating its value by and adding
noise with mean zero and variance β (equation 18.1). Accordingly, the three
examples noisily propagate through the variables with a tendency to move toward
zero. b) The conditional probabilities Pr(z1|x) and Pr(zt|zt−1) are normal distributions
with a mean that is slightly closer to zero than the current point and a fixed variance
βt (equation 18.2).
This is a Markov chain because the probability zt depends only on the value
of the immediately preceding variable zt−1. With sufficient steps T , all
traces of the original data are removed, and q(zT|x) = q(zT) becomes a
standard normal distribution.2
Problem 18.1
The joint distribution of all of the latent variables z1, z2, …, zT given
input x is:
(18.3)
18.2.1 Diffusion kernel q(zt|x)
To train the decoder to invert this process, we use multiple samples zt at
time t for the same example x. However, generating these sequentially using
equation 18.1 is time-consuming when t is large. Fortunately, there is a
closed-form expression for q(zt|x), which allows us to directly draw
samples zt given initial data point x without computing the intermediate
variables z1 … zt−1. This is known as the diffusion kernel (figure 18.3).
Figure 18.3 Diffusion kernel. a) The point x* = 2.0 is propagated through the latent
variables using equation 18.1 (five paths shown in gray). The diffusion kernel q(zt|x*)
is the probability distribution over variable zt given that we started from x*. It can be
computed in closed-form and is a normal distribution whose mean moves toward
zero and whose variance increases as t increases. Heatmap shows q(zt|x*) for each
variable. Cyan lines show ±2 standard deviations from the mean. b) The diffusion
kernel q(zt|x*) is shown explicitly for t = 20, 40, 80. In practice, the diffusion kernel
allows us to sample a latent variable zt corresponding to a given x* without
computing the intermediate variables z1, …, zt−1. When t becomes very large, the
diffusion kernel becomes a standard normal.
To derive an expression for q(zt|x), consider the first two steps of the
forward process:
(18.4)
(18.5)
The last two terms are independent samples from mean-zero normal
distributions with variances 1 − β2 − (1 − β2)(1 − β1) and β2, respectively.
The mean of this sum is zero, and its variance is the sum of the component
variances (see problem 18.2), so:
(18.6)
(18.7)
(18.8)
Problem 18.3
(18.9)
(18.10)
(18.11)
Appendix C.1.4
Bayes’ rule
where between the first two lines, we have used the fact that q(zt|zt−1, x) =
q(zt|zt−1) because the diffusion process is Markov, and all information about
zt is captured by zt−1. Between lines three and four, we use the Gaussian
change of variables identity:
(18.13)
Appendix C.3.4
Gaussian change of
variables
(18.14)
Problems 18.4–18.5
(18.15)
Problem 18.6
(18.16)
where ft[zt, ϕt] is a neural network that computes the mean of the normal
distribution in the estimated mapping from zt to the preceding latent
variable zt−1. The terms are predetermined. If the hyperparameters βt in
the diffusion process are close to zero (and the number of time steps T is
large), then this normal approximation will be reasonable.
We generate new examples from Pr(x) using ancestral sampling. We
start by drawing zT from Pr(zT). Then we sample zT−1 from Pr(zT−1|zT, ϕT),
sample zT−2 from Pr(zT−2|zT−1, ϕT−1) and so on until we finally generate x
from Pr(x|z1, ϕ1).
18.4 Training
The joint distribution of the observed variable x and the latent variables {zt}
is:
(18.17)
(18.18)
Appendix C.1.2
Marginalization
(18.19)
(18.20)
(18.21)
(18.22)
where the first equality follows because all of the information about
variable zt is encompassed in zt−1, so the extra conditioning on the data x is
irrelevant. The second equality is a straightforward application of Bayes’
rule.
Appendix C.1.4
Bayes’ rule
(18.24)
where all but two of the terms in the product of the ratios q(zt−1|x)/q(zt|x)
cancel out between lines two and three leaving only q(z1|x) and q(zT|x). The
last term in the third line is approximately log[1] = 0 since the result of the
forward process q(zT|x) is a standard normal distribution, and so is equal to
the prior Pr(zT).
The simplified ELBO is hence:
(18.25)
where we have marginalized over the irrelevant variables in q(z1…T|x)
between lines two and three and used the definition of KL divergence (see
problem 18.7).
Problem 18.7
Appendix C.5.1
KL divergence
(18.26)
and is equivalent to the reconstruction term in the VAE. The ELBO will be
larger if the model prediction matches the observed data. As for the VAE,
we will approximate the expectation over the log of this quantity using a
Monte Carlo estimate (see equations 17.22–17.23), in which we estimate
the expectation with a sample from q(z1|x).
The KL divergence terms in the ELBO measure the distance between
Pr(zt−1|zt, ϕt) and q(zt−1|zt, x), which were defined in equations 18.16 and
18.15, respectively:
(18.27)
(18.28)
Appendix C.5.4
KL divergence between
normal distributions
Problem 18.8
(18.29)
where xi is the ith data point, and zit is the associated latent variable at
diffusion step t.
Figure 18.8 Fitted model results. Cyan and brown curves are original and
estimated densities and correspond to the top rows of figures 18.4 and 18.7,
respectively. Vertical bars are binned samples from the model, generated by sampling
from Pr(zT) and propagating back through the variables zT−1, zT−2, … as shown for
the five paths in figure 18.7.
(18.30)
It follows that the data term x in equation 18.28 can be expressed as the
diffused image minus the noise that was added to it:
(18.31)
Substituting this into the target terms from equation 18.29 gives:
(18.32)
Problem 18.9
where we have multiplied the numerator and denominator of the first term
by between lines two and three, multiplied out the terms, and
simplified the numerator in the first term between lines three and four.
Problem 18.10
Substituting this back into the loss function (equation 18.29), we have:
(18.34)
(18.35)
Substituting the new model into equation 18.34 produces the criterion:
(18.36)
The log normal can be written as a least squares loss plus a constant Ci
(section 5.3.1):
Substituting in the definitions of x and f1[z1, ϕ1] from equations 18.31 and
18.35, respectively, the first term simplifies to:
(18.37)
Problem 18.11
(18.38)
(18.39)
Figure 18.9 U-Net as used in diffusion models for images. The network aims to
predict the noise that was added to the image. It consists of an encoder which reduces
the scale and increases the number of channels and a decoder which increases the
scale and reduces the number of channels. The encoder representations are
concatenated to their partner in the decoder. Connections between adjacent
representations consist of residual blocks, and periodic global self-attention in which
every spatial position interacts with every other spatial position. A single network is
used for all time steps, by passing a sinusoidal time embedding (figure 12.5) through
a shallow neural network and adding the result to the channels at every spatial
position at every stage of the U-Net.
Among this family are denoising diffusion implicit models, which are no
longer stochastic after the first step from x to z1, and accelerated sampling
models, where the forward process is defined only on a sub-sequence of
time steps. This allows a reverse process that skips time steps and hence
makes sampling much more efficient; good samples can be created with 50
time steps when the forward process is no longer stochastic. This is much
faster than before but still slower than most other generative models.
Notebook 18.4
Families of diffusion
models
18.6.3 Conditional generation
If the data has associated labels c, these can be exploited to control the
generation. Sometimes this can improve generation results in GANs, and
we might expect this to be the case in diffusion models as well; it's easier to
denoise an image if you have some information about what that image
contains. One approach to conditional synthesis in diffusion models is
classifier guidance. This modifies the denoising update from zt to zt−1 to
take into account class information c. In practice, this means adding an
extra term into the final update step in algorithm 18.2 to yield:
(18.40)
Figure 18.13 Conditional generation using text prompts. Synthesized images from
a cascaded generation framework, conditioned on a text prompt encoded by a large
language model. The stochastic model can produce many different images
compatible with the prompt. The model can count objects and incorporate text into
images. Adapted from Saharia et al. (2022b).
18.7 Summary
Diffusion models map the data examples through a series of latent variables
by repeatedly blending the current representation with random noise. After
sufficient steps, the representation becomes indistinguishable from white
noise. Since these steps are small, the reverse denoising process at each step
can be approximated with a normal distribution and predicted by a deep
learning model. The loss function is based on the evidence lower bound
(ELBO) and ultimately results in a simple least-squares formulation.
For image generation, each denoising step is implemented using a U-
Net, so sampling is slow compared to other generative models. To improve
generation speed, it's possible to change the diffusion model to a
deterministic formulation, and here sampling with fewer steps works well.
Several methods have been proposed to condition generation on class
information, images, and text information. Combining these methods
produces impressive text-to-image synthesis results.
Notes
Denoising diffusion models were introduced by Sohl-Dickstein et al.
(2015), and early related work based on score-matching was carried out by
Song & Ermon (2019). Ho et al. (2020) produced image samples that were
competitive with GANs and kick-started a wave of interest in this area.
Most of the exposition in this chapter, including the original formulation
and the reparameterization, is derived from this paper. Dhariwal & Nichol
(2021) improved the quality of these results and showed for the first time
that images from diffusion models were quantitatively superior to GAN
models in terms of Fréchet Inception Distance. At the time of writing, the
state-of-the-art results for conditional image synthesis have been achieved
by Karras et al. (2022). Surveys of denoising diffusion models can be found
in Croitoru et al. (2022), Cao et al. (2022), Luo (2022), and Yang et al.
(2022).
Applications for images: Applications of diffusion models include text-
to-image generation (Nichol et al., 2022; Ramesh et al., 2022; Saharia et al.,
2022b), image-to-image tasks such as colorization, inpainting, uncropping
and restoration (Saharia et al., 2022a), super-resolution (Saharia et al.,
2022c), image editing (Hertz et al., 2022; Meng et al., 2021), removing
adversarial perturbations (Nie et al., 2022), semantic segmentation
(Baranchuk et al., 2022), and medical imaging (Song et al., 2021b; Chung
& Ye, 2022; Chung et al., 2022; Peng et al., 2022; Xie & Li, 2022; Luo et
al., 2022) where the diffusion model is sometimes used as a prior.
Different data types: Diffusion models have also been applied to video
data (Ho et al., 2022b; Harvey et al., 2022; Yang et al., 2022; Höppe et al.,
2022; Voleti et al., 2022) for generation, past and future frame prediction,
and interpolation. They have been used for 3D shape generation (Zhou et
al., 2021; Luo & Hu, 2021), and recently a technique has been introduced to
generate 3D models using only a 2D text-to-image diffusion model (Poole
et al., 2023). Austin et al. (2021) and Hoogeboom et al. (2021) investigated
diffusion models for discrete data. Kong et al. (2021) and Chen et al.
(2021d) applied diffusion models to audio data.
Alternatives to denoising: The diffusion models in this chapter mix
noise with the data and build a model to gradually denoise the result.
However, degrading the image using noise is not necessary. Rissanen et al.
(2022) devised a method that progressively blurred the image and Bansal et
al. (2022) showed that the same ideas work with a large family of
degradations that do not have to be stochastic. These include masking,
morphing, blurring, and pixelating.
Comparison to other generative models: Diffusion models synthesize
higher quality images than other generative models and are simple to train.
They can be thought of as a special case of a hierarchical VAE (Vahdat &
Kautz, 2020; Sønderby et al., 2016b) where the encoder is fixed, and the
latent space is the same size as the data. They are probabilistic, but in their
basic form, they can only compute a lower bound on the likelihood of a data
point. However, Kingma et al. (2021) show that this lower bound improves
on the exact log-likelihoods for test data from normalizing flows and
autoregressive models. The likelihood for diffusion models can be
computed by converting to an ordinary differential equation (Song et al.,
2021c) or by training a continuous normalizing flow model with a
diffusion-based criterion (Lipman et al., 2022). The main disadvantages of
diffusion models are that they are slow and that the latent space has no
semantic interpretation.
Improving quality: Many techniques have been proposed to improve
image quality. These include the reparameterization of the network
described in section 18.5 and the equal weighting of the subsequent terms
(Ho et al., 2020). Choi et al. (2022) subsequently investigated different
weightings of terms in the loss function.
Kingma et al. (2021) improved the test log-likelihood of the model by
learning the denoising weights βt. Conversely, Nichol & Dhariwal (2021)
improved performance by learning separate variances σ2 of the denoising
estimate at each time step in addition to the mean. Bao et al. (2022) show
how to learn the variances after training the model.
Ho et al. (2022a) developed the cascaded method for producing very high-
resolution images (figure 18.11). To prevent artifacts in lower-resolution
images from being propagated to higher resolutions, they introduced noise
conditioning augmentation; here, the lower-resolution conditioning image is
degraded by adding noise at each training step. This reduces the reliance on
the exact details of the lower-resolution image during training. It is also
done during inference, where the best noise level is chosen by sweeping
over different values.
Improving speed: One of the major drawbacks of diffusion models is
that they take a long time to train and sample from. Stable diffusion
(Rombach et al., 2022) projects the original data to a smaller latent space
using a conventional autoencoder and then runs the diffusion process in this
smaller space. This has the advantages of reducing the dimensionality of the
training data for the diffusion process and allowing other data types (text,
graphs, etc.) to be described by diffusion models. Vahdat et al. (2021)
applied a similar approach.
Song et al. (2021a) showed that an entire family of diffusion processes is
compatible with the training objective. Most of these processes are non-
Markovian (i.e., the diffusion step does not only depend on the results of the
previous step). One of these models is the denoising diffusion implicit
model (DDIM), in which the updates are not stochastic (figure 18.10b).
This model is amenable to taking larger steps (figure 18.10b) without
inducing large errors. It effectively converts the model into an ordinary
differential equation (ODE) in which the trajectories have low curvature
and allows efficient numerical methods for solving ODEs to be applied.
Song et al. (2021c) propose converting the underlying stochastic differential
equations into a probability flow ODE which has the same marginal
distributions as the original process. Vahdat et al. (2021), Xiao et al.
(2022b), and Karras et al. (2022) all exploit techniques for solving ODEs to
speed up synthesis. Karras et al. (2022) identified the best-performing time
discretization for sampling and evaluated different sampler schedules. The
result of these and other improvements has been a significant drop in steps
required during synthesis.
Sampling is slow because many small diffusion steps are required to ensure
that the posterior distribution q(zt−1|zt) is close to Gaussian (figure 18.5), so
the Gaussian distribution in the decoder is appropriate. If we use a model
that describes a more complex distribution at each denoising step, then we
can use fewer diffusion steps in the first place. To this end, Xiao et al.
(2022b) have investigated using conditional GAN models, and Gao et al.
(2021) investigated using conditional energy-based models. Although these
models cannot describe the original data distribution, they suffice to predict
the (much simpler) reverse diffusion step.
Salimans & Ho (2022) distilled adjacent steps of the denoising process into
a single step to speed up synthesis. Dockhorn et al. (2022) introduced
momentum into the diffusion process. This makes the trajectories smoother
and so more amenable to coarse sampling.
Conditional generation: Dhariwal & Nichol (2021) introduced classifier
guidance, in which a classifier learns to identify the category of object
being synthesized at each step, and this is used to bias the denoising update
toward that class. This works well, but training a separate classifier is
expensive. Classifier-free guidance (Ho & Salimans, 2022) concurrently
trains conditional and unconditional denoising models by dropping the class
information some proportion of the time in a process akin to dropout. This
technique allows control of the relative contributions of the conditional and
unconditional components. Over-weighting the conditional component
causes the model to produce more typical and realistic samples.
The standard technique for conditioning on images is to append the
(resized) image to the different layers of the U-Net. For example, this was
used in the cascaded generation process for super-resolution (Ho et al.,
2022a). Choi et al. (2021) provide a method for conditioning on images in
an unconditional diffusion model by matching the latent variables with
those of a conditioning image. The standard technique for conditioning on
text is to linearly transform the text embedding to the same size as the U-
Net layer and then add it to the representation in the same way that the time
embedding is introduced (figure 18.9).
Existing diffusion models can also be fine-tuned to be conditioned on edge
maps, joint positions, segmentation, depth maps, etc., using a neural
network structure called a control network (Zhang & Agrawala, 2023).
Text-to-image: Before diffusion models, state-of-the-art text-to-image
systems were based on transformers (e.g., Ramesh et al., 2021). GLIDE
(Nichol et al., 2022) and Dall·E 2 (Ramesh et al., 2022) are both
conditioned on embeddings from the CLIP model (Radford et al., 2021),
which generates joint embeddings for text and image data. Imagen (Saharia
et al., 2022b) showed that text embeddings from a large language model
could produce even better results (see figure 18.13). The same authors
introduced a benchmark (DrawBench) which is designed to evaluate the
ability of a model to render colors, numbers of objects, spatial relations, and
other characteristics. Feng et al. (2022) developed a Chinese text-to-image
model.
Connections to other models: This chapter described diffusion models
as hierarchical variational autoencoders because this approach connects
most closely with the other parts of this book. However, diffusion models
also have close connections with stochastic differential equations (consider
the paths in figure 18.5) and with score matching (Song & Ermon, 2019,
2020). Song et al. (2021c) presented a framework based on stochastic
differential equations that encompasses both the denoising and score
matching interpretations. Diffusion models also have close connections to
normalizing flows (Zhang & Chen, 2021). Yang et al. (2022) present an
overview of the relationship between diffusion models and other generative
approaches.
Problems
Problem 18.1 Show that if Var[xt−1] = I and we use the update:
(18.41)
(18.42)
(18.43)
(18.44)
(18.45)
(18.46)
Substitute the definitions from equation 18.27 into this expression and show
that the only term that depends on the parameters ϕ is the first term from
equation 18.28.
(18.48)
(18.49)
1 Note, this is the opposite nomenclature to normalizing flows, where the inverse mapping moves
from the data to the latent variable, and the forward mapping moves back again.
2 We use q(zt|zt−1) rather than Pr(zt|zt−1) to match the notation in the description of the VAE
encoder in the previous chapter.
OceanofPDF.com
Chapter 19
Reinforcement learning
(19.1)
Figure 19.3 Markov decision process. a) The agent (penguin) can perform one of a
set of actions in each state. The action influences both the probability of moving to
the successor state and the probability of receiving rewards. b) Here, the four actions
correspond to moving up, right, down, and left. c) For any state (here, state 6), the
action changes the probability of moving to the next state. The penguin moves in the
intended direction with 50% probability, but the ice is slippery, so it may slide to one
of the other adjacent positions with equal probability. Accordingly, in panel (a), the
action taken (gray arrows) doesn't always line up with the trajectory (orange line).
Here, the action does not affect the reward, so Pr(rt+1|st, at) = Pr(rt+1|st). The
trajectory τ from an MDP consists of a sequence s1, a1, r2, s2, a2, r3, s3, a3, r4 … of
alternating states st, actions at, and rewards, rt+1. Note that here the penguin receives
the reward when it leaves a state with a fish (i.e., the reward is received for passing
through the fish square, regardless of whether the penguin arrived there intentionally
or not).
19.1.5 Policy
The rules that determine the agent's action for each state are known as the
policy (figure 19.5). This may be stochastic (the policy defines a
distribution over actions for each state) or deterministic (the agent always
takes the same action in a given state). A stochastic policy π[a|s] returns a
probability distribution over each possible action a for state s, from which a
new action is sampled. A deterministic policy π[a|s] returns one for the
action a that is chosen for state s and zero otherwise. A stationary policy
depends only on the current state. A non-stationary policy also depends on
the time step.
Figure 19.5 Policies. a) A deterministic policy always chooses the same action in
each state (indicated by arrow). Some policies are better than others. This policy is
not optimal but still generally steers the penguin from top-left to bottom-right where
the reward lies. b) This policy is more random. c) A stochastic policy has a
probability distribution over actions for each state (probability indicated by size of
arrows). This has the advantage that the agent explores the states more thoroughly
and can be necessary for optimal performance in partially observable Markov
decision processes.
The environment and the agent form a loop (figure 19.6). The agent
receives the state st and reward rt from the last time step. Based on this, it
can modify the policy π[at|st] if desired and choose the next action at. The
environment then assigns the next state according to Pr(st+1|st, at) and the
reward according to Pr(rt+1|st, at).
Notebook 19.1
Markov decision processes
Figure 19.6 Reinforcement learning loop. The agent takes an action at at time t
based on the state st, according to the policy π[at|st]. This triggers the generation of a
new state st+1 (via the state transition function) and a reward rt+1 (via the reward
function). Both are passed back to the agent, which then chooses a new action.
(19.2)
Appendix C.2
Expectation
Figure 19.7 State and action values. a) The value v[st|π] of a state st (number at
each position) is the expected return for this state for a given policy π (gray arrows).
It is the average sum of discounted rewards received over many trajectories started
from this state. Here, states closer to the fish are more valuable. b) The value q[st, at,
π] of an action at in state st (four numbers at each position/state corresponding to
four actions) is the expected return given that this particular action is taken in this
state. In this case, it gets larger as we get closer to the fish and is larger for actions
that head in the direction of the fish. c) If we know the action values at a state, then
the policy can be modified so that it chooses the maximum of these values (red
numbers in panel b).
Informally, the state value tells us the long-term reward we can expect on
average if we start in this state and follow the specified policy thereafter. It
is highest for states where it's probable that subsequent transitions will
bring large rewards soon (assuming the discount factor γ is less than one).
Similarly, the action value or state-action value function q[st, at|π] is the
expected return from executing action at in state st (figure 19.7b):
(19.3)
The action value tells us the long-term reward we can expect on average if
we start in this state, take this action, and follow the specified policy
thereafter. Through this quantity, reinforcement learning algorithms connect
future rewards to current actions (i.e., resolve the temporal credit
assignment problem).
19.2.2 Optimal policy
We want a policy that maximizes the expected return. For MDPs (but not
POMDPs), there is always a deterministic, stationary policy that maximizes
the value of every state. If we know this optimal policy, then we get the
optimal state-value function v*[st]:
(19.4)
(19.5)
Turning this on its head, if we knew the optimal action-values q*[st, at],
then we can derive the optimal policy by choosing the action at with the
highest value (figure 19.7c):1
(19.6)
(19.7)
Figure 19.8 Relationship between state values and action values. The value of state
six v[st = 6] is a weighted sum of the action values q[st = 6, at] at state six, where the
weights are the policy probabilities π[at|st = 6] of taking that action.
Similarly, the value of an action is the immediate reward rt+1 = r[st, at]
generated by taking the action, plus the value v[st+1] of being in the
subsequent state st+1 discounted by γ (figure 19.9).3 Since the assignment of
st+1 is not deterministic, we weight the values v[st+1] according to the
transition probabilities Pr(st+1|st, at):
(19.8)
Figure 19.9 Relationship between action values and state values. The value q[st =
6, at = 2] of taking action two in state six is the reward r[st = 6, at = 2] from taking
that action plus a weighted sum of the discounted values v[st+1] of being in successor
states, where the weights are the transition probabilities Pr(st+1|st = 6, at = 2). The
Bellman equations chain this relation with that of figure 19.8 to link the current and
next (i) state values and (ii) action values.
(19.9)
(19.10)
The latter two relations are the Bellman equations and are the backbone
of many RL methods. In short, they say that the state (action) values have to
be self-consistent. Consequently, when we update an estimate of one state
(action) value, this will have a ripple effect that causes modifications to all
the others.
Policy evaluation: We sweep through the states st, updating their values:
(19.11)
where st+1 is the successor state and Pr(st+1|st, at) is the state transition
probability. Each update makes v[st] consistent with the value at the
successor state st+1 using the Bellman equation for state values (equation
19.9). This is termed bootstrapping.
(19.12)
(19.13)
Figure 19.11 Monte Carlo methods. a) The policy (arrows) is initialized randomly.
The MDP is repeatedly simulated, and the trajectories of these episodes are stored
(orange and brown paths represent two trajectories). b) The action values are
empirically estimated based on the observed returns averaged over these trajectories.
In this case, the action values were all initially zero and have been updated where an
action was observed. c) The policy can then be updated according to the action which
received the best (or least bad) reward.
This is an on-policy method; the current best policy is used to guide the
agent through the environment. This policy is based on the observed action
values in every state, but of course, it's not possible to estimate the value of
actions that haven't been used, and there is nothing to encourage the
algorithm to explore these. One solution is to use exploring starts. Here,
episodes with all possible state-action pairs are initiated, so every
combination is observed at least once. However, this is impractical if the
number of states is large or the starting point cannot be controlled. A
different approach is to use an epsilon greedy policy, in which a random
action is taken with probability ϵ, and the optimal action is allotted the
remaining probability. The choice of ϵ trades off exploitation and
exploration. Here, an on-policy method will seek the best policy from this
epsilon-greedy family, which will not generally be the best overall policy.
Problem 19.4
Conversely, in off-policy methods, the optimal policy π (the target
policy) is learned based on episodes generated by a different behavior
policy π′. Typically, the target policy is deterministic, and the behavior
policy is stochastic (e.g., an epsilon-greedy policy). Hence, the behavior
policy can explore the environment, but the learned target policy remains
efficient. Some off-policy methods explicitly use importance sampling
(section 17.8.1) to estimate the action value under policy π using samples
from π′. Others, such as Q-learning (described in the next section), estimate
the values based on the greedy action, even though this is not necessarily
what was chosen.
Notebook 19.3
Monte Carlo methods
(19.14)
where α ∈ ℝ+ is the learning rate. The bracketed term is called the TD error
and measures the consistency between the estimated action value q[st, at]
and the estimate r[st, at]+γ · q[st+1, at+1] after taking a single step.
By contrast, Q-Learning is an off-policy algorithm with update (figure
19.12):
(19.15)
where now the choice of action at each step is derived from a different
behavior policy π′.
Notebook 19.4
Temporal difference
methods
Figure 19.12 Q-learning. a) The agent starts in state st and takes action at = 2
according to the policy. It does not slip on the ice and moves downward, receiving
reward r[st, at] = 0 for leaving the original state. b) The maximum action value at the
new state is found (here 0.43). c) The action value for action 2 in the original state is
updated to 1.12 based on the current estimate of the maximum action value at the
subsequent state, the reward, discount factor γ = 0.9, and learning rate α = 0.1. This
changes the highest action value at the original state, so the policy changes.
In both cases, the policy is updated by taking the maximum of the action
values at each state (equation 19.13). It can be shown that these updates are
contraction mappings (see equation 16.20); the action values will eventually
converge, assuming that every state-action pair is visited an infinite number
of times.
Problem 19.5
(19.16)
(19.17)
Figure 19.14 Deep Q-network architecture. The input st consists of four adjacent
frames of the ATARI game. Each is resized to 84×84 and converted to grayscale.
These frames are represented as four channels and processed by an 8×8 convolution
with stride four, followed by a 4×4 convolution with stride 2, followed by two fully
connected layers. The final output predicts the action value q[st, at] for each of the 18
actions in this state.
Several modifications were made to the standard training procedure.
First, the rewards (which were driven by the score in the game) were
clipped to −1 for a negative change and +1 for a positive change. This
compensates for the wide variation in scores between different games and
allows the same learning rate to be used. Second, the system exploited
experience replay. Rather than update the network based on the tuple <st, at,
rt+1, st+1 > at the current step or with a batch of the last I tuples, all recent
tuples were stored in a buffer. This buffer was sampled randomly to
generate a batch at each step. This approach reuses data samples many
times and reduces correlations between the samples in the batch that arise
due to the similarity of adjacent frames.
Finally, the issue of convergence in fitted Q-Networks was tackled by
fixing the target parameters to values ϕ− and only updating them
periodically. This gives the update:
(19.18)
Now the network no longer chases a moving target and is less prone to
oscillation.
Using these and other heuristics and with an ϵ-greedy policy, Deep Q-
Networks performed at a level comparable to a professional game tester
across a set of 49 games using the same network (trained separately for
each game). It should be noted that the training process was data-intensive.
It took around 38 full days of experience to learn each game. In some
games, the algorithm exceeded human performance. On other games like
“Montezuma's Revenge,” it barely made any progress. This game features
sparse rewards and multiple screens with quite different appearances.
(19.19)
leads to a systematic bias in the estimated state values q[st, at]. Consider
two actions that provide the same average reward, but one is stochastic and
the other deterministic. The stochastic reward will exceed the average
roughly half of the time and be chosen by the maximum operation, causing
the corresponding action value q[st, at] to be over-estimated. A similar
argument can be made about random inaccuracies in the output of the
network q[st, at, ϕ] or random initializations of the q-function.
The underlying problem is that the same network both selects the target
(by the maximization operation) and updates the value. Double Q-Learning
tackles this problem by training two models q1[st, at, π1] and q2[st, at, π2]
simultaneously:
(19.20)
Now the choice of the target and the target itself are decoupled, which
helps prevent these biases. In practice, new tuples <s, a, r, s′ > are randomly
assigned to update one model or another. This is known as double Q-
learning. Double deep Q-networks or double DQNs use deep networks q[st,
at, ϕ1] and q[st, at, ϕ2] to estimate the action values, and the update
becomes:
(19.21)
(19.22)
Policy gradient algorithms aim to maximize the expected return r[τ] over
many such trajectories:
(19.23)
where the return is the sum of all the rewards received along the trajectory.
To maximize this quantity, we use the gradient ascent update:
(19.24)
(19.25)
(19.26)
(19.27)
(19.28)
and noting that only the center term depends on θ, we can rewrite the
update from equation 19.27 as:
(19.29)
where sit is the state at time t in episode i, and ait is the action taken at time t
in episode i. Note that since the terms relating to the state evolution
Pr(st+1|st, at) disappear, this parameter update does not assume a Markov
time evolution process.
We can further simplify this by noting that:
(19.30)
where rit is the reward at time t in the ith episode. The first term (the
rewards before time t) does not affect the update from time t, so we can
write:
(19.31)
(19.32)
and then we update the parameters for each time step t in each trajectory:
(19.33)
19.5.3 Baselines
Policy gradient methods have the drawback that they exhibit high variance;
many episodes may be needed to get stable updates of the derivatives. One
way to reduce this variance is to subtract the trajectory returns r[τ] from a
baseline b:
(19.34)
(19.35)
and the expected value will not change. However, if the baseline co-varies
with irrelevant factors that add uncertainty, then subtracting it reduces the
variance (figure 19.16). This is a special case of the method of control
variates (see problem 19.7).
Notebook 19.5
Control variates
Problem 19.7
Figure 19.16 Decreasing variance of estimates using control variates. a) Consider
trying to estimate 𝔼[a] from a small number of samples. The estimate (the mean of
the samples) will vary based on the number of samples and the variance of those
samples. b) Now consider observing another variable b that co-varies with a and has
𝔼[b] = 0 and the same variance as a. c) The variance of the samples of a − b is much
less than that of a, but the expected value 𝔼[a − b] = 𝔼[a], so we get an estimator
with lower variance.
This raises the question of how we should choose b. We can find the
value of b that minimizes the variance by writing an expression for the
variance, taking the derivative with respect to b, setting the result to zero,
and solving to yield:
(19.36)
Problem 19.8
(19.37)
Subtracting this baseline factors out variance that might occur when the
returns r[τi] from all trajectories are greater than is typical but only because
they happen to pass through states with higher than average returns
whatever actions are taken.
19.5.4 State-dependent baselines
A better option is to use a baseline b[sit] that depends on the current state sit.
(19.38)
(19.39)
(19.40)
Here the value v[si,t+1, ϕ] is estimated by a second neural network with
parameters ϕ.
Substituting this into equation 19.38 gives the update:
(19.41)
(19.42)
The policy network π[st, θ] that predicts Pr(a|st) is termed the actor. The
value network v[st, ϕ] is termed the critic. Often the same network
represents both actor and the critic, with two sets of outputs that predict the
policy and the values, respectively. Note that although actor-critic methods
can update the policy parameters at each step, this is rarely done in practice.
The agent typically collects a batch of experience over many time steps
before the policy is updated.
19.8 Summary
Reinforcement learning is a sequential decision-making framework for
Markov decision processes and similar systems. This chapter reviewed
tabular approaches to RL, including dynamic programming (in which the
environment model is known), Monte Carlo methods (in which multiple
episodes are run and the action values and policy subsequently changed
based on the rewards received), and temporal difference methods (in which
these values are updated while the episode is ongoing).
Deep Q-Learning is a temporal difference method where deep neural
networks are used to predict the action value for every state. It can train
agents to perform Atari 2600 games at a level similar to humans. Policy
gradient methods directly optimize the policy rather than assigning values
to actions. They produce stochastic policies, which are important when the
environment is partially observable. The updates are noisy, and many
refinements have been introduced to reduce their variance.
Offline reinforcement learning is used when we cannot interact with the
environment but must learn from historical data. The decision transformer
leverages recent progress in deep learning to build a model of the state-
action-reward sequence and predict the actions that will maximize the
rewards.
Notes
Sutton & Barto (2018) cover tabular reinforcement learning methods in
depth. Li (2017), Arulkumaran et al. (2017), François-Lavet et al. (2018),
and Wang et al. (2022c) all provide overviews of deep reinforcement
learning. Graesser & Keng (2019) is an excellent introductory resource that
includes Python code.
Landmarks in deep reinforcement learning: Most landmark
achievements of reinforcement learning have been in either video games or
real-world games since these provide constrained environments with
limited actions and fixed rules. Deep Q-Learning (Mnih et al., 2015)
achieved human-level performance across a benchmark of ATARI games.
AlphaGo (Silver et al., 2016) beat the world champion at Go. This game
was previously considered very difficult for computers to play. Berner et al.
(2019) built a system that beat the world champion team in the five vs. five-
player game Defense of the Ancients 2, which requires cooperation across
players. Ye et al. (2021) built a system that could beat humans on Atari
games with limited data (in contrast to previous systems, which need much
more experience than humans). More recently, the Cicero system
demonstrated human-level performance in the game Diplomacy which
requires natural language negotiations and coordination between players
(FAIR, 2022).
RL has also been applied successfully to combinatorial optimization
problems (see Mazyavkina et al., 2021). For example, Kool et al. (2019)
learned a model that performed similarly to the best heuristics for the
traveling salesman problem. Recently, AlphaTensor (Fawzi et al., 2022)
treated matrix multiplication as a game and learned faster ways to multiply
matrices using fewer multiplication operations. Since deep learning relies
heavily on matrix multiplication, this is one of the first examples of self-
improvement in AI.
Classical reinforcement learning methods: Very early contributions to
the theory of MDPs were made by Thompson (1933) and Thompson
(1935). The Bellman recursions were introduced by Bellman (1966).
Howard (1960) introduced policy iteration. Sutton & Barto (2018) identify
the work of Andreae (1969) as being the first to describe RL using the MDP
formalism.
The modern era of reinforcement learning arguably originated in the Ph.D.
theses of Sutton (1984) and Watkins (1989). Sutton (1988) introduced the
term temporal difference learning. Watkins (1989) and Watkins & Dayan
(1992) introduced Q-Learning and showed that it converges to a fixed point
by Banach's theorem because the Bellman operator is a contraction
mapping. Watkins (1989) made the first explicit connection between
dynamic programming and reinforcement learning. SARSA was developed
by Rummery & Niranjan (1994). Gordon (1995) introduced fitted Q-
learning in which a machine learning model is used to predict the action
value for each state-action pair. Riedmiller (2005) introduced neural-fitted
Q-learning, which used a neural network to predict all the action values at
once from a state. Early work on Monte Carlo methods was carried out by
Singh & Sutton (1996), and the exploring starts algorithm was introduced
by Sutton & Barto (1999). Note that this is an extremely cursory summary
of more than fifty years of work. A much more thorough treatment can be
found in Sutton & Barto (2018).
Deep Q-Networks: Deep Q-Learning was devised by Mnih et al. (2015)
and is an intellectual descendent of neural-fitted Q-learning. It exploited the
then-recent successes of convolutional networks to develop a fitted Q-
Learning method that could achieve human-level performance on a
benchmark of ATARI games. Deep Q-Learning suffers from the deadly
triad issue (Sutton & Barto, 2018): training can be unstable in any scheme
that incorporates (i) bootstrapping, (ii) off-policy learning, and (iii) function
approximation. Much subsequent work has aimed to make training more
stable. Mnih et al. (2015) introduced the experience replay buffer (Lin,
1992), which was subsequently improved by Schaul et al. (2016) to favor
more important tuples and hence increase learning speed. This is termed
prioritized experience replay.
The original Q-Learning paper concatenated four frames so the network
could observe the velocities of objects and make the underlying process
closer to fully observable. Hausknecht & Stone (2015) introduced deep
recurrent Q-learning, which used a recurrent network architecture that only
ingested a single frame at a time because it could “remember” the previous
states. Van Hasselt (2010) identified the systematic overestimation of the
state values due to the max operation and proposed double Q-Learning in
which two models are trained simultaneously to remedy this. This was
subsequently applied in the context of deep Q-learning (Van Hasselt et al.,
2016), although its efficacy has since been questioned (Hessel et al., 2018).
Wang et al. (2016) introduced deep dueling networks in which two heads of
the same network predict (i) the state value and (ii) the advantage (relative
value) of each action. The intuition here is that sometimes it is the state
value that is important, and it doesn't matter much which action is taken,
and decoupling these estimates improves stability.
Fortunato et al. (2018) introduced noisy deep Q-Networks, in which some
weights in the Q-Network are multiplied by noise to add stochasticity to the
predictions and encourage exploration. The network can learn to decrease
the magnitudes of the noise over time as it converges to a sensible policy.
Distributional DQN (Bellemare et al., 2017a; Dabney et al., 2018 following
Morimura et al., 2010) aims to estimate more complete information about
the distribution of returns than just the expectation. This potentially allows
the network to mitigate against worst-case outcomes and can also improve
performance, as predicting higher moments provides a richer training
signal. Rainbow (Hessel et al., 2018) combined six improvements to the
original deep Q-learning algorithm, including dueling networks,
distributional DQN, and noisy DQN, to improve both the training speed and
the final performance on the ATARI benchmark.
Policy gradients: Williams (1992) introduced the REINFORCE
algorithm. The term “policy gradient method” dates to Sutton et al. (1999).
Konda & Tsitsiklis (1999) introduced the actorcritic algorithm. Decreasing
the variance by using different baselines is discussed in Greensmith et al.
(2004) and Peters & Schaal (2008). It has since been argued that the value
baseline primarily reduces the aggressiveness of the updates rather than
their variance (Mei et al., 2022).
Policy gradients have been adapted to produce deterministic policies (Silver
et al., 2014; Lillicrap et al., 2016; Fujimoto et al., 2018). The most direct
approach is to maximize over the possible actions, but if the action space is
continuous, this requires an optimization procedure at each step. The deep
deterministic policy gradient algorithm (Lillicrap et al., 2016) moves the
policy in the direction of the gradient of the action value (implying the use
of an actor-critic method).
Modern policy gradients: We introduced policy gradients in terms of the
parameter update. However, they can also be viewed as optimizing a
surrogate loss based on importance sampling of the expected rewards, using
trajectories from the current policy parameters. This view allows us to take
multiple optimization steps validly. However, this can cause very large
policy updates. Overstepping is a minor problem in supervised learning, as
the trajectory can be corrected later. However, in RL, it affects future data
collection and can be extremely destructive.
Several methods have been proposed to moderate these updates. Natural
policy gradients (Kakade, 2001) are based on natural gradients (Amari,
1998), which modify the descent direction by the Fisher information matrix.
This provides a better update which is less likely to get stuck in local
plateaus. However, the Fisher matrix is impractical to compute in models
with many parameters. In trust-region policy optimization or TRPO
(Schulman et al., 2015), the surrogate objective is maximized subject to a
constraint on the KL divergence between the old and new policies.
Schulman et al. (2017) propose a simpler formulation in which this KL
divergence appears as a regularization term. The regularization weight is
adapted based on the distance between the KL divergence and a target
indicating how much we want the policy to change. Proximal policy
optimization or PPO (Schulman et al., 2017) is an even simpler approach in
which the loss is clipped to ensure smaller updates.
Actor-critic: In the actor-critic algorithm (Konda & Tsitsiklis, 1999)
described in section 19.6, the critic used a 1-step estimator. It's also possible
to use k-step estimators (in which we observe k discounted rewards and
approximate subsequent rewards with an estimate of the state value). As k
increases, the variance of the estimate increases, but the bias decreases.
Generalized advantage estimation (Schulman et al., 2016) weights together
estimates from many steps and parameterizes the weighting by a single term
that trades off the bias and the variance. Mnih et al. (2016) introduced
asynchronous actor-critic or A3C in which multiple agents are run
independently in parallel environments and update the same parameters.
Both the policy and value function are updated every T time steps using a
mix of k-step returns. Wang et al. (2017) introduced several methods
designed to make asynchronous actor-critic more efficient. Soft actor-critic
(Haarnoja et al., 2018b) adds an entropy term to the cost function, which
encourages exploration and reduces overfitting as the policy is encouraged
to be less confident.
Offline RL: In offline reinforcement learning, the policy is learned by
observing the behavior of other agents, including the rewards they receive,
without the ability to change the policy. It is related to imitation learning,
where the goal is to copy the behavior of another agent without access to
rewards (see Hussein et al., 2017). One approach is to treat offline RL in the
same way as off-policy reinforcement learning. However, in practice, the
distributional shift between the observed and applied policy manifests in
overly optimistic estimates of the action value and poor performance (see
Fujimoto et al., 2019; Kumar et al., 2019a; Agarwal et al., 2020).
Conservative Q-learning (Kumar et al., 2020b) learns conservative, lower-
bound estimates of the value function by regularizing the Q-values. The
decision transformer (Chen et al., 2021c) is a simple approach to offline
learning that takes advantage of the well-studied self-attention architecture.
It can subsequently be fine-tuned with online training (Zheng et al., 2022).
Reinforcement learning and chatbots: Chatbots can be trained using a
technique known as reinforcement learning with human feedback or RLHF
(Christiano et al., 2018; Stiennon et al., 2020). For example, InstructGPT
(the forerunner of ChatGPT, Ouyang et al., 2022) starts with a standard
transformer decoder model. This is then fine-tuned based on prompt-
response pairs where the response was written by human annotators. During
this training step, the model is optimized to predict the next word in the
ground truth response.
Unfortunately, such training data are expensive to produce in sufficient
quantities to support high-quality performance. To resolve this problem,
human annotators then indicate which of several model responses they
prefer. These (much cheaper) data are used to train a reward model. This is
a second transformer network that ingests the prompt and model response
and returns a scalar indicating how good the response is. Finally, the fine-
tuned chatbot model is further trained to produce high rewards using the
reward model as supervision. Here, standard gradient descent cannot be
used as it's not possible to compute derivatives through the sampling
procedure in the chatbot output. Hence, the model is trained with proximal
policy optimization (a policy gradient method where the derivatives are
tractable) to generate higher rewards.
Other areas of RL: Reinforcement learning is an enormous area, which
easily justifies its own book, and this literature review is extremely
superficial. Other notable areas of RL that we have not discussed include
model-based RL, in which the state transition probabilities and reward
functions are modeled (see Moerland et al., 2023). This allows forward
planning and has the advantage that the same model can be reused for
different reward structures. Hybrid methods such as AlphaGo (Silver et al.,
2016) and MuZero (Schrittwieser et al., 2020) have separate models for the
dynamics of the states, the policy, and the value of future positions.
This chapter has only discussed simple methods for exploration, like the
epsilon-greedy approach, noisy Q-learning, and adding an entropy term to
penalize overconfident policies. Intrinsic motivation refers to methods that
add rewards for exploration and thus imbue the agent with “curiosity” (see
Barto, 2013; Aubret et al., 2019). Hierarchical reinforcement learning (see
Pateria et al., 2021) refers to methods that break down the final objective
into sub-tasks. Multiagent reinforcement learning (see Zhang et al., 2021a)
considers the case where multiple agents coexist in a shared environment.
This may be in either a competitive or cooperative context.
Problems
Problem 19.1 Figure 19.18 shows a single trajectory through the example
MDP. Calculate the return for each step in the trajectory given that the
discount factor γ is 0.9.
Figure 19.18 One trajectory through an MDP. The penguin receives a reward of +1
when it reaches the first fish tile, −2 when it falls in the hole, and +1 for reaching the
second fish tile. The discount factor γ is 0.9.
(19.43)
and for all other states, the policies are the same. Show that the value v[st|π]
for the original policy must be less than or equal to v[st|π′] = q(st, π′[a|st]|π]
for the new policy:
(19.44)
Hint: Start by writing the term v[st+1|π] in terms of the new policy.
Problem 19.3 Show that when the state values and policy are initialized as
in figure 19.10a, they become those in figure 19.10b after two iterations of
(i) policy evaluation (in which all states are updated based on their current
values and then replace the previous ones) and (ii) policy improvement. The
state transition allots half the probability to the direction the policy indicates
and divides the remaining probability equally between the other valid
actions. The reward function returns -2 irrespective of the action when the
penguin leaves a hole. The reward function returns +3 regardless of the
action when the penguin leaves the fish tile and the episode ends, so the fish
tile has a value of +3.
Problem 19.4 The Boltzmann policy strikes a balance between exploration
and exploitation by basing the action probabilities π[a|s] on the current
state-action reward function q[s, a]:
(19.45)
(19.46)
(19.47)
where ||•||∞ represents the ℓ∞ norm. It follows that a fixed point will exist by
Banach's theorem and that the updates will eventually converge.
Appendix B.3.2
Vector norms
(19.48)
and so adding a baseline update doesn't change the expected policy gradient
update.
Problem 19.7* Suppose that we want to estimate a quantity 𝔼[a] from
samples a1, a2 … aI. Consider that we also have paired samples b1, b2 … bI
that are samples that co-vary with a where 𝔼[b] = μb. We define a new
variable:
(19.49)
Show that Var[a′] ≤ Var[a] when the constant c is chosen judiciously. Find
an expression for the optimal value of c.
Problem 19.8 The estimate of the gradient in equation 19.34 can be written
as:
(19.50)
where
(19.51)
and
(19.52)
Show that the value of b that minimizes the variance of the gradient
estimate is given by:
(19.53)
1 The notation π[at|st] ← a in equations 19.6, 19.12, and 19.13 means set π[at|s] to one for action
a and π[at|s] to zero for other actions.
2 For simplicity, we will just write v[st] and q[st, at] instead of v[st|π] and q[st, at|π] from now on.
3 We also assume from now on that the rewards are deterministic and can be written as r[st, at].
4 In RL, a trajectory is an observed sequence of states, rewards, and actions. A rollout is a
simulated trajectory. An episode is a trajectory that starts in an initial state and ends in a terminal
state (e.g., a full game of chess starting from the standard opening position and ending in a win, lose,
or draw.)
OceanofPDF.com
Chapter 20
This chapter differs from those that precede it. Instead of presenting
established results, it poses questions about how and why deep learning
works so well. These questions are rarely discussed in textbooks. However,
it's important to realize that (despite the title of this book) understanding of
deep learning is still limited.
We argue that it is surprising that deep networks are easy to train and
also surprising that they generalize. Then we consider each of these topics
in turn. We enumerate the factors that influence training success and discuss
what is known about loss functions for deep networks. Then we consider
the factors that influence generalization. We conclude with a discussion of
whether networks need to be overparameterized and deep.
20.1.2 Generalization
If the efficient fitting of neural networks is startling, their generalization to
new data is dumbfounding. First, it's not obvious a priori that typical
datasets are sufficient to characterize the input/output mapping. The curse
of dimensionality implies that the training dataset is tiny compared to the
possible inputs; if each of the 40 inputs of the MNIST-1D data were
quantized into 10 possible values, there would be 1040 possible inputs,
which is a factor of 1035 more than the number of training examples.
Problem 20.1
20.2.1 Dataset
It's important to realize that we can't learn any function. Consider a
completely random mapping from every possible 28×28 binary image to
one of ten categories. Since there is no structure to this function, the only
recourse is to memorize the 2784 assignments. However, it's easy to train a
model on the MNIST dataset (figures 8.10 and 15.15), which contains
60,000 examples of 28×28 images labeled with one of ten categories. One
explanation for this contradiction could be that it is easy to find global
minima because the real-world functions that we approximate are relatively
simple.1
This hypothesis was investigated by Zhang et al. (2017a), who trained
AlexNet on the CIFAR-10 image classification dataset when (i) each image
was replaced with Gaussian noise and (ii) the labels of the ten classes were
randomly permuted (figure 20.1). These changes slowed down learning, but
the network could still fit this finite dataset well. This suggests that the
properties of the dataset aren't critical.
Notebook 20.1
Random data
Problem 20.2
Figure 20.1 Fitting random data. Losses for AlexNet architecture trained on
CIFAR-10 dataset with SGD. When the pixels are drawn from a Gaussian random
distribution with the same mean and variance as the original data, the model can still
be fit (albeit more slowly). When the labels are randomized, the model can still be fit
(albeit even more slowly). Adapted from Zhang et al. (2017a).
20.2.2 Regularization
Another possible explanation for the ease with which models are trained is
that regularization makes the loss surface flatter and more convex.
However, Zhang et al. (2017a) found that neither explicit regularization nor
Dropout was required to fit random data. This does not eliminate implicit
regularization due to the finite step size of the fitting algorithms (section
9.2). However, this effect increases with the learning rate (equation 9.9),
and model-fitting does not get easier with larger learning rates.
Problem 20.3
Figure 20.2 MNIST-1D training. Four fully connected networks were fit to 4000
MNIST-1D examples with random labels using full batch gradient descent, He
initialization, no momentum or regularization, and learning rate 0.0025. Models with
1,2,3,4 layers had 298, 100, 75, and 63 hidden units per layer and 15208, 15210,
15235, and 15139 parameters, respectively. All models train successfully, but deeper
models require fewer epochs.
20.2.4 Overparameterization
Overparameterization almost certainly is an important factor that
contributes to ease of training. It implies that there is a large family of
degenerate solutions, so there may always be a direction in which the
parameters can be modified to decrease the loss. Sejnowski (2020) suggests
that “… the degeneracy of solutions changes the nature of the problem from
finding a needle in a haystack to a haystack of needles.”
In practice, networks are frequently overparameterized by one or two
orders of magnitude (figure 20.3). However, data augmentation makes it
difficult to make precise statements. Augmentation may increase the data by
several orders of magnitude, but these are manipulations of existing
examples rather than independent new data points. Moreover, figure 8.10
shows that neural networks can sometimes fit the training data well when
there are the same number or fewer parameters than data points. This is
presumably due to redundancy in training examples from the same
underlying function.
20.2.6 Initialization
Another potential explanation is that Xavier/He initialization sets the
parameters to values that are easy to optimize. Of course, for deeper
networks, such initialization is necessary to avoid exploding and vanishing
gradients, so in a trivial sense, initialization is critical to training success.
However, for shallower networks, the initial variance of the weights is less
important. Liu et al. (2023c) trained a 3-layer fully connected network with
200 hidden units per layer on 1000 MNIST data points. They found that
more iterations were required to fit the training data as the variance
increased from that proposed by He (figure 20.4), but this did not ultimately
impede fitting. Hence, initialization doesn't shed much light on why fitting
neural networks is easy, although exploding/vanishing gradients do reveal
initializations that make training difficult with finite precision arithmetic.
Figure 20.4 Initialization and fitting. A three-layer fully connected network with
200 hidden units per layer was trained on 1000 MNIST examples with AdamW using
one-hot targets and mean-squared error loss. It takes longer to fit networks when
larger multiples of He initialization are used, but this doesn't change the outcome.
This may simply reflect the extra distance that the weights must move. Adapted from
Liu et al. (2023c).
Figure 20.5 Linear slices through loss function. a) A two-layer fully connected
ReLU network is trained on MNIST. The loss along a straight line starting at the
initial parameters (δ=0) and finishing at the trained parameters (δ=1) descends
monotonically. b) However, in this two-layer fully connected MaxOut network on
MNIST, there is an increase in the loss along a straight line between one solution
(δ=0) and another (δ=1). Adapted from Goodfellow et al. (2015b).
Li & Liang (2018) show that the relative change in the parameters during
training decreases as network width increases; for larger widths, the
parameters start at smaller values, change by a smaller proportion of those
values, and converge in fewer steps.
Figure 20.10 Batch size to learning rate ratio. Generalization of two models on the
CIFAR-10 database depends on the ratio of batch size to the learning rate. As the
batch size increases, generalization decreases. As the learning rate increases,
generalization increases. Adapted from He et al. (2019).
These observations are aligned with the discovery that SGD implicitly
adds regularization terms to the loss function (section 9.2), and their
magnitude depends on the learning rate. The trajectory of the parameters is
changed by this regularization, and they converge to a part of the loss
function that generalizes well.
Flatness can be measured by (i) the size of the connected region around
the minimum for which training loss is similar (Hochreiter & Schmidhuber,
1997a), (ii) the secondorder curvature around the minimum (Chaudhari et
al., 2019), or (iii) the maximum loss within a neighborhood of the minimum
(Keskar et al., 2017). However, caution is required; estimated flatness can
be affected by trivial reparameterizations of the network due to the non-
negative homogeneity property of the ReLU function (Dinh et al., 2017).
Nonetheless, Keskar et al. (2017) varied the batch size and learning rate
and showed that flatness correlates with generalization. Izmailov et al.
(2018) average together weights from multiple points in a learning
trajectory. This both results in flatter test and training surfaces at the
minimum and improves generalization. Other regularization techniques can
also be viewed through this lens. For example, averaging model outputs
(ensembling) may also make the test loss surface flatter. Kleinberg et al.
(2018) showed that large gradient variance during training helps avoid
sharp regions. This may explain why reducing the batch size and adding
noise helps generalization.
The above studies consider flatness for a single model and training set.
However, sharpness is not a good criterion to predict generalization
between datasets; when the labels in the CIFAR dataset are randomized
(making generalization impossible), there is no commensurate decrease in
the flatness of the minimum (Neyshabur et al., 2017).
20.4.3 Architecture
The inductive bias of a network is determined by its architecture, and
judicious choices of model can drastically improve generalization. Chapter
10 introduced convolutional networks, which are designed to process data
on regular grids; they implicitly assume that the input statistics are the same
across the input, so they share parameters across position. Similarly,
transformers are suited for modeling data that is invariant to permutations,
and graph neural networks are suited to data represented on irregular
graphs. Matching the architecture to the properties of the data improves
generalization over generic, fully connected architectures (see figure 10.8).
This finding was used by Liu et al. (2023c) to explain the phenomenon
of grokking (Power et al., 2022), in which a sudden improvement in
generalization can occur many epochs after the training error is already zero
(figure 20.13). It is proposed that grokking occurs when the norm of the
weights is initially too large; the training data fits well, but the variation of
the model between the data points is large. Over time, implicit or explicit
regularization decreases the norm of the weights until they reach the
Goldilocks zone, and generalization suddenly improves.
Figure 20.13 Grokking. When the parameters are initialized so that their ℓ 2 norm
(radius) is considerably larger than is specified by He initialization, training takes
longer (dashed lines), and generalization takes much longer (solid lines). The lag in
generalization is attributed to the time taken for the norm of the weights to decrease
back to the Goldilocks zone. Adapted from Liu et al. (2023c).
20.4.5 Overparameterization
Figure 8.10 showed that generalization performance tends to improve with
the degree of overparameterization. When combined with the bias/variance
trade-off curve, this results in double descent. The putative explanation for
this improvement is that the network has more latitude to become smoother
between the training data points when the model is overparameterized.
It follows that the norm of the weights can also be used to explain double
descent. The norm of the weights increases when the number of parameters
is similar to the number of data points (as the model contorts itself to fit
these points exactly), causing generalization to reduce. As the network
becomes wider and the number of weights increases, the overall norm of
these weights decreases; the weights are initialized with a variance that is
inversely proportional to the width (i.e., with He or Glorot initialization),
and the weights change very little from their original values.
Figure 20.14 Adversarial examples. In each case, the left image is correctly
classified by AlexNet. By considering the gradients of the network output with
respect to the input, it's possible to find a small perturbation (center, magnified by 10
for visibility) that, when added to the original image (right), causes the network to
misclassify it as an ostrich. This is despite the fact that the original and perturbed
images are almost indistinguishable to humans. Adapted from Szegedy et al. (2014).
The conclusion is that there are positions that are close to but not on the
data manifold that are misclassified. These are known as adversarial
examples. Their existence is surprising; how can such a small change to the
network input make such a drastic change to the output? The best current
explanation is that adversarial examples aren't due to a lack of robustness to
data from outside the training data manifold. Instead, they are exploiting a
source of information that is in the training distribution but which has a
small norm and is imperceptible to humans (Ilyas et al., 2019).
20.5.1 Pruning
Pruning trained models reduces their size and hence storage requirements
(figure 20.15). The simplest approach is to remove individual weights. This
can be done based on the second derivatives of the loss function (LeCun et
al., 1990; Hassibi & Stork, 1993) or (more practically) based on the
absolute value of the weight (Han et al., 2016, 2015). Other work prunes
hidden units (Zhou et al., 2016a; Alvarez & Salzmann, 2016), channels in
convolutional networks (Li et al., 2017a; Luo et al., 2017b; He et al., 2017;
Liu et al., 2019a), or entire layers in residual nets (Huang & Wang, 2018).
Often, the network is fine-tuned after pruning, and sometimes this process
is repeated.
Figure 20.15 Pruning neural networks. The goal is to remove as many weights as
possible without decreasing performance. This is often done just based on the
magnitude of the weights. Typically, the network is fine-tuned after pruning. a)
Example fully connected network. b) After pruning.
In general, the smaller the model, the large the proportion of weights can
be pruned without significantly damaging performance. For example, Han
et al. (2016) maintained good performance for the VGG network on
ImageNet classification when only 8% of the weights were retained. This
significantly decreases the model size but isn't enough to show that
overparameterization is not required; the VGG network has ~100 times as
many parameters as there are training data in ImageNet (disregarding
augmentation).
Pruning is a form of architecture search. In their work on lottery tickets
(see section 20.2.7), Frankle & Carbin (2019) (i) trained a network, (ii)
pruned the weights with the smallest magnitudes, and (iii) retrained the
remaining network from the same initial weights. By iterating this
procedure, they reduced the size of the VGG-19 network (originally 138
million parameters) by 98.5% on the CIFAR-10 database (60,000 examples)
while maintaining good performance. For ResNet-50 (25.6 million
parameters), they reduced the parameters by 80% without reducing the
performance on ImageNet (1.28 million examples). These demonstrations
are impressive but (disregarding data augmentation) these networks are still
over-parameterized after pruning.
20.7 Summary
This chapter has made the case that the success of deep learning is
surprising. We discussed the challenges of optimizing high-dimensional
loss functions and argued that overparameterization and the choice of
activation function are the two most important factors that make this
tractable in deep networks. We saw that, during training, the parameters
move through a low-dimensional subspace to one of a family of connected
global minima and that local minima are not apparent.
Generalization of neural networks also improves with
overparameterization, although other factors, such as the flatness of the
minimum and the inductive bias of the architecture, are also important. It
appears that both a large number of parameters and multiple network layers
are required for good generalization, although we do not yet know why.
Many questions remain unanswered. We do not currently have any
prescriptive theory that will allow us to predict the circumstances in which
training and generalization will succeed or fail. We do not know the limits
of learning in deep networks or whether much more efficient models are
possible. We do not know if there are parameters that would generalize
better within the same model. The study of deep learning is still driven by
empirical demonstrations. These are undeniably impressive, but they are not
yet matched by our understanding of deep learning mechanisms.
Problems
Problem 20.1 Consider the ImageNet image classification task in which the
input images contain 224×224×3 RGB values. Consider coarsely quantizing
these inputs into ten bins per RGB value and training with ~ 107 training
examples. How many possible inputs are there per training data point?
Problem 20.2 Consider figure 20.1. Why do you think that the algorithm
fits the data faster when the pixels are randomized relative to when the
labels are randomized?
Problem 20.3 Figure 20.2 shows a non-stochastic fitting process with a
fixed learning rate successfully fitting random data. Does this imply that the
loss function has no local minima? Does this imply that the function is
convex? Justify your answer and give a counter-example if you think either
statement is false.
1 In this chapter, we use the term “global minimum” loosely to mean any solution where all data
are classified correctly. We have no way of knowing if there are solutions with a lower loss
elsewhere.
OceanofPDF.com
Chapter 21
This chapter was written by Travis LaCroix and Simon J.D. Prince.
AI is poised to change society for better or worse. These technologies
have enormous potential for social good (Taddeo & Floridi, 2018; Tomašev
et al., 2020), including important roles in healthcare (Rajpurkar et al., 2022)
and the fight against climate change (Rolnick et al., 2023). However, they
also have the potential for misuse and unintended harm. This has led to the
emergence of the field of AI ethics.
The modern era of deep learning started in 2012 with AlexNet, but
sustained interest in AI ethics did not follow immediately. Indeed, a
workshop on fairness in machine learning was rejected from NeurIPS 2013
for want of material. It wasn't until 2016 that AI Ethics had its “AlexNet”
moment, with ProPublica's exposé on bias in the COMPAS recidivism-
prediction model (Angwin et al., 2016) and Cathy O’Neil's book Weapons
of Math Destruction (O’Neil, 2016). Interest has swelled ever since;
submissions to the Conference on Fairness, Accountability, and
Transparency (FAccT) have increased nearly ten-fold in the five years since
its inception in 2018.
In parallel, many organizations have proposed policy recommendations
for responsible AI. Jobin et al. (2019) found 84 documents containing AI
ethics principles, with 88% released since 2016. This proliferation of non-
legislative policy agreements, which depend on voluntary, non-binding
cooperation, calls into question their efficacy (McNamara et al., 2018;
Hagendorff, 2020; LaCroix & Mohseni, 2022). In short, AI Ethics is in its
infancy, and ethical considerations are often reactive rather than proactive.
This chapter considers potential harms arising from the design and use of
AI systems. These include algorithmic bias, lack of explainability, data
privacy violations, militarization, fraud, and environmental concerns. The
aim is not to provide advice on being more ethical. Instead, the goal is to
express ideas and start conversations in key areas that have received
attention in philosophy, political science, and the broader social sciences.
In a machine learning model, the loss function is a proxy for our true
objectives, and a misalignment between the two is termed the outer
alignment problem (Hubinger et al., 2019). To the extent that this proxy is
inadequate, there will be “loopholes” that the system can exploit to
minimize its loss function while failing to satisfy the intended objective.
For example, consider training an RL agent to play chess. If the agent is
rewarded for capturing pieces, this may result in many drawn games rather
than the desired behavior (to win the game). In contrast, the inner alignment
problem is to ensure that the behavior of an AI system does not diverge
from the intended objectives even when the loss function is well specified.
If the learning algorithm fails to find the global minimum or the training
data are unrepresentative, training can converge to a solution that is
misaligned with the true objective resulting in undesirable behavior
(Goldberg, 1987; Mitchell et al., 1992; Lehman & Stanley, 2008).
Problem 21.2
Gabriel (2020) divides the value alignment problem into technical and
normative components. The technical component concerns how we encode
values into the models so that they reliably do what they should. Some
concrete problems, such as avoiding reward hacking and safe exploration,
may have purely technical solutions (Amodei et al., 2016). In contrast, the
normative component concerns what the correct values are in the first place.
There may be no single answer to this question, given the range of things
that different cultures and societies value. It's important that the encoded
values are representative of everyone and not just culturally dominant
subsets of society.
Another way to think about value alignment is as a structural problem
that arises when a human principal delegates tasks to an artificial agent
(LaCroix, 2022). This is similar to the principal-agent problem in
economics (Laffont & Martimort, 2002), which allows that there are
competing incentives inherent in any relationship where one party is
expected to act in another's best interests. In the AI context, such conflicts
of interest can arise when either (i) the objectives are misspecified or (ii)
there is an informational asymmetry between the principal and the agent
(figure 21.1).
Problem 21.3
Figure 21.2 Bias mitigation. Methods have been proposed to compensate for bias
at all stages of the training pipeline, from data collection to post-processing of
already trained models. See Barocas et al. (2023) and Mehrabi et al. (2022).
Figure 21.3 LIME. Output functions of deep networks are complex; in high
dimensions, it's hard to know why a decision was made or how to modify the inputs
to change it without access to the model. a) Consider trying to understand why Pr(y =
1|x) is low at the white cross. LIME probes the network at nearby points to see if it
identifies these as Pr(y = 1|x) < 0.5 (cyan points) or Pr(y = 1|x) ≥ 0.5 (gray points). It
weights these points by proximity to the point of interest (weight indicated by circle
size). b) The weighted points are used to train a simpler model (here, logistic
regression — a linear function passed through a sigmoid). c) Near the white cross,
this approximation is close to d) the original function. Even though we did not have
access to the original model, we can deduce from the parameters of this approximate
model, that if we increase x1 or decrease x2, Pr(y = 1|x) will increase, and the output
class will change. Adapted from Prince (2022).
21.2.3 Fraud
Unfortunately, AI is a useful tool for automating fraudulent activities (e.g.,
sending mass emails or text messages that trick people into revealing
sensitive information or sending money). Generative AI can be used to
deceive people into thinking they are interacting with a legitimate entity or
generate fake documents that mislead or deceive people. Additionally, AI
could increase the sophistication of cyber-attacks, such as by generating
more convincing phishing emails or adapting to the defenses of targeted
organizations.
This highlights the downside of calls for transparency in machine
learning systems: the more open and transparent these systems are, the
more vulnerable they may be to security risks or use by bad-faith actors.
For example, generative language models, like ChatGPT, have been used to
write software and emails that could be used for espionage, ransomware,
and other malware (Goodin, 2023).
Problem 21.7
The tendency to anthropomorphize computer behaviors and particularly
the projection of meaning onto strings of symbols is termed the ELIZA
effect (Hofstadter, 1995). This leads to a false sense of security when
interacting with sophisticated chatbots, making people more susceptible to
text-based fraud such as romance scams or business email compromise
schemes (Abrahams, 2023). Véliz (2023) highlights how emoji use in some
chatbots is inherently manipulative, exploiting instinctual responses to
emotive images.
Problem 21.13
Where does this leave the average scientist? Perhaps with the following
imperative: it is necessary to reflect upon the moral and social dimensions
of one's work. This might require actively engaging those communities that
are likely to be most affected by new technologies, thus cultivating
relationships between researchers and communities and empowering those
communities. Likewise, it might involve engagement with the literature
beyond one's own discipline. For philosophical questions, the Stanford
Encyclopedia of Philosophy is an invaluable resource. Interdisciplinary
conferences are also useful in this regard. Leading work is published at both
the Conference on Fairness, Accountability, and Transparency (FAccT) and
the Conference on AI and Society (AIES).
21.8 Summary
This chapter considered the ethical implications of deep learning and AI.
The value alignment problem is the task of ensuring that the objectives of
AI systems are aligned with human objectives. Bias, explainability,
artificial moral agency, and other topics can be viewed through this lens. AI
can be intentionally misused, and this chapter detailed some ways this can
happen. Progress in AI has further implications in areas as diverse as IP law
and climate change.
Ethical AI is a collective action problem, and the chapter concludes with
an appeal to scientists to consider the moral and ethical implications of their
work. Every ethical issue is not within the control of every individual
computer scientist. However, this does not imply that researchers have no
responsibility whatsoever to consider—and mitigate where they can—the
potential for misuse of the systems they create.
Problems
Problem 21.1 It was suggested that the most common specification of the
value alignment problem for AI is “the problem of ensuring that the values
of AI systems are aligned with the values of humanity.” Discuss the ways in
which this statement of the problem is underspecified. Discussion Resource:
LaCroix (2023).
Problem 21.2 Goodhart's law states that “when a measure becomes a target,
it ceases to be a good measure.” Consider how this law might be
reformulated to apply to value alignment for artificial intelligence, given
that the loss function is a mere proxy for our true objectives.
Problem 21.3 Suppose a university uses data from past students to build
models for predicting “student success,” where those models can support
informed changes in policies and practices. Consider how biases might
affect each of the four stages of the development and deployment of this
model.
Discussion Resource: Fazelpour & Danks (2021).
Problem 21.4 We might think of functional transparency, structural
transparency, and run transparency as orthogonal. Provide an example of
how an increase in one form of transparency may not lead to a concomitant
increase in another form of transparency.
Discussion Resource: Creel (2020).
Problem 21.5 If a computer scientist writes a research paper on AI or
pushes code to a public repository, do you consider them responsible for
future misuse of their work?
Problem 21.6 To what extent do you think the militarization of AI is
inevitable?
Problem 21.7 In light of the possible misuse of AI highlighted in section
21.2, make arguments both for and against the open-source culture of
research in deep learning.
Problem 21.8 Some have suggested that personal data is a source of power
for those who own it. Discuss the ways personal data is valuable to
companies that utilize deep learning and consider the claim that losses to
privacy are experienced collectively rather than individually.
Discussion Resource: Véliz (2020).
Problem 21.9 What are the implications of generative AI for the creative
industries? How do you think IP laws should be modified to cope with this
new development?
Problem 21.10 A good forecast must (i) be specific enough to know when it
is wrong, (ii) account for possible cognitive biases, and (iii) allow for
rationally updating beliefs. Consider any claim in the recent media about
future AI and discuss whether it satisfies these criteria.
Discussion Resource: Tetlock & Gardner (2016).
Problem 21.11 Some critics have argued that calls to democratize AI have
focused too heavily on the participatory aspects of democracy, which can
increase risks of errors in collective perception, reasoning, and agency,
leading to morally-bad outcomes. Reflect on each of the following: What
aspects of AI should be democratized? Why should AI be democratized?
How should AI be democratized?
Discussion Resource: Himmelreich (2022).
Problem 21.12 In March 2023, the Future of Life Institute published a
letter, “Pause Giant AI Experiments,” in which they called on all AI labs to
immediately pause for at least six months the training of AI systems more
powerful than GPT-4. Discuss the motivations of the authors in writing this
letter, the public reaction, and the implications of such a pause. Relate this
episode to the view that AI ethics can be considered a collective action
problem (section 21.6).
Discussion Resource: Gebru et al. (2023).
Problem 21.13 Discuss the merits of the four points in section 21.7. Do you
agree with them?
1 Whether Article 22 actually mandates such a right is debatable (see Wachter et al., 2017).
2 As a baseline, it is estimated that the average human is responsible for around 5 tonnes of CO2
per year, with individuals from major oil-producing countries responsible for three times this amount.
See https://ourworldindata.org/co2-emissions.
OceanofPDF.com
Appendix A
Notation
This appendix details the notation used in this book. This mostly adheres to
standard conventions in computer science, but deep learning is applicable to
many different areas, so it is explained in full. In addition, there are several
notational conventions that are unique to this book, including notation for
functions and the systematic distinction between parameters and variables.
Sets
Sets are denoted by curly brackets, so {0, 1, 2} denotes the numbers 0, 1,
and 2. The notation {0, 1, 2, …} denotes the set of positive integers.
Sometimes, we want to specify a set of variables and denotes the I
variables x1, … xI. When it's not necessary to specify how many items are
in the set, this is shortened to {xi}. The notation denotes the set
of I pairs xi, yi. The convention for naming sets is to use calligraphic letters.
Notably, 𝓑t is used to denote the set of indices in a batch at iteration t during
training. The number of elements in a set 𝒮 is denoted by|𝒮|.
The set ℝ denotes the set of real numbers. The set ℝ+ denotes the set of
non-negative real numbers. The notation ℝD denotes the set of D-
dimensional vectors containing real numbers. The notation denotes
the set of matrices of dimension D1 × D2. The notation denotes
the set of tensors of size D1 × D2 × D3 and so on.
The notation [a, b] denotes the real numbers from a to b, including a and
b themselves. When the square brackets are replaced by round brackets, this
means that the adjacent value is not included in the set. For example, the set
(−π, π] denotes the real numbers from −π to π, but excluding −π.
Membership of sets is denoted by the symbol ∈, so x ∈ ℝ+ means that
the variable x is a non-negative real number, and the notation Σ ∈ ℝD×D
denotes that Σ is a matrix of size D × D. Sometimes, we want to work
through each element of a set systematically, and the notation ∀ {1, …, K}
means “for all” the integers from 1 to K.
Functions
Functions are expressed as a name, followed by square brackets that contain
the arguments of the function. For example, log[x] returns the logarithm of
the variable x. When the function returns a vector, it is written in bold and
starts with a small letter. For example, the function y = mlp[x, ϕ] returns a
vector y and has vector arguments x and ϕ. When a function returns a
matrix or tensor, it is written in bold and starts with a capital letter. For
example, the function Y = Sa[X, ϕ] returns a matrix Y and has arguments X
and ϕ. When we want to leave the arguments of a function deliberately
ambiguous, we use the bullet symbol (e.g., mlp[•, ϕ]).
Probability distributions
Probability distributions should be written as Pr(x = a), denoting that the
random variable x takes the value of a. However, this notation is
cumbersome. Hence, we usually simplify this and just write Pr(x), where x
denotes either the random variable or the value it takes according to the
sense of the equation. The conditional probability of y given x is written as
Pr(y|x). The joint probability of y and x is written as Pr(y, x). These two
forms can be combined, so Pr(y|x, ϕ) denotes the probability of the variable
y, given that we know x and ϕ. Similarly, Pr(y, x|ϕ) denotes the probability
of variables y and x given that we know ϕ. When we need two probability
distributions over the same variable, we write Pr(x) for the first distribution
and q(x) for the second. More information about probability distributions
can be found in appendix C.
Asymptotic notation
Asymptotic notation is used to compare the amount of work done by
different algorithms as the size D of the input increases. This can be done in
various ways, but this book only uses big-O notation, which represents a
tight upper bound on the growth of computation in an algorithm. A function
f[n] is 𝒪[g[n]] if there exists a constant c > 0 and integer n0 such that f[n] <
c * g[n] for all n > n0.
This notation provides a bound on the worst-case running time of an
algorithm. For example, when we say that inversion of a D × D matrix is
𝒪[D3], we mean that the computation will increase no faster than some
constant times D3 once D is large enough. This gives us an idea of how
feasible it is to invert matrices of different sizes. If D = 103, then it may take
of the order of 109 operations to invert it.
Miscellaneous
A small dot in a mathematical equation is intended to improve ease of
reading and has no real meaning (or just implies multiplication). For
example, α ⸱ f[x] is the same as αf[x] but is easier to read. To avoid
ambiguity, dot products are written as aTb (see appendix B.3.4). A left
arrow symbol ← denotes assignment, so x ← x + 2 means that we are
adding two to the current value of x.
OceanofPDF.com
Appendix B
Mathematics
This appendix reviews mathematical concepts that are used in the main text.
B.1 Functions
A function defines a mapping from a set 𝒳 (e.g., the set of real numbers) to
another set 𝒴. An injection is a one-to-one function where every element in
the first set maps to a unique position in the second set (but there may be
elements of the second set that are not mapped to). A surjection is a
function where every element in the second set receives a mapping from the
first (but there may be elements of the first set that are not mapped). A
bijection or bijective mapping is a function that is both injective and
surjective. It provides a one-to-one correspondence between all members of
the two sets. A diffeomorphism is a special case of a bijection where both
the forward and reverse mapping are differentiable.
(B.1)
B.1.2 Convexity
A function is convex if we can draw a straight line between any two points
on the function, and this line always lies above the function. Similarly, a
function is concave if a straight line between any two points always lies
below the function. By definition, convex (concave) functions have at most
one minimum (maximum).
A region of ℝD is convex if we can draw a straight line between any two
points on the boundary of the region without intersecting the boundary in
another place. Gradient descent guarantees to find the global minimum of
any function that is both convex and defined on a convex region.
(B.2)
(B.3)
(B.4)
Figure B.2 Stirling's formula. The factorial function x! can be approximated by
Stirling's formula Stir[x] which is defined for every real value.
(B.5)
B.2.1 Autocorrelation
The autocorrelation r[τ] of a continuous function f[z] is defined as:
(B.6)
where τ is the time lag. Sometimes, this is normalized by r[0] so that the
autocorrelation at time lag zero is one. The autocorrelation function is a
measure of the correlation of the function with itself as a function of an
offset (i.e., the time lag). If a function changes slowly and predictably, then
the autocorrelation function will decrease slowly as the time lag increases
from zero. If the function changes fast and unpredictably, then it will
decrease quickly to zero.
B.3 Vector, matrices, and tensors
In machine learning, a vector x ∈ ℝD is a one-dimensional array of D
numbers, which we will assume are organized in a column. Similarly, a
matrix is a two-dimensional array of numbers with D1 rows
and D2 columns. A tensor is an N-dimensional array of
numbers. Confusingly, all three of these quantities are stored in objects
known as “tensors” in deep learning APIs such as PyTorch and TensorFlow.
B.3.1 Transpose
The transpose of a matrix is formed by reflecting it
around the principal diagonal so that the kth column becomes the kth row
and vice-versa. If we take the transpose of a matrix product AB, then we
take the transpose of the original matrices but reverse the order so that
(B.7)
(B.8)
When p = 2, this returns the length of the vector, and this is known as the
Euclidean norm. It is this case that is most commonly used in deep learning,
and often the exponent p is omitted, and the Euclidean norm is just written
as ‖z‖. When p = ∞, the operator returns the maximum absolute value in the
vector.
Norms can be computed in a similar way for matrices. For example, the
ℓ2 norm of a matrix Z (known as the Frobenius norm) is calculated as:
(B.9)
(B.10)
(B.11)
It can be shown that the dot product is proportional to the Euclidean norm
of the first vector times the Euclidean norm of the second vector times the
angle θ between them:
(B.12)
B.3.5 Inverse
A square matrix A may or may not have an inverse A−1 such that A−1A =
AA−1 = I. If a matrix does not have an inverse, it is called singular. If we
take the inverse of a matrix product AB then we can equivalently take the
inverse of each matrix individually and reverse the order of multiplication.
(B.13)
B.3.7 Eigenspectrum
If we multiply the set of 2D points on a unit circle by a 2×2 matrix A, they
map to an ellipse (figure B.3). The radii of the major and minor axes of this
ellipse (i.e., the longest and shortest directions) correspond to the magnitude
of the eigenvalues λ1 and λ2 of the matrix. The eigenvalues also have a sign,
which relates to whether the matrix reflects the inputs about the origin. The
same idea applies in higher dimensions. A D–dimensional spheroid is
mapped by a D × D matrix A to a D-dimensional ellipsoid. The radii of the
D principal axes of this ellipsoid determine the magnitude of the
eigenvalues.
Figure B.3 Eigenvalues. When the points {xi} on the unit circle are transformed to
points by a linear transformation , they are mapped to an ellipse. For
example, the light blue point on the unit circle is mapped to the light blue point on
the ellipse. The length of the major (longest) axis of the ellipse (long gray arrow) is
the magnitude of the first eigenvalue of the matrix, and the length of the minor
(shortest) axis of the ellipse (short gray arrow) is the magnitude of the second
eigenvalue.
The trace of a square matrix is the sum of the diagonal values (the matrix
itself need not be diagonal) or the sum of the eigenvalues. Traces obey these
rules:
(B.15)
where in the last relation, the trace is invariant for cyclic permutations only,
so in general, trace[ABC] ≠ trace[BAC].
(B.16)
where ϕ1, …, ϕD are parameters that define the function. We often add a
constant term ϕ0 to the right-hand side. This is technically an affine function
but is commonly referred to as linear in machine learning. We adopt this
convention throughout.
(B.19)
and
(B.21)
OceanofPDF.com
Appendix C
Probability
(C.1)
Figure C.1 Joint and marginal distributions. a) The joint distribution Pr(x, y)
captures the propensity of variables x and y to take different combinations of values.
Here, the probability density is represented by the color map, so brighter positions
are more probable. For example, the combination x = 6, y = 6 is much less likely to
be observed than the combination x = 5, y = 0. b) The marginal distribution Pr(x) of
variable x can be recovered by integrating over y. c) The marginal distribution Pr(y)
of variable y can be recovered by integrating over x.
This idea extends to more than two variables, so the joint density of x, y,
and z is written as Pr(x, y, z). Sometimes, we store multiple random
variables in a vector x, and we write their joint density as Pr(x). Extending
this, we can write the joint density of all of the variables in two vectors x
and y as Pr(x, y).
C.1.2 Marginalization
If we know the joint distribution Pr(x, y) over two variables, we can recover
the marginal distributions Pr(x) and Pr(y) by integrating over the other
variable (figure C.1b-c):
(C.2)
This process is called marginalization and has the interpretation that we are
computing the distribution of one variable regardless of the value the other
one took. The idea of marginalization extends to higher dimensions, so if
we have a joint distribution Pr(x, y, z), we can recover the joint distribution
Pr(x, z) by integrating over y.
(C.3)
Figure C.2 Conditional distributions. a) Joint distribution Pr(x, y) of variables x
and y. b) The conditional probability Pr(x|y = 3.0) of variable x, given that y takes the
value 3.0, is found by taking the horizontal “slice” Pr(x, y = 3.0) of the joint
probability (top cyan line in panel a), and dividing this by the total area Pr(y = 3.0) in
that slice so that it forms a valid probability distribution that integrates to one. c) The
joint probability Pr(x, y = −1.0) is found similarly using the slice at y = −1.0.
Similarly,
(C.4)
(C.5)
(C.6)
This expression relates the conditional probability Pr(x|y) of x given y to the
conditional probability Pr(y|x) of y given x and is known as Bayes’ rule.
Each term in this Bayes’ rule has a name. The term Pr(y|x) is the
likelihood of y given x, and the term Pr(x) is the prior probability of x. The
denominator Pr(y) is known as the evidence, and the left-hand side Pr(x|y)
is termed the posterior probability of x given y. The equation maps from the
prior Pr(x) (what we know about x before observing y) to the posterior
Pr(x|y) (what we know about x after observing y).
C.1.5 Independence
If the value of the random variable y tells us nothing about x and vice-versa,
we say that x and y are independent, and we can write Pr(x|y) = Pr(x) and
Pr(y|x) = Pr(y). It follows that all of the conditional distributions Pr(y|x = •)
are identical, as are the conditional distributions Pr(x|y = •).
Starting from the first expression for the joint probability in equation
C.5, we see that the joint distribution becomes the product of the marginal
distributions:
(C.7)
C.2 Expectation
Consider a function f[x] and a probability distribution Pr(x) defined over x.
The expected value of a function f[•] of a random variable x with respect to
the probability distribution Pr(x) is defined as:
(C.8)
As the name suggests, this is the expected or average value of f[x] after
taking into account the probability of seeing different values of x. This idea
generalizes to functions f[•, •] of more than one random variable:
(C.9)
(C.10)
(C.11)
where k is an arbitrary constant. These are proven below for the continuous
case.
where now A is a constant matrix and f[x] is a function of the vector x that
returns a vector, and g[y] is a function of the vector y that also returns a
vector.
(C.13)
Proof:
(C.14)
where we have used rule 3 between lines 1 and 2, rules 1 and 2 between
lines 2 and 3, and the definition μ = 𝔼[x] in the remaining two lines.
C.2.4 Standardization
Setting the mean of a random variable to zero and the variance to one is
known as standardization. This is achieved using the transformation:
(C.15)
where again, we have used the four rules for manipulating expectations.
The variance of the new distribution is given by:
(C.17)
(C.18)
(C.19)
The result will have a mean 𝔼[z] = 0 and an identity covariance matrix 𝔼[(z
− 𝔼[z])(z − 𝔼[z])T] = I. To reverse this process, we use:
(C.20)
C.3 Normal probability distribution
Probability distributions used in this book include the Bernoulli distribution
(figure 5.6), categorical distribution (figure 5.9), Poisson distribution
(figure 5.15), von Mises distribution (figure 5.13), and mixture of
Gaussians (figures 5.14 and 17.1). However, the most common distribution
in machine learning is the normal or Gaussian distribution.
(C.21)
The interpretation is similar to the univariate case. The quadratic term −(x −
μ)TΣ−1(x − μ)/2 returns a scalar that decreases as x grows further from the
mean μ, at a rate that depends on the matrix Σ. This is turned into a bell-
curve shape by the exponential, and dividing by (2π)D/2|Σ|1/2 ensures that
the distribution integrates to one.
The covariance matrix can take spherical, diagonal, and full forms:
(C.23)
(C.24)
Figure C.4 Bivariate normal distribution. a–b) When the covariance matrix is a
multiple of the diagonal matrix, the isocontours are circles, and we refer to this as
spherical covariance. c–d) When the covariance is an arbitrary diagonal matrix, the
isocontours are axis-aligned ellipses, and we refer to this as diagonal covariance e–f)
When the covariance is an arbitrary symmetric positive definite matrix, the iso-
contours are general ellipses, and we refer to this as full covariance.
(C.25)
At first sight, this relation is rather opaque, but figure C.5 shows the case
for scalar x and y, which is easy to understand. As for the previous relation,
this can be proved by expanding the quadratic product in the exponential
term and completing the square to make this a distribution in y. (see
problem 18.4).
C.4 Sampling
To sample from a univariate distribution Pr(x), we first compute the
cumulative distribution F[x] (the integral of Pr(x)). Then we draw a sample
z* from a uniform distribution over the range [0, 1] and evaluate this
against the inverse of the cumulative distribution, so the sample x* is
created as:
(C.26)
To sample from this joint distribution, we first draw a sample x* from Pr(x).
Then we draw a sample y* from Pr(y|x*). Finally, we draw a sample z*
from Pr(z|y*).
(C.28)
(C.29)
Figure C.6 Lower bound on negative logarithm. The function 1 − y is always less
than the function −log[y]. This relation is used to show that the Kullback-Leibler
divergence is always greater than or equal to zero.
The KL divergence is infinite if there are places where q(x) is zero but p(x)
is non-zero. This can lead to problems when we are minimizing a function
based on this distance.
(C.30)
It is the mean divergence of p(x) and q(x) to the average of the two
distributions.
(C.31)
and is a measure of the maximum distance between the cumulative
probability curves.
(C.32)
where tr[•] is the trace of the matrix argument. The Frećhet and Wasserstein
distances are both given by:
(C.33)
OceanofPDF.com
Bibliography
Abdal, R., Qin, Y., & Wonka, P. (2019). Image2StyleGAN: How to embed images into the StyleGAN
latent space? IEEE/CVF International Conference on Computer Vision, 4432–4441. 301
Abdal, R., Qin, Y., & Wonka, P. (2020). Image2StyleGAN++: How to edit the embedded images?
IEEE/CVF Computer Vision & Pattern Recognition, 8296–8305. 301
Abdal, R., Zhu, P., Mitra, N. J., & Wonka, P. (2021). StyleFlow: Attribute-conditioned exploration of
StyleGAN-generated images using conditional continuous normalizing flows. ACM Transactions
on Graphics (ToG), 40(3), 1–21. 300, 322
Abdalla, M., & Abdalla, M. (2021). The grey hoodie project: Big tobacco, big tech, and the threat on
academic integrity. AAAI/ACM Conference on AI, Ethics, and Society, 287–297. 434
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., & Penn, G. (2012). Applying convolutional neural
networks concepts to hybrid NN-HMM model for speech recognition. IEEE International
Conference on Acoustics, Speech and Signal Processing, 4277–4280. 182
Abdelhamed, A., Brubaker, M. A., & Brown, M. S. (2019). Noise flow: Noise modeling with
conditional normalizing flows. IEEE/CVF International Conference on Computer Vision, 3165–
3173. 322
Abeßer, J., Mimilakis, S. I., Gräfe, R., Lukashevich, H., & Fraunhofer, I. (2017). Acoustic scene
classification by combining autoencoder-based dimensionality reduction and convolutional
neural networks. Workshop on Detection and Classification of Acoustic Scenes and Events, 7–11.
160
Abrahams, D. (2023). Let's talk about generative AI and fraud. Forter Blog, March 27, 2023.
https://www.forter.com/blog/letstalk-about-generative-ai-and-fraud/. 428
Abu-El-Haija, S., Perozzi, B., Kapoor, A., Alipourfard, N., Lerman, K., Harutyunyan, H., Ver Steeg,
G., & Galstyan, A. (2019). MixHop: Higher-order graph convolutional architectures via
sparsified neighborhood mixing. International Conference on Machine Learning, 21–29. 263
Adler, J., & Lunz, S. (2018). Banach Wasserstein GAN. Neural Information Processing Systems, 31,
6755–6764. 299
Agarwal, R., Schuurmans, D., & Norouzi, M. (2020). An optimistic perspective on offline
reinforcement learning. International Conference on Machine Learning, 104–114. 398
Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance
metrics in high dimensional space. International Conference on Database Theory, 420–434. 135
Agüera y Arcas, B., Todorov, A., & Mitchell, M. (2018). Do algorithms reveal sexual orientation or
just expose our stereotypes? Medium, Jan 11, 2018. https://medium.com/@blaisea/do-
algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477. 431
Ahmed, N., & Wahed, M. (2016). The dedemocratization of AI: Deep learning and the compute
divide in artificial intelligence research. arXiv:1606.06565. 430
Ahmed, S., Mula, R. S., & Dhavala, S. S. (2020). A framework for democratizing AI.
arXiv:2001.00818. 430
Ahmed, T. (2017). AI can tell if you're gay: Artificial intelligence predicts sexuality from one photo
with startling accuracy. Newsweek, 8 Sept 2017. https://www.newsweek.com/ai-can-tell-if-
youre-gay-artificial-intelligence-predicts-sexuality-one-photo-661643. 430
Aiken, M., & Park, M. (2010). The efficacy of round-trip translation for MT evaluation. Translation
Journal, 14(1). 160
Ainslie, J., Ontañón, S., Alberti, C., Cvicek, V., Fisher, Z., Pham, P., Ravula, A., Sanghai, S., Wang,
Q., & Yang, L. (2020). ETC: Encoding long and structured inputs in transformers. ACL
Empirical Methods in Natural Language Processing, 268–284. 237
Akers, J., Bansal, G., Cadamuro, G., Chen, C., Chen, Q., Lin, L., Mulcaire, P., Nandakumar, R.,
Rockett, M., Simko, L., Toman, J., Wu, T., Zeng, E., Zorn, B., & Roesner, F. (2018). Technology-
enabled disinformation: Summary, lessons, and recommendations. arXiv:1812.09383. 427
Akuzawa, K., Iwasawa, Y., & Matsuo, Y. (2018). Expressive speech synthesis via modeling
expressions with variational autoencoder. INTERPSPEECH, 3067–3071. 343
Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N.,
Synnaeve, G., Verbeek, J., et al. (2021). XCiT: Cross-covariance image transformers. Neural
Information Processing Systems, 34, 20014–20027. 238
Allen, C., Smit, I., & Wallach, W. (2005). Artificial morality: Top-down, bottom-up, and hybrid
approaches. Ethics and Information Technology, 7, 149–155. 424
Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A convergence theory for deep learning via over-
parameterization. International Conference on Machine Learning, 97, 242–252. 404
Alon, U., & Yahav, E. (2021). On the bottleneck of graph neural networks and its practical
implications. International Conference on Learning Representations. 265
Alvarez, J. M., & Salzmann, M. (2016). Learning the number of neurons in deep networks. Neural
Information Processing Systems, 29, 2262–2270. 414
Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–
276. 397
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete
problems in AI safety. arXiv:1606.06565. 421
An, G. (1996). The effects of adding noise during backpropagation training on a generalization
performance. Neural Computation, 8(3), 643–674. 158
An, J., Huang, S., Song, Y., Dou, D., Liu, W., & Luo, J. (2021). ArtFlow: Unbiased image style
transfer via reversible neural flows. IEEE/CVF Computer Vision & Pattern Recognition, 862–
871. 322
Anderson, M., & Anderson, S. L. (2008). Ethical healthcare agents. Advanced Computational
Intelligence Paradigms in Healthcare 3. Studies in Computational Intelligence, vol. 107, 233–
257. 424
Andreae, J. (1969). Learning machines: A unified view. Encyclopaedia of Linguistics, Information
and Control, 261–270. 396
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias: There's software used across
the country to predict future criminals. and it's biased against blacks. ProPublica, May 23, 2016.
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing. 420
Ardizzone, L., Kruse, J., Lüth, C., Bracher, N., Rother, C., & Köthe, U. (2020). Conditional
invertible neural networks for diverse image-to-image translation. DAGM German Conference on
Pattern Recognition, 373–387. 322
Arjovsky, M., & Bottou, L. (2017). Towards principled methods for training generative adversarial
networks. International Conference on Learning Representations. 283, 299
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks.
International Conference on Machine Learning, 214–223. 280, 299
Arkin, R. C. (2008a). Governing lethal behavior: Embedding ethics in a hybrid deliberative/reactive
robot architecture—Part I: Motivation and philosophy. ACM/IEEE International Conference on
Human Robot Interaction, 121–128. 424
Arkin, R. C. (2008b). Governing lethal behavior: Embedding ethics in a hybrid deliberative/reactive
robot architecture—Part II: Formalization for ethical control. Conference on Artificial General
Intelligence, 51–62. 424
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). ViVit: A video
vision transformer. IEEE/CVF International Conference on Computer Vision, 6836–6846. 238
Arora, R., Basu, A., Mianjy, P., & Mukherjee, A. (2016). Understanding deep neural networks with
rectified linear units. arXiv:1611.01491. 52
Arora, S., Ge, R., Liang, Y., Ma, T., & Zhang, Y. (2017). Generalization and equilibrium in generative
adversarial nets (GANs). International Conference on Machine Learning, 224–232. 300
Arora, S., Li, Z., & Lyu, K. (2018). Theoretical analysis of auto rate-tuning by batch normalization.
arXiv:1812.03981. 204
Arora, S., & Zhang, Y. (2017). Do GANs actually learn the distribution? An empirical study.
arXiv:1706.08224. 300
Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement
learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38. 396
Asaro, P. (2012). On banning autonomous weapon systems: human rights, automation, and the
dehumanization of lethal decision-making. International Review of the Red Cross, 94(886), 687–
709. 429
Atwood, J., & Towsley, D. (2016). Diffusion-convolutional neural networks. Neural Information
Processing Systems, 29, 1993–2001. 262
Aubret, A., Matignon, L., & Hassas, S. (2019). A survey on intrinsic motivation in reinforcement
learning. arXiv:1908.06976. 398
Austin, J., Johnson, D. D., Ho, J., Tarlow, D., & van den Berg, R. (2021). Structured denoising
diffusion models in discrete state-spaces. Neural Information Processing Systems, 34, 17981–
17993. 369
Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J.-F., & Rahwan, I.
(2018). The moral machine experiment. Nature, 563, 59–64. 424
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450. 203
Bachlechner, T., Majumder, B. P., Mao, H., Cottrell, G., & McAuley, J. (2021). ReZero is all you
need: Fast convergence at large depth. Uncertainty in Artificial Intelligence, 1352–1361. 238
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align
and translate. International Conference on Learning Representations. 233, 235
Bahri, Y., Kadmon, J., Pennington, J., Schoenholz, S. S., Sohl-Dickstein, J., & Ganguli, S. (2020).
Statistical mechanics of deep learning. Annual Review of Condensed Matter Physics, 11, 501–
528. 409, 410
Baldi, P., & Hornik, K. (1989). Neural networks and principal component analysis: Learning from
examples without local minima. Neural networks, 2(1), 53–58. 410
Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma, K. W.-D., & McWilliams, B. (2017). The shattered
gradients problem: If ResNets are the answer, then what is the question? International
Conference on Machine Learning, 342–350. 188, 202, 203, 205
Bansal, A., Borgnia, E., Chu, H.-M., Li, J. S., Kazemi, H., Huang, F., Goldblum, M., Geiping, J., &
Goldstein, T. (2022). Cold diffusion: Inverting arbitrary image transforms without noise.
arXiv:2208.09392. 369
Bao, F., Li, C., Zhu, J., & Zhang, B. (2022). Analytic-DPM: An analytic estimate of the optimal
reverse variance in diffusion probabilistic models. International Conference on Learning
Representations. 369
Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., & Babenko, A. (2022). Labelefficient
semantic segmentation with diffusion models. International Conference on Learning
Representations. 369
Barber, D., & Bishop, C. (1997). Ensemble learning for multi-layer networks. Neural Information
Processing Systems, 10, 395–401. 159
Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and Machine Learning: Limitations and
Opportunities. MIT Press. 423
Barratt, S., & Sharma, R. (2018). A note on the inception score. Workshop on Theoretical
Foundations and Applications of Deep Generative Models. 274
Barrett, D. G. T., & Dherin, B. (2021). Implicit gradient regularization. International Conference on
Learning Representations. 157
Barrett, L. (2020). Ban facial recognition technologies for children — and for everyone else. Boston
University Journal of Science and Technology Law, 26(2), 223–285. 427
Barron, J. T. (2019). A general and adaptive robust loss function. IEEE/CVF Computer Vision &
Pattern Recognition, 4331–4339. 73
Bartlett, P. L., Foster, D. J., & Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for
neural networks. Neural Information Processing Systems, vol. 30, 6240–6249. 156
Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A. (2019). Nearly-tight VC-dimension and
pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning
Research, 20(1), 2285–2301. 134
Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. Intrinsically Motivated
Learning in Natural and Artificial Systems, 17–47. 398
Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying
interpretability of deep visual representations. IEEE/CVF Computer Vision & Pattern
Recognition, 6541–6549. 184
Bau, D., Zhu, J.-Y., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., & Torralba, A. (2019). Seeing what
a GAN cannot generate. IEEE/CVF International Conference on Computer Vision, 4502–4511.
300
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic differentiation in
machine learning: A survey. Journal of Marchine Learning Research, 18, 1–43. 113
Bayer, M., Kaufhold, M.-A., & Reuter, C. (2022). A survey on data augmentation for text
classification. ACM Computing Surveys, 55(7), 1–39. 160
Behrmann, J., Grathwohl, W., Chen, R. T., Duvenaud, D., & Jacobsen, J.-H. (2019). Invertible
residual networks. International Conference on Machine Learning, 573–582. 318, 323
Belinkov, Y., & Bisk, Y. (2018). Synthetic and natural noise both break neural machine translation.
International Conference on Learning Representations. 160
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice
and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences,
116(32), 15849–15854. 130, 134
Bellemare, M. G., Dabney, W., & Munos, R. (2017a). A distributional perspective on reinforcement
learning. International Conference on Machine Learning, 449–458. 397
Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., &
Munos, R. (2017b). The Cramer distance as a solution to biased Wasserstein gradients.
arXiv:1705.10743. 299
Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34–37. 396
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer.
arXiv:2004.05150. 237
Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding
in the age of data. Meeting of the Association for Computational Linguistics, 5185–5198. 234
Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Neural
Information Processing Systems, 13, 932–938. 274
Benjamin, R. (2019). Race After Technology: Abolitionist Tools for the New Jim Code. Polity. 433
Berard, H., Gidel, G., Almahairi, A., Vincent, P., & Lacoste-Julien, S. (2019). A closer look at the
optimization landscapes of generative adversarial networks. arXiv:1906.04848. 299
Berger, P. (2019). MTA's initial foray into facial recognition at high speed is a bust. April 07, 2019.
https://www.wsj.com/articles/mtasinitial-foray-into-facial-recognition-at-high-speed-is-a-bust-
11554642000. 427
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of
Machine Learning Research, 13(10), 281–305. 136
Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter
optimization. Neural Information Processing Systems, vol. 24, 2546–2554. 136
Berk, R., Heidari, H., Jabbari, S., Kearns, M., & Roth, A. (2017). Fairness in criminal justice risk
assessments: the state of the art. Sociological Methods & Research, 50(1), 3–44. 422
Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q.,
Hashme, S., Hesse, C., et al. (2019). DOTA 2 with large scale deep reinforcement learning.
arXiv:1912.06680. 396
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video
understanding? International Conference on Machine Learning, 3, 813–824. 238
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “nearest neighbor”
meaningful? International Conference on Database Theory, 217–235. 135
Binns, R. (2018). Algorithmic accountability and public reason. Philosophy & Technology, 31(4),
543–556. 13
Birhane, A., Isaac, W., Prabhakaran, V., Diaz, M., Elish, M. C., Gabriel, I., & Mohamed, S. (2022a).
Power to the people? Opportunities and challenges for participatory AI. Equity and Access in
Algorithms, Mechanisms, and Optimization. 433
Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R., & Bao, M. (2022b). The values encoded in
machine learning research. ACM Conference on Fairness, Accountability, and Transparency,
173–184. 431
Bishop, C. (1995). Regularization and complexity control in feed-forward networks. International
Conference on Artificial Neural Networks, 141–148. 157, 158
Bishop, C. M. (1994). Mixture density networks. Aston University Technical Report. 73
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. 15, 159
Bjorck, N., Gomes, C. P., Selman, B., & Weinberger, K. Q. (2018). Understanding batch
normalization. Neural Information Processing Systems, 31, 7705–7716. 204
Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural
Networks, 5(1), 117–127. 401
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural
network. International Conference on Machine Learning, 1613–1622. 159
Bond-Taylor, S., Leach, A., Long, Y., & Willcocks, C. G. (2022). Deep generative modelling: A
comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive
models. IEEE Transactions on Pattern Analysis & Machine Intelligence, 44(11), 7327–7347. 274
Bontridder, N., & Poullet, Y. (2021). The role of artificial intelligence in disinformation. Data &
Policy, 3, E32. 427
Borji, A. (2022). Pros and cons of GAN evaluation measures: New developments. Computer Vision
& Image Understanding, 215, 103329. 274
Bornschein, J., Shabanian, S., Fischer, A., & Bengio, Y. (2016). Bidirectional Helmholtz machines.
International Conference on Machine Learning, 2511–2519. 346
Boscaini, D., Masci, J., Rodolà, E., & Bronstein, M. (2016). Learning shape correspondence with
anisotropic convolutional neural networks. Neural Information Processing Systems, 29, 3189–
3197. 265
Bottou, L. (2012). Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade: Second
Edition, 421–436. 91
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine
learning. SIAM Review, 60(2), 223–311. 91
Bottou, L., Soulié, F. F., Blanchet, P., & Liénard, J.-S. (1990). Speaker-independent isolated digit
recognition: Multilayer perceptrons vs. dynamic time warping. Neural Networks, 3(4), 453–465.
181
Boulemtafes, A., Derhab, A., & Challal, Y. (2020). A review of privacy-preserving techniques for
deep learning. Neurocomputing, 384, 21–45. 428
Bousselham, W., Thibault, G., Pagano, L., Machireddy, A., Gray, J., Chang, Y. H., & Song, X.
(2021). Efficient self-ensemble framework for semantic segmentation. arXiv:2111.13280. 162
Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in natural language
understanding? ACL Human Language Technologies, 4843–4855. 234
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2015). Generating
sentences from a continuous space. ACL Conference on Computational Natural Language
Learning, 10–21. 343, 344, 345
Braverman, H. (1974). Labor and monopoly capital: the degradation of work in the twentieth
century. Monthly Review Press. 429
Brock, A., Donahue, J., & Simonyan, K. (2019). Large scale GAN training for high fidelity natural
image synthesis. International Conference on Learning Representations. 287, 299
Brock, A., Lim, T., Ritchie, J. M., & Weston, N. (2016). Neural photo editing with introspective
adversarial networks. International Conference on Learning Representations. 345
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1993). Signature verification using a
“Siamese” time delay neural network. Neural Information Processing Systems, 6, 737–744. 181
Bronstein, M. M., Bruna, J., Cohen, T., & Veličković, P. (2021). Geometric deep learning: Grids,
groups, graphs, geodesics, and gauges. arXiv:2104.13478. 262
Broussard, M. (2018). Artificial Unintelligence: How Computers Misunderstand the World. The MIT
Press. 433
Broussard, M. (2023). More than a Glitch: Confronting Race, Gender, and Ability Bias in Tech. The
MIT Press. 433
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,
P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Neural
Information Processing Systems, 33, 1877–1901. 9, 159, 234, 237, 422, 425
Brügger, R., Baumgartner, C. F., & Konukoglu, E. (2019). A partially reversible U-Net for memory-
efficient volumetric image segmentation. International Conference on Medical Image Computing
and Computer-Assisted Intervention, 429–437. 322
Bruna, J., Zaremba, W., Szlam, A., & LeCun, Y. (2013). Spectral networks and locally connected
networks on graphs. International Conference on Learning Representations. 262
Brynjolfsson, E., & McAfee, A. (2016). The Second Machine Age: Work, Progress, and Prosperity in
a Time of Brilliant Technologies. W. W. Norton. 430
Bryson, A., Ho, Y.-C., & Siouris, G. (1979). Applied optimal control: Optimization, estimation, and
control. IEEE Transactions on Systems, Man & Cybernetics, 9, 366–367. 113
Bubeck, S., & Sellke, M. (2021). A universal law of robustness via isoperimetry. Neural Information
Processing Systems, 34, 28811–28822. 135, 416
Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 535–541. 415
Bughin, J., Seong, J., Manyika, J., Chui, M., & Joshi, R. (2018). Notes from the AI Frontier:
Modelling the Impact of AI on the World Economy. McKinsey Global Institute, Sept 4, 2018. 429
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in
commercial gender classification. Proceedings of Machine Learning Research, 81. 423
Burda, Y., Grosse, R. B., & Salakhutdinov, R. (2016). Importance weighted autoencoders.
International Conference on Learning Representations. 73, 346
Buschjäger, S., & Morik, K. (2021). There is no double-descent in random forests. arXiv:2111.04409.
134
Cai, T., Luo, S., Xu, K., He, D., Liu, T.-y., & Wang, L. (2021). GraphNorm: A principled approach to
accelerating graph neural network training. International Conference on Machine Learning,
1204–1215. 265
Calimeri, F., Marzullo, A., Stamile, C., & Terracina, G. (2017). Biomedical data augmentation using
adversarial neural networks. International Conference on Artificial Neural Networks, 626–634.
159
Calo, R. (2018). Artificial intelligence policy: A primer and roadmap. University of Bologna Law
Review, 3(2), 180–218. 430
Cao, H., Tan, C., Gao, Z., Chen, G., Heng, P.-A., & Li, S. Z. (2022). A survey on generative diffusion
model. arXiv:2209.02646. 369
Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to rank: From pairwise approach to
listwise approach. International Conference on Machine Learning, 129–136. 73
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end
object detection with transformers. European Conference on Computer Vision, 213–229. 238
Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramèr, F., Balle, B., Ippolito, D., &
Wallace, E. (2023). Extracting training data from diffusion models. arXiv:2301.13188. 428
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F.,, & Zhang, C. (2022). Quantifying
memorization across neural language models. arXiv:2202.07646. 428
Cauchy, A. (1847). Methode generale pour la resolution des systemes d’equations simultanees.
Comptes Rendus de l’Académie des Sciences, 25. 91
Cervantes, J.-A., López, S., Rodríguez, L.-F., Cervantes, S., Cervantes, F., & Ramos, F. (2019).
Artificial moral agents: A survey of the current status. Science and Engineering Ethics, 26, 501–
532. 424
Ceylan, G., Anderson, I. A., & Wood, W. (2023). Sharing of misinformation is habitual, not just lazy
or biased. Proceedings of the National Academy of Sciences of the United States of America,
120(4). 432
Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., & Murphy, K. (2020). Machine learning on graphs:
A model and comprehensive taxonomy. arXiv:2005.03675. 261
Chang, B., Chen, M., Haber, E., & Chi, E. H. (2019a). AntisymmetricRNN: A dynamical system
view on recurrent neural networks. International Conference on Learning Representations. 323
Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., & Holtham, E. (2018). Reversible
architectures for arbitrarily deep residual neural networks. AAAI Conference on Artificial
Intelligence, 2811–2818. 323
Chang, Y.-L., Liu, Z. Y., Lee, K.-Y., & Hsu, W. (2019b). Free-form video inpainting with 3D gated
convolution and temporal Patch-GAN. IEEE/CVF International Conference on Computer Vision,
9066–9075. 181
Chaudhari, P., Choromanska, A., Soatto, S., Le-Cun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun,
L., & Zecchina, R. (2019). Entropy-SGD: Biasing gradient descent into wide valleys. Journal of
Statistical Mechanics: Theory and Experiment, 12, 124018. 158, 411
Chen, D., Mei, J.-P., Zhang, Y., Wang, C., Wang, Z., Feng, Y., & Chen, C. (2021a). Cross-layer
distillation with semantic calibration. AAAI Conference on Artificial Intelligence, 7028–7036.
416
Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., & Gao, W. (2021b).
Pre-trained image processing transformer. IEEE/CVF Computer Vision & Pattern Recognition,
12299–12310. 238
Chen, J., Ma, T., & Xiao, C. (2018a). FastGCN: Fast learning with graph convolutional networks via
importance sampling. International Conference on Learning Representations. 264, 265
Chen, J., Zhu, J., & Song, L. (2018b). Stochastic training of graph convolutional networks with
variance reduction. International Conference on Machine Learning, 941–949. 264
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., &
Mordatch, I. (2021c). Decision transformer: Reinforcement learning via sequence modeling.
Neural Information Processing Systems, 34, 15084–15097. 398
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018c). DeepLab: Semantic
image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs.
IEEE Transactions on Pattern Analysis & Machine Intelligence, 40(4), 834–848. 181
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020a). Generative
pretraining from pixels. International Conference on Machine Learning, 1691–1703. 238
Chen, M., Wei, Z., Huang, Z., Ding, B., & Li, Y. (2020b). Simple and deep graph convolutional
networks. International Conference on Machine Learning, 1725–1735. 266
Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., Dehak, N., & Chan, W. (2021d). WaveGrad
2: Iterative refinement for text-to-speech synthesis. INTERSPEECH, 3765–3769. 369
Chen, R. T., Behrmann, J., Duvenaud, D. K., & Jacobsen, J.-H. (2019). Residual flows for invertible
generative modeling. Neural Information Processing Systems, 32, 9913–9923. 324
Chen, R. T., Li, X., Grosse, R. B., & Duvenaud, D. K. (2018d). Isolating sources of disentanglement
in variational autoencoders. Neural Information Processing Systems, 31, 2615–2625. 343, 346
Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018e). Neural ordinary differential
equations. Neural Information Processing Systems, 31, 6572–6583. 324
Chen, T., Fox, E., & Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. International
Conference on Machine Learning, 1683–1691. 159
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020c). A simple framework for contrastive
learning of visual representations. International Conference on Machine Learning, 1597–1607.
159
Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016a). Training deep nets with sublinear memory cost.
arXiv:1604.06174. 114
Chen, W., Liu, T.-Y., Lan, Y., Ma, Z.-M., & Li, H. (2009). Ranking measures and loss functions in
learning to rank. Neural Information Processing Systems, 22, 315–323. 73
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016b). Info-GAN:
Interpretable representation learning by information maximizing generative adversarial nets.
Neural Information Processing Systems, 29, 2172–2180. 291, 301
Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., & Abbeel,
P. (2017). Variational lossy autoencoder. International Conference on Learning Representations.
345
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., & Liu, J. (2020d). UNITER:
Universal image-text representation learning. European Conference on Computer Vision, 104–
120. 238
Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S., & Hsieh, C.-J. (2019). Cluster-GCN: An efficient
algorithm for training deep and large graph convolutional networks. ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, 257–266. 263, 264, 265
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse
transformers. arXiv:1904.10509. 237
Chintala, S., Denton, E., Arjovsky, M., & Matheiu, M. (2020). How to train a GAN? Tips and tricks
to make GANs work. https://github.com/soumith/ganhacks. 299
Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural
machine translation: Encoder-decoder approaches. ACL Workshop on Syntax, Semantics and
Structure in Statistical Translation, 103–111. 233
Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J., & Dahl, G. E. (2019). On empirical
comparisons of optimizers for deep learning. arXiv:1910.05446. 94, 410
Choi, J., Kim, S., Jeong, Y., Gwon, Y., & Yoon, S. (2021). ILVR: Conditioning method for denoising
diffusion probabilistic models. IEEE/CVF International Conference on Computer Vision, 14347–
14356. 370
Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., & Yoon, S. (2022). Perception prioritized training of
diffusion models. IEEE/CVF Computer Vision & Pattern Recognition, 11472–11481. 369
Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., & Choo, J. (2018). StarGAN: Unified generative
adversarial networks for multi-domain image-to-image translation. IEEE/CVF Computer Vision
& Pattern Recognition, 8789–8797. 301
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. IEEE/CVF
Computer Vision & Pattern Recognition, 1251–1258. 405
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). The loss surfaces of
multilayer networks. International Conference on Artificial Intelligence and Statistics. 405
Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis,
J., Mohiuddin, A., Kaiser, L., et al. (2020). Rethinking attention with Performers. International
Conference on Learning Representations. 236, 237
Chorowski, J., & Jaitly, N. (2017). Towards better decoding and language model integration in
sequence to sequence models. INTERSPEECH, 523–527. 158
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism
prediction instruments. Big data, 5(2), 153–163. 422
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.
W., Sutton, C., Gehrmann, S., et al. (2022). PaLM: Scaling language modeling with pathways.
arXiv:2204.02311. 234
Christian, B. (2020). The Alignment Problem: Machine Learning and Human Values. W. W. Norton.
421
Christiano, P., Shlegeris, B., & Amodei, D. (2018). Supervising strong learners by amplifying weak
experts. arXiv:1810.08575. 398
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins:
Revisiting the design of spatial attention in vision transformers. Neural Information Processing
Systems, 34, 9355–9366. 238
Chung, H., Sim, B., & Ye, J. C. (2022). Come-closer-diffuse-faster: Accelerating conditional
diffusion models for inverse problems through stochastic contraction. IEEE/CVF Computer
Vision & Pattern Recognition, 12413–12422. 369
Chung, H., & Ye, J. C. (2022). Score-based diffusion models for accelerated MRI. Medical Image
Analysis, 80, 102479. 369
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural
networks on sequence modeling. Deep Learning and Representation Workshop. 233
Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., & Bengio, Y. (2015). A recurrent latent
variable model for sequential data. Neural Information Processing Systems, 28, 2980–2988. 344,
345
Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O. (2016). 3D U-Net: Learning
dense volumetric segmentation from sparse annotation. International Conference on Medical
Image Computing and Computer-Assisted Intervention, 424–432. 205
Clark, M. (2022). The engineer who claimed a Google AI is sentient has been fired. The Verge, July
22, 2022. https://www.theverge.com/2022/7/22/23274958/google-ai-engineer-blake-lemoine-
chatbot-lamda-2-sentience. 234
Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by
exponential linear units (ELUs). arXiv:1511.07289. 38
Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., & Talwalkar, A. (2021). Gradient descent on neural
networks typically occurs at the edge of stability. International Conference on Learning
Representations. 157
Cohen, N., Sharir, O., & Shashua, A. (2016). On the expressive power of deep learning: A tensor
analysis. PMLR Conference on Learning Theory, 698–728. 53
Cohen, T., & Welling, M. (2016). Group equivariant convolutional networks. International
Conference on Machine Learning, 2990–2999. 183
Collins, E., Bala, R., Price, B., & Susstrunk, S. (2020). Editing in style: Uncovering the local
semantics of GANs. IEEE/CVF Computer Vision & Pattern Recognition, 5771–5780. 300
Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2017). Very deep convolutional networks for
text classification. Meeting of the Association for Computational Linguistics, 1107–1116. 182
Constanza-Chock, S. (2020). Design Justice: Community-Led Practices to Build the Worlds We Need.
Cambridge, MA: The MIT Press. 433
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and
convolutional layers. International Conference on Learning Representations. 236
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., &
Schiele, B. (2016). The Cityscapes dataset for semantic urban scene understanding. IEEE/CVF
Computer Vision & Pattern Recognition, 1877–1901. 6, 153
Coulombe, C. (2018). Text data augmentation made simple by leveraging NLP cloud APIs.
arXiv:1812.04718. 160
Creel, K. A. (2020). Transparency in complex computational systems. Philosophy of Science, 87 (4),
568–589. 425, 435
Crenshaw, K. (1991). Mapping the margins: Intersectionality, identity politics, and violence against
women of color. Stanford Law Review, 43(6), 1241–1299. 423
Creswell, A., & Bharath, A. A. (2018). Inverting the generator of a generative adversarial network.
IEEE Transactions on Neural Networks and Learning Systems, 30(7), 1967–1974. 301
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018).
Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53–65.
298
Cristianini, M., & Shawe-Taylor, J. (2000). An Introduction to support vector machines. CUP. 74
Croitoru, F.-A., Hondru, V., Ionescu, R. T., & Shah, M. (2022). Diffusion models in vision: A survey.
arXiv:2209.04747. 369
Cubuk, E. D., Zoph, B., Mané, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning
augmentation strategies from data. IEEE/CVF Computer Vision & Pattern Recognition, 113–123.
405
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of
Control, Signals and Systems, 2(4), 303–314. 38
Dabney, W., Rowland, M., Bellemare, M., & Munos, R. (2018). Distributional reinforcement learning
with quantile regression. AAAI Conference on Artificial Intelligence. 397
Dai, H., Dai, B., & Song, L. (2016). Discriminative embeddings of latent variable models for
structured data. International Conference on Machine Learning, 2702–2711. 262
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional
networks. IEEE/CVF International Conference on Computer Vision, 764–773. 183
Daigavane, A., Balaraman, R., & Aggarwal, G. (2021). Understanding convolutions on graphs.
Distill, https://distill.pub/2021/understanding-gnns/. 261
Danaher, J. (2019). Automation and Utopia: Human Flourishing in a World without Work. Harvard
University Press. 430
Daniluk, M., Rocktäschel, T., Welbl, J., & Riedel, S. (2017). Frustratingly short attention spans in
neural language modeling. International Conference on Learning Representations. 235
Danks, D., & London, A. J. (2017). Algorithmic bias in autonomous systems. International Joint
Conference on Artificial Intelligence, 4691–4697. 422
Dao, D. (2021). Awful AI. Github. Retrieved January 17, 2023. https://github.com/daviddao/awful-ai.
14
Dar, Y., Muthukumar, V., & Baraniuk, R. G. (2021). A farewell to the bias-variance trade-off? An
overview of the theory of overparameterized machine learning. arXiv:2109.02355. 135
Das, H. P., Abbeel, P., & Spanos, C. J. (2019). Likelihood contribution based multi-scale architecture
for generative flows. arXiv:1908.01686. 323
Dauphin, Y. N., Pascanu, R., Gülçehre, Ç., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and
attacking the saddle point problem in high-dimensional non-convex optimization. Neural
Information Processing Systems, vol. 27, 2933–2941. 409, 410
David, H. (2015). Why are there still so many jobs? The history and future of workplace automation.
Journal of Economic Perspectives, 29(3), 3–30. 14
De, S., & Smith, S. (2020). Batch normalization biases residual blocks towards the identity function
in deep networks. Neural Information Processing Systems, 33, 19964–19975. 205
De Cao, N., & Kipf, T. (2018). MolGAN: An implicit generative model for small molecular graphs.
ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models. 299
Dechter, R. (1986). Learning while searching in constraint-satisfaction-problems. AAAI Conference
on Artificial Intelligence, 178––183. 52
Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs
with fast localized spectral filtering. Neural Information Processing Systems, 29, 3837–3845. 262
Dehghani, M., Tay, Y., Gritsenko, A. A., Zhao, Z., Houlsby, N., Diaz, F., Metzler, D., & Vinyals, O.
(2021). The benchmark lottery. arXiv:2107.07002. 234
Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning. Cambridge
University Press. 15
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1–22. 346
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale
hierarchical image database. IEEE Computer Vision & Pattern Recognition, 248–255. 181, 272
Denton, E. L., Chintala, S., Fergus, R., et al. (2015). Deep generative image models using a
Laplacian pyramid of adversarial networks. Neural Information Processing Systems, 28, 1486–
1494. 300, 301
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidirectional
transformers for language understanding. ACL Human Language Technologies, 4171–4186. 159,
234
DeVries, T., & Taylor, G. W. (2017a). Dataset augmentation in feature space. arXiv:1702.05538. 158
DeVries, T., & Taylor, G. W. (2017b). Improved regularization of convolutional neural networks with
Cutout. arXiv:1708.04552. 183
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Neural
Information Processing Systems, 34, 8780–8794. 367, 368, 370
Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., & Yuan, L. (2022). DaViT: Dual attention vision
transformers. European Conference on Computer Vision, 74–92. 238
Dinh, L., Krueger, D., & Bengio, Y. (2015). NICE: Non-linear independent components estimation.
International Conference on Learning Representations Workshop. 323
Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets.
International Conference on Machine Learning, 1019–1028. 411
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using Real NVP. International
Conference on Learning Representations. 322, 323
Dinh, L., Sohl-Dickstein, J., Larochelle, H., & Pascanu, R. (2019). A RAD approach to deep mixture
models. ICLR Workshop on Deep Generative Models for Highly Structured Data. 323
Dockhorn, T., Vahdat, A., & Kreis, K. (2022). Score-based generative modeling with critically-
damped Langevin diffusion. International Conference on Learning Representations. 370
Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by
context prediction. IEEE International Conference on Computer Vision, 1422–1430. 159
Domingos, P. (2000). A unified bias-variance decomposition. International Conference on Machine
Learning, 231–238. 133
Domke, J. (2010). Statistical machine learning. https://people.cs.umass.edu/~domke/. 116
Donahue, C., Lipton, Z. C., Balsubramani, A., & McAuley, J. (2018a). Semantically decomposing the
latent spaces of generative adversarial networks. International Conference on Learning
Representations. 301
Donahue, C., McAuley, J., & Puckette, M. (2018b). Adversarial audio synthesis. International
Conference on Learning Representations. 299, 301
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2022). CSWin
transformer: A general vision transformer backbone with cross-shaped windows. IEEE/CVF
Computer Vision & Pattern Recognition, 12124–12134. 238
Dorta, G., Vicente, S., Agapito, L., Campbell, N. D., & Simpson, I. (2018). Structured uncertainty
prediction networks. IEEE/CVF Computer Vision & Pattern Recognition, 5477–5485. 73, 340,
344
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers
for image recognition at scale. International Conference on Learning Representations. 234, 238
Dozat, T. (2016). Incorporating Nesterov momentum into Adam. International Conference on
Learning Representations — Workshop track. 94
Draxler, F., Veschgini, K., Salmhofer, M., & Hamprecht, F. A. (2018). Essentially no barriers in
neural network energy landscape. International Conference on Machine Learning, 1308–1317.
408, 409
Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat,
O., et al. (2022). GLaM: Efficient scaling of language models with mixture-of-experts.
International Conference on Machine Learning, 5547–5569. 234
Du, S. S., Lee, J. D., Li, H., Wang, L., & Zhai, X. (2019a). Gradient descent finds global minima of
deep neural networks. International Conference on Machine Learning, 1675–1685. 404, 405
Du, S. S., Zhai, X., Poczos, B., & Singh, A. (2019b). Gradient descent provably optimizes over-
parameterized neural networks. International Conference on Learning Representations. 404
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159. 93
Dufter, P., Schmitt, M., & Schütze, H. (2021). Position information in transformers: An overview.
Computational Linguistics, 1–31. 236
Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., & Courville, A.
(2017). Adversarially learned inference. International Conference on Learning Representations.
301, 345
Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning.
arXiv:1603.07285. 180
Dupont, E., Doucet, A., & Teh, Y. W. (2019). Augmented neural ODEs. Neural Information
Processing Systems, 32, 3134–3144. 324
Durkan, C., Bekasov, A., Murray, I., & Papamakarios, G. (2019a). Cubic-spline flows. ICML
Invertible Neural Networks and Normalizing Flows. 323
Durkan, C., Bekasov, A., Murray, I., & Papamakarios, G. (2019b). Neural spline flows. Neural
Information Processing Systems, 32, 7509–7520. 323
Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., &
Adams, R. P. (2015). Convolutional networks on graphs for learning molecular fingerprints.
Neural Information Processing Systems, 28, 2224–2232. 262
D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., Chen, C., Deaton, J.,
Eisenstein, J., Hoffman, M. D., et al. (2020). Underspecification presents challenges for
credibility in modern machine learning. Journal of Machine Learning Research, 1–61. 413
Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2018). HotFlip: White-box adversarial examples for text
classification. Meeting of the Association for Computational Linguistics, 31–36. 160
El Asri, L., & Prince, J. D., Simon (2020). Tutorial #6: Neural natural language generation –
decoding algorithms. https://www.borealisai.com/research-blogs/tutorial-6-neural-natural-
language-generation-decoding-algorithms/. 235
Eldan, R., & Shamir, O. (2016). The power of depth for feedforward neural networks. PMLR
Conference on Learning Theory, 907–940. 53, 417
Elfwing, S., Uchibe, E., & Doya, K. (2018). Sigmoid-weighted linear units for neural network
function approximation in reinforcement learning. Neural Networks, 107, 3–11. 38
Erasmus, A., Brunet, T. D. P., & Fisher, E. (2021). What is interpretability? Philosophy &
Technology, 34, 833–862. 425
Eren, L., Ince, T., & Kiranyaz, S. (2019). A generic intelligent bearing fault diagnosis system using
compact adaptive 1D CNN classifier. Journal of Signal Processing Systems, 91(2), 179–189. 182
Erhan, D., Bengio, Y., Courville, A., & Vincent, P. (2009). Visualizing higher-layer features of a deep
network. Technical Report, University of Montreal, 1341(3). 184
Errica, F., Podda, M., Bacciu, D., & Micheli, A. (2019). A fair comparison of graph neural networks
for graph classification. International Conference on Learning Representations. 262
Eslami, S., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G. E., et al. (2016). Attend, infer,
repeat: Fast scene understanding with generative models. Neural Information Processing
Systems, 29, 3225–3233. 344
Eslami, S. A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A.,
Rusu, A. A., Danihelka, I., Gregor, K., et al. (2018). Neural scene representation and rendering.
Science, 360(6394), 1204–1210. 344
Esling, P., Masuda, N., Bardet, A., Despres, R., et al. (2019). Universal audio synthesizer control with
normalizing flows. International Conference on Digital Audio Effects. 322
Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image
synthesis. IEEE/CVF Computer Vision & Pattern Recognition, 12873–12883. 301
Esteves, C., Allen-Blanchette, C., Zhou, X., & Daniilidis, K. (2018). Polar transformer networks.
International Conference on Learning Representations. 183
Etmann, C., Ke, R., & Schönlieb, C.-B. (2020). iunets: Fully invertible U-Nets with learnable up-and
downsampling. IEEE International Workshop on Machine Learning for Signal Processing. 322
Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Profile, Police, and Punish the
Poor. New York: St. Martin's Press. 433
Evans, K., de Moura, N., Chauvier, S., Chatila, R., & Dogan, E. (2020). Ethical decision making in
autonomous vehicles: the AV ethics project. Science and Engineering Ethics, 26(6), 3285–3312.
424
FAIR (2022). Human-level play in the game of Diplomacy by combining language models with
strategic reasoning. Science, 378(6624), 1067–1074. 396
Falbo, A., & LaCroix, T. (2022). Est-ce que vous compute? Code-switching, cultural identity, and AI.
Feminist Philosophy Quarterly, 8(3/4). 423
Falk, T., Mai, D., Bensch, R., Çiçek, Ö., Abdulkadir, A., Marrakchi, Y., Böhm, A., Deubner, J.,
Jäckel, Z., Seiwald, K., et al. (2019). U-Net: Deep learning for cell counting, detection, and
morphometry. Nature Methods, 16(1), 67–70. 199
Falkner, S., Klein, A., & Hutter, F. (2018). BOHB: Robust and efficient hyperparameter optimization
at scale. International Conference on Machine Learning, 1437–1446. 136
Fallah, N., Gu, H., Mohammad, K., Seyyedsalehi, S. A., Nourijelyani, K., & Eshraghian, M. R.
(2009). Nonlinear Poisson regression using neural networks: A simulation study. Neural
Computing and Applications, 18(8), 939–943. 74
Fan, A., Lewis, M., & Dauphin, Y. N. (2018). Hierarchical neural story generation. Meeting of the
Association for Computational Linguistics, 889–898. 235
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale
vision transformers. IEEE/CVF International Conference on Computer Vision, 6824–6835. 238
Fan, K., Li, B., Wang, J., Zhang, S., Chen, B., Ge, N., & Yan, Z. (2020). Neural zero-inflated quality
estimation model for automatic speech recognition system. Interspeech, 606–610. 73
Fang, F., Yamagishi, J., Echizen, I., & Lorenzo-Trueba, J. (2018). High-quality nonparallel voice
conversion based on cycle-consistent adversarial network. International Conference on
Acoustics, Speech and Signal Processing, 5279–5283. 299
Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., & Liu, W. (2021). You only look at one
sequence: Rethinking transformer in vision through object detection. Neural Information
Processing Systems, 34, 26183–26197. 238
Farnia, F., & Ozdaglar, A. (2020). Do GANs always have Nash equilibria? International Conference
on Machine Learning, 3029–3039. 299
Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R
Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. (2022). Discovering faster matrix multiplication
algorithms with reinforcement learning. Nature, 610(7930), 47–53. 396
Fazelpour, S., & Danks, D. (2021). Algorithmic bias: Senses, sources, solutions. Philosophy
Compass, 16. 421, 422, 435
Fedus, W., Goodfellow, I., & Dai, A. M. (2018). MaskGAN: Better text generation via filling in the_.
International Conference on Learning Representations. 299
Feng, S. Y., Gangal, V., Kang, D., Mitamura, T., & Hovy, E. (2020). GenAug: Data augmentation for
finetuning text generators. ACL Deep Learning Inside Out, 29–42. 160
Feng, Z., Zhang, Z., Yu, X., Fang, Y., Li, L., Chen, X., Lu, Y., Liu, J., Yin, W., Feng, S., et al. (2022).
ERNIEViLG 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-
of-denoising-experts. arXiv:2210.15257. 371
Fernandez, C. (2017). Can a computer tell if you're gay? Artificial intelligence system guesses your
sexuality with 91% accuracy just by looking at a photo of your face. Daily Mail, 7 Sept, 2017.
https://www.dailymail.co.uk/sciencetech/article-4862676/Artificial-intelligence-tell-gay.html.
430
Fernández-Madrigal, J.-A., & González, J. (2002). Multihierarchical graph search. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24(1), 103–113. 242
Fetscherin, M., Tantleff-Dunn, S., & Klumb, A. (2020). Effects of facial features and styling elements
on perceptions of competence, warmth, and hireability of male professionals. The Journal of
Social Psychology, 160(3), 332–345. 427
Finlay, C., Jacobsen, J., Nurbekyan, L., & Oberman, A. M. (2020). How to train your neural ODE:
The world of Jacobian and kinetic regularization. International Conference on Machine
Learning, 3154–3164. 324
Fort, S., Hu, H., & Lakshminarayanan, B. (2019). Deep ensembles: A loss landscape perspective.
arXiv:1912.02757. 158
Fort, S., & Jastrzębski, S. (2019). Large scale structure of neural network loss landscapes. Neural
Information Processing Systems, vol. 32, 6706–6714. 408
Fort, S., & Scherlis, A. (2019). The Goldilocks zone: Towards better understanding of neural network
loss landscapes. AAAI Conference on Artificial Intelligence, 3574–3581. 409, 410, 412, 413
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R.,
Hassabis, D., Pietquin, O., et al. (2018). Noisy networks for exploration. International
Conference on Learning Representations. 397
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., Pineau, J., et al. (2018). An
introduction to deep reinforcement learning. Foundations and Trends in Machine Learning, 11(3-
4), 219–354. 396
Frankle, J., & Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural
networks. International Conference on Learning Representations. 406, 415
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carbin, M. (2020). Linear mode connectivity and the
lottery ticket hypothesis. International Conference on Machine Learning, 3259–3269. 158, 408
Frankle, J., Schwab, D. J., & Morcos, A. S. (2021). Training BatchNorm and only BatchNorm: On
the expressive power of random features in CNNs. International Conference on Learning
Representations. 418
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. 74
Frey, C. B. (2019). The Technology Trap: Capital, Labour, and Power in the Age of Automation.
Princeton University Press. 430
Frey, C. B., & Osborne, M. A. (2017). The future of employment: How susceptible are jobs to
computerisation? Technological forecasting and social change, 114, 254–280. 430
Friedman, J. H. (1997). On bias, variance, 0/1— loss, and the curse-of-dimensionality. Data Mining
and Knowledge Discovery, 1(1), 55–77. 133
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actorcritic
methods. International Conference on Machine Learning, 1587–1596. 397
Fujimoto, S., Meger, D., & Precup, D. (2019). Off-policy deep reinforcement learning without
exploration. International Conference on Machine Learning, 2052–2062. 398
Fukushima, K. (1969). Visual feature extraction by a multilayered network of analog threshold
elements. IEEE Transactions on Systems Science and Cybernetics, 5(4), 322–333. 37
Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a
mechanism of visual pattern recognition. Competition and Cooperation in Neural Nets, 267–285.
180
Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30, 411–437.
421
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model
uncertainty in deep learning. International Conference on Machine Learning, 1050––1059. 158
Gales, M. J. (1998). Maximum likelihood linear transformations for HMM-based speech recognition.
Computer Speech & Language, 12(2), 75–98. 160
Gales, M. J., Ragni, A., AlDamarki, H., & Gautier, C. (2009). Support vector machines for noise
robust ASR. 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, 205–210.
160
Ganaie, M., Hu, M., Malik, A., Tanveer, M., & Suganthan, P. (2022). Ensemble deep learning: A
review. Engineering Applications of Artificial Intelligence, 115. 158
Gao, H., & Ji, S. (2019). Graph U-Nets. International Conference on Machine Learning, 2083–2092.
265
Gao, R., Song, Y., Poole, B., Wu, Y. N., & Kingma, D. P. (2021). Learning energy-based models by
diffusion recovery likelihood. International Conference on Learning Representations. 370
Garg, R., Bg, V. K., Carneiro, G., & Reid, I. (2016). Unsupervised CNN for single view depth
estimation: Geometry to the rescue. European Conference on Computer Vision, 740–756. 205
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., & Wilson, A. G. (2018). Loss surfaces, mode
connectivity, and fast ensembling of DNNs. Neural Information Processing Systems, vol. 31,
8803––8812. 158, 408
Gastaldi, X. (2017a). Shake-shake regularization. arXiv:1705.07485. 203
Gastaldi, X. (2017b). Shake-shake regularization of 3-branch residual networks. 203
Gebru, T., Bender, E. M., McMillan-Major, A., & Mitchell, M. (2023). Statement from the listed
authors of stochastic parrots on the “AI pause” letter. https://www.dair-institute.org/blog/letter-
statement-March2023. 435
Gemici, M. C., Rezende, D., & Mohamed, S. (2016). Normalizing flows on Riemannian manifolds.
NIPS Workshop on Bayesian Deep Learning. 324
Germain, M., Gregor, K., Murray, I., & Larochelle, H. (2015). MADE: Masked autoencoder for
distribution estimation. International Conference on Machine Learning, 881–889. 323
Ghosh, A., Kulharia, V., Namboodiri, V. P., Torr, P. H., & Dokania, P. K. (2018). Multi-agent diverse
generative adversarial networks. IEEE/CVF Computer Vision & Pattern Recognition, 8513–
8521. 300
Gidaris, S., Singh, P., & Komodakis, N. (2018). Unsupervised representation learning by predicting
image rotations. International Conference on Learning Representations. 159
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (2017). Neural message passing
for quantum chemistry. International Conference on Machine Learning, 1263–1272. 262
Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network.
IEEE/CVF Computer Vision & Pattern Recognition, 244–253. 238
Girshick, R. (2015). Fast R-CNN. IEEE International Conference on Computer Vision, 1440–1448.
183
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object
detection and semantic segmentation. IEEE Computer Vision & Pattern Recognition, 580–587.
183
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural
networks. International Conference on Artificial Intelligence and Statistics, 9, 249–256. 113, 183
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. International
Conference on Artificial Intelligence and Statistics, 315–323. 37, 38
Goh, G. (2017). Why momentum really works. Distill, http://distill.pub/2017/momentum. 92
Goldberg, D. E. (1987). Simple genetic algorithms and the minimal deceptive problem. Genetic
Algorithms and Simulated Annealing, 74–88. Morgan Kaufmann. 421
Gomez, A. N., Ren, M., Urtasun, R., & Grosse, R. B. (2017). The reversible residual network:
Backpropagation without storing activations. Neural Information Processing Systems, 30, 2214–
2224. 114, 322, 323
Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B.,
Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., & Aspuru-Guzik, A. (2018).
Automatic chemical design using a data-driven continuous representation of molecules. ACS
Central Science, 4(2), 268–276. 343, 344
Gong, S., Bahri, M., Bronstein, M. M., & Zafeiriou, S. (2020). Geometrically principled connections
in graph neural networks. IEEE/CVF Computer Vision & Pattern Recognition, 11415–11424. 266
Goodfellow, I. (2016). Generative adversarial networks. NIPS 2016 Tutorial. 298
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. 15, 157
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., &
Bengio, Y. (2014). Generative adversarial networks. Communications of the ACM, 63(11), 139–
144. 273, 298, 300
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015a). Explaining and harnessing adversarial examples.
International Conference on Learning Representations. 159, 413
Goodfellow, I. J., Vinyals, O., & Saxe, A. M. (2015b). Qualitatively characterizing neural network
optimization problems. International Conference on Learning Representations. 407, 408
Goodin, D. (2023). ChatGPT is enabling script kiddies to write functional malware. ars Technica,
June 1, 2023. https://arstechnica.com/information-technology/2023/01/chatgpt-is-enabling-script-
kiddies-to-write-functional-malware/. 428
Gordon, G. J. (1995). Stable fitted reinforcement learning. Neural Information Processing Systems, 8,
1052–1058. 396
Gori, M., Monfardini, G., & Scarselli, F. (2005). A new model for learning in graph domains. IEEE
International Joint Conference on Neural Networks, 2005, 729–734. 262
Gouk, H., Frank, E., Pfahringer, B., & Cree, M. J. (2021). Regularisation of neural networks by
enforcing Lipschitz continuity. Machine Learning, 110(2), 393–416. 156
Goyal, A., Bochkovskiy, A., Deng, J., & Koltun, V. (2021). Non-deep networks. arXiv:2110.07641.
417
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., &
He, K. (2018). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677.
92, 93, 237, 410
Graesser, L., & Keng, W. L. (2019). Foundations of deep reinforcement learning. Addison-Wesley
Professional. 16, 396
Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., & Duvenaud, D. (2019). Ffjord: Free-form
continuous dynamics for scalable reversible generative models. International Conference on
Learning Representations. 324
Grattarola, D., Zambon, D., Bianchi, F. M., & Alippi, C. (2022). Understanding pooling in graph
neural networks. IEEE Transactions on Neural Networks and Learning Systems. 265
Green, B. (2019). “Good” isn't good enough. NeurIPS Workshop on AI for Social Good. 433
Green, B. (2022). Escaping the impossibility of fairness: From formal to substantive algorithmic
fairness. Philosophy & Technology, 35(90). 422
Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniques for gradient
estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 1471–1530.
397
Gregor, K., Besse, F., Jimenez Rezende, D., Danihelka, I., & Wierstra, D. (2016). Towards
conceptual compression. Neural Information Processing Systems, 29, 3549–3557. 343, 344
Gregor, K., Papamakarios, G., Besse, F., Buesing, L., & Weber, T. (2019). Temporal difference
variational auto-encoder. International Conference on Learning Representations. 344
Grennan, L., Kremer, A., Singla, A., & Zipparo, P. (2022). Why businesses need explainable AI—and
how to deliver it. McKinsey, September 29, 2022.
https://www.mckinsey.com/capabilities/quantumblack/our-insights/why-businesses-need-
explainable-ai-and-how-to-deliver-it/. 13
Greydanus, S. (2020). Scaling down deep learning. arXiv:2011.14439. 119
Griewank, A., & Walther, A. (2008). Evaluating derivatives: Principles and techniques of
algorithmic differentiation. SIAM. 113
Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.-H., Lai, L., Chandra, V., & Pan, D. Z. (2022).
Multi-scale high-resolution vision transformer for semantic segmentation. IEEE/CVF Computer
Vision & Pattern Recognition, 12094–12103. 238
Guan, S., Tai, Y., Ni, B., Zhu, F., Huang, F., & Yang, X. (2020). Collaborative learning for faster
StyleGAN embedding. arXiv:2007.01758. 301
Gui, J., Sun, Z., Wen, Y., Tao, D., & Ye, J. (2021). A review on generative adversarial networks:
Algorithms, theory, and applications. IEEE Transactions on Knowledge and Data Engineering.
299
Guimaraes, G. L., Sanchez-Lengeling, B., Outeiral, C., Farias, P. L. C., & Aspuru-Guzik, A. (2017).
Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models.
arXiv:1705.10843. 299
Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., & Courville, A. (2016).
PixelVAE: A latent variable model for natural images. International Conference on Learning
Representations. 299, 343, 344, 345
Ha, D., Dai, A., & Le, Q. V. (2017). Hypernetworks. International Conference on Learning
Representations. 235
Haarnoja, T., Hartikainen, K., Abbeel, P., & Levine, S. (2018a). Latent space policies for hierarchical
reinforcement learning. International Conference on Machine Learning, 1851–1860. 322
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018b). Soft actor-critic: Off-policy maximum
entropy deep reinforcement learning with a stochastic actor. International Conference on
Machine Learning, 1861–1870. 398
Hagendorff, T. (2020). The ethics of AI ethics: An evaluation of guidelines. Minds and Machines,
30(1), 99–120. 420
Hamilton, W., Ying, Z., & Leskovec, J. (2017a). Inductive representation learning on large graphs.
Neural Information Processing Systems, 30, 1024–1034. 262, 263, 264, 265, 267
Hamilton, W. L. (2020). Graph representation learning. Synthesis Lectures on Artifical Intelligence
and Machine Learning, 14(3), 1–159. 15, 261
Hamilton, W. L., Ying, R., & Leskovec, J. (2017b). Representation learning on graphs: Methods and
applications. IEEE Data Engineering Bulletin, 40(3), 52–74. 263
Han, S., Mao, H., & Dally, W. J. (2016). Deep compression: Compressing deep neural networks with
pruning, trained quantization and Huffman coding. International Conference on Learning
Representations. 414, 415
Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both weights and connections for efficient
neural network. Neural Information Processing Systems, vol. 28, 1135–1143. 414
Hannun, A. Y., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S.,
Sengupta, S., Coates, A., & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech
recognition. arXiv:1412.5567. 160
Hanson, S. J., & Pratt, L. Y. (1988). Comparing biases for minimal network construction with back-
propagation. Neural Information Processing Systems, vol. 2, 177–185. 155
Harding, S. (1986). The Science Question in Feminism. Cornell University Press. 433
Härkönen, E., Hertzmann, A., Lehtinen, J., & Paris, S. (2020). GANSpace: Discovering interpretable
GAN controls. Neural Information Processing Systems, 33, 9841–9850. 300
Hartmann, K. G., Schirrmeister, R. T., & Ball, T. (2018). EEG-GAN: Generative adversarial
networks for electroencephalograhic (EEG) brain signals. arXiv:1806.01875. 299
Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., & Wood, F. (2022). Flexible diffusion
modeling of long videos. Neural Information Processing Systems, 35. 369
Hasanzadeh, A., Hajiramezanali, E., Boluki, S., Zhou, M., Duffield, N., Narayanan, K., & Qian, X.
(2020). Bayesian graph neural networks with adaptive connection sampling. International
Conference on Machine Learning, 4094–4104. 265
Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain
surgeon. Neural Information Processing Systems, vol. 6, 164–171. 414
Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs. AAAI
Fall Symposia, 29–37. 397
Hayou, S., Clerico, E., He, B., Deligiannidis, G., Doucet, A., & Rousseau, J. (2021). Stable ResNet.
International Conference on Artificial Intelligence and Statistics, 1324–1332. 205
He, F., Liu, T., & Tao, D. (2019). Control batch size and learning rate to generalize well: Theoretical
and empirical evidence. Neural Information Processing Systems, 32, 1143–1152. 92, 410, 411
He, J., Neubig, G., & Berg-Kirkpatrick, T. (2018). Unsupervised learning of syntactic structure with
invertible neural projections. ACL Empirical Methods in Natural Language Processing, 1292–
1302. 322
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level
performance on ImageNet classification. IEEE International Conference on Computer Vision,
1026–1034. 38, 113, 183
He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep residual learning for image recognition.
IEEE/CVF Computer Vision & Pattern Recognition, 770–778. 188, 201, 323, 405
He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity mappings in deep residual networks.
European Conference on Computer Vision, 630–645. 202, 405
He, P., Liu, X., Gao, J., & Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with disentangled
attention. International Conference on Learning Representations. 236
He, X., Haffari, G., & Norouzi, M. (2020). Dynamic programming encoding for subword
segmentation in neural machine translation. Meeting of the Association for Computational
Linguistics, 3042–3051. 234
He, Y., Zhang, X., & Sun, J. (2017). Channel pruning for accelerating very deep neural networks.
IEEE/CVF International Conference on Computer Vision, 1389–1397. 414
Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., & Tassa, Y. (2015). Learning continuous
control policies by stochastic value gradients. Neural Information Processing Systems, 28, 2944–
2952. 344
Heikkilä, M. (2022). Why business is booming for military AI startups. MIT Technology Review,
July 7 2022. https://www.technologyreview.com/2022/07/07/1055526/why-business-is-booming-
for-military-ai-startups/. 13, 427
Henaff, M., Bruna, J., & LeCun, Y. (2015). Deep convolutional networks on graph-structured data.
arXiv:1506.05163. 262
Henderson, P., Li, X., Jurafsky, D., Hashimoto, T., Lemley, M. A., & Liang, P. (2023). Foundation
models and fair use. arXiv:2303.15715. 428
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (GELUs). arXiv:1606.08415. 38
Hermann, V. (2017). Wasserstein GAN and the Kantorovich-Rubinstein duality.
https://vincentherrmann.github.io/blog/wasserstein/. 284, 299
Hernández, C. X., Wayment-Steele, H. K., Sultan, M. M., Husic, B. E., & Pande, V. S. (2018).
Variational encoding of complex dynamics. Physical Review E, 97 (6), 062412. 344
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Prompt-to-
prompt image editing with cross attention control. arXiv:2208.01626. 369
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B.,
Azar, M., & Silver, D. (2018). Rainbow: Combining improvements in deep rein-forcement
learning. AAAI Conference on Artificial Intelligence, 3215–3222. 397
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a
two time-scale update rule converge to a local Nash equilibrium. Neural Information Processing
Systems, 30, 6626–6637. 274
Heyns, C. (2017). Autonomous weapons in armed conflict and the right to a dignified life: An
African perspective. South African Journal of Human Rights, 33(1), 46–71. 429
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner,
A. (2017). Beta-VAE: Learning basic visual concepts with a constrained variational framework.
International Conference on Learning Representations. 345
Himmelreich, J. (2022). Against ‘democratizing AI’. AI & Society. 435
Hindupur, A. (2022). The GAN zoo. GitHub Retrieved January 17, 2023.
https://github.com/hindupuravinash/the-gan-zoo. 299
Hinton, G., Srivastava, N., & Swersky, K. (2012a). Neural networks for machine learning: Lecture 6a
– Overview of mini-batch gradient descent.
https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf. 93
Hinton, G., & van Camp, D. (1993). Keeping neural networks simple by minimising the description
length of weights. Computational learning theory, 5–13. 159
Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling the knowledge in a neural network.
arXiv:1503.02531, 2(7). 415
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural
networks. Science, 313(5786), 504–507. 344
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012b).
Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580.
158
Ho, J., Chen, X., Srinivas, A., Duan, Y., & Abbeel, P. (2019). Flow++: Improving flow-based
generative models with variational dequantization and architecture design. International
Conference on Machine Learning, 2722–2730. 322, 323
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Neural Information
Processing Systems, 33, 6840–6851. 274, 367, 369
Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., & Salimans, T. (2022a). Cascaded diffusion
models for high fidelity image generation. Journal of Machine Learning Research, 23, 47–1.
369, 370
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. NeurIPS Workshop on Deep
Generative Models and Downstream Applications. 370
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., & Fleet, D. J. (2022b). Video diffusion
models. International Conference on Learning Representations. 369
Hochreiter, S., & Schmidhuber, J. (1997a). Flat minima. Neural Computation, 9(1), 1–42. 411
Hochreiter, S., & Schmidhuber, J. (1997b). Long short-term memory. Neural Computation, 9(8),
1735–1780. 233
Hoffer, E., Hubara, I., & Soudry, D. (2017). Train longer, generalize better: Closing the
generalization gap in large batch training of neural networks. Neural Information Processing
Systems, 30, 1731–1741. 203, 204
Hoffman, M. D., & Johnson, M. J. (2016). ELBO surgery: Yet another way to carve up the
variational evidence lower bound. NIPS Workshop in Advances in Approximate Bayesian
Inference, 2. 346
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L.,
Hendricks, L. A., Welbl, J., Clark, A., et al. (2023). Training compute-optimal large language
models. arXiv:2203.15556. 234
Hofstadter, D. R. (1995). The ineradicable Eliza effect and its dangers (preface 4). Fluid Concepts
and Creative Analogies: Computer Models Of The Fundamental Mechanisms Of Thought, 155–
168. Basic Books. 428
Holland, C. A., Ebner, N. C., Lin, T., & Samanez-Larkin, G. R. (2019). Emotion identification across
adulthood using the dynamic faces database of emotional expressions in younger, middle aged,
and older adults. Cognition and Emotion, 33(2), 245–257. 9
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text
degeneration. International Conference on Learning Representations. 235
Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., & Welling, M. (2021). Argmax flows and
multinomial diffusion: Learning categorical distributions. Neural Information Processing
Systems, 34, 12454–12465. 369
Hoogeboom, E., Peters, J., Van Den Berg, R., & Welling, M. (2019a). Integer discrete flows and
lossless compression. Neural Information Processing Systems, 32, 12134–12144. 324
Hoogeboom, E., Van Den Berg, R., & Welling, M. (2019b). Emerging convolutions for generative
normalizing flows. International Conference on Machine Learning, 2771–2780. 322
Höppe, T., Mehrjou, A., Bauer, S., Nielsen, D., & Dittadi, A. (2022). Diffusion models for video
prediction and infilling. ECCV Workshop on AI for Creative Video Editing and Understanding.
369
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks,
4(2), 251–257. 38
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R.,
Vasudevan, V., et al. (2019). Searching for MobileNetV3. IEEE/CVF International Conference
on Computer Vision, 1314–1324. 38
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., &
Adam, H. (2017). MobileN ets: Efficient convolutional neural networks for mobile vision
applications. arXiv:1704.04861. 181
Howard, R. A. (1960). Dynamic programming and Narkov processes. Wiley. 396
Hsu, C.-C., Hwang, H.-T., Wu, Y.-C., Tsao, Y., & Wang, H.-M. (2017a). Voice conversion from
unaligned corpora using variational autoencoding Wasserstein generative adversarial networks.
INTERSPEECH, 3364–3368. 345
Hsu, W.-N., Zhang, Y., & Glass, J. (2017b). Learning latent representations for speech generation and
transformation. INTERSPEECH, 1273–1277. 343
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018a). Relation networks for object detection.
IEEE/CVF Computer Vision & Pattern Recognition, 3588–3597. 238
Hu, H., Zhang, Z., Xie, Z., & Lin, S. (2019). Local relation networks for image recognition.
IEEE/CVF International Conference on Computer Vision, 3464–3473. 238
Hu, J., Shen, L., & Sun, G. (2018b). Squeeze-and-excitation networks. IEEE/CVF Computer Vision
& Pattern Recognition, 7132–7141. 181, 235
Hu, W., Pang, J., Liu, X., Tian, D., Lin, C.-W., & Vetro, A. (2022). Graph signal processing for
geometric data and beyond: Theory and applications. IEEE Transactions on Multimedia, 24,
3961–3977. 242
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward controlled generation of
text. International Conference on Machine Learning, 1587–1596. 343
Huang, C.-W., Krueger, D., Lacoste, A., & Courville, A. (2018a). Neural autoregressive flows.
International Conference on Machine Learning, 2078–2087. 323, 324
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017a). Snapshot
ensembles: Train 1, get M for free. International Conference on Learning Representations. 158
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017b). Densely connected
convolutional networks. IEEE/CVF Computer Vision & Pattern Recognition, 4700–4708. 205,
405
Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K. Q. (2016). Deep networks with stochastic
depth. European Conference on Computer Vision, 646–661. 202
Huang, W., Zhang, T., Rong, Y., & Huang, J. (2018b). Adaptive sampling towards fast graph
representation learning. Neural Information Processing Systems, 31, 4563–4572. 264, 265
Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., & Belongie, S. (2017c). Stacked generative adversarial
networks. IEEE/CVF Computer Vision & Pattern Recognition, 5077–5086. 300
Huang, X. S., Perez, F., Ba, J., & Volkovs, M. (2020a). Improving transformer optimization through
better initialization. International Conference on Machine Learning, 4475–4483. 114, 237
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y.,
et al. (2019). GPipe: Efficient training of giant neural networks using pipeline parallelism. Neural
Information Processing Systems, 32, 103–112. 114
Huang, Z., Liang, D., Xu, P., & Xiang, B. (2020b). Improve transformer models with better relative
position embeddings. Empirical Methods in Natural Language Processing. 236
Huang, Z., & Wang, N. (2018). Data-driven sparse structure selection for deep neural networks.
European Conference on Computer Vision, 304–320. 414
Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned
optimization in advanced machine learning systems. arXiv:1906.01820. 421
Hussein, A., Gaber, M. M., Elyan, E., & Jayne, C. (2017). Imitation learning: A survey of learning
methods. ACM Computing Surveys, 50(2), 1–35. 398
Huszár, F. (2019). Exponentially growing learning rate? Implications of scale invariance induced by
batch normalization. https://www.inference.vc/exponentially-growing-learning-rate-implications-
of-scale-invariance-induced-by-BatchNorm/. 204
Hutchinson, M. F. (1989). A stochastic estimator of the trace of the influence matrix for Laplacian
smoothing splines. Communications in Statistics-Simulation and Computation, 18(3), 1059–
1076. 324
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential model-based optimization for
general algorithm configuration. International Conference on Learning and Intelligent
Optimization, 507–523. 136
Iglovikov, V., & Shvets, A. (2018). TernausNet: U-Net with VGG11 encoder pre-trained on ImageNet
for image segmentation. arXiv:1801.05746. 205
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial
examples are not bugs, they are features. Neural Information Processing Systems, 32, 125–136.
414
Inoue, H. (2018). Data augmentation by pairing samples for images classification. arXiv:1801.02929.
159
Inoue, T., Choudhury, S., De Magistris, G., & Dasgupta, S. (2018). Transfer learning from synthetic
to real images using variational autoen coders for precise position detection. IEEE International
Conference on Image Processing, 2725–2729. 344
Ioffe, S. (2017). Batch renormalization: Towards reducing minibatch dependence in batch-
normalized models. Neural Information Processing Systems, 30, 1945–1953. 203
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing
internal covariate shift. International Conference on Machine Learning, 448–456. 114, 203, 204
Ishida, T., Yamane, I., Sakai, T., Niu, G., & Sugiyama, M. (2020). Do we need zero training loss after
achieving zero training error? International Conference on Machine Learning, 4604–4614. 134,
159
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional
adversarial networks. IEEE/CVF Computer Vision & Pattern Recognition, 1125–1134. 205, 293,
301
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging weights
leads to wider optima and better generalization. Uncertainly in Artificial Intelligence, 876–885.
158, 411
Jackson, P. T., Abarghouei, A. A., Bonner, S., Breckon, T. P., & Obara, B. (2019). Style
augmentation: Data augmentation via style randomization. IEEE Computer Vision and Pattern
Recognition Workshops, 10–11. 159
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local
experts. Neural Computation, 3(1), 79–87. 73
Jacobsen, J.-H., Smeulders, A., & Oyallon, E. (2018). i-RevNet: Deep invertible networks.
International Conference on Learning Representations. 322, 323
Jaini, P., Kobyzev, I., Yu, Y., & Brubaker, M. A. (2020). Tails of Lipschitz triangular flows.
International Conference on Machine Learning, 4673–4681. 324
Jaini, P., Selby, K. A., & Yu, Y. (2019). Sum-of-squares polynomial flow. International Conference
on Machine Learning, 3009–3018. 323
Jaitly, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech
recognition. ICML Workshop on Deep Learning for Audio, Speech and Language. 160
Jarrett, K., Kavukcuoglu, K., Ranzato, M., & LeCun, Y. (2009). What is the best multi-stage
architecture for object recognition? IEEE International Conference on Computer Vision, 2146–
2153. 37
Jastrzębski, S., Arpit, D., Astrand, O., Kerg, G. B., Wang, H., Xiong, C., Socher, R., Cho, K., &
Geras, K. J. (2021). Catastrophic fisher explosion: Early phase fisher matrix impacts
generalization. International Conference on Machine Learning, 4772–4784. 157
Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A. (2018). Three
factors influencing minima in SGD. arXiv:1711.04623. 92, 410
Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action
recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 35(1), 221–231.
182
Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. Neural
Information Processing Systems, 29. 183
Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2016). Variational deep embedding: An
unsupervised and generative approach to clustering. International Joint Conference on Artificial
Intelligence, 1965–1972. 344
Jin, C., Netrapalli, P., & Jordan, M. (2020). What is local optimality in nonconvex-nonconcave
minimax optimization? International Conference on Machine Learning, 4880–4889. 299
Jin, L., Doshi-Velez, F., Miller, T., Schwartz, L., & Schuler, W. (2019). Unsupervised learning of
PCFGs with normalizing flow. Meeting of the Association for Computational Linguistics, 2442–
2452. 322
Jing, L., & Tian, Y. (2020). Self-supervised visual feature learning with deep neural networks: A
survey. IEEE Transactions on Pattern Analysis & Machine Intelligence, 43(11), 4037–4058. 159
Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature
Machine Intelligence, 1, 389–399. 420
Johnson, G. M. (2022). Are algorithms value-free? feminist theoretical virtues in machine learning.
198. 432
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance
reduction. Neural Information Processing Systems, 26, 315–323. 91
Jolicoeur-Martineau, A. (2019). The relativistic discriminator: A key element missing from standard
GAN. International Conference on Learning Representations. 299
Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing, 2nd Edition. Pearson. 233
Kakade, S. M. (2001). A natural policy gradient. Neural Information Processing Systems, 14, 1531–
1538. 397
Kanazawa, A., Sharma, A., & Jacobs, D. (2014). Locally scale-invariant convolutional neural
networks. Neural Information Processing Systems Workshop. 183
Kanda, N., Takeda, R., & Obuchi, Y. (2013). Elastic spectral distortion for low resource speech
recognition with deep neural networks. IEEE Workshop on Automatic Speech Recognition and
Understanding, 309–314. 160
Kaneko, T., & Kameoka, H. (2017). Parallel-data-free voice conversion using cycle-consistent
adversarial networks. arXiv:1711.11293. 299
Kang, G., Dong, X., Zheng, L., & Yang, Y. (2017). PatchShuffle regularization. arXiv:1707.07103.
159
Kanwar, G., Albergo, M. S., Boyda, D., Cranmer, K., Hackett, D. C., Racaniere, S., Rezende, D. J., &
Shanahan, P. E. (2020). Equivariant flow-based sampling for lattice gauge theory. Physical
Review Letters, 125(12), 121601. 322
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of GANs for improved
quality, stability, and variation. International Conference on Learning Representations. 286, 287,
299, 300, 319, 345
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based
generative models. Neural Information Processing Systems. 369, 370
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., & Aila, T. (2020a). Training generative
adversarial networks with limited data. Neural Information Processing Systems, 33, 12104–
12114. 300
Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., & Aila, T. (2021). Alias-free
generative adversarial networks. Neural Information Processing Systems, 34, 852–863. 300
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative
adversarial networks. IEEE/CVF Computer Vision & Pattern Recognition, 4401–4410. 299, 300
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020b). Analyzing and
improving the image quality of Style-GAN. IEEE/CVF Computer Vision & Pattern Recognition,
8110–8119. 8, 300, 301
Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast
autoregressive transformers with linear attention. International Conference on Machine
Learning, 5156–5165. 237
Kawaguchi, K., Huang, J., & Kaelbling, L. P. (2019). Effect of depth and width on local minima in
deep learning. Neural Computation, 31(7), 1462–1498. 405
Ke, G., He, D., & Liu, T.-Y. (2021). Rethinking positional encoding in language pre-training.
International Conference on Learning Representations. 236
Kearnes, S., McCloskey, K., Berndl, M., Pande, V., & Riley, P. (2016). Molecular graph
convolutions: Moving beyond fingerprints. Journal of computer-aided molecular design, 30(8),
595–608. 264
Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer
vision? Neural Information Processing Systems, 30, 5574–5584. 158
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On largebatch
training for deep learning: Generalization gap and sharp minima. International Conference on
Learning Representations. 158, 403, 410, 411
Keskar, N. S., & Socher, R. (2017). Improving generalization performance by switching from Adam
to SGD. arXiv:1712.07628. 94, 410
Keynes, J. M. (2010). Economic possibilities for our grandchildren. Essays in Persuasion, 321–332.
Palgrave Macmillan. 430
Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). Transformers in
vision: A survey. ACM Computing Surveys, 54(10), 200:1–200:41. 238
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D., & Frey, B. J. (2017). Generating and designing
DNA with deep generative models. NIPS 2017 Workshop on Computational Biology. 299
Kim, H., & Mnih, A. (2018). Disentangling by factorising. International Conference on Machine
Learning, 2649–2658. 345, 346
Kim, I., Han, S., Baek, J.-w., Park, S.-J., Han, J.-J., & Shin, J. (2021). Qualityagnostic image
recognition via invertible decoder. IEEE/CVF Computer Vision & Pattern Recognition, 12257–
12266. 322
Kim, S., Lee, S.-g., Song, J., Kim, J., & Yoon, S. (2018). FloWaveNet: A generative flow for raw
audio. International Conference on Machine Learning, 3370–3378. 322, 323
Kingma, D., Salimans, T., Poole, B., & Ho, J. (2021). Variational diffusion models. Neural
Information Processing Systems, 34, 21696–21707. 369
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International
Conference on Learning Representations. 93, 237
Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions.
Neural Information Processing Systems, 31, 10236–10245. 319, 322, 323
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016). Improved
variational inference with inverse autoregressive flow. Neural Information Processing Systems,
29, 4736–4744. 323, 344
Kingma, D. P., Salimans, T., & Welling, M. (2015). Variational dropout and the local
reparameterization trick. Advances in neural information processing systems, 28, 2575–2583.
346
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. International Conference on
Learning Representations. 273, 343
Kingma, D. P., Welling, M., et al. (2019). An introduction to variational autoencoders. Foundations
and Trends in Machine Learning, 12(4), 307–392. 343
Kipf, T. N., & Welling, M. (2016). Variational graph auto-encoders. NIPS Bayesian Deep Learning
Workshop. 159, 344
Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
International Conference on Learning Representations. 262, 263, 264, 265
Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., & Inman, D. J. (2021). 1D
convolutional neural networks and applications: A survey. Mechanical Systems and Signal
Processing, 151, 107398. 182
Kiranyaz, S., Ince, T., Hamila, R., & Gabbouj, M. (2015). Convolutional neural networks for patient-
specific ECG classification. International Conference of the IEEE Engineering in Medicine and
Biology Society, vol. 37, 2608–2611. 182
Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The efficient transformer. International
Conference on Learning Representations. 237
Kitcher, P. (2011a). The Ethical Project. Harvard University Press. 432
Kitcher, P. (2011b). Science in a Democratic Society. Prometheus Books. 432
Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). Self-normalizing neural networks.
Neural Information Processing Systems, vol. 30, 972–981. 38, 113
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination
of risk scores. Innovations in Theoretical Computer Science Conference, vol. 67, 1–23. 422
Kleinberg, R., Li, Y., & Yuan, Y. (2018). An alternative view: When does SGD escape local minima?
International Conference on Machine Learning, 2703–2712. 411
Knight, W. (2018). One of the fathers of AI is worried about its future. MIT Technology Review, Nov
20, 2018. https://www.technologyreview.com/2018/11/17/66372/one-of-the-fathers-of-ai-is-
worried-about-its-future/. 430
Kobyzev, I., Prince, S. J., & Brubaker, M. A. (2020). Normalizing flows: An introduction and review
of current methods. IEEE Transactions on Pattern Analysis & Machine Intelligence, 43(11),
3964–3979. xv, 321, 324
Koenker, R., & Hallock, K. F. (2001). Quantile regression. Journal of Economic Perspectives, 15(4),
143–156. 73
Köhler, J., Klein, L., & Noé, F. (2020). Equivariant flows: Exact likelihood generative learning for
symmetric densities. International Conference on Machine Learning, 5361–5370. 322, 324
Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. MIT
Press. 15
Kolomiyets, O., Bethard, S., & Moens, M.-F. (2011). Model-portability experiments for textual
temporal analysis. Meeting of the Association for Computational Linguistics, 271–276. 160
Konda, V., & Tsitsiklis, J. (1999). Actor-critic algorithms. Neural Information Processing Systems,
12, 1008–1014. 397
Kong, Z., Ping, W., Huang, J., Zhao, K., & Catanzaro, B. (2021). DiffWave: A versatile diffusion
model for audio synthesis. International Conference on Learning Representations. 369
Kool, W., van Hoof, H., & Welling, M. (2019). Attention, learn to solve routing problems!
International Conference on Learning Representations. 396
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from
digital records of human behavior. Proceedings of the National Academy of Sciences of the
United States of America, 110(15), 5802–5805. 427
Kratsios, M. (2019). The national artificial intelligence research and development strategic plan:
2019 update. Tech. rep., Networking and Information Technology Research and Development.
https://www.nitrd.gov/pubs/National-AI-RD-Strategy-2019.pdf. 430
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Technical Report, University of Toronto. 188
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep
convolutional neural networks. Neural Information Processing Systems, 25, 1097–1105. 52, 113,
159, 176, 181
Kruse, J., Detommaso, G., Köthe, U., & Scheichl, R. (2021). HINT: Hierarchical invertible neural
transport for density estimation and Bayesian inference. AAAI Conference on Artificial
Intelligence, 8191–8199. 323
Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple
subword candidates. Meeting of the Association for Computational Linguistics, 66–75. 234
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing. Empirical Methods in Natural Language
Processing, 66–71. 234
Kukačka, J., Golkov, V., & Cremers, D. (2017). Regularization for deep learning: A taxonomy.
arXiv:1710.10686. 155
Kulikov, I., Miller, A. H., Cho, K., & Weston, J. (2018). Importance of search and evaluation
strategies in neural dialogue modeling. ACL International Conference on Natural Language
Generation, 76–87. 235
Kumar, A., Fu, J., Soh, M., Tucker, G., & Levine, S. (2019a). Stabilizing off-policy Q-learning via
bootstrapping error reduction. Neural Information Processing Systems, 32, 11761–11771. 398
Kumar, A., Sattigeri, P., & Balakrishnan, A. (2018). Variational inference of disentangled latent
concepts from unlabeled observations. International Conference on Learning Representations.
345
Kumar, A., Singh, S. S., Singh, K., & Biswas, B. (2020a). Link prediction techniques, applications,
and performance: A survey. Physica A: Statistical Mechanics and its Applications, 553, 124289.
262
Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020b). Conservative Q-learning for offline
reinforcement learning. Neural Information Processing Systems, 33, 1179–1191. 398
Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., & Kingma, D. (2019b).
VideoFlow: A flow-based generative model for video. ICML Workshop on Invertible Neural
Networks and Normalizing Flows. 322
Kumar, M., Weissenborn, D., & Kalchbrenner, N. (2021). Colorization transformer. International
Conference on Learning Representations. 238
Kurach, K., Lučić, M., Zhai, X., Michalski, M., & Gelly, S. (2019). A large-scale study on
regularization and normalization in GANs. International Conference on Machine Learning,
3581–3590. 299
Kurenkov, A. (2020). A Brief History of Neural Nets and Deep Learning.
https://www.skynettoday.com/overviews/neural-net-history. 37
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., & Aila, T. (2019). Improved precision and recall
metric for assessing generative models. Neural Information Processing Systems, 32, 3929–3938.
274
LaCroix, T. (2022). The linguistic blind spot of value-aligned agency, natural and artificial.
arXiv:2207.00868. 421
LaCroix, T. (2023). Artificial Intelligence and the Value-Alignment Problem: A Philosophical
Introduction. https://value-alignment.github.io. 422, 435
LaCroix, T., Geil, A., & O’Connor, C. (2021). The dynamics of retraction in epistemic networks.
Philosophy of Science, 88(3), 415–438. 432
LaCroix, T., & Mohseni, A. (2022). The tragedy of the AI commons. Synthese, 200(289). 420
Laffont, J.-J., & Martimort, D. (2002). The Theory of Incentives: The Principal-Agent Model.
Princeton University Press. 421
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty
estimation using deep ensembles. Neural Information Processing Systems, 30, 6402–6413. 158
Lamb, A., Dumoulin, V., & Courville, A. (2016). Discriminative regularization for generative
models. arXiv:1602.03220. 344
Lample, G., & Charton, F. (2020). Deep learning for symbolic mathematics. International
Conference on Learning Representations. 234
Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016). Autoencoding beyond pixels
using a learned similarity metric. International Conference on Machine Learning, 1558–1566.
344, 345
Lasseck, M. (2018). Acoustic bird detection with deep convolutional neural networks. Detection and
Classification of Acoustic Scenes and Events, 143–147. 160
Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press. 136
Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997). Face recognition: A convolutional
neural-network approach. IEEE Transactions on Neural Networks, 8(1), 98–113. 181
LeCun, Y. (1985). Une procedure d’apprentissage pour reseau a seuil asymmetrique. Proceedings of
Cognitiva, 599–604. 113
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. 52
LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel, L. (1989a).
Handwritten digit recognition with a back-propagation network. Neural Information Processing
Systems, 2, 396–404. 180, 181
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D.
(1989b). Backpropagation applied to hand-written zip code recognition. Neural Computation,
1(4), 541–551. 180
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11), 2278–2324. 159, 181
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energybased
learning. Predicting structured data, 1(0). 274
LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. Neural Information
Processing Systems, vol. 3, 598–605. 414
LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K.-R. (2012). Efficient backprop. Neural Networks:
Tricks of the trade, 9–48. Springer. 113, 410
Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A.,
Totz, J., Wang, Z., et al. (2017). Photo-realistic single image super-resolution using a generative
adversarial network. IEEE/CVF Computer Vision & Pattern Recognition, 4681–4690. 294, 301
Lee, J., Lee, I., & Kang, J. (2019). Self-attention graph pooling. International Conference on
Machine Learning, 3734–3743. 265
Lee, J. B., Rossi, R. A., Kong, X., Kim, S., Koh, E., & Rao, A. (2018). Higher-order graph
convolutional networks. arXiv:1809.07697. 263
Lehman, J., & Stanley, K. O. (2008). Exploiting open-endedness to solve problems through the
search for novelty. International Conference on Artificial Life, 329–336. 421
Leuner, J. (2019). A replication study: Machine learning models are capable of predicting sexual
orientation from facial images. arXiv:1902.10739. 427
Li, C., Chen, C., Carlson, D., & Carin, L. (2016a). Preconditioned stochastic gradient Langevin
dynamics for deep neural networks. AAAI Conference on Artificial Intelligence, 1788–1794. 159
Li, C., Farkhoor, H., Liu, R., & Yosinski, J. (2018a). Measuring the intrinsic dimension of objective
landscapes. International Conference on Learning Representations. 407, 408
Li, F.-F. (2018). How to make A.I. that's good for people. The New York Times, March 7, 2018.
https://www.nytimes.com/2018/03/07/opinion/artificial-intelligence-human.html. 430
Li, G., Müller, M., Ghanem, B., & Koltun, V. (2021a). Training graph neural networks with 1000
layers. International Conference on Machine Learning, 6437–6449. 266, 322
Li, G., Müller, M., Qian, G., Perez, I. C. D., Abualshour, A., Thabet, A. K., & Ghanem, B. (2021b).
DeepGCNs: Making GCNs go as deep as CNNs. IEEE Transactions on Pattern Analysis and
Machine Intelligence. 266
Li, G., Xiong, C., Thabet, A., & Ghanem, B. (2020a). DeeperGCN: All you need to train deeper
GCNs. arXiv:2006.07739. 266
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017a). Pruning filters for efficient
ConvNets. International Conference on Learning Representations. 414
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018b). Visualizing the loss landscape of
neural nets. Neural Information Processing Systems, 31, 6391–6401. 201, 202, 407
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2017b). Hyperband: A novel
bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research,
18(1), 6765–6816. 136
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., & Chang, K.-W. (2019). VisualBERT: A simple and
performant baseline for vision and language. arXiv:1908.03557. 238
Li, Q., Han, Z., & Wu, X.-M. (2018c). Deeper insights into graph convolutional networks for semi-
supervised learning. AAAI Conference on Artificial Intelligence, 3438–3545. 265
Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B.,
Damania, P., & Chintala, S. (2020b). Pytorch distributed: Experiences on accelerating data
parallel training. International Conference on Very Large Databases. 114
Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., & Jia, J. (2022). MAT: Mask-aware transformer for large
hole image inpainting. IEEE/CVF Computer Vision & Pattern Recognition, 10758–10768. 238
Li, Y. (2017). Deep reinforcement learning: An overview. arXiv:1701.07274. 396
Li, Y., Cohn, T., & Baldwin, T. (2017c). Robust training under linguistic adversity. Meeting of the
Association for Computational Linguistics, 21–27. 160
Li, Y., & Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient
descent on structured data. Neural Information Processing Systems, 31, 8168–8177. 407
Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2016b). Gated graph sequence neural networks.
International Conference on Learning Representations. 262
Li, Y., & Turner, R. E. (2016). Rényi divergence variational inference. Neural Information
Processing Systems, 29, 1073–1081. 346
Li, Z., & Arora, S. (2019). An exponential learning rate schedule for deep learning. International
Conference on Learning Representations. 204
Liang, D., Krishnan, R. G., Hoffman, M. D., & Jebara, T. (2018). Variational autoencoders for
collaborative filtering. World Wide Web Conference, 689–698. 344
Liang, J., Zhang, K., Gu, S., Van Gool, L., & Timofte, R. (2021). Flow-based kernel prior with
application to blind super-resolution. IEEE/CVF Computer Vision & Pattern Recognition,
10601–10610. 322
Liang, S., & Srikant, R. (2016). Why deep neural networks for function approximation? International
Conference on Learning Representations. 53, 417
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016).
Continuous control with deep rein-forcement learning. International Conference on Learning
Representations. 397
Lin, K., Li, D., He, X., Zhang, Z., & Sun, M.-T. (2017a). Adversarial ranking for language
generation. Neural Information Processing Systems, 30, 3155–3165. 299
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and
teaching. Machine learning, 8, 293–321. 396
Lin, M., Chen, Q., & Yan, S. (2014). Network in network. International Conference on Learning
Representations. 181
Lin, T., Wang, Y., Liu, X., & Qiu, X. (2022). A survey of transformers. AI Open, 3, 111–132. 233
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017b). Feature pyramid
networks for object detection. IEEE Computer Vision & Pattern Recognition, 2117–2125. 184
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017c). Focal loss for dense object detection.
IEEE/CVF International Conference on Computer Vision, 2980–2988. 73
Lin, Z., Khetan, A., Fanti, G., & Oh, S. (2018). PacGAN: The power of two samples in generative
adversarial networks. Neural Information Processing Systems, 31, 1505–1514. 300
Ling, H., Kreis, K., Li, D., Kim, S. W., Torralba, A., & Fidler, S. (2021). EditGAN: High-precision
semantic image editing. Neural Information Processing Systems, 34, 16331–16345. 302
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow matching for generative
modeling. arXiv:2210.02747. 369
Lipton, Z. C., & Tripathi, S. (2017). Precise recovery of latent vectors from generative adversarial
networks. International Conference on Learning Representations. 301
Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., & Catanzaro, B. (2018a). Image inpainting for
irregular holes using partial convolutions. European Conference on Computer Vision, 85–100.
181
Liu, H., Simonyan, K., & Yang, Y. (2019a). DARTS: Differentiable architecture search. International
Conference on Learning Representations. 414
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., & Han, J. (2021a). On the variance of the
adaptive learning rate and beyond. International Conference on Learning Representations. 93
Liu, L., Liu, X., Gao, J., Chen, W., & Han, J. (2020). Understanding the difficulty of training
transformers. Empirical Methods in Natural Language Processing, 5747–5763. 237, 238
Liu, L., Luo, Y., Shen, X., Sun, M., & Li, B. (2019b). Beta-dropout: A unified dropout. IEEE Access,
7, 36140–36153. 158
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., & Shazeer, N. (2018b). Generating
Wikipedia by summarizing long sequences. International Conference on Learning
Representations. 237
Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., & Tang, J. (2023a). Self-supervised
learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering,
35(1), 857–876. 159
Liu, Y., Qin, Z., Anwar, S., Ji, P., Kim, D., Cald-well, S., & Gedeon, T. (2021b). Invertible denoising
network: A light solution for real noise removal. IEEE/CVF Computer Vision & Pattern
Recognition, 13365–13374. 322
Liu, Y., Zhang, Y., Wang, Y., Hou, F., Yuan, J., Tian, J., Zhang, Y., Shi, Z., Fan, J., & He, Z. (2023b).
A survey of visual transformers. IEEE Transactions on Neural Networks and Learning Systems.
238
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., &
Guo, B. (2022). Swin transformer V2: Scaling up capacity and resolution. IEEE/CVF Computer
Vision & Pattern Recognition, 12009–12019. 238
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021c). Swin transformer:
Hierarchical vision transformer using shifted windows. IEEE/CVF International Conference on
Computer Vision, 10012–10022. 231, 238
Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. IEEE
International Conference on Computer Vision, 3730–3738. 345
Liu, Z., Michaud, E. J., & Tegmark, M. (2023c). Omnigrok: Grokking beyond algorithmic data.
International Conference on Learning Representations. 405, 406, 412, 413
Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell, T. (2019c). Rethinking the value of network
pruning. International Conference on Learning Representations. 235
Livni, R., Shalev-Shwartz, S., & Shamir, O. (2014). On the computational efficiency of training
neural networks. Neural Information Processing Systems, 27, 855–863. 405
Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J.,
Dosovitskiy, A., & Kipf, T. (2020). Object-centric learning with slot attention. Neural
Information Processing Systems, 33, 11525–11538. 238
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic
segmentation. IEEE/CVF Computer Vision & Pattern Recognition, 3431–3440. 181
Longino, H. E. (1990). Science as Social Knowledge: Values and Objectivity in Scientific Inquiry.
Princeton University Press. 432
Longino, H. E. (1996). Cognitive and non-cognitive values in science: Rethinking the dichotomy.
Feminism, Science, and the Philosophy of Science, 39–58. 432
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. International Conference
on Learning Representations. 94, 156
Louizos, C., Welling, M., & Kingma, D. P. (2018). Learning sparse neural networks through l0
regularization. International Conference on Learning Representations. 156
Loukas, A. (2020). What graph neural networks cannot learn: Depth vs width. International
Conference on Learning Representations. 262
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). VilBERT: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks. Neural Information Processing Systems, 32, 13–
23. 238
Lu, S.-P., Wang, R., Zhong, T., & Rosin, P. L. (2021). Large-capacity image steganography based on
invertible neural networks. IEEE/CVF Computer Vision & Pattern Recognition, 10816–10825.
322
Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The expressive power of neural networks: A
view from the width. Neural Information Processing Systems, 30, 6231–6239. 53
Lubana, E. S., Dick, R., & Tanaka, H. (2021). Beyond BatchNorm: Towards a unified understanding
of normalization in deep learning. Neural Information Processing Systems, 34, 4778–4791. 204
Lucas, J., Tucker, G., Grosse, R., & Norouzi, M. (2019a). Understanding posterior collapse in
generative latent variable models. ICLR Workshop on Deep Generative Models for Highly
Structured Data. 345
Lucas, J., Tucker, G., Grosse, R. B., & Norouzi, M. (2019b). Don't blame the ELBO! A linear VAE
perspective on posterior collapse. Neural Information Processing Systems, 32, 9403–9413. 345
Luccioni, A. S. (2023). The mounting human and environmental costs of generative AI. ars Technica,
April 12, 2023. https://arstechnica.com/gadgets/2023/04/generative-ai-is-cool-but-lets-not-forget-
its-human-and-environmental-costs. 429
Luccioni, A. S., Viguier, S., & Ligozat, A.-L. (2022). Estimating the carbon footprint of bloom, a
176b parameter language model. arXiv:2211.02001. 429
Lucic, M., Kurach, K., Michalski, M., Gelly, S., & Bousquet, O. (2018). Are GANs created equal? A
large-scale study. Neural Information Processing Systems, 31, 698–707. 299
Lücke, J., Forster, D., & Dai, Z. (2020). The evidence lower bound of variational autoencoders
converges to a sum of three entropies. arXiv:2010.14860. 346
Luo, C. (2022). Understanding diffusion models: A unified perspective. arXiv:2208.11970. 369
Luo, G., Heide, M., & Uecker, M. (2022). MRI reconstruction via data driven Markov chain with
joint uncertainty estimation. arXiv:2202.01479. 369
Luo, J., Xu, Y., Tang, C., & Lv, J. (2017a). Learning inverse mapping by autoencoder based
generative adversarial nets. Neural Information Processing Systems, vol. 30, 207–216. 301
Luo, J.-H., Wu, J., & Lin, W. (2017b). ThiNet: A filter level pruning method for deep neural network
compression. IEEE/CVF International Conference on Computer Vision, 5058–5066. 414
Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding regularization in batch
normalization. International Conference on Learning Representations. 205
Luo, S., & Hu, W. (2021). Diffusion probabilistic models for 3D point cloud generation. IEEE/CVF
Computer Vision & Pattern Recognition, 2837–2845. 369
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural
machine translation. Empirical Methods in Natural Language Processing, 1412–1421. 235
Luther, K. (2020). Why BatchNorm causes exploding gradients.
https://kyleluther.github.io/2020/02/18/BatchNorm-exploding-gradients.html. 203
Ma, Y., & Tang, J. (2021). Deep learning on graphs. Cambridge University Press. 261
Ma, Y.-A., Chen, T., & Fox, E. (2015). A complete recipe for stochastic gradient MCMC. Neural
Information Processing Systems, 28, 2917–2925. 159
Maaløe, L., Sønderby, C. K., Sønderby, S. K., & Winther, O. (2016). Auxiliary deep generative
models. International Conference on Machine Learning, 1445–1453. 344, 345
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network
acoustic models. ICML Workshop on Deep Learning for Audio, Speech, and Language
Processing. 38
MacKay, D. J. (1995). Ensemble learning and evidence maximization. Neural Information
Processing Systems, vol. 8, 4083–4090. 159
MacKay, M., Vicol, P., Ba, J., & Grosse, R. B. (2018). Reversible recurrent neural networks. Neural
Information Processing Systems, 31, 9043–9054. 322
Mackowiak, R., Ardizzone, L., Kothe, U., & Rother, C. (2021). Generative classifiers as a basis for
trustworthy image classification. IEEE/CVF Computer Vision & Pattern Recognition, 2971–
2981. 322
Madhawa, K., Ishiguro, K., Nakago, K., & Abe, M. (2019). GraphNVP: An invertible flow model for
generating molecular graphs. arXiv:1905.11600. 322
Mahendran, A., & Vedaldi, A. (2015). Understanding deep image representations by inverting them.
IEEE/CVF Computer Vision & Pattern Recognition, 5188–5196. 184
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2015). Adversarial autoencoders.
arXiv:1511.05644. 345
Mangalam, K., Fan, H., Li, Y., Wu, C.-Y., Xiong, B., Feichtenhofer, C., & Malik, J. (2022).
Reversible vision transformers. IEEE/CVF Computer Vision & Pattern Recognition, 10830–
10840. 322
Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. MIT
Press. 233
Manyika, J., Lund, S., Chui, M., Bughin, J., Woetzel, J., Batra, P., Ko, R., & Sanghvi, S. (2017). Jobs
Lost, Jobs Gained: Workforce Transitions in a Time of Automation. McKinsey Global Institute.
429
Manyika, J., & Sneader, K. (2018). AI, automation, and the future of work: Ten things to solve for.
McKinsey Global Institute. 429
Mao, Q., Lee, H.-Y., Tseng, H.-Y., Ma, S., & Yang, M.-H. (2019). Mode seeking generative
adversarial networks for diverse image synthesis. IEEE/CVF Computer Vision & Pattern
Recognition, 1429–1437. 300
Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., & Paul Smolley, S. (2017). Least squares generative
adversarial networks. IEEE/CVF International Conference on Computer Vision, 2794–2802. 299
Marchesi, M. (2017). Megapixel size image creation using generative adversarial networks.
arXiv:1706.00082. 299
Martin, G. L. (1993). Centered-object integrated segmentation and recognition of overlapping
handprinted characters. Neural Computation, 5(3), 419–429. 181
Masci, J., Boscaini, D., Bronstein, M., & Vandergheynst, P. (2015). Geodesic convolutional neural
networks on Riemannian manifolds. IEEE International Conference on Computer Vision
Workshop, 832–840. 265
Masrani, V., Le, T. A., & Wood, F. (2019). The thermodynamic variational objective. Neural
Information Processing Systems, 32, 11521–11530. 346
Mathieu, E., Rainforth, T., Siddharth, N., & Teh, Y. W. (2019). Disentangling disentanglement in
variational autoencoders. International Conference on Machine Learning, 4402–4412. 346
Matsakis, L. (2017). A frightening AI can determine whether a person is gay with 91 percent
accuracy. Vice, Sept 8, 2017. https://www.vice.com/en/article/a33xb4/a-frightening-ai-can-
determine-a-persons-sexuality-with-91-accuracy. 430
Maturana, D., & Scherer, S. (2015). VoxNet: A 3D convolutional neural network for real-time object
recognition. IEEE/RSJ International Conference on Intelligent Robots and Systems, 922–928.
182
Mayson, S. G. (2018). Bias in bias out. Yale Law Journal, 128, 2122–2473. 422
Mazoure, B., Doan, T., Durand, A., Pineau, J., & Hjelm, R. D. (2020). Leveraging exploration in off-
policy algorithms via normalizing flows. Conference on Robot Learning, 430–444. 322
Mazyavkina, N., Sviridov, S., Ivanov, S., & Burnaev, E. (2021). Reinforcement learning for
combinatorial optimization: A survey. Computers & Operations Research, 134, 105400. 396
McCoy, R. T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic
heuristics in natural language inference. Meeting of the Association for Computational
Linguistics, 2428–3448. 234
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity.
The Bulletin of Mathematical Biophysics, 5(4), 115–133. 37
McNamara, A., Smith, J., & Murphy-Hill, E. (2018). Does ACM's code of ethics change ethical
decision making in software development? ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineering, 729–733.
420
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2022). A survey on bias and
fairness in machine learning. ACM Computing Surveys, 54(6), 1–35. 423
Mei, J., Chung, W., Thomas, V., Dai, B., Szepesvári, C., & Schuurmans, D. (2022). The role of
baselines in policy gradient optimization. Neural Information Processing Systems, vol. 35,
17818–17830. 397
Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., & Ermon, S. (2021). SDEdit: Image synthesis and
editing with stochastic differential equations. International Conference on Learning
Representations. 369
Menon, S., Damian, A., Hu, S., Ravi, N., & Rudin, C. (2020). PULSE: self-supervised photo
upsampling via latent space exploration of generative models. IEEE/CVF Computer Vision &
Pattern Recognition, 2434–2442. 422
Metcalf, J., Keller, E. F., & Boyd, D. (2016). Perspectives on big data, ethics, and society. Council for
Big Data, Ethics, and Society. https://bdes.datasociety.net/council-output/perspectives-on-big-
data-ethics-and-society/. 430
Metz, L., Poole, B., Pfau, D., & Sohl-Dickstein, J. (2017). Unrolled generative adversarial networks.
International Conference on Learning Representations. 299
Mézard, M., & Mora, T. (2009). Constraint satisfaction problems and neural networks: A statistical
physics perspective. Journal of Physiology-Paris, 103(1-2), 107–113. 94
Micelli, M., Posada, J., & Yang, T. (2022). Studying up machine learning data: Why talk about bias
when we mean power? Proceedngs of ACM on Human-Computer Interaction, 6. 423
Milletari, F., Navab, N., & Ahmadi, S.-A. (2016). V-Net: Fully convolutional neural networks for
volumetric medical image segmentation. International Conference on 3D Vision, 565–571. 205
Min, J., McCoy, R. T., Das, D., Pitler, E., & Linzen, T. (2020). Syntactic data augmentation increases
robustness to inference heuristics. Meeting of the Association for Computational Linguistics,
2339–2352. 160
Minaee, S., Boykov, Y. Y., Porikli, F., Plaza, A. J., Kehtarnavaz, N., & Terzopoulos, D. (2021). Image
segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis & Machine
Intelligence, 44(7), 3523–3542. 184
Minsky, M., & Papert, S. A. (1969). Perceptrons: An introduction to computational geometry. MIT
Press. 37, 233
Mireshghallah, F., Taram, M., Vepakomma, P., Singh, A., Raskar, R., & Esmaeilzadeh, H. (2020).
Privacy in deep learning: A survey. arXiv:2004.12254. 428
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv:1411.1784. 301
Mishkin, D., & Matas, J. (2016). All you need is a good init. International Conference on Learning
Representations. 113
Mitchell, M., Forrest, S., & Holland, J. H. (1992). The royal road for genetic algorithms: Fitness
landscapes and GA performance. European Conference on Artificial Life. 421
Mitchell, S., Potash, E., Barocas, S., D’Amour, A., & Lum, K. (2021). Algorithmic fairness: Choices,
assumptions, and definitions. Annual Review of Statistics and Its Application, 8, 141–163. 422
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative
adversarial networks. International Conference on Learning Representations. 299
Miyato, T., & Koyama, M. (2018). cGANs with projection discriminator. International Conference
on Learning Representations. 301
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K.
(2016). Asynchronous methods for deep reinforcement learning. International Conference on
Machine Learning, 1928–1937. 398
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,
Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep
reinforcement learning. Nature, 518(7540), 529–533. 396
Moerland, T. M., Broekens, J., Plaat, A., Jonker, C. M., et al. (2023). Model-based reinforcement
learning: A survey. Foundations and Trends in Machine Learning, 16(1), 1–118. 398
Mogren, O. (2016). C-RNN-GAN: Continuous recurrent neural networks with adversarial training.
NIPS 2016 Constructive Machine Learning Workshop. 299
Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models
Explainable. https://christophm.github.io/interpretable-ml-book. 425
Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., & Bronstein, M. M. (2017). Geometric
deep learning on graphs and manifolds using mixture model CNNs. IEEE/CVF Computer Vision
& Pattern Recognition, 5115–5124. 263, 265
Monti, F., Shchur, O., Bojchevski, A., Litany, O., Günnemann, S., & Bronstein, M. M. (2018). Dual-
primal graph convolutional networks. arXiv:1806.00770. 264
Montúfar, G. (2017). Notes on the number of linear regions of deep neural networks. 52, 53
Montufar, G. F., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the number of linear regions of deep
neural networks. Neural Information Processing Systems, 27, 2924–2932. 52, 53
Moor, J. (2006). The nature, importance, and difficulty of machine ethics. Intelligence Systems,
21(4), 18–21. 424
Moore, A., & Himma, K. (2022). Intellectual Property. The Stanford Encyclopedia of Philosophy.
428
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., & Herrera, F. (2012). A
unifying view on dataset shift in classification. Pattern Recognition, 45(1), 521–530. 135
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010). Nonparametric return
distribution approximation for reinforcement learning. International Conference on Machine
Learning, 799–806. 397
Müller, R., Kornblith, S., & Hinton, G. E. (2019a). When does label smoothing help? Neural
Information Processing Systems, 32, 4696–4705. 158
Müller, T., McWilliams, B., Rousselle, F., Gross, M., & Novák, J. (2019b). Neural importance
sampling. ACM Transactions on Graphics (TOG), 38(5), 1–19. 322, 323
Mun, S., Shon, S., Kim, W., Han, D. K., & Ko, H. (2017). Deep neural network based learning and
transferring mid-level audio features for acoustic scene classification. IEEE International
Conference on Acoustics, Speech and Signal Processing, 796–800. 160
Murphy, K. P. (2022). Probabilistic machine learning: An introduction. MIT Press. 15
Murphy, K. P. (2023). Probabilistic machine learning: Advanced topics. MIT Press. 15
Murphy, R. L., Srinivasan, B., Rao, V., & Ribeiro, B. (2018). Janossy pooling: Learning deep
permutation-invariant functions for variable-size inputs. International Conference on Learning
Representations. 263
Murty, K. G., & Kabadi, S. N. (1987). Some NP-complete problems in quadratic and nonlinear
programming. Mathematical Programming, 39(2), 117–129. 401
Mutlu, E. C., Oghaz, T., Rajabi, A., & Garibay, I. (2020). Review on learning and extracting graph
features for link prediction. Machine Learning and Knowledge Extraction, 2(4), 672–704. 262
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines.
International Conference on Machine Learning, 807–814. 37
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). Deep double
descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and
Experiment, 2021(12), 124003. 130, 134
Narang, S., Chung, H. W., Tay, Y., Fedus, W., Fevry, T., Matena, M., Malkan, K., Fiedel, N., Shazeer,
N., Lan, Z., et al. (2021). Do transformer modifications transfer across implementations and
applications? Empirical Methods in Natural Language Processing, 5758–5773. 233
Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. IEEE
Symposium on Security and Privacy, 111–125. 428
Narayanan, D., Phanishayee, A., Shi, K., Chen, X., & Zaharia, M. (2021a). Memory-efficient
pipeline-parallel DNN training. International Conference on Machine Learning, 7937–7947. 114
Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D.,
Kashinkunti, P., Bernauer, J., Catanzaro, B., et al. (2021b). Efficient largescale language model
training on GPU clusters using Megatron-LM. International Conference for High Performance
Computing, Networking, Storage and Analysis, 1–15. 114
Nash, C., Menick, J., Dieleman, S., & Battaglia, P. W. (2021). Generating images with sparse
representations. International Conference on Machine Learning, 7958–7968. 238, 274
Neal, R. M. (1995). Bayesian learning for neural networks. Springer. 159
Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. IEEE/CVF
International Conference on Computer Vision, 3163–3172. 238
Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence
rate. Doklady Akademii Nauk SSSR, vol. 269, 543–547. 93
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation.
European Conference on Computer Vision, 483–499. 200, 205
Neyshabur, B., Bhojanapalli, S., McAllester, D., & Srebro, N. (2017). Exploring generalization in
deep learning. Neural Information Processing Systems, 30, 5947–5956. 134, 412
Neyshabur, B., Bhojanapalli, S., & Srebro, N. (2018). A PAC-Bayesian approach to spectrally-
normalized margin bounds for neural networks. International Conference on Learning
Representations. 156
Ng, N. H., Gabriel, R. A., McAuley, J., Elkan, C., & Lipton, Z. C. (2017). Predicting surgery duration
with neural heteroscedastic regression. PMLR Machine Learning for Healthcare Conference,
100–111. 74
Nguyen, Q., & Hein, M. (2017). The loss surface of deep and wide neural networks. International
Conference on Machine Learning, 2603–2612. 405
Nguyen, Q., & Hein, M. (2018). Optimization landscape and expressivity of deep CNNs.
International Conference on Machine Learning, 3730–3739. 405
Nichol, A. Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models.
International Conference on Machine Learning, 8162–8171. 369
Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen,
M. (2022). GLIDE: towards photorealistic image generation and editing with text-guided
diffusion models. International Conference on Machine Learning, 16784–16804. 369, 370
Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., & Anandkumar, A. (2022). Diffusion models for
adversarial purification. International Conference on Machine Learning, 16805–16827. 369
Nix, D. A., & Weigend, A. S. (1994). Estimating the mean and variance of the target probability
distribution. IEEE International Conference on Neural Networks, 55–60. 73
Noble, S. (2018). Algorithms of Oppression. New York: NYU Press. 433
Noci, L., Roth, K., Bachmann, G., Nowozin, S., & Hofmann, T. (2021). Disentangling the roles of
curation, data-augmentation and the prior in the cold posterior effect. Neural Information
Processing Systems, 34, 12738–12748. 159
Noé, F., Olsson, S., Köhler, J., & Wu, H. (2019). Boltzmann generators: Sampling equilibrium states
of many-body systems with deep learning. Science, 365(6457). 322
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation.
IEEE International Conference on Computer Vision, 1520–1528. 6, 179, 180, 184
Noothigattu, R., Gaikwad, S. N., Awad, E., Dsouza, S., Rahwan, I., Ravikumar, P., & Procaccia, A.
D. (2018). A voting-based system for ethical decision making. AAAI Portuguese Conference on
Artificial Intelligence, 1587–1594. 424
Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw
puzzles. European Conference on Computer Vision, 69–84. 159
Nowozin, S., Cseke, B., & Tomioka, R. (2016). f-GAN: Training generative neural samplers using
variational divergence minimization. Neural Information Processing Systems, 29, 271–279. 299
Nye, M., & Saxe, A. (2018). Are efficient deep representations learnable? International Conference
on Learning Representations (Workshop). 417
O’Connor, C., & Bruner, J. (2019). Dynamics and diversity in epistemic communities. Erkenntnis,
84, 101–119. 433
Odena, A. (2019). Open questions about generative adversarial networks. Distill,
https://distill.pub/2019/gan-open-problems. 299
Odena, A., Dumoulin, V., & Olah, C. (2016). Deconvolution and checkerboard artifacts. Distill,
https://distill.pub/2016/deconv-checkerboard/. 181
Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier
GANs. International Conference on Machine Learning, 2642–2651. 290, 301
O’Neil, C. (2016). Weapons of Math Destruction. Crown. 420, 422
Oono, K., & Suzuki, T. (2019). Graph neural networks exponentially lose expressive power for node
classification. International Conference on Learning Representations. 265
Orhan, A. E., & Pitkow, X. (2017). Skip connections eliminate singularities. International
Conference on Learning Representations. 202
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain-wright, C., Mishkin, P., Zhang, C., Agarwal, S.,
Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human
feedback. Neural Information Processing Systems, 35, 27730–27744. 398
Pablok, J. (2017). Chess pieces and board improved. Wikimedia Commons. Retrieved January 17,
2023. https://commons.wikimedia.org/wiki/File:Chess_pieces_and_board_improved.svg. 13
Papamakarios, G., Nalisnick, E. T., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021).
Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning
Research, 22(57), 1–64. 321
Papamakarios, G., Pavlakou, T., & Murray, I. (2017). Masked autoregressive flow for density
estimation. Neural Information Processing Systems, 30, 2338–2347. 323
Park, D., Hoshi, Y., & Kemp, C. C. (2018). A multimodal anomaly detector for robot-assisted feeding
using an LSTM-based variational autoencoder. IEEE Robotics and Automation Letters, 3(3),
1544–1551. 344
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019).
SpecAugment: A simple data augmentation method for automatic speech recognition.
INTERSPEECH. 160
Park, S., & Kwak, N. (2016). Analysis on the dropout effect in convolutional neural networks. Asian
Conference on Computer Vision, 189–204. 183
Park, S.-W., Ko, J.-S., Huh, J.-H., & Kim, J.-C. (2021). Review on generative adversarial networks:
Focusing on computer vision and its applications. Electronics, 10(10), 1216. 299
Parker, D. B. (1985). Learning-logic: Casting the cortex of the human brain in silicon. Alfred P.
Sloan School of Management, MIT. 113
Parmar, N., Ramachandran, P., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone
self-attention in vision models. Neural Information Processing Systems, 32, 68–80. 238
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image
transformer. International Conference on Machine Learning, 4055–4064. 238
Pascanu, R., Dauphin, Y. N., Ganguli, S., & Bengio, Y. (2014). On the saddle point problem for non-
convex optimization. arXiv:1405.4604. 405
Pascanu, R., Montufar, G., & Bengio, Y. (2013). On the number of response regions of deep feed
forward networks with piece-wise linear activations. arXiv:1312.6098. 53
Paschalidou, D., Katharopoulos, A., Geiger, A., & Fidler, S. (2021). Neural parts: Learning
expressive 3D shape abstractions with invertible neural networks. IEEE/CVF Computer Vision &
Pattern Recognition, 3204–3215. 322
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., & Lischinski, D. (2021). StyleCLIP: Text-
driven manipulation of StyleGAN imagery. IEEE/CVF International Conference on Computer
Vision, 2085–2094. 300
Pateria, S., Subagdja, B., Tan, A.-h., & Quek, C. (2021). Hierarchical reinforcement learning: A
comprehensive survey. ACM Computing Surveys, 54(5), 1–35. 398
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature
learning by inpainting. IEEE/CVF Computer Vision & Pattern Recognition, 2536–2544. 159
Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer, C., Vedaldi, A., &
Henriques, J. F. (2021). Keeping your eye on the ball: Trajectory attention in video transformers.
Neural Information Processing Systems, 34, 12493–12506. 238
Peluchetti, S., & Favaro, S. (2020). Infinitely deep neural networks as diffusion processes.
International Conference on Artificial Intelligence and Statistics, 1126–1136. 324
Peng, C., Guo, P., Zhou, S. K., Patel, V., & Chellappa, R. (2022). Towards performant and reliable
undersampled MR reconstruction via diffusion model sampling. Medical Image Computing and
Computer Assisted Intervention, 13436, 623–633. 369
Pennington, J., & Bahri, Y. (2017). Geometry of neural network loss surfaces via random matrix
theory. International Conference on Machine Learning, 2798–2806. 405
Perarnau, G., Van De Weijer, J., Raducanu, B., & Álvarez, J. M. (2016). Invertible conditional GANs
for image editing. NIPS 2016 Workshop on Adversarial Training. 301
Pereyra, G., Tucker, G., Chorowski, J., Kaiser, ., & Hinton, G. (2017). Regularizing neural networks
by penalizing confident output distributions. International Conference on Learning
Representations Workshop. 158
Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural
Networks, 21(4), 682–697. 397
Peyré, G., Cuturi, M., et al. (2019). Computational optimal transport with applications to data
science. Foundations and Trends in Machine Learning, 11(5-6), 355–607. 299
Pezeshki, M., Mitra, A., Bengio, Y., & Lajoie, G. (2022). Multi-scale feature learning dynamics:
Insights for double descent. International Conference on Machine Learning, 17669–17690. 134
Pham, T., Tran, T., Phung, D., & Venkatesh, S. (2017). Column networks for collective classification.
AAAI Conference on Artificial Intelligence, 2485–2491. 263
Phuong, M., & Hutter, M. (2022). Formal algorithms for transformers. Technical Report, DeepMind.
233
Pieters, M., & Wiering, M. (2018). Comparing generative adversarial network techniques for image
creation and modification. arXiv:1803.09093. 299
Pintea, S. L., Tömen, N., Goes, S. F., Loog, M., & van Gemert, J. C. (2021). Resolution learning in
deep convolutional networks using scalespace theory. IEEE Transactions on Image Processing,
30, 8342–8353. 183
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but
not shallow-networks avoid the curse of dimensionality: A review. International Journal of
Automation and Computing, 14(5), 503–519. 53
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5), 1–17. 92
Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2023). DreamFusion: Text-to-3D using 2D
diffusion. International Conference on Learning Representations. 369
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization
beyond overfitting on small algorithmic datasets. arXiv:2201.02177. 412
Prenger, R., Valle, R., & Catanzaro, B. (2019). Waveglow: A flow-based generative network for
speech synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing,
3617–3621. 322, 323
Prince, S. J. D. (2012). Computer vision: Models, learning, and inference. Cambridge University
Press. 15, 159
Prince, S. J. D. (2021a). Transformers II: Extensions. https://www.borealisai.com/en/blog/tutorial-16-
transformers-ii-extensions/. 236, 237
Prince, S. J. D. (2021b). Transformers III: Training. https://www.borealisai.com/en/blog/tutorial-17-
transformers-iii-training/. 238
Prince, S. J. D. (2022). Explainability I: local post-hoc explanations.
https://www.borealisai.com/research-blogs/explainability-i-local-post-hoc-explanations/. 426
Prokudin, S., Gehler, P., & Nowozin, S. (2018). Deep directional statistics: Pose estimation with
uncertainty quantification. European Conference on Computer Vision, 534–551. 74
Provilkov, I., Emelianenko, D., & Voita, E. (2020). BPE-Dropout: Simple and effective subword
regularization. Meeting of the Association for Computational Linguistics, 1882–1892. 234
Qi, G.-J. (2020). Loss-sensitive generative adversarial networks on Lipschitz densities. International
Journal of Computer Vision, 128(5), 1118–1140. 299
Qi, J., Du, J., Siniscalchi, S. M., Ma, X., & Lee, C.-H. (2020). On mean absolute error for deep
neural network based vector-to-vector regression. IEEE Signal Processing Letters, 27, 1485––
1489. 73
Qin, Z., Yu, F., Liu, C., & Chen, X. (2018). How convolutional neural network see the world — A
survey of convolutional neural network visualization methods. arXiv:1804.11191. 184
Qiu, S., Xu, B., Zhang, J., Wang, Y., Shen, X., De Melo, G., Long, C., & Li, X. (2020). EasyAug: An
automatic textual data augmentation platform for classification tasks. Companion Proceedings of
the Web Conference 2020, 249–252. 160
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language
supervision. International Conference on Machine Learning, 8748–8763. 238, 370
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep
convolutional generative adversarial networks. International Conference on Learning
Representations. 280, 299
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models
are unsupervised multitask learners. OpenAI Blog, 1(8), 9. 159, 234
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S.,
Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from
training Gopher. arXiv:2112.11446. 234
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J., et al.
(2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of
Machine Learning Research, 21(140), 1–67. 236
Raji, I. D., & Buolamwini, J. (2019). Actionable auditing: Investigating the impact of publicly
naming biased performance results of commercial AI products. AAAI/ACM Conference on AI,
Ethics, and Society, 429–435. 423
Raji, I. D., & Fried, G. (2020). About face: A survey of facial recognition evaluation. AAAI Workshop
on AI Evaluation. 427
Raji, I. D., Kumar, I. E., Horowitz, A., & Selbst, A. (2022). The fallacy of AI functionality. ACM
Conference on Fairness, Accountability, and Transparency, 959–972. 423, 427
Rajpurkar, P., Chen, E., Banerjee, O., & Topol, E. J. (2022). AI in health and medicine. Nature
Medicine, 28(1), 31–38. 420
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine
comprehension of text. Empirical Methods in Natural Language Processing, 2383–2392. 234
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions.
arXiv:1710.05941. 38
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional
image generation with CLIP latents. arXiv:2204.06125. 10, 11, 238, 369, 370
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021).
Zero-shot text-to-image generation. International Conference on Machine Learning, 8821–8831.
238, 370
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M.,
Pavlović, M., Sandve, G. K., et al. (2021). Hopfield networks is all you need. International
Conference on Learning Representations. 236
Ranganath, R., Tran, D., & Blei, D. (2016). Hierarchical variational models. International
Conference on Machine Learning, 324–333. 345
Ravanbakhsh, S., Lanusse, F., Mandelbaum, R., Schneider, J., & Poczos, B. (2017). Enabling dark
energy science with deep generative models of galaxy images. AAAI Conference on Artificial
Intelligence, 1488–1494. 344
Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A
comprehensive review. Neural Computation, 29(9), 2352–2449. 181
Rawls, J. (1971). A Theory of Justice. Belknap Press. 430
Razavi, A., Oord, A. v. d., Poole, B., & Vinyals, O. (2019a). Preventing posterior collapse with delta-
VAEs. International Conference on Learning Representations. 345
Razavi, A., Van den Oord, A., & Vinyals, O. (2019b). Generating diverse high-fidelity images with
VQ-VAE-2. Neural Information Processing Systems, 32, 14837–14847. 344, 345
Recht, B., Re, C., Wright, S., & Niu, F. (2011). Hogwild!: A lock-free approach to parallelizing
stochastic gradient descent. Neural Information Processing Systems, 24, 693–701. 114
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and beyond. International
Conference on Learning Representations. 93
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, realtime
object detection. IEEE/CVF Computer Vision & Pattern Recognition, 779–788. 178, 184
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016a). Generative adversarial
text to image synthesis. International Conference on Machine Learning, 1060–1069. 301
Reed, S. E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., & Lee, H. (2016b). Learning what and
where to draw. Neural Information Processing Systems, 29, 217–225. 301
Reiss, J., & Sprenger, J. (2017). Scientific Objectivity. The Stanford Encyclopedia of Philosophy. 431
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection
with region proposal networks. Neural Information Processing Systems, 28. 183
Rezende, D. J., & Mohamed, S. (2015). Variational inference with normalizing flows. International
Conference on Machine Learning, 1530–1538. 273, 321, 322, 344
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate
inference in deep generative models. International Conference on Machine Learning, 1278–
1286. 346
Rezende, D. J., Racanière, S., Higgins, I., & Toth, P. (2019). Equivariant Hamiltonian flows.
arXiv:1909.13739. 324
Rezende Jimenez, D., Eslami, S., Mohamed, S., Battaglia, P., Jaderberg, M., & Heess, N. (2016).
Unsupervised learning of 3D structure from images. Neural Information Processing Systems, 29,
4997–5005. 344
Riad, R., Teboul, O., Grangier, D., & Zeghidour, N. (2022). Learning strides in convolutional neural
networks. International Conference on Learning Representations. 183
Ribeiro, M., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions
of any classifier. Meeting of the Association for Computational Linguistics, 97–101. 425
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2021). Beyond accuracy: Behavioral testing of
NLP models with CheckList. 4824–4828. 234
Richardson, E., Alaluf, Y., Patashnik, O., Nitzan, Y., Azar, Y., Shapiro, S., & Cohen-Or, D. (2021).
Encoding in style: A Style-GAN encoder for image-to-image translation. IEEE/CVF Computer
Vision & Pattern Recognition, 2287–2296. 301
Riedl, M. (2020). AI democratization in the era of GPT-3. The Gradient, Sept 25, 2020.
https://thegradient.pub/ai-democratization-in-the-era-of-gpt-3/. 430
Riedmiller, M. (2005). Neural fitted Q iteration — first experiences with a data efficient neural
reinforcement learning method. European Conference on Machine Learning, 317–328. 396
Rippel, O., & Adams, R. P. (2013). High-dimensional probability estimation with deep density
models. arXiv:1302.5125. 321
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. The
Annals of Statistics, 11(2), 416–431. 411
Rissanen, S., Heinonen, M., & Solin, A. (2022). Generative modelling with inverse heat dissipation.
arXiv:2206.13397. 369
Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., et al.
(2021). Biological structure and function emerge from scaling unsupervised learning to 250
million protein sequences. Proceedings of the National Academy of Sciences, 118(15). 234
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical
Statistics, 22(3), 400–407. 91
Rodrigues, F., & Pereira, F. C. (2020). Beyond expectation: Deep joint mean and quantile regression
for spatiotemporal problems. IEEE Transactions on Neural Networks and Learning Systems,
31(12), 5377–5389. 73
Roeder, G., Wu, Y., & Duvenaud, D. K. (2017). Sticking the landing: Simple, lower-variance gradient
estimators for variational inference. Neural Information Processing Systems, 30, 6925–6934. 346
Roich, D., Mokady, R., Bermano, A. H., & Cohen-Or, D. (2022). Pivotal tuning for latent-based
editing of real images. ACM Transactions on Graphics (TOG), 42(1), 1–13. 300, 301
Rolfe, J. T. (2017). Discrete variational autoencoders. International Conference on Learning
Representations. 344
Rolnick, D., Donti, P. L., Kaack, L. H., Kochanski, K., Lacoste, A., Sankaran, K., Ross, A. S.,
Milojevic-Dupont, N., Jaques, N., Waldman-Brown, A., Luccioni, A. S., Maharaj, T., Sherwin, E.
D., Mukkavilli, S. K., Kording, K. P., Gomes, C. P., Ng, A. Y., Hassabis, D., Platt, J. C., Creutzig,
F., Chayes, J. T., & Bengio, Y. (2023). Tackling climate change with machine learning. ACM
Computing Surveys, 55(2), 1–42. 420
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image
synthesis with latent diffusion models. IEEE/CVF Computer Vision & Pattern Recognition,
10684–10695. 370
Romero, D. W., Bruintjes, R.-J., Tomczak, J. M., Bekkers, E. J., Hoogendoorn, M., & van Gemert, J.
C. (2021). FlexConv: Continuous kernel convolutions with differentiable kernel sizes.
International Conference on Learning Representations. 183
Rong, Y., Huang, W., Xu, T., & Huang, J. (2020). DropEdge: Towards deep graph convolutional
networks on node classification. International Conference on Learning Representations. 264
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical
image segmentation. International Conference on Medical Image Computing and Computer-
Assisted Intervention, 234–241. 184, 198, 205
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization
in the brain. Psychological review, 65(6), 386. 37
Rossi, E., Frasca, F., Chamberlain, B., Eynard, D., Bronstein, M., & Monti, F. (2020). SIGN:
Scalable inception graph neural networks. ICML Graph Representation Learning and Beyond
Workshop, 7, 15. 263
Roy, A., Saffar, M., Vaswani, A., & Grangier, D. (2021). Efficient content-based sparse attention with
routing transformers. Transactions of the Association for Computational Linguistics, 9, 53–68.
237
Rozemberczki, B., Kiss, O., & Sarkar, R. (2020). Little ball of fur: A Python library for graph
sampling. ACM International Conference on Information & Knowledge Management, 3133–
3140. 264
Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47 (1),
69–76. 344
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747. 91
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations by error
propagation. Techical Report, La Jolla Institute for Cognitive Science, UCSD. 113, 233, 344
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-
propagating errors. Nature, 323(6088), 533–536. 113
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical
Report, University of Cambridge. 396
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla,
A., Bernstein, M., et al. (2015). ImageNet large scale visual recognition challenge. International
Journal of Computer Vision, 115(3), 211–252. 175, 181
Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.
420
Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. Neural
Information Processing Systems, 30, 3856–3866. 235
Safran, I., & Shamir, O. (2017). Depth-width tradeoffs in approximating natural functions with neural
networks. International Conference on Machine Learning, 2979–2987. 53
Saha, S., Singh, G., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2016). Deep learning for detecting
multiple space-time action tubes in videos. British Machine Vision Conference. 182
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., & Norouzi, M. (2022a).
Palette: Image-to-image diffusion models. ACM SIGGRAPH. 8, 369
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K.,
Mahdavi, S. S., Lopes, R. G., et al. (2022b). Photorealistic text-to-image diffusion models with
deep language understanding. arXiv:2205.11487. 366, 368, 369, 371
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., & Norouzi, M. (2022c). Image super-
resolution via iterative refinement. IEEE Transactions on Pattern Analysis & Machine
Intelligence, 1–14. 369
Sainath, T. N., Kingsbury, B., Mohamed, A.-r., Dahl, G. E., Saon, G., Soltau, H., Beran, T., Aravkin,
A. Y., & Ramabhadran, B. (2013). Improvements to deep convolutional neural networks for
LVCSR. IEEE Workshop on Automatic Speech Recognition and Understanding, 315–320. 182
Saito, Y., Takamichi, S., & Saruwatari, H. (2017). Statistical parametric speech synthesis
incorporating generative adversarial networks. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 26(1), 84–96. 299, 301
Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for
environmental sound classification. IEEE Signal Processing Letters, 24(3), 279–283. 160
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved
techniques for training GANs. Neural Information Processing Systems, 29, 2226–2234. 274, 299,
300
Salimans, T., & Ho, J. (2022). Progressive distillation for fast sampling of diffusion models.
International Conference on Learning Representations. 370
Salimans, T., Kingma, D., & Welling, M. (2015). Markov chain Monte Carlo and variational
inference: Bridging the gap. International Conference on Machine Learning, 1218–1226. 345
Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to
accelerate training of deep neural networks. Neural Information Processing Systems, 29, 901–
909. 204
Sanchez-Lengeling, B., Reif, E., Pearce, A., & Wiltschko, A. B. (2021). A gentle introduction to
graph neural networks. Distill, https://distill.pub/2021/gnn-intro/. 261
Sankararaman, K. A., De, S., Xu, Z., Huang, W. R., & Goldstein, T. (2020). The impact of neural
network overparameterization on gradient confusion and stochastic gradient descent.
International Conference on Machine Learning, 8469–8479. 202
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help
optimization? Neural Information Processing Systems, 31, 2488–2498. 204
Sauer, A., Schwarz, K., & Geiger, A. (2022). StyleGAN-XL: Scaling StyleGAN to large diverse
datasets. ACM SIGGRAPH. 10
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). The graph neural
network model. IEEE Transactions on Neural Networks, 20(1), 61–80. 262
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. International
Conference on Learning Representations. 396
Scherer, D., Müller, A., & Behnke, S. (2010). Evaluation of pooling operations in convolutional
architectures for object recognition. International Conference on Artificial Neural Networks, 92–
101. 181
Schlag, I., Irie, K., & Schmidhuber, J. (2021). Linear transformers are secretly fast weight
programmers. International Conference on Machine Learning, 9355–9366. 235
Schlichtkrull, M., Kipf, T. N., Bloem, P., Berg, R. v. d., Titov, I., & Welling, M. (2018). Modeling
relational data with graph convolutional networks. European Semantic Web Conference, 593–
607. 265
Schmidhuber, J. (2022). Annotated history of modern AI and deep learning. arXiv:2212.11279. 37
Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for
speech recognition. INTERSPEECH, 3465–3469. 159
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart,
E., Hassabis, D., Graepel, T., et al. (2020). Mastering Atari, Go, chess and shogi by planning with
a learned model. Nature, 588(7839), 604–609. 398
Schroecker, Y., Vecerik, M., & Scholz, J. (2019). Generative predecessor models for sample-efficient
imitation learning. International Conference on Learning Representations. 322
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T.,
Jitsev, J., & Komatsuzaki, A. (2021). Laion-400m: Open dataset of clip-filtered 400 million
image-text pairs. NeurIPS Workshop on Data-centric AI. 238
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy
optimization. International Conference on Machine Learning, 1889–1897. 397
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-dimensional continuous
control using generalized advantage estimation. International Conference on Learning
Representations. 398
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy
optimization algorithms. arXiv:1707.06347. 397
Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. IEEE International
Conference on Acoustics, Speech and Signal Processing, 5149–5152. 234
Schwarz, J., Jayakumar, S., Pascanu, R., Latham, P., & Teh, Y. (2021). Powerpropagation: A sparsity
inducing weight reparameterisation. Neural Information Processing Systems, 34, 28889–28903.
156
Sejnowski, T. J. (2018). The deep learning revolution. MIT press. 37
Sejnowski, T. J. (2020). The unreasonable effectiveness of deep learning in artificial intelligence.
Proceedings of the National Academy of Sciences, 117 (48), 30033–30038. 404
Selsam, D., Lamm, M., Bünz, B., Liang, P., de Moura, L., & Dill, D. L. (2019). Learning a SAT
solver from single-bit supervision. International Conference on Learning Representations. 262
Selva, J., Johansen, A. S., Escalera, S., Nasrollahi, K., Moeslund, T. B., & Clapés, A. (2022). Video
transformers: A survey. arXiv:2201.05991. 238
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with
subword units. Meeting of the Association for Computational Linguistics. 234
Serra, T., Tjandraatmadja, C., & Ramalingam, S. (2018). Bounding and counting linear regions of
deep neural networks. International Conference on Machine Learning, 4558–4566. 52
Shang, W., Sohn, K., Almeida, D., & Lee, H. (2016). Understanding and improving convolutional
neural networks via concatenated recti fied linear units. International Conference on Machine
Learning, 2217–2225. 38
Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An
astounding baseline for recognition. IEEE Conference on Computer Vision and Pattern
Recognition Workshop, 806–813. 159
Sharkey, A., & Sharkey, N. (2012). Granny and the robots: Ethical issues in robot care for the elderly.
Ethics and Information Technology, 14(1), 27–40. 424
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations.
ACL Human Language Technologies, 464–468. 236
Shen, S., Yao, Z., Gholami, A., Mahoney, M., & Keutzer, K. (2020a). PowerNorm: Rethinking batch
normalization in transformers. International Conference on Machine Learning, 8741–8751. 237
Shen, X., Tian, X., Liu, T., Xu, F., & Tao, D. (2017). Continuous dropout. IEEE Transactions on
Neural Networks and Learning Systems, 29(9), 3926–3937. 158
Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020b). Interpreting the latent space of GANs for semantic
face editing. IEEE/CVF Computer Vision & Pattern Recognition, 9243–9252. 300
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., Rueckert, D., & Wang, Z. (2016).
Real-time single image and video super-resolution using an efficient sub-pixel convolutional
neural network. IEEE/CVF Computer Vision & Pattern Recognition, 1874–1883. 182
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-LM:
Training multi-billion parameter language models using model parallelism. arXiv:1909.08053.
114
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning.
Journal of Big Data, 6(1), 1–48. 159
Siddique, N., Paheding, S., Elkin, C. P., & Devabhaktuni, V. (2021). U-Net and its variants for
medical image segmentation: A review of theory and applications. IEEE Access, 82031–82057.
205
Sifre, L., & Mallat, S. (2013). Rotation, scaling and deformation invariant scattering for texture
discrimination. IEEE/CVF Computer Vision & Pattern Recognition, 1233–1240. 183
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with
deep neural networks and tree search. Nature, 529(7587), 484–489. 396, 398
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic
policy gradient algorithms. International Conference on Machine Learning, 387–395. 397
Simonovsky, M., & Komodakis, N. (2018). Graph-VAE: Towards generation of small graphs using
variational autoencoders. International Conference on Artificial Neural Networks, 412–422. 344
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. International Conference on Learning Representations. 177, 181
Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces.
Machine learning, 22(1), 123–158. 396
Sinha, S., Zhao, Z., Goyal, A., Raffel, C., & Odena, A. (2020). Top-k training of GANs: Improving
GAN performance by throwing away bad samples. Neural Information Processing Systems, 33,
14638–14649. 299
Sisson, M., Spindel, J., Scharre, P., & Kozyulin, V. (2020). The militarization of artificial
intelligence. United Nations Office for Disarmament Affairs. 427
Sjöberg, J., & Ljung, L. (1995). Overtraining, regularization and searching for a minimum, with
application to neural networks. International Journal of Control, 62(6), 1391–1407. 157
Smith, M., & Miller, S. (2022). The ethical application of biometric facial recognition technology. AI
& Society, 37, 167–175. 426
Smith, S., Elsen, E., & De, S. (2020). On the generalization benefit of noise in stochastic gradient
descent. International Conference on Machine Learning, 9058–9067. 157
Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye,
S., Zerveas, G., Korthikanti, V., et al. (2022). Using DeepSpeed and Megatron to train Megatron-
Turing NLG 530B, a large-scale generative language model. arXiv:2201.11990. 234
Smith, S. L., Dherin, B., Barrett, D. G. T., & De, S. (2021). On the origin of implicit regularization in
stochastic gradient descent. International Conference on Learning Representations. 157
Smith, S. L., Kindermans, P., Ying, C., & Le, Q. V. (2018). Don't decay the learning rate, increase the
batch size. International Conference on Learning Representations. 92
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine
learning algorithms. Neural Information Processing Systems, vol. 25, 2951–2959. 136
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised
learning using nonequilibrium thermodynamics. International Conference on Machine Learning,
2256–2265. 274, 367
Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep
conditional generative models. Neural Information Processing Systems, 28, 3483–3491. 344
Sohoni, N. S., Aberger, C. R., Leszczynski, M., Zhang, J., & Ré, C. (2019). Low-memory neural
network training: A technical report. arXiv:1904.10631. 114
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016a). How to train deep
variational autoencoders and probabilistic ladder networks. arXiv:1602.02282. 344
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016b). Ladder variational
autoencoders. Neural Information Processing Systems, 29, 738–3746. 369
Song, J., Meng, C., & Ermon, S. (2021a). Denoising diffusion implicit models. International
Conference on Learning Representations. 370
Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution.
Neural Information Processing Systems, 32, 11895–11907. 367, 371
Song, Y., & Ermon, S. (2020). Improved techniques for training score-based generative models.
Neural Information Processing Systems, 33, 12438–12448. 371
Song, Y., Meng, C., & Ermon, S. (2019). Mint-Net: Building invertible neural networks with masked
convolutions. Neural Information Processing Systems, 32, 11002–11012. 322
Song, Y., Shen, L., Xing, L., & Ermon, S. (2021b). Solving inverse problems in medical imaging
with score-based generative models. International Conference on Learning Representations. 369
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021c). Score-based
generative modeling through stochastic differential equations. International Conference on
Learning Representations. 369, 370, 371
Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for simplicity: The
all convolutional net. International Conference on Learning Representations. 182
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A.,
Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and
extrapolating the capabilities of language models. arXiv:2206.04615. 234
Srivastava, A., Valkov, L., Russell, C., Gutmann, M. U., & Sutton, C. (2017). VEEGAN: Reducing
mode collapse in GANs using implicit variational learning. Neural Information Processing
Systems, 30, 3308–3318. 300
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A
simple way to prevent neural networks from overfitting. Journal of Machine Learning Research,
15(1), 1929–1958. 158
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv:1505.00387. 202
Stark, L., & Hoey, J. (2021). The ethics of emotions in artificial intelligence systems. ACM
Conference on Fairness, Accountability, and Transparency, 782–793. 427
Stark, L., & Hutson, J. (2022). Physiognomic artificial intelligence. Fordham Intellectual Property,
Media & Entertainment Law Journal, XXXII (4), 922–978. 427, 432
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., &
Christiano, P. F. (2020). Learning to summarize with human feedback. Neural Information
Processing Systems, 33, 3008–3021. 398
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning
in NLP. Meeting of the Association for Computational Linguistics, 3645–3650. 429
Strubell, E., Ganesh, A., & McCallum, A. (2020). Energy and policy considerations for modern deep
learning research. Meeting of the Association for Computational Linguistics, 13693–13696. 429
Su, H., Jampani, V., Sun, D., Gallo, O., Learned-Miller, E., & Kautz, J. (2019a). Pixel-adaptive
convolutional neural networks. IEEE/CVF Computer Vision & Pattern Recognition, 11166–
11175. 183
Su, J., Lu, Y., Pan, S., Wen, B., & Liu, Y. (2021). Roformer: Enhanced transformer with rotary
position embedding. arXiv:2104.09864. 236
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., & Dai, J. (2019b). VL-BERT: Pre-training of generic
visual-linguistic representations. International Conference on Learning Representations. 238
Sultan, M. M., Wayment-Steele, H. K., & Pande, V. S. (2018). Transferable neural networks for
enhanced sampling of protein dynamics. Journal of Chemical Theory and Computation, 14(4),
1887–1894. 344
Summers, C., & Dinneen, M. J. (2019). Improved mixed-example data augmentation. Winter
Conference on Applications of Computer Vision, 1262–1270. 159
Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019). VideoBERT: A joint model for
video and language representation learning. IEEE/CVF International Conference on Computer
Vision, 7464–7473. 238
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data
in deep learning era. IEEE/CVF International Conference on Computer Vision, 843–852. 238
Sun, R.-Y. (2020). Optimization for deep learning: An overview. Journal of the Operations Research
Society of China, 8(2), 249–294. 91
Susmelj, I., Agustsson, E., & Timofte, R. (2017). ABC-GAN: Adaptive blur and control for improved
training stability of generative adversarial networks. ICML Workshop on Implicit Models. 299
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and
momentum in deep learning. International Conference on Machine Learning, 1139–1147. 93
Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Ph.D., University of
Massachusetts Amherst. 396
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning,
3(1), 9–44. 396
Sutton, R. S., & Barto, A. G. (1999). Reinforcement learning: An introduction. MIT press. 396
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction, 2nd Edition. MIT
Press. 16, 396
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for
reinforcement learning with function approximation. Neural Information Processing Systems, 12,
1057–1063. 397
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-Resnet and the
impact of residual connections on learning. AAAI Conference on Artificial Intelligence, 4278–
4284. 181, 183, 405
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception
architecture for computer vision. IEEE/CVF Computer Vision & Pattern Recognition, 2818–
2826. 155, 158, 274
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014).
Intriguing properties of neural networks. International Conference on Learning Representations.
414
Szeliski, R. (2022). Computer vision: Algorithms and applications, 2nd Edition. Springer. 15
Tabak, E. G., & Turner, C. V. (2013). A family of nonparametric density estimation algorithms.
Communications on Pure and Applied Mathematics, 66(2), 145–164. 321
Tabak, E. G., & Vanden-Eijnden, E. (2010). Density estimation by dual ascent of the log-likelihood.
Communications in Mathematical Sciences, 8(1), 217–233. 321
Taddeo, M., & Floridi, L. (2018). How AI can be a force for good. Science, 361(6404), 751–752. 420
Tan, H., & Bansal, M. (2019). LXMERT: Learning cross-modality encoder representations from
transformers. Empirical Methods in Natural Language Processing, 5099–5110. 238
Tan, M., & Le, Q. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks.
International Conference on Machine Learning, 6105–6114. 405
Tay, Y., Bahri, D., Metzler, D., Juan, D.-C., Zhao, Z., & Zheng, C. (2021). Synthesizer: Rethinking
self-attention for transformer models. International Conference on Machine Learning, 10183–
10192. 235
Tay, Y., Bahri, D., Yang, L., Metzler, D., & Juan, D.-C. (2020). Sparse Sinkhorn attention.
International Conference on Machine Learning, 9438–9447. 237
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2023). Efficient transformers: A survey. ACM
Computing Surveys, 55(6), 109:1–109:28. 237
Tegmark, M. (2018). Life 3.0: Being human in the age of artificial intelligence. Vintage. 14
Telgarsky, M. (2016). Benefits of depth in neural networks. PMLR Conference on Learning Theory,
1517–1539. 53, 417
Teru, K., Denis, E., & Hamilton, W. (2020). Inductive relation prediction by subgraph reasoning.
International Conference on Machine Learning, 9448–9457. 265
Tetlock, P. E., & Gardner, D. (2016). Superfore-casting: The Art and Science of Prediction. Toronto:
Signal, McClelland & Stewart. 435
Tewari, A., Elgharib, M., Bharaj, G., Bernard, F., Seidel, H.-P., Pérez, P., Zollhofer, M., & Theobalt,
C. (2020). StyleRig: Rigging StyleGAN for 3D control over portrait images. IEEE/CVF
Computer Vision & Pattern Recognition, 6142–6151. 300
Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized
deep networks. International Conference on Machine Learning, 4907–4916. 204
Theis, L., Oord, A. v. d., & Bethge, M. (2016). A note on the evaluation of generative models.
International Conference on Learning Representations. 322
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of
the evidence of two samples. Biometrika, 25(3-4), 285–294. 396
Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics, 57 (2),
450–456. 396
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T.,
Baker, L., Du, Y., et al. (2022). LaMDA: Language models for dialog applications.
arXiv:2201.08239. 234
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the
Royal Statistical Society: Series B, 61(3), 611–622. 344
Tolmeijer, S., Kneer, M., Sarasua, C., Christen, M., & Bernstein, A. (2020). Implementations in
machine ethics: A survey. ACM Computing Surveys, 53(6), 1–38. 424
Tolstikhin, I., Bousquet, O., Gelly, S., & Schoelkopf, B. (2018). Wasserstein autoencoders.
International Conference on Learning Representations. 345
Tomašev, N., Cornebise, J., Hutter, F., Mohamed, S., Picciariello, A., Connelly, B., Belgrave, D. C.,
Ezer, D., Haert, F. C. v. d., Mugisha, F., et al. (2020). AI for social good: Unlocking the
opportunity for positive impact. Nature Communications, 11(1), 2468. 420
Tomasev, N., McKee, K. R., Kay, J., & Mohamed, S. (2021). Fairness for unobserved characteristics:
Insights from technological impacts on queer communities. AAAI/ACM Conference on AI, Ethics,
and Society, 254–265. 424
Tomczak, J. M., & Welling, M. (2016). Improving variational auto-encoders using Householder flow.
NIPS Workshop on Bayesian Deep Learning. 322
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization
using convolutional networks. IEEE/CVF Computer Vision & Pattern Recognition, 648–656. 183
Torralba, A., Freeman, W., & Isola, P. (2024). Foundations of Computer Vision. MIT Press. 15
Touati, A., Satija, H., Romoff, J., Pineau, J., & Vincent, P. (2020). Randomized value functions via
multiplicative normalizing flows. Uncertainty in Artificial Intelligence, 422–432. 322
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-
efficient image transformers & distillation through attention. International Conference on
Machine Learning, 10347–10357. 238
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal
features with 3D convolutional networks. IEEE International Conference on Computer Vision,
4489–4497. 182
Tran, D., Vafa, K., Agrawal, K., Dinh, L., & Poole, B. (2019). Discrete flows: Invertible generative
models of discrete data. Neural Information Processing Systems, 32, 14692–14701. 322, 324
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at
spatiotemporal convolutions for action recognition. IEEE/CVF Computer Vision & Pattern
Recognition, 6450–6459. 181
Tsitsulin, A., Palowitch, J., Perozzi, B., & Müller, E. (2020). Graph clustering with graph neural
networks. arXiv:2006.16904. 262
Tzen, B., & Raginsky, M. (2019). Neural stochastic differential equations: Deep latent Gaussian
models in the diffusion limit. arXiv:1905.09883. 324
Ulku, I., & Akagündüz, E. (2022). A survey on deep learning-based architectures for semantic
segmentation on 2D images. Applied Artificial Intelligence, 36(1). 184
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for
fast stylization. arXiv:1607.08022. 203
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2018). Deep image prior. IEEE/CVF Computer Vision &
Pattern Recognition, 9446–9454. 418
Urban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R., Mohamed, A., Philipose, M.,
& Richardson, M. (2017). Do deep convolutional nets really need to be deep and convolutional?
International Conference on Learning Representations. 417, 418
Vahdat, A., Andriyash, E., & Macready, W. (2018a). DVAE#: Discrete variational autoencoders with
relaxed Boltzmann priors. Neural Information Processing Systems, 31, 1869–1878. 344
Vahdat, A., Andriyash, E., & Macready, W. (2020). Undirected graphical models as approximate
posteriors. International Conference on Machine Learning, 9680–9689. 344
Vahdat, A., & Kautz, J. (2020). NVAE: A deep hierarchical variational autoencoder. Neural
Information Processing Systems, 33, 19667–19679. 340, 345, 369
Vahdat, A., Kreis, K., & Kautz, J. (2021). Score-based generative modeling in latent space. Neural
Information Processing Systems, 34, 11287–11302. 370
Vahdat, A., Macready, W., Bian, Z., Khoshaman, A., & Andriyash, E. (2018b). DVAE++: Discrete
variational autoencoders with overlapping transformations. International Conference on Machine
Learning, 5035–5044. 344
Vallor, S. (2011). Carebots and caregivers: Sustaining the ethical ideal of care in the 21st century.
Philosophy and Technology, 24(3), 251–268. 429
Vallor, S. (2015). Moral deskilling and upskilling in a new machine age: Reflections on the
ambiguous future of character. Philosophy & Technology, 28, 107–124. 429
Van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N.,
Senior, A., & Kavukcuoglu, K. (2016a). WaveNet: A generative model for raw audio. ISCA
Speech Synthesis Workshop. 323
Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. (2016b). Conditional
image generation with PixelCNN decoders. Neural Information Processing Systems, 29, 4790–
4798. 274
Van den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016c). Pixel recurrent neural networks.
International Conference on Machine Learning, 1747–1756. 233, 345
Van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G.,
Lockhart, E., Cobo, L., Stimberg, F., et al. (2018). Parallel WaveNet: Fast high-fidelity speech
synthesis. International Conference on Machine Learning, 3918–3926. 323
Van Den Oord, A., Vinyals, O., et al. (2017). Neural discrete representation learning. Neural
Information Processing Systems, 30, 6306–6315. 344, 345
Van Hasselt, H. (2010). Double Q-learning. Neural Information Processing Systems, 23, 2613–2621.
397
Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning.
AAAI Conference on Artificial Intelligence, 2094–2100. 397
Van Hoof, H., Chen, N., Karl, M., van der Smagt, P., & Peters, J. (2016). Stable reinforcement
learning with autoencoders for tactile and visual data. IEEE/RSJ International Conference on
Intelligent Robots and Systems, 3928–3934. IEEE. 344
van Wynsberghe, A., & Robbins, S. (2019). Critiquing the reasons for making artificial moral agents.
Science and Engineering Ethics, 25, 719–735. 424
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer Verlag. 74
Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of
events to their probabilities. Measures of Complexity, 11–30. 134
Vardi, G., Yehudai, G., & Shamir, O. (2022). Width is less important than depth in ReLU neural
networks. PMRL Conference on Learning Theory, 1–33. 53
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &
Polosukhin, I. (2017). Attention is all you need. Neural Information Processing Systems, 30,
5998–6008. 158, 233, 234, 235, 236, 237
Veit, A., Wilber, M. J., & Belongie, S. (2016). Residual networks behave like ensembles of relatively
shallow networks. Neural Information Processing Systems, 29, 550–558. 202, 417
Veličković, P. (2023). Everything is connected: Graph neural networks. Current Opinion in
Structural Biology, 79, 102538. 261
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2019). Graph attention
networks. International Conference on Learning Representations. 234, 263, 265
Véliz, C. (2020). Privacy is Power: Why and How You Should Take Back Control of Your Data.
Bantam Press. 435
Véliz, C. (2023). Chatbots shouldn't use emojis. Nature, 615, 375. 428
Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., & Batra, D.
(2016). Diverse beam search: Decoding diverse solutions from neural sequence models.
arXiv:1610.02424. 235
Vincent, J. (2020). What a machine learning tool that turns Obama white can (and can't) tell us about
AI bias / a striking image that only hints at a much bigger problem. The Verge, June 23, 2020.
https://www.theverge.com/21298762/face-depixelizer-ai-machine-learning-tool-pulse-stylegan-
obama-bias. 422
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust
features with denoising autoencoders. International Conference on Machine Learning, 1096–
1103. 344
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing multi-head self-
attention: Specialized heads do the heavy lifting, the rest can be pruned. Meeting of the
Association for Computational Linguistics, 5797–5808. 235
Voleti, V., Jolicoeur-Martineau, A., & Pal, C. (2022). MCVD: Masked conditional video diffusion for
prediction, generation, and interpolation. Neural Information Processing Systems, 35. 369
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. Neural
Information Processing Systems, 29, 613–621. 299
Wachter, S., Mittelstadt, B., & Floridi, L. (2017). Why a right to explanation of automated decision-
making does not exist in the general data protection regulation. International Data Privacy Law,
7 (2), 76–99. 425
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using
time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37
(3), 328–339. 181
Wallach, W., Allen, C., & Smit, I. (2008). Machine morality: Bottom-up and top-down approaches
for modeling human moral faculties. AI & Society, 22(4), 565–582. 424
Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., & Fergus, R. (2013). Regularization of neural networks
using DropConnect. International Conference on Machine Learning, 1058–1066. 158
Wan, Z., Zhang, J., Chen, D., & Liao, J. (2021). High-fidelity pluralistic image completion with
transformers. IEEE/CVF International Conference on Computer Vision, 4692–4701. 238
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S.
(2019a). SuperGLUE: A stickier benchmark for general-purpose language understanding
systems. Neural Information Processing Systems, 32, 3261–3275. 234
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019b). GLUE: A multitask
benchmark and analysis platform for natural language understanding. International Conference
on Learning Representations. 234
Wang, B., Shang, L., Lioma, C., Jiang, X., Yang, H., Liu, Q., & Simonsen, J. G. (2020a). On position
embeddings in BERT. International Conference on Learning Representations. 236
Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2022a). Yolov7: Trainable bag-of-freebies sets new
state-of-the-art for real-time object detectors. arXiv:2207.02696. 184
Wang, P. Z., & Wang, W. Y. (2019). Riemannian normalizing flow on variational Wasserstein
autoencoder for text modeling. ACL Human Language Technologies, 284–294. 324
Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma, H. (2020b). Linformer: Self-attention with linear
complexity. arXiv:2006.04768. 237
Wang, T., Liu, M., Zhu, J., Yakovenko, N., Tao, A., Kautz, J., & Catanzaro, B. (2018a). Video-to-
video synthesis. Neural Information Processing Systems, vol. 31, 1152–1164. 299
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018b). High-resolution
image synthesis and semantic manipulation with conditional GANs. IEEE/CVF Computer Vision
& Pattern Recognition, 8798–8807. 300, 301
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021).
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.
IEEE/CVF International Conference on Computer Vision, 568–578. 238
Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., & Liu, W. (2022b). Crossformer: A versatile
vision transformer hinging on crossscale attention. International Conference on Learning
Representations. 238
Wang, X., Girshick, R., Gupta, A., & He, K. (2018c). Non-local neural networks. IEEE/CVF
Computer Vision & Pattern Recognition, 7794–7803. 238
Wang, X., Wang, S., Liang, X., Zhao, D., Huang, J., Xu, X., Dai, B., & Miao, Q. (2022c). Deep
reinforcement learning: A survey. IEEE Transactions on Neural Networks and Learning Systems.
396
Wang, Y., & Kosinski, M. (2018). Deep neural networks are more accurate than humans at detecting
sexual orientation from facial images. Journal of Personality and Social Psychology, 114(2),
246–257. 430
Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao, A., Mahadeokar, J., Huang, H., Tjandra, A., Zhang,
X., Zhang, F., et al. (2020c). Transformer-based acoustic modeling for hybrid speech recognition.
IEEE International Conference on Acoustics, Speech and Signal Processing, 6874–6878. 234
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & de Freitas, N. (2017).
Sample efficient actor-critic with experience replay. International Conference on Learning
Representations. 398
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & Freitas, N. (2016). Dueling network
architectures for deep reinforcement learning. International Conference on Machine Learning,
1995–2003. 397
Ward, P. N., Smofsky, A., & Bose, A. J. (2019). Improving exploration in soft-actor-critic with
normalizing flows policies. ICML Workshop on Invertible Neural Networks and Normalizing
Flows. 322
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279–292. 396
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D., University of Cambridge. 396
Wehenkel, A., & Louppe, G. (2019). Unconstrained monotonic neural networks. Neural Information
Processing Systems, 32, 1543–1553. 323
Wei, J., Ren, X., Li, X., Huang, W., Liao, Y., Wang, Y., Lin, J., Jiang, X., Chen, X., & Liu, Q. (2019).
NEZHA: Neural contextualized representation for Chinese language understanding.
arXiv:1909.00204. 236
Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on
text classification tasks. ACL Empirical Methods in Natural Language Processing, 6382–6388.
160
Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S., Mellor, J., Glaese, A., Cheng, M.,
Balle, B., Kasirzadeh, A., Biles, C., Brown, S., Kenton, Z., Hawkins, W., Stepleton, T., Birhane,
A., Hendricks, L. A., Rimell, L., Isaac, W., Haas, J., Legassick, S., Irving, G., & Gabriel, I.
(2022). Taxonomy of risks posed by language models. ACM Conference on Fairness,
Accountability, and Transparency, 214–229. 428
Weisfeiler, B., & Leman, A. (1968). The reduction of a graph to canonical form and the algebra
which appears therein. NTI, Series, 2(9), 12–16. 264
Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics.
International Conference on Machine Learning, 681–688. 159
Wen, Y.-H., Yang, Z., Fu, H., Gao, L., Sun, Y., & Liu, Y.-J. (2021). Autoregressive stylized motion
synthesis with generative flow. IEEE/CVF Computer Vision & Pattern Recognition, 13612–
13621. 322
Wenzel, F., Roth, K., Veeling, B. S., Świątkowski, J., Tran, L., Mandt, S., Snoek, J., Salimans, T.,
Jenatton, R., & Nowozin, S. (2020a). How good is the Bayes posterior in deep neural networks
really? International Conference on Machine Learning, 10248–10259. 159
Wenzel, F., Snoek, J., Tran, D., & Jenatton, R. (2020b). Hyperparameter ensembles for robustness
and uncertainty quantification. Neural Information Processing Systems, 33, 6514–6527. 158
Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Ph.D. dissertation, Harvard University. 113
White, T. (2016). Sampling generative networks. arXiv:1609.04468. 342, 344
Whitney, H. (1932). Congruent graphs and the connectivity of graphs. Hassler Whitney Collected
Papers, 61–79. 264
Wightman, R., Touvron, H., & Jégou, H. (2021). ResNet strikes back: An improved training
procedure in timm. Neural Information Processing Systems Workshop. 202
Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning. MIT Press.
15
Williams, P. M. (1996). Using neural networks to model conditional multivariate densities. Neural
Computation, 8(4), 843–854. 73
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine learning, 8(3), 229–256. 397
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The marginal value of adaptive
gradient methods in machine learning. Neural Information Processing Systems, 30, 4148–4158.
94, 410
Wirnsberger, P., Ballard, A. J., Papamakarios, G., Abercrombie, S., Racanière, S., Pritzel, A., Jimenez
Rezende, D., & Blundell, C. (2020). Targeted free energy estimation via learned mappings. The
Journal of Chemical Physics, 153(14), 144112. 322
Wolf, S. (2021). ProGAN: How NVIDIA generated images of unprecedented quality.
https://towardsdatascience.com/progan-how-nvidia-generated-images-of-unprecedented-quality-
51c98ec2cbd2. 286
Wolf, V., Lugmayr, A., Danelljan, M., Van Gool, L., & Timofte, R. (2021). DeFlow: Learning
complex image degradations from unpaired data with conditional flows. IEEE/CVF Computer
Vision & Pattern Recognition, 94–103. 322
Wolfe, C. R., Yang, J., Chowdhury, A., Dun, C., Bayer, A., Segarra, S., & Kyrillidis, A. (2021). GIST:
Distributed training for largescale graph convolutional networks. NeurIPS Workshop on New
Frontiers in Graph Learning. 264
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. 158
Wong, K. W., Contardo, G., & Ho, S. (2020). Gravitational-wave population inference with deep
flow-based generative network. Physical Review D, 101(12), 123005. 322
Worrall, D. E., Garbin, S. J., Turmukhambetov, D., & Brostow, G. J. (2017). Harmonic networks:
Deep translation and rotation equivariance. IEEE/CVF Computer Vision & Pattern Recognition,
5028–5037. 183
Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z., Tomizuka, M., Gonzalez, J., Keutzer, K., &
Vajda, P. (2020a). Visual transformers: Token-based image representation and processing for
computer vision. arXiv:2006.03677. 238
Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., & Auli, M. (2019). Pay less attention with lightweight
and dynamic convolutions. International Conference on Learning Representations. 235
Wu, H., & Gu, X. (2015). Max-pooling dropout for regularization of convolutional neural networks.
Neural Information Processing Systems, vol. 18, 46–54. 183
Wu, J., Huang, Z., Thoma, J., Acharya, D., & Van Gool, L. (2018a). Wasserstein divergence for
GANs. European Conference on Computer Vision, 653–668. 299
Wu, J., Zhang, C., Xue, T., Freeman, B., & Tenenbaum, J. (2016). Learning a probabilistic latent
space of object shapes via 3D generative-adversarial modeling. Neural Information Processing
Systems, 29, 82–90. 299
Wu, N., Green, B., Ben, X., & O’Banion, S. (2020b). Deep transformer models for time series
forecasting: The influenza prevalence case. arXiv:2001.08317. 234
Wu, R., Yan, S., Shan, Y., Dang, Q., & Sun, G. (2015a). Deep image: Scaling up image recognition.
arXiv:1501.02876, 7 (8). 154
Wu, S., Sun, F., Zhang, W., Xie, X., & Cui, B. (2023). Graph neural networks in recommender
systems: A survey. ACM Computing Surveys, 55(5), 97:1–97:37. 262
Wu, X., & Zhang, X. (2016). Automated inference on criminality using face images.
arXiv:1611.04135. 427
Wu, Y., Burda, Y., Salakhutdinov, R., & Grosse, R. (2017). On the quantitative analysis of decoder-
based generative models. International Conference on Learning Representations. 300
Wu, Y., & He, K. (2018). Group normalization. European Conference on Computer Vision, 3–19.
203, 204
Wu, Z., Lischinski, D., & Shechtman, E. (2021). Stylespace analysis: Disentangled controls for
StyleGAN image generation. IEEE/CVF Computer Vision & Pattern Recognition, 12863–12872.
300
Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L. S., Grauman, K., & Feris, R. (2018b).
BlockDrop: Dynamic inference paths in residual networks. IEEE/CVF Computer Vision &
Pattern Recognition, 8817–8826. 203
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Philip, S. Y. (2020c). A comprehensive survey on
graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1), 4–
24. 261
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015b). 3D ShapeNets: A deep
representation for volumetric shapes. IEEE/CVF Computer Vision & Pattern Recognition, 1912–
1920. 182
Xia, F., Liu, T.-Y., Wang, J., Zhang, W., & Li, H. (2008). Listwise approach to learning to rank:
theory and algorithm. International Conference on Machine Learning, 1192–1199. 73
Xia, W., Zhang, Y., Yang, Y., Xue, J.-H., Zhou, B., & Yang, M.-H. (2022). GAN inversion: A survey.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–17. 301
Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., & Pennington, J. (2018a). Dynamical isometry
and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural
networks. International Conference on Machine Learning, 5393–5402. 114, 183
Xiao, S., Wang, S., Dai, Y., & Guo, W. (2022a). Graph neural networks in node classification: Survey
and evaluation. Machine Vision and Applications, 33(1), 1–19. 262
Xiao, T., Hong, J., & Ma, J. (2018b). DNAGAN: Learning disentangled representations from multi-
attribute images. International Conference on Learning Representations. 301
Xiao, Z., Kreis, K., & Vahdat, A. (2022b). Tackling the generative learning trilemma with denoising
diffusion GANs. International Conference on Learning Representations. 370
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). SegFormer: Simple and
efficient design for semantic segmentation with transformers. Neural Information Processing
Systems, 34, 12077–12090. 238
Xie, L., Wang, J., Wei, Z., Wang, M., & Tian, Q. (2016). DisturbLabel: Regularizing CNN on the loss
layer. IEEE/CVF Computer Vision & Pattern Recognition, 4753–4762. 158
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for
deep neural networks. IEEE/CVF Computer Vision & Pattern Recognition, 1492–1500. 181, 202,
405
Xie, Y., & Li, Q. (2022). Measurement-conditioned denoising diffusion probabilistic model for
under-sampled medical image reconstruction. Medical Image Computing and Computer Assisted
Intervention, vol. 13436, 655–664. 369
Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., & Yu, Y.
(2015). Petuum: A new platform for distributed machine learning on big data. IEEE Transactions
on Big Data, 1(2), 49–67. 114
Xing, Y., Qian, Z., & Chen, Q. (2021). Invertible image signal processing. IEEE/CVF Computer
Vision & Pattern Recognition, 6287–6296. 322
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., & Liu, T.
(2020a). On layer normalization in the transformer architecture. International Conference on
Machine Learning, 10524–10533. 237
Xiong, Z., Yuan, Y., Guo, N., & Wang, Q. (2020b). Variational context-deformable convnets for
indoor scene parsing. IEEE/CVF Computer Vision & Pattern Recognition, 3992–4002. 183
Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in
convolutional network. arXiv:1505.00853. 158, 160
Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks?
International Conference on Learning Representations. 264
Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i., & Jegelka, S. (2018). Representation
learning on graphs with jumping knowledge networks. International Conference on Machine
Learning, 5453–5462. 263, 265, 266
Xu, K., Zhang, M., Jegelka, S., & Kawaguchi, K. (2021a). Optimization of graph neural networks:
Implicit acceleration by skip connections and more depth. International Conference on Machine
Learning, 11592–11602. 266
Xu, P., Cheung, J. C. K., & Cao, Y. (2020). On variational learning of controllable representations for
text without supervision. International Conference on Machine Learning, 10534–10543. 343,
345
Xu, P., Kumar, D., Yang, W., Zi, W., Tang, K., Huang, C., Cheung, J. C. K., Prince, S. J. D., & Cao,
Y. (2021b). Optimizing deeper transformers on small datasets. Meeting of the Association for
Computational Linguistics. 114, 234, 238
Yamada, Y., Iwamura, M., Akiba, T., & Kise, K. (2019). Shakedrop regularization for deep residual
learning. IEEE Access, 7, 186126–186136. 202, 203
Yamada, Y., Iwamura, M., & Kise, K. (2016). Deep pyramidal residual networks with separated
stochastic depth. arXiv:1612.01230. 202
Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). Attribute2Image: Conditional image generation from
visual attributes. European Conference on Computer Vision, 776–791. 301
Yang, F., Yang, H., Fu, J., Lu, H., & Guo, B. (2020a). Learning texture transformer network for
image super-resolution. IEEE/CVF Computer Vision & Pattern Recognition, 5791–5800. 238
Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., & Schoenholz, S. S. (2019). A mean field theory
of batch normalization. International Conference on Learning Representations. 203
Yang, K., Goldman, S., Jin, W., Lu, A. X., Barzilay, R., Jaakkola, T., & Uhler, C. (2021). Mol2Image:
Improved conditional flow models for molecule to image synthesis. IEEE/CVF Computer Vision
& Pattern Recognition, 6688–6698. 322
Yang, Q., Zhang, Y., Dai, W., & Pan, S. J. (2020b). Transfer learning. Cambridge University Press.
159
Yang, R., Srivastava, P., & Mandt, S. (2022). Diffusion probabilistic modeling for video generation.
arXiv:2203.09481. 369, 371
Yao, W., Zeng, Z., Lian, C., & Tang, H. (2018). Pixel-wise regression using U-Net and its application
on pansharpening. Neurocomputing, 312, 364–371. 205
Ye, H., & Young, S. (2004). High quality voice morphing. IEEE International Conference on
Acoustics, Speech, and Signal Processing, 1–9. 160
Ye, L., Rochan, M., Liu, Z., & Wang, Y. (2019). Cross-modal self-attention network for referring
image segmentation. IEEE/CVF Computer Vision & Pattern Recognition, 10502–10511. 238
Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari games with limited data.
Neural Information Processing Systems, 34, 25476–25488. 396
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., & Leskovec, J. (2018a). Graph
convolutional neural networks for webscale recommender systems. ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, 974–983. 264, 265
Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., & Leskovec, J. (2018b). Hierarchical graph
representation learning with differentiable pooling. Neural Information Processing Systems, 31,
4805–4815. 265
Yoshida, Y., & Miyato, T. (2017). Spectral norm regularization for improving the generalizability of
deep learning. arXiv:1705.10941. 156
You, Y., Chen, T., Wang, Z., & Shen, Y. (2020). When does self-supervision help graph convolutional
networks? International Conference on Machine Learning, 10871–10880. 159
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. International
Conference on Learning Representations. 181
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. S. (2019). Free-form image inpainting with
gated convolution. IEEE/CVF International Conference on Computer Vision, 4471–4480. 181
Yu, J., Zheng, Y., Wang, X., Li, W., Wu, Y., Zhao, R., & Wu, L. (2021). FastFlow: Unsupervised
anomaly detection and localization via 2D normalizing flows. arXiv:2111.07677. 322
Yu, J. J., Derpanis, K. G., & Brubaker, M. A. (2020). Wavelet flow: Fast training of high resolution
normalizing flows. Neural Information Processing Systems, 33, 6184–6196. 322
Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017). SeqGAN: Sequence generative adversarial nets with
policy gradient. AAAI Conference on Artificial Intelligence, 2852–2858. 299
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization strategy to
train strong classifiers with localizable features. IEEE/CVF International Conference on
Computer Vision, 6023–6032. 160
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. British Machine Vision
Conference. 202, 417
Zagoruyko, S., & Komodakis, N. (2017). Paying more attention to attention: Improving the
performance of convolutional neural networks via attention transfer. International Conference on
Learning Representations. 415
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., & Smola, A. J. (2017).
Deep sets. Neural Information Processing Systems, 30, 3391–3401. 263
Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2018). Adaptive methods for nonconvex
optimization. Neural Information Processing Systems, 31, 9815–9825. 93
Zaslavsky, T. (1975). Facing up to arrangements: Face-count formulas for partitions of space by
hyperplanes: Face-count formulas for partitions of space by hyperplanes. Memoirs of the
American Mathematical Society. 38, 40
Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. arXiv:1212.5701. 93
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European
Conference on Computer Vision, 818–833. 181, 184
Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and
high level feature learning. IEEE International Conference on Computer Vision, 2018–2025. 181
Zeng, H., Zhou, H., Srivastava, A., Kannan, R., & Prasanna, V. (2020). GraphSAINT: Graph
sampling based inductive learning method. International Conference on Learning
Representations. 264
Zeng, Y., Fu, J., Chao, H., & Guo, B. (2019). Learning pyramid-context encoder network for high-
quality image inpainting. IEEE/CVF Computer Vision & Pattern Recognition, 1486–1494. 205
Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R., & Susskind, J. (2021). An
attention free transformer. 235
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into deep learning. Cambridge
University Press. 15
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017a). Understanding deep learning
requires rethinking generalization. International Conference on Learning Representations. 156,
403, 418
Zhang, C., Ouyang, X., & Patras, P. (2017b). ZipNet-GAN: Inferring fine-grained mobile traffic
patterns via a generative adversarial neural network. International Conference on emerging
Networking EXperiments and Technologies, 363–375. 299
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017c). mixup: Beyond empirical risk
minimization. International Conference on Learning Representations. 160
Zhang, H., Dauphin, Y. N., & Ma, T. (2019a). Fixup initialization: Residual learning with-out
normalization. International Conference on Learning Representations. 114, 205
Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2019b). Self-attention generative adversarial
networks. International Conference on Machine Learning, 7354–7363. 299
Zhang, H., Hsieh, C.-J., & Akella, V. (2016a). Hogwild++: A new mechanism for decentralized
asynchronous stochastic gradient descent. IEEE International Conference on Data Mining, 629–
638. 114
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. N. (2017d). StackGAN:
Text to photo-realistic image synthesis with stacked generative adversarial networks. IEEE/CVF
International Conference on Computer Vision, 5907–5915. 300, 301
Zhang, J., & Meng, L. (2019). GResNet: Graph residual network for reviving deep gnns from
suspended animation. arXiv:1909.05729. 263
Zhang, J., Shi, X., Xie, J., Ma, H., King, I., & Yeung, D.-Y. (2018a). GaAN: Gated attention
networks for learning on large and spatiotemporal graphs. Uncertainty in Artificial Intelligence,
339–349. 263
Zhang, J., Zhang, H., Xia, C., & Sun, L. (2020). Graph-Bert: Only attention is needed for learning
graph representations. arXiv:2001.05140. 263
Zhang, K., Yang, Z., & Başar, T. (2021a). Multi-agent reinforcement learning: A selective overview
of theories and algorithms. Handbook of Reinforcement Learning and Control, 321–384. 398
Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models.
arXiv:2302.05543. 370
Zhang, M., & Chen, Y. (2018). Link prediction based on graph neural networks. Neural Information
Processing Systems, 31, 5171–5181. 262
Zhang, M., Cui, Z., Neumann, M., & Chen, Y. (2018b). An end-to-end deep learning architecture for
graph classification. AAAI Conference on Artificial Intelligence, 4438–4445. 262, 265
Zhang, Q., & Chen, Y. (2021). Diffusion normalizing flow. Neural Information Processing Systems,
34, 16280–16291. 371
Zhang, R. (2019). Making convolutional networks shift-invariant again. International Conference on
Machine Learning, 7324–7334. 182, 183
Zhang, R., Isola, P., & Efros, A. A. (2016b). Colorful image colorization. European Conference on
Computer Vision, 649–666. 159
Zhang, S., Tong, H., Xu, J., & Maciejewski, R. (2019c). Graph convolutional networks: A
comprehensive review. Computational Social Networks, 6(1), 1–23. 262
Zhang, S., Zhang, C., Kang, N., & Li, Z. (2021b). iVPF: Numerical invertible volume preserving
flow for efficient lossless compression. IEEE/CVF Computer Vision & Pattern Recognition, 620–
629. 322
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text
classification. Neural Information Processing Systems, 28, 649–657. 182
Zhao, H., Jia, J., & Koltun, V. (2020a). Exploring self-attention for image recognition. IEEE/CVF
Computer Vision & Pattern Recognition, 10076–10085. 238
Zhao, J., Mathieu, M., & LeCun, Y. (2017a). Energy-based generative adversarial network.
International Conference on Learning Representations. 299
Zhao, L., & Akoglu, L. (2020). PairNorm: Tackling oversmoothing in GNNs. International
Conference on Learning Representations. 265
Zhao, L., Mo, Q., Lin, S., Wang, Z., Zuo, Z., Chen, H., Xing, W., & Lu, D. (2020b). UCTGAN:
Diverse image inpainting based on unsupervised cross-space translation. IEEE/CVF Computer
Vision & Pattern Recognition, 5741–5750. 238
Zhao, S., Song, J., & Ermon, S. (2017b). InfoVAE: Balancing learning and inference in variational
autoencoders. AAAI Conference on Artificial Intelligence, 5885–5892. 345
Zhao, S., Song, J., & Ermon, S. (2017c). Towards deeper understanding of variational autoencoding
models. arXiv:1702.08658. 345
Zheng, C., Cham, T.-J., & Cai, J. (2021). TFill: Image completion via a transformer-based
architecture. arXiv:2104.00845. 238
Zheng, G., Yang, Y., & Carbonell, J. (2017). Convolutional normalizing flows. arXiv:1711.02255.
322
Zheng, Q., Zhang, A., & Grover, A. (2022). Online decision transformer. International Conference on
Machine Learning, 162, 27042–27059. 398
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. AAAI
Conference on Artificial Intelligence, 13001–13008. 159
Zhou, C., Ma, X., Wang, D., & Neubig, G. (2019). Density matching for bilingual word embedding.
ACL Human Language Technologies, 1588–1598. 322
Zhou, H., Alvarez, J. M., & Porikli, F. (2016a). Less is more: Towards compact CNNs. European
Conference on Computer Vision, 662–677. 414
Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., & Sun, M. (2020a). Graph
neural networks: A review of methods and applications. AI Open, 1, 57–81. 261
Zhou, K., Huang, X., Li, Y., Zha, D., Chen, R., & Hu, X. (2020b). Towards deeper graph neural
networks with differentiable group normalization. Neural Information Processing Systems, 33,
4917–4928. 265
Zhou, L., Du, Y., & Wu, J. (2021). 3D shape generation and completion through point-voxel
diffusion. IEEE/CVF International Conference on Computer Vision, 5826–5835. 369
Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016b). Learning dense
correspondence via 3D-guided cycle consistency. IEEE/CVF Computer Vision & Pattern
Recognition, 117–126. 301
Zhou, Y.-T., & Chellappa, R. (1988). Computation of optical flow using a neural network. IEEE
International Conference on Neural Networks, 71–78. 181
Zhou, Z., & Li, X. (2017). Graph convolution: A high-order and adaptive approach.
arXiv:1706.09916. 263
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N., & Liang, J. (2018). UNet++: A nested U-Net
architecture for medical image segmentation. Deep Learning in Medical Image Analysis
Workshop, 3–11. 205
Zhu, C., Ni, R., Xu, Z., Kong, K., Huang, W. R., & Goldstein, T. (2021). GradInit: Learning to
initialize neural networks for stable and efficient training. Neural Information Processing
Systems, 34, 16410–16422. 113
Zhu, J., Krähenbühl, P., Shechtman, E., & Efros, A. A. (2016). Generative visual manipulation on the
natural image manifold. European Conference on Computer Vision, 597–613. 301
Zhu, J., Shen, Y., Zhao, D., & Zhou, B. (2020a). In-domain GAN inversion for real image editing.
European Conference on Computer Vision, 592–608. 301
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using
cycle-consistent adversarial networks. IEEE/CVF International Conference on Computer Vision,
2223–2232. 296, 301
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020b). Deformable DETR: Deformable
transformers for end-to-end object detection. International Conference on Learning
Representations. 238
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., & He, Q. (2020). A comprehensive
survey on transfer learning. Proceedings of the IEEE, 109(1), 43–76. 159
Ziegler, Z., & Rush, A. (2019). Latent normalizing flows for discrete sequences. International
Conference on Machine Learning, 7673–7682. 322, 323
Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., & Chen, H. (2018). Deep
autoencoding Gaussian mixture model for unsupervised anomaly detection. International
Conference on Learning Representations. 344
Zou, D., Cao, Y., Zhou, D., & Gu, Q. (2020). Gradient descent optimizes over-parameterized deep
ReLU networks. Machine Learning, 109, 467–492. 404
Zou, D., Hu, Z., Wang, Y., Jiang, S., Sun, Y., & Gu, Q. (2019). Layer-dependent importance sampling
for training deep and large graph convolutional networks. Neural Information Processing
Systems, 32, 11247–11256. 264
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the
Royal Statistical Society: Series B, 67 (2), 301–320. 156
Zou, Z., Chen, K., Shi, Z., Guo, Y., & Ye, J. (2023). Object detection in 20 years: A survey.
Proceedings of the IEEE. 184
OceanofPDF.com
Index
ℓ2 norm, 442
ℓ∞ norm, 442
ℓp norm, 442
<cls> token, 221
1×1 convolution, 174, 181
1D convolution, 163
1D convolutional network, 162–170, 182
2D convolutional network, 170–174
3D U-Net, 205
3D convolutional network, 182
ACGAN, 288
action value, 377
activation, 35
activation function, 25, 38
concatenated ReLU, 38
ELU, 38
GeLU, 38
HardSwish, 38
leaky ReLU, 38
logistic sigmoid, 38
parametric ReLU, 38
ReLU, 25, 38
scaled exponential linear unit, 113
SiLU, 38
Softplus, 38
Swish, 38
tanh, 38
activation normalization, 113
activation pattern, 27
ActNorm, 113
actor-critic method, 393
AdaDelta, 93
AdaGrad, 93
Adam, 88, 93
rectified, 93
AdamW, 94, 155
adaptive kernels, 183
adaptive moment estimation, 93
adaptive training methods, 93
adjacency matrix, 243–245
adjoint graph, 260
advantage estimate, 391
advantage function, 393
adversarial attack, 413
adversarial loss, 292, 301
adversarial training, 149
affine function, 446
aggregated posterior, 340, 341
AlexNet, 174
algorithmic differentiation, 106
AMSGrad, 93
ancestral sampling, 459
argmax function, 437
argmin function, 437
artificial moral agency, 424
ethical impact agent, 424
explicit ethical agent, 424
full ethical agent, 424
implicit ethical agent, 424
asynchronous data parallelism, 114
ATARI 2600 benchmark, 386
atrous convolution, 181
attention
additive, 235
as routing, 235
graph attention network, 258
key-value, 235
local, 237
memory-compressed, 235
memory-compresssed, 237
multiplicative, 235
squeeze-and-excitation network, 235
synthesizer, 235
augmentation, 152–154, 159
in graph neural networks, 264
autocorrelation function, 441
autoencoder, 344
variational, 326–347
automatic translation, 226
automation bias, 429
automation of jobs, 13
autoregressive flow, 311–313, 323
auxiliary classifier GAN, 288
average pooling, 171, 181
aysmptotic notation, 438
Gabor model, 80
gamma function, 440
GAN, see generative adversarial network
gated convolution, 181
gated multi-layer perceptron, 235
Gaussian distribution, see normal distribution
GeLU, 38
generalization, 118, 402
factors that determine, 410–414
generative adversarial network, 275–302
ACGAN, 288
adversarial loss, 292
conditional, 288, 300
conditional generation, 288–290
CycleGAN, 292–295
DCGAN, 278
difficulty of training, 279
discriminator, 275
editing images with, 301
generator, 275
image translation, 290–295
InfoGAN, 290
inverting, 301
least squares, 299
loss function, 276, 299
mini-batch discrimination, 288, 300
mode collapse, 279, 300
mode dropping, 279
multiple scales, 300
PatchGAN, 291
Pix2Pix, 291
progressive growing, 286, 300
SRGAN, 292
StyleGAN, 295–297, 300
tricks for training, 299
truncation trick, 288
VEEGAN, 300
Wasserstein, 280–285, 299
gradient penalty, 285
weight clipping, 285
generative model, 7, 23, 223, 269
desirable properties, 269
quantifying performance, 271
generator, 275
geodesic CNN, 265
geometric graph
example, 241
geodesic CNN, 265
MoNet, 265
ghost batch normalization, 203
GLIDE, 370
global minimum, 81
Glorot initialization, 113
GLOW, 318–320, 323
Goldilocks zone, 410, 412
GoogLeNet, 181
GPT3, 222–227
decoding, 223
few-shot learning, 224
GPU, 107
gradient checkpointing, 114
gradient descent, 77–78, 91
GradInit, 113
graph
adjacency matrix, 243–245
adjoint, 260
edge, 240, 260
directed, 241
embedding, 243
undirected, 241
examples, 240
expansion problem, 254
geometric, 241
heterogenous, 241
hierarchical, 241
knowledge, 241
line, 260
max pooling aggregation, 258
neighborhood sampling, 254
node, 240
embedding, 243, 244
partitioning, 254
real world, 240
tasks, 246
types, 241
graph attention network, 258, 263
graph isomorphism network, 264
graph Laplacian, 262
graph neural network, 240–267
augmentation, 264
batches, 264
dual-primal graph CNN, 264
graph attention network, 258
GraphSAGE, 262
higher-order convolutional layer, 263
MixHop, 263
MoNet, 262
normalization, 265
over-smoothing, 265
pooling, 265
regularization, 264
residual connection, 263, 266
spectral methods, 262
suspended animation, 265–266
graphics processing unit, 107
GraphNorm, 265
GraphSAGE, 262
GraphSAINT, 264
GResNet, 263
grokking, 412
group normalization, 203
grouped convolution, 181
guided convolution, 183
HardSwish, 38
He initialization, 110, 113
Heaviside function, 104
heteroscedastic regression, 64, 73
hidden layer, 35
hidden unit, 27, 35
hidden variable, see latent variable
hierarchical graph, 241
highway network, 202
Hogwild!, 114
homoscedastic regression, 64
hourglass network, 179, 197, 205
stacked, 198
Hutchinson trace estimator, 316
hyperband, 136
hypernetwork, 235
hyperparameter, 46
model, 46
training algorithm, 91
hyperparameter search, 132, 133, 135
Bayesian optimization, 135
beta-Bernoulli bandit, 135
BOHB, 136
hyperband, 136
random sampling, 135
SMAC, 136
Tree-Parzen estimators, 136
i.i.d., see independent and identically distributed
identity matrix, 445
image interpolation, 11
image translation, 290–295, 301
ImageGPT, 229
Imagen, 370
ImageNet classification, 174–176, 181
implicit ethical agent, 424
implicit regularization, 144, 156–157
importance sampling, 339
inception block, 181
inception score, 271
independence, 451
independent and identically distributed, 58
inductive bias, 129
convolutional, 170
relational, 248
inductive model, 252
inference, 17
infinitesimal flows, 324
InfoGAN, 290
information preference problem, 345
initialization, 107–111
ActNorm, 113
convolutional layers, 183
ConvolutionOrthogonal, 183
Fixup, 205
Glorot, 113
GradInit, 113
He, 113
layer-sequential unit variance, 113
LeCun, 113
SkipInit, 205
TFixup, 237
Xavier, 113
injection, 439
inner alignment problem, 421
inpainting, 8
instance normalization, 203
InstructGPT, 398
intellectual property, 428
internal covariate shift, 203
interpretability, see explainability
intersectionality, 423
invariance, 161
permutation, 162, 213, 249
rotation, 182
scale, 182
translation, 182
inverse autoregressive flow, 313
inverse of a matrix, 443
invertible layer, 308
autoregressive flow, 311–313, 323
coupling flow, 310–311
elementwise flow, 309–310
linear flow, 308–309
residual flow, 313–316, 323
iResNet, 314–316, 323
iRevNet, 313–314, 323
Jacobian, 447
Janossy pooling, 263
Jensen's inequality, 330
Jensen-Shannon divergence, 460
joint probability, 448
Lk pooling, 181
L-infinity norm, 442
L0 regularization, 155
L1 regularization, 156
L2 norm, 442
L2 regularization, 140
label, 64
label smoothing, 149, 158
language model, 222, 234
few-shot learning, 224
GPT3, 222–227
large language model, 224, 234
LASSO, 155, 156
latent space disentangled, 270
latent variable, 7, 268
latent variable model, 326
mixture of Gaussians, 327
nonlinear, 327
layer, 35
convolutional, 161, 165
hidden, 35
input, 35
invertible, 308
autoregressive flow, 311–313
coupling flow, 310–311
elementwise flow, 309–310
linear flow, 308–309
residual flow, 313–316, 323
output, 35
residual, 189
layer normalization, 203
layer-sequential unit variance initialization, 113
layer-wise DropEdge, 264
leaky ReLU, 38
learning, 18
learning rate, 78
schedule, 86
warmup, 93
learning to rank, 73
least squares GAN, 299
least squares loss, 19, 62
LeCun initialization, 113
LeNet, 180
likelihood, 58, 450, 451
likelihood ratio identity, 389
LIME, 426
line graph, 260
line search, 92
linear algebra, 446
linear flow, 308–309, 322
linear function, 27, 446
linear programming, 284
linear regression, 18
LinFormer, 237
Lipschitz constant, 439
local attention, 237
local minimum, 81
in real loss functions, 408
log-likelihood, 59
logarithm, 440
logistic regression, 94
logistic sigmoid, 66
loss, 19–21
adversarial, 292
perceptual, 292
VGG, 292
loss function, 21, 56–76
binary cross-entropy, 66
convex, 80
cross-entropy, 71–72
focal, 73
global minimum, 81
least squares, 19, 62
local minimum, 81
multiclass cross-entropy, 69
negative log-likelihood, 60
non-convex, 80
pinball, 73
properties of, 406–410
quantile, 73
ranking, 73
recipe for computing, 60
saddle point, 81
vs. cost function, 23
vs. objective function, 23
lottery ticket, 406
lottery ticket, 415
lower triangular matrix, 445
LP norm, 442
manifold, 273
manifold precision/recall, 273
marginalization, 449
Markov chain, 350
Markov decision process, 377
Markov process, 373
Markov reward process, 373
masked autoregressive flow, 312
masked self-attention, 223
matrix, 436, 442
calculus, 447
column space, 443
determinant, 444
eigenvalue, 444
inverse, 443
Jacobian, 447
permutation, 245
product, 443
singular, 443
special types, 445
diagonal, 445
identity, 445
lower triangular, 445
orthogonal, 446
permutation, 446
upper triangular, 445
trace, 444
transpose, 442
max function, 437
max pooling, 171, 181
max pooling aggregation, 258
max unpooling, 172, 181
MaxBlurPool, 182
maximum likelihood, 56–59
mean, 454
mean pooling, 171, 246
measuring performance, 118–137
median estimation, 73
memory-compressed attention, 237
micro-batching, 114
militarization, 427
min function, 437
mini-batch, 85
discrimination, 288, 300
minimax game, 277
minimum, 81
connections between, 407
family of, 407
global, 81
local, 81
route to, 407
misuse, 426
data privacy, 428
face recognition, 426
fraud, 427
militarization, 427
political interference, 427
MixHop, 263
mixture density network, 74
mixture model network, 262, 265
mixture of Gaussians, 75, 327
MLP, 35
MNIST, 291
MNIST-1D, 118
mode collapse, 279, 300
mode dropping, 279
model, 17
capacity, 134
effective, 134
representational, 134
inductive, 252
machine learning, 4
parameter, 18
testing, 22
transductive, 252
modern regime, 129
momentum, 86, 92
Nesterov, 86, 92
MoNet, 262, 265
Monte Carlo batch normalization, 203
Monte Carlo dropout, 158
Monte Carlo method, 381, 383
moral deskilling, 429
multi-head self-attention, 214
multi-layer perceptron, 35
multi-scale flow, 316–317
multi-scale vision transformer, 230, 238
multi-task learning, 151
multiclass classification, 67–69
multiclass cross-entropy loss, 69
multigraph, 241
multivariate normal, 456
multivariate regression, 69
MViT, 238
NAdam, 92
named entity recognition, 221
Nash equilibrium, 277
natural language processing, 207, 216
automatic translation, 226
benchmarks, 234
BERT, 219–222
embedding, 218
GPT3, 222–227
named entity recognition, 221
question answering, 222
sentiment analysis, 221
tasks, 232
text classification, 221
tokenization, 218
natural policy gradients, 397
negative log-likelihood, 60
neighborhood sampling, 254
Nesterov accelerated momentum, 86, 92
network, see neural network
network dissection, 184
network inversion, 184
network-in-network, 181
neural architecture search, 132
neural network
shallow, 25–40
bias, 36
capacity, 29, 46
capsule, 235
composing, 41
convolutional, 161–185
deep, 41–55
deep vs. shallow, 49–51
depth, 46
depth efficiency, 50, 53
encoder-decoder, 179
feed-forward, 35
fully connected, 36
graph, 240–267
highway, 202
history, 37
hourglass, 179, 197
hyperparameter, 46
layer, 35
matrix notation, 49
recurrent, 233
residual, 186, 206
stacked hourglass, 198
transformer, 207–239
U-Net, 197
weights, 36
width, 46
width efficiency, 53
neural ODE, 202
neuron, see hidden unit
Newton method, 92
NLP, see natural language processing
node, 240
embedding, 243, 244
noise, 122
adding to inputs, 149
adding to weights, 149
noise conditioning augmentation, 369
noise schedule, 349
noisy deep Q-network, 397
non-convex function, 80
non-negative homogeneity, 39
nonlinear function, 27
nonlinear latent variable model, 327
norm
ℓp, 442
Euclidean, 442
spectral, 442
vector, 442
norm of weights, 412
normal distribution, 61, 456–458
change of variable, 458
distance between, 461
Frećhet distance between, 461
KL divergence between, 461
multivariate, 456
product of two normals, 458
sampling, 459
standard, 456
univariate, 456
Wasserstein distance between, 461
normalization
batch, 192–194
Monte Carlo, 203
batch renormalization, 203
ghost batch, 203
group, 203
in graph neural networks, 265
instance, 203
Kipf, 258, 262
layer, 203
normalizing flows, 303–325
applications, 322
autoregressive, 311–313, 323
coupling, 323
coupling flow, 310–311
coupling functions, 322
elementwise, 309–310, 322
generative direction, 305
GLOW, 318–320, 323
in variational inference, 320
infinitesimal, 324
inverse autoregressive, 313
linear, 308–309, 322
masked autoregressive, 312
multi-scale, 316–317
normalizing direction, 305
planar, 322
radial, 322
residual, 313–316, 323
iResNet, 314–316
iRevNet, 313–314
universality, 324
notation, 436–438
nullspace, 443
number set, 436
padding, 164
PairNorm, 265
parameter, 18, 436
parameteric ReLU, 38
partial convolution, 181
partially observable MDP, 377
PatchGAN, 291
PCA, 344
perceptron, 37
perceptual loss, 292
performance, 118–137
Performer, 237
permutation invariance, 162, 213, 249
permutation matrix, 245, 446
pinball loss, 73
pipeline model parallelism, 114
pivotal tuning, 301
Pix2Pix, 291
PixelShuffle, 182
PixelVAE, 344
planar flow, 322
Poisson distribution, 76
policy, 377
behavior, 384
Boltzmann, 399
epsilon-greedy, 384
target, 384
policy gradient method, 388
PPO, 397
REINFORCE, 391
TRPO, 397
policy network, 12
political interference, 427
POMDP, 377
pooling
ℓk, 181
average, 181
in graph neural networks, 265
Janossy, 263
max, 181
max-blur, 181
positional encoding, 213, 236
posterior, 451
posterior collapse, 345
PPO, 397
pre-activation, 35
pre-activation residual block, 201
pre-training, 151
transformer encoder, 219
precision, 273
principal agent problem, 421
prior, 139, 140, 451
prior shift, 135
prioritized experience replay, 396
probabilistic generative model, 269
probabilistic PCA, 344
probability, 448–461
Bayes’ rule, 450
conditional, 449
density function, 448
distribution, 437, 448
Bernoulli, 65
categorical, 67
continuous, 448
discrete, 448
distance between, 459–461
mixture of Gaussians, 327
multivariate normal, 456
normal, 456–458
Poisson, 76
sampling from, 459
univariate normal, 61, 456
von Mises, 74
joint, 448
marginalization, 449
notation, 437
random variable, 448
probability flow ODE, 370
progressive growing, 286, 300
protected attribute, 423
proximal policy optimization, 397
pruning, 414
pyramid vision transformer, 238
PyTorch, 106
Q-learning, 384
deep, 385–387
double, 387
double deep, 387
fitted, 384
noisy deep Q-network, 397
quantile loss, 73
quantile regression, 73
query, 210
question answering, 222
R-CNN, 183
Rademacher complexity, 134
radial flow, 322
Rainbow, 397
random synthesizer, 235
random variable, 448
RandomDrop, 202
ranking, 73
recall, 273
receptive field, 167
in graph neural networks, 254
reconstruction loss, 334
rectified Adam, 93
rectified linear unit, 25
derivative of, 104
dying ReLU problem, 38
non-negative homogeneity, 39
recurrent dropout, 158
recurrent neural network, 233
regression, 2
heteroscedastic, 73
multivariate, 2, 69
quantile, 73
robust, 73
univariate, 61
regularization, 131, 138–160
AdamW, 155
adding noise, 158
adding noise to inputs, 149
adding noise to weights, 149
adversarial training, 149
augmentation, 152–154
bagging, 146
Bayesian approaches, 150
data augmentation, 159
DropConnect, 265
DropEdge, 264
dropout, 147
early stopping, 145, 157
elastic net, 156
ensemble, 145, 157
flooding, 159
Frobenius norm, 140, 155
implicit, 141, 144, 156–157
in graph neural networks, 264
L0, 155
L1, 156
L2, 140
label smoothing, 149, 158
LASSO, 156
multi-task learning, 151
probabilistic interpretation, 139
RandomDrop, 202
ResDrop, 202
ridge regression, 140
self-supervised learning, 159
shake drop, 203
shake-shake, 203
stochastic depth, 202
Tikhonov, 140
transfer learning, 151, 159
weight decay, 140
vs. L2, 155
REINFORCE, 391
reinforcement learning, 373–400
action value, 377
advantage function, 393
baseline, 391
batch, 394
Bellman equations, 379
classical, 396
deadly triad issue, 396
deep dueling network, 397
discount factor, 374
dynamic programming methods, 382
episode, 381, 383
experience replay, 386
exploration-exploitation trade-off, 373
for combinatorial optimization, 396
introduction, 11–12
Markov decision process, 377
Monte Carlo method, 381, 383
natural policy gradients, 397
offline, 394
policy, 377
behavior, 384
Boltzmann, 399
epsilon-greedy, 384
optimal, 378
target, 384
policy gradient method, 388
PPO, 397
REINFORCE, 391
TRPO, 397
policy network, 12
POMDP, 377
prioritized experience replay, 396
Q-learning, 384
deep Q-network, 385–387, 396
double DQN, 387
double Q-learning, 387
fitted, 384
noisy deep Q-network, 397
Rainbow, 397
return, 374
reward, 374
rollout, 381
SARSA, 384
state value, 377
state-action value function, 378
state-value function, 378
tabular, 381–384
temporal difference method, 384
trajectory, 381
value, 374
with human feedback, 398
relational inductive bias, 248
ReLU, see rectified linear unit
reparameterization trick, 338, 346
representational capacity, 134
ResDrop, 202
residual block, 189
order of operations, 191
residual connection, 189
in graph neural networks, 263, 266
why improves performance, 202
residual flow, 313–316, 323
iResNet, 314–316
iRevNet, 313–314
residual network, 186–206
as ensemble, 202
performance, 198
stable ResNet, 205
unraveling, 189
ResNet v1, 201
ResNet v2, 201
ResNet-200, 195
ResNeXt, 202
resynthesis, 341
return, 374
reverse-mode differentiation, 116
reward, 374
ridge regression, 140
RL, see reinforcement learning
RMSProp, 93
RNN, 233
robust regression, 73
rollout, 381
rotation equivariance, 182
rotation invariance, 182
run transparency, 425
tabular data, 2, 17
tabular reinforcement learning, 381–384
target policy, 384
teacher forcing, 234
technochauvanism, 433
technological unemployment, 429
temporal difference method, 381, 384
tensor, 107, 436, 442
tensor model parallelism, 114
TensorFlow, 106
test data, 22
test error bias, 122
double descent, 127
noise, 122
variance, 122
test set, 118
text classification, 221
text synthesis, 8, 224–225
conditional, 9
text-to-image, 367, 370
TFixup, 237
Tikhonov regularization, 140
tokenization, 218, 234
BPE dropout, 234
Byte pair encoding, 234
SentencePiece, 234
sub-word, 218
WordPiece, 234
top-k sampling, 224
total correlation VAE, 345
trace, 444
Hutchinson estimator, 316
training, 5, 18, 77–117
batch, 85
epoch, 85
error, 19
factors that determine success, 402–406
gradient checkpointing, 114
micro-batching, 114
reducing memory requirements, 114
stochastic gradient descent, 83–86
tractability, 401
trajectory, 381
transductive model, 252
transfer learning, 151, 159, 219
fine-tuning, 151
pre-training, 151
transformer, 207–239
applications, 233
applied to images, 228–232
BERT, 219–222
BigBird, 237
CLIP, 238
combined with CNNs, 238
combining images and text, 238
cross covariance image transformer, 238
Crossformer, 238
DaViT, 232, 238
decoding algorithm, 234
definition, 215
encoder model, 219–222
encoder-decoder model, 226
extended construction, 237
for NLP, 216
for video processing, 238
for vision, 238
ImageGPT, 229
LinFormer, 237
long sequences, 227
multi-head self-attention, 214
multi-scale, 238
multi-scale vision, 230
Performer, 237
positional encoding, 213, 236
pyramid vision, 238
scaled dot-product attention, 214
SWin, 230, 238
SWin V2, 238
synthesizer, 235
TFixup, 237
ViT, 229
translation (automatic), 226
translation equivariance, 182
translation invariance, 182
transparency, 424–425
functional, 425
run, 425
structural, 425
transport plan, 283
transpose, 442
transposed convolution, 172, 181
Tree-Parzen estimators, 136
triangular matrix, 445
TRPO, 397
truncation trick, 288
trust region policy optimization, 397
V-Net, 205
VAE, see variational autoencoder
valid convolution, 164
value, 208, 374
value alignment, 420–426
inner alignment problem, 421
outer alignment problem, 421
principal agent problem, 421
value of action, 377
value of state, 377
value-free ideal of science, 431
vanishing gradient problem, 108
in residual networks, 192
Vapnik-Chervonenkis dimension, 134
variable, 436
variance, 122, 454
identity, 454
variational approximation, 335
with normalizing flows, 320
variational autoencoder, 326–347
adversarially learned inference, 345
aggregated posterior, 340, 341
applications, 343
beta VAE, 342, 345
combination with other models, 345
conditional, 344
decoder, 337
disentanglement, 341
encoder, 337
estimating probability, 339
factor VAE, 346
generation, 340
hierarchical model for posterior, 344
holes in latent space, 345
information preference problem, 345
modifying likelihood term, 344
normalizing flows, 344
PixelVAE, 344
posterior collapse, 345
relation to EM algorithm, 346
relation to other models, 344
reparameterization trick, 338, 346
resynthesis, 341
total correlation VAE, 345
VC dimension, 134
vector, 436, 442
dot product, 443
norm, 442
VEEGAN, 300
VGG, 176
VGG loss, 292
vision transformer
DaViT, 232
ImageGPT, 229
multi-scale, 230
SWin, 230
ViT, 229
visualizing activations:, 184
ViT, 229
von Mises distribution, 74
YOGI, 93
YOLO, 177, 184
OceanofPDF.com