The Little Book of Deep Learning
The Little Book of Deep Learning
The Little Book of Deep Learning
of
Deep Learning
François Fleuret
François Fleuret is professor of computer science
at the University of Geneva, Switzerland.
beta-2023.04.23
2 140
Contents
List of figures 7
Foreword 8
I Foundations 10
1 Machine Learning 11
1.1 Learning from data . . . . . . . 12
1.2 Basis function regression . . . . 14
1.3 Categories of models . . . . . . 15
1.4 Under and over-fitting . . . . . 17
2 Efficient computation 19
2.1 GPUs, TPUs, and batches . . . . 20
2.2 Tensors . . . . . . . . . . . . . . 22
3 Training 24
3.1 Losses . . . . . . . . . . . . . . 25
3 140
3.2 Autoregressive models . . . . . 28
3.3 Gradient descent . . . . . . . . 30
3.4 Backpropagation . . . . . . . . 34
3.5 Training protocols . . . . . . . 38
3.6 Training data . . . . . . . . . . 41
II Deep models 43
4 Model components 44
4.1 The notion of layer . . . . . . . 45
4.2 Linear layers . . . . . . . . . . . 47
4.3 Activation functions . . . . . . 56
4.4 Pooling . . . . . . . . . . . . . . 59
4.5 Dropout . . . . . . . . . . . . . 62
4.6 Normalizing layers . . . . . . . 64
4.7 Skip connections . . . . . . . . 68
4.8 Attention layers . . . . . . . . . 71
4.9 Token embedding . . . . . . . . 78
4.10 Positional encoding . . . . . . . 79
5 Architectures 81
5.1 Multi-Layer Perceptrons . . . . 82
5.2 Convolutional networks . . . . 84
5.3 Attention models . . . . . . . . 91
4 140
III Applications 98
6 Prediction 99
6.1 Image denoising . . . . . . . . . 100
6.2 Image classification . . . . . . . 102
6.3 Object detection . . . . . . . . . 103
6.4 Semantic segmentation . . . . . 108
6.5 Speech recognition . . . . . . . 111
6.6 Text-image representations . . . 113
7 Synthesis 116
7.1 Text generation . . . . . . . . . 117
7.2 Image generation . . . . . . . . 119
Afterword 125
Bibliography 126
Index 133
5 140
List of Figures
4.1 1d convolution . . . . . . . . . . . . 49
4.2 2d convolution . . . . . . . . . . . . 50
4.3 Stride, padding, and dilation . . . . 51
4.4 Receptive field . . . . . . . . . . . . 53
4.5 Activation functions . . . . . . . . . 57
4.6 Max pooling . . . . . . . . . . . . . 60
4.7 Dropout . . . . . . . . . . . . . . . . 63
4.8 Batch normalization . . . . . . . . . 65
4.9 Skip connections . . . . . . . . . . . 69
4.10 Attention operator . . . . . . . . . . 72
4.11 Interpretation of the attention operator 73
4.12 Multi-Head Attention layer . . . . . 75
6 140
5.1 Multi-Layer Perceptron . . . . . . . 82
5.2 LeNet-like convolutional model . . 85
5.3 Residual block . . . . . . . . . . . . 86
5.4 Downscaling residual block . . . . . 87
5.5 ResNet-50 . . . . . . . . . . . . . . . 88
5.6 Self and cross-attention blocks . . . 92
5.7 Transformer . . . . . . . . . . . . . 93
5.8 GPT model . . . . . . . . . . . . . . 95
5.9 ViT model . . . . . . . . . . . . . . 96
7 140
Foreword
If you did not get this book from its official url
https://fleuret.org/public/lbdl.pdf
François Fleuret
April 23, 2023
9 140
Part I
Foundations
10 140
Chapter 1
Machine Learning
11 140
1.1 Learning from data
The simplest use case for a model trained from
data is when a signal x is accessible, for instance
the picture of a license plate, from which one
wants to predict a quantity y, such as the string
of characters written on the plate.
12 140
often referred to as weights, by analogy with
the synaptic weights of biological neural net-
works. In addition to these parameters, models
usually depend on meta parameters which are
set according to domain prior knowledge, best
practices, or resource constraints. They may also
be optimized in some ways, but with techniques
different than those used to optimize w.
13 140
1.2 Basis function regression
We can illustrate the training of a model in a sim-
ple case where xn and yn are two real numbers,
the loss is the mean squared error
N
1X
ℒ (w) = (yn −f (xn ;w))2 , (1.1)
N
n=1
15 140
Both regression and classification are generally
referred to as supervised learning since the value
to predict, which is required as a target during
training, has to be produced, for instance by hu-
man experts. On the contrary density modeling
is usually seen as unsupervised learning since
it is sufficient to take existing data, without the
need for producing an associated ground-truth.
16 140
1.4 Under and over-fitting
A key element is the interplay between the
capacity
of the model, that is its flexibility and
ability to fit diverse data, and the amount and
quality of the training data. When the capacity
is insufficient, the model cannot fit the data and
the error during training is high. This is referred
to as under-fitting.
17 140
So a large part of the art of applied machine
learning is to design models which are not too
flexible, but still able to fit the data. This is done
by crafting the right inductive bias in a model,
which means that its structure corresponds to
the underlying structure of the data at hand.
18 140
Chapter 2
Efficient computation
19 140
2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally de-
signed for real-time image synthesis, which re-
quires highly parallel architectures that happen
to be fitting to deep models. As their usage
for AI has increased, GPUs got equipped with
dedicated sub-components referred to as
,tensor cores
and deep-learning specialized chips
such as Google’s Tensor Processing Units (TPUs)
have been produced.
21 140
2.2 Tensors
GPUs and deep learning frameworks such as Py-
Torch or JAX manipulate the quantities to pro-
cess by organizing them as tensors, which are
series of scalars arranged along several discrete
axes. They are elements of RN1 ×···×ND that gen-
eralize the notion of vector and matrix.
23 140
Chapter 3
Training
24 140
3.1 Losses
The example of the mean squared error of equa-
tion 1.1 is a standard loss for predicting a con-
tinuous value.
27 140
3.2 Autoregressive models
Many spectacular applications in computer vi-
sion and natural language processing have been
tackled by modeling the distribution of a high-
dimension discrete vector with the chain rule:
the chain rule states that one can sample a full se-
quence of length T by sampling the xt s one after
another, each according to the predicted poste-
rior distribution, given the x1 ,...,xt−1 already
28 140
sampled. This is an autoregressive generative
model.
(3.6)
ℒce xt ,f (x1 ,...,xt−1 ,0,...,0;w) ,
29 140
3.3 Gradient descent
Except in specific cases like the linear regres-
sion we saw in the previous chapter, the optimal
parameters w∗ do not have a closed form expres-
sion. In the general case the tool of choice to
minimize a function is gradient descent. It con-
sists of initializing the parameters with a random
w0 , and then improving this estimate by iterat-
ing gradient steps, each consisting of computing
the gradient of the loss with respect to the pa-
rameters, and subtracting a fraction of it
wn+1 = wn −η∇ℒ |w (wn ). (3.7)
This procedure corresponds to moving the cur-
rent estimate a bit in the direction corresponding
locally to the maximum decrease of ℒ (w), as
illustrated on Figure 3.1.
ℒ (w)
31 140
as an average of a per-sample loss
N
1X
ℒ (w) = 𝓁n (w), (3.8)
N
n=1
33 140
3.4 Backpropagation
Using gradient descent requires a technical
means to compute ∇𝓁n |w (w). Given that f and
L are both compositions of standard tensor oper-
ations, as for any mathematical expression, the
chain rule allows us to get an expression of it.
f = f1 ◦f2 ◦···◦fD .
fd (·;wd )
xd−1 xd
×Jfd |x
∇𝓁 |xd−1 ∇𝓁 |xd
×Jfd |w
∇𝓁 |wd
35 140
convenient algorithm is autograd [Baydin et al.,
2015], which tracks tensor operations, and builds
on the fly the combination of operators for gra-
dients. Thanks to this, a piece of imperative
programming that manipulates tensors can auto-
matically compute the gradient of any quantity
with respect to any other.
36 140
work is that when the gradient propagates back-
wards through many operators it may decrease
or increase exponentially. When it decreases
exponentially this is called the
,vanishing gradient
and it may make the training impossible,
or in its milder form, make different parts of
the model being updated at different speeds, de-
grading their co-adaptation [Glorot and Bengio,
2010]. As we will see, several techniques have
been developed to prevent this from happening.
37 140
3.5 Training protocols
Training a deep network requires defining a pro-
tocol to make the most of computation and data,
and ensure that performance will be good on
new data.
Loss
Validation
Train
Number of epochs
41 140
the computing device’s memory.
42 140
Part II
Deep models
43 140
Chapter 4
Model components
44 140
4.1 The notion of layer
We call layers standard complex compounded
tensor operations that have been designed and
empirically identified as being generic and effi-
cient. They often incorporate trainable param-
eters, and correspond to a convenient level of
granularity to design and describe large deep
models. The term is inherited from the simple
multi-layer neural networks, even though a mod-
ern model may take the form of a complex graph
of such modules, and incorporates multiple par-
allel pathways.
Y
4×4
g n=4
f
×K
32×32
X
45 140
• non-default valued meta-parameters are
added in blue on their right,
46 140
4.2 Linear layers
Linear layers are the most important modules
in terms of computation and number of parame-
ters. They benefit from decades of research and
engineering in algorithmic and chip design for
matrix operations.
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = W X[d1 ,...,dK ]+b. (4.1)
Convolutional layers
A linear layer can take as input an arbitrarily
shaped tensor by reshaping it into a vector, as
long as it has the right number of coefficients.
However such a layer is poorly adapted to deal-
ing with large tensors since the number of pa-
rameters and number of operations are propor-
tional to the product of the input and output
dimensions. For instance to process an RGB im-
age of size 256×256 as input and compute a
result of same size, it would require ≃ 4×1010
parameters and multiplications.
ϕ ψ
X X
Y Y
ϕ ψ
X X
... ...
Y Y
ϕ ψ
X X
1d transposed
1d convolution
convolution
Figure 4.1: A 1d convolution (left) takes as input
a D×T tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D×K, and stores
the resulting D′ ×1 tensors into Y . A 1d transposed
convolution (right) takes as input a D×T tensor, ap-
plies the same affine mapping ψ(·;w) to every sub-
tensor of shape D×1, and sums the shifted resulting
D′ ×K tensors. Both can process inputs of different
size.
49 140
ϕ ψ
Y X
X Y
2d transposed
2d convolution
convolution
Figure 4.2: A 2d convolution (left) takes as input a
D×H ×W tensor X, applies the same affine map-
ping ϕ(·;w) to every sub-tensor of shape D×K ×L,
and stores the resulting D′ ×1×1 tensors into Y . A
2d transposed convolution (right) takes as input a
D×H ×W tensor, applies the same affine mapping
ψ(·;w) to every D×1×1 sub-tensor, and sums the
shifted resulting D′ ×K ×L tensors into Y .
50 140
Y
Y ϕ
X
ϕ
p=2
X
Padding
Y
Y
ϕ
X ϕ
X
s=2
...
d=2
Stride
Dilation
Figure 4.3: Beside its kernel size and number of input
/ output channels, a convolution admits three meta-
parameter: the stride s (left) modulates the step size
when going though the input tensor, the padding p
(top right) specifies how many zeros entries are added
around the input tensor before processing it, and the
dilation d (bottom right) parameterizes the index count
between coefficients of the filter.
51 140
A 1d convolution is mainly defined by three
meta-parameters: its kernel size K, its number
of input channels D, its number of output chan-
nels D′ , and by the trainable parameters w of an
′
affine mapping ϕ(·;w) : RD×K → RD ×1 .
52 140
Figure 4.4: Given an activation in a series of convolu-
tion layers, here in red, its receptive field is the area in
the input signal, in blue, that modulates its value. Each
intermediate convolutional layer increases its size by
roughly half its kernel size.
54 140
plies, for instance in the 1d case, an affine map-
′
ping ψ(·;w) : RD×1 → RD ×K , to every D×1
sub-tensor of the input, and sums the shifted
D′ ×K resulting tensors to compute its output.
Such operator increases the size of the signal
and can be understood intuitively as a synthesis
process. See Figure 4.1 (right) and 4.2 (right).
55 140
4.3 Activation functions
If a network was combining only linear compo-
nents it would itself be a linear operator, so it
is essential to have non-linear operations. They
are implemented in particular with
activation functions
which are layers that transforms ev-
ery component of the input tensor individually
through a mapping, resulting in a tensor of same
shape.
57 140
4.5 (bottom left).
(
ax if x < 0
leakyrelu(x) = (4.3)
x otherwise.
58 140
4.4 Pooling
A classical strategy to reduce the signal size is to
use a pooling operation that combines multiple
activations into one that ideally summarizes the
information. The most standard operation of this
class is the max pooling layer which, similarly
to convolution, can operate in 1d and 2d, and is
defined by a kernel size.
max
max
...
max
1d max pooling
60 140
max over the sub-tensors. This is a linear opera-
tion, while max pooling is not.
61 140
4.5 Dropout
Some layers have been designed to explicitly
facilitate training, or improve the quality of the
learned representations.
62 140
Y Y
0
1 1 1 1 1 1
0 1 1 1 1
0 0
1 1
1 1
0 1 1
0 1 1 1 1 1 1 1 1 1
× 1 1 1
0 1 1 1 1 1
0 1 1 1 1
0 × 1−p
1 1 1 1 1 1
0 1 1 1
0 1 1
0 1
1
0 1 1 1
0 1 1 1 1 1 1 1
0 1
X X
Train Test
Figure 4.7: Dropout can process a tensor of arbitrary
shape. During training (left), it sets activations at ran-
dom to zero with probability p and applies a multiply-
ing factor to keep the expected values unchanged. Dur-
ing test (right), it keeps all the activations unchanged.
63 140
4.6 Normalizing layers
An important class of operators to facilitate the
training of deep architectures are the
normalizing layers
which force the empirical mean and
variance of groups of activations.
64 140
D D
H,W H,W
B B
x⊙γ +β x⊙γ +β
√ √
(x− m̂)/ v̂+ϵ (x− m̂)/ v̂+ϵ
batchnorm layernorm
65 140
viation γd
xb,d − m̂d
zb,d = √ (4.7)
v̂d +ϵ
yb,d = γd zb,d +βd . (4.8)
67 140
4.7 Skip connections
Another technique that mitigates the vanishing
gradient and allows the training of deep architec-
ture are the skip connections [Long et al., 2014;
Ronneberger et al., 2015]. They are not layers
per se, but an architectural design in which out-
puts of some layers are transported as-is to other
layers further in the model, bypassing process-
ing in-between. This unmodified signal can be
concatenated or added to the input to the layer
the connection branches into. See Figure 4.9. A
particular type of skip connections are the
residual connections
which combine the signal with
a sum, and usually skip only a few layers. See
Figure 4.9, right.
68 140
...
f8
... ...
f7
f6 +
f6
f5 f4
f5
f4 f3
f4
f3 +
f3
f2 f2
f2
f1 f1
f1 ... ...
...
69 140
Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size be-
fore re-expanding it, by connecting layers with
compatible size. In the case of residual connec-
tions, they may also facilitate the learning by
simplifying the task to finding a differential im-
provement instead of a full update.
70 140
4.8 Attention layers
In many applications there is a need for a pro-
cessing able to combine local information at lo-
cations far apart in a tensor. This can be for
instance distant details for coherent and realistic
image synthesis, or words at different positions
in a paragraph to make a grammatical or seman-
tic decision in natural language processing.
form
formers
formers
ers, the dominant architecture for large lan-
guage models. See § 5.3 and § 7.1.
Attention operator
Given
Y = att(K,Q,V )
72 140
Q Y
K A V A
73 140
attention scores
X
Yn = An,m Vm . (4.10)
m
×W O
(Y1 | ··· | YH )
attatt
attatt
att
Q K V
×W
×W1 2 Q×W
Q
Q×W
1 2K K×W 1 2V V
×W ×W K×W×W
×W3 Q ×W3 K ×W3 4V V
×W4
H ×W 4
H ×W H
×H
XQ XK XV
Figure 4.12: The Multi-head Attention Layer applies
for each of its h = 1,...,H heads a parametrized lin-
ear transformation to individual elements of the input
sequences X Q ,X K ,X V to get sequences Q,K,V that
are processed by the attention operator to compute Yh .
These H sequences are concatenated along features,
and individual elements are passed through one last
linear operator to get the final result sequence Y .
75 140
• W Q of size H ×D×DQK ,
• W K of size H ×D×DQK ,
• W V of size H ×D×DV ,
• X Q of size N Q ×D,
• X V of size N KV ×D,
(4.12)
Yh = att X Q WhQ ,X K WhK ,X V WhV .
77 140
4.9 Token embedding
In many situations, we need to convert discrete
tokens into vectors. This can be done with an
embedding layer
which consists of a lookup table
that directly maps integers to vectors.
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = M [X[d1 ,...,dK ]]. (4.14)
78 140
4.10 Positional encoding
While the processing of a fully connected layer
is specific to both the positions of the features
in the input tensor, and to the position of the
resulting activation in the output tensor, convo-
lutional layers and multi-head attention layers
are oblivious to the absolute position in the ten-
sor. This is key to their strong invariance and
inductive bias, which is beneficial to deal with a
stationary signal.
79 140
D, Vaswani et al. [2017] add
pos-enc[t,d] =
sin d/D t
if d ∈ 2N
T (4.15)
cos (d−1)/D
T
t
otherwise,
with T = 104 .
80 140
Chapter 5
Architectures
81 140
5.1 Multi-Layer Perceptrons
The simplest deep architecture is the
Multi-Layer Perceptron
(MLP), which takes the form
of a succession of fully connected layers sepa-
rated by activation functions. See an example
on Figure 5.1. For historical reasons, in such a
model, the number of hidden layers refers to the
number of linear layers, excluding the last one.
Y
2
fully-conn
relu
10
fully-conn
relu
25
fully-conn
50
X
Figure 5.1: This multi-layer perceptron takes as input
a one dimension tensor of size 50, is composed of three
fully connected layers with outputs of dimensions re-
spectively 25, 10, and 2, the two first followed by ReLU
layers.
82 140
arbitrarily well uniformly on a compact by a
model of the form l2 ◦σ◦l1 where l1 and l2 are
affine. Such a model is a MLP with a single hid-
den layer, and this result implies that it can ap-
proximate anything of practical value. However
this approximation holds if the dimension of the
first linear layer’s output can be arbitrarily large.
83 140
5.2 Convolutional networks
The standard architecture for processing images
is a convolutional network, or convnet, that
combines multiple convolutional layers, either
to reduce the signal size before it can be pro-
cessed by fully connected layers, or to output a
2d signal also of large size.
LeNet-like
The original LeNet model for image classifica-
tion [LeCun et al., 1998] combines a series of 2d
convolutional layers and max pooling layers that
play the role of feature extractor, with a series of
fully connected layers which act like a MLP and
performs the classification per se. See Figure 5.2
for an example.
Residual networks
Standard convolutional neural networks that fol-
low the architecture of the LeNet family are not
easily extended to deep architectures and suffer
from the vanishing gradient problem. The
LeNet-like network for
classifying 28×28 grayscale images of handwritten
digits [LeCun et al., 1998]. Its first half is convolutional,
and alternates convolutional layers per se and max
pooling layers, reducing the signal dimension for 28×
28 scalars to 256. Its second half processes this 256
dimension feature vector through a one hidden layer
perceptron to compute 10 logit scores corresponding to
the ten possible digits.
85 140
Y
C ×H ×W
relu
+
batchnorm
C ×H ×W
conv-2d k=1
relu
batchnorm
conv-2d k=3 p=1
relu
batchnorm
C
2 ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.3: A residual block.
ual
ual net
networks
networks
works,
works or resnets, proposed by He et al.
[2015] explicitly address the issue of the van-
ishing gradient with residual connections (see
§ 4.7), that allow hundreds of layers. They have
become standard architectures for computer vi-
sion applications, and exist in multiple versions
depending on the number of layers. We are go-
ing to look in detail at the architecture of the
ResNet-50 for classification.
relu
batchnorm
C
S ×H W
S × S
conv-2d k=3 s=S p=1
relu
batchnorm
C
S ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.4: A downscaling residual block. It admits a
meta-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor size.
resblock
×2
2048×7×7
dresblock
S=2
resblock
×5
1024×14×14
dresblock
S=2
resblock
×3
512×28×28
dresblock
S=2
resblock
×2
256×56×56
dresblock
S=1
64×56×56
maxpool k=3 s=2 p=1
relu
batchnorm
64×112×112
conv-2d k=7 s=2 p=3
3×224×224
X
Figure 5.5: Structure of the ResNet-50 [He et al., 2015].
88 140
volutional layer, and its computational cost, are
quadratic with the number of channels. This
residual block mitigates this problem by first re-
ducing the number of channels with a 1×1 con-
volution, then operating spatially with a 3×3
convolution on this reduced number of chan-
nels, and then up-scaling the number of chan-
nels, again with a 1×1 convolution.
89 140
is no downscaling, only an increase of the num-
ber of channels by a factor of 4. The output of
the last residual block is 2048×7×7, which is
converted to a vector of dimension 2048 by an
average pooling of kernel size 7×7, and then
processed through a fully connected layer to get
the final logits, here for 1000 classes.
90 140
5.3 Attention models
As stated in § 4.8, many applications, in partic-
ular from natural language processing, greatly
benefit from models that include attention mech-
anisms. The architecture of choice for such tasks,
which has been instrumental in recent advances
in deep learning, is the Transformer proposed
by Vaswani et al. [2017].
Transformer
The original Transformer, pictured on Figure 5.7,
was designed for sequence-to-sequence trans-
lation. It combines an encoder that processes
the input sequence to get a refined representa-
tion, and an auto-regressive decoder that gener-
ates each token of the result sequence, given the
encoder’s representation of the input sequence,
and the output tokens generated so far. As the
residual convolutional networks of § 5.2, both
the encoder and the decoder of the Transformer
are sequences of compounded blocks built with
residual connections.
91 140
Y Y
+ +
dropout dropout
fully-conn fully-conn
gelu gelu
fully-conn fully-conn
layernorm layernorm
+ +
mha mha
Q K V Q K V
layernorm layernorm
X QKV XQ X KV
Figure 5.6: Self-attention block (left) and
cross-attention block
(right). This specific structures proposed by
Radford et al. [2018] differ slightly from the original
architecture of Vaswani et al. [2017], in particular by
having the layernorm as the first layer of the residual
blocks.
92 140
Y
S ×V
fully-conn
S ×D
cross-att
Q KV
Decoder causal
self-att ×N
pos-enc +
S ×D
embed
S
Z
T ×D
self-att
×N
Encoder
pos-enc +
T ×D
embed
T
X
Figure 5.7: Original encoder-decoder
Transformer model
for sequence-to-sequence translation [Vaswani
et al., 2017].
93 140
and one the keys and values.
T ×V
fully-conn
T ×D
causal
self-att ×N
pos-enc +
T ×D
embed
T
X
Figure 5.8: GPT model [Radford et al., 2018].
Vision Transformer
Transformers have been put to use for image
classification with the Vision Transformer (ViT)
model [Dosovitskiy et al., 2020], see Figure 5.9.
95 140
Y
C
fully-conn
gelu
MLP
readout fully-conn
gelu
fully-conn
D
(M +1)×D
self-att
×N
pos-enc +
(M +1)×D
(E0 ,E1 ,...,EM )
| {z }
Image E0 ×W E
encoder M ×3P 2
(X1 ,...,XM )
96 140
The first element Z0 in the result sequence,
which corresponds to E0 and is not associated
to any part of the image, is finally processed by
a two hidden layer MLP to get the final C logits.
Such a token added for a readout of a class pre-
diction was introduced by Devlin et al. [2018]
in the BERT model and is referred to as a
CLS token
.
97 140
Part III
Applications
98 140
Chapter 6
Prediction
99 140
6.1 Image denoising
A direct application of deep models to image pro-
cessing is to recover from degradation by using
the redundancy in the statistical structure of im-
ages. The petals of a sunflower on a grayscale
picture can be colored with high confidence, and
the texture of a geometric shape such as a table
on a low-light grainy picture can be corrected
by averaging it over a large area likely to be
uniform.
101 140
6.2 Image classification
Image classification is the simplest strategy to
extract semantic from an image and consists
of predicting a class among a finite, predefined
number of classes, given an input image.
102 140
6.3 Object detection
A more complex task for image understanding
is object detection, in which case the objective
is, given an input image, to predict the classes
and positions of objects of interest.
Z1
Z2
ZS−1
ZS
...
...
104 140
Figure 6.2: Examples of object detection with the Single-
Shot Detector [Liu et al., 2015].
105 140
any bounding box (x1 ,x2 ,y1 ,y2 ) to a s,h,w, de-
termined respectively by max(x2 −x1 ,y2 −y1 ),
y1 +y2
2 , and 2 .
x1 +x2
106 140
During training every ground truth bounding
box is associated to its s,h,w, and induces a loss
term composed of a cross entropy loss for the
logits, and a regression loss such as MSE for the
bounding box coordinates. Every other s,h,w
free of bounding-box match induces a cross-
entropy only penalty to predict the class “no
object”.
107 140
6.4 Semantic segmentation
The finest grain prediction task for image under-
standing is semantic segmentation, which con-
sists of predicting for every pixel the class of the
object it belongs to. This can be achieved with
a standard convolutional neural network, that
outputs a convolutional map with as many chan-
nels as classes, that carry the estimated logits for
every pixel.
109 140
scaling, before making the final per-pixel predic-
tion [Zhao et al., 2016].
110 140
6.5 Speech recognition
Speech recognition consists of converting a
sound sample into a sequence of words. There
have been plenty of approaches to this problem
historically, but a conceptually simple and re-
cent one consists of casting it as a sequence-to-
sequence translation and then solving it with a
standard attention-based Transformer.
112 140
6.6 Text-image representations
A powerful approach to image understanding
consists of learning consistent image and text
representations. The Contrastive Language Im-
age Pre-training (CLIP) proposed by Radford
et al. [2021] combines an image encoder f , which
can be a ResNet-50, see § 5.2, and a text encoder
g, which is a GPT, see § 5.3. To use a GPT as a
text encoder, instead of a standard autoregres-
sive model, they add to the input sequence an
“end of sentence” token, and use the representa-
tion of this token in the last layer as the embed-
ding. Both embeddings have the same dimen-
sion which, depending on the configuration, is
between 512 and 1024.
114 140
Figure 6.4: The CLIP text-image embedding [Radford
et al., 2021] allows to do zero-shot prediction by pre-
dicting what class description embedding is the most
consistent with the image embedding.
115 140
Chapter 7
Synthesis
116 140
7.1 Text generation
The standard approach to text synthesis is to use
an attention-based autoregressive model. The
most successful in this domain is the GPT [Rad-
ford et al., 2018], that we described in § 5.3.
118 140
7.2 Image generation
Multiple deep methods have been developed to
model and sample from a high dimension density.
A powerful one for image synthesis relies on
inverting a diffusion process.
x0
120 140
portance of x0 , and xt ’s density can rapidly be
approximated with a normal.
122 140
which are modulated dynamically.
123 140
discriminator’s loss. It can be shown that at the
equilibrium the generator produces samples in-
distinguishable from real data. In practice, when
the gradient flows through the discriminator to
the generator, it informs the latter about the cues
that the discriminator uses, that should be fixed.
124 140
Afterword
125 140
Bibliography
128 140
D. Hendrycks and K. Gimpel. Gaussian Error
Linear Units (GELUs). CoRR, abs/1606.08415,
2016. [pdf]. 58
D. Hendrycks, K. Zhao, S. Basart, et al. Natural
Adversarial Examples. CoRR, abs/1907.07174,
2019. [pdf]. 114
J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion
Probabilistic Models. CoRR, abs/2006.11239,
2020. [pdf]. 119, 120, 121
S. Hochreiter and J. Schmidhuber. Long Short-
Term Memory. Neural Computation, 9(8):1735–
1780, 1997. [pdf]. 122
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reduc-
ing Internal Covariate Shift. In International
Conference on Machine Learning (ICML), 2015.
[pdf]. 64
D. Kingma and J. Ba. Adam: A Method for
Stochastic Optimization. CoRR, abs/1412.6980,
2014. [pdf]. 33
D. P. Kingma and M. Welling. Auto-Encoding
Variational Bayes. CoRR, abs/1312.6114, 2013.
[pdf]. 123
A. Krizhevsky, I. Sutskever, and G. Hinton. Ima-
geNet Classification with Deep Convolutional
129 140
Neural Networks. In Neural Information Pro-
cessing Systems (NIPS), 2012. [pdf]. 8, 84
130 140
learning. Nature, 518(7540):529–533, February
2015. [pdf]. 124
131 140
K. Simonyan and A. Zisserman. Very Deep Con-
volutional Networks for Large-Scale Image
Recognition. CoRR, abs/1409.1556, 2014. [pdf].
84
132 140
Index
1d convolution, 52
2d convolution, 52
activation, 22, 35
activation function, 56, 82
activation map, 54
Adam, 33
artificial neural network, 8
attention layer, 71
attention operator, 72
autoencoder, 123
autograd, 36
autoregressive model, 29, 117
average pooling, 59
backpropagation, 35
backward pass, 35
basis function regression, 14
batch, 20, 32
batch normalization, 64
133 140
bias vector, 47, 52
cache memory, 20
capacity, 17
causal, 72
causal model, 29, 74, 94, 117
channel, 22
classification, 15
CLIP, 113
CLS token, 97
computational cost, 36
contrastive loss, 26
convnet, 84
convolutional layer, 50, 84
convolutional network, 84
cross-attention block, 92
cross-entropy, 25
deep learning, 8
denoising autoencoder, 100
density modeling, 15
diffusion process, 119
dilation, 53, 59
discriminator, 123
downscaling residual block, 89
dropout, 62
embedding layer, 78
epoch, 38
134 140
filter, 52
fine tuning, 124
flops, 21
forward pass, 35
FP16, 21
FP32, 21
framework, 22
fully connected layer, 47, 82, 84
GAN, 123
GELU, 58
Generative Adversarial Networks, 123
generator, 123
GPT, 94, 113, 117
GPU, 8, 19
gradient descent, 30, 32, 34
gradient step, 30
Graphical Processing Unit, 8, 19
ground truth, 15
hidden layer, 82
hidden state, 122
max pooling, 59
mean squared error, 14, 25
memory requirement, 36
memory speed, 20
meta parameter, 13, 38
metric learning, 26
MLP, 82
model, 12
multi-head attention layer, 74
multi-layer perceptron, 82
padding, 52, 59
parameter, 12
parametric model, 12
peak performance, 21
pooling, 59
positional encoding, 79
posterior probability, 25
pre-trained model, 106, 110
query, 72
random initialization, 48
receptive field, 53, 54, 103
rectified linear unit, 56, 122
recurrent neural network, 122
regression, 15
reinforcement learning, 124
ReLU, 56
residual block, 87
residual connection, 68, 86
residual network, 68, 86
resnet, 68, 86
ResNet-50, 86, 113
RL, 124
RNN, 122
tanh, 57
tensor, 22
tensor cores, 20
Tensor Processing Units, 20
test set, 38
text synthesis, 117
tokens, 28
TPU, 20
trainable parameter, 12
training, 12
training set, 12, 24, 38, 41
Transformer, 68, 72, 91, 93, 111
transposed convolution, 54
under-fitting, 17
universal approximation theorem, 82
unsupervised learning, 16
VAE, 123
138 140
validation set, 38
value, 72
vanishing gradient, 37
variational autoencoder, 123
Vision Transformer, 95
ViT, 95
vocabulary, 28
weight, 13
weight decay, 26
weight matrix, 47
139 140
This book is licensed under the Creative Com-
mons BY-NC-SA 4.0 International License.
140 140