LBDL A5 Booklet
LBDL A5 Booklet
of
Deep Learning
François Fleuret
This book is licensed under the Creative Commons
BY-NC-SA 4.0 International License.
V1.2–May 19, 2024
179
The Little Book of Deep Learning
transposed convolution, 69, 121
underfitting, 18
universal approximation theorem, 96
unsupervised learning, 21
VAE, see variational, autoencoder
validation set, 48
value, 86
vanishing gradient, 45, 59
variational
autoencoder, 153
bound, 136
Vision Transformer, 110, 123
ViT, see Vision Transformer
François Fleuret is a professor of computer science vocabulary, 33
at the University of Geneva, Switzerland.
weight, 17
The cover illustration is a schematic of the Neocog- decay, 32
nitron by Fukushima [1980], a key ancestor of deep matrix, 61
neural networks.
zero-shot prediction, 124
177
scaling laws, 52
self-attention block, 91, 104, 106
self-supervised learning, 155
semantic segmentation, 84, 119
SGD, see stochastic gradient descent
Single Shot Detector, 118
skip connection, 83, 121, 152
Contents
softargmax, 30, 86
softmax, 30
speech recognition, 122
SSD, see Single Shot Detector Contents 7
stochastic gradient descent, 40, 46, 52
stride, 67, 73 List of figures 10
supervised learning, 21
Foreword 11
Tanh, see hyperbolic tangent
Task Arithmetic, 148
I Foundations 13
tensor, 25
tensor cores, 24 1 Machine Learning 15
Tensor Processing Unit, 24 1.1 Learning from data . . . . . . . . 16
test set, 48 1.2 Basis function regression . . . . . 17
text synthesis, 131 1.3 Under and overfitting . . . . . . . 18
token, 33 1.4 Categories of models . . . . . . . 20
tokenizer, 36, 122
TPU, see Tensor Processing Unit 2 Efficient Computation 23
trainable parameter, 16, 25, 52 2.1 GPUs, TPUs, and batches . . . . . 23
training, 16 2.2 Tensors . . . . . . . . . . . . . . . 25
training set, 16, 29, 48 3 Training 29
Transformer, 46, 83, 85, 93, 103, 105, 122 3.1 Losses . . . . . . . . . . . . . . . 29
transformer, 146
176 5
3.2 Autoregressive models . . . . . . 32 pre-trained model, see model, pre-trained
3.3 Gradient descent . . . . . . . . . 37 prompt, 132, 133
3.4 Backpropagation . . . . . . . . . 41 engineering, 140
3.5 The value of depth . . . . . . . . 45
quantization, 143
3.6 Training protocols . . . . . . . . 48
Quantization-Aware Training, 145
3.7 The benefits of scale . . . . . . . 52 query, 85
II Deep Models 57 RAG, see Retrieval-Augmented Generation
4 Model Components 59 random initialization, 62
4.1 The notion of layer . . . . . . . . 60 receptive field, 68, 69, 118
rectified linear unit, 71, 151
4.2 Linear layers . . . . . . . . . . . . 61
recurrent neural network, 151
4.3 Activation functions . . . . . . . 70
regression, 20
4.4 Pooling . . . . . . . . . . . . . . . 73 Reinforcement Learning, 127, 134
4.5 Dropout . . . . . . . . . . . . . . 75 Reinforcement Learning from Human Feedback,
4.6 Normalizing layers . . . . . . . . 78 133
4.7 Skip connections . . . . . . . . . 83 ReLU, see rectified linear unit
4.8 Attention layers . . . . . . . . . . 84 residual
4.9 Token embedding . . . . . . . . . 91 block, 102
4.10 Positional encoding . . . . . . . . 92 connection, 83, 99
network, 46, 83, 99
5 Architectures 95
ResNet-50, 99
5.1 Multi-Layer Perceptrons . . . . . 95
Retrieval-Augmented Generation, 142
5.2 Convolutional networks . . . . . 97
return, 126
5.3 Attention models . . . . . . . . . 103 reversible layer, see layer, reversible
RL, see Reinforcement Learning
III Applications 111 RLHF, see Reinforcement Learning from Human
6 Prediction 113 Feeback
6.1 Image denoising . . . . . . . . . . 113 RNN, see recurrent neural network
6 175
metric learning, 31 6.2 Image classification . . . . . . . . 114
MLP, see multi-layer perceptron, 146 6.3 Object detection . . . . . . . . . . 115
model, 16 6.4 Semantic segmentation . . . . . . 119
autoregressive, 33, 34, 131 6.5 Speech recognition . . . . . . . . 122
causal, 35, 88, 107 6.6 Text-image representations . . . . 123
parametric, 16 6.7 Reinforcement learning . . . . . . 126
pre-trained, 51, 119, 121
model merging, 148 7 Synthesis 131
multi-layer perceptron, 46, 95–97, 106 7.1 Text generation . . . . . . . . . . 131
7.2 Image generation . . . . . . . . . 134
Natural Language Processing, 84
NLP, see Natural Language Processing 8 The Compute Schism 139
non-linearity, 70 8.1 Prompt Engineering . . . . . . . 140
normalizing layer, see layer, normalizing 8.2 Quantization . . . . . . . . . . . . 143
8.3 Adapters . . . . . . . . . . . . . . 145
object detection, 115 8.4 Model merging . . . . . . . . . . 148
overfitting, 19, 50
174 7
convolutional, 63, 75, 84, 92, 97, 102, 118, 121,
122
embedding, 91, 107
fully connected, 61, 84, 92, 95, 97
hidden, 95
linear, 61
Multi-Head Attention, 89, 92, 106
normalizing, 78
reversible, 44
layer normalization, 81, 106
Leaky ReLU, 72
learning rate, 37, 50
learning rate schedule, 50
LeNet, 97, 98
linear layer, see layer, linear
LLM, see Large Language Model
local minimum, 37
logit, 30, 33
LoRA, see Low-Rank Adaptation
loss, 16
Low-Rank Adaptation, 146, 147
machine learning, 15, 19, 20
Markovian Decision Process, 126
Markovian property, 126
max pooling, 73, 97
MDP, see Markovian, Decision Process
mean squared error, 18, 29
memory requirement, 44
memory speed, 24
173
Generative Adversarial Networks, 153
Generative Pre-trained Transformer, 108, 123, 131,
154
generator, 153
GNN, see Graph Neural Network
GPT, see Generative Pre-trained Transformer
GPU, see Graphical Processing Unit
List of Figures
gradient descent, 37, 39, 41, 45
gradient norm clipping, 45
gradient step, 37
Graph Neural Network, 154 1.1 Kernel regression . . . . . . . . . . . 17
Graphical Processing Unit, 11, 23 1.2 Overfitting of kernel regression . . . 19
ground truth, 20
3.1 Causal autoregressive model . . . . . 35
hidden layer, see layer, hidden 3.2 Gradient descent . . . . . . . . . . . . 38
hidden state, 151 3.3 Backpropagation . . . . . . . . . . . . 42
hyper parameter, see parameter, hyper 3.4 Feature warping . . . . . . . . . . . . 47
hyperbolic tangent, 72 3.5 Training and validation losses . . . . 49
3.6 Scaling laws . . . . . . . . . . . . . . 53
image processing, 97 3.7 Model training costs . . . . . . . . . . 55
image synthesis, 84, 134
inductive bias, 19, 50, 63, 67, 92 4.1 1D convolution . . . . . . . . . . . . . 64
invariance, 75, 91, 92, 155 4.2 2D convolution . . . . . . . . . . . . . 65
4.3 Stride, padding, and dilation . . . . . 66
kernel size, 65, 73 4.4 Receptive field . . . . . . . . . . . . . 68
key, 86 4.5 Activation functions . . . . . . . . . . 71
4.6 Max pooling . . . . . . . . . . . . . . 74
Large Language Model, 51, 54, 85, 132, 139, 154 4.7 Dropout . . . . . . . . . . . . . . . . . 76
layer, 42, 60 4.8 Dropout 2D . . . . . . . . . . . . . . . 77
attention, 84 4.9 Batch normalization . . . . . . . . . . 79
172 9
4.10 Skip connections . . . . . . . . . . . . 82 data augmentation, 115
4.11 Attention operator interpretation . . 85 deep learning, 11, 15
4.12 Complete attention operator . . . . . 87 Deep Q-Network, 127
4.13 Multi-Head Attention layer . . . . . . 90 denoising autoencoder, see autoencoder, denoising
density modeling, 20
5.1 Multi-Layer Perceptron . . . . . . . . 96 depth, 42
5.2 LeNet-like convolutional model . . . 98 diffusion model, 134
5.3 Residual block . . . . . . . . . . . . . 99 dilation, 68, 73
5.4 Downscaling residual block . . . . . . 100 discriminator, 153
5.5 ResNet-50 . . . . . . . . . . . . . . . . 101 downscaling residual block, 102
5.6 Transformer components . . . . . . . 104 downstream task, 51
5.7 Transformer . . . . . . . . . . . . . . 105 DQN, see Deep Q-Network
5.8 GPT model . . . . . . . . . . . . . . . 108 dropout, 75, 88
5.9 ViT model . . . . . . . . . . . . . . . 109
embedding layer, see layer, embedding
6.1 Convolutional object detector . . . . 116 epoch, 48
6.2 Object detection with SSD . . . . . . 117 equivariance, 67, 91
6.3 Semantic segmentation with PSP . . . 120
6.4 CLIP zero-shot prediction . . . . . . . 125 feed-forward block, 104, 106
6.5 DQN state value evolution . . . . . . 129 few-shot prediction, 133
filter, 67
7.1 Few-shot prediction with a GPT . . . 132 fine-tune, 119
7.2 Denoising diffusion . . . . . . . . . . 135 fine-tuning, 51, 133
flops, 25
8.1 Chain-of-thought . . . . . . . . . . . 141 forward pass, 42
8.2 Quantization . . . . . . . . . . . . . . 144 foundation model, 133
FP32, 25
framework, 25
GAN, see Generative Adversarial Networks
GELU, 72
10 171
batch normalization, 78, 102
Bellman equation, 127
bias vector, 61, 67
BPE, see Byte Pair Encoding
Byte Pair Encoding, 36, 122
cache memory, 24
Foreword
capacity, 18
causal, 35, 87, 106
model, see model, causal
chain rule (derivative), 41
chain rule (probability), 33 The current period of progress in artificial intelli-
chain-of-thought, 133, 142 gence was triggered when Krizhevsky et al. [2012]
channel, 26 demonstrated that an artificial neural network de-
checkpointing, 44 signed twenty years earlier [LeCun et al., 1989]
classification, 20, 30, 97, 114 could outperform complex state-of-the-art image
CLIP, see Contrastive Language-Image recognition methods by a huge margin, simply
Pre-training by being a hundred times larger and trained on a
CLS token, 110 dataset similarly scaled up.
computational cost, 44, 88 This breakthrough was made possible thanks to
context size, 140 Graphical Processing Units (GPUs), highly paral-
Contrastive Language-Image Pre-training, 123, lel consumer-grade computing devices developed
148 for real-time image synthesis and repurposed for
contrastive loss, 31, 124 artificial neural networks.
convnet, see convolutional network
convolution, 65, 67 Since then, under the umbrella term of “
convolutional layer, see layer, convolutional deep learning
,” innovations in the structures of these net-
convolutional network, 97 works, the strategies to train them, and dedicated
cross-attention block, 91, 104, 106 hardware have allowed for an exponential increase
cross-entropy, 31, 34, 46 in both their size and the quantity of training data
170 11
they take advantage of [Sevilla et al., 2022]. This
has resulted in a wave of successful applications
across technical domains, from computer vision
and robotics to speech processing, and since 2020
in the development of Large Language Models with
general proto-reasoning capabilities [Chowdhery Index
et al., 2022].
Although the bulk of deep learning is not difficult
to understand, it combines diverse components
such as linear algebra, calculus, probabilities, op- 1D convolution, 65
timization, signal processing, programming, algo- 2D convolution, 67
rithmics, and high-performance computing, mak-
ing it complicated to learn. activation, 25, 41
function, 70, 95
Instead of trying to be exhaustive, this little book map, 69
is limited to the background necessary to under- Adam, 40, 147
stand a few important models. This proved to be a adapter, 146
popular approach, resulting in more than 500,000 affine operation, 61
downloads of the PDF file in the 12 months follow- artificial neural network, 11, 15
ing its announcement on Twitter. attention operator, 86
autoencoder, 152
You can download a phone-formatted PDF of this denoising, 113
book from Autograd, 43
autoregressive model, see model, autoregressive
https://fleuret.org/public/lbdl.pdf
average pooling, 75
François Fleuret, backpropagation, 43
May 19, 2024 backward pass, 43, 147
basis function regression, 17
batch, 24, 40
12 169
Part I
Foundations
tinguished Experts. CoRR, abs/2305.14688, 2023.
140
P. Yadav, D. Tam, L. Choshen, et al. TIES-Merging:
Resolving Interference When Merging Models.
CoRR, abs/2306.01708, 2023. 148
L. Yu, B. Yu, H. Yu, et al. Language Models are Super
Mario: Absorbing Abilities from Homologous
Models as a Free Lunch. CoRR, abs/2311.03099,
2023. 148
J. Zbontar, L. Jing, I. Misra, et al. Barlow Twins: Self-
Supervised Learning via Redundancy Reduction.
CoRR, abs/2103.03230, 2021. 155
M. D. Zeiler and R. Fergus. Visualizing and Under-
standing Convolutional Networks. In European
Conference on Computer Vision (ECCV), 2014. 69
H. Zhao, J. Shi, X. Qi, et al. Pyramid Scene Parsing
Network. CoRR, abs/1612.01105, 2016. 120, 121
J. Zhou, C. Wei, H. Wang, et al. iBOT: Image
BERT Pre-Training with Online Tokenizer. CoRR,
abs/2111.07832, 2021. 155
167
J. Sevilla, P. Villalobos, J. F. Cerón, et al. Parameter,
Compute and Data Trends in Machine Learning,
May 2023. [web]. 55
166 15
1.1 Learning from data A. Radford, K. Narasimhan, T. Salimans, and
I. Sutskever. Improving Language Understand-
The simplest use case for a model trained from data ing by Generative Pre-Training, 2018. 104, 108,
is when a signal x is accessible, for instance, the 131
picture of a license plate, from which one wants to
predict a quantity y, such as the string of characters A. Radford, J. Wu, R. Child, et al. Language Models
written on the plate. are Unsupervised Multitask Learners, 2019. 108,
155
In many real-world situations where x is a high-
dimensional signal captured in an uncontrolled O. Ronneberger, P. Fischer, and T. Brox. U-Net:
environment, it is too complicated to come up with Convolutional Networks for Biomedical Image
an analytical recipe that relates x and y. Segmentation. In Medical Image Computing and
Computer-Assisted Intervention, 2015. 82, 83, 121
What one can do is to collect a large training set 𝒟
of pairs (xn , yn ), and devise a parametric model f . P. Sahoo, A. Singh, S. Saha, et al. A Systematic Sur-
This is a piece of computer code that incorporates vey of Prompt Engineering in Large Language
trainable parameters w that modulate its behavior, Models: Techniques and Applications. CoRR,
and such that, with the proper values w∗ , it is a abs/2402.07927, 2024. 140
good predictor. “Good” here means that if an x is F. Scarselli, M. Gori, A. C. Tsoi, et al. The Graph
given to this piece of code, the value ŷ = f (x; w∗ ) Neural Network Model. IEEE Transactions on
it computes is a good estimate of the y that would Neural Networks (TNN), 20(1):61–80, 2009. 154
have been associated with x in the training set had
it been there. R. Sennrich, B. Haddow, and A. Birch. Neural Ma-
chine Translation of Rare Words with Subword
This notion of goodness is usually formalized with Units. CoRR, abs/1508.07909, 2015. 36
a loss ℒ (w) which is small when f ( · ; w) is good
on 𝒟 . Then, training the model consists of com- J. Sevilla, L. Heim, A. Ho, et al. Compute Trends
puting a value w∗ that minimizes ℒ (w∗ ). Across Three Eras of Machine Learning. CoRR,
abs/2202.05924, 2022. 12, 52
Most of the content of this book is about the defini-
tion of f , which, in realistic scenarios, is a complex
16 165
Deep Learning for Audio, Speech and Language combination of pre-defined sub-modules.
Processing, 2013. 72
The trainable parameters that compose w are of-
V. Mnih, K. Kavukcuoglu, D. Silver, et al. Human- ten called weights, by analogy with the synaptic
level control through deep reinforcement learn- weights of biological neural networks. In addition
ing. Nature, 518(7540):529–533, February 2015. to these parameters, models usually depend on
127, 128, 129 ,hyper-parameters
which are set according to domain
prior knowledge, best practices, or resource con-
A. Nichol, P. Dhariwal, A. Ramesh, et al. GLIDE: To- straints. They may also be optimized in some way,
wards Photorealistic Image Generation and Edit- but with techniques different from those used to
ing with Text-Guided Diffusion Models. CoRR, optimize w.
abs/2112.10741, 2021. 137
L. Ouyang, J. Wu, X. Jiang, et al. Training language 1.2 Basis function regression
models to follow instructions with human feed-
back. CoRR, abs/2203.02155, 2022. 133 We can illustrate the training of a model in a simple
case where xn and yn are two real numbers, the
R. Pascanu, T. Mikolov, and Y. Bengio. On the
difficulty of training recurrent neural networks.
In International Conference on Machine Learning
(ICML), 2013. 45
164 17
loss is the mean squared error: Y. LeCun, B. Boser, J. S. Denker, et al. Backpropaga-
N
tion applied to handwritten zip code recognition.
ℒ (w) = (1.1) Neural Computation, 1(4):541–551, 1989. 11
N
(yn − f (xn ; w))2 ,
n=1
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
1 X
and f ( · ; w) is a linear combination of a prede- Gradient-based learning applied to document
fined basis of functions f1 , . . . , fK , with w = recognition. Proceedings of the IEEE, 86(11):2278–
(w1 , . . . , wK ): 2324, 1998. 97, 98
K
P. Lewis, E. Perez, A. Piktus, et al. Retrieval-
f (x; w) = wk fk (x). Augmented Generation for Knowledge-
k=1
Intensive NLP Tasks. CoRR, abs/2005.11401,
X
2020. 142
Since f (xn ; w) is linear with respect to the wk s and
ℒ (w) is quadratic with respect to f (xn ; w), the W. Liu, D. Anguelov, D. Erhan, et al. SSD: Single
loss ℒ (w) is quadratic with respect to the wk s, and Shot MultiBox Detector. CoRR, abs/1512.02325,
finding w∗ that minimizes it boils down to solving 2015. 117, 118
a linear system. See Figure 1.1 for an example with
Llama.cpp. Llama.cpp git repository, June 2023.
Gaussian kernels as fk .
[web]. 143, 144
1.3 Under and overfitting J. Long, E. Shelhamer, and T. Darrell. Fully Convo-
lutional Networks for Semantic Segmentation.
A key element is the interplay between the CoRR, abs/1411.4038, 2014. 82, 83, 121
of the model, that is its flexibility and ability to
capacity
fit diverse data, and the amount and quality of the S. Ma, H. Wang, L. Ma, et al. The Era of 1-bit
training data. When the capacity is insufficient, the LLMs: All Large Language Models are in 1.58
model cannot fit the data, resulting in a high error Bits. CoRR, abs/2402.17764, 2024. 145
during training. This is referred to as underfitting. A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier
On the contrary, when the amount of data is in- nonlinearities improve neural network acoustic
sufficient, as illustrated in Figure 1.2, the model models. In proceedings of the ICML Workshop on
18 163
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reducing
Internal Covariate Shift. In International Confer-
ence on Machine Learning (ICML), 2015. 78
A. Jiang, A. Sablayrolles, A. Mensch, et al. Mistral
7B. CoRR, abs/2310.06825, 2023. 149
J. Kaplan, S. McCandlish, T. Henighan, et al. Scal-
ing Laws for Neural Language Models. CoRR, Figure 1.2: If the amount of training data (black
abs/2001.08361, 2020. 52, 53 dots) is small compared to the capacity of the model,
the empirical performance of the fitted model during
A. Katharopoulos, A. Vyas, N. Pappas, and training (red curve) reflects poorly its actual fit to
F. Fleuret. Transformers are RNNs: Fast Au- the underlying data structure (thin black curve), and
toregressive Transformers with Linear Atten- consequently its usefulness for prediction.
tion. In Proceedings of the International Confer-
ence on Machine Learning (ICML), pages 5294–
5303, 2020. 89 will often learn characteristics specific to the train-
ing examples, resulting in excellent performance
D. Kingma and J. Ba. Adam: A Method for Stochas- during training, at the cost of a worse fit to the
tic Optimization. CoRR, abs/1412.6980, 2014. 40 global structure of the data, and poor performance
D. P. Kingma and M. Welling. Auto-Encoding Vari- on new inputs. This phenomenon is referred to as
ational Bayes. CoRR, abs/1312.6114, 2013. 153 overfitting.
T. Kojima, S. Gu, M. Reid, et al. Large Lan- So, a large part of the art of applied
guage Models are Zero-Shot Reasoners. CoRR, machine learning
is to design models that are not too flexible yet
abs/2205.11916, 2022. 142 still able to fit the data. This is done by crafting
the right inductive bias in a model, which means
A. Krizhevsky, I. Sutskever, and G. Hinton. Ima- that its structure corresponds to the underlying
geNet Classification with Deep Convolutional structure of the data at hand.
Neural Networks. In Neural Information Process-
ing Systems (NIPS), 2012. 11, 97 Even though this classical perspective is relevant
162 19
for reasonably-sized deep models, things get con- K. He, X. Zhang, S. Ren, and J. Sun. Deep Resid-
fusing with large ones that have a very large num- ual Learning for Image Recognition. CoRR,
ber of trainable parameters and extreme capacity abs/1512.03385, 2015. 52, 82, 83, 99, 101
yet still perform well on prediction. We will come
back to this in § 3.6 and § 3.7. D. Hendrycks and K. Gimpel. Gaussian Error Lin-
ear Units (GELUs). CoRR, abs/1606.08415, 2016.
72
1.4 Categories of models
D. Hendrycks, K. Zhao, S. Basart, et al. Natural
We can organize the use of machine learning mod- Adversarial Examples. CoRR, abs/1907.07174,
els into three broad categories: 2019. 126
Regression consists of predicting a continuous- J. Ho, A. Jain, and P. Abbeel. Denoising Diffu-
valued vector y ∈ RK , for instance, a geometrical sion Probabilistic Models. CoRR, abs/2006.11239,
position of an object, given an input signal X. This 2020. 134, 135, 136
is a multi-dimensional generalization of the setup
we saw in § 1.2. The training set is composed of S. Hochreiter and J. Schmidhuber. Long Short-Term
pairs of an input signal and a ground-truth value. Memory. Neural Computation, 9(8):1735–1780,
1997. 151
Classification aims at predicting a value from a
finite set {1, . . . , C}, for instance, the label Y of N. Houlsby, A. Giurgiu, S. Jastrzebski, et al.
an image X. As with regression, the training set Parameter-Efficient Transfer Learning for NLP.
is composed of pairs of input signal, and ground- CoRR, abs/1902.00751, 2019. 146
truth quantity, here a label from that set. The stan- E. Hu, Y. Shen, P. Wallis, et al. LoRA: Low-Rank
dard way of tackling this is to predict one score Adaptation of Large Language Models. CoRR,
per potential class, such that the correct class has abs/2106.09685, 2021. 146
the maximum score.
G. Ilharco, M. Ribeiro, M. Wortsman, et al. Edit-
Density modeling has as its objective to model ing Models with Task Arithmetic. CoRR,
the probability density function of the data µX it- abs/2212.04089, 2022. 148
self, for instance, images. In that case, the training
20 161
Y. Gal and Z. Ghahramani. Dropout as a Bayesian set is composed of values xn without associated
Approximation: Representing Model Uncer- quantities to predict, and the trained model should
tainty in Deep Learning. CoRR, abs/1506.02142, allow for the evaluation of the probability den-
2015. 78 sity function, or sampling from the distribution, or
both.
X. Glorot and Y. Bengio. Understanding the diffi-
culty of training deep feedforward neural net- Both regression and classification are generally re-
works. In International Conference on Artificial ferred to as supervised learning, since the value to
Intelligence and Statistics (AISTATS), 2010. 45, 62 be predicted, which is required as a target during
training, has to be provided, for instance, by hu-
X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse man experts. On the contrary, density modeling
Rectifier Neural Networks. In International Con- is usually seen as unsupervised learning, since it
ference on Artificial Intelligence and Statistics is sufficient to take existing data without the need
(AISTATS), 2011. 71 for producing an associated ground-truth.
A. Gomez, M. Ren, R. Urtasun, and R. Grosse. These three categories are not disjoint; for instance,
The Reversible Residual Network: Backprop- classification can be cast as class-score regression,
agation Without Storing Activations. CoRR, or discrete sequence density modeling as iterated
abs/1707.04585, 2017. 44 classification. Furthermore, they do not cover all
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, cases. One may want to predict compounded quan-
et al. Generative Adversarial Networks. CoRR, tities, or multiple classes, or model a density con-
abs/1406.2661, 2014. 153 ditional on a signal.
160 21
G. Cybenko. Approximation by superpositions of
a sigmoidal function. Mathematics of Control,
Signals, and Systems, 2(4):303–314, December
1989. 96
J. Deng, W. Dong, R. Socher, et al. ImageNet:
A Large-Scale Hierarchical Image Database.
In Conference on Computer Vision and Pattern
Recognition (CVPR), 2009. 51
T. Dettmers, A. Pagnoni, A. Holtzman, and
L. Zettlemoyer. QLoRA: Efficient Finetuning
of Quantized LLMs. CoRR, abs/2305.14314, 2023.
147
J. Devlin, M. Chang, K. Lee, and K. Toutanova.
BERT: Pre-training of Deep Bidirectional Trans-
formers for Language Understanding. CoRR,
abs/1810.04805, 2018. 52, 110, 155
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al.
An Image is Worth 16x16 Words: Transform-
ers for Image Recognition at Scale. CoRR,
abs/2010.11929, 2020. 109, 110
K. Fukushima. Neocognitron: A self-organizing
neural network model for a mechanism of pat-
tern recognition unaffected by shift in position.
Biological Cybernetics, 36(4):193–202, April 1980.
4
159
I. Beltagy, M. Peters, and A. Cohan. Longformer:
The Long-Document Transformer. CoRR,
abs/2004.05150, 2020. 88
R. Bommasani, D. Hudson, E. Adeli, et al. On the
Opportunities and Risks of Foundation Models. Chapter 2
CoRR, abs/2108.07258, 2021. 133
J. Bradbury, S. Merity, C. Xiong, and R. Socher. Efficient Computation
Quasi-Recurrent Neural Networks. CoRR,
abs/1611.01576, 2016. 152
T. Brown, B. Mann, N. Ryder, et al. Language Mod-
els are Few-Shot Learners. CoRR, abs/2005.14165, From an implementation standpoint, deep learning
2020. 52, 108, 131 is about executing heavy computations with large
S. Bubeck, V. Chandrasekaran, R. Eldan, et al. amounts of data. The Graphical Processing Units
Sparks of Artificial General Intelligence: Early (GPUs) have been instrumental in the success of
experiments with GPT-4. CoRR, abs/2303.12712, the field by allowing such computations to be run
2023. 133 on affordable hardware.
T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training The importance of their use, and the resulting tech-
Deep Nets with Sublinear Memory Cost. CoRR, nical constraints on the computations that can be
abs/1604.06174, 2016. 44 done efficiently, force the research in the field to
constantly balance mathematical soundness and
K. Cho, B. van Merrienboer, Ç. Gülçehre, et al. implementability of novel methods.
Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Trans-
lation. CoRR, abs/1406.1078, 2014. 151
2.1 GPUs, TPUs, and batches
A. Chowdhery, S. Narang, J. Devlin, et al. PaLM: Graphical Processing Units were originally de-
Scaling Language Modeling with Pathways. signed for real-time image synthesis, which re-
CoRR, abs/2204.02311, 2022. 12, 52, 133 quires highly parallel architectures that happen
158 23
to be well suited for deep models. As their usage
for AI has increased, GPUs have been equipped
with dedicated tensor cores, and deep-learning spe-
cialized chips such as Google’s
Tensor Processing Units
(TPUs) have been developed.
A GPU possesses several thousand parallel units
Bibliography
and its own fast memory. The limiting factor is
usually not the number of computing units, but
the read-write operations to memory. The slow-
est link is between the CPU memory and the GPU
memory, and consequently one should avoid copy- T. Akiba, M. Shing, Y. Tang, et al. Evolutionary
ing data across devices. Moreover, the structure Optimization of Model Merging Recipes. CoRR,
of the GPU itself involves multiple levels of abs/2403.13187, 2024. 149
,cache memory
which are smaller but faster, and compu- J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer Nor-
tation should be organized to avoid copies between malization. CoRR, abs/1607.06450, 2016. 81
these different caches.
R. Balestriero, M. Ibrahim, V. Sobal, et al. A
This is achieved, in particular, by organizing the Cookbook of Self-Supervised Learning. CoRR,
computation in batches of samples that can fit en- abs/2304.12210, 2023. 155
tirely in the GPU memory and are processed in
parallel. When an operator combines a sample A. Baydin, B. Pearlmutter, A. Radul, and J. Siskind.
and model parameters, both have to be moved Automatic differentiation in machine learning:
to the cache memory near the actual computing a survey. CoRR, abs/1502.05767, 2015. 43
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing M. Belkin, D. Hsu, S. Ma, and S. Mandal. Recon-
it for each sample. In practice, a GPU processes a ciling modern machine learning and the bias-
batch that fits in memory almost as quickly as it variance trade-off. CoRR, abs/1812.11118, 2018.
would process a single sample. 50
24 157
A standard GPU has a theoretical
peak performance
of 1013 –1014 floating-point operations
(FLOPs) per second, and its memory typically
ranges from 8 to 80 gigabytes. The standard FP32
encoding of float numbers is on 32 bits, but empir-
ical results show that using encoding on 16 bits,
or even less for some operands, does not degrade
performance.
2.2 Tensors
GPUs and deep learning frameworks such as Py-
Torch or JAX manipulate the quantities to be pro-
cessed by organizing them as tensors, which are
series of scalars arranged along several discrete
axes. They are elements of RN1 ×···×ND that gen-
eralize the notion of vector and matrix.
25
is the dimension of the feature representation at swering questions, or even translating from one
every time step, often referred to as the number of language to another [Radford et al., 2019].
channels. Similarly, a 2D-structured signal can be
represented as a D × H × W tensor, where H and Such models constitute one category of a larger
W are its height and width. An RGB image would class of methods that fall under the name of
correspond to D = 3, but the number of channels self-supervised learning
, and try to take advantage of
can grow up to several thousands in large models. unlabeled datasets [Balestriero et al., 2023].
Adding more dimensions allows for the represen- The key principle of these methods is to define a
tation of series of objects. For example, fifty RGB task that does not require labels but necessitates
images of resolution 32 × 24 can be encoded as a feature representations which are useful for the
50 × 3 × 24 × 32 tensor. real task of interest, for which a small labeled
dataset exists. In computer vision, for instance,
Deep learning libraries provide a large number of image features can be optimized so that they are
operations that encompass standard linear alge- to data transformations that do not change
invariant
bra, complex reshaping and extraction, and deep- the semantic content of the image, while being
learning specific operations, some of which we will statistically uncorrelated [Zbontar et al., 2021].
see in Chapter 4. The implementation of tensors
separates the shape representation from the stor- In both NLP and computer vision, a powerful
age layout of the coefficients in memory, which al- generic strategy is to train a model to recover parts
lows many reshaping, transposing, and extraction of the signal that have been masked [Devlin et al.,
operations to be done without coefficient copying, 2018; Zhou et al., 2021].
hence extremely rapidly.
In practice, virtually any computation can be
decomposed into elementary tensor operations,
which avoids non-parallel loops at the language
level and poor memory management.
Besides being convenient tools, tensors are instru-
mental in achieving computational efficiency. All
26 155
cues that the discriminator uses that need to be the people involved in the development of an op-
addressed. erational deep model, from the designers of the
drivers, libraries, and models to those of the com-
Graph Neural Networks puters and chips, know that the data will be ma-
nipulated as tensors. The resulting constraints on
Many applications require processing signals locality and block decomposability enable all the
which are not organized regularly on a grid. For in- actors in this chain to come up with optimal de-
stance, proteins, 3D meshes, geographic locations, signs.
or social interactions are more naturally structured
as graphs. Standard convolutional networks or
even attention models are poorly adapted to pro-
cess such data, and the tool of choice for such a
task is Graph Neural Networks (GNN) [Scarselli
et al., 2009].
Self-supervised training
As stated in § 7.1, even though they are trained only
to predict the next word, Large Language Models
trained on large unlabeled datasets such as GPT
(see § 5.3) are able to solve various tasks, such as
identifying the grammatical role of a word, an-
154 27
manifold.
The Variational Autoencoder (VAE) proposed by
Kingma and Welling [2013] is a generative model
with a similar structure. It imposes, through the
loss, a pre-defined distribution on the latent rep-
resentation. This allows, after training, the gener-
ation of new samples by sampling the latent rep-
resentation according to this imposed distribution
and then mapping back through the decoder.
Generative Adversarial Networks
Another approach to density modeling is the
Generative Adversarial Networks
(GAN) introduced
by Goodfellow et al. [2014]. This method combines
a generator, which takes a random input follow-
ing a fixed distribution as input and produces a
structured signal such as an image, and a
discriminator
, which takes a sample as input and predicts
whether it comes from the training set or if it was
generated by the generator.
Training optimizes the discriminator to minimize
a standard cross-entropy loss, and the generator to
maximize the discriminator’s loss. It can be shown
that, at equilibrium, the generator produces sam-
ples indistinguishable from real data. In practice,
when the gradient flows through the discriminator
to the generator, it informs the latter about the
153
of skip connections which are modulated dynami-
cally.
152 29
hood of the data. If f (x; w) is to be interpreted as a
normalized log-probability or log-density, the loss
is the opposite of the sum of its values over train-
ing samples, which corresponds to the likelihood
of the data-set.
Cross-entropy
The Missing Bits
For classification, the usual strategy is that the out-
put of the model is a vector with one component
f (x; w)y per class y, interpreted as the logarithm
of a non-normalized probability, or logit. For the sake of concision, this volume skips many
important topics, in particular:
With X the input signal and Y the class to predict,
we can then compute from f an estimate of the Recurrent Neural Networks
posterior probabilities:
Before attention models showed greater perfor-
exp f (x; w)y mance, Recurrent Neural Networks (RNN) were
.
z exp f (x; w)z
the standard approach for dealing with temporal se-
quences such as text or sound samples. These archi-
P̂ (Y = y | X = x) = P
This expression is generally called the softmax, or tectures possess an internal hidden state that gets
more adequately, the softargmax, of the logits. updated each time a component of the sequence is
processed. Their main components are layers such
To be consistent with this interpretation, the model
as LSTM [Hochreiter and Schmidhuber, 1997] or
should be trained to maximize the probability of
GRU [Cho et al., 2014].
the true classes, hence to minimize the
Training a recurrent architecture amounts to un-
folding it in time, which results in a long composi-
tion of operators. This has historically prompted
the design of key techniques now used for deep
architectures such as rectifiers and gating, a form
151
metric learning, where the ob-
jective is to learn a measure of distance between
samples such that a sample xa from a certain se-
mantic class is closer to any sample xb of the same
class than to any sample xc from another class. For
instance, xa and xb can be two pictures of a certain
person, and xc a picture of someone else.
31
Engineering the loss space is to recombine their layers. Akiba et al.
[2024] combine merging the parameters and re-
Usually, the loss minimized during training is not combining layers, and rely on a stochastic op-
the actual quantity one wants to optimize ulti- timization to deal with the combinatorial explo-
mately, but a proxy for which finding the best sion. Experiments with three fine-tuned versions
model parameters is easier. For instance, cross- of Mistral-7B [Jiang et al., 2023] show that combin-
entropy is the standard loss for classification, even ing these two merging strategies outperforms both
though the actual performance measure is a classi- of them.
fication error rate, because the latter has no infor-
mative gradient, a key requirement as we will see
in § 3.3.
It is also possible to add terms to the loss that
depend on the trainable parameters of the model
themselves to favor certain configurations.
The weight decay regularization, for instance, con-
sists of adding to the loss a term proportional to
the sum of the squared parameters. This can be
interpreted as having a Gaussian Bayesian prior
on the parameters, which favors smaller values
and thereby reduces the influence of the data. This
degrades performance on the training set, but re-
duces the gap between the performance in training
and that on new, unseen data.
3.2 Autoregressive models
A key class of methods, particularly for dealing
with discrete sequences in natural language pro-
32 149
8.4 Model merging cessing and computer vision, are the
,autoregressive models
An alternative to the fine-tuning and prompting
methods seen in the previous sections consists of The chain rule for probabilities
combining multiple models with diverse capabili-
ties into a single one, without additional training. Such models put to use the chain rule from proba-
bility theory:
Model merging relies on the compatibility between
P (X1 = x1 , X2 = x2 , . . . , XT = xT ) =
multiple fine-tuned versions of a base model.
P (X1 = x1 )
Ilharco et al. [2022] showed that models obtained × P (X2 = x2 | X1 = x1 )
by fine-tuning a CLIP base model on several image
...
classification data-sets can be combined in the pa-
rameter space, where they exhibit Task Arithmetic × P (XT = xT | X1 = x1 , . . . , XT −1 = xT −1 ).
properties.
Although this decomposition is valid for a random
Formally, let θ be the parameter vector of a pre- sequence of any type, it is particularly efficient
trained model, and for t = 1, . . . , T , let θt and when the signal of interest is a sequence of tokens
τt = θt − θ be respectively the parameters af- from a finite vocabulary {1, . . . K}.
ter fine-tuning on task t and the corresponding
residual. Experiments show that the model with With the convention that the additional token ∅
parameters θ + τ1 + · · · + τT exhibits multi-task stands for an “unknown” quantity, we can repre-
capabilities. Similarly, subtracting a τt degrades sent the event {X1 = x1 , . . . , Xt = xt } as the
the performance on the corresponding task. vector (x1 , . . . , xt , ∅, . . . , ∅).
148 33
allows to sample one token given the previous sion denoising models by fine-tuning the attention
ones. blocks responsible for the text-based conditioning.
The chain rule ensures that by sampling T tokens Since fine-tuning with LoRA adapters drastically
xt , one at a time given the previously sampled reduces the number of trainable parameters, it re-
x1 , . . . , xt−1 , we get a sequence that follows the duces the memory footprint required by optimiz-
joint distribution. This is an autoregressive gener- ers such as Adam, which generally store two run-
ative model. ning average per parameter to optimize. Also, it
reduces slightly the computation during the
Training such a model can be done by minimizing .backward pass
the sum across training sequences and time steps
of the cross-entropy loss For commercial applications that require a large
number of fine-tuned models, the AB pairs can be
Lce f (x1 , . . . , xt−1 , ∅, . . . , ∅; w), xt , stored separately from the original model, which
has to be stored only once. And finally, contrary
which is formally equivalent to maximizing the
likelihood of the true xt s. to other type of adapters, the modifications can be
integrated into the original architecture, simply by
The value that is classically monitored is not the adding AB to W , resulting in an architecture and
cross-entropy itself, but the perplexity, which is parameter count for inference strictly identical to
defined as the exponential of the cross-entropy. that of the base model.
It corresponds to the number of values of a uni-
form distribution with the same entropy, which is We saw that quantization degrade models’ accu-
generally more interpretable. racy only marginally. However, gradient descent
requires high precision in both the gradient and the
Causal models trained parameters, to allow the accumulation of
small changes. The QLoRA approach combines a
The training procedure we just described requires quantized base model and unquantized
a different input for each t, and the bulk of the Low-Rank Adaptation
to reduce the memory requirement
computation done for t < t′ is repeated for t′ . This even more [Dettmers et al., 2023].
is extremely inefficient since T is often of the order
of hundreds or thousands.
34 147
with few parameters, referred to as adapters, to the l1 l2 l3 ... lT −1 lT
pre-trained architecture, and freeze all the original
parameters [Houlsby et al., 2019].
f
The current dominant method is the
Low-Rank Adaptation
(LoRA), which adds low-rank correc-
tions to some of the model’s weight matrices [Hu x1 x2 ... xT −2 xT −1
et al., 2021].
Figure 3.1: An autoregressive model f , is causal if
Formally, given a linear operation of the form
a time step xt of the input sequence modulates the
XW T , where X is a N ×D tensor of activations for
predicted logits ls only if s > t, as depicted by the
a batch of N samples, and W is a C ×D weight ma-
blue arrows. This allows computing the distributions
trix, the LoRA adapter replaces this operation with
at all the time steps in one pass during training. Dur-
X(W + BA)T , where A and B are two trainable
ing sampling, however, the lt and xt are computed
matrices of size R × D and C × R respectively,
sequentially, the latter sampled with the former, as
with R ≪ min(C, D), and the matrix W is re-
depicted by the red arrows.
moved from the trainable parameters. The matrix
A is initialized with random Gaussian values, and
B is set to zero, so that the fine-tuning starts with The standard strategy to address this issue is to
a model that computes an output identical to that design a model f that predicts all the vectors of
of the original one. logits l1 , . . . , lT at once, that is:
146 35
The consequence is that the output at every posi- It quantizes individually sub-blocks of 32 entries
tion is the one that would be obtained if the input of the original weight matrix by storing for each a
were only available up to before that position. Dur- scaling factor d and a bias m in the original FP16
ing training, it allows one to compute the output for encoding, and encoding each entry x with 4 bits
a full sequence and to maximize the predicted prob- as a value q ∈ {0, . . . , 24 − 1}. The resulting de-
abilities of all the tokens of that same sequence, quantized value being x̃ = dq + m.
which again boils down to minimizing the sum of
the per-token cross-entropy. Such a block was encoded originally as 32 values in
FP16, hence 64 bytes, while the quantized version
Note that, for the sake of simplicity, we have de- needs 4 bytes for q and m and 32 · 4 bits = 16 bytes
fined f as operating on sequences of a fixed length for the entries, hence a total of 20 bytes.
T . However, models used in practice, such as the
transformers we will see in § 5.3, are able to process Such an aggressive quantization surprisingly de-
sequences of arbitrary length. grades only marginally the performance of the
models, as illustrated on Figure 8.2.
Tokenizer An alternative to Post-Training Quantization is
One important technical detail when dealing with Quantization-Aware Training that applies quanti-
natural languages is that the representation as to- zation during the forward pass but keeps high-
kens can be done in multiple ways, ranging from precision encoding of parameters and gradients,
the finest granularity of individual symbols to en- and propagates the gradients during the backward
tire words. The conversion to and from the token pass as if there was no quantization [Ma et al.,
representation is carried out by a separate algo- 2024].
rithm called a tokenizer.
8.3 Adapters
A standard method is the Byte Pair Encoding (BPE)
[Sennrich et al., 2015] that constructs tokens by As we saw in § 3.6, fine-tuning is a key strategy to
hierarchically merging groups of characters, trying reuse pre-trained models. Since it aims at making
to get tokens that represent fragments of words of only minor changes to an existing model, tech-
various lengths but of similar frequencies, allocat- niques have been developed that add components
36 145
ing tokens to long frequent fragments as well as to
rare individual symbols.
6.5
144 37
8.2 Quantization
Although training or generating multiple streams
can benefit from high-end parallel computing de-
vices, deployment of a Large Language Model for
individual use requires generally single-stream in-
ference, which is bounded by memory size and
speed far more than by computation.
As stated in § 2.1, parameters, activations, and gra-
dients are usually encoded with 32 or 16 bits. The
precision it provides is necessary for training, to
allow gradual changes to accumulate.
w
However, since activations are the sums of many
terms, quantization during inference is mitigated
by an averaging effect. This is even more true with
large architectures, and models quantized down
to 6 or 4 bits per parameter exhibit remarkable
performance. Additionally to reducing the mem-
ℒ (w) ory footprint, quantization also improves inference
speed significantly.
This has motivated the development of software
w to quantize existing models with
,Post-Training Quantization
and run them in single-stream in-
ference on consumer hardware, such as llama.cpp
Figure 3.2: At every point w, the gradient ∇ℒ |w (w)
is in the direction that maximizes the increase of ℒ ,
[Llama.cpp, 2023]. This framework implements
orthogonal to the level curves (top). The gradient
multiple formats, that apply specific quantization
descent minimizes ℒ (w) iteratively by subtracting
levels for the different weight matrices of a lan-
a fraction of the gradient at every step, resulting in a
trajectory that follows the steepest descent (bottom).
38 143
Chain of Thought around a good minimum and never descend into
it. As we will see in § 3.6, it can depend on the
A remarkable type of prompting aims at making iteration number n.
the model generate intermediate steps before gen-
erating the response itself.
Stochastic Gradient Descent
Such a chain-of-thought is composed of succes- All the losses used in practice can be expressed as
sive steps that are simpler, hence have been better an average of a loss per small group of samples, or
modeled during training, and are predicted more per sample such as:
deterministically [Wei et al., 2022; Kojima et al.,
2022]. See Figure 8.1 for an example. N
1 X
ℒ (w) = 𝓁n (w),
N
Retrieval-Augmented Generation n=1
142 39
gradient. Due to the redundancy in the data, this
happens to be a far more efficient strategy.
Q: Gina has 105 beans, she gives 23 beans to Bob, and
We saw in § 2.1 that processing a batch of samples prepares a soup with 53 beans. How many beans are left?
small enough to fit in the computing device’s mem- A: There are 29 beans left.
ory is generally as fast as processing a single one.
Hence, the standard approach is to split the full Q: I prepare 53 pancakes, eat 5 of them and give 7 to Gina.
I then prepare 26 more. How many pancakes are left? A:
set 𝒟 into batches, and to update the parameters 27 pancakes are left.
from the estimate of the gradient computed from
Q: Gina has 105 beans, she gives 23 beans to Bob, and
each. This is called mini-batch stochastic gradient prepares a soup with 53 beans. How many beans are left?
descent, or stochastic gradient descent (SGD) for A: Let’s proceed step by step: Gina has 105 beans, she
short. gives 23 beans to Bob (82 left), and prepares a soup with
53 beans (29 left). So there are 29 beans left.
It is important to note that this process is extremely
gradual, and that the number of mini-batches and Q: I prepare 53 pancakes, eat 5 of them and give 7 to Gina.
I then prepare 26 more. How many pancakes are left? A:
gradient steps are typically of the order of several Let’s proceed step by step: 53 pancakes, eat 5 of them
million. (48 left), give 7 to Gina (41 left), prepare 26 more (67
left). So there are 67 pancakes left.
As with many algorithms, intuition breaks down
in high dimensions, and although it may seem that
Figure 8.1: Example of a chain-of-thought to im-
this procedure would be easily trapped in a local
prove the response of the Llama-3-8B base model. In
minimum, in reality, due to the number of parame-
the two examples, the beginning of the text in normal
ters, the design of the models, and the stochasticity
font is the prompt, and the generated part is indicated
of the data, its efficiency is far greater than one
in bold. The generation without chain-of-thought
might expect.
(top) leads to an incorrect answer, while the gener-
Plenty of variations of this standard strategy have ation with it (bottom) generates a correct answer,
been proposed. The most popular one is Adam by explicitly producing multiple simple arithmetic
[Kingma and Ba, 2014], which keeps running esti- operations.
mates of the mean and variance of each component
of the gradient, and normalizes them automati-
40 141
8.1 Prompt Engineering cally, avoiding scaling issues and different training
speeds in different parts of a model.
The simplest strategy to specialize or improve a
Large Language Model with a limited computa- 3.4 Backpropagation
tional budget is to use prompt engineering, that
is, to carefully craft the beginning of the text se- Using gradient descent requires a technical means
quence to bias the autoregressive process [Sahoo to compute ∇𝓁 |w (w) where 𝓁 = L(f (x; w); y).
et al., 2024]. This approach moves a part of the Given that f and L are both compositions of stan-
information traditionally encoded in the model’s dard tensor operations, as for any mathematical
parameters to the input. expression, the chain rule from differential calcu-
lus allows us to get an expression of it.
We saw in § 7.1 a simple example of few-shot pre-
diction, to use an LLM for a text classification For the sake of making notation lighter, we will
task without fine-tuning. A long and sophisticated not specify at which point gradients are computed,
prompt allows generalizing this strategy to com- since the context makes it clear.
plex tasks.
The context size of a language model, that is, the The output of f (x; w) can be computed by starting
number of tokens it can operate on, directly mod- with x(0) = x and applying iteratively:
ulates the quantity of information that can be pro- (d)
x =f (d)
x (d−1)
; wd ,
vided in the prompt. This is mostly constrained
by the computational cost of standard attention with x(D) as the final value.
models, which is quadratic with the context size
(see § 4.8). The individual scalar values of these intermediate
results x(d) are traditionally called activations in
140 41
f (d) ( · ; wd )
x(d−1) x(d)
×Jf (d) |x
∇𝓁 |x(d−1) ∇𝓁 |x(d)
×Jf (d) |w Chapter 8
∇𝓁 |wd The Compute Schism
Figure 3.3: Given a model f = f (D) ◦ · · · ◦ f (1) , the
forward pass computes the outputs x(d) of the f (d) in
order (top, black). The backward pass computes the
gradients of the loss with respect to the activations The scale of deep architectures is critical to their
x(d) (bottom, blue) and the parameters wd (bottom, performance and, as we saw in § 3.7,
red) backward by multiplying them by the Jacobians. Large Language Models
in particular may require amounts
of memory and computation that greatly exceed
those of consumer hardware.
reference to neuron activations, the value D is the
depth of the model, the individual mappings f (d) While training such a model from scratch requires
are referred to as layers, as we will see in § 4.1, and resources available only to large corporations or
their sequential evaluation is the forward pass (see public bodies, techniques have been developed to
Figure 3.3, top). allow inference and adaptation to specific tasks
under strong resource constraints. Allowing to
run models locally instead of through a provider
Conversely, the gradient ∇𝓁 |x(d−1) of the loss with
respect to the output x(d−1) of f (d−1) is the prod- may be highly desirable for cost or confidentiality
uct of the gradient ∇𝓁 |x(d) with respect to the out- reasons.
put of f (d) multiplied by the Jacobian Jf (d−1) |x of
f (d−1) with respect to its variable x. Thus, the gra-
dients with respect to the outputs of all the f (d) s
can be computed recursively backward, starting
42 139
with ∇𝓁 |x(D) = ∇L|x .
43
Resource usage where σt is defined analytically.
Regarding the computational cost, as we will see, In practice, such a model initially hallucinates
the bulk of the computation goes into linear oper- structures by pure luck in the random noise, and
ations, each requiring one matrix product for the then gradually builds more elements that emerge
forward pass and two for the products by the Ja- from the noise by reinforcing the most likely con-
cobians for the backward pass, making the latter tinuation of the image obtained thus far.
roughly twice as costly as the former.
This approach can be extended to text-conditioned
The memory requirement during inference is synthesis, to generate images that match a descrip-
roughly equal to that of the most demanding indi- tion. For instance, Nichol et al. [2021] add to the
vidual layer. For training, however, the backward mean of the denoising distribution of Equation 7.1
pass requires keeping the activations computed a bias that goes in the direction of increasing the
during the forward pass to compute the Jacobians, CLIP matching score (see § 6.6) between the pro-
which results in a memory usage that grows pro- duced image and the conditioning text description.
portionally to the model’s depth. Techniques exist
to trade the memory usage for computation by
either relying on reversible layers [Gomez et al.,
2017], or using checkpointing, which consists of
storing activations for some layers only and recom-
puting the others on the fly with partial forward
passes during the backward pass [Chen et al., 2016].
Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-
wards through an operator, it may be scaled by a
multiplicative factor, and consequently decrease
or increase exponentially when it traverses many
44 137
setup should degrade the signal so much that the layers. A standard method to prevent it from ex-
distribution p(xT ) has a known analytical form ploding is gradient norm clipping, which consists
which can be sampled. of re-scaling the gradient to set its norm to a fixed
threshold if it is above it [Pascanu et al., 2013].
For instance, Ho et al. [2020] normalize the data
to have a mean of 0 and a variance of 1, and their When the gradient decreases exponentially, this is
diffusion process consists of adding a bit of white called the vanishing gradient, and it may make the
noise and re-normalizing the variance to 1. This training impossible, or, in its milder form, cause
process exponentially reduces the importance of different parts of the model to be updated at differ-
x0 , and xt ’s density can rapidly be approximated ent speeds, degrading their co-adaptation [Glorot
with a normal. and Bengio, 2010].
The denoiser f is a deep architecture that As we will see in Chapter 4, multiple techniques
should model and allow sampling from have been developed to prevent this from happen-
f (xt−1 , xt , t; w) ≃ p(xt−1 | xt ). It can be shown, ing, reflecting a change in perspective that was
thanks to a variational bound, that if this one-step crucial to the success of deep-learning: instead of
reverse process is accurate enough, sampling trying to improve generic optimization methods,
xT ∼ p(xT ) and denoising T steps with f results the effort shifted to engineering the models them-
in x0 that follows p(x0 ). selves to make them optimizable.
in each, and maximizing As the term “deep learning” indicates, useful mod-
X els are generally compositions of long series of
(n) (n)
log f xtn −1 , xtn , tn ; w .
mappings. Training them with gradient descent
n
results in a sophisticated co-adaptation of the map-
pings, even though this procedure is gradual and
Given their diffusion process, Ho et al. [2020] have local.
a denoising of the form:
We can illustrate this behavior with a simple model
xt−1 | xt ∼ 𝒩 (xt + f (xt , t; w); σt ), (7.1)
136 45
R2 → R2 that combines eight layers, each multi-
plying its input by a 2×2 matrix and applying Tanh xT
per component, with a final linear classifier. This
is a simplified version of the standard
Multi-Layer Perceptron
that we will see in § 5.1.
If we train this model with SGD and cross-entropy
on a toy binary classification task (Figure 3.4, top
left), the matrices co-adapt to deform the space
until the classification is correct, which implies
that the data have been made linearly separable
before the final affine operation (Figure 3.4, bottom
right).
Such an example gives a glimpse of what a deep
model can achieve; however, it is partially mislead-
ing due to the low dimension of both the signal to
process and the internal representations. Every-
thing is kept in 2D here for the sake of visualiza-
tion, while real models take advantage of represen-
tations in high dimensions, which, in particular,
facilitates the optimization by providing many de-
grees of freedom.
x0
Empirical evidence accumulated over twenty years
demonstrates that state-of-the-art performance
across application domains necessitates models Figure 7.2: Image synthesis with denoising diffusion
with tens of layers, such as residual networks (see [Ho et al., 2020]. Each sample starts as a white noise
§ 5.2) or Transformers (see § 5.3). xT (top), and is gradually de-noised by sampling
iteratively xt−1 | xt ∼ 𝒩 (xt + f (xt , t; w), σt ).
Theoretical results show that, for a fixed computa-
46 135
be used as-is to fine-tune the language model, and
the latter can be used to train a reward network
that predicts the rating and use it as a target to
fine-tune the language model with a standard
Reinforcement Learning
approach.
134 47
tional budget or number of parameters, increasing This results in particular in the ability to solve
the depth leads to a greater complexity of the re- few-shot prediction, where only a handful of train-
sulting mapping [Telgarsky, 2016]. ing examples are available, as illustrated in Fig-
ure 7.1. More surprisingly, when given a carefully
3.6 Training protocols crafted prompt, it can exhibit abilities for question
answering, problem solving, and chain-of-thought
Training a deep network requires defining a proto- that appear eerily close to high-level reasoning
col to make the most of computation and data, and [Chowdhery et al., 2022; Bubeck et al., 2023].
to ensure that performance will be good on new
Due to these remarkable capabilities, these mod-
data.
els are sometimes called foundation models [Bom-
As we saw in § 1.3, the performance on the train- masani et al., 2021].
ing samples may be misleading, so in the simplest
However, even though it integrates a very large
setup one needs at least two sets of samples: one is
body of knowledge, such a model may be inade-
a training set, used to optimize the model param-
quate for practical applications, in particular when
eters, and the other is a test set, to evaluate the
interacting with human users. In many situations,
performance of the trained model.
one needs responses that follow the statistics of a
Additionally, there are usually hyper-parameters helpful dialog with an assistant. This differs from
to adapt, in particular, those related to the model ar- the statistics of available large training sets, which
chitecture, the learning rate, and the regularization combine novels, encyclopedias, forum messages,
terms in the loss. In that case, one needs a and blog posts.
validation set
that is disjoint from both the training and
This discrepancy is addressed by fine-tuning such
test sets to assess the best configuration.
a language model (see § 3.6). The current domi-
The full training is usually decomposed into nant strategy is
epochs, each of which corresponds to going Reinforcement Learning from Human Feedback(RLHF) [Ouyang et al., 2022], which
through all the training examples once. The usual consists of creating small labeled training sets by
dynamic of the losses is that the training loss de- asking users to either write responses or provide
creases as long as the optimization runs, while the ratings of generated responses. The former can
48 133
I: I love apples, O: positive, I: music is my passion, O: pos-
itive, I: my job is boring, O: negative, I: frozen pizzas are
awesome, O: positive,
I: I love apples, O: positive, I: music is my passion, O: posi-
tive, I: my job is boring, O: negative, I: frozen pizzas taste
like cardboard, O: negative, Overfitting
I: water boils at 100 degrees, O: physics, I: the square root
of two is irrational, O: mathematics, I: the set of prime
numbers is infinite, O: mathematics, I: gravity is propor-
tional to the mass, O: physics,
Loss
I: water boils at 100 degrees, O: physics, I: the square root Validation
of two is irrational, O: mathematics, I: the set of prime
numbers is infinite, O: mathematics, I: squares are rectan-
gles, O: mathematics,
Training
Figure 7.1: Examples of few-shot prediction with Number of epochs
a 120 million parameter GPT model from Hugging
Face. In each example, the beginning of the sentence
Figure 3.5: As training progresses, a model’s per-
was given as a prompt, and the model generated the
formance is usually monitored through losses. The
part in bold.
training loss is the one driving the optimization pro-
cess and goes down, while the validation loss is es-
When such a model is trained on a very large timated on an other set of examples to assess the
dataset, it results in a Large Language Model overfitting of the model. Overfitting appears when
(LLM), which exhibits extremely powerful proper- the model starts to take into account random struc-
ties. Besides the syntactic and grammatical struc- tures specific to the training set at hand, resulting in
ture of the language, it has to integrate very diverse the validation loss starting to increase.
knowledge, e.g. to predict the word following “The
capital of Japan is”, “if water is heated to 100 Cel-
sius degrees it turns into”, or “because her puppy
was sick, Jane was”.
132 49
validation loss may reach a minimum after a cer-
tain number of epochs and then start to increase,
reflecting an overfitting regime, as introduced in
§ 1.3 and illustrated in Figure 3.5.
Paradoxically, although they should suffer from se- Chapter 7
vere overfitting due to their capacity, large models
usually continue to improve as training progresses. Synthesis
This may be due to the inductive bias of the model
becoming the main driver of optimization when
performance is near perfect on the training set
[Belkin et al., 2018].
An important design choice is the A second category of applications distinct from pre-
during training, that is, the specification
learning rate schedule diction is synthesis. It consists of fitting a density
of the value of the learning rate at each iteration of model to training samples and providing means to
the gradient descent. The general policy is that the sample from this model.
learning rate should be initially large to avoid hav-
ing the optimization being trapped in a bad local 7.1 Text generation
minimum early, and that it should get smaller so
that the optimized parameter values do not bounce The standard approach to text synthesis is to use
around and reach a good minimum in a narrow an attention-based, autoregressive model. A very
valley of the loss landscape. successful model proposed by Radford et al. [2018],
is the GPT which we described in § 5.3.
The training of very large models may take months
on thousands of powerful GPUs and have a finan- This architecture has been used for very large mod-
cial cost of several million dollars. At this scale, the els, such as OpenAI’s 175-billion-parameter GPT-
training may involve many manual interventions, 3 [Brown et al., 2020]. It is composed of 96 self-
informed, in particular, by the dynamics of the loss attention blocks, each with 96 heads, and processes
evolution. tokens of dimension 12,288, with a hidden dimen-
sion of 49,512 in the MLPs of the attention blocks.
50 131
Fine-tuning
It is often beneficial to adapt an already trained
model to a new task, referred to as a
downstream task
.
51
3.7 The benefits of scale
There is an accumulation of empirical results
showing that performance, for instance, estimated
through the loss on test data, improves with the
amount of data according to remarkable
, as long as the model size increases corre-
scaling laws
spondingly [Kaplan et al., 2020] (see Figure 3.6).
Benefiting from these scaling laws in the multi-
billion sample regime is possible in part thanks to
the structure of deep models which can be scaled
up arbitrarily, as we will see, by increasing the
number of layers or feature dimensions. But it is Value
also made possible by the distributed nature of the
computation they implement, and by the Frame number
,stochastic gradient descent
which requires only a fraction
of the data at a time and can operate with datasets Figure 6.5: This graph shows the evolution of the
whose size is orders of magnitude greater than state value V (St ) = maxa Q(St , a) during a game
that of the computing device’s memory. This has of Breakout. The spikes at time points (1) and (2)
resulted in an exponential growth of the models, correspond to clearing a brick, at time point (3) it
as illustrated in Figure 3.7. is about to break through to the top line, and at (4)
it does, which ensures a high future reward [Mnih
Typical vision models have 10–100 million et al., 2015].
trainable parameters
and require 1018 –1019 FLOPs for
training [He et al., 2015; Sevilla et al., 2022]. Lan-
guage models have from 100 million to hundreds of
billions of trainable parameters and require 1020 –
1023 FLOPs for training [Devlin et al., 2018; Brown
et al., 2020; Chowdhery et al., 2022; Sevilla et al.,
52 129
mizing
N
1 X
(Q (sn , an ; w) − yn )2 (6.2)
Test loss
ℒ (w) =
N
n=1
Test loss
sary since the target value in Equation 6.1 is the
expectation of yn , while it is yn itself which is used
in Equation 6.2. Fixing w in yn results in a better
approximation of the desirable gradient.
A key issue is the policy used to collect episodes. Dataset size (tokens)
Mnih et al. [2015] simply use the ϵ-greedy strat-
egy, which consists of taking an action completely
at random with probability ϵ, and the optimal ac-
Test loss
tion argmaxa Q(s, a) otherwise. Injecting a bit of
randomness is necessary to favor exploration.
128 53
Dataset Year Nb. of images Size This is the standard setup of
ImageNet 2012 1.2M 150Gb
Reinforcement Learning
(RL), and it can be worked out by introduc-
Cityscape 2016 25K 60Gb
LAION-5B 2022 5.8B 240Tb ing the optimal state-action value function Q(s, a)
which is the expected return if we execute action
Dataset Year Nb. of books Size
a in state s, and then follow the optimal policy.
WMT-18-de-en 2018 14M 8Gb
The Pile 2020 1.6B 825Gb It provides a means to compute the optimal pol-
OSCAR 2020 12B 6Tb icy as π(s) = argmaxa Q(s, a), and, thanks to
the Markovian assumption, it verifies the
Table 3.1: Some examples of publicly available Bellman equation
:
datasets. The equivalent number of books is an in-
dicative estimate for 250 pages of 2000 characters per Q(s, a) = (6.1)
book.
E Rt + γ max Q(St+1 , a′ ) St = s, At = a ,
′ a
2022]. These latter models require machines with from which we can design a procedure to train a
multiple high-end GPUs. parametric model Q( · , · ; w).
Training these large models is impossible using To apply this framework to play classical Atari
datasets with a detailed ground-truth costly to pro- video games, Mnih et al. [2015] use for St the con-
duce, which can only be of moderate size. Instead, catenation of the frame at time t and the three
it is done with datasets automatically produced by that precede, so that the Markovian assumption
combining data available on the internet with min- is reasonable, and use for Q a model dubbed the
imal curation, if any. These sets may combine mul- Deep Q-Network (DQN), composed of two convo-
tiple modalities, such as text and images from web lutional layers and one fully connected layer with
pages, or sound and images from videos, which one output value per action, following the classical
can be used for large-scale supervised training. structure of a LeNet (see § 5.2).
As of 2024, the most powerful models are the so- Training is achieved by alternatively playing and
called Large Language Models (LLMs), which we recording episodes, and building mini-batches of
will see in § 5.3 and § 7.1, trained on extremely tuples (sn , an , rn , s′ n ) ∼ (St , At , Rt , St+1 ) taken
large text datasets (see Table 3.1). across stored episodes and time steps, and mini-
54 127
Additionally, since the textual descriptions are of-
ten detailed, such a model has to capture a richer
representation of images and pick up cues beyond 1GWh
PaLM
what is necessary for instance for classification.
1024
This translates to excellent performance on chal- GPT-3 LaMDA
lenging datasets such as ImageNet Adversarial
AlphaZero Whisper
[Hendrycks et al., 2019] which was specifically de-
1021
BERT
6.7 Reinforcement learning
Transformer
126 55
Figure 6.4: The CLIP text-image embedding [Rad-
ford et al., 2021] allows for zero-shot prediction by
predicting which class description embedding is the
most consistent with the image embedding.
125
1024, depending on the configuration.
124
such as background music or ambient noise.
This approach allows leveraging extremely large
datasets that combine multiple types of sound
sources with diverse ground truths.
It is noteworthy that even though the ultimate
goal of this approach is to produce a translation
as deterministic as possible given the input signal,
it is formally the sampling of a text distribution
conditioned on a sound sample, hence a synthesis
process. The decoder is, in fact, extremely similar
to the generative model of § 7.1.
6.6 Text-image representations
A powerful approach to image understanding con-
sists of learning consistent image and text represen-
tations, such that an image, or a textual description
of it, would be mapped to the same feature vector.
The Contrastive Language-Image Pre-training
(CLIP) proposed by Radford et al. [2021] combines
an image encoder f , which is a ViT, and a text
encoder g, which is a GPT. See § 5.3 for both.
To repurpose a GPT as a text encoder, instead of a
standard autoregressive model, they add an “end
of sentence” token to the input sequence, and use
the representation of this token in the last layer as
the embedding. Its dimension is between 512 and
123
a large-scale image classification dataset to com-
pensate for the limited availability of segmentation
ground truth.
122 59
4.1 The notion of layer requires operating at multiple scales. This is neces-
sary so that any object, or sufficiently informative
We call layers standard complex compounded ten- sub-part, regardless of its size, is captured some-
sor operations that have been designed and em- where in the model by the feature representation
pirically identified as being generic and efficient. at a single tensor position. Hence, standard archi-
They often incorporate trainable parameters and tectures for this task downscale the image with a
correspond to a convenient level of granularity for series of convolutional layers to increase the recep-
designing and describing large deep models. The tive field of the activations, and re-upscale it with a
term is inherited from simple multi-layer neural series of transposed convolutional layers, or other
networks, even though modern models may take upscaling methods such as bilinear interpolation,
the form of a complex graph of such modules, in- to make the prediction at high resolution.
corporating multiple parallel pathways.
However, a strict downscaling-upscaling architec-
Y ture does not allow for operating at a fine grain
4×4 when making the final prediction, since all the sig-
g n=4 nal has been transmitted through a low-resolution
representation at some point. Models that apply
f
×K such downscaling-upscaling serially mitigate these
32 × 32 issues with skip connections from layers at a cer-
X
tain resolution, before downscaling, to layers at
In the following pages, I try to stick to the conven- the same resolution, after upscaling [Long et al.,
tion for model depiction illustrated above: 2014; Ronneberger et al., 2015]. Models that do
it in parallel, after a convolutional backbone, con-
• operators / layers are depicted as boxes, catenate the resulting multi-scale representation
after upscaling, before making the final per-pixel
• darker coloring indicates that they embed train- prediction [Zhao et al., 2016].
able parameters,
Training is achieved with a standard cross-entropy
• non-default valued hyper-parameters are added summed over all the pixels. As for object detection,
in blue on their right, training can start from a network pre-trained on
60 121
• a dashed outer frame with a multiplicative factor
indicates that a group of layers is replicated in se-
ries, each with its own set of trainable parameters,
if any, and
While a standard residual network, for instance, The most basic linear layer is the
can generate a dense output of the same resolu- ,fully connected layer
parameterized by a trainable weight matrix
tion as its input, as for object detection, this task W of size D′ × D and bias vector b of dimension
120 61
D′ . It implements an affine transformation gener- erates several bounding boxes per s, h, w, each
alized to arbitrary tensor shapes, where the sup- dedicated to a hard-coded range of aspect ratios.
plementary dimensions are interpreted as vector
indexes. Formally, given an input X of dimension Training sets for object detection are costly to cre-
D1 × · · · × DK × D, it computes an output Y of ate, since the labeling with bounding boxes re-
dimension D1 × · · · × DK × D′ with quires a slow human intervention. To mitigate
this issue, the standard approach is to fine-tune
∀d1 , . . . , dK , a convolutional model that has been pre-trained
on a large classification dataset such as VGG-16
Y [d1 , . . . , dK ] = W X[d1 , . . . , dK ] + b.
for the original SSD, and to replace its final fully-
connected layers with additional convolutional
While at first sight such an affine operation seems ones. Surprisingly, models trained for classifica-
limited to geometric transformations such as rota- tion only learn feature representations that can be
tions, symmetries, and translations, it can in fact repurposed for object detection, even though that
do more than that. In particular, projections for task involves the regression of geometric quanti-
dimension reduction or signal filtering, but also, ties.
from the perspective of the dot product being a
measure of similarity, a matrix-vector product can During training, every ground-truth bounding box
be interpreted as computing matching scores be- is associated with its s, h, w, and induces a loss
tween the queries, as encoded by the input vectors, term composed of a cross-entropy loss for the log-
and keys, as encoded by the matrix rows. its, and a regression loss such as MSE for the bound-
ing box coordinates. Every other s, h, w free of
As we saw in § 3.3, the gradient descent starts with bounding-box match induces a cross-entropy only
the parameters' random initialization. If this is penalty to predict the class “no object”.
done too naively, as seen in § 3.4, the network may
suffer from exploding or vanishing activations and 6.4 Semantic segmentation
gradients [Glorot and Bengio, 2010]. Deep learn-
ing frameworks implement initialization methods The finest-grain prediction task for image under-
that in particular scale the random parameters ac- standing is semantic segmentation, which consists
cording to the dimension of the input to keep the of predicting, for each pixel, the class of the object
62 119
The standard approach to solve this task, for in- variance of the activations constant and prevent
stance, by the Single Shot Detector (SSD) [Liu et al., pathological behaviors.
2015]), is to use a convolutional neural network
that produces a sequence of image representations Convolutional layers
Zs of size Ds × Hs × Ws , s = 1, . . . , S, with
decreasing spatial resolution Hs × Ws down to A linear layer can take as input an arbitrarily-
1 × 1 for s = S (see Figure 6.1). Each of these shaped tensor by reshaping it into a vector, as long
tensors covers the input image in full, so the h, w as it has the correct number of coefficients. How-
indices correspond to a partitioning of the image ever, such a layer is poorly adapted to dealing with
lattice into regular squares that gets coarser when large tensors, since the number of parameters and
s increases. number of operations are proportional to the prod-
uct of the input and output dimensions. For in-
As seen in § 4.2, and illustrated in Figure 4.4, due stance, to process an RGB image of size 256 × 256
to the succession of convolutional layers, a feature as input and compute a result of the same size, it
vector (Zs [0, h, w], . . . , Zs [Ds − 1, h, w]) is a de- would require approximately 4 × 1010 parameters
scriptor of an area of the image, called its and multiplications.
receptive field
, that is larger than this square but centered
on it. This results in a non-ambiguous matching Besides these practical issues, most of the high-
of any bounding box (x1 , x2 , y1 , y2 ) to a s, h, w, dimension signals are strongly structured. For in-
determined respectively by max(x2 − x1 , y2 − y1 ), stance, images exhibit short-term correlations and
statistical stationarity with respect to translation,
2 , and 2 .
y1 +y2 x1 +x2
scaling, and certain symmetries. This is not re-
Detection is achieved by adding S convolutional flected in the inductive bias of a fully connected
layers, each processing a Zs and computing, for ev- layer, which completely ignores the signal struc-
ery tensor indices h, w, the coordinates of a bound- ture.
ing box and the associated logits. If there are C
object classes, there are C + 1 logits, the addi- To leverage these regularities, the tool of choice
tional one standing for “no object.” Hence, each is convolutional layers, which are also affine, but
additional convolution layer has 4 + C + 1 output process time-series or 2D signals locally, with the
channels. The SSD algorithm in particular gen- same operator everywhere.
118 63
Y Y
ϕ ψ
X X
Y Y
ϕ ψ
X X
... ...
Y Y
ϕ ψ
X X
1D transposed
1D convolution
convolution
Figure 4.1: A 1D convolution (left) takes as input a
D × T tensor X, applies the same affine mapping
ϕ( · ; w) to every sub-tensor of shape D × K, and
stores the resulting D′ × 1 tensors into Y . A 1D
transposed convolution (right) takes as input a D×T
tensor, applies the same affine mapping ψ( · ; w) to
every sub-tensor of shape D×1, and sums the shifted Figure 6.2: Examples of object detection with the
resulting D′ × K tensors. Both can process inputs Single-Shot Detector [Liu et al., 2015].
of different sizes.
64 117
D
W
X
H ϕ ψ
Z1
Z2 Y X
ZS−1 ZS X Y
... 2D transposed
2D convolution
convolution
116 65
predicting a class from a finite, predefined number
of classes, given an input image.
Y
The standard models for this task are convolutional
ϕ
networks, such as ResNets (see § 5.2), and attention-
Y based models such as ViT (see § 5.3). These models
X generate a vector of logits with as many dimen-
ϕ
sions as there are classes.
X p=2
Padding The training procedure simply minimizes the cross-
entropy loss (see § 3.1). Usually, performance
Y can be improved with data augmentation, which
Y consists of modifying the training samples with
ϕ
hand-designed random transformations that do not
X ϕ change the semantic content of the image, such as
cropping, scaling, mirroring, or color changes.
X
s=2 ...
d=2 6.3 Object detection
Stride
Dilation
A more complex task for image understanding is
Figure 4.3: Beside its kernel size and number of input object detection, in which the objective is, given
/ output channels, a convolution admits three hyper- an input image, to predict the classes and positions
parameters: the stride s (left) modulates the step size of objects of interest.
when going through the input tensor, the padding p An object position is formalized as the four coor-
(top right) specifies how many zero entries are added dinates (x1 , y1 , x2 , y2 ) of a rectangular bounding
around the input tensor before processing it, and the box, and the ground truth associated with each
dilation d (bottom right) parameterizes the index training image is a list of such bounding boxes,
count between coefficients of the filter. each labeled with the class of the object contained
therein.
66 115
timate of the original signal X. For images, it is of size D × K of X, storing the results in a tensor
a convolutional network that may integrate skip- Y of size D′ × (T − K + 1), as pictured in Figure
connections, in particular to combine representa- 4.1 (left).
tions at the same resolution obtained early and late
in the model, as well as attention layers to facili- A 2D convolution is similar but has a K × L kernel
tate taking into account elements that are far away and takes as input a D × H × W tensor (see Figure
from each other. 4.2, left).
Such a model is trained by collecting a large num- Both operators have for trainable parameters those
ber of clean samples paired with their degraded of ϕ that can be envisioned as D′ filters of size
inputs. The latter can be captured in degraded D × K or D × K × L respectively, and a
conditions, such as low-light or inadequate focus, bias vector
of dimension D′ .
or generated algorithmically, for instance, by con- Such a layer is equivariant to translation, meaning
verting the clean sample to grayscale, reducing its that if the input signal is translated, the output is
size, or aggressively compressing it with a lossy similarly transformed. This property results in a
compression method. desirable inductive bias when dealing with a signal
The standard training procedure for denoising au- whose distribution is invariant to translation.
toencoders uses the MSE loss summed across all They also admit three additional
pixels, in which case the model aims at computing hyper-parameters
, illustrated on Figure 4.3:
the best average clean picture, given the degraded
one, that is E[X | X̃]. This quantity may be prob- • The padding specifies how many zero coeffi-
lematic when X is not completely determined by cients should be added around the input tensor
X̃, in which case some parts of the generated signal before processing it, particularly to maintain the
may be an unrealistic, blurry average. tensor size when the kernel size is greater than one.
Its default value is 0.
6.2 Image classification
• The stride specifies the step size used when go-
ing through the input, allowing one to reduce the
Image classification is the simplest strategy for ex-
output size geometrically by using large steps. Its
tracting semantics from an image and consists of
114 67
W
Chapter 6
H
Prediction
Model depth
A first category of applications, such as face recog-
Figure 4.4: Given an activation in a series of convolu- nition, sentiment analysis, object detection, or
tion layers, here in red, its receptive field is the area speech recognition, requires predicting an un-
in the input signal, in blue, that modulates its value. known value from an available signal.
Each intermediate convolutional layer increases the
width and height of that area by roughly those of 6.1 Image denoising
the kernel.
A direct application of deep models to image pro-
default value is 1. cessing is to recover from degradation by utiliz-
ing the redundancy in the statistical structure of
• The dilation specifies the index count between images. The petals of a sunflower in a grayscale
the filter coefficients of the local affine operator. Its picture can be colored with high confidence, and
default value is 1, and greater values correspond the texture of a geometric shape such as a table
to inserting zeros between the coefficients, which on a low-light, grainy picture can be corrected by
increases the filter / kernel size while keeping the averaging it over a large area likely to be uniform.
number of trainable parameters unchanged.
A denoising autoencoder is a model that takes a
Except for the number of channels, a convolution’s degraded signal X̃ as input and computes an es-
68 113
output is usually smaller than its input. In the 1D
case without padding nor dilation, if the input is
of size T , the kernel of size K, and the stride is S,
the output is of size T ′ = (T − K)/S + 1.
69
stance, in the 1D case, applies an affine mapping
′
ψ( · ; w) : RD×1 → RD ×K , to every D × 1 sub-
tensor of the input, and sums the shifted D′ × K
resulting tensors to compute its output. Such an
operator increases the size of the signal and can be
understood intuitively as a synthesis process (see
Figure 4.1, right, and Figure 4.2, right).
A series of convolutional layers is the usual archi-
tecture for mapping a large-dimension signal, such
as an image or a sound sample, to a low-dimension
tensor. This can be used, for instance, to get class
scores for classification or a compressed represen-
tation. Transposed convolution layers are used
Part III
the opposite way to build a large-dimension signal
from a compressed representation, either to as-
sess that the compressed representation contains Applications
enough information to reconstruct the signal or for
synthesis, as it is easier to learn a density model
over a low-dimension representation. We will re-
visit this in § 5.2.
4.3 Activation functions
If a network were combining only linear compo-
nents, it would itself be a linear operator, so it
is essential to have non-linear operations. These
are implemented in particular with
activation functions
, which are layers that transform each com-
ponent of the input tensor individually through a
70
Vision Transformer
Transformers have been put to use for image classi-
fication with the Vision Transformer (ViT) model
[Dosovitskiy et al., 2020] (see Figure 5.9).
110 71
learning relies on the gradient, it may seem prob-
lematic to have a mapping that is not differentiable
at zero and constant on half the real line. However, P̂ (Y )
the main property gradient descent requires is that
the gradient is informative on average. Parameter C
fully-conn
initialization and data normalization make half of gelu
the activations positive when the training starts, MLP
fully-conn
ensuring that this is the case. readout
gelu
Before the generalization of ReLU, the standard fully-conn
activation function was the hyperbolic tangent
(Tanh, see Figure 4.5, top left) which saturates ex- D
Z0 , Z1 , . . . , ZM
ponentially fast on both the negative and positive
sides, aggravating the vanishing gradient. (M + 1) × D
ffw
Other popular activation functions follow the same
idea of keeping positive values unchanged and self-att
×N
squashing the negative values. Leaky ReLU [Maas
et al., 2013] applies a small positive multiplying fac- pos-enc +
tor to the negative values (see Figure 4.5, bottom (M + 1) × D
left): E0 , E1 , . . . , EM
ax if x < 0, Image E0
leaky relu(x) =
×W E
x otherwise.
(
encoder M × 3P 2
X1 , . . . , XM
And GELU [Hendrycks and Gimpel, 2016] is de-
fined using the cumulative distribution function of Figure 5.9: Vision Transformer model [Dosovitskiy
the Gaussian distribution, that is: et al., 2020].
gelu(x) = xP (Z ≤ x),
72 109
P̂ (X1 ), . . . , P̂ (XT | Xt<T ) where Z ∼ 𝒩 (0, 1). It roughly behaves like a
smooth ReLU (see Figure 4.5, bottom right).
T ×V
fully-conn
The choice of an activation function, in particular
ffw
T ×D among the variants of ReLU, is generally driven by
empirical performance.
causal
self-att ×N
4.4 Pooling
pos-enc +
T ×D
embed A classical strategy to reduce the signal size is to
use a pooling operation that combines multiple
0, X1 , . . . , XT −1
T activations into one that ideally summarizes the
information. The most standard operation of this
Figure 5.8: GPT model [Radford et al., 2018]. class is the max pooling layer, which, similarly
to convolution, can operate in 1D and 2D and is
defined by a kernel size.
Generative Pre-trained Transformer
In its standard form, this layer computes the maxi-
The Generative Pre-trained Transformer (GPT) mum activation per channel, over non-overlapping
[Radford et al., 2018, 2019], pictured in Figure 5.8 sub-tensors of spatial size equal to the kernel size.
is a pure autoregressive model that consists of a These values are stored in a result tensor with the
succession of causal self-attention blocks, hence a same number of channels as the input, and whose
causal version of the original Transformer encoder. spatial size is divided by the kernel size. As with
the convolution, this operator has three
This class of models scales extremely well, up
hyper-parameters
: padding, stride, and dilation, with the
to hundreds of billions of trainable parameters
stride being equal to the kernel size by default. A
[Brown et al., 2020]. We will come back to their
smaller stride results in a larger resulting tensor,
use for text generation in § 7.1.
following the same formula as for convolutions
(see § 4.2).
108 73
tom right of Figure 5.6, is similar except that it
takes as input two sequences, one to compute the
queries and one to compute the keys and values.
Y
The encoder of the Transformer (see Figure 5.7, bot-
max tom), recodes the input sequence of discrete tokens
X X1 , . . . XT with an embedding layer (see § 4.9),
and adds a positional encoding (see § 4.10), before
processing it with several self-attention blocks to
Y generate a refined representation Z1 , . . . , ZT .
max
The decoder (see Figure 5.7, top), takes as in-
X put the sequence Y1 , . . . , YS−1 of result tokens
produced so far, similarly recodes them through
... an embedding layer, adds a positional encod-
ing, and processes it through alternating causal
Y self-attention blocks and cross-attention blocks
to produce the logits predicting the next tokens.
max
These cross-attention blocks compute their keys
X and values from the encoder’s result representa-
tion Z1 , . . . , ZT , which allows the resulting se-
quence to be a function of the original sequence
1D max pooling
X1 , . . . , XT .
Figure 4.6: A 1D max pooling takes as input a D ×T As we saw in § 3.2 being causal ensures that such
tensor X, computes the max over non-overlapping a model can be trained by minimizing the cross-
1 × L sub-tensors (in blue) and stores the resulting entropy summed across the full sequence.
values (in red) in a D × (T /L) tensor Y .
74 107
Transformer The max operation can be intuitively interpreted
as a logical disjunction, or, when it follows a series
The original Transformer, pictured in Figure 5.7, of convolutional layers that compute local scores
was designed for sequence-to-sequence translation. for the presence of parts, as a way of encoding
It combines an encoder that processes the input that at least one instance of a part is present. It
sequence to get a refined representation, and an au- loses precise location, making it invariant to local
toregressive decoder that generates each token of deformations.
the result sequence, given the encoder’s represen-
tation of the input sequence and the output tokens A standard alternative is the average pooling layer
generated so far. that computes the average instead of the maximum
over the sub-tensors. This is a linear operation,
As the residual convolutional networks of § 5.2, whereas max pooling is not.
both the encoder and the decoder of the Trans-
former are sequences of compounded blocks built
with residual connections.
4.5 Dropout
• The feed-forward block, pictured at the top of Some layers have been designed to explicitly facil-
Figure 5.6 is a one hidden layer MLP, preceded by a itate training or improve the learned representa-
layer normalization. It can update representations tions.
at every position separately.
One of the main contributions of that sort was
• The self-attention block, pictured on the bottom dropout [Srivastava et al., 2014]. Such a layer has
left of Figure 5.6, is a Multi-Head Attention layer no trainable parameters, but one hyper-parameter,
(see § 4.8), that recombines information globally, p, and takes as input a tensor of arbitrary shape.
allowing any position to collect information from
It is usually switched off during testing, in which
any other positions, preceded by a
case its output is equal to its input. When it is
.layer normalization
This block can be made causal by using an
active, it has a probability p of setting to zero each
adequate mask in the attention layer, as described
activation of the input tensor independently, and
in § 4.8
it re-scales all the activations by a factor of 1−p
1
to
• The cross-attention block, pictured on the bot- maintain the expected value unchanged (see Figure
106 75
P̂ (Y1 ), . . . , P̂ (YS | Ys<S )
Y Y
fully-conn
S×V
01 1 1 1 1 01 1 1 1 01
1 01 1 01 1 1 1 1 1 1 ffw
S×D
1
1 1 01 1 1 1 1 01 1 1
1 1 1 1 1 01 1 1 01 1
× × 1−p
01 1 1 10 1 1 1 1 1 1
cross-att
Q KV
Decoder
causal
X X
self-att ×N
Train Test pos-enc +
embed
S×D
Figure 4.7: Dropout can process a tensor of arbi-
trary shape. During training (left), it sets activations S
0, Y1 , . . . , YS−1
at random to zero with probability p and applies a
multiplying factor to keep the expected values un-
changed. During test (right), it keeps all the activa- Z1 , . . . , Z T
tions unchanged.
ffw
T ×D
4.7).
self-att
Encoder
The motivation behind dropout is to favor mean- ×N
ingful individual activation and discourage group pos-enc +
representation. Since the probability that a group embed
T ×D
of k activations remains intact through a dropout
T
layer is (1 − p)k , joint representations become un- X1 , . . . , XT
reliable, making the training procedure avoid them.
It can also be seen as a noise injection that makes Figure 5.7: Original encoder-decoder
the training more robust. Transformer model
for sequence-to-sequence translation
[Vaswani et al., 2017].
76 105
D
H, W
B
Y
+
× 1
dropout 1 1 0 1 0 0 1 × 1−p
fully-conn
gelu
fully-conn
layernorm
Train Test
X QKV
Y Y Figure 4.8: 2D signals such as images generally ex-
hibit strong short-term correlation and individual
+ +
activations can be inferred from their neighbors. This
mha mha redundancy nullifies the effect of the standard un-
Q K V Q K V
structured dropout, so the usual dropout layer for 2D
layernorm layernorm tensors drops entire channels instead of individual
values.
X QKV XQ X KV
When dealing with images and 2D tensors, the
Figure 5.6: Feed-forward block (top), short-term correlation of the signals and the re-
self-attention block
(bottom left) and cross-attention block (bottom sulting redundancy negate the effect of dropout,
right). These specific structures proposed by Radford since activations set to zero can be inferred from
et al. [2018] differ slightly from the original architec- their neighbors. Hence, dropout for 2D tensors
ture of Vaswani et al. [2017], in particular by having sets entire channels to zero instead of individual
the layer normalization first in the residual blocks. activations (see Figure 4.8).
104 77
training and is inactive during inference, it can be requires a residual connection that changes the ten-
used in certain setups as a randomization strategy, sor shape. This is achieved with a 1×1 convolution
for instance, to estimate empirically confidence with a stride of two (see Figure 5.4).
scores [Gal and Ghahramani, 2015].
The overall structure of the ResNet-50 is presented
in Figure 5.5. It starts with a 7 × 7 convolutional
4.6 Normalizing layers layer that converts the three-channel input image
to a 64-channel image of half the size, followed by
An important class of operators to facilitate the
four sections of residual blocks. Surprisingly, in
training of deep architectures are the
the first section, there is no downscaling, only an
,normalizing layers
which force the empirical mean and
increase of the number of channels by a factor of 4.
variance of groups of activations.
The output of the last residual block is 2048×7×7,
The main layer in that family is which is converted to a vector of dimension 2048
[Ioffe and Szegedy, 2015], which is the only
batch normalization by an average pooling of kernel size 7 × 7, and
standard layer to process batches instead of indi- then processed through a fully-connected layer to
vidual samples. It is parameterized by a hyper- get the final logits, here for 1000 classes.
parameter D and two series of trainable scalar pa-
rameters β1 , . . . , βD and γ1 , . . . , γD . 5.3 Attention models
Given a batch of B samples x1 , . . . , xB of dimen- As stated in § 4.8, many applications, particularly
sion D, it first computes for each of the D com- from natural language processing, benefit greatly
ponents an empirical mean m̂d and variance v̂d from models that include attention mechanisms.
across the batch: The architecture of choice for such tasks, which
B has been instrumental in recent advances in deep
m̂d = xb,d learning, is the Transformer proposed by Vaswani
B
b=1 et al. [2017].
1 X
B
v̂d =
B
(xb,d − m̂d )2 ,
b=1
1 X
from which it computes for every component xb,d
78 103
D
H, W
classification.
102 79
a normalized value zb,d , with empirical mean 0 and P̂ (Y )
variance 1, and from it the final result value yb,d
1000
with mean βd and standard deviation γd : fully-conn
2048
reshape
xb,d − m̂d 2048 × 1 × 1
∀b, zb,d = √ avgpool k=7
v̂d + ϵ
yb,d = γd zb,d + βd . resblock
×2
2048 × 7 × 7
dresblock
Because this normalization is defined across a S=2
batch, it is done only during training. During test-
resblock
ing, the layer transforms individual samples accord- ×5
1024 × 14 × 14
ing to the m̂d s and v̂d s estimated with a moving dresblock
average over the full training set, which boils down S=2
to a fixed affine transformation per component. resblock
×3
The motivation behind batch normalization was dresblock
512 × 28 × 28
to avoid that a change in scaling in an early layer S=2
of the network during training impacts all the lay-
resblock
ers that follow, which then have to adapt their ×2
256 × 56 × 56
trainable parameters accordingly. Although the dresblock
actual mode of action may be more complicated S=1
64 × 56 × 56
than this initial motivation, this layer considerably maxpool k=3 s=2 p=1
facilitates the training of deep models. relu
batchnorm
In the case of 2D tensors, to follow the principle 64 × 112 × 112
of convolutional layers of processing all locations conv-2d k=7 s=2 p=3
similarly, the normalization is done per-channel 3 × 224 × 224
across all 2D positions, and β and γ remain vectors X
of dimension D so that the scaling/shift does not
Figure 5.5: Structure of the ResNet-50 [He et al.,
depend on the 2D position. Hence, if the tensor
2015].
80 101
to be processed is of shape B × D × H × W , the
layer computes (m̂d , v̂d ), for d = 1, . . . , D from
the corresponding B × H × W slice, normalizes
it accordingly, and finally scales and shifts its com-
Y ponents with the trainable parameters βd and γd .
relu
4C
S
× H
S
× W
S So, given a B × D tensor, batch normalization
normalizes it across b and scales/shifts it according
+
to d, which can be implemented as a component-
batchnorm batchnorm
4C
× H
× W wise product by γ and a sum with β. Given a
B ×D ×H ×W tensor, it normalizes across b, h, w
S S S
conv-2d k=1 s=S conv-2d k=1
100 81
··· Y
relu
C ×H ×W
f (8)
··· ··· +
f (7)
f (6) batchnorm
+
conv-2d
C ×H ×W
f (6) k=1
f (5) f (4)
f (5) relu
f (4) f (3) batchnorm
f (4) conv-2d
f (3) + k=3 p=1
f (3) relu
f (2) f (2)
batchnorm
f (2) C
2
×H ×W
f (1) f (1) conv-2d k=1
f (1) C ×H ×W
··· ···
X
···
Figure 5.3: A residual block.
Figure 4.10: Skip connections, highlighted in red on
this figure, transport the signal unchanged across
multiple layers. Some architectures (center) that easily extended to deep architectures and suffer
downscale and re-upscale the representation size to from the vanishing gradient problem. The
operate at multiple scales, have skip connections to ,residual networks
or ResNets, proposed by He et al.
feed outputs from the early parts of the network to [2015] explicitly address the issue of the vanish-
later layers operating at the same scales [Long et al., ing gradient with residual connections (see § 4.7),
2014; Ronneberger et al., 2015]. The residual connec- which allow hundreds of layers. They have become
tions (right) are a special type of skip connections standard architectures for computer vision appli-
that sum the original signal to the transformed one, cations, and exist in multiple versions depending
and usually bypass at most a handful of layers [He on the number of layers. We are going to look
et al., 2015]. in detail at the architecture of the ResNet-50 for
82 99
4.7 Skip connections
P̂ (Y ) Another technique that mitigates the vanishing
10 gradient and allows the training of deep architec-
fully-conn
tures are skip connections [Long et al., 2014; Ron-
Classifier relu neberger et al., 2015]. They are not layers per se,
fully-conn
200 but an architectural design in which outputs of
some layers are transported as-is to other layers
256
reshape further in the model, bypassing processing in be-
tween. This unmodified signal can be concatenated
relu
64 × 2 × 2 or added to the input of the layer the connection
maxpool k=2
64 × 4 × 4 branches into (see Figure 4.10). A particular type
Feature conv-2d k=5 of skip connections are the residual connections
extractor relu which combine the signal with a sum, and usually
maxpool k=3
32 × 8 × 8 skip only a few layers (see Figure 4.10, right).
32 × 24 × 24
conv-2d k=5 The most desirable property of this design is to
1 × 28 × 28 ensure that, even in the case of gradient-killing
X
processing at a certain stage, the gradient will still
propagate through the skip connections. Residual
Figure 5.2: Example of a small LeNet-like network
connections, in particular, allow for the building
for classifying 28 × 28 grayscale images of hand-
of deep models with up to several hundred layers,
written digits [LeCun et al., 1998]. Its first half is
and key models, such as the residual networks [He
convolutional, and alternates convolutional layers
et al., 2015] in computer vision (see § 5.2), and the
per se and max pooling layers, reducing the signal
Transformers [Vaswani et al., 2017] in natural lan-
dimension from 28 × 28 scalars to 256. Its second
guage processing (see § 5.3), are entirely composed
half processes this 256-dimensional feature vector
of blocks of layers with residual connections.
through a one hidden layer perceptron to compute 10
logit scores corresponding to the ten possible digits. Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size before
98 83
re-expanding it, by connecting layers with compat- tant tool when the dimension of the signal to be
ible sizes, for instance for semantic segmentation processed is not too large.
(see § 6.4). In the case of residual connections, they
may also facilitate learning by simplifying the task 5.2 Convolutional networks
to finding a differential improvement instead of a
full update. The standard architecture for processing images
is a convolutional network, or convnet, that com-
4.8 Attention layers bines multiple convolutional layers, either to re-
duce the signal size before it can be processed by
In many applications, there is a need for an op- fully connected layers, or to output a 2D signal
eration able to combine local information at loca- also of large size.
tions far apart in a tensor. For instance, this could
be distant details for coherent and realistic LeNet-like
,image synthesis
or words at different positions in a para-
graph to make a grammatical or semantic decision The original LeNet model for image classification
in Natural Language Processing. [LeCun et al., 1998] combines a series of 2D
and max pooling layers that play
convolutional layers
Fully connected layers cannot process large- the role of feature extractor, with a series of
dimension signals, nor signals of variable size, and fully connected layers
which act as a MLP and perform
convolutional layers are not able to propagate in- the classification per se (see Figure 5.2).
formation quickly. Strategies that aggregate the
results of convolutions, for instance, by averaging This architecture was the blueprint for many mod-
them over large spatial areas, suffer from mixing els that share its structure and are simply larger,
multiple signals into a limited number of dimen- such as AlexNet [Krizhevsky et al., 2012] or the
sions. VGG family [Simonyan and Zisserman, 2014].
Attention layers specifically address this problem Residual networks
by computing an attention score for each compo-
nent of the resulting tensor to each component Standard convolutional neural networks that fol-
of the input tensor, without locality constraints, low the architecture of the LeNet family are not
84 97
Y Q Y
K A V A
2
fully-conn
relu
10
Hidden fully-conn
layers Computes Aq,1 , . . . , Aq,N KV Computes Yq
relu
25
fully-conn
Figure 4.11: The attention operator can be inter-
X
50 preted as matching every query Qq with all the keys
K1 , . . . , KN KV to get normalized attention scores
Figure 5.1: This multi-layer perceptron takes as in- Aq,1 , . . . , Aq,N KV (left, and Equation 4.1), and then
put a one-dimensional tensor of size 50, is composed averaging the values V1 , . . . , VN KV with these scores
of three fully connected layers with outputs of di- to compute the resulting Yq (right, and Equation 4.2).
mensions respectively 25, 10, and 2, the two first
followed by ReLU layers.
and averaging the features across the full tensor
accordingly [Vaswani et al., 2017].
tion the
theo
theorem
rem [Cybenko, 1989] which states that,
Even though they are substantially more compli-
if the activation function σ is continuous and not
cated than other layers, they have become a stan-
polynomial, any continuous function f can be ap-
dard element in many recent models. They are,
proximated arbitrarily well uniformly on a com-
in particular, the key building block of
pact domain, which is bounded and contains its
,Transformers
the dominant architecture for
boundary, by a model of the form l2 ◦σ ◦l1 where l1
.Large Language Models
See § 5.3 and § 7.1.
and l2 are affine. Such a model is a MLP with a sin-
gle hidden layer, and this result implies that it can
approximate anything of practical value. However,
Attention operator
this approximation holds if the dimension of the Given
first linear layer’s output can be arbitrarily large.
• a tensor Q of queries of size N Q × DQK ,
In spite of their simplicity, MLPs remain an impor-
96 85
keys of size N KV × DQK , and
• a tensor V of values of size N KV × DV ,
the attention operator computes a tensor
Y = att(Q, K, V ) Chapter 5
of dimension N Q × DV . To do so, it first computes
for every query index q and every key index k an
Architectures
attention score Aq,k as the softargmax of the dot
products between the query Q and the keys:
q
exp √ 1 QK Qq ·Kk
(4.1) The field of deep learning has developed over the
l D
exp √ 1 QK Qq ·Kl years for each application domain multiple deep
architectures that exhibit good trade-offs with re-
Aq,k = P D ,
where the scaling factor √ 1 QK keeps the range of
D
spect to multiple criteria of interest: e.g. ease of
values roughly unchanged even for large DQK . training, accuracy of prediction, memory footprint,
computational cost, scalability.
Then a retrieved value is computed for each query
by averaging the values according to the attention
scores (see Figure 4.11):
5.1 Multi-Layer Perceptrons
Yq = Aq,k Vk . (4.2) The simplest deep architecture is the
k
Multi-Layer Perceptron
(MLP), which takes the form of a suc-
X
cession of fully connected layers separated by
So if a query Qn matches one key Km far more .activation functions
See an example in Figure 5.1. For
than all the others, the corresponding attention historical reasons, in such a model, the number of
score An,m will be close to one, and the retrieved hidden layers refers to the number of linear layers,
value Yn will be the value Vm associated to that excluding the last one.
key. But, if it matches several keys equally, then
Yn will be the average of the associated values. A key theoretical result is the
86
Y
×
A
dropout
1/Σk
Masked
softargmax M ⊙
exp
Q K V
87
This can be implemented as feature vector that depends on the position in the
tensor. This positional encoding can be learned as
QK T other layer parameters, or defined analytically.
att(Q, K, V ) = softargmax √ V.
DQK
A
For instance, in the original Transformer model,
for a series of vectors of dimension D, Vaswani
| {z }
This operator is usually extended in two ways, as et al. [2017] add an encoding of the sequence index
depicted in Figure 4.12. First, the attention ma- as a series of sines and cosines at various frequen-
trix can be masked by multiplying it before the cies:
softargmax normalization by a Boolean matrix M .
This allows, for instance, to make the operator t
sin if d ∈ 2N
causal by taking M full of 1s below the diagonal pos-enc[t, d] =
t
T (d−1)/D
otherwise,
and zero above, preventing Yq from depending on
T d/D
keys and values of indices k greater than q. Sec-
cos
ond, the attention matrix is processed by a
(see § 4.5) before being multiplied by V , pro-
dropout layer with T = 104 .
viding the usual benefits during training.
Since a dot product is computed for every
query/key pair, the computational cost of the atten-
tion operator is quadratic with the sequence length.
This happens to be problematic, as some of the
applications of these methods require to process
sequences of tens of thousands, or more tokens.
Multiple attempts have been made at reducing this
cost, for instance by combining a dense attention
to a local window with a long-range sparse atten-
tion [Beltagy et al., 2020], or linearizing the opera-
tor to benefit from the associativity of the matrix
product and compute the key-value product before
88 93
Given as input an integer tensor X of dimension multiplying with the queries [Katharopoulos et al.,
D1 × · · · × DK and values in {0, . . . , N − 1} such 2020].
a layer returns a real-valued tensor Y of dimension
D1 × · · · × DK × D with Multi-head Attention Layer
92 89
along the feature dimension and each individual
element of the resulting sequence is multiplied by
W O to get the final result:
Y
Y = (Y1 | · · · | YH )W O .
×W O
As we will see in § 5.3 and in Figure 5.6, this layer
(Y1 | · · · | YH ) is used to build two model sub-structures:
self-attention blocks
, in which the three input sequences
X Q , X K , and X V are the same, and
attattattatt
att , where X K and X V are the same.
cross-attention blocks
Q K V
It is noteworthy that the attention operator, and
Q Q Q K V V V
Q 1 K V consequently the multi-head attention layer when
×W
×W ×W×W ×W
K ×W
4 H
1 ×W
2 ×W 1 ×W K ×W
4 H ×W
3×W 4 H ×W
2 3×W 2 3×W
there is no masking, is invariant to a permutation
×H of the keys and values, and equivariant to a per-
XQ XK XV mutation of the queries, as it would permute the
resulting tensor similarly.
Figure 4.13: The Multi-head Attention layer applies
for each of its h = 1, . . . , H heads a parametrized 4.9 Token embedding
linear transformation to individual elements of
the input sequences X Q , X K , X V to get sequences In many situations, we need to convert discrete
Q, K, V that are processed by the attention operator tokens into vectors. This can be done with an
to compute Yh . These H sequences are concatenated ,embedding layer
which consists of a lookup table that
along features, and individual elements are passed directly maps integers to vectors.
through one last linear operator to get the final result
sequence Y . Such a layer is defined by two hyper-parameters:
the number N of possible token values, and the di-
mension D of the output vectors, and one trainable
N × D weight matrix M .
90 91