0% found this document useful (0 votes)

20 views90 pages

LBDL A5 Booklet

Uploaded by

investtcartier

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views90 pages

LBDL A5 Booklet

Uploaded by

investtcartier

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

The Little Book

of
Deep Learning

François Fleuret
This book is licensed under the Creative Commons
BY-NC-SA 4.0 International License.
V1.2–May 19, 2024
179
The Little Book of Deep Learning
transposed convolution, 69, 121
underfitting, 18
universal approximation theorem, 96
unsupervised learning, 21
VAE, see variational, autoencoder
validation set, 48
value, 86
vanishing gradient, 45, 59
variational
autoencoder, 153
bound, 136
Vision Transformer, 110, 123
ViT, see Vision Transformer
François Fleuret is a professor of computer science vocabulary, 33
at the University of Geneva, Switzerland.
weight, 17
The cover illustration is a schematic of the Neocog- decay, 32
nitron by Fukushima [1980], a key ancestor of deep matrix, 61
neural networks.
zero-shot prediction, 124
177
scaling laws, 52
self-attention block, 91, 104, 106
self-supervised learning, 155
semantic segmentation, 84, 119
SGD, see stochastic gradient descent
Single Shot Detector, 118
skip connection, 83, 121, 152
Contents
softargmax, 30, 86
softmax, 30
speech recognition, 122
SSD, see Single Shot Detector Contents 7
stochastic gradient descent, 40, 46, 52
stride, 67, 73 List of figures 10
supervised learning, 21
Foreword 11
Tanh, see hyperbolic tangent
Task Arithmetic, 148
I Foundations 13
tensor, 25
tensor cores, 24 1 Machine Learning 15
Tensor Processing Unit, 24 1.1 Learning from data . . . . . . . . 16
test set, 48 1.2 Basis function regression . . . . . 17
text synthesis, 131 1.3 Under and overfitting . . . . . . . 18
token, 33 1.4 Categories of models . . . . . . . 20
tokenizer, 36, 122
TPU, see Tensor Processing Unit 2 Efficient Computation 23
trainable parameter, 16, 25, 52 2.1 GPUs, TPUs, and batches . . . . . 23
training, 16 2.2 Tensors . . . . . . . . . . . . . . . 25
training set, 16, 29, 48 3 Training 29
Transformer, 46, 83, 85, 93, 103, 105, 122 3.1 Losses . . . . . . . . . . . . . . . 29
transformer, 146

176 5
3.2 Autoregressive models . . . . . . 32 pre-trained model, see model, pre-trained
3.3 Gradient descent . . . . . . . . . 37 prompt, 132, 133
3.4 Backpropagation . . . . . . . . . 41 engineering, 140
3.5 The value of depth . . . . . . . . 45
quantization, 143
3.6 Training protocols . . . . . . . . 48
Quantization-Aware Training, 145
3.7 The benefits of scale . . . . . . . 52 query, 85
II Deep Models 57 RAG, see Retrieval-Augmented Generation
4 Model Components 59 random initialization, 62
4.1 The notion of layer . . . . . . . . 60 receptive field, 68, 69, 118
rectified linear unit, 71, 151
4.2 Linear layers . . . . . . . . . . . . 61
recurrent neural network, 151
4.3 Activation functions . . . . . . . 70
regression, 20
4.4 Pooling . . . . . . . . . . . . . . . 73 Reinforcement Learning, 127, 134
4.5 Dropout . . . . . . . . . . . . . . 75 Reinforcement Learning from Human Feedback,
4.6 Normalizing layers . . . . . . . . 78 133
4.7 Skip connections . . . . . . . . . 83 ReLU, see rectified linear unit
4.8 Attention layers . . . . . . . . . . 84 residual
4.9 Token embedding . . . . . . . . . 91 block, 102
4.10 Positional encoding . . . . . . . . 92 connection, 83, 99
network, 46, 83, 99
5 Architectures 95
ResNet-50, 99
5.1 Multi-Layer Perceptrons . . . . . 95
Retrieval-Augmented Generation, 142
5.2 Convolutional networks . . . . . 97
return, 126
5.3 Attention models . . . . . . . . . 103 reversible layer, see layer, reversible
RL, see Reinforcement Learning
III Applications 111 RLHF, see Reinforcement Learning from Human
6 Prediction 113 Feeback
6.1 Image denoising . . . . . . . . . . 113 RNN, see recurrent neural network
6 175
metric learning, 31 6.2 Image classification . . . . . . . . 114
MLP, see multi-layer perceptron, 146 6.3 Object detection . . . . . . . . . . 115
model, 16 6.4 Semantic segmentation . . . . . . 119
autoregressive, 33, 34, 131 6.5 Speech recognition . . . . . . . . 122
causal, 35, 88, 107 6.6 Text-image representations . . . . 123
parametric, 16 6.7 Reinforcement learning . . . . . . 126
pre-trained, 51, 119, 121
model merging, 148 7 Synthesis 131
multi-layer perceptron, 46, 95–97, 106 7.1 Text generation . . . . . . . . . . 131
7.2 Image generation . . . . . . . . . 134
Natural Language Processing, 84
NLP, see Natural Language Processing 8 The Compute Schism 139
non-linearity, 70 8.1 Prompt Engineering . . . . . . . 140
normalizing layer, see layer, normalizing 8.2 Quantization . . . . . . . . . . . . 143
8.3 Adapters . . . . . . . . . . . . . . 145
object detection, 115 8.4 Model merging . . . . . . . . . . 148
overfitting, 19, 50

padding, 67, 73 The missing bits 151

parameter, 16
hyper, 17, 37, 48, 65, 67, 73, 89, 91 Bibliography 156
parametric model, see model, parametric
peak performance, 25 Index 168
Perplexity, 144
perplexity, 34
policy, 126
optimal, 127
pooling, 73
positional encoding, 92, 107
Post-Training Quantization, 143
posterior probability, 30

174 7
convolutional, 63, 75, 84, 92, 97, 102, 118, 121,
122
embedding, 91, 107
fully connected, 61, 84, 92, 95, 97
hidden, 95
linear, 61
Multi-Head Attention, 89, 92, 106
normalizing, 78
reversible, 44
layer normalization, 81, 106
Leaky ReLU, 72
learning rate, 37, 50
learning rate schedule, 50
LeNet, 97, 98
linear layer, see layer, linear
LLM, see Large Language Model
local minimum, 37
logit, 30, 33
LoRA, see Low-Rank Adaptation
loss, 16
Low-Rank Adaptation, 146, 147
machine learning, 15, 19, 20
Markovian Decision Process, 126
Markovian property, 126
max pooling, 73, 97
MDP, see Markovian, Decision Process
mean squared error, 18, 29
memory requirement, 44
memory speed, 24
173
Generative Adversarial Networks, 153
Generative Pre-trained Transformer, 108, 123, 131,
154
generator, 153
GNN, see Graph Neural Network
GPT, see Generative Pre-trained Transformer
GPU, see Graphical Processing Unit
List of Figures
gradient descent, 37, 39, 41, 45
gradient norm clipping, 45
gradient step, 37
Graph Neural Network, 154 1.1 Kernel regression . . . . . . . . . . . 17
Graphical Processing Unit, 11, 23 1.2 Overfitting of kernel regression . . . 19
ground truth, 20
3.1 Causal autoregressive model . . . . . 35
hidden layer, see layer, hidden 3.2 Gradient descent . . . . . . . . . . . . 38
hidden state, 151 3.3 Backpropagation . . . . . . . . . . . . 42
hyper parameter, see parameter, hyper 3.4 Feature warping . . . . . . . . . . . . 47
hyperbolic tangent, 72 3.5 Training and validation losses . . . . 49
3.6 Scaling laws . . . . . . . . . . . . . . 53
image processing, 97 3.7 Model training costs . . . . . . . . . . 55
image synthesis, 84, 134
inductive bias, 19, 50, 63, 67, 92 4.1 1D convolution . . . . . . . . . . . . . 64
invariance, 75, 91, 92, 155 4.2 2D convolution . . . . . . . . . . . . . 65
4.3 Stride, padding, and dilation . . . . . 66
kernel size, 65, 73 4.4 Receptive field . . . . . . . . . . . . . 68
key, 86 4.5 Activation functions . . . . . . . . . . 71
4.6 Max pooling . . . . . . . . . . . . . . 74
Large Language Model, 51, 54, 85, 132, 139, 154 4.7 Dropout . . . . . . . . . . . . . . . . . 76
layer, 42, 60 4.8 Dropout 2D . . . . . . . . . . . . . . . 77
attention, 84 4.9 Batch normalization . . . . . . . . . . 79

172 9
4.10 Skip connections . . . . . . . . . . . . 82 data augmentation, 115
4.11 Attention operator interpretation . . 85 deep learning, 11, 15
4.12 Complete attention operator . . . . . 87 Deep Q-Network, 127
4.13 Multi-Head Attention layer . . . . . . 90 denoising autoencoder, see autoencoder, denoising
density modeling, 20
5.1 Multi-Layer Perceptron . . . . . . . . 96 depth, 42
5.2 LeNet-like convolutional model . . . 98 diffusion model, 134
5.3 Residual block . . . . . . . . . . . . . 99 dilation, 68, 73
5.4 Downscaling residual block . . . . . . 100 discriminator, 153
5.5 ResNet-50 . . . . . . . . . . . . . . . . 101 downscaling residual block, 102
5.6 Transformer components . . . . . . . 104 downstream task, 51
5.7 Transformer . . . . . . . . . . . . . . 105 DQN, see Deep Q-Network
5.8 GPT model . . . . . . . . . . . . . . . 108 dropout, 75, 88
5.9 ViT model . . . . . . . . . . . . . . . 109
embedding layer, see layer, embedding
6.1 Convolutional object detector . . . . 116 epoch, 48
6.2 Object detection with SSD . . . . . . 117 equivariance, 67, 91
6.3 Semantic segmentation with PSP . . . 120
6.4 CLIP zero-shot prediction . . . . . . . 125 feed-forward block, 104, 106
6.5 DQN state value evolution . . . . . . 129 few-shot prediction, 133
filter, 67
7.1 Few-shot prediction with a GPT . . . 132 fine-tune, 119
7.2 Denoising diffusion . . . . . . . . . . 135 fine-tuning, 51, 133
flops, 25
8.1 Chain-of-thought . . . . . . . . . . . 141 forward pass, 42
8.2 Quantization . . . . . . . . . . . . . . 144 foundation model, 133
FP32, 25
framework, 25
GAN, see Generative Adversarial Networks
GELU, 72
10 171
batch normalization, 78, 102
Bellman equation, 127
bias vector, 61, 67
BPE, see Byte Pair Encoding
Byte Pair Encoding, 36, 122

cache memory, 24
Foreword
capacity, 18
causal, 35, 87, 106
model, see model, causal
chain rule (derivative), 41
chain rule (probability), 33 The current period of progress in artificial intelli-
chain-of-thought, 133, 142 gence was triggered when Krizhevsky et al. [2012]
channel, 26 demonstrated that an artificial neural network de-
checkpointing, 44 signed twenty years earlier [LeCun et al., 1989]
classification, 20, 30, 97, 114 could outperform complex state-of-the-art image
CLIP, see Contrastive Language-Image recognition methods by a huge margin, simply
Pre-training by being a hundred times larger and trained on a
CLS token, 110 dataset similarly scaled up.
computational cost, 44, 88 This breakthrough was made possible thanks to
context size, 140 Graphical Processing Units (GPUs), highly paral-
Contrastive Language-Image Pre-training, 123, lel consumer-grade computing devices developed
148 for real-time image synthesis and repurposed for
contrastive loss, 31, 124 artificial neural networks.
convnet, see convolutional network
convolution, 65, 67 Since then, under the umbrella term of “
convolutional layer, see layer, convolutional deep learning
,” innovations in the structures of these net-
convolutional network, 97 works, the strategies to train them, and dedicated
cross-attention block, 91, 104, 106 hardware have allowed for an exponential increase
cross-entropy, 31, 34, 46 in both their size and the quantity of training data

170 11
they take advantage of [Sevilla et al., 2022]. This
has resulted in a wave of successful applications
across technical domains, from computer vision
and robotics to speech processing, and since 2020
in the development of Large Language Models with
general proto-reasoning capabilities [Chowdhery Index
et al., 2022].
Although the bulk of deep learning is not difficult
to understand, it combines diverse components
such as linear algebra, calculus, probabilities, op- 1D convolution, 65
timization, signal processing, programming, algo- 2D convolution, 67
rithmics, and high-performance computing, mak-
ing it complicated to learn. activation, 25, 41
function, 70, 95
Instead of trying to be exhaustive, this little book map, 69
is limited to the background necessary to under- Adam, 40, 147
stand a few important models. This proved to be a adapter, 146
popular approach, resulting in more than 500,000 affine operation, 61
downloads of the PDF file in the 12 months follow- artificial neural network, 11, 15
ing its announcement on Twitter. attention operator, 86
autoencoder, 152
You can download a phone-formatted PDF of this denoising, 113
book from Autograd, 43
autoregressive model, see model, autoregressive
https://fleuret.org/public/lbdl.pdf
average pooling, 75
François Fleuret, backpropagation, 43
May 19, 2024 backward pass, 43, 147
basis function regression, 17
batch, 24, 40
12 169
Part I

Foundations
tinguished Experts. CoRR, abs/2305.14688, 2023.
140
P. Yadav, D. Tam, L. Choshen, et al. TIES-Merging:
Resolving Interference When Merging Models.
CoRR, abs/2306.01708, 2023. 148
L. Yu, B. Yu, H. Yu, et al. Language Models are Super
Mario: Absorbing Abilities from Homologous
Models as a Free Lunch. CoRR, abs/2311.03099,
2023. 148
J. Zbontar, L. Jing, I. Misra, et al. Barlow Twins: Self-
Supervised Learning via Redundancy Reduction.
CoRR, abs/2103.03230, 2021. 155
M. D. Zeiler and R. Fergus. Visualizing and Under-
standing Convolutional Networks. In European
Conference on Computer Vision (ECCV), 2014. 69
H. Zhao, J. Shi, X. Qi, et al. Pyramid Scene Parsing
Network. CoRR, abs/1612.01105, 2016. 120, 121
J. Zhou, C. Wei, H. Wang, et al. iBOT: Image
BERT Pre-Training with Online Tokenizer. CoRR,
abs/2111.07832, 2021. 155
167
J. Sevilla, P. Villalobos, J. F. Cerón, et al. Parameter,
Compute and Data Trends in Machine Learning,
May 2023. [web]. 55

K. Simonyan and A. Zisserman. Very Deep Convo-

lutional Networks for Large-Scale Image Recog- Chapter 1
nition. CoRR, abs/1409.1556, 2014. 97

N. Srivastava, G. Hinton, A. Krizhevsky, et al.

Machine Learning
Dropout: A Simple Way to Prevent Neural Net-
works from Overfitting. Journal of Machine
Learning Research (JMLR), 15:1929–1958, 2014.
75
Deep learning belongs historically to the larger
M. Telgarsky. Benefits of depth in neural networks. field of statistical machine learning, as it funda-
CoRR, abs/1602.04485, 2016. 48 mentally concerns methods that are able to learn
representations from data. The techniques in-
H. Touvron, T. Lavril, G. Izacard, et al. LLaMA: volved come originally from
Open and Efficient Foundation Language Mod- ,artificial neural networks
and the “deep” qualifier highlights that mod-
els. CoRR, abs/2302.13971, 2023. 144 els are long compositions of mappings, now known
A. Vaswani, N. Shazeer, N. Parmar, et al. Attention to achieve greater performance.
Is All You Need. CoRR, abs/1706.03762, 2017. 83, The modularity, versatility, and scalability of deep
85, 93, 103, 104, 105 models have resulted in a plethora of specific math-
J. Wei, X. Wang, D. Schuurmans, et al. Chain of ematical methods and software development tools,
Thought Prompting Elicits Reasoning in Large establishing deep learning as a distinct and vast
Language Models. CoRR, abs/2201.11903, 2022. technical field.
142

B. Xu, A. Yang, J. Lin, et al. ExpertPrompting:

Instructing Large Language Models to be Dis-

166 15
1.1 Learning from data A. Radford, K. Narasimhan, T. Salimans, and
I. Sutskever. Improving Language Understand-
The simplest use case for a model trained from data ing by Generative Pre-Training, 2018. 104, 108,
is when a signal x is accessible, for instance, the 131
picture of a license plate, from which one wants to
predict a quantity y, such as the string of characters A. Radford, J. Wu, R. Child, et al. Language Models
written on the plate. are Unsupervised Multitask Learners, 2019. 108,
155
In many real-world situations where x is a high-
dimensional signal captured in an uncontrolled O. Ronneberger, P. Fischer, and T. Brox. U-Net:
environment, it is too complicated to come up with Convolutional Networks for Biomedical Image
an analytical recipe that relates x and y. Segmentation. In Medical Image Computing and
Computer-Assisted Intervention, 2015. 82, 83, 121
What one can do is to collect a large training set 𝒟
of pairs (xn , yn ), and devise a parametric model f . P. Sahoo, A. Singh, S. Saha, et al. A Systematic Sur-
This is a piece of computer code that incorporates vey of Prompt Engineering in Large Language
trainable parameters w that modulate its behavior, Models: Techniques and Applications. CoRR,
and such that, with the proper values w∗ , it is a abs/2402.07927, 2024. 140
good predictor. “Good” here means that if an x is F. Scarselli, M. Gori, A. C. Tsoi, et al. The Graph
given to this piece of code, the value ŷ = f (x; w∗ ) Neural Network Model. IEEE Transactions on
it computes is a good estimate of the y that would Neural Networks (TNN), 20(1):61–80, 2009. 154
have been associated with x in the training set had
it been there. R. Sennrich, B. Haddow, and A. Birch. Neural Ma-
chine Translation of Rare Words with Subword
This notion of goodness is usually formalized with Units. CoRR, abs/1508.07909, 2015. 36
a loss ℒ (w) which is small when f ( · ; w) is good
on 𝒟 . Then, training the model consists of com- J. Sevilla, L. Heim, A. Ho, et al. Compute Trends
puting a value w∗ that minimizes ℒ (w∗ ). Across Three Eras of Machine Learning. CoRR,
abs/2202.05924, 2022. 12, 52
Most of the content of this book is about the defini-
tion of f , which, in realistic scenarios, is a complex
16 165
Deep Learning for Audio, Speech and Language combination of pre-defined sub-modules.
Processing, 2013. 72
The trainable parameters that compose w are of-
V. Mnih, K. Kavukcuoglu, D. Silver, et al. Human- ten called weights, by analogy with the synaptic
level control through deep reinforcement learn- weights of biological neural networks. In addition
ing. Nature, 518(7540):529–533, February 2015. to these parameters, models usually depend on
127, 128, 129 ,hyper-parameters
which are set according to domain
prior knowledge, best practices, or resource con-
A. Nichol, P. Dhariwal, A. Ramesh, et al. GLIDE: To- straints. They may also be optimized in some way,
wards Photorealistic Image Generation and Edit- but with techniques different from those used to
ing with Text-Guided Diffusion Models. CoRR, optimize w.
abs/2112.10741, 2021. 137

L. Ouyang, J. Wu, X. Jiang, et al. Training language 1.2 Basis function regression
models to follow instructions with human feed-
back. CoRR, abs/2203.02155, 2022. 133 We can illustrate the training of a model in a simple
case where xn and yn are two real numbers, the
R. Pascanu, T. Mikolov, and Y. Bengio. On the
difficulty of training recurrent neural networks.
In International Conference on Machine Learning
(ICML), 2013. 45

A. Radford, J. Kim, C. Hallacy, et al. Learning Trans-

ferable Visual Models From Natural Language
Supervision. CoRR, abs/2103.00020, 2021. 123,
125 Figure 1.1: Given a basis of functions (blue curves)
and a training set (black dots), we can compute an
A. Radford, J. Kim, T. Xu, et al. Robust Speech optimal linear combination of the former (red curve)
Recognition via Large-Scale Weak Supervision. to approximate the latter for the mean squared error.
CoRR, abs/2212.04356, 2022. 122

164 17
loss is the mean squared error: Y. LeCun, B. Boser, J. S. Denker, et al. Backpropaga-
N
tion applied to handwritten zip code recognition.
ℒ (w) = (1.1) Neural Computation, 1(4):541–551, 1989. 11
N
(yn − f (xn ; w))2 ,
n=1
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.

1 X
and f ( · ; w) is a linear combination of a prede- Gradient-based learning applied to document
fined basis of functions f1 , . . . , fK , with w = recognition. Proceedings of the IEEE, 86(11):2278–
(w1 , . . . , wK ): 2324, 1998. 97, 98
K
P. Lewis, E. Perez, A. Piktus, et al. Retrieval-
f (x; w) = wk fk (x). Augmented Generation for Knowledge-
k=1
Intensive NLP Tasks. CoRR, abs/2005.11401,

X
2020. 142
Since f (xn ; w) is linear with respect to the wk s and
ℒ (w) is quadratic with respect to f (xn ; w), the W. Liu, D. Anguelov, D. Erhan, et al. SSD: Single
loss ℒ (w) is quadratic with respect to the wk s, and Shot MultiBox Detector. CoRR, abs/1512.02325,
finding w∗ that minimizes it boils down to solving 2015. 117, 118
a linear system. See Figure 1.1 for an example with
Llama.cpp. Llama.cpp git repository, June 2023.
Gaussian kernels as fk .
[web]. 143, 144
1.3 Under and overfitting J. Long, E. Shelhamer, and T. Darrell. Fully Convo-
lutional Networks for Semantic Segmentation.
A key element is the interplay between the CoRR, abs/1411.4038, 2014. 82, 83, 121
of the model, that is its flexibility and ability to
capacity
fit diverse data, and the amount and quality of the S. Ma, H. Wang, L. Ma, et al. The Era of 1-bit
training data. When the capacity is insufficient, the LLMs: All Large Language Models are in 1.58
model cannot fit the data, resulting in a high error Bits. CoRR, abs/2402.17764, 2024. 145
during training. This is referred to as underfitting. A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier
On the contrary, when the amount of data is in- nonlinearities improve neural network acoustic
sufficient, as illustrated in Figure 1.2, the model models. In proceedings of the ICML Workshop on
18 163
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reducing
Internal Covariate Shift. In International Confer-
ence on Machine Learning (ICML), 2015. 78
A. Jiang, A. Sablayrolles, A. Mensch, et al. Mistral
7B. CoRR, abs/2310.06825, 2023. 149
J. Kaplan, S. McCandlish, T. Henighan, et al. Scal-
ing Laws for Neural Language Models. CoRR, Figure 1.2: If the amount of training data (black
abs/2001.08361, 2020. 52, 53 dots) is small compared to the capacity of the model,
the empirical performance of the fitted model during
A. Katharopoulos, A. Vyas, N. Pappas, and training (red curve) reflects poorly its actual fit to
F. Fleuret. Transformers are RNNs: Fast Au- the underlying data structure (thin black curve), and
toregressive Transformers with Linear Atten- consequently its usefulness for prediction.
tion. In Proceedings of the International Confer-
ence on Machine Learning (ICML), pages 5294–
5303, 2020. 89 will often learn characteristics specific to the train-
ing examples, resulting in excellent performance
D. Kingma and J. Ba. Adam: A Method for Stochas- during training, at the cost of a worse fit to the
tic Optimization. CoRR, abs/1412.6980, 2014. 40 global structure of the data, and poor performance
D. P. Kingma and M. Welling. Auto-Encoding Vari- on new inputs. This phenomenon is referred to as
ational Bayes. CoRR, abs/1312.6114, 2013. 153 overfitting.

T. Kojima, S. Gu, M. Reid, et al. Large Lan- So, a large part of the art of applied
guage Models are Zero-Shot Reasoners. CoRR, machine learning
is to design models that are not too flexible yet
abs/2205.11916, 2022. 142 still able to fit the data. This is done by crafting
the right inductive bias in a model, which means
A. Krizhevsky, I. Sutskever, and G. Hinton. Ima- that its structure corresponds to the underlying
geNet Classification with Deep Convolutional structure of the data at hand.
Neural Networks. In Neural Information Process-
ing Systems (NIPS), 2012. 11, 97 Even though this classical perspective is relevant

162 19
for reasonably-sized deep models, things get con- K. He, X. Zhang, S. Ren, and J. Sun. Deep Resid-
fusing with large ones that have a very large num- ual Learning for Image Recognition. CoRR,
ber of trainable parameters and extreme capacity abs/1512.03385, 2015. 52, 82, 83, 99, 101
yet still perform well on prediction. We will come
back to this in § 3.6 and § 3.7. D. Hendrycks and K. Gimpel. Gaussian Error Lin-
ear Units (GELUs). CoRR, abs/1606.08415, 2016.
72
1.4 Categories of models
D. Hendrycks, K. Zhao, S. Basart, et al. Natural
We can organize the use of machine learning mod- Adversarial Examples. CoRR, abs/1907.07174,
els into three broad categories: 2019. 126
Regression consists of predicting a continuous- J. Ho, A. Jain, and P. Abbeel. Denoising Diffu-
valued vector y ∈ RK , for instance, a geometrical sion Probabilistic Models. CoRR, abs/2006.11239,
position of an object, given an input signal X. This 2020. 134, 135, 136
is a multi-dimensional generalization of the setup
we saw in § 1.2. The training set is composed of S. Hochreiter and J. Schmidhuber. Long Short-Term
pairs of an input signal and a ground-truth value. Memory. Neural Computation, 9(8):1735–1780,
1997. 151
Classification aims at predicting a value from a
finite set {1, . . . , C}, for instance, the label Y of N. Houlsby, A. Giurgiu, S. Jastrzebski, et al.
an image X. As with regression, the training set Parameter-Efficient Transfer Learning for NLP.
is composed of pairs of input signal, and ground- CoRR, abs/1902.00751, 2019. 146
truth quantity, here a label from that set. The stan- E. Hu, Y. Shen, P. Wallis, et al. LoRA: Low-Rank
dard way of tackling this is to predict one score Adaptation of Large Language Models. CoRR,
per potential class, such that the correct class has abs/2106.09685, 2021. 146
the maximum score.
G. Ilharco, M. Ribeiro, M. Wortsman, et al. Edit-
Density modeling has as its objective to model ing Models with Task Arithmetic. CoRR,
the probability density function of the data µX it- abs/2212.04089, 2022. 148
self, for instance, images. In that case, the training
20 161
Y. Gal and Z. Ghahramani. Dropout as a Bayesian set is composed of values xn without associated
Approximation: Representing Model Uncer- quantities to predict, and the trained model should
tainty in Deep Learning. CoRR, abs/1506.02142, allow for the evaluation of the probability den-
2015. 78 sity function, or sampling from the distribution, or
both.
X. Glorot and Y. Bengio. Understanding the diffi-
culty of training deep feedforward neural net- Both regression and classification are generally re-
works. In International Conference on Artificial ferred to as supervised learning, since the value to
Intelligence and Statistics (AISTATS), 2010. 45, 62 be predicted, which is required as a target during
training, has to be provided, for instance, by hu-
X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse man experts. On the contrary, density modeling
Rectifier Neural Networks. In International Con- is usually seen as unsupervised learning, since it
ference on Artificial Intelligence and Statistics is sufficient to take existing data without the need
(AISTATS), 2011. 71 for producing an associated ground-truth.
A. Gomez, M. Ren, R. Urtasun, and R. Grosse. These three categories are not disjoint; for instance,
The Reversible Residual Network: Backprop- classification can be cast as class-score regression,
agation Without Storing Activations. CoRR, or discrete sequence density modeling as iterated
abs/1707.04585, 2017. 44 classification. Furthermore, they do not cover all
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, cases. One may want to predict compounded quan-
et al. Generative Adversarial Networks. CoRR, tities, or multiple classes, or model a density con-
abs/1406.2661, 2014. 153 ditional on a signal.

A. Gu and T. Dao. Mamba: Linear-Time Sequence

Modeling with Selective State Spaces. CoRR,
abs/2312.00752, 2023. 152

A. Gu, K. Goel, and C. Ré. Efficiently Modeling

Long Sequences with Structured State Spaces.
CoRR, abs/2111.00396, 2021. 152

160 21
G. Cybenko. Approximation by superpositions of
a sigmoidal function. Mathematics of Control,
Signals, and Systems, 2(4):303–314, December
1989. 96
J. Deng, W. Dong, R. Socher, et al. ImageNet:
A Large-Scale Hierarchical Image Database.
In Conference on Computer Vision and Pattern
Recognition (CVPR), 2009. 51
T. Dettmers, A. Pagnoni, A. Holtzman, and
L. Zettlemoyer. QLoRA: Efficient Finetuning
of Quantized LLMs. CoRR, abs/2305.14314, 2023.
147
J. Devlin, M. Chang, K. Lee, and K. Toutanova.
BERT: Pre-training of Deep Bidirectional Trans-
formers for Language Understanding. CoRR,
abs/1810.04805, 2018. 52, 110, 155
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al.
An Image is Worth 16x16 Words: Transform-
ers for Image Recognition at Scale. CoRR,
abs/2010.11929, 2020. 109, 110
K. Fukushima. Neocognitron: A self-organizing
neural network model for a mechanism of pat-
tern recognition unaffected by shift in position.
Biological Cybernetics, 36(4):193–202, April 1980.
4
159
I. Beltagy, M. Peters, and A. Cohan. Longformer:
The Long-Document Transformer. CoRR,
abs/2004.05150, 2020. 88
R. Bommasani, D. Hudson, E. Adeli, et al. On the
Opportunities and Risks of Foundation Models. Chapter 2
CoRR, abs/2108.07258, 2021. 133
J. Bradbury, S. Merity, C. Xiong, and R. Socher. Efficient Computation
Quasi-Recurrent Neural Networks. CoRR,
abs/1611.01576, 2016. 152
T. Brown, B. Mann, N. Ryder, et al. Language Mod-
els are Few-Shot Learners. CoRR, abs/2005.14165, From an implementation standpoint, deep learning
2020. 52, 108, 131 is about executing heavy computations with large
S. Bubeck, V. Chandrasekaran, R. Eldan, et al. amounts of data. The Graphical Processing Units
Sparks of Artificial General Intelligence: Early (GPUs) have been instrumental in the success of
experiments with GPT-4. CoRR, abs/2303.12712, the field by allowing such computations to be run
2023. 133 on affordable hardware.

T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training The importance of their use, and the resulting tech-
Deep Nets with Sublinear Memory Cost. CoRR, nical constraints on the computations that can be
abs/1604.06174, 2016. 44 done efficiently, force the research in the field to
constantly balance mathematical soundness and
K. Cho, B. van Merrienboer, Ç. Gülçehre, et al. implementability of novel methods.
Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Trans-
lation. CoRR, abs/1406.1078, 2014. 151
2.1 GPUs, TPUs, and batches
A. Chowdhery, S. Narang, J. Devlin, et al. PaLM: Graphical Processing Units were originally de-
Scaling Language Modeling with Pathways. signed for real-time image synthesis, which re-
CoRR, abs/2204.02311, 2022. 12, 52, 133 quires highly parallel architectures that happen

158 23
to be well suited for deep models. As their usage
for AI has increased, GPUs have been equipped
with dedicated tensor cores, and deep-learning spe-
cialized chips such as Google’s
Tensor Processing Units
(TPUs) have been developed.
A GPU possesses several thousand parallel units
Bibliography
and its own fast memory. The limiting factor is
usually not the number of computing units, but
the read-write operations to memory. The slow-
est link is between the CPU memory and the GPU
memory, and consequently one should avoid copy- T. Akiba, M. Shing, Y. Tang, et al. Evolutionary
ing data across devices. Moreover, the structure Optimization of Model Merging Recipes. CoRR,
of the GPU itself involves multiple levels of abs/2403.13187, 2024. 149
,cache memory
which are smaller but faster, and compu- J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer Nor-
tation should be organized to avoid copies between malization. CoRR, abs/1607.06450, 2016. 81
these different caches.
R. Balestriero, M. Ibrahim, V. Sobal, et al. A
This is achieved, in particular, by organizing the Cookbook of Self-Supervised Learning. CoRR,
computation in batches of samples that can fit en- abs/2304.12210, 2023. 155
tirely in the GPU memory and are processed in
parallel. When an operator combines a sample A. Baydin, B. Pearlmutter, A. Radul, and J. Siskind.
and model parameters, both have to be moved Automatic differentiation in machine learning:
to the cache memory near the actual computing a survey. CoRR, abs/1502.05767, 2015. 43
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing M. Belkin, D. Hsu, S. Ma, and S. Mandal. Recon-
it for each sample. In practice, a GPU processes a ciling modern machine learning and the bias-
batch that fits in memory almost as quickly as it variance trade-off. CoRR, abs/1812.11118, 2018.
would process a single sample. 50
24 157
A standard GPU has a theoretical
peak performance
of 1013 –1014 floating-point operations
(FLOPs) per second, and its memory typically
ranges from 8 to 80 gigabytes. The standard FP32
encoding of float numbers is on 32 bits, but empir-
ical results show that using encoding on 16 bits,
or even less for some operands, does not degrade
performance.

We will come back in § 3.7 to the large size of deep

architectures.

2.2 Tensors
GPUs and deep learning frameworks such as Py-
Torch or JAX manipulate the quantities to be pro-
cessed by organizing them as tensors, which are
series of scalars arranged along several discrete
axes. They are elements of RN1 ×···×ND that gen-
eralize the notion of vector and matrix.

Tensors are used to represent both the signals to

be processed, the trainable parameters of the mod-
els, and the intermediate quantities they compute.
The latter are called activations, in reference to
neuronal activations.

For instance, a time series is naturally encoded

as a T × D tensor, or, for historical reasons, as
a D × T tensor, where T is its duration and D

25
is the dimension of the feature representation at swering questions, or even translating from one
every time step, often referred to as the number of language to another [Radford et al., 2019].
channels. Similarly, a 2D-structured signal can be
represented as a D × H × W tensor, where H and Such models constitute one category of a larger
W are its height and width. An RGB image would class of methods that fall under the name of
correspond to D = 3, but the number of channels self-supervised learning
, and try to take advantage of
can grow up to several thousands in large models. unlabeled datasets [Balestriero et al., 2023].
Adding more dimensions allows for the represen- The key principle of these methods is to define a
tation of series of objects. For example, fifty RGB task that does not require labels but necessitates
images of resolution 32 × 24 can be encoded as a feature representations which are useful for the
50 × 3 × 24 × 32 tensor. real task of interest, for which a small labeled
dataset exists. In computer vision, for instance,
Deep learning libraries provide a large number of image features can be optimized so that they are
operations that encompass standard linear alge- to data transformations that do not change
invariant
bra, complex reshaping and extraction, and deep- the semantic content of the image, while being
learning specific operations, some of which we will statistically uncorrelated [Zbontar et al., 2021].
see in Chapter 4. The implementation of tensors
separates the shape representation from the stor- In both NLP and computer vision, a powerful
age layout of the coefficients in memory, which al- generic strategy is to train a model to recover parts
lows many reshaping, transposing, and extraction of the signal that have been masked [Devlin et al.,
operations to be done without coefficient copying, 2018; Zhou et al., 2021].
hence extremely rapidly.
In practice, virtually any computation can be
decomposed into elementary tensor operations,
which avoids non-parallel loops at the language
level and poor memory management.
Besides being convenient tools, tensors are instru-
mental in achieving computational efficiency. All
26 155
cues that the discriminator uses that need to be the people involved in the development of an op-
addressed. erational deep model, from the designers of the
drivers, libraries, and models to those of the com-
Graph Neural Networks puters and chips, know that the data will be ma-
nipulated as tensors. The resulting constraints on
Many applications require processing signals locality and block decomposability enable all the
which are not organized regularly on a grid. For in- actors in this chain to come up with optimal de-
stance, proteins, 3D meshes, geographic locations, signs.
or social interactions are more naturally structured
as graphs. Standard convolutional networks or
even attention models are poorly adapted to pro-
cess such data, and the tool of choice for such a
task is Graph Neural Networks (GNN) [Scarselli
et al., 2009].

These models are composed of layers that compute

activations at each vertex by combining linearly
the activations located at its immediate neighbor-
ing vertices. This operation is very similar to a
standard convolution, except that the data struc-
ture does not reflect any geometrical information
associated with the feature vectors they carry.

Self-supervised training
As stated in § 7.1, even though they are trained only
to predict the next word, Large Language Models
trained on large unlabeled datasets such as GPT
(see § 5.3) are able to solve various tasks, such as
identifying the grammatical role of a word, an-

154 27
manifold.
The Variational Autoencoder (VAE) proposed by
Kingma and Welling [2013] is a generative model
with a similar structure. It imposes, through the
loss, a pre-defined distribution on the latent rep-
resentation. This allows, after training, the gener-
ation of new samples by sampling the latent rep-
resentation according to this imposed distribution
and then mapping back through the decoder.
Generative Adversarial Networks
Another approach to density modeling is the
Generative Adversarial Networks
(GAN) introduced
by Goodfellow et al. [2014]. This method combines
a generator, which takes a random input follow-
ing a fixed distribution as input and produces a
structured signal such as an image, and a
discriminator
, which takes a sample as input and predicts
whether it comes from the training set or if it was
generated by the generator.
Training optimizes the discriminator to minimize
a standard cross-entropy loss, and the generator to
maximize the discriminator’s loss. It can be shown
that, at equilibrium, the generator produces sam-
ples indistinguishable from real data. In practice,
when the gradient flows through the discriminator
to the generator, it informs the latter about the
153
of skip connections which are modulated dynami-
cally.

One of the key drawbacks of traditional recurrent

architectures is that the structure of the computa-
tion xt+1 = f (xt ) imposes to process the input Chapter 3
sequence serially, which takes a time proportional
to T . In contrast, transformers, for instance, can
take advantage of parallel computation, resulting
Training
in a constant time if enough computing units are
available.

This is addressed by architectures such as QRNN

[Bradbury et al., 2016], S4 [Gu et al., 2021], or As introduced in § 1.1, training a model consists of
Mamba [Gu and Dao, 2023], whose recurrent op- minimizing a loss ℒ (w) which reflects the perfor-
erations are affine so that the f t themselves, and mance of the predictor f ( · ; w) on a training set
consequently the xt = f t (x0 ), can be computed in 𝒟.
parallel, resulting in a constant time if f does not Since models are usually extremely complex, and
depend on t and log T otherwise, again if enough their performance is directly related to how well
parallel computing units are available. the loss is minimized, this minimization is a key
challenge, which involves both computational and
Autoencoder mathematical difficulties.
An autoencoder is a model that maps an input sig-
nal, possibly of high dimension, to a low-dimension 3.1 Losses
latent representation, and then maps it back to the
original signal, ensuring that information has been The example of the mean squared error from Equa-
preserved. We saw it in § 6.1 for denoising, but it tion 1.1 is a standard loss for predicting a continu-
can also be used to automatically discover a mean- ous value.
ingful low-dimension parameterization of the data
For density modeling, the standard loss is the likeli-

152 29
hood of the data. If f (x; w) is to be interpreted as a
normalized log-probability or log-density, the loss
is the opposite of the sum of its values over train-
ing samples, which corresponds to the likelihood
of the data-set.
Cross-entropy
The Missing Bits
For classification, the usual strategy is that the out-
put of the model is a vector with one component
f (x; w)y per class y, interpreted as the logarithm
of a non-normalized probability, or logit. For the sake of concision, this volume skips many
important topics, in particular:
With X the input signal and Y the class to predict,
we can then compute from f an estimate of the Recurrent Neural Networks
posterior probabilities:
Before attention models showed greater perfor-
exp f (x; w)y mance, Recurrent Neural Networks (RNN) were
.
z exp f (x; w)z
the standard approach for dealing with temporal se-
quences such as text or sound samples. These archi-
P̂ (Y = y | X = x) = P
This expression is generally called the softmax, or tectures possess an internal hidden state that gets
more adequately, the softargmax, of the logits. updated each time a component of the sequence is
processed. Their main components are layers such
To be consistent with this interpretation, the model
as LSTM [Hochreiter and Schmidhuber, 1997] or
should be trained to maximize the probability of
GRU [Cho et al., 2014].
the true classes, hence to minimize the
Training a recurrent architecture amounts to un-
folding it in time, which results in a long composi-
tion of operators. This has historically prompted
the design of key techniques now used for deep
architectures such as rectifiers and gating, a form
151
metric learning, where the ob-
jective is to learn a measure of distance between
samples such that a sample xa from a certain se-
mantic class is closer to any sample xb of the same
class than to any sample xc from another class. For
instance, xa and xb can be two pictures of a certain
person, and xc a picture of someone else.

The standard approach for such cases is to mini-

mize a contrastive loss, in that case, for instance,
the sum over triplets (xa , xb , xc ), such that ya =
yb ̸= yc , of

max(0, 1 − f (xa , xc ; w) + f (xa , xb ; w)).

This quantity will be strictly positive unless

f (xa , xc ; w) ≥ 1 + f (xa , xb ; w).

31
Engineering the loss space is to recombine their layers. Akiba et al.
[2024] combine merging the parameters and re-
Usually, the loss minimized during training is not combining layers, and rely on a stochastic op-
the actual quantity one wants to optimize ulti- timization to deal with the combinatorial explo-
mately, but a proxy for which finding the best sion. Experiments with three fine-tuned versions
model parameters is easier. For instance, cross- of Mistral-7B [Jiang et al., 2023] show that combin-
entropy is the standard loss for classification, even ing these two merging strategies outperforms both
though the actual performance measure is a classi- of them.
fication error rate, because the latter has no infor-
mative gradient, a key requirement as we will see
in § 3.3.
It is also possible to add terms to the loss that
depend on the trainable parameters of the model
themselves to favor certain configurations.
The weight decay regularization, for instance, con-
sists of adding to the loss a term proportional to
the sum of the squared parameters. This can be
interpreted as having a Gaussian Bayesian prior
on the parameters, which favors smaller values
and thereby reduces the influence of the data. This
degrades performance on the training set, but re-
duces the gap between the performance in training
and that on new, unseen data.
3.2 Autoregressive models
A key class of methods, particularly for dealing
with discrete sequences in natural language pro-
32 149
8.4 Model merging cessing and computer vision, are the
,autoregressive models
An alternative to the fine-tuning and prompting
methods seen in the previous sections consists of The chain rule for probabilities
combining multiple models with diverse capabili-
ties into a single one, without additional training. Such models put to use the chain rule from proba-
bility theory:
Model merging relies on the compatibility between
P (X1 = x1 , X2 = x2 , . . . , XT = xT ) =
multiple fine-tuned versions of a base model.
P (X1 = x1 )
Ilharco et al. [2022] showed that models obtained × P (X2 = x2 | X1 = x1 )
by fine-tuning a CLIP base model on several image
...
classification data-sets can be combined in the pa-
rameter space, where they exhibit Task Arithmetic × P (XT = xT | X1 = x1 , . . . , XT −1 = xT −1 ).
properties.
Although this decomposition is valid for a random
Formally, let θ be the parameter vector of a pre- sequence of any type, it is particularly efficient
trained model, and for t = 1, . . . , T , let θt and when the signal of interest is a sequence of tokens
τt = θt − θ be respectively the parameters af- from a finite vocabulary {1, . . . K}.
ter fine-tuning on task t and the corresponding
residual. Experiments show that the model with With the convention that the additional token ∅
parameters θ + τ1 + · · · + τT exhibits multi-task stands for an “unknown” quantity, we can repre-
capabilities. Similarly, subtracting a τt degrades sent the event {X1 = x1 , . . . , Xt = xt } as the
the performance on the corresponding task. vector (x1 , . . . , xt , ∅, . . . , ∅).

Methods have been developed to reduce the in- Then, a model

terference between the different residuals and im- f : {∅, 1, . . . , K}T → RK
prove the performance when the number of tasks
increases [Yadav et al., 2023; Yu et al., 2023]. which, given such an input, computes a vector lt
of K logits corresponding to
An alternative to merging models in parameter
P̂ (Xt | X1 = x1 , . . . , Xt−1 = xt−1 ),

148 33
allows to sample one token given the previous sion denoising models by fine-tuning the attention
ones. blocks responsible for the text-based conditioning.
The chain rule ensures that by sampling T tokens Since fine-tuning with LoRA adapters drastically
xt , one at a time given the previously sampled reduces the number of trainable parameters, it re-
x1 , . . . , xt−1 , we get a sequence that follows the duces the memory footprint required by optimiz-
joint distribution. This is an autoregressive gener- ers such as Adam, which generally store two run-
ative model. ning average per parameter to optimize. Also, it
reduces slightly the computation during the
Training such a model can be done by minimizing .backward pass
the sum across training sequences and time steps
of the cross-entropy loss For commercial applications that require a large
number of fine-tuned models, the AB pairs can be
Lce f (x1 , . . . , xt−1 , ∅, . . . , ∅; w), xt , stored separately from the original model, which
has to be stored only once. And finally, contrary

which is formally equivalent to maximizing the
likelihood of the true xt s. to other type of adapters, the modifications can be
integrated into the original architecture, simply by
The value that is classically monitored is not the adding AB to W , resulting in an architecture and
cross-entropy itself, but the perplexity, which is parameter count for inference strictly identical to
defined as the exponential of the cross-entropy. that of the base model.
It corresponds to the number of values of a uni-
form distribution with the same entropy, which is We saw that quantization degrade models’ accu-
generally more interpretable. racy only marginally. However, gradient descent
requires high precision in both the gradient and the
Causal models trained parameters, to allow the accumulation of
small changes. The QLoRA approach combines a
The training procedure we just described requires quantized base model and unquantized
a different input for each t, and the bulk of the Low-Rank Adaptation
to reduce the memory requirement
computation done for t < t′ is repeated for t′ . This even more [Dettmers et al., 2023].
is extremely inefficient since T is often of the order
of hundreds or thousands.
34 147
with few parameters, referred to as adapters, to the l1 l2 l3 ... lT −1 lT
pre-trained architecture, and freeze all the original
parameters [Houlsby et al., 2019].
f
The current dominant method is the
Low-Rank Adaptation
(LoRA), which adds low-rank correc-
tions to some of the model’s weight matrices [Hu x1 x2 ... xT −2 xT −1
et al., 2021].
Figure 3.1: An autoregressive model f , is causal if
Formally, given a linear operation of the form
a time step xt of the input sequence modulates the
XW T , where X is a N ×D tensor of activations for
predicted logits ls only if s > t, as depicted by the
a batch of N samples, and W is a C ×D weight ma-
blue arrows. This allows computing the distributions
trix, the LoRA adapter replaces this operation with
at all the time steps in one pass during training. Dur-
X(W + BA)T , where A and B are two trainable
ing sampling, however, the lt and xt are computed
matrices of size R × D and C × R respectively,
sequentially, the latter sampled with the former, as
with R ≪ min(C, D), and the matrix W is re-
depicted by the red arrows.
moved from the trainable parameters. The matrix
A is initialized with random Gaussian values, and
B is set to zero, so that the fine-tuning starts with The standard strategy to address this issue is to
a model that computes an output identical to that design a model f that predicts all the vectors of
of the original one. logits l1 , . . . , lT at once, that is:

The total number of parameters to optimize with f : {1, . . . , K}T → RT ×K ,

this approach is generally a few percent of the but with a computational structure such that the
number of parameters in the original model. computed logits lt for xt depend only on the input
values x1 , . . . , xt−1 .
The standard procedure to fine-tune a transformer
with such adapters is to change only the weight ma- Such a model is called causal, since it corresponds,
trices in the attention blocks, and to keep the MLP in the case of temporal series, to not letting the
of the feed-forward blocks unchanged. The same future influence the past, as illustrated in Figure
strategy has been used successfully to tune diffu- 3.1.

146 35
The consequence is that the output at every posi- It quantizes individually sub-blocks of 32 entries
tion is the one that would be obtained if the input of the original weight matrix by storing for each a
were only available up to before that position. Dur- scaling factor d and a bias m in the original FP16
ing training, it allows one to compute the output for encoding, and encoding each entry x with 4 bits
a full sequence and to maximize the predicted prob- as a value q ∈ {0, . . . , 24 − 1}. The resulting de-
abilities of all the tokens of that same sequence, quantized value being x̃ = dq + m.
which again boils down to minimizing the sum of
the per-token cross-entropy. Such a block was encoded originally as 32 values in
FP16, hence 64 bytes, while the quantized version
Note that, for the sake of simplicity, we have de- needs 4 bytes for q and m and 32 · 4 bits = 16 bytes
fined f as operating on sequences of a fixed length for the entries, hence a total of 20 bytes.
T . However, models used in practice, such as the
transformers we will see in § 5.3, are able to process Such an aggressive quantization surprisingly de-
sequences of arbitrary length. grades only marginally the performance of the
models, as illustrated on Figure 8.2.
Tokenizer An alternative to Post-Training Quantization is
One important technical detail when dealing with Quantization-Aware Training that applies quanti-
natural languages is that the representation as to- zation during the forward pass but keeps high-
kens can be done in multiple ways, ranging from precision encoding of parameters and gradients,
the finest granularity of individual symbols to en- and propagates the gradients during the backward
tire words. The conversion to and from the token pass as if there was no quantization [Ma et al.,
representation is carried out by a separate algo- 2024].
rithm called a tokenizer.
8.3 Adapters
A standard method is the Byte Pair Encoding (BPE)
[Sennrich et al., 2015] that constructs tokens by As we saw in § 3.6, fine-tuning is a key strategy to
hierarchically merging groups of characters, trying reuse pre-trained models. Since it aims at making
to get tokens that represent fragments of words of only minor changes to an existing model, tech-
various lengths but of similar frequencies, allocat- niques have been developed that add components
36 145
ing tokens to long frequent fragments as well as to
rare individual symbols.
6.5

3.3 Gradient descent

Perplexity

6 Except in specific cases like the linear regression

we saw in § 1.2, the optimal parameters w∗ do not
have a closed-form expression. In the general case,
5.5 the tool of choice to minimize a function is
.gradient descent
It starts by initializing the parameters
with a random w0 , and then improves this estimate
2 4 8 16 32 by iterating gradient steps, each consisting of com-
Size (Gigabytes) puting the gradient of the loss with respect to the
parameters, and subtracting a fraction of it:
Figure 8.2: Perplexity of quantized versions of the wn+1 = wn − η∇ℒ |w (wn ). (3.1)
language models Llama-7B (blue) and 13B (red) [Tou-
vron et al., 2023] on the wikitext corpus, as a function This procedure corresponds to moving the current
of the parameters’ memory footprint. The crosses are estimate a bit in the direction that locally decreases
the original FP16 models and the dots correspond ℒ (w) maximally, as illustrated in Figure 3.2.
to different levels of quantization with llama.cpp
[Llama.cpp, 2023]. Learning rate
The hyper-parameter η is called the learning rate.
guage model. For instance the quantization may It is a positive value that modulates how quickly
use more bits for the W V weights of the attention the minimization is done, and must be chosen care-
blocks, and for the weights of the feed-forward fully.
blocks.
If it is too small, the optimization will be slow
An example of llama.cpp’s quantization is Q4_1. at best, and may be trapped in a local minimum
early. If it is too large, the optimization may bounce

144 37
8.2 Quantization
Although training or generating multiple streams
can benefit from high-end parallel computing de-
vices, deployment of a Large Language Model for
individual use requires generally single-stream in-
ference, which is bounded by memory size and
speed far more than by computation.
As stated in § 2.1, parameters, activations, and gra-
dients are usually encoded with 32 or 16 bits. The
precision it provides is necessary for training, to
allow gradual changes to accumulate.
w
However, since activations are the sums of many
terms, quantization during inference is mitigated
by an averaging effect. This is even more true with
large architectures, and models quantized down
to 6 or 4 bits per parameter exhibit remarkable
performance. Additionally to reducing the mem-
ℒ (w) ory footprint, quantization also improves inference
speed significantly.
This has motivated the development of software
w to quantize existing models with
,Post-Training Quantization
and run them in single-stream in-
ference on consumer hardware, such as llama.cpp
Figure 3.2: At every point w, the gradient ∇ℒ |w (w)
is in the direction that maximizes the increase of ℒ ,
[Llama.cpp, 2023]. This framework implements
orthogonal to the level curves (top). The gradient
multiple formats, that apply specific quantization
descent minimizes ℒ (w) iteratively by subtracting
levels for the different weight matrices of a lan-
a fraction of the gradient at every step, resulting in a
trajectory that follows the steepest descent (bottom).
38 143
Chain of Thought around a good minimum and never descend into
it. As we will see in § 3.6, it can depend on the
A remarkable type of prompting aims at making iteration number n.
the model generate intermediate steps before gen-
erating the response itself.
Stochastic Gradient Descent
Such a chain-of-thought is composed of succes- All the losses used in practice can be expressed as
sive steps that are simpler, hence have been better an average of a loss per small group of samples, or
modeled during training, and are predicted more per sample such as:
deterministically [Wei et al., 2022; Kojima et al.,
2022]. See Figure 8.1 for an example. N
1 X
ℒ (w) = 𝓁n (w),
N
Retrieval-Augmented Generation n=1

where 𝓁n (w) = L(f (xn ; w), yn ) for some L, and

Prompt engineering can also be put to use to con-
the gradient is then:
nect a language model to an external knowledge
base. It plays the role of a smart interface that al- N
1 X
lows the end user to formulate questions in natural ∇ℒ |w (w) = ∇𝓁n |w (w). (3.2)
N
language and get back a response that combines n=1
information that is not encoded in the model’s pa-
rameters [Lewis et al., 2020]. The resulting gradient descent would compute ex-
actly the sum in Equation 3.2, which is usually
For such Retrieval-Augmented Generation (RAG), computationally heavy, and then update the pa-
an embedding model is used to retrieve documents rameters according to Equation 3.1. However, un-
whose embedding is correlated to that of the user’s der reasonable assumptions of exchangeability, for
query. Then, a prompt is constructed by joining instance, if the samples have been properly shuf-
these retrieved documents with instructions to fled, any partial sum of Equation 3.2 is an unbiased
combine them, and the generative model produces estimator of the full sum, albeit noisy. So, updat-
the response to the user. ing the parameters from partial sums corresponds
to doing more gradient steps for the same com-
putational budget, with noisier estimates of the

142 39
gradient. Due to the redundancy in the data, this
happens to be a far more efficient strategy.
Q: Gina has 105 beans, she gives 23 beans to Bob, and
We saw in § 2.1 that processing a batch of samples prepares a soup with 53 beans. How many beans are left?
small enough to fit in the computing device’s mem- A: There are 29 beans left.
ory is generally as fast as processing a single one.
Hence, the standard approach is to split the full Q: I prepare 53 pancakes, eat 5 of them and give 7 to Gina.
I then prepare 26 more. How many pancakes are left? A:
set 𝒟 into batches, and to update the parameters 27 pancakes are left.
from the estimate of the gradient computed from
Q: Gina has 105 beans, she gives 23 beans to Bob, and
each. This is called mini-batch stochastic gradient prepares a soup with 53 beans. How many beans are left?
descent, or stochastic gradient descent (SGD) for A: Let’s proceed step by step: Gina has 105 beans, she
short. gives 23 beans to Bob (82 left), and prepares a soup with
53 beans (29 left). So there are 29 beans left.
It is important to note that this process is extremely
gradual, and that the number of mini-batches and Q: I prepare 53 pancakes, eat 5 of them and give 7 to Gina.
I then prepare 26 more. How many pancakes are left? A:
gradient steps are typically of the order of several Let’s proceed step by step: 53 pancakes, eat 5 of them
million. (48 left), give 7 to Gina (41 left), prepare 26 more (67
left). So there are 67 pancakes left.
As with many algorithms, intuition breaks down
in high dimensions, and although it may seem that
Figure 8.1: Example of a chain-of-thought to im-
this procedure would be easily trapped in a local
prove the response of the Llama-3-8B base model. In
minimum, in reality, due to the number of parame-
the two examples, the beginning of the text in normal
ters, the design of the models, and the stochasticity
font is the prompt, and the generated part is indicated
of the data, its efficiency is far greater than one
in bold. The generation without chain-of-thought
might expect.
(top) leads to an incorrect answer, while the gener-
Plenty of variations of this standard strategy have ation with it (bottom) generates a correct answer,
been proposed. The most popular one is Adam by explicitly producing multiple simple arithmetic
[Kingma and Ba, 2014], which keeps running esti- operations.
mates of the mean and variance of each component
of the gradient, and normalizes them automati-
40 141
8.1 Prompt Engineering cally, avoiding scaling issues and different training
speeds in different parts of a model.
The simplest strategy to specialize or improve a
Large Language Model with a limited computa- 3.4 Backpropagation
tional budget is to use prompt engineering, that
is, to carefully craft the beginning of the text se- Using gradient descent requires a technical means
quence to bias the autoregressive process [Sahoo to compute ∇𝓁 |w (w) where 𝓁 = L(f (x; w); y).
et al., 2024]. This approach moves a part of the Given that f and L are both compositions of stan-
information traditionally encoded in the model’s dard tensor operations, as for any mathematical
parameters to the input. expression, the chain rule from differential calcu-
lus allows us to get an expression of it.
We saw in § 7.1 a simple example of few-shot pre-
diction, to use an LLM for a text classification For the sake of making notation lighter, we will
task without fine-tuning. A long and sophisticated not specify at which point gradients are computed,
prompt allows generalizing this strategy to com- since the context makes it clear.
plex tasks.

Since the prompt’s role is to leverage the “good”

Forward and backward passes
biases that were present in the training set, it ben- Consider the simple case of a composition of map-
efits from surprising strategies such as stating that pings:
the response is generated by a skilled professional
[Xu et al., 2023]. f = f (D) ◦ f (D−1) ◦ · · · ◦ f (1) .

The context size of a language model, that is, the The output of f (x; w) can be computed by starting
number of tokens it can operate on, directly mod- with x(0) = x and applying iteratively:

ulates the quantity of information that can be pro- (d)
x =f (d)
x (d−1)
; wd ,
vided in the prompt. This is mostly constrained
by the computational cost of standard attention with x(D) as the final value.
models, which is quadratic with the context size
(see § 4.8). The individual scalar values of these intermediate
results x(d) are traditionally called activations in

140 41
f (d) ( · ; wd )
x(d−1) x(d)
×Jf (d) |x
∇𝓁 |x(d−1) ∇𝓁 |x(d)
×Jf (d) |w Chapter 8
∇𝓁 |wd The Compute Schism
Figure 3.3: Given a model f = f (D) ◦ · · · ◦ f (1) , the
forward pass computes the outputs x(d) of the f (d) in
order (top, black). The backward pass computes the
gradients of the loss with respect to the activations The scale of deep architectures is critical to their
x(d) (bottom, blue) and the parameters wd (bottom, performance and, as we saw in § 3.7,
red) backward by multiplying them by the Jacobians. Large Language Models
in particular may require amounts
of memory and computation that greatly exceed
those of consumer hardware.
reference to neuron activations, the value D is the
depth of the model, the individual mappings f (d) While training such a model from scratch requires
are referred to as layers, as we will see in § 4.1, and resources available only to large corporations or
their sequential evaluation is the forward pass (see public bodies, techniques have been developed to
Figure 3.3, top). allow inference and adaptation to specific tasks
under strong resource constraints. Allowing to
run models locally instead of through a provider
Conversely, the gradient ∇𝓁 |x(d−1) of the loss with
respect to the output x(d−1) of f (d−1) is the prod- may be highly desirable for cost or confidentiality
uct of the gradient ∇𝓁 |x(d) with respect to the out- reasons.
put of f (d) multiplied by the Jacobian Jf (d−1) |x of
f (d−1) with respect to its variable x. Thus, the gra-
dients with respect to the outputs of all the f (d) s
can be computed recursively backward, starting
42 139
with ∇𝓁 |x(D) = ∇L|x .

And the gradient that we are interested in for train-

ing, that is ∇𝓁 |wd , is the gradient with respect
to the output of f (d) multiplied by the Jacobian
Jf (d) |w of f (d) with respect to the parameters.

This iterative computation of the gradients with

respect to the intermediate activations, combined
with that of the gradients with respect to the lay-
ers’ parameters, is the backward pass (see Figure
3.3, bottom). The combination of this computation
with the procedure of gradient descent is called
backpropagation.

In practice, the implementation details of the for-

ward and backward passes are hidden from pro-
grammers. Deep learning frameworks are able to
automatically construct the sequence of operations
to compute gradients.

A particularly convenient algorithm is Autograd

[Baydin et al., 2015], which tracks tensor opera-
tions and builds, on the fly, the combination of
operators for gradients. Thanks to this, a piece
of imperative programming that manipulates ten-
sors can automatically compute the gradient of any
quantity with respect to any other.

43
Resource usage where σt is defined analytically.
Regarding the computational cost, as we will see, In practice, such a model initially hallucinates
the bulk of the computation goes into linear oper- structures by pure luck in the random noise, and
ations, each requiring one matrix product for the then gradually builds more elements that emerge
forward pass and two for the products by the Ja- from the noise by reinforcing the most likely con-
cobians for the backward pass, making the latter tinuation of the image obtained thus far.
roughly twice as costly as the former.
This approach can be extended to text-conditioned
The memory requirement during inference is synthesis, to generate images that match a descrip-
roughly equal to that of the most demanding indi- tion. For instance, Nichol et al. [2021] add to the
vidual layer. For training, however, the backward mean of the denoising distribution of Equation 7.1
pass requires keeping the activations computed a bias that goes in the direction of increasing the
during the forward pass to compute the Jacobians, CLIP matching score (see § 6.6) between the pro-
which results in a memory usage that grows produced image and the conditioning text description.
portionally to the model’s depth. Techniques exist
to trade the memory usage for computation by
either relying on reversible layers [Gomez et al.,
2017], or using checkpointing, which consists of
storing activations for some layers only and recom-
puting the others on the fly with partial forward
passes during the backward pass [Chen et al., 2016].
Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-
wards through an operator, it may be scaled by a
multiplicative factor, and consequently decrease
or increase exponentially when it traverses many
44 137
setup should degrade the signal so much that the layers. A standard method to prevent it from ex-
distribution p(xT ) has a known analytical form ploding is gradient norm clipping, which consists
which can be sampled. of re-scaling the gradient to set its norm to a fixed
threshold if it is above it [Pascanu et al., 2013].
For instance, Ho et al. [2020] normalize the data
to have a mean of 0 and a variance of 1, and their When the gradient decreases exponentially, this is
diffusion process consists of adding a bit of white called the vanishing gradient, and it may make the
noise and re-normalizing the variance to 1. This training impossible, or, in its milder form, cause
process exponentially reduces the importance of different parts of the model to be updated at differ-
x0 , and xt ’s density can rapidly be approximated ent speeds, degrading their co-adaptation [Glorot
with a normal. and Bengio, 2010].

The denoiser f is a deep architecture that As we will see in Chapter 4, multiple techniques
should model and allow sampling from have been developed to prevent this from happen-
f (xt−1 , xt , t; w) ≃ p(xt−1 | xt ). It can be shown, ing, reflecting a change in perspective that was
thanks to a variational bound, that if this one-step crucial to the success of deep-learning: instead of
reverse process is accurate enough, sampling trying to improve generic optimization methods,
xT ∼ p(xT ) and denoising T steps with f results the effort shifted to engineering the models them-
in x0 that follows p(x0 ). selves to make them optimizable.

Training f can be achieved by generating a large

3.5 The value of depth
number of sequences x0 , . . . , xT , picking a tn
(n) (n)

in each, and maximizing As the term “deep learning” indicates, useful mod-
X els are generally compositions of long series of
(n) (n)
log f xtn −1 , xtn , tn ; w .
mappings. Training them with gradient descent
n
results in a sophisticated co-adaptation of the map-
pings, even though this procedure is gradual and
Given their diffusion process, Ho et al. [2020] have local.
a denoising of the form:
We can illustrate this behavior with a simple model
xt−1 | xt ∼ 𝒩 (xt + f (xt , t; w); σt ), (7.1)

136 45
R2 → R2 that combines eight layers, each multi-
plying its input by a 2×2 matrix and applying Tanh xT
per component, with a final linear classifier. This
is a simplified version of the standard
Multi-Layer Perceptron
that we will see in § 5.1.
If we train this model with SGD and cross-entropy
on a toy binary classification task (Figure 3.4, top
left), the matrices co-adapt to deform the space
until the classification is correct, which implies
that the data have been made linearly separable
before the final affine operation (Figure 3.4, bottom
right).
Such an example gives a glimpse of what a deep
model can achieve; however, it is partially mislead-
ing due to the low dimension of both the signal to
process and the internal representations. Every-
thing is kept in 2D here for the sake of visualiza-
tion, while real models take advantage of represen-
tations in high dimensions, which, in particular,
facilitates the optimization by providing many de-
grees of freedom.
x0
Empirical evidence accumulated over twenty years
demonstrates that state-of-the-art performance
across application domains necessitates models Figure 7.2: Image synthesis with denoising diffusion
with tens of layers, such as residual networks (see [Ho et al., 2020]. Each sample starts as a white noise
§ 5.2) or Transformers (see § 5.3). xT (top), and is gradually de-noised by sampling
iteratively xt−1 | xt ∼ 𝒩 (xt + f (xt , t; w), σt ).
Theoretical results show that, for a fixed computa-
46 135
be used as-is to fine-tune the language model, and
the latter can be used to train a reward network
that predicts the rating and use it as a target to
fine-tune the language model with a standard
Reinforcement Learning
approach.

7.2 Image generation

d=0 d=1 d=2
Multiple deep methods have been developed to
model and sample from a high-dimensional density.
A powerful approach for image synthesis relies on
inverting a diffusion process. Such a generative
model is referred to, somehow incorrectly, as a
diffusion model. d=3 d=4 d=5
The principle consists of defining analytically a pro-
cess that gradually degrades any sample, and con-
sequently transforms the complex and unknown
density of the data into a simple and well-known
density such as a normal, and training a deep ar-
chitecture to invert this degradation process [Ho d=6 d=7 d=8
et al., 2020].
Figure 3.4: Each plot shows the deformation of the
Given a fixed T , the diffusion process defines a
space and the resulting positioning of the training
probability distribution over series of T + 1 images
points in R2 after d layers of processing, starting
as follows: sample x0 uniformly from the dataset,
with the input to the model itself (top left). The
and then sequentially sample xt+1 ∼ p(xt+1 |
oblique line in the last plot (bottom right) shows the
xt ), t = 0, . . . , T − 1, where the conditional dis-
final affine decision.
tribution p is defined analytically and such that it
gradually erases the structure that was in x0 . The

134 47
tional budget or number of parameters, increasing This results in particular in the ability to solve
the depth leads to a greater complexity of the re- few-shot prediction, where only a handful of train-
sulting mapping [Telgarsky, 2016]. ing examples are available, as illustrated in Fig-
ure 7.1. More surprisingly, when given a carefully
3.6 Training protocols crafted prompt, it can exhibit abilities for question
answering, problem solving, and chain-of-thought
Training a deep network requires defining a proto- that appear eerily close to high-level reasoning
col to make the most of computation and data, and [Chowdhery et al., 2022; Bubeck et al., 2023].
to ensure that performance will be good on new
Due to these remarkable capabilities, these mod-
data.
els are sometimes called foundation models [Bom-
As we saw in § 1.3, the performance on the train- masani et al., 2021].
ing samples may be misleading, so in the simplest
However, even though it integrates a very large
setup one needs at least two sets of samples: one is
body of knowledge, such a model may be inade-
a training set, used to optimize the model param-
quate for practical applications, in particular when
eters, and the other is a test set, to evaluate the
interacting with human users. In many situations,
performance of the trained model.
one needs responses that follow the statistics of a
Additionally, there are usually hyper-parameters helpful dialog with an assistant. This differs from
to adapt, in particular, those related to the model ar- the statistics of available large training sets, which
chitecture, the learning rate, and the regularization combine novels, encyclopedias, forum messages,
terms in the loss. In that case, one needs a and blog posts.
validation set
that is disjoint from both the training and
This discrepancy is addressed by fine-tuning such
test sets to assess the best configuration.
a language model (see § 3.6). The current domi-
The full training is usually decomposed into nant strategy is
epochs, each of which corresponds to going Reinforcement Learning from Human Feedback(RLHF) [Ouyang et al., 2022], which
through all the training examples once. The usual consists of creating small labeled training sets by
dynamic of the losses is that the training loss de- asking users to either write responses or provide
creases as long as the optimization runs, while the ratings of generated responses. The former can
48 133
I: I love apples, O: positive, I: music is my passion, O: pos-
itive, I: my job is boring, O: negative, I: frozen pizzas are
awesome, O: positive,
I: I love apples, O: positive, I: music is my passion, O: posi-
tive, I: my job is boring, O: negative, I: frozen pizzas taste
like cardboard, O: negative, Overfitting
I: water boils at 100 degrees, O: physics, I: the square root
of two is irrational, O: mathematics, I: the set of prime
numbers is infinite, O: mathematics, I: gravity is propor-
tional to the mass, O: physics,
Loss
I: water boils at 100 degrees, O: physics, I: the square root Validation
of two is irrational, O: mathematics, I: the set of prime
numbers is infinite, O: mathematics, I: squares are rectan-
gles, O: mathematics,
Training
Figure 7.1: Examples of few-shot prediction with Number of epochs
a 120 million parameter GPT model from Hugging
Face. In each example, the beginning of the sentence
Figure 3.5: As training progresses, a model’s per-
was given as a prompt, and the model generated the
formance is usually monitored through losses. The
part in bold.
training loss is the one driving the optimization pro-
cess and goes down, while the validation loss is es-
When such a model is trained on a very large timated on an other set of examples to assess the
dataset, it results in a Large Language Model overfitting of the model. Overfitting appears when
(LLM), which exhibits extremely powerful proper- the model starts to take into account random struc-
ties. Besides the syntactic and grammatical structures specific to the training set at hand, resulting in
ture of the language, it has to integrate very diverse the validation loss starting to increase.
knowledge, e.g. to predict the word following “The
capital of Japan is”, “if water is heated to 100 Cel-
sius degrees it turns into”, or “because her puppy
was sick, Jane was”.

132 49
validation loss may reach a minimum after a cer-
tain number of epochs and then start to increase,
reflecting an overfitting regime, as introduced in
§ 1.3 and illustrated in Figure 3.5.
Paradoxically, although they should suffer from se- Chapter 7
vere overfitting due to their capacity, large models
usually continue to improve as training progresses. Synthesis
This may be due to the inductive bias of the model
becoming the main driver of optimization when
performance is near perfect on the training set
[Belkin et al., 2018].
An important design choice is the A second category of applications distinct from pre-
during training, that is, the specification
learning rate schedule diction is synthesis. It consists of fitting a density
of the value of the learning rate at each iteration of model to training samples and providing means to
the gradient descent. The general policy is that the sample from this model.
learning rate should be initially large to avoid hav-
ing the optimization being trapped in a bad local 7.1 Text generation
minimum early, and that it should get smaller so
that the optimized parameter values do not bounce The standard approach to text synthesis is to use
around and reach a good minimum in a narrow an attention-based, autoregressive model. A very
valley of the loss landscape. successful model proposed by Radford et al. [2018],
is the GPT which we described in § 5.3.
The training of very large models may take months
on thousands of powerful GPUs and have a finan- This architecture has been used for very large mod-
cial cost of several million dollars. At this scale, the els, such as OpenAI’s 175-billion-parameter GPT-
training may involve many manual interventions, 3 [Brown et al., 2020]. It is composed of 96 self-
informed, in particular, by the dynamics of the loss attention blocks, each with 96 heads, and processes
evolution. tokens of dimension 12,288, with a hidden dimen-
sion of 49,512 in the MLPs of the attention blocks.
50 131
Fine-tuning
It is often beneficial to adapt an already trained
model to a new task, referred to as a
downstream task
.

It can be because the amount of data for the origi-

nal task is plentiful, while they are limited for the
downstream task, and the two tasks share enough
similarities that statistical structures learned for
the first provide a good inductive bias for the sec-
ond. It can also be to limit the training cost by
reusing the patterns encoded in an existing model.

Adapting a pre-trained model to a specific task

is achieved with fine-tuning, which is a standard
training procedure for the downstream task, but
which starts from the pre-trained model instead of
using a random initialization.

This is the main strategy for most computer vi-

sion applications which generally use a model pre-
trained for classification on ImageNet [Deng et al.,
2009] (see § 6.3 and § 6.4), and it is also how purely
generative pre-trained Large Language Models are
re-purposed as assistant-like models, able to pro-
duce interactive dialogues (see § 7.1).

We come back to techniques to cope with limited

resources in inference and for fine-tuning in Chap-
ter 8.

51
3.7 The benefits of scale
There is an accumulation of empirical results
showing that performance, for instance, estimated
through the loss on test data, improves with the
amount of data according to remarkable
, as long as the model size increases corre-
scaling laws
spondingly [Kaplan et al., 2020] (see Figure 3.6).
Benefiting from these scaling laws in the multi-
billion sample regime is possible in part thanks to
the structure of deep models which can be scaled
up arbitrarily, as we will see, by increasing the
number of layers or feature dimensions. But it is Value
also made possible by the distributed nature of the
computation they implement, and by the Frame number
,stochastic gradient descent
which requires only a fraction
of the data at a time and can operate with datasets Figure 6.5: This graph shows the evolution of the
whose size is orders of magnitude greater than state value V (St ) = maxa Q(St , a) during a game
that of the computing device’s memory. This has of Breakout. The spikes at time points (1) and (2)
resulted in an exponential growth of the models, correspond to clearing a brick, at time point (3) it
as illustrated in Figure 3.7. is about to break through to the top line, and at (4)
it does, which ensures a high future reward [Mnih
Typical vision models have 10–100 million et al., 2015].
trainable parameters
and require 1018 –1019 FLOPs for
training [He et al., 2015; Sevilla et al., 2022]. Lan-
guage models have from 100 million to hundreds of
billions of trainable parameters and require 1020 –
1023 FLOPs for training [Devlin et al., 2018; Brown
et al., 2020; Chowdhery et al., 2022; Sevilla et al.,
52 129
mizing
N
1 X
(Q (sn , an ; w) − yn )2 (6.2)

Test loss
ℒ (w) =
N
n=1

with one iteration of SGD, where yn = rn if this

tuple is the end of the episode, and yn = rn +
γ maxa Q (s′ n , a; w̄) otherwise. Compute (peta-FLOP/s-day)
Here w̄ is a constant copy of w, i.e. the gradient
does not propagate through it to w. This is neces-

Test loss
sary since the target value in Equation 6.1 is the
expectation of yn , while it is yn itself which is used
in Equation 6.2. Fixing w in yn results in a better
approximation of the desirable gradient.

A key issue is the policy used to collect episodes. Dataset size (tokens)
Mnih et al. [2015] simply use the ϵ-greedy strat-
egy, which consists of taking an action completely
at random with probability ϵ, and the optimal ac-

Test loss
tion argmaxa Q(s, a) otherwise. Injecting a bit of
randomness is necessary to favor exploration.

Training is done with ten million frames corre-

sponding to a bit less than eight days of gameplay. Number of parameters
The trained network computes accurate estimates
of the state values (see Figure 6.5), and reaches hu- Figure 3.6: Test loss of a language model vs. the
man performance on a majority of the 49 games amount of computation in petaflop/s-day, the dataset
used in the experimental validation. size in tokens, that is fragments of words, and the
model size in parameters [Kaplan et al., 2020].

128 53
Dataset Year Nb. of images Size This is the standard setup of
ImageNet 2012 1.2M 150Gb
Reinforcement Learning
(RL), and it can be worked out by introduc-
Cityscape 2016 25K 60Gb
LAION-5B 2022 5.8B 240Tb ing the optimal state-action value function Q(s, a)
which is the expected return if we execute action
Dataset Year Nb. of books Size
a in state s, and then follow the optimal policy.
WMT-18-de-en 2018 14M 8Gb
The Pile 2020 1.6B 825Gb It provides a means to compute the optimal pol-
OSCAR 2020 12B 6Tb icy as π(s) = argmaxa Q(s, a), and, thanks to
the Markovian assumption, it verifies the
Table 3.1: Some examples of publicly available Bellman equation
:
datasets. The equivalent number of books is an in-
dicative estimate for 250 pages of 2000 characters per Q(s, a) = (6.1)
book.
E Rt + γ max Q(St+1 , a′ ) St = s, At = a ,
′ a

2022]. These latter models require machines with from which we can design a procedure to train a
multiple high-end GPUs. parametric model Q( · , · ; w).
Training these large models is impossible using To apply this framework to play classical Atari
datasets with a detailed ground-truth costly to pro- video games, Mnih et al. [2015] use for St the con-
duce, which can only be of moderate size. Instead, catenation of the frame at time t and the three
it is done with datasets automatically produced by that precede, so that the Markovian assumption
combining data available on the internet with min- is reasonable, and use for Q a model dubbed the
imal curation, if any. These sets may combine mul- Deep Q-Network (DQN), composed of two convo-
tiple modalities, such as text and images from web lutional layers and one fully connected layer with
pages, or sound and images from videos, which one output value per action, following the classical
can be used for large-scale supervised training. structure of a LeNet (see § 5.2).
As of 2024, the most powerful models are the so- Training is achieved by alternatively playing and
called Large Language Models (LLMs), which we recording episodes, and building mini-batches of
will see in § 5.3 and § 7.1, trained on extremely tuples (sn , an , rn , s′ n ) ∼ (St , At , Rt , St+1 ) taken
large text datasets (see Table 3.1). across stored episodes and time steps, and mini-
54 127
Additionally, since the textual descriptions are of-
ten detailed, such a model has to capture a richer
representation of images and pick up cues beyond 1GWh
PaLM
what is necessary for instance for classification.
1024
This translates to excellent performance on chal- GPT-3 LaMDA
lenging datasets such as ImageNet Adversarial
AlphaZero Whisper
[Hendrycks et al., 2019] which was specifically de-

Training cost (FLOP)

ViT
signed to degrade or erase cues on which standard 1MWh
predictors rely. AlphaGo
GPT-2
CLIP-ViT

1021
BERT
6.7 Reinforcement learning
Transformer

Many problems, such as strategy games or robotic ResNet

GPT

control, can be formalized with a discrete-time 1KWh

VGG16
state process St and reward process Rt that can be
modulated by choosing actions At . If St is 1018 AlexNet GoogLeNet
Markovian
, meaning that it carries alone as much infor-
mation about the future as all the past states until 2015 2020
that instant, such an object is a Year
Markovian Decision Process
(MDP).
Figure 3.7: Training costs in number of FLOP of some
Given an MDP, the objective is classically to find
landmark models [Sevilla et al., 2023]. The colors in-
a policy π such that At = π(St ) maximizes the
dicate the domains of application: Computer Vision
expectation of the return, which is an accumulated
(blue), Natural Language Processing (red), or other
discounted reward:
(black). The dashed lines correspond to the energy
 
X consumption using A100s SXM in 16-bit precision.
E γ t Rt  , For reference, the total electricity consumption in the
t≥0 US in 2021 was 3920TWh.

for a discount factor 0 < γ < 1.

126 55
Figure 6.4: The CLIP text-image embedding [Rad-
ford et al., 2021] allows for zero-shot prediction by
predicting which class description embedding is the
most consistent with the image embedding.
125
1024, depending on the configuration.

Those two models are trained from scratch using

a dataset of 400 million image-text pairs (ik , tk )
collected from the internet. The training procedure
follows the standard mini-batch stochastic gradient
descent approach but relies on a contrastive loss.
The embeddings are computed for every image and
every text of the N pairs in the mini-batch, and
a cosine similarity measure is computed not only
between text and image embeddings from each
pair, but also across pairs, resulting in an N × N
matrix of similarity scores:
Part II
lm,n = f (im )·g(tn ), m = 1, . . . , N, n = 1, . . . , N.

The model is trained with cross-entropy so that, Deep Models

∀n the values l1,n , . . . , lN,n interpreted as logit
scores predict n, and similarly for ln,1 , . . . , ln,N .
This means that ∀n, m, s.t. n ̸= m the similarity
ln,n is unambiguously greater than both ln,m and
lm,n .

When it has been trained, this model can be used to

do zero-shot prediction, that is, classifying a signal
in the absence of training examples by defining a
series of candidate classes with text descriptions,
and computing the similarity of the embedding
of an image with the embedding of each of those
descriptions (see Figure 6.4).

124
such as background music or ambient noise.
This approach allows leveraging extremely large
datasets that combine multiple types of sound
sources with diverse ground truths.
It is noteworthy that even though the ultimate
goal of this approach is to produce a translation
as deterministic as possible given the input signal,
it is formally the sampling of a text distribution
conditioned on a sound sample, hence a synthesis
process. The decoder is, in fact, extremely similar
to the generative model of § 7.1.
6.6 Text-image representations
A powerful approach to image understanding con-
sists of learning consistent image and text represen-
tations, such that an image, or a textual description
of it, would be mapped to the same feature vector.
The Contrastive Language-Image Pre-training
(CLIP) proposed by Radford et al. [2021] combines
an image encoder f , which is a ViT, and a text
encoder g, which is a GPT. See § 5.3 for both.
To repurpose a GPT as a text encoder, instead of a
standard autoregressive model, they add an “end
of sentence” token to the input sequence, and use
the representation of this token in the last layer as
the embedding. Its dimension is between 512 and
123
a large-scale image classification dataset to com-
pensate for the limited availability of segmentation
ground truth.

6.5 Speech recognition Chapter 4

Speech recognition consists of converting a sound
sample into a sequence of words. There have been Model Components
plenty of approaches to this problem historically,
but a conceptually simple and recent one proposed
by Radford et al. [2022] consists of casting it as a
sequence-to-sequence translation and then solving
it with a standard attention-based Transformer, as A deep model is nothing more than a complex ten-
described in § 5.3. sorial computation that can ultimately be decom-
posed into standard mathematical operations from
Their model first converts the sound signal into linear algebra and analysis. Over the years, the
a spectrogram, which is a one-dimensional series field has developed a large collection of high-level
T × D, that encodes at every time step a vector modules with a clear semantic, and complex mod-
of energies in D frequency bands. The associated els combining these modules, which have proven
text is encoded with the BPE tokenizer (see § 3.2). to be effective in specific application domains.
The spectrogram is processed through a few 1D Empirical evidence and theoretical results show
convolutional layers, and the resulting represen- that greater performance is achieved with deeper
tation is fed into the encoder of the Transformer. architectures, that is, long compositions of map-
The decoder directly generates a discrete sequence pings. As we saw in section § 3.4, training such
of tokens, that correspond to one of the possible a model is challenging due to the
tasks considered during training. Multiple objec- vanishing gradient
, and multiple important technical contribu-
tives are considered: transcription of English or tions have mitigated this issue.
non-English text, translation from any language
to English, or detection of non-speech sequences,

122 59
4.1 The notion of layer requires operating at multiple scales. This is neces-
sary so that any object, or sufficiently informative
We call layers standard complex compounded ten- sub-part, regardless of its size, is captured some-
sor operations that have been designed and em- where in the model by the feature representation
pirically identified as being generic and efficient. at a single tensor position. Hence, standard archi-
They often incorporate trainable parameters and tectures for this task downscale the image with a
correspond to a convenient level of granularity for series of convolutional layers to increase the recep-
designing and describing large deep models. The tive field of the activations, and re-upscale it with a
term is inherited from simple multi-layer neural series of transposed convolutional layers, or other
networks, even though modern models may take upscaling methods such as bilinear interpolation,
the form of a complex graph of such modules, in- to make the prediction at high resolution.
corporating multiple parallel pathways.
However, a strict downscaling-upscaling architec-
Y ture does not allow for operating at a fine grain
4×4 when making the final prediction, since all the sig-
g n=4 nal has been transmitted through a low-resolution
representation at some point. Models that apply
f
×K such downscaling-upscaling serially mitigate these
32 × 32 issues with skip connections from layers at a cer-
X
tain resolution, before downscaling, to layers at
In the following pages, I try to stick to the conven- the same resolution, after upscaling [Long et al.,
tion for model depiction illustrated above: 2014; Ronneberger et al., 2015]. Models that do
it in parallel, after a convolutional backbone, con-
• operators / layers are depicted as boxes, catenate the resulting multi-scale representation
after upscaling, before making the final per-pixel
• darker coloring indicates that they embed train- prediction [Zhao et al., 2016].
able parameters,
Training is achieved with a standard cross-entropy
• non-default valued hyper-parameters are added summed over all the pixels. As for object detection,
in blue on their right, training can start from a network pre-trained on
60 121
• a dashed outer frame with a multiplicative factor
indicates that a group of layers is replicated in se-
ries, each with its own set of trainable parameters,
if any, and

• in some cases, the dimension of their output is

specified on the right when it differs from their
input.

Additionally, layers that have a complex internal

structure are depicted with a greater height.

4.2 Linear layers

The most important modules in terms of compu-
tation and number of parameters are the
.Linear layers
They benefit from decades of research and
Figure 6.3: Semantic segmentation results with the engineering in algorithmic and chip design for ma-
Pyramid Scene Parsing Network [Zhao et al., 2016]. trix operations.

Note that the term “linear” in deep learning gener-

to which it belongs. This can be achieved with a ally refers improperly to an affine operation, which
standard convolutional neural network that out- is the sum of a linear expression and a constant
puts a convolutional map with as many channels bias.
as classes, carrying the estimated logits for every
pixel. Fully connected layers

While a standard residual network, for instance, The most basic linear layer is the
can generate a dense output of the same resolu- ,fully connected layer
parameterized by a trainable weight matrix
tion as its input, as for object detection, this task W of size D′ × D and bias vector b of dimension

120 61
D′ . It implements an affine transformation gener- erates several bounding boxes per s, h, w, each
alized to arbitrary tensor shapes, where the sup- dedicated to a hard-coded range of aspect ratios.
plementary dimensions are interpreted as vector
indexes. Formally, given an input X of dimension Training sets for object detection are costly to cre-
D1 × · · · × DK × D, it computes an output Y of ate, since the labeling with bounding boxes re-
dimension D1 × · · · × DK × D′ with quires a slow human intervention. To mitigate
this issue, the standard approach is to fine-tune
∀d1 , . . . , dK , a convolutional model that has been pre-trained
on a large classification dataset such as VGG-16
Y [d1 , . . . , dK ] = W X[d1 , . . . , dK ] + b.
for the original SSD, and to replace its final fully-
connected layers with additional convolutional
While at first sight such an affine operation seems ones. Surprisingly, models trained for classifica-
limited to geometric transformations such as rota- tion only learn feature representations that can be
tions, symmetries, and translations, it can in fact repurposed for object detection, even though that
do more than that. In particular, projections for task involves the regression of geometric quanti-
dimension reduction or signal filtering, but also, ties.
from the perspective of the dot product being a
measure of similarity, a matrix-vector product can During training, every ground-truth bounding box
be interpreted as computing matching scores be- is associated with its s, h, w, and induces a loss
tween the queries, as encoded by the input vectors, term composed of a cross-entropy loss for the log-
and keys, as encoded by the matrix rows. its, and a regression loss such as MSE for the bound-
ing box coordinates. Every other s, h, w free of
As we saw in § 3.3, the gradient descent starts with bounding-box match induces a cross-entropy only
the parameters' random initialization. If this is penalty to predict the class “no object”.
done too naively, as seen in § 3.4, the network may
suffer from exploding or vanishing activations and 6.4 Semantic segmentation
gradients [Glorot and Bengio, 2010]. Deep learn-
ing frameworks implement initialization methods The finest-grain prediction task for image under-
that in particular scale the random parameters ac- standing is semantic segmentation, which consists
cording to the dimension of the input to keep the of predicting, for each pixel, the class of the object
62 119
The standard approach to solve this task, for invariance of the activations constant and prevent
stance, by the Single Shot Detector (SSD) [Liu et al., pathological behaviors.
2015]), is to use a convolutional neural network
that produces a sequence of image representations Convolutional layers
Zs of size Ds × Hs × Ws , s = 1, . . . , S, with
decreasing spatial resolution Hs × Ws down to A linear layer can take as input an arbitrarily-
1 × 1 for s = S (see Figure 6.1). Each of these shaped tensor by reshaping it into a vector, as long
tensors covers the input image in full, so the h, w as it has the correct number of coefficients. How-
indices correspond to a partitioning of the image ever, such a layer is poorly adapted to dealing with
lattice into regular squares that gets coarser when large tensors, since the number of parameters and
s increases. number of operations are proportional to the prod-
uct of the input and output dimensions. For in-
As seen in § 4.2, and illustrated in Figure 4.4, due stance, to process an RGB image of size 256 × 256
to the succession of convolutional layers, a feature as input and compute a result of the same size, it
vector (Zs [0, h, w], . . . , Zs [Ds − 1, h, w]) is a de- would require approximately 4 × 1010 parameters
scriptor of an area of the image, called its and multiplications.
receptive field
, that is larger than this square but centered
on it. This results in a non-ambiguous matching Besides these practical issues, most of the high-
of any bounding box (x1 , x2 , y1 , y2 ) to a s, h, w, dimension signals are strongly structured. For in-
determined respectively by max(x2 − x1 , y2 − y1 ), stance, images exhibit short-term correlations and
statistical stationarity with respect to translation,
2 , and 2 .
y1 +y2 x1 +x2
scaling, and certain symmetries. This is not re-
Detection is achieved by adding S convolutional flected in the inductive bias of a fully connected
layers, each processing a Zs and computing, for ev- layer, which completely ignores the signal struc-
ery tensor indices h, w, the coordinates of a bound- ture.
ing box and the associated logits. If there are C
object classes, there are C + 1 logits, the addi- To leverage these regularities, the tool of choice
tional one standing for “no object.” Hence, each is convolutional layers, which are also affine, but
additional convolution layer has 4 + C + 1 output process time-series or 2D signals locally, with the
channels. The SSD algorithm in particular gen- same operator everywhere.

118 63
Y Y
ϕ ψ
X X
Y Y
ϕ ψ
X X
... ...
Y Y
ϕ ψ
X X
1D transposed
1D convolution
convolution
Figure 4.1: A 1D convolution (left) takes as input a
D × T tensor X, applies the same affine mapping
ϕ( · ; w) to every sub-tensor of shape D × K, and
stores the resulting D′ × 1 tensors into Y . A 1D
transposed convolution (right) takes as input a D×T
tensor, applies the same affine mapping ψ( · ; w) to
every sub-tensor of shape D×1, and sums the shifted Figure 6.2: Examples of object detection with the
resulting D′ × K tensors. Both can process inputs Single-Shot Detector [Liu et al., 2015].
of different sizes.
64 117
D
W

X
H ϕ ψ
Z1
Z2 Y X
ZS−1 ZS X Y
... 2D transposed
2D convolution
convolution

Figure 4.2: A 2D convolution (left) takes as input

a D × H × W tensor X, applies the same affine
... mapping ϕ( · ; w) to every sub-tensor of shape D ×
K × L, and stores the resulting D′ × 1 × 1 tensors
into Y . A 2D transposed convolution (right) takes as
Figure 6.1: A convolutional object detector processes input a D × H × W tensor, applies the same affine
the input image to generate a sequence of represen- mapping ψ( · ; w) to every D ×1×1 sub-tensor, and
tations of decreasing resolutions. It computes for sums the shifted resulting D′ × K × L tensors into
every h, w, at every scale s, a pre-defined number Y.
of bounding boxes whose centers are in the image
area corresponding to that cell, and whose sizes are
A 1D convolution is mainly defined by three
such that they fit in its receptive field. Each predic-
hyper-parameters
: its kernel size K, its number of
tion takes the form of the estimates (x̂1 , x̂2 , ŷ1 , ŷ2 ),
input channels D, its number of output channels
represented by the red boxes above, and a vector of
D′ , and by the trainable parameters w of an affine
C + 1 logits for the C classes of interest, and an
mapping ϕ( · ; w) : RD×K → RD ×1 .
′

additional “no object” class.

It can process any tensor X of size D × T with
T ≥ K, and applies ϕ( · ; w) to every sub-tensor

116 65
predicting a class from a finite, predefined number
of classes, given an input image.
Y
The standard models for this task are convolutional
ϕ
networks, such as ResNets (see § 5.2), and attention-
Y based models such as ViT (see § 5.3). These models
X generate a vector of logits with as many dimen-
ϕ
sions as there are classes.
X p=2
Padding The training procedure simply minimizes the cross-
entropy loss (see § 3.1). Usually, performance
Y can be improved with data augmentation, which
Y consists of modifying the training samples with
ϕ
hand-designed random transformations that do not
X ϕ change the semantic content of the image, such as
cropping, scaling, mirroring, or color changes.
X
s=2 ...
d=2 6.3 Object detection
Stride
Dilation
A more complex task for image understanding is
Figure 4.3: Beside its kernel size and number of input object detection, in which the objective is, given
/ output channels, a convolution admits three hyper- an input image, to predict the classes and positions
parameters: the stride s (left) modulates the step size of objects of interest.
when going through the input tensor, the padding p An object position is formalized as the four coor-
(top right) specifies how many zero entries are added dinates (x1 , y1 , x2 , y2 ) of a rectangular bounding
around the input tensor before processing it, and the box, and the ground truth associated with each
dilation d (bottom right) parameterizes the index training image is a list of such bounding boxes,
count between coefficients of the filter. each labeled with the class of the object contained
therein.
66 115
timate of the original signal X. For images, it is of size D × K of X, storing the results in a tensor
a convolutional network that may integrate skip- Y of size D′ × (T − K + 1), as pictured in Figure
connections, in particular to combine representa- 4.1 (left).
tions at the same resolution obtained early and late
in the model, as well as attention layers to facili- A 2D convolution is similar but has a K × L kernel
tate taking into account elements that are far away and takes as input a D × H × W tensor (see Figure
from each other. 4.2, left).

Such a model is trained by collecting a large num- Both operators have for trainable parameters those
ber of clean samples paired with their degraded of ϕ that can be envisioned as D′ filters of size
inputs. The latter can be captured in degraded D × K or D × K × L respectively, and a
conditions, such as low-light or inadequate focus, bias vector
of dimension D′ .
or generated algorithmically, for instance, by con- Such a layer is equivariant to translation, meaning
verting the clean sample to grayscale, reducing its that if the input signal is translated, the output is
size, or aggressively compressing it with a lossy similarly transformed. This property results in a
compression method. desirable inductive bias when dealing with a signal
The standard training procedure for denoising au- whose distribution is invariant to translation.
toencoders uses the MSE loss summed across all They also admit three additional
pixels, in which case the model aims at computing hyper-parameters
, illustrated on Figure 4.3:
the best average clean picture, given the degraded
one, that is E[X | X̃]. This quantity may be prob- • The padding specifies how many zero coeffi-
lematic when X is not completely determined by cients should be added around the input tensor
X̃, in which case some parts of the generated signal before processing it, particularly to maintain the
may be an unrealistic, blurry average. tensor size when the kernel size is greater than one.
Its default value is 0.
6.2 Image classification
• The stride specifies the step size used when go-
ing through the input, allowing one to reduce the
Image classification is the simplest strategy for ex-
output size geometrically by using large steps. Its
tracting semantics from an image and consists of

114 67
W
Chapter 6
H
Prediction
Model depth
A first category of applications, such as face recog-
Figure 4.4: Given an activation in a series of convolu- nition, sentiment analysis, object detection, or
tion layers, here in red, its receptive field is the area speech recognition, requires predicting an un-
in the input signal, in blue, that modulates its value. known value from an available signal.
Each intermediate convolutional layer increases the
width and height of that area by roughly those of 6.1 Image denoising
the kernel.
A direct application of deep models to image pro-
default value is 1. cessing is to recover from degradation by utiliz-
ing the redundancy in the statistical structure of
• The dilation specifies the index count between images. The petals of a sunflower in a grayscale
the filter coefficients of the local affine operator. Its picture can be colored with high confidence, and
default value is 1, and greater values correspond the texture of a geometric shape such as a table
to inserting zeros between the coefficients, which on a low-light, grainy picture can be corrected by
increases the filter / kernel size while keeping the averaging it over a large area likely to be uniform.
number of trainable parameters unchanged.
A denoising autoencoder is a model that takes a
Except for the number of channels, a convolution’s degraded signal X̃ as input and computes an es-
68 113
output is usually smaller than its input. In the 1D
case without padding nor dilation, if the input is
of size T , the kernel of size K, and the stride is S,
the output is of size T ′ = (T − K)/S + 1.

Given an activation computed by a convolutional

layer, or the vector of values for all the channels at a
certain location, the portion of the input signal that
it depends on is called its receptive field (see Figure
4.4). One of the H × W sub-tensors corresponding
to a single channel of a D × H × W activation
tensor is called an activation map.

Convolutions are used to recombine information,

generally to reduce the spatial size of the repre-
sentation, in exchange for a greater number of
channels, which translates into a richer local rep-
resentation. They can implement differential oper-
ators such as edge-detectors, or template matching
mechanisms. A succession of such layers can also
be envisioned as a compositional and hierarchi-
cal representation [Zeiler and Fergus, 2014], or as
a diffusion process in which information can be
transported by half the kernel size when passing
through a layer.

A converse operation is the

transposed convolution
that also consists of a localized affine oper-
ator, defined by similar hyper and trainable pa-
rameters as the convolution, but which, for in-

69
stance, in the 1D case, applies an affine mapping
′
ψ( · ; w) : RD×1 → RD ×K , to every D × 1 sub-
tensor of the input, and sums the shifted D′ × K
resulting tensors to compute its output. Such an
operator increases the size of the signal and can be
understood intuitively as a synthesis process (see
Figure 4.1, right, and Figure 4.2, right).
A series of convolutional layers is the usual archi-
tecture for mapping a large-dimension signal, such
as an image or a sound sample, to a low-dimension
tensor. This can be used, for instance, to get class
scores for classification or a compressed represen-
tation. Transposed convolution layers are used
Part III
the opposite way to build a large-dimension signal
from a compressed representation, either to as-
sess that the compressed representation contains Applications
enough information to reconstruct the signal or for
synthesis, as it is easier to learn a density model
over a low-dimension representation. We will re-
visit this in § 5.2.
4.3 Activation functions
If a network were combining only linear compo-
nents, it would itself be a linear operator, so it
is essential to have non-linear operations. These
are implemented in particular with
activation functions
, which are layers that transform each com-
ponent of the input tensor individually through a
70
Vision Transformer
Transformers have been put to use for image classi-
fication with the Vision Transformer (ViT) model
[Dosovitskiy et al., 2020] (see Figure 5.9).

It splits the three-channel input image into M Tanh ReLU

patches of resolution P × P , which are then flat-
tened to create a sequence of vectors X1 , . . . , XM
of shape M × 3P 2 . This sequence is multiplied by
a trainable matrix W E of shape 3P 2 × D to map it
to an M × D sequence, to which is concatenated
one trainable vector E0 . The resulting (M +1)×D
sequence E0 , . . . , EM is then processed through
multiple self-attention blocks. See § 5.3 and Figure Leaky ReLU GELU
5.6.
Figure 4.5: Activation functions.
The first element Z0 in the resultant sequence,
which corresponds to E0 and is not associated with
any part of the image, is finally processed by a two- mapping, resulting in a tensor of the same shape.
hidden-layer MLP to get the final C logits. Such
a token, added for a readout of a class prediction, There are many different activation functions, but
was introduced by Devlin et al. [2018] in the BERT the most used is the Rectified Linear Unit (ReLU)
model and is referred to as a CLS token. [Glorot et al., 2011], which sets negative values
to zero and keeps positive values unchanged (see
Figure 4.5, top right):
(
0 if x < 0,
relu(x) =
x otherwise.

Given that the core training strategy of deep-

110 71
learning relies on the gradient, it may seem prob-
lematic to have a mapping that is not differentiable
at zero and constant on half the real line. However, P̂ (Y )
the main property gradient descent requires is that
the gradient is informative on average. Parameter C
fully-conn
initialization and data normalization make half of gelu
the activations positive when the training starts, MLP
fully-conn
ensuring that this is the case. readout
gelu
Before the generalization of ReLU, the standard fully-conn
activation function was the hyperbolic tangent
(Tanh, see Figure 4.5, top left) which saturates ex- D
Z0 , Z1 , . . . , ZM
ponentially fast on both the negative and positive
sides, aggravating the vanishing gradient. (M + 1) × D
ffw
Other popular activation functions follow the same
idea of keeping positive values unchanged and self-att
×N
squashing the negative values. Leaky ReLU [Maas
et al., 2013] applies a small positive multiplying fac- pos-enc +
tor to the negative values (see Figure 4.5, bottom (M + 1) × D
left): E0 , E1 , . . . , EM
ax if x < 0, Image E0
leaky relu(x) =
×W E
x otherwise.

(
encoder M × 3P 2
X1 , . . . , XM
And GELU [Hendrycks and Gimpel, 2016] is de-
fined using the cumulative distribution function of Figure 5.9: Vision Transformer model [Dosovitskiy
the Gaussian distribution, that is: et al., 2020].
gelu(x) = xP (Z ≤ x),
72 109
P̂ (X1 ), . . . , P̂ (XT | Xt<T ) where Z ∼ 𝒩 (0, 1). It roughly behaves like a
smooth ReLU (see Figure 4.5, bottom right).
T ×V
fully-conn
The choice of an activation function, in particular
ffw
T ×D among the variants of ReLU, is generally driven by
empirical performance.
causal
self-att ×N
4.4 Pooling
pos-enc +
T ×D
embed A classical strategy to reduce the signal size is to
use a pooling operation that combines multiple
0, X1 , . . . , XT −1
T activations into one that ideally summarizes the
information. The most standard operation of this
Figure 5.8: GPT model [Radford et al., 2018]. class is the max pooling layer, which, similarly
to convolution, can operate in 1D and 2D and is
defined by a kernel size.
Generative Pre-trained Transformer
In its standard form, this layer computes the maxi-
The Generative Pre-trained Transformer (GPT) mum activation per channel, over non-overlapping
[Radford et al., 2018, 2019], pictured in Figure 5.8 sub-tensors of spatial size equal to the kernel size.
is a pure autoregressive model that consists of a These values are stored in a result tensor with the
succession of causal self-attention blocks, hence a same number of channels as the input, and whose
causal version of the original Transformer encoder. spatial size is divided by the kernel size. As with
the convolution, this operator has three
This class of models scales extremely well, up
hyper-parameters
: padding, stride, and dilation, with the
to hundreds of billions of trainable parameters
stride being equal to the kernel size by default. A
[Brown et al., 2020]. We will come back to their
smaller stride results in a larger resulting tensor,
use for text generation in § 7.1.
following the same formula as for convolutions
(see § 4.2).

108 73
tom right of Figure 5.6, is similar except that it
takes as input two sequences, one to compute the
queries and one to compute the keys and values.
Y
The encoder of the Transformer (see Figure 5.7, bot-
max tom), recodes the input sequence of discrete tokens
X X1 , . . . XT with an embedding layer (see § 4.9),
and adds a positional encoding (see § 4.10), before
processing it with several self-attention blocks to
Y generate a refined representation Z1 , . . . , ZT .
max
The decoder (see Figure 5.7, top), takes as in-
X put the sequence Y1 , . . . , YS−1 of result tokens
produced so far, similarly recodes them through
... an embedding layer, adds a positional encod-
ing, and processes it through alternating causal
Y self-attention blocks and cross-attention blocks
to produce the logits predicting the next tokens.
max
These cross-attention blocks compute their keys
X and values from the encoder’s result representa-
tion Z1 , . . . , ZT , which allows the resulting se-
quence to be a function of the original sequence
1D max pooling
X1 , . . . , XT .
Figure 4.6: A 1D max pooling takes as input a D ×T As we saw in § 3.2 being causal ensures that such
tensor X, computes the max over non-overlapping a model can be trained by minimizing the cross-
1 × L sub-tensors (in blue) and stores the resulting entropy summed across the full sequence.
values (in red) in a D × (T /L) tensor Y .
74 107
Transformer The max operation can be intuitively interpreted
as a logical disjunction, or, when it follows a series
The original Transformer, pictured in Figure 5.7, of convolutional layers that compute local scores
was designed for sequence-to-sequence translation. for the presence of parts, as a way of encoding
It combines an encoder that processes the input that at least one instance of a part is present. It
sequence to get a refined representation, and an au- loses precise location, making it invariant to local
toregressive decoder that generates each token of deformations.
the result sequence, given the encoder’s represen-
tation of the input sequence and the output tokens A standard alternative is the average pooling layer
generated so far. that computes the average instead of the maximum
over the sub-tensors. This is a linear operation,
As the residual convolutional networks of § 5.2, whereas max pooling is not.
both the encoder and the decoder of the Trans-
former are sequences of compounded blocks built
with residual connections.
4.5 Dropout

• The feed-forward block, pictured at the top of Some layers have been designed to explicitly facil-
Figure 5.6 is a one hidden layer MLP, preceded by a itate training or improve the learned representa-
layer normalization. It can update representations tions.
at every position separately.
One of the main contributions of that sort was
• The self-attention block, pictured on the bottom dropout [Srivastava et al., 2014]. Such a layer has
left of Figure 5.6, is a Multi-Head Attention layer no trainable parameters, but one hyper-parameter,
(see § 4.8), that recombines information globally, p, and takes as input a tensor of arbitrary shape.
allowing any position to collect information from
It is usually switched off during testing, in which
any other positions, preceded by a
case its output is equal to its input. When it is
.layer normalization
This block can be made causal by using an
active, it has a probability p of setting to zero each
adequate mask in the attention layer, as described
activation of the input tensor independently, and
in § 4.8
it re-scales all the activations by a factor of 1−p
1
to
• The cross-attention block, pictured on the bot- maintain the expected value unchanged (see Figure

106 75
P̂ (Y1 ), . . . , P̂ (YS | Ys<S )
Y Y
fully-conn
S×V
01 1 1 1 1 01 1 1 1 01
1 01 1 01 1 1 1 1 1 1 ffw
S×D
1
1 1 01 1 1 1 1 01 1 1
1 1 1 1 1 01 1 1 01 1
× × 1−p
01 1 1 10 1 1 1 1 1 1
cross-att
Q KV
Decoder
causal
X X
self-att ×N
Train Test pos-enc +
embed
S×D
Figure 4.7: Dropout can process a tensor of arbi-
trary shape. During training (left), it sets activations S
0, Y1 , . . . , YS−1
at random to zero with probability p and applies a
multiplying factor to keep the expected values un-
changed. During test (right), it keeps all the activa- Z1 , . . . , Z T
tions unchanged.
ffw
T ×D
4.7).
self-att
Encoder
The motivation behind dropout is to favor mean- ×N
ingful individual activation and discourage group pos-enc +
representation. Since the probability that a group embed
T ×D
of k activations remains intact through a dropout
T
layer is (1 − p)k , joint representations become un- X1 , . . . , XT
reliable, making the training procedure avoid them.
It can also be seen as a noise injection that makes Figure 5.7: Original encoder-decoder
the training more robust. Transformer model
for sequence-to-sequence translation
[Vaswani et al., 2017].
76 105
D
H, W

B
Y
+
× 1
dropout 1 1 0 1 0 0 1 × 1−p
fully-conn
gelu
fully-conn
layernorm
Train Test
X QKV
Y Y Figure 4.8: 2D signals such as images generally ex-
hibit strong short-term correlation and individual
+ +
activations can be inferred from their neighbors. This
mha mha redundancy nullifies the effect of the standard un-
Q K V Q K V
structured dropout, so the usual dropout layer for 2D
layernorm layernorm tensors drops entire channels instead of individual
values.
X QKV XQ X KV
When dealing with images and 2D tensors, the
Figure 5.6: Feed-forward block (top), short-term correlation of the signals and the re-
self-attention block
(bottom left) and cross-attention block (bottom sulting redundancy negate the effect of dropout,
right). These specific structures proposed by Radford since activations set to zero can be inferred from
et al. [2018] differ slightly from the original architec- their neighbors. Hence, dropout for 2D tensors
ture of Vaswani et al. [2017], in particular by having sets entire channels to zero instead of individual
the layer normalization first in the residual blocks. activations (see Figure 4.8).

Although dropout is generally used to improve

104 77
training and is inactive during inference, it can be requires a residual connection that changes the ten-
used in certain setups as a randomization strategy, sor shape. This is achieved with a 1×1 convolution
for instance, to estimate empirically confidence with a stride of two (see Figure 5.4).
scores [Gal and Ghahramani, 2015].
The overall structure of the ResNet-50 is presented
in Figure 5.5. It starts with a 7 × 7 convolutional
4.6 Normalizing layers layer that converts the three-channel input image
to a 64-channel image of half the size, followed by
An important class of operators to facilitate the
four sections of residual blocks. Surprisingly, in
training of deep architectures are the
the first section, there is no downscaling, only an
,normalizing layers
which force the empirical mean and
increase of the number of channels by a factor of 4.
variance of groups of activations.
The output of the last residual block is 2048×7×7,
The main layer in that family is which is converted to a vector of dimension 2048
[Ioffe and Szegedy, 2015], which is the only
batch normalization by an average pooling of kernel size 7 × 7, and
standard layer to process batches instead of indi- then processed through a fully-connected layer to
vidual samples. It is parameterized by a hyper- get the final logits, here for 1000 classes.
parameter D and two series of trainable scalar pa-
rameters β1 , . . . , βD and γ1 , . . . , γD . 5.3 Attention models
Given a batch of B samples x1 , . . . , xB of dimen- As stated in § 4.8, many applications, particularly
sion D, it first computes for each of the D com- from natural language processing, benefit greatly
ponents an empirical mean m̂d and variance v̂d from models that include attention mechanisms.
across the batch: The architecture of choice for such tasks, which
B has been instrumental in recent advances in deep
m̂d = xb,d learning, is the Transformer proposed by Vaswani
B
b=1 et al. [2017].

1 X
B
v̂d =
B
(xb,d − m̂d )2 ,
b=1

1 X
from which it computes for every component xb,d
78 103
D
H, W
classification.

As other ResNets, it is composed of a series of B

residual blocks, each combining several
convolutional layers
, batch norm layers, and ReLU layers,
γd · +βd γd,h,w · +βd,h,w
wrapped in a residual connection. Such a block is
pictured in Figure 5.3.

A key requirement for high performance with real

images is to propagate a signal with a large num-
ber of channels, to allow for a rich representation.
However, the parameter count of a convolutional
layer, and its computational cost, are quadratic
with the number of channels. This residual block
mitigates this problem by first reducing the num- √ √
ber of channels with a 1 × 1 convolution, then ( · − m̂d )/ v̂d + ϵ ( · − m̂b )/ v̂b + ϵ
operating spatially with a 3 × 3 convolution on
this reduced number of channels, and then upscal-
ing the number of channels, again with a 1 × 1
convolution.
batchnorm layernorm
The network reduces the dimensionality of the
signal to finally compute the logits for the clas-
Figure 4.9: Batch normalization (left) normalizes in
sification. This is done thanks to an architecture
mean and variance each group of activations for a
composed of several sections, each starting with a
given d, and scales/shifts that same group of acti-
downscaling residual block that halves the height
vation with learned parameters for each d. Layer
and width of the signal, and doubles the number
normalization (right) normalizes each group of acti-
of channels, followed by a series of residual blocks.
vations for a certain b, and scales/shifts each group
Such a downscaling residual block has a structure
of activations for a given d, h, w with learned pa-
similar to a standard residual block, except that it
rameters indexed by the same.

102 79
a normalized value zb,d , with empirical mean 0 and P̂ (Y )
variance 1, and from it the final result value yb,d
1000
with mean βd and standard deviation γd : fully-conn
2048
reshape
xb,d − m̂d 2048 × 1 × 1
∀b, zb,d = √ avgpool k=7
v̂d + ϵ
yb,d = γd zb,d + βd . resblock
×2
2048 × 7 × 7
dresblock
Because this normalization is defined across a S=2
batch, it is done only during training. During test-
resblock
ing, the layer transforms individual samples accord- ×5
1024 × 14 × 14
ing to the m̂d s and v̂d s estimated with a moving dresblock
average over the full training set, which boils down S=2
to a fixed affine transformation per component. resblock
×3
The motivation behind batch normalization was dresblock
512 × 28 × 28
to avoid that a change in scaling in an early layer S=2
of the network during training impacts all the lay-
resblock
ers that follow, which then have to adapt their ×2
256 × 56 × 56
trainable parameters accordingly. Although the dresblock
actual mode of action may be more complicated S=1
64 × 56 × 56
than this initial motivation, this layer considerably maxpool k=3 s=2 p=1
facilitates the training of deep models. relu
batchnorm
In the case of 2D tensors, to follow the principle 64 × 112 × 112
of convolutional layers of processing all locations conv-2d k=7 s=2 p=3
similarly, the normalization is done per-channel 3 × 224 × 224
across all 2D positions, and β and γ remain vectors X
of dimension D so that the scaling/shift does not
Figure 5.5: Structure of the ResNet-50 [He et al.,
depend on the 2D position. Hence, if the tensor
2015].
80 101
to be processed is of shape B × D × H × W , the
layer computes (m̂d , v̂d ), for d = 1, . . . , D from
the corresponding B × H × W slice, normalizes
it accordingly, and finally scales and shifts its com-
Y ponents with the trainable parameters βd and γd .

relu
4C
S
× H
S
× W
S So, given a B × D tensor, batch normalization
normalizes it across b and scales/shifts it according
+
to d, which can be implemented as a component-
batchnorm batchnorm
4C
× H
× W wise product by γ and a sum with β. Given a
B ×D ×H ×W tensor, it normalizes across b, h, w
S S S
conv-2d k=1 s=S conv-2d k=1

relu and scales/shifts according to d (see Figure 4.9, left).

batchnorm
C H W This can be generalized depending on these dimen-
× ×
conv-2d S
k=3 s=S p=1
S S
sions. For instance, layer normalization [Ba et al.,
2016] computes moments and normalizes across
relu
all components of individual samples, and scales
batchnorm and shifts components individually (see Figure 4.9,
C
×H ×W
right). So, given a B × D tensor, it normalizes
S
conv-2d k=1
across d and scales/shifts also according to the
C ×H ×W
X same. Given a B × D × H × W tensor, it normal-
izes it across d, h, w and scales/shifts according to
Figure 5.4: A downscaling residual block. It admits a the same.
hyper-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor Contrary to batch normalization, since it processes
size. samples individually, layer normalization behaves
the same during training and testing.

100 81
··· Y
relu
C ×H ×W
f (8)
··· ··· +
f (7)
f (6) batchnorm
+
conv-2d
C ×H ×W
f (6) k=1
f (5) f (4)
f (5) relu
f (4) f (3) batchnorm
f (4) conv-2d
f (3) + k=3 p=1
f (3) relu
f (2) f (2)
batchnorm
f (2) C
2
×H ×W
f (1) f (1) conv-2d k=1
f (1) C ×H ×W
··· ···
X
···
Figure 5.3: A residual block.
Figure 4.10: Skip connections, highlighted in red on
this figure, transport the signal unchanged across
multiple layers. Some architectures (center) that easily extended to deep architectures and suffer
downscale and re-upscale the representation size to from the vanishing gradient problem. The
operate at multiple scales, have skip connections to ,residual networks
or ResNets, proposed by He et al.
feed outputs from the early parts of the network to [2015] explicitly address the issue of the vanish-
later layers operating at the same scales [Long et al., ing gradient with residual connections (see § 4.7),
2014; Ronneberger et al., 2015]. The residual connec- which allow hundreds of layers. They have become
tions (right) are a special type of skip connections standard architectures for computer vision appli-
that sum the original signal to the transformed one, cations, and exist in multiple versions depending
and usually bypass at most a handful of layers [He on the number of layers. We are going to look
et al., 2015]. in detail at the architecture of the ResNet-50 for
82 99
4.7 Skip connections
P̂ (Y ) Another technique that mitigates the vanishing
10 gradient and allows the training of deep architec-
fully-conn
tures are skip connections [Long et al., 2014; Ron-
Classifier relu neberger et al., 2015]. They are not layers per se,
fully-conn
200 but an architectural design in which outputs of
some layers are transported as-is to other layers
256
reshape further in the model, bypassing processing in be-
tween. This unmodified signal can be concatenated
relu
64 × 2 × 2 or added to the input of the layer the connection
maxpool k=2
64 × 4 × 4 branches into (see Figure 4.10). A particular type
Feature conv-2d k=5 of skip connections are the residual connections
extractor relu which combine the signal with a sum, and usually
maxpool k=3
32 × 8 × 8 skip only a few layers (see Figure 4.10, right).
32 × 24 × 24
conv-2d k=5 The most desirable property of this design is to
1 × 28 × 28 ensure that, even in the case of gradient-killing
X
processing at a certain stage, the gradient will still
propagate through the skip connections. Residual
Figure 5.2: Example of a small LeNet-like network
connections, in particular, allow for the building
for classifying 28 × 28 grayscale images of hand-
of deep models with up to several hundred layers,
written digits [LeCun et al., 1998]. Its first half is
and key models, such as the residual networks [He
convolutional, and alternates convolutional layers
et al., 2015] in computer vision (see § 5.2), and the
per se and max pooling layers, reducing the signal
Transformers [Vaswani et al., 2017] in natural lan-
dimension from 28 × 28 scalars to 256. Its second
guage processing (see § 5.3), are entirely composed
half processes this 256-dimensional feature vector
of blocks of layers with residual connections.
through a one hidden layer perceptron to compute 10
logit scores corresponding to the ten possible digits. Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size before

98 83
re-expanding it, by connecting layers with compat- tant tool when the dimension of the signal to be
ible sizes, for instance for semantic segmentation processed is not too large.
(see § 6.4). In the case of residual connections, they
may also facilitate learning by simplifying the task 5.2 Convolutional networks
to finding a differential improvement instead of a
full update. The standard architecture for processing images
is a convolutional network, or convnet, that com-
4.8 Attention layers bines multiple convolutional layers, either to re-
duce the signal size before it can be processed by
In many applications, there is a need for an op- fully connected layers, or to output a 2D signal
eration able to combine local information at loca- also of large size.
tions far apart in a tensor. For instance, this could
be distant details for coherent and realistic LeNet-like
,image synthesis
or words at different positions in a para-
graph to make a grammatical or semantic decision The original LeNet model for image classification
in Natural Language Processing. [LeCun et al., 1998] combines a series of 2D
and max pooling layers that play
convolutional layers
Fully connected layers cannot process large- the role of feature extractor, with a series of
dimension signals, nor signals of variable size, and fully connected layers
which act as a MLP and perform
convolutional layers are not able to propagate in- the classification per se (see Figure 5.2).
formation quickly. Strategies that aggregate the
results of convolutions, for instance, by averaging This architecture was the blueprint for many mod-
them over large spatial areas, suffer from mixing els that share its structure and are simply larger,
multiple signals into a limited number of dimen- such as AlexNet [Krizhevsky et al., 2012] or the
sions. VGG family [Simonyan and Zisserman, 2014].
Attention layers specifically address this problem Residual networks
by computing an attention score for each compo-
nent of the resulting tensor to each component Standard convolutional neural networks that fol-
of the input tensor, without locality constraints, low the architecture of the LeNet family are not
84 97
Y Q Y
K A V A
2
fully-conn

relu
10
Hidden fully-conn
layers Computes Aq,1 , . . . , Aq,N KV Computes Yq
relu
25
fully-conn
Figure 4.11: The attention operator can be inter-
X
50 preted as matching every query Qq with all the keys
K1 , . . . , KN KV to get normalized attention scores
Figure 5.1: This multi-layer perceptron takes as in- Aq,1 , . . . , Aq,N KV (left, and Equation 4.1), and then
put a one-dimensional tensor of size 50, is composed averaging the values V1 , . . . , VN KV with these scores
of three fully connected layers with outputs of di- to compute the resulting Yq (right, and Equation 4.2).
mensions respectively 25, 10, and 2, the two first
followed by ReLU layers.
and averaging the features across the full tensor
accordingly [Vaswani et al., 2017].
tion the
theo
theorem
rem [Cybenko, 1989] which states that,
Even though they are substantially more compli-
if the activation function σ is continuous and not
cated than other layers, they have become a stan-
polynomial, any continuous function f can be ap-
dard element in many recent models. They are,
proximated arbitrarily well uniformly on a com-
in particular, the key building block of
pact domain, which is bounded and contains its
,Transformers
the dominant architecture for
boundary, by a model of the form l2 ◦σ ◦l1 where l1
.Large Language Models
See § 5.3 and § 7.1.
and l2 are affine. Such a model is a MLP with a sin-
gle hidden layer, and this result implies that it can
approximate anything of practical value. However,
Attention operator
this approximation holds if the dimension of the Given
first linear layer’s output can be arbitrarily large.
• a tensor Q of queries of size N Q × DQK ,
In spite of their simplicity, MLPs remain an impor-

96 85
keys of size N KV × DQK , and
• a tensor V of values of size N KV × DV ,
the attention operator computes a tensor
Y = att(Q, K, V ) Chapter 5
of dimension N Q × DV . To do so, it first computes
for every query index q and every key index k an
Architectures
attention score Aq,k as the softargmax of the dot
products between the query Q and the keys:
q
exp √ 1 QK Qq ·Kk
(4.1) The field of deep learning has developed over the

l D
exp √ 1 QK Qq ·Kl years for each application domain multiple deep
architectures that exhibit good trade-offs with re-

Aq,k = P D ,
where the scaling factor √ 1 QK keeps the range of
D
spect to multiple criteria of interest: e.g. ease of
values roughly unchanged even for large DQK . training, accuracy of prediction, memory footprint,
computational cost, scalability.
Then a retrieved value is computed for each query
by averaging the values according to the attention
scores (see Figure 4.11):
5.1 Multi-Layer Perceptrons
Yq = Aq,k Vk . (4.2) The simplest deep architecture is the
k
Multi-Layer Perceptron
(MLP), which takes the form of a suc-
X
cession of fully connected layers separated by
So if a query Qn matches one key Km far more .activation functions
See an example in Figure 5.1. For
than all the others, the corresponding attention historical reasons, in such a model, the number of
score An,m will be close to one, and the retrieved hidden layers refers to the number of linear layers,
value Yn will be the value Vm associated to that excluding the last one.
key. But, if it matches several keys equally, then
Yn will be the average of the associated values. A key theoretical result is the
86
Y

×
A

dropout

1/Σk
Masked
softargmax M ⊙
exp

Q K V

Figure 4.12: The attention operator Y =

att(Q, K, V ) computes first an attention matrix A
as the per-query softargmax of QK T , which may
be masked by a constant matrix M before the nor-
malization. This attention matrix goes through a
dropout layer before being multiplied by V to get
the resulting Y . This operator can be made causal
by taking M full of 1s below the diagonal and zeros
above.

87
This can be implemented as feature vector that depends on the position in the
tensor. This positional encoding can be learned as
QK T other layer parameters, or defined analytically.
att(Q, K, V ) = softargmax √ V.
DQK

A
For instance, in the original Transformer model,
for a series of vectors of dimension D, Vaswani
| {z }
This operator is usually extended in two ways, as et al. [2017] add an encoding of the sequence index
depicted in Figure 4.12. First, the attention maas a series of sines and cosines at various frequen-
trix can be masked by multiplying it before the cies:
softargmax normalization by a Boolean matrix M .
This allows, for instance, to make the operator t
sin if d ∈ 2N
causal by taking M full of 1s below the diagonal pos-enc[t, d] =
t


T (d−1)/D
otherwise,
and zero above, preventing Yq from depending on


T d/D
keys and values of indices k greater than q. Sec-

 cos
ond, the attention matrix is processed by a
(see § 4.5) before being multiplied by V , pro-
dropout layer with T = 104 .
viding the usual benefits during training.
Since a dot product is computed for every
query/key pair, the computational cost of the atten-
tion operator is quadratic with the sequence length.
This happens to be problematic, as some of the
applications of these methods require to process
sequences of tens of thousands, or more tokens.
Multiple attempts have been made at reducing this
cost, for instance by combining a dense attention
to a local window with a long-range sparse atten-
tion [Beltagy et al., 2020], or linearizing the opera-
tor to benefit from the associativity of the matrix
product and compute the key-value product before
88 93
Given as input an integer tensor X of dimension multiplying with the queries [Katharopoulos et al.,
D1 × · · · × DK and values in {0, . . . , N − 1} such 2020].
a layer returns a real-valued tensor Y of dimension
D1 × · · · × DK × D with Multi-head Attention Layer

∀d1 , . . . , dK , This parameterless attention operator is the key el-

ement in the Multi-Head Attention layer depicted
Y [d1 , . . . , dK ] = M [X[d1 , . . . , dK ]].
in Figure 4.13. The structure of this layer is de-
fined by several hyper-parameters: a number H of
4.10 Positional encoding heads, and the shapes of three series of H trainable
weight matrices
While the processing of a fully connected layer is
specific to both the positions of the features in the • W Q of size H × D × DQK ,
input tensor and to the positions of the resulting • W K of size H × D × DQK , and
activations in the output tensor, • W V of size H × D × DV ,
convolutional layers
and Multi-Head Attention layers are oblivious
to the absolute position in the tensor. This is key to to compute respectively the queries, the keys, and
their strong invariance and inductive bias, which the values from the input, and a final weight matrix
is beneficial for dealing with a stationary signal. W O of size HDV × D to aggregate the per-head
results.
However, this can be an issue in certain situations
where proper processing has to access the abso- It takes as input three sequences
lute positioning. This is the case, for instance, for
image synthesis, where the statistics of a scene • X Q of size N Q × D,
are not totally stationary, or in natural language • X K of size N KV × D, and
processing, where the relative positions of words • X V of size N KV × D,
strongly modulate the meaning of a sentence.
from which it computes, for h = 1, . . . , H,
The standard way of coping with this problem is to
Yh = att X Q WhQ , X K WhK , X V WhV .
add or concatenate to the feature representation,
at every position, a positional encoding, which is a These sequences Y1 , . . . , YH are concatenated

92 89
along the feature dimension and each individual
element of the resulting sequence is multiplied by
W O to get the final result:
Y
Y = (Y1 | · · · | YH )W O .
×W O
As we will see in § 5.3 and in Figure 5.6, this layer
(Y1 | · · · | YH ) is used to build two model sub-structures:
self-attention blocks
, in which the three input sequences
X Q , X K , and X V are the same, and
attattattatt
att , where X K and X V are the same.
cross-attention blocks
Q K V
It is noteworthy that the attention operator, and
Q Q Q K V V V
Q 1 K V consequently the multi-head attention layer when
×W
×W ×W×W ×W
K ×W
4 H
1 ×W
2 ×W 1 ×W K ×W
4 H ×W
3×W 4 H ×W
2 3×W 2 3×W
there is no masking, is invariant to a permutation
×H of the keys and values, and equivariant to a per-
XQ XK XV mutation of the queries, as it would permute the
resulting tensor similarly.
Figure 4.13: The Multi-head Attention layer applies
for each of its h = 1, . . . , H heads a parametrized 4.9 Token embedding
linear transformation to individual elements of
the input sequences X Q , X K , X V to get sequences In many situations, we need to convert discrete
Q, K, V that are processed by the attention operator tokens into vectors. This can be done with an
to compute Yh . These H sequences are concatenated ,embedding layer
which consists of a lookup table that
along features, and individual elements are passed directly maps integers to vectors.
through one last linear operator to get the final result
sequence Y . Such a layer is defined by two hyper-parameters:
the number N of possible token values, and the di-
mension D of the output vectors, and one trainable
N × D weight matrix M .
90 91

Little Book of Deep Learning
100% (1)
Little Book of Deep Learning
158 pages
Dive Into Deep Learning
100% (1)
Dive Into Deep Learning
633 pages
Soft Computing Unit-2 by Arun Pratap Singh
100% (1)
Soft Computing Unit-2 by Arun Pratap Singh
74 pages
d2l en
No ratings yet
d2l en
505 pages
d2l en PDF
100% (1)
d2l en PDF
670 pages
Dive Into Deep Learning - D2l-En
100% (1)
Dive Into Deep Learning - D2l-En
660 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Deep Learning - Roy Keyes
No ratings yet
Deep Learning - Roy Keyes
163 pages
Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
d2l en PDF
No ratings yet
d2l en PDF
635 pages
Deep Learning Notes Andrew NG
No ratings yet
Deep Learning Notes Andrew NG
54 pages
Deep Learning in Neural Networks An Overview
No ratings yet
Deep Learning in Neural Networks An Overview
89 pages
Deep Learning Curriculum
No ratings yet
Deep Learning Curriculum
23 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
155 pages
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
No ratings yet
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
245 pages
Notions de Deep Learning
No ratings yet
Notions de Deep Learning
116 pages
Table of Content: (Page Numbers in PDF File)
No ratings yet
Table of Content: (Page Numbers in PDF File)
223 pages
LBDL
No ratings yet
LBDL
185 pages
A Little Book of Deep Learning - Francois Fleuret
No ratings yet
A Little Book of Deep Learning - Francois Fleuret
149 pages
Binary Neural Networks
No ratings yet
Binary Neural Networks
218 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
LBDL
No ratings yet
LBDL
143 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
Deep Learning Tutorial Release 0.1
No ratings yet
Deep Learning Tutorial Release 0.1
173 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
143 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
Deep Learning
No ratings yet
Deep Learning
6 pages
LBDL
No ratings yet
LBDL
156 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
LBDL
No ratings yet
LBDL
156 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
LBDL
No ratings yet
LBDL
142 pages
Deep Learning Book Part1
No ratings yet
Deep Learning Book Part1
100 pages
Super VIP Cheatsheet - Deep Learning
No ratings yet
Super VIP Cheatsheet - Deep Learning
47 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Machine Learning and Deep Learning Techniques Used in Cybersecurity and Digital Forensics: A Review
No ratings yet
Machine Learning and Deep Learning Techniques Used in Cybersecurity and Digital Forensics: A Review
91 pages
Master's Thesis Deep Learning For Visual Recognition: Remi Cadene Supervised by Nicolas Thome and Matthieu Cord
No ratings yet
Master's Thesis Deep Learning For Visual Recognition: Remi Cadene Supervised by Nicolas Thome and Matthieu Cord
58 pages
Chap 2
No ratings yet
Chap 2
49 pages
I2DL Student Lecture Notes
No ratings yet
I2DL Student Lecture Notes
97 pages
Harsha Thesis
No ratings yet
Harsha Thesis
62 pages
LBDL A5 Booklet
No ratings yet
LBDL A5 Booklet
82 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
Introduction To Deep Learning 17th January 2025
No ratings yet
Introduction To Deep Learning 17th January 2025
60 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Speeding Up Document Image Classi Cation
No ratings yet
Speeding Up Document Image Classi Cation
59 pages
Deep Learning (DL) - Comprehensive Summary
No ratings yet
Deep Learning (DL) - Comprehensive Summary
9 pages
Btech Cs 7 Sem Deep Learning
No ratings yet
Btech Cs 7 Sem Deep Learning
3 pages
Super VIP Cheetsheet - Deep Learning, AI, ML
No ratings yet
Super VIP Cheetsheet - Deep Learning, AI, ML
47 pages
Generating Arabic Letters Using Generative
No ratings yet
Generating Arabic Letters Using Generative
63 pages
Deep Learning Techniques and Application
No ratings yet
Deep Learning Techniques and Application
20 pages
Seminar Report cnn1
No ratings yet
Seminar Report cnn1
23 pages
An Survey of Neural Network Compression
No ratings yet
An Survey of Neural Network Compression
73 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
BBBB
No ratings yet
BBBB
8 pages
Deep Learning Notes (1) 2
No ratings yet
Deep Learning Notes (1) 2
54 pages
1 AI - Introduction and ML
No ratings yet
1 AI - Introduction and ML
32 pages
NFA TO DFA Conversion
No ratings yet
NFA TO DFA Conversion
23 pages
ETEG 425 Internal Exam Questions 2021
No ratings yet
ETEG 425 Internal Exam Questions 2021
2 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Transfer Intervention
No ratings yet
Transfer Intervention
17 pages
Module1 L1 L2
No ratings yet
Module1 L1 L2
35 pages
Eecs 2022 93
No ratings yet
Eecs 2022 93
134 pages
TFG Baldillou Salse Pau
No ratings yet
TFG Baldillou Salse Pau
73 pages
Bab I Pendahuluan: 1.2 Pengumpulan Data
No ratings yet
Bab I Pendahuluan: 1.2 Pengumpulan Data
6 pages
Compilers Design: M. T. Bennani Assistant Professor, FST - El Manar University, LISI-INSAT
No ratings yet
Compilers Design: M. T. Bennani Assistant Professor, FST - El Manar University, LISI-INSAT
15 pages
Chapter 8 Homework Assignment Ver 1.0
No ratings yet
Chapter 8 Homework Assignment Ver 1.0
2 pages
Ma 6453-Probability & Queueing Theory UNIT-1 Random Variables Part A
No ratings yet
Ma 6453-Probability & Queueing Theory UNIT-1 Random Variables Part A
32 pages
Forecasting at Uber: A Brief Survey: Andrea Pasqua
No ratings yet
Forecasting at Uber: A Brief Survey: Andrea Pasqua
53 pages
DR Mervet Elmhalawy Time Series
No ratings yet
DR Mervet Elmhalawy Time Series
54 pages
Modul 6. ARIMA Box-Jenkins Part 2
No ratings yet
Modul 6. ARIMA Box-Jenkins Part 2
21 pages
The Aging Programmer
No ratings yet
The Aging Programmer
52 pages
Survey On Performance Optimization For Database Sy
No ratings yet
Survey On Performance Optimization For Database Sy
24 pages
Deep Learning Unit 5
No ratings yet
Deep Learning Unit 5
23 pages
5.MLP in Practice
No ratings yet
5.MLP in Practice
19 pages
Deep Learning - Lecture 4
No ratings yet
Deep Learning - Lecture 4
13 pages
Lec 3 (SDA)
No ratings yet
Lec 3 (SDA)
24 pages
Vet Research LMM
No ratings yet
Vet Research LMM
29 pages
Accurate Predictions On Small Data With A Tabular Foundation Model
No ratings yet
Accurate Predictions On Small Data With A Tabular Foundation Model
23 pages
The Kullback-Leibler Divergence For Univariate Models
No ratings yet
The Kullback-Leibler Divergence For Univariate Models
2 pages
Slides Used in The Sessions:Ramesh Ramani: Session 1:introduction
No ratings yet
Slides Used in The Sessions:Ramesh Ramani: Session 1:introduction
4 pages
DAI School TG 7
No ratings yet
DAI School TG 7
5 pages
Untrained Neural Networks Can Demonstrate Memorization-Independent Abstract Reasoning
No ratings yet
Untrained Neural Networks Can Demonstrate Memorization-Independent Abstract Reasoning
12 pages
The Variable-Increment Counting Bloom Filter
No ratings yet
The Variable-Increment Counting Bloom Filter
9 pages
Experiment 1
No ratings yet
Experiment 1
2 pages
Mid Exam
No ratings yet
Mid Exam
2 pages
Inverse Gamma PDF
No ratings yet
Inverse Gamma PDF
3 pages
S S U S: "How To Calculate The Exponentially Weighted Moving Average (EWMA) "
No ratings yet
S S U S: "How To Calculate The Exponentially Weighted Moving Average (EWMA) "
4 pages
Co-So-Tri-Tue-Nhan-Tao - 2021-Reviewexercise09-Nn-Sol - (Cuuduongthancong - Com)
No ratings yet
Co-So-Tri-Tue-Nhan-Tao - 2021-Reviewexercise09-Nn-Sol - (Cuuduongthancong - Com)
2 pages
FM League. Ex5
No ratings yet
FM League. Ex5
1 page
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)