Chapter 4
Chapter 4
Chapter 4. Transformer
Transformer models have greatly advanced NLP. They overcome RNNs’ limitations in managing
long-range dependencies and enable parallel processing of input sequences. There are three main
Transformer architectures: encoder-decoder, initially formulated for machine translation;
encoder-only, typically used for classification; and decoder-only, commonly found in chat LMs.
In this chapter, we’ll explore the decoder-only Transformer architecture in detail, as it is the most
widely used approach for training autoregressive language models.
The transformer architecture introduces two key innovations: self-attention and positional
encoding. Self-attention enables the model to assess how each word relates to all others during
prediction, while positional encoding captures word order and sequential patterns. Unlike RNNs,
transformers process all tokens simultaneously, using positional encoding to maintain sequential
context despite parallel processing. This chapter explores these fundamental elements in detail.
A decoder-only Transformer (referred to simply as “decoder” from here on) is made up of multiple
identical7 layers, known as decoder blocks, stacked vertically as shown below:
As you can see, training a decoder involves pairing each input sequence with a target sequence that
is identical to the input but shifted forward by one token. This approach mirrors the training
method used for RNN-based language models.
7 Decoder blocks share the same architecture but have distinct trainable parameters unique to each block.
The illustration simplifies certain aspects to avoid introducing too many new concepts at once.
We’ll introduce the missing details step by step.
Let’s take a closer look at what happens in a decoder block, starting with the first one:
The first decoder block processes input token embeddings. For this example, we use 6-dimensional
input and output embeddings, though in practice these dimensions grow larger with parameter
count and token vocabulary. The self-attention layer, transforms each input embedding vector 𝐱 𝑡
into a new vector 𝐠 𝑡 for every token 𝑡, from 1 to 𝐿, where 𝐿 represents the input length.
After self-attention, the position-wise MLP independently processes each vector 𝐠 𝑡 one at a time.
Each decoder block has its own MLP with unique parameters, and within a block, this same MLP is
applied independently to each position’s vector, taking one 𝐠 𝑡 as input and producing one 𝐳𝑡 as
output. When the MLP finishes processing each position sequentially, the number of output vectors
𝐳𝑡 equals the number of input tokens 𝐱𝑡 .
The output vectors 𝐳𝑡 then serve as inputs to the next decoder block. This process repeats through
each decoder block, preserving the same number of vectors as the input tokens 𝐱𝑡 .
4.2. Self-Attention
To see how self-attention works, let’s start with an intuitive comparison. Transforming 𝐠 𝑡 into 𝐳𝑡
is straightforward: a position-wise MLP takes an input vector and outputs a new vector by applying
a learned transformation. This is what feedforward networks are designed to do. However, self-
attention can seem more complex.
Consider a 5-token example: [“we,” “train,” “a,” “transformer,” “model”], and assume a decoder with
a maximum input sequence length of 4.
In each decoder block, the self-attention function relies on three tensors of trainable parameters:
𝐖𝑄 , 𝐖𝐾 , and 𝐖𝑉 . Here, 𝑄 stands for “query,” 𝐾 for “key,” and 𝑉 for “value.”
Let’s assume these tensors are 6 × 6. This means each of the four 6-dimensional input vectors will
be transformed into four 6-dimensional output vectors. Let’s use the second token, 𝐱2 , representing
the word “train,” as our illustrative example. To compute the output 𝐠 2 for 𝐱2 , the self-attention
layer works in six steps.
In the illustration, we combined the four input embeddings 𝐱1 , 𝐱2 , 𝐱3 , and 𝐱4 into a matrix 𝐗. Then,
we multiplied 𝐗 by the weight matrices 𝐖𝑄 , 𝐖𝐾 , and 𝐖𝑉 to create matrices 𝐐, 𝐊, and 𝐕. These
matrices hold 6-dimensional query, key, and value vectors, respectively. Since the process
generates the same number of query, key, and value vectors as input embeddings, each input
embedding 𝐱𝑡 corresponds to a query vector 𝐪𝑡 , a key vector 𝐤𝑡 , and a value vector 𝐯𝑡 .
Taking the second token 𝐱2 as our example, we compute attention scores by taking the dot product
of its query vector 𝐪2 with each key vector 𝐤𝑡 . Let’s assume the resulting scores are:
In vector format:
𝐬𝐜𝐨𝐫𝐞𝐬2 = [4.90,17.15,9.80,12.25]⊤
We now divide each score by the square root of the key vector’s dimensionality to obtain the scaled
scores. In our example, the key vector has a dimensionality of 6, so we divide each score by √6 ≈
2.45, yielding:
We apply the causal mask to the scaled scores. (If the reason for using the causal mask isn’t clear
yet, it will be explained in detail soon.) For the second input position, the causal mask is:
def
𝐜𝐚𝐮𝐬𝐚𝐥_𝐦𝐚𝐬𝐤2 = [0,0, −∞, −∞]⊤
We add the scaled scores to the causal mask, resulting in the masked scores:
We apply the softmax function to the masked scores to produce the attention weights:
Since scores of −∞ become zero after applying the exponential function, the attention weights for
the third and fourth positions will be zero. The remaining two weights are calculated as:
⊤
𝑒2 𝑒7
𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧𝐰 𝐞𝐢𝐠𝐡𝐭𝐬2 =[ , 2 , 0,0] ≈ [0.0067,0.9933,0,0]⊤
𝑒 + 𝑒 𝑒 + 𝑒7
2 7
Dividing attention scores by the square root of the key dimensionality helps prevent
the dot products from growing too large in magnitude as the dimensionality increases,
which could lead to extremely small gradients after applying softmax (due to very
large negative or positive values pushing the softmax outputs to 0 or 1).
We compute the output vector 𝐠 2 for the input embedding 𝐱2 by taking a weighted sum of the value
vectors 𝐯1, 𝐯2, 𝐯3, and 𝐯4 using the attention weights from the previous step:
𝐠 2 ≈ 0.9933 ⋅ 𝐯1 + 0.0067 ⋅ 𝐯2 + 0 ⋅ 𝐯3 + 0 ⋅ 𝐯4
As you can see, the decoder’s output for position 2 depends only on (or, we can say “attends only
to”) the inputs at positions 1 and 2, with position 2 having a much stronger influence. This effect
comes from the causal mask, which restricts the model from attending to future positions when
generating an output for a given position. This property is essential for maintaining the
autoregressive nature of language models, ensuring that predictions for each position rely solely
on previous and current inputs, not future ones.
While in our example the token primarily attends to itself, this is not a universal
pattern. Attention distributions vary based on context, meaning, and token
relationships. A token may attend strongly to other tokens that provide relevant
semantic or syntactic information, depending on the sentence structure, learned
parameters, and specific transformer layer.
The vectors 𝐪𝑡 , 𝐤𝑡 , and 𝐯𝑡 can be interpreted as follows: each input position (token or embedding)
seeks information about other positions. For example, a token like “I” might look for a name in
another position, allowing the model to process “I” and the name in a similar way. To enable this,
each position 𝑡 is assigned a query 𝐪𝑡 .
The self-attention mechanism calculates a dot-product between 𝐪𝑡 and every key 𝐤𝑝 across all
positions 𝑝. A larger dot-product indicates greater similarity between the vectors. If position 𝑝’s
key 𝐤𝑝 aligns closely with position 𝑡’s query 𝐪𝑡 , then position 𝑝’s value 𝐯𝑝 contributes more
significantly to the final result.
The concept of attention emerged before the Transformer. In 2014, Dzmitry Bahdanau,
while studying under Yoshua Bengio, addressed a fundamental challenge in machine
translation: enabling an RNN to focus on the most relevant parts of a sentence. Drawing
from his own experience learning English—where he moved his focus between
different parts of the text—Bahdanau developed a mechanism8 for the RNN to “decide”
which input words were most important at each translation step. This mechanism,
which Bengio termed attention, became a cornerstone of modern neural networks.
8 Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv preprint, 2014.
The process used to calculate 𝐠 2 is repeated for each position in the input sequence, resulting in a
set of output vectors: 𝐠1 , 𝐠 2 , 𝐠 3 , and 𝐠 4 . Each position has its own causal mask, so when calculating
𝐠1 , 𝐠 3 , and 𝐠 4 , a different causal mask is applied for each position. The full causal mask for all
positions is shown below:
0 −∞ −∞ −∞
def
0 0 −∞ −∞
𝐌=[ ]
0 0 0 −∞
0 0 0 0
As you can see, the first token attends only to itself, the second to itself and the first, the third to
itself and the first two, and the last to itself and all preceding tokens.
The general formula for computing attention for all positions is:
def 𝐐𝐊⊤
𝐆 = attention(𝐐, 𝐊, 𝐕) = softmax ( + 𝐌) 𝐕
√𝑑𝑘
Here, 𝐐 and 𝐕 are 𝐿 × 𝑑𝑘 query and value matrices. 𝐊⊤ is the 𝑑𝑘 × 𝐿 transposed key matrix. 𝑑𝑘 is
the dimensionality of the key, query, and value vectors, and 𝐿 is the sequence length.
While we computed the attention scores explicitly for 𝐱2 earlier, the matrix multiplication 𝐐𝐊⊤
calculates the scores for all positions at once. This method makes the process much faster.
𝐳𝑡 = 𝐖2 (ReLU(𝐖1 𝐠𝑡 + 𝐛1 )) + 𝐛2
Here, 𝐖1, 𝐖2, 𝐛1 , and 𝐛2 are learned parameters. The resulting vector 𝐳𝑡 is then either passed to
the next decoder block or, if it’s the final decoder block, used to generate the output vector.
This component is a position-wise multilayer perceptron, which is why I use that term.
The literature may refer to it as a feedforward network, dense layer, or fully connected
layer, but these names can be misleading. The entire Transformer is a feedforward
neural network. Additionally, dense or fully connected layers typically incorporate one
weight matrix, one bias vector, and an output non-linearity. The position-wise MLP in
a Transformer, however, utilizes two weight matrices, two bias vectors, and omits an
output non-linearity.
To handle word order, Transformers need to incorporate positional information. A widely used
method for this is rotary position embedding (RoPE), which applies position-dependent
rotations to the query and key vectors in the attention mechanism. One key benefit of RoPE is its
ability to generalize effectively to sequences longer than those seen during training. This allows
models to be trained on shorter sequences—saving time and computational resources—while still
supporting much longer contexts at inference.
RoPE encodes positional information by rotating the query and key vectors. This rotation occurs
before the attention computation. Here’s a simple illustration of how it works in 2D:
The black arrow labeled “Original” shows a position-less key or query vector in self-attention. RoPE
embeds positional information by rotating this vector according to the token’s position. 9 The
colored arrows show the resulting rotated vectors for positions 1, 3, 5, and 7.
9In practice, RoPE operates by rotating pairs of adjacent dimensions within query and key vectors, rather than
rotating the entire vectors themselves, as we will explore shortly.
A key property of RoPE is that the angle between any two rotated vectors encodes the distance
between their positions in the sequence. For example, the angle between positions 1 and 3 is the
same as the angle between positions 5 and 7, since both pairs are two positions apart.
So, how do we rotate vectors? We use matrix multiplication. Rotation matrices are widely used in
fields like computer graphics to rotate 3D scenes—one of the original purposes of GPUs (the “G” in
GPU stands for graphical) before they were applied to neural network training.
cos(𝜃) −sin(𝜃)
𝐑𝜃 = [ ]
sin(𝜃) cos(𝜃)
Let’s rotate the two-dimensional vector 𝐪 = [2,1]⊤ . To do this, we multiply 𝐪 by the rotation matrix
𝐑𝜃 . The result is a new vector, representing 𝐪 rotated counterclockwise by an angle 𝜃.
When 𝜃 = 45∘ (or 𝜋/4 radians, since trigonometric functions in PyTorch use radians), we know
√2
that cos(𝜃) = sin(𝜃) = 2
. Substituting these values, the rotation matrix becomes:
√2 √2
−
𝐑 45∘ = 2 2
√2 √2
[2 2]
Now, multiplying 𝐑 45∘ by 𝐪 = [2,1]⊤ gives the rotated vector:
√2 √2
−
𝐑45∘ ⋅ 𝐪 = 2 2 [2]
√2 √2 1
[2 2]
This produces the rotated vector 𝐪rotated:
√2 √2 √2 √2 √2
⋅2− ⋅1 (2 − 1) ⋅1
𝐪rotated = 2 2 = 2 = 2 = 2
√2 √2 √2 √2 3√2
[ 2 ⋅ 2 + 2 ⋅ 1] [ 2 (2 + 1)] [ 2 ⋅ 3] [ 2 ]
The figure below illustrates 𝐪 and its rotated version for 𝜃 = 45∘ .
For a position 𝑡, RoPE rotates each pair of dimensions in the query and key vectors defined as:
⊤
(1) (2) (𝑑𝑞 −1) (𝑑𝑞 )
𝐪𝑡 = [𝑞𝑡 , 𝑞𝑡 , … , 𝑞𝑡 , 𝑞𝑡 ]
(1) (2) (𝑑𝑘 −1) (𝑑𝑘 ) ⊤
𝐤𝑡 = [𝑘𝑡 , 𝑘𝑡 , … , 𝑘𝑡 , 𝑘𝑡 ]
Here, 𝑑𝑞 and 𝑑𝑘 are the (even) dimensionality of the query and key vectors. RoPE rotates pairs of
dimensions indexed as (2𝑝 − 1, 2𝑝), where 𝑝 ranges from 1 to 𝑑𝑞 /2 and represents each pair’s
index.
(2𝑝−1) (2𝑝)
When we write 𝐪𝑡 (𝑝), it represents the pair [𝑞𝑡 , 𝑞𝑡 ]. For example, 𝐪𝑡 (3) corresponds to:
(2⋅3−1) (2⋅3) (5) (6)
[𝑞𝑡 , 𝑞𝑡 ] = [𝑞𝑡 , 𝑞𝑡 ]
Each pair 𝑝 undergoes a rotation based on the token position 𝑡 and a rotation frequency 𝜃𝑝 :
Applying the matrix-vector multiplication rule, the rotation results in the following 2D vector:
(2𝑝−1) (2𝑝) (2𝑝−1) (2𝑝) ⊤
RoPE(𝐪𝑡 (𝑝)) = [𝑞𝑡 cos(𝜃𝑝 𝑡) − 𝑞𝑡 sin(𝜃𝑝 𝑡), 𝑞𝑡 sin(𝜃𝑝 𝑡) + 𝑞𝑡 cos(𝜃𝑝 𝑡)] ,
where 𝜃𝑝 is the rotation frequency for the 𝑝th pair. It is defined as:
1
𝜃𝑝 =
𝛩 2(𝑝−1)/𝑑𝑞
Here, 𝛩 is a constant. Initially set to 10,000, later experiments demonstrated that higher values of
𝛩—such as 500,000 (used in Llama 2 and 3 series of models) or 1,000,000 (in Qwen 2 and 2.5
series)—enable support for larger context sizes (hundreds of thousands of tokens).
Read first, buy later 94
DRAFT The Hundred-Page Language Models Book DRAFT
The full rotated embedding RoPE(𝐪𝑡 ) is constructed by concatenating all the rotated pairs:
def
RoPE(𝐪𝑡 ) = concat [RoPE(𝐪𝑡 (1)), RoPE(𝐪𝑡 (2)), … , RoPE (𝐪𝑡 (𝑑𝑞 /2))]
Note how the rotation frequency 𝜃𝑝 decreases quickly for each subsequent pair because of the
exponential term in the denominator. This enables RoPE to capture fine-grained local position
information in the early dimensions, where rotations are more frequent, and coarse-grained global
position information in the later dimensions, where rotations slow down. This combination creates
richer positional encoding, allowing the model to differentiate token positions in a sequence more
effectively than using a single rotation frequency across all dimensions.
To illustrate the process, consider a 6-dimensional query vector at position 𝑡 and 𝛩 = 10,000:
(1) (2) (3) (4) (5) (6) ⊤ def
𝐪𝑡 = [𝑞𝑡 , 𝑞𝑡 , 𝑞𝑡 , 𝑞𝑡 , 𝑞𝑡 , 𝑞𝑡 ] = [0.8,0.6,0.7,0.3,0.5,0.4]⊤
1
𝜃𝑝 = 2(𝑝−1)/𝑑𝑞
10000
Let the position 𝑡 be 100. First, we calculate the rotation angles for each pair (in radians):
1 1
𝜃1 = 2(1−1)/6
= 0/6
= 1.0000, therefore: 𝜃1 𝑡 = 100.00
10000 10000
1 1
𝜃2 = = ≈ 0.0464, therefore: 𝜃2 𝑡 = 4.64
100002(2−1)/6 100002/6
1 1
𝜃3 = = ≈ 0.0022, therefore: 𝜃3 𝑡 = 0.22
100002(3−1)/6 100004/6
The rotated pair 1 is:
These is what the original and rotated pairs look like when plotted:
The math for RoPE(𝐤𝑡 ) is the same as for RoPE(𝐪𝑡 ). In each decoder block, RoPE is applied to each
row of the query (𝐐) and key (𝐊) matrices within the self-attention mechanism.
Value vectors only provide the information that is selected and combined after the
attention weights are determined. Since the positional relationships are already
captured in the query-key alignment, value vectors don’t need their own rotary
embeddings. In other words, the value vectors simply “deliver” the content once the
positional-aware attention has identified where to look.
Recall that 𝐐 and 𝐊 are generated by multiplying the decoder block inputs by weight matrices 𝐖𝑄
and 𝐖𝐾 , as illustrated in Figure 4.1. RoPE is applied immediately after obtaining 𝐐 and 𝐊, and
before the attention scores are calculated.
Applying RoPE in all decoder blocks helps retain positional information throughout the network’s
depth. The illustration below shows two decoder blocks and where RoPE is applied:
In this illustration, the outputs of the second decoder block are used to compute logits for each
position. This is achieved by multiplying the outputs of the final decoder block by a matrix of shape
(embedding dimensionality, vocabulary size), which is shared across all positions. We will
implement the decoder model in Python soon, where this detail will become clearer.
The self-attention mechanism we’ve described would work as is. However, transformers typically
employ an enhanced version called multi-head attention. This allows the model to focus on
multiple aspects of information simultaneously. For example, one attention head might capture
syntactic relationships, another might emphasize semantic similarities, and a third could detect
long-range dependencies between tokens.
Each triplet is applied to the input vectors 𝐱1 , … , 𝐱4 , producing 𝐻 matrices 𝐆ℎ . For each head ℎ, this
gives four vectors 𝐠 ℎ,1 , … , 𝐠 ℎ,4 , as shown in Figure 4.2 for three heads (𝐻 = 3):
As you can see, the multi-head self-attention mechanism processes an input sequence through
multiple self-attention “heads.” For instance, with 3 heads, each head calculates self-attention
scores for the input tokens independently. RoPE is applied separately in each head.
All input tokens 𝐱1 , … , 𝐱4 are processed by all three heads, producing output matrices 𝐆1 , 𝐆2 , and
𝐆3 . Each matrix 𝐆ℎ has as many rows as there are input tokens, meaning each head generates an
embedding for every token. The embedding dimensionality of each 𝐆ℎ is reduced to one-third of
the total embedding dimensionality. As a result, each head outputs lower-dimensional embeddings
compared to the original embedding size.
The outputs from the three heads are concatenated along the embedding dimension in the
concatenation and projection layer, creating a single matrix that integrates information from all
heads. This matrix is then transformed by the projection matrix 𝐖𝑂 , resulting in the final output
matrix 𝐆. This output is passed to the position-wise MLP:
Concatenating the matrices 𝐆1 , 𝐆2 , and 𝐆3 restores the original embedding dimensionality (e.g., 6
in this case). However, applying the trainable parameter matrix 𝐖𝑂 enables the model to combine
the heads’ information more effectively than mere concatenation.
The original Transformer paper tested various numbers of attention heads and found
the optimal range to be around 8–16 heads. In contrast, modern large language models
often use 16 to 64 heads.
At this stage, the reader understands the Transformer model architecture at a high level. Two key
technical details remain to explore: layer normalization and residual connections, both essential
components that enable the Transformer’s effectiveness. Let’s begin with residual connections.
A network containing more than two layers is called a deep neural network. The process of
training these models is known as deep learning. Prior to the development of ReLU and residual
connections, training deep networks posed significant challenges due to the vanishing gradient
problem. Let’s examine this phenomenon.
Remember that in the gradient descent algorithm, we calculate partial derivatives for all
parameters and update them by taking a small step in the direction opposite to the gradient. As
networks grow deeper, this step becomes progressively smaller in the earlier layers (those closer
to the input), resulting in minimal (vanishing) parameter updates in these layers. Residual
connections strengthen these updates by creating pathways for the gradient to “bypass” certain
layers, hence the term skip connections.
To understand the vanishing gradient problem more thoroughly, let’s analyze a 3-layer neural
network expressed as a composite function:
where 𝑓1 represents the first layer, 𝑓2 represents the second layer, and 𝑓3 represents the third
(output) layer. Let these functions be defined as follows:
def
𝑧 ← 𝑓1 (𝑥) = 𝑤1 𝑥 + 𝑏1
def
𝑟 ← 𝑓2 (𝑧) = 𝑤2 𝑧 + 𝑏2
def
𝑦 ← 𝑓3 (𝑟) = 𝑤3 𝑟 + 𝑏3
Here, 𝑤𝑙 and 𝑏𝑙 are scalar weights and biases for each layer 𝑙 ∈ {1,2,3} and the notation 𝑧 ← 𝑓1 (𝑥)
means 𝑓1 (𝑥) takes 𝑥 as input and returns 𝑧.
Let’s define the loss function 𝐿 in terms of the network output 𝑓(𝑥) and the true label 𝑦 as
∂𝐿
𝐿(𝑓(𝑥), 𝑦). The gradient of the loss 𝐿 with respect to 𝑤1 , denoted as ∂𝑤 , is given by:
1
where:
∂𝑓3 ∂𝑓2 ∂𝑓1
= 𝑤3 , = 𝑤2 , =𝑥
∂𝑓2 ∂𝑓1 ∂𝑤1
∂𝐿 ∂𝐿
= ⋅𝑤 ⋅𝑤 ⋅𝑥
∂𝑤1 ∂𝑓3 3 2
The vanishing gradient problem occurs when weights like 𝑤2 and 𝑤3 are small (less than 1). When
multiplied together, they produce even smaller values, causing the gradient for earlier weights such
as 𝑤1 to approach zero. This issue becomes particularly severe in deep networks with many layers.
Take large language models as an example. These networks often include 32 or more decoder
blocks. To simplify, assume all blocks are fully connected layers. If the average weight value is
around 0.5, the gradient for the input layer parameters becomes 0.532 ≈ 0.0000000002. This is
extremely small. After multiplying by the learning rate, updates to the early layers are negligible.
As a result, the network stops learning effectively.
Residual connections offer a solution to the vanishing gradient problem by creating shortcuts in
the gradient computation path. The basic idea is simple: instead of passing only the output of a layer
to the next one, the layer’s input is added to its output. Mathematically, this is written as:
𝑦 = 𝑓(𝑥) + 𝑥,
where 𝑥 is the input, 𝑓(𝑥) is the layer’s computed function, and 𝑦 is the output. This addition forms
the residual connection. Graphically, it looks like this:
In this illustration, the input 𝑥 is processed both through the layer (represented as 𝑓(𝑥)) and added
directly to the layer’s output.
Now let’s introduce residual connections into our 3-layer network, where inputs and outputs are
scalars. We’ll see how this changes gradient computation and mitigates the vanishing gradient
issue. Starting with the original network 𝑓(𝑥) = 𝑓3 (𝑓2 (𝑓1 (𝑥))), let’s add residual connections to
layers 2 and 3:
def
𝑧 ← 𝑓1 (𝑥) = 𝑤1 𝑥 + 𝑏1
def
𝑟 ← 𝑓2 (𝑧) = 𝑤2 𝑧 + 𝑏2 + 𝑧
def
𝑦 ← 𝑓3 (𝑟) = 𝑤3 𝑟 + 𝑏3 + 𝑟
∂𝑓 ∂
= [(𝑤3 (𝑤2 (𝑤1 𝑥 + 𝑏1 ) + 𝑏2 + (𝑤1 𝑥 + 𝑏1 )) + 𝑏3 ) + (𝑤2 (𝑤1 𝑥 + 𝑏1 ) + 𝑏2 + (𝑤1 𝑥 + 𝑏1 ))]
∂𝑤1 ∂𝑤1
= (𝑤3 𝑤2 + 𝑤3 + 𝑤2 + 1) ⋅ 𝑥
We observe that residual connections introduce three additional terms: 𝑤3 , 𝑤2 , and 1. This
guarantees that the gradient will not vanish completely, even when 𝑤2 and 𝑤3 are small, due to the
added constant term 1.
As shown, each decoder block includes two residual connections. The layers are now named like
Python objects, which we will implement shortly. Additionally, two RMSNorm layers have been
added. Let’s discuss their purpose.
Suppose we have a vector 𝐱 = [𝑥 (1) , 𝑥 (2) , 𝑥 (3) ]⊤ . To apply RMS normalization, we first calculate the
root mean square (RMS) of the vector:
3
1 1
RMS(𝐱) = √ ∑(𝑥 (𝑖) )2 = √ [(𝑥 (1) )2 + (𝑥 (2) )2 + (𝑥 (3) )2 ]
3 3
𝑖=1
Then, we normalize the vector by dividing each component by the RMS value to obtain 𝐱̃:
⊤
𝐱 𝑥 (1) 𝑥 (2) 𝑥 (3)
𝐱̃ = =[ , , ]
RMS(𝐱) RMS(𝐱) RMS(𝐱) RMS(𝐱)
where ⊙ denotes the element-wise product. The vector 𝛄 is a trainable parameter, and each
RMSNorm layer has its own independent 𝛄.
The primary purpose of RMSNorm is to stabilize training by keeping the scale of the input to each
layer consistent. This improves numerical stability, helping to prevent gradient updates that are
excessively large or small.
Now that we’ve covered the key components of the Transformer architecture, let’s summarize how
a decoder block processes its input:
1. Recalculate the key and value matrices for all previous tokens.
2. Merge these with the new token’s key and value vectors to compute self-attention.
Key-value caching prevents this redundant recomputation. Since 𝐖𝐾 and 𝐖𝑉 stay fixed after
training, the key and value vectors for earlier tokens remain unchanged during decoding. This
allows us to store (“cache”) these vectors after computing them once. For each new token:
This approach ensures that the rest of the sequence does not need to be reprocessed, which reduces
computation significantly—especially for long sequences. For each decoder block, the cached keys
and values are stored per attention head, with shapes (𝐿 × 𝑑ℎ ) for both matrices (where 𝐿 increases
by one with each new token and 𝑑ℎ is the dimensionality of the query, key, and value vectors for
this attention head). Consequently, for a model with 𝐻 attention heads, the combined key and value
caches in each decoder block have shapes (𝐻 × 𝐿 × 𝑑ℎ ).
Now that we understand how the Transformer operates, we’re ready to start coding.
class AttentionHead(nn.Module):
def __init__(self, emb_dim, d_h):
super().__init__()
self.W_Q = nn.Parameter(torch.empty(emb_dim, d_h))
self.W_K = nn.Parameter(torch.empty(emb_dim, d_h))
self.W_V = nn.Parameter(torch.empty(emb_dim, d_h))
self.d_h = d_h
Q, K = rope(Q), rope(K) ➌
This class implements a single attention head in the multi-head attention mechanism. In the
constructor, we initialize three trainable weight matrices: the query matrix W_Q, the key matrix W_K,
and the value matrix W_V. Each of these is a Parameter tensor of shape (emb_dim, d_h), where
emb_dim is the input embedding dimension and d_h is the dimensionality of the query, key, and
value vectors for this attention head.
• Lines ➊ and ➋ compute the query, key, and value matrices by multiplying the input
vector x with the respective weight matrices. Given that x has shape (batch_size,
seq_len, emb_dim), Q, K, and V each have shape (batch_size, seq_len, d_h).
• Line ➌ applies the rotary positional encoding to Q and K. After the query and key vectors
are rotated, line ➍ computes the attention scores. Here’s a breakdown:
When the matrix multiplication operator @ is applied to tensors with more than two
dimensions, PyTorch uses broadcasting. This technique handles dimensions that
aren’t directly compatible with the @ operator, which is normally defined only for two-
dimensional tensors (matrices). In this case, PyTorch treats the first dimension as the
batch dimension, performing the matrix multiplication separately for each example in
the batch. This process is known as batch matrix multiplication.
• Line ➎ applies the causal mask. The mask tensor has the shape (seq_len, seq_len) and
contains 0s and 1s. The masked_fill function replaces all cells in the input matrix
where mask == 0 with negative infinity. This prevents attention to future tokens. Since
the mask lacks the batch dimension while scores includes it, PyTorch uses broadcasting
to apply the mask to the scores of each sequence in the batch.
• Line ➏ applies softmax to the scores along the last dimension, turning them into
attention weights. Then, line ➐ computes the output by multiplying these attention
weights with V. The resulting output has the shape (batch_size, seq_len, d_h).
Given the attention head class, we can now define the MultiHeadAttention class:
class MultiHeadAttention(nn.Module):
def __init__(self, emb_dim, num_heads):
super().__init__()
d_h = emb_dim // num_heads ➊
self.heads = nn.ModuleList([
AttentionHead(emb_dim, d_h)
for _ in range(num_heads)
]) ➋
self.W_O = nn.Parameter(torch.empty(emb_dim, emb_dim)) ➌
x = torch.cat(head_outputs, dim=-1) ➎
return x @ self.W_O ➏
In the constructor:
• Line ➊ calculates d_h, the dimensionality of each attention head, by dividing the model’s
embedding dimensionality emb_dim by the number of heads.
• Line ➋ creates a ModuleList containing num_heads instances of AttentionHead. Each
head takes the input dimensionality emb_dim and outputs a vector of size d_h.
• Line ➌ initializes W_O, a learnable projection matrix with shape (emb_dim, emb_dim) to
combine the outputs from all attention heads.
• Line ➍ applies each attention head to the input x of shape (batch_size, seq_len,
emb_dim). Each head’s output has shape (batch_size, seq_len, d_h).
• Line ➎ concatenates all heads’ outputs along the last dimension. The resulting x has
shape (batch_size, seq_len, emb_dim) since num_heads * d_h = emb_dim.
• Line ➏ multiplies the concatenated output by the projection matrix W_O. The output has
the same shape as input.
Now that we have multi-head attention, the last piece needed for the decoder block is the position-
wise multilayer perceptron. Let’s define it:
class MLP(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.W_1 = nn.Parameter(torch.empty(emb_dim, emb_dim * 4))
self.B_1 = nn.Parameter(torch.empty(emb_dim * 4))
self.W_2 = nn.Parameter(torch.empty(emb_dim * 4, emb_dim))
self.B_2 = nn.Parameter(torch.empty(emb_dim))
• Line ➊ multiplies the input x by the weight matrix W_1 and adds the bias vector B_1. The
input has shape (batch_size, seq_len, emb_dim), so the result has shape
(batch_size, seq_len, emb_dim * 4).
• Line ➋ applies the ReLU activation function element-wise, adding non-linearity.
• Line ➌ multiplies the result by the second weight matrix W_2 and adds the bias vector
B_2, reducing the dimensionality back to (batch_size, seq_len, emb_dim).
The first linear transformation expands to 4 times the embedding dimensionality (emb_dim * 4)
to provide the network with greater capacity for learning complex patterns and relationships
between variables. The 4x factor balances expressiveness and efficiency.
After expanding the dimensionality, it’s compressed back to the original embedding dimensionality
(emb_dim). This ensures compatibility with residual connections, which require matching
dimensionalities. Empirical results support this expand-and-compress approach as an effective
trade-off between computational cost and performance.
With all components defined, we’re ready to set up the complete decoder block:
class DecoderBlock(nn.Module):
def __init__(self, emb_dim, num_heads):
super().__init__()
self.norm1 = RMSNorm(emb_dim)
self.attn = MultiHeadAttention(emb_dim, num_heads)
self.norm2 = RMSNorm(emb_dim)
self.mlp = MLP(emb_dim)
The DecoderBlock class represents a single decoder block in a Transformer model. In the
constructor, we set up the necessary layers: two RMSNorm layers, a MultiHeadAttention instance
(configured with the embedding dimensionality and number of heads), and an MLP layer.
• Line ➊ applies RMSNorm to the input x, which has shape (batch_size, seq_len,
emb_dim). The output of RMSNorm keeps this shape. This normalized tensor is then
passed to the multi-head attention layer, which outputs a tensor of the same shape.
• Line ➋ adds a residual connection by combining the attention output attn_out with
the original input x. The shape doesn’t change.
• Line ➌ applies the second RMSNorm to the result from the residual connection, retaining
the same shape. This normalized tensor is then passed through the MLP, which outputs
another tensor with shape (batch_size, seq_len, emb_dim).
• Line ➍ adds a second residual connection, combining mlp_out with its unnormalized
input. The decoder block’s final output shape is (batch_size, seq_len, emb_dim), ready
for the next decoder block or the final output layer.
With the decoder block defined, we can now build the decoder transformer language model by
stacking multiple decoder blocks sequentially:
class DecoderLanguageModel(nn.Module):
def __init__(
self, vocab_size, emb_dim,
num_heads, num_blocks, pad_idx
):
super().__init__()
self.embedding = nn.Embedding(
vocab_size, emb_dim,
padding_idx=pad_idx
) ➊
self.layers = nn.ModuleList([
DecoderBlock(emb_dim, num_heads) for _ in range(num_blocks)
]) ➋
self.output = nn.Parameter(torch.rand(emb_dim, vocab_size)) ➌
• Line ➊ creates an embedding layer that converts input token indices to dense vectors.
The padding_idx specifies the ID of the padding token, ensuring that padding tokens
are mapped to zero vectors.
• Line ➋ creates a ModuleList with num_blocks DecoderBlock instances, forming the
stack of decoder layers.
• Line ➌ defines a matrix to project the last decoder block’s output to logits over the
vocabulary, enabling next token prediction.
• Line ➍ converts the input token indices to embeddings. The input tensor x has shape
(batch_size, seq_len); the output has shape (batch_size, seq_len, emb_dim).
• Line ➎ creates the causal mask.
• Line ➏ applies each decoder block to the input tensor x with shape (batch_size,
seq_len, emb_dim), producing an output tensor of the same shape. Each block refines
the sequence and passes it to the next until the final block.
• Line ➐ projects the output of the final decoder block to vocabulary-sized logits by
multiplying it with the self.output matrix, which has shape (emb_dim, vocab_size).
Read first, buy later 108
DRAFT The Hundred-Page Language Models Book DRAFT
After this batched matrix multiplication, the final output has shape (batch_size,
seq_len, vocab_size), providing scores for each token in the vocabulary at each
position in the input sequence. This output can then be used to generate the model’s
predictions as we will discuss in the next chapter.
The training loop for DecoderLanguageModel is the same as for the RNN (Section 3.6), so it is not
repeated here for brevity. Implementations of RMSNorm and RoPE are also skipped. Training data is
prepared just like for the RNN: the target sequence is offset by one position relative to the input
sequence, as described in Section 3.7. The complete code for training the decoder language model
is available in the thelmbook.com/nb/4.1 notebook.
Let’s look at some continuations of the prompt “The President” generated by the decoder model at
later training steps:
The President has been in the process of a new deal to make a decision on t
he issue .
The President 's office said the government had `` no intention of making a
ny mistakes '' .
The President of the United States has been a key figure for the first time
in the past ## years .
The “#” characters in the training data represent individual digits. For example, “##” likely
represents the number of years.
If you’ve made it this far, well done! You now understand the mechanics of language models. But
understanding the mechanics alone won’t help you fully appreciate what modern language models
are capable of. To truly understand, you need to work with one.
In the next chapter, we’ll explore large language models (LLMs). We’ll discuss why they’re called
large and what’s so special about the size. Then, we’ll cover how to finetune an existing LLM for
practical tasks like question answering and document classification, as well as how to use LLMs to
address a variety of real-world problems.