0% found this document useful (0 votes)
6 views60 pages

2 Generative Models

This chapter discusses the advancements in natural language processing (NLP) through the development of large language models (LLMs), highlighting their ability to understand and generate human-like text. It traces the evolution of language modeling from early n-gram approaches to modern deep learning techniques, including the use of Transformers and word embeddings. The chapter also outlines the architecture and training processes of LLMs, emphasizing their potential in solving complex AI problems and performing a variety of tasks with minimal adaptation.

Uploaded by

Soham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views60 pages

2 Generative Models

This chapter discusses the advancements in natural language processing (NLP) through the development of large language models (LLMs), highlighting their ability to understand and generate human-like text. It traces the evolution of language modeling from early n-gram approaches to modern deep learning techniques, including the use of Transformers and word embeddings. The chapter also outlines the architecture and training processes of LLMs, emphasizing their potential in solving complex AI problems and performing a variety of tasks with minimal adaptation.

Uploaded by

Soham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

C HAPTER 2

Generative Models
One of the most signicant advances in NLP in recent years might be the development of large
language models (LLMs). This has helped create systems that can understand and generate nat-
ural languages like humans. These systems have even been found to be able to reason, which
is considered a very challenging AI problem. With these achievements, NLP made big strides
and entered a new era of research in which difcult problems are being solved, such as building
conversational systems that can communicate with humans smoothly.
The concept of language modeling or probabilistic language modeling dates back to early ex-
periments conducted by Shannon [1951]. In his work, a language model was designed to estimate
the predictability of English — how well can the next letter of a text be predicted when the pre-
ceding N letters are known. Although Shannon’s experiments were preliminary, the fundamental
goals and methods of language modeling have remained largely unchanged over the decades since
then. For quite a long period, particularly before 2010, the dominant approach to language mod-
eling was the n-gram approach [Jurafsky and Martin, 2008]. In n-gram language modeling, we
estimate the probability of a word given its preceding n − 1 words, and thus the probability of a
sequence can be approximated by the product of a series of n-gram probabilities. These proba-
bilities are typically estimated by collecting smoothed relative counts of n-grams in text. While
such an approach is straightforward and simple, it has been extensively used in NLP. For example,
the success of modern statistical speech recognition and machine translation systems has largely
depended on the utilization of n-gram language models [Jelinek, 1998; Koehn, 2010].
Applying neural networks to language modeling has long been attractive, but a real break-
through appeared as deep learning techniques advanced. A widely cited study is Bengio et al.
[2003]’s work where n-gram probabilities are modeled via a feed-forward network and learned
by training the network in an end-to-end fashion. A by-product of this neural language model
is the distributed representations of words, known as word embeddings. Rather than represent-
ing words as discrete variables, word embeddings map words into low-dimensional real-valued
vectors, making it possible to compute the meanings of words and word n-grams in a continu-
ous representation space. As a result, language models are no longer burdened with the curse of
dimensionality, but can represent exponentially many n-grams via a compact and dense neural
model.
The idea of learning word representations through neural language models inspired subsequent
research in representation learning in NLP. However, this approach did not attract signicant in-
terest in developing NLP systems in the rst few years after its proposal. Starting in about 2012,
though, advances were made in learning word embeddings from large-scale text via simple word
prediction tasks. Several methods, such as Word2Vec, were proposed to effectively learn such
embeddings, which were then successfully applied in a variety of NLP systems [Mikolov et al.,
2013a;b]. As a result of these advances, researchers began to think of learning representations of
sequences using more powerful language models, such as LSTM-based models [Sutskever et al.,
2014; Peters et al., 2018]. And further progress and interest in sequence representation exploded
after Transformer was proposed. Alongside the rise of Transformer, the concept of language mod-
eling was generalized to encompass models that learn to predict words in various ways. Many

36
2.1 A Brief Introduction to LLMs 37

powerful Transformer-based models were pre-trained using these word prediction tasks, and suc-
cessfully applied to a variety of downstream tasks [Devlin et al., 2019].
Indeed, training language models on large-scale data has led NLP research to exciting times.
While language modeling has long been seen as a foundational technique with no direct link to
the goals of articial intelligence that researchers had hoped for, it helps us see the emergence of
intelligent systems that can learn a certain degree of general knowledge from repeatedly predicting
words in text. Recent research demonstrates that a single, well-trained LLM can handle a large
number of tasks and generalize to perform new tasks with a small adaptation effort [Bubeck et al.,
2023]. This suggests a step towards more advanced forms of articial intelligence, and inspires
further exploration into developing more powerful language models as foundation models.
In this chapter, we consider the basic concepts of generative LLMs. For simplicity, we use the
terms large language models or LLMs to refer to generative models like GPT, though this term
can broadly cover other types of models like BERT. We begin by giving a general introduction
to LLMs, including the key steps of building such models. We then discuss two scaling issues of
LLMs: how LLMs are trained at scale, and how LLMs can be improved to handle very long texts.
Finally, we give a summary of these discussions.

2.1 A Brief Introduction to LLMs

In this section we give an introduction to the basic ideas of LLMs as required for the rest of this
chapter and the following chapters. We will use terms word and token interchangeably. Both
of them refer to the basic units used in language modeling, though their original meanings are
different.
Before presenting details, let us rst consider how language models work. The goal of lan-
guage modeling is to predict the probability of a sequence of tokens occurring. Let {x0 , x1 , ..., xm }
be a sequence of tokens, where x0 is the start symbol s (or SOS)1 . The probability of this se-
quence can be dened using the chain rule

Pr(x0 , ..., xm ) = Pr(x0 ) · Pr(x1 |x0 ) · Pr(x2 |x0 , x1 ) · · · Pr(xm |x0 , ..., xm−1 )
m

= Pr(xi |x0 , ..., xi−1 ) (2.1)
i=0

or alternatively in a logarithmic form

m

log Pr(x0 , ..., xm ) = log Pr(xi |x0 , ..., xi−1 ) (2.2)
i=0

Here Pr(xi |x0 , ..., xi−1 ) is the probability of the token xi given all its previous tokens {x0 , ..., xi−1 }
2 . In the era of deep learning, a typical approach to language modeling is to estimate this

1
The start symbol can also be [CLS] following BERT models.
2
We assume that when i = 0, Pr(xi |x0 , ..., xi−1 ) = Pr(x0 ) = 1. Hence Pr(x0 , ..., xm ) =
Pr(x0 ) Pr(x1 , ..., xm |x0 ) = Pr(x1 , ..., xm |x0 ).
38 Generative Models

Context Predict Decision Rule Sequence Probability


s a b arg maxx2 ∈V Pr(x2 |s a) Pr(s) · Pr(a|s)· Pr(b|s a)
s a b c arg maxx3 ∈V Pr(x3 |s a b) Pr(s) · Pr(a|s) · Pr(b|s a)·
Pr(c|s a b)
s a b c d arg maxx4 ∈V Pr(x4 |s a b c) Pr(s) · Pr(a|s) · Pr(b|s a)·
Pr(c|s a b)· Pr(d|s a b c)

Table 2.1: Illustration of generating the three tokens b c d given the prex s a via a language model. In each step,
the model picks a token xi from V so that Pr(xi |x0 , ..., xi−1 ) is maximized. This token is then appended to the end
of the context sequence. In the next step, we repeat the same process, but based on the new context.

probability using a deep neural network. Neural networks trained to accomplish this task re-
ceive a sequence of tokens x0 , ..., xi−1 and produce a distribution over the vocabulary V (de-
noted by Pr(·|x0 , ..., xi−1 )). The probability Pr(xi |x0 , ..., xi−1 ) is the value of the i-th entry of
Pr(·|x0 , ..., xi−1 ).
When applying a trained language model, a common task is to nd the most likely token given
its previous context tokens. This token prediction task can be described as

x̂i = arg max Pr(xi |x0 , ..., xi−1 ) (2.3)


xi ∈V

We can perform word prediction multiple times to generate a continuous text: each time we
predict the best token x̂i , and then add this predicted token to the context for predicting the next
token x̂i+1 . This results in a left-to-right generation process implementing Eqs. (2.1) and (2.2). To
illustrate, consider the generation of the following three words given the prex ‘s a’, as shown
in Table 2.1. Now we discuss how LLMs are constructed, trained, and applied.

2.1.1 Decoder-only Transformers

As is standard practice, the input of a language model is a sequence of tokens (denoted by


{x0 , ..., xm−1 }). For each step, an output token is generated, shifting the sequence one po-
sition forward for the next prediction. To do this, the language model outputs a distribution
Pr(·|x0 , ..., xi−1 ) at each position i, and the token xi is selected according to this distribution.
 3
This model is trained by maximizing the log likelihood m i=1 log Pr(xi |x0 , ..., xi−1 ) .

Here, we focus on the decoder-only Transformer architecture, as it is one of the most popular
model architectures used in LLMs. The input sequence of tokens is represented by a sequence
of de -dimensional vectors {e0 , ..., em−1 }. ei is the sum of the token embedding of xi and the
positional embedding of i. The major body of the model is a stack of Transformer blocks (or
layers). Each Transformer block has two stacked sub-layers, one for self-attention modeling and
one for FFN modeling. These sub-layers can be dened using the post-norm architecture

output = LNorm(F (input) + input) (2.4)

3
m m
Note that i=1
log Pr(xi |x0 , ..., xi−1 ) = i=0
log Pr(xi |x0 , ..., xi−1 ) since log Pr(x0 ) = 0.
2.1 A Brief Introduction to LLMs 39

or the pre-norm architecture

output = LNorm(F (input)) + input (2.5)

where input and output denote the input and output, both being an m × d matrix. The i-th rows
of input and output can be seen as contextual representations of the i-th token in the sequence.
F (·) is the core function of a sub-layer. For FFN sub-layers, F (·) is a multi-layer FFN. For
self-attention sub-layers, F (·) is a multi-head self-attention function. In general, self-attention is
expressed in a form of QKV attention

QKT
Attqkv (Q, K, V) = Softmax( √ + Mask)V (2.6)
d

where Q, K and V ∈ Rm×d are the queries, keys, and values, respectively. It is important to
note that only previous tokens are considered when predicting a token. So a masking variable
Mask ∈ Rm×m is incorporated into self-attention to achieve this. The entry (i, k) of Mask has
a value of 0 if i ≤ k, and a value of − inf otherwise.
Given a representation H ∈ Rm×d , the multi-head self-attention function can be dened as

F (H) = Merge(head1 , ..., headτ )Whead (2.7)

where Merge(·) representees a concatenation of its inputs, and Whead ∈ Rd×d represents a pa-
rameter matrix. headj is the output of QKV attention on a sub-space of representation

headj = Attqkv (Q[j] , K[j] , V[j] ) (2.8)

Q[j],K[j] ,and V[j] are the queries, keys, and values projected onto the j-th sub-space via linear
transformations

Q[j] = HWqj (2.9)


[j]
K = HWkj (2.10)
[j]
V = HWvj (2.11)

d
where Wqj , Wkj , and Wvj ∈ Rd× τ are the parameter matrices of the transformations.
Suppose we have L Transformer blocks. A Softmax layer is built on top of the output of the
last block. The Softmax layer outputs a sequence of m distributions over the vocabulary, like this
 
Pr(·|x0 , ..., xm−1 )
 .. 
 . 



 = Softmax(HL Wo ) (2.12)
 Pr(·|x0 , x1 ) 
Pr(·|x0 )

where HL is the output of the last Transformer block, and Wo ∈ Rd×|V | is the parameter matrix.
Figure 2.1 shows the Transformer architecture for language modeling. Applying this language
40 Generative Models

Post-norm or Pre-norm

FFN

x1 x2 ... xm

L Blocks
Pr(xm |x0 x1 ...xm−1 )
Pr(x2 |x0 x1 )
Pr(x1 |x0 )
Post-norm or Pre-norm
...
Self-attention
hL
0 hL
1
... hL
m−1

Language Model

e0 e1 ... em−1

x0 x1 ... xm−1 z0 z1 ... zm−1

Fig. 2.1: The Transformer-decoder architecture for language modeling. The central components are L stacked Trans-
former blocks, each comprising a self-attention sub-layer and an FFN sub-layer. To prevent the model from accessing
the right-context, a masking variable is incorporated into self-attention. The output layer uses a Softmax function to
generate a probability distribution for the next token, given the sequence of previous tokens. During inference, the
model takes the previously predicted token to predict the next one, repeating this process until the end of the sequence
is reached. {z0 , ..., zm−1 } denote the inputs of a Transformer block, and {hL L
0 , ..., hm−1 } denote the outputs of the
last Transformer block.

model follows an autoregressive process. Each time the language model takes a token xi−1 as
input and predicts a token xi that maximizes the probability Pr(xi |x0 , ..., xi−1 ). It is important
to note that, despite different implementation details, many LLMs share the same architecture
described above. These models are called large because both their depth and width are signicant.
Table 2.2 shows the model sizes for a few LLMs, as well as their model setups.

2.1.2 Training LLMs

Now suppose that we are given a training set D comprising K sequences. The log-likelihood of
each sequence x = x0 ...xm in D can be calculated using a language model
m

Lθ (x) = log Prθ (xi |x0 , ..., xi−1 ) (2.13)
i=1

Here the subscript θ afxed to L(·) and Pr(·) denotes the parameters of the language model. Then,
the objective of maximum likelihood training is dened as

θ̂ = arg max Lθ (x) (2.14)
θ x∈D

Training Transformer-based language models with the above objective is commonly viewed
as a standard optimization process for neural networks. This can be achieved using gradient de-
scent algorithms, which are widely supported by off-the-shelf deep learning toolkits. Somewhat
2.1 A Brief Introduction to LLMs 41

LLM # of Parameters Depth L Width d # of Heads


(Q/KV)
GPT-1 [Radford et al., 2018] 0.117B 12 768 12/12
GPT-2 [Radford et al., 2019] 1.5B 48 1,600 25/25
GPT-3 [Brown et al., 2020] 175B 96 12,288 96/96
7B 32 4,096 32/32
LLaMA2 [Touvron et al., 2023b] 13B 40 5,120 40/40
70B 80 8,192 64/64
8B 32 4,096 32/8
LLaMA3/3.1 [Dubey et al., 2024] 70B 80 8,192 64/8
405B 126 16,384 128/8
2B 26 2,304 8/4
Gemma2 [Team et al., 2024] 9B 42 3,584 16/8
37B 46 4,608 32/16
0.5B 24 896 14/2
Qwen2.5 [Yang et al., 2024] 7B 28 3,584 28/4
72B 80 8,192 64/8
DeepSeek-V3 [Liu et al., 2024a] 671B 61 7,168 128/128
7B 32 4,544 71/71
Falcon [Penedo et al., 2023] 40B 60 8,192 128/128
180B 80 14,848 232/232
Mistral [Jiang et al., 2023a] 7B 32 4,096 32/32

Table 2.2: Comparison of some LLMs in terms of model size, model depth, model width, and number of heads (a/b
means a heads for queries and b heads for both keys and values).

surprisingly, better results were continuously yielded as language models were evolved into more
computationally intensive models and trained on larger datasets [Kaplan et al., 2020]. These suc-
cesses have led NLP researchers to continue increasing both the training data and model size in
order to build more powerful language models.
However, as language models become larger, we confront new training challenges, which
signicantly change the problem compared to training relatively small models. One of these
challenges arises from the need for large-scale distributed systems to manage the data, model
parameters, training routines, and so on. Developing and maintaining such systems requires a
signicant amount of work in both software and hardware engineering, as well as expertise in deep
learning. A related issue is that when the training is scaled up, we need more computing resources
to ensure the training process can be completed in an acceptable time. For example, it generally
requires hundreds or thousands of GPUs to train an LLM with tens of billions of parameters
from scratch. This requirement drastically increases the cost of training such models, especially
considering that many training runs are needed as these models are developed. Also, from the
perspective of deep learning, the training process can become unstable if the neural networks are
very deep and/or the model size is very large. In response, we typically need to modify the model
architecture to adapt LLMs to large-scale training. In Section 2.2 we will present more discussions
on these issues.
42 Generative Models

2.1.3 Fine-tuning LLMs

Once we have pre-trained an LLM, we can then apply it to perform various NLP tasks. Tradi-
tionally language models are used as components of other systems, for example, they are widely
applied to score translations in statistical machine translation systems. By contrast, in generative
AI, LLMs are considered complete systems and are employed to address NLP problems by mak-
ing use of their generation nature. A common approach is to describe the task we want to address
in text and then prompt LLMs to generate text based on this description. This is a standard text
generation task where we continue or complete the text starting from a given context.
More formally, let x = x0 ...xm denote a token sequence of context given by users, and
y = y1 ...yn denote a token sequence following the context. Then, the inference of LLMs can be
dened as a problem of nding the most likely sequence y based on x:

ŷ = arg max log Pr(y|x)


y
n

= arg max log Pr(yi |x0 , ..., xm , y1 , ..., yi−1 ) (2.15)
y
i=1


Here ni=1 log Pr(yi |x0 , ..., xm , y1 , ..., yi−1 ) essentially expresses the same thing as the right-
hand side of Eq. (2.2). It models the log probability of predicting tokens from position m + 1,
rather than position 0. Throughout this chapter and subsequent ones, we will employ separate
variables x and y to distinguish the input and output of an LLM, though they can be seen as sub-
sequences from the same sequence. By adopting such notation, we see that the form of the above
equation closely resembles those used in other text generation models in NLP, such as neural
machine translation models.
To illustrate how LLMs are applied, consider the problem of determining the grammaticality
for a given sentence. We can dene a template like this

{*sentence*}
Question: Is this sentence grammatically correct?
Answer:

Here represents the text we intend to generate. {*sentence*} is a placeholder variable that
will be replaced by the actual sentence provided by the users. For example, suppose we have a
sentence “John seems happy today.”. We can replace the {*sentence*} in the template with this
sentence to have an input to the language model

John seems happy today.


Question: Is this sentence grammatically correct?
Answer:

To perform the task, the language model is given the context x =“John seems happy today .\n
Question : Is this sentence grammatically correct?\n Answer :”4 . It then generates the following
4
\n is a special character used for line breaks.
2.1 A Brief Introduction to LLMs 43

text as the answer, based on the context. For example, the language model may output “Yes” (i.e.,
y = “Yes”) if this text is the one with the maximum probability of prediction given this context.
Likewise, we can dene more templates to address other tasks. For example, we can translate
an English sentence into Chinese using the following template

{*sentence*}
Question: What is the Chinese translation of this English sentence?
Answer:

or using an instruction-like template

{*sentence*}
Translate this sentence from English into Chinese.

or using a code-like template.

[src-lang] = English [tgt-lang] = Chinese [input] = {*sentence*}


[output] =

The above templates provide a simple but effective method to “prompt” a single LLM to per-
form various tasks without adapting the structure of the model. However, this approach requires
that the LLM can recognize and follow the instructions or questions. One way to do this is to incor-
porate training samples with instructions and their corresponding responses into the pre-training
dataset. While this method is straightforward, building and training LLMs from scratch is com-
putationally expensive. Moreover, making instruction-following data effective for pre-training
requires a signicant amount of such data, but collecting large-scale labeled data for all tasks of
interest is very difcult.
A second method, which has been a de facto standard in recent research, is to adapt LLMs
via ne-tuning. As such, the token prediction ability learned in the pre-training phase can be
generalized to accomplish new tasks. The idea behind ne-tuning is that some general knowledge
of language has been acquired in pre-training, but we need a mechanism to activate this knowledge
for applying it to new tasks. To achieve this, we can slightly ne-tune the model parameters using
instruction-following data. This approach is called instruction ne-tuning.
An instruction ne-tuning sample, which is represented by a sequence of tokens, can be seen
as a tuple consisting of an input and the desired output. Here, the input includes instructions,
system information (or system prex), and any other user-provided information5 . To illustrate,
consider the following examples (blue text = input and underlined text = output).

5
System information refers to a sequence of tokens added at the beginning of an input in order to guide the behavior
of an LLM, such as, you are a helpful assistant and should not output toxic content.
44 Generative Models

Read the following sentence, and answer whether it is grammatically correct.


LLMs are powerful models but are expensive to build.
Yes

LLMs are powerful models but are expensive to build.


Does this sentence make sense grammatically?
Answer Option
- Yes
- No
Yes

Yes or No? Yes = the following sentence is grammatically correct. No = it con-


tains grammatical errors.
LLMs are powerful models but are expensive to build.
Yes

All these samples describe the same binary classication task, but with different instructions.
To increase the diversity in the ne-tuning data and make LLMs generalize, we can dene more
tasks using instructions.

Write a short story about three characters.


Character 1: A boy whose name is Andy.
Character 2: Andy’s dog.
Character 3: Andy’s friend, Jane.
In the heart of a bustling city, Andy, his loyal dog Rocket, and his friend
Jane discovered a hidden garden beneath the glow of twilight. Surrounded by
ancient trees and whispering winds, they shared dreams and secrets, forging an
unbreakable bond. As the night deepened, they made a pact under the stars to
always nd magic in the mundane. Together, they realized that true adventure lay
in the moments spent with friends, and every day was a treasure waiting to be
discovered.
2.1 A Brief Introduction to LLMs 45

Show me a recipe for making ice cream.


Ingredients:
2 cups heavy cream; 1 cup milk; 3/4 cup sugar; 2 tsp vanilla extract;
Pinch of salt.
Steps:
1. Whisk all ingredients until sugar dissolves.
2. Chill mixture for 2 hours in the fridge.
3. Freeze in a shallow dish, stirring every 30 minutes, 3-4 times.
4. Freeze solid for 2 hours.
5. Serve your homemade ice cream.

If you buy 5 apples and each apple costs $1.20, how much do you spend in total?
$6.00

Write a Python program to calculate the sum of squares of the following numbers.
1 , 2 , 10 , -9 , 78
numbers = [1,2,10,-9 ,78]
sum_of_squares = sum(x**2 for x in numbers)
print(sum_of_squares)

To acquire instruction-following abilities, a certain amount of ne-tuning data is required. This


data may include diverse instructions and possible responses. It has been found that scaling the
number of ne-tuning tasks is benecial for improving the performance of LLMs [Chung et al.,
2022]. Note that although more ne-tuning data is favorable, the amount of this data is generally
orders of magnitude smaller than that of the pre-training data. For example, LLMs can be ne-
tuned with tens or hundreds of thousands of samples, or even fewer if these samples are of high
quality [Zhou et al., 2023a; Chen et al., 2023b], whereas pre-training such models may require
billions or trillions of tokens, resulting in signicantly larger computational demands and longer
training times [Touvron et al., 2023a].
It is also worth noting that we should not expect the ne-tuning data to cover all the down-
stream tasks to which we intend to apply LLMs. A common understanding of how the pre-training
+ ne-tuning approach works is that LLMs have gained knowledge for understanding instructions
and generating responses in the pre-training phase. However, these abilities are not fully activated
until we introduce some form of supervision. The general instruction-following behavior emerges
as we ne-tune the models with a relatively small amount of labeled data. As a result, we can
achieve some level of zero-shot learning: the ne-tuned models can handle new tasks that they
have not been explicitly trained or ne-tuned for [Sanh et al., 2022; Wei et al., 2022a]. This zero-
shot learning ability distinguishes generative LLMs from earlier pre-trained models like BERT,
which are primarily ne-tuned for specic tasks.
Once we have prepared a collection of instruction-described data, the ne-tuning process is
relatively simple. This process can be viewed as a standard training process as pre-training, but on
a much smaller training dataset. Let Dtune be the ne-tuning dataset and θ̂ be the model parameters
46 Generative Models

optimized via pre-training. We can modify Eq. (2.14) to obtain the objective of ne-tuning

θ̃ = arg max Lθ̂+ (sample) (2.16)
θ̂ + sample∈Dtune

Here θ̃ denotes the optimal parameters. The use of notation θ̂ + means that the ne-tuning starts
with the pre-trained parameters θ̂.
For each sample ∈ Dtune , we divide it into an input segment xsample and an output segment
ysample , that is,

sample = [ysample , xsample ] (2.17)

We then dene the loss function to be

Lθ̂+ (sample) = − log Prθ̂+ (ysample |xsample ) (2.18)

In other words, we compute the loss over the sub-sequence ysample , rather than the entire sequence.
In a practical implementation of back-propagation for this equation, the sequence [ysample , xsample ]
is constructed in the forward pass as usual. However, in the backward pass, error gradients are
propagated back only through the parts of the network that correspond to ysample , leaving the rest
of the network unchanged. As an example, consider a sequence

s Square this number . 2 . The result is 4 .


     
Context (Input) Prediction (Output)

The loss is calculated and back propagated only for The result is 4 ..
Instruction ne-tuning also requires substantial engineering work. In order to achieve satis-
factory results, one may experiment with different settings of the learning rate, batch size, number
of ne-tuning steps, and so on. This typically requires many ne-tuning runs and evaluations. The
cost and experimental effort of ne-tuning remain critical and should not be overlooked, though
they are much lower than those of the pre-training phase.
While we focus on instruction ne-tuning for an illustrative example here, ne-tuning tech-
niques play an important role in developing various LLMs and are more widely used. Examples
include ne-tuning LLMs as chatbots using dialog data, and adapting these models to handle very
long sequences. The wide application of ne-tuning has led researchers to improve these tech-
niques, such as designing more efcient ne-tuning algorithms. While the research on ne-tuning
is fruitful, in this section we just give a avour of the key steps involved. We will see more detailed
discussions on this topic in the following chapters.

2.1.4 Aligning LLMs with the World

Instruction ne-tuning provides a simple way to adapt LLMs to tasks that can be well dened. This
problem can broadly be categorized as an alignment problem. Here, alignment is referred to as a
process of guiding LLMs to behave in ways that align with human intentions. The guidance can
come from labeled data, human feedback, or any other form of human preferences. For example,
2.1 A Brief Introduction to LLMs 47

we want LLMs not only to be accurate in following instructions, but also to be unbiased, truthful,
and harmless. So we need to supervise the models towards human values and expectations. A
common example is that when we ask an LLM how to build a weapon, it may provide a list of key
steps to do so if it is not carefully aligned. However, a responsible model should recognize and
avoid responding to requests for harmful or illegal information. Alignment in this case is crucial
for ensuring that LLMs act responsibly and in accordance with ethical guidelines.
A related concept to alignment is AI safety. One ultimate goal of AI is to build intelligent
systems that are safe and socially benecial. To achieve this goal we should keep these systems
robust, secure, and subjective, in any conditions of real-world use, even in conditions of misuse
or adverse use. For LLMs, the safety can be increased by aligning them with appropriate human
guidance, such as human labeled data and interactions with users during application.
Alignment is difcult as human values and expectations are diverse and shifting. Sometimes,
it is hard to describe precisely what humans want, unless we see the response of LLMs to user
requests. This makes alignment no longer a problem of tuning LLMs on predened tasks, but a
bigger problem of training them with the interactions with the real world.
As a result of the concerns with controlling AI systems, there has been a surge in research
on the alignment issue for LLMs. Typically, two alignment steps are adopted after LLMs are
pre-trained on large-scale unlabeled data.

• Supervised Fine-tuning (SFT). This involves continuing the training of pre-trained LLMs
on new, task-oriented, labelled data. A commonly used SFT technique is instruction ne-
tuning. As described in the previous subsection, by learning from instruction-response
annotated data, LLMs can align with the intended behaviors for following instructions,
thereby becoming capable of performing various instruction-described tasks. Supervised
ne-tuning can be seen as following the pre-training + ne-tuning paradigm, and offers a
relatively straightforward method to adapt LLMs.

• Learning from Human Feedback. After an LLM nishes pre-training and supervised ne-
tuning, it can be used to respond to user requests if appropriately prompted. But this model
may generate content that is unfactual, biased, or harmful. To make the LLM more aligned
with the users, one simple approach is to directly learn from human feedback. For example,
given some instructions and inputs provided by the users, experts are asked to evaluate how
well the model responds in accordance with their preferences and interests. This feedback
is then used to further train the LLM for better alignment.

A typical method for learning from human feedback is to consider it as a reinforcement learn-
ing (RL) problem, known as reinforcement learning from human feedback (RLHF) [Ouyang et al.,
2022]. The RLHF method was initially proposed to address general sequential decision-making
problems [Christiano et al., 2017], and was later successfully employed in the development of
the GPT series models [Stiennon et al., 2020]. As a reinforcement learning approach, the goal of
RLHF is to learn a policy by maximizing some reward from the environment. Specically, two
components are built in RLHF:

• Agent. An agent, also called an LM agent, is the LLM that we want to train. This agent
operates by interacting with its environment: it receives a text from the environment and
48 Generative Models

outputs another text that is sent back to the environment. The policy of the agent is the
function dened by the LLM, that is, Pr(y|x).

• Reward Model. A reward model is a proxy of the environment. Each time the agent
produces an output sequence, the reward model assigns this output sequence a numerical
score (i.e., the reward). This score tells the agent how good the output sequence is.

In RLHF, we need to perform two learning tasks: 1) reward model learning, which involves
training a reward model using human feedback on the output of the agent, and 2) policy learning,
which involves optimizing a policy guided by the reward model using reinforcement learning
algorithms. Here is a brief outline of the key steps involved in RLHF.

• Build an initial policy using pre-training and instruction ne-tuning.

• Use the policy to generate multiple outputs for each input, and then collect human feedback
on these outputs (e.g., comparisons of the outputs).

• Learn a reward model from the human feedback.

• Fine-tune the policy with the supervision from the reward model.

Figure 2.2 shows an overview of RLHF. Given that this section serves only as a brief intro-
duction to concepts of LLMs, a detailed discussion of RLHF techniques will not be included. We
instead illustrate the basic ideas behind RLHF using a simple example.
Suppose we have trained an LLM via pre-training and instruction ne-tuning. This LLM is
deployed to respond to requests from users. For example, a user may input

How can I live a more environmentally friendly life?

We use the LLM to generate 4 different outputs (denoted by {y1 , ..., y4 }) by sampling the
output space

Output 1 (y1 ): Consider switching to an electric vehicle or bicycle instead of


traditional cars to reduce carbon emissions and protect our planet.
Output 2 (y2 ): Adopt a minimalist lifestyle. Own fewer possessions to reduce
consumption and the environmental impact of manufacturing and
disposal.
Output 3 (y3 ): Go off-grid. Generate your own renewable energy and collect
rainwater to become completely self-sufcient and reduce reliance
on non-renewable resources.
Output 4 (y4 ): Support local farm products to reduce the carbon footprint of
transporting food, while enjoying fresh, healthy food.
2.1 A Brief Introduction to LLMs 49

Comparisons
y1 ≻ y4 ≻ y2 ≻ y3
SFT Data
Write a poem about the
weather in London . ...
Annotating Data with Human Preferences

Pre-training Data Model Output


How can I get there? ... 1. ............ 2. ............
I love the food here! ... 3. ............ 4. ............

Pre-training & Predicting


Supervised ne-tuning
User Input
LLM LLM How can I live more
environmentally friendly?

(a) Learning an Initial LLM (b) Annotating Data with Human Preferences

Reward Scores
{r(x, y)}

Evaluate the Input-output Pairs

Reward Model
RL Fine-tuning

Comparison Data Input-output Pairs


{(x, yk1 ≻ yk2 )} {x, y}

Training Sampling y via the Policy Pr(y|x)

Dataset D
Reward Model LLM x∼D
(Policy)

(c) Training the Reward Model (d) Training/Fine-tuning the Policy

Fig. 2.2: An overview of RLHF. There are 4 key steps involved: a) training an initial LLM (i.e., policy) using pre-
training and supervised ne-tuning; b) collecting human preference data by ranking the outputs of the LLM; c) training
a reward model using the ranking results; d) RL ne-tuning of the policy based on the reward model. Double line
arrows mean training or ne-tuning.

We then ask annotators to evaluate these outputs. One straightforward way is to assign a rating
score to each output. In this case, the reward model learning problem can be framed as a task of
training a regression model. But giving numerical scores to LLM outputs is not an easy task for
annotators. It is usually difcult to design an annotation standard that all annotators can agree on
and easily follow. An alternative method, which is more popular in the development of LLMs, is
to rank these outputs. For example, a possible ranking of the above outputs is

y1 ≻ y4 ≻ y2 ≻ y3
50 Generative Models

A reward model is then trained using this ranking result. In general, a reward model in RLHF
is a language model that shares the same architecture as the target LLM, but with a smaller model
size. Given the input x and output yk , we concatenate them to form a sequence seq k = [x, yk ].
This sequence is processed from left to right using forced decoding. Since each position can
only access its left context in language modeling, the output of the top-most Transformer layer at
the rst position cannot be used as the representation of the sequence. Instead, a special symbol
(e.g., \s) is added to the end of the sequence, and the corresponding output of the Transformer
layer stack is considered as the representation of the entire sequence. An output layer, such as a
linear transformation layer, is built on top of this representation to generate the reward, denoted
by R(seq k ) or R(x, yk ).
We train this reward model using ranking loss. For example, a pair-wise ranking loss function
can be written in the form

Lossω (Dr ) = −E(x,yk1 ,yk2 )∼Dr log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 ))) (2.19)

where ω represents the parameters of the reward model, and Dr represents a set of tuples of an
input and a pair of outputs. (x, yk1 , yk2 ) ∼ Dr is a sampling operation which draws a sample
(x, yk1 , yk2 ) from Dr with some probability. As an example, suppose we rst draw a model
input x with a uniform distribution and then draw a pair of model outputs with a probability of
yk1 ≻ yk2 given x (denoted by Pr(yk1 ≻ yk2 |x)). The corresponding loss function is given by

Lossω (Dr )

= − Pr(x) · Pr(yk1 ≻ yk2 |x) · log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 )))
1 
= − Pr(yk1 ≻ yk2 |x) · log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 ))) (2.20)
K

where K represents the number of model inputs involved in sampling. While the form of these
functions may seem complex, their idea is simple: we penalize the model if the predicted ranking
of two outputs differs from the human-labeled ranking. By contrast, the model receives a bonus,
if the predicted ranking matches the human-labeled ranking.
We can train the reward model by minimizing the above ranking loss

ω̂ = arg min Lossω (Dr ) (2.21)


ω

The resulting model Rω̂ (·) can be employed to evaluate any given pair of input and output. Note
that although the reward model is trained using a ranking-based objective, it is used for scoring.
This allows it to provide continuous supervision signals, which is very benecial for training other
models.
We now turn to the policy learning problem. A commonly adopted objective is to maximize
the reward on a set of input-output pairs. Following an analogous form of Eq. (2.16), we obtain a
simple training objective for RL ne-tuning

θ̃ = arg max E(x,yθ̂+ )∼Drlft Rω̂ (x, yθ̂+ ) (2.22)


θ̂ +

where the optimal parameters θ̃ are obtained by ne-tuning the pre-trained parameters θ̂. Drlft is
2.1 A Brief Introduction to LLMs 51

the RL ne-tuning dataset. For each sample (x, yθ̂+ ), x is sampled from a prepared dataset of
input sequences, and yθ̂+ is sampled from the distribution Prθ̂+ (y|x) given by the policy.
In practice, more advanced reinforcement learning algorithms, such as proximal policy opti-
mization (PPO), are often used for achieving more stable training, as well as better performance.
We leave the detailed discussion of reinforcement learning algorithms to the following parts of
this book where RLHF is extensively used for alignment.
An interesting question arises here: why not consider learning from human preferences as
a standard supervised learning problem? This question is closely related to our aforementioned
discussion on the difculty of data annotation. Often, describing human values and goals is chal-
lenging, and it is even more difcult for humans to provide outputs that are well aligned. As an
alternative, annotating the preferences of a given list of model outputs offers a simpler task. By
doing so, we can create a model that understands human preferences, which can then be used as
a reward model for training policies. From the perspective of machine learning, RLHF is par-
ticularly useful for scenarios where the desired behavior of an agent is difcult to demonstrate
but can be easily recognized by humans. Another advantage of RLHF is its ability to explore the
sample space. By employing sampling techniques, models trained with reinforcement learning
can venture beyond the annotated data set to explore additional samples. This exploratory ability
allows RLHF to discover potentially benecial policies that are not immediately apparent from
the labeled data alone.

2.1.5 Prompting LLMs

We have so far shown that LLMs can be used to perform various tasks by giving them appropriate
prompts. There are no restrictions on these prompts, which can include any information we wish
to ask or communicate with LLMs, such as natural language instructions and the context of con-
versations. Since this approach requires no additional training or tuning, adapting LLMs becomes
highly efcient once they are developed. This somewhat inuences the paradigms in NLP: we
no longer need to develop specic systems for individual tasks but can instead use a single, well-
trained LLM to perform different tasks by prompting it. An appealing aspect of LLM prompting
arises as a result: users can easily have “customized” systems by designing their own prompts
for LLMs. Given the important role played by prompting in LLMs, prompting engineering has
become a very active area of research in NLP.
The term prompt is used in many different ways in the literature. In this chapter, this term
refers to the entire input to LLMs, and so we use the terms prompt and model input interchange-
ably. Before discussing prompting further, let us rst see a few examples where the prompts
are more complex than those presented in the previous subsections. Note that this subsection is
not aimed at writing high-quality prompts but rather at highlighting some interesting issues in
prompting LLMs.
One of the popular ways to use LLMs is to assign them a “role” played in generating responses.
For example, LLMs can act as a psychologist when answering questions.
52 Generative Models

Please explain what delayed gratication is?


Note, you are a researcher with a deep background in psychology. Your own
eld of study is developmental psychology in children, with a focus on aspects
of control and intelligence. You are very familiar with the concept of delayed
gratication and have applied it to your own research. You have published papers
on delayed gratication and conducted related experiments.
Delayed gratication is the process of resisting an immediate reward in
anticipation of receiving a more valuable reward in the future. It is a signicant
concept in developmental psychology, especially in relation to how children learn
self-control and develop cognitive abilities. Research shows that the ability to
delay gratication is associated with positive outcomes in academic performance,
social competence, and emotional well-being. Studies and experiments in this
area highlight the importance of cultivating this skill early in life to support
long-term success and resilience.

Another example is the use of LLMs in detecting and correcting errors such as syntactic or
semantic mistakes in text. For an LLM which is trained on both code and natural language data,
we may use it for code debugging6 .

Fix the bugs in this C language program.


#include <stdio.h>
int main() {
printg("Hello, World!")
return 0;
}
1. The function name ’printg’ is incorrect . It should be ’printf’.
2. There is a missing semicolon at the end of the ’printf’ function call.
Here is the corrected version of the program:
#include <stdio.h>
int main() {
printf("Hello, World!");
return 0;
}

Also, LLMs can be ne-tuned to act as a conversation partner 7 .

6
In this example, the code is not tokenized for easier reading.
7
To ne-tune an LLM for multi-turn dialogue, one needs to consider conversation history in the context for pre-
dicting the response in the current round of conversation. This makes the actual prompt used in response generation
relatively longer than that used in single-turn dialogue.
2.1 A Brief Introduction to LLMs 53

Hi! I’m an assistant. How can I help you?


Who won the FIFA World Cup 2022?
Argentina won the FIFA World Cup 2022.
Where was it held?
The 2022 FIFA World Cup was held in Qatar.
How many times has Argentina won the World Cup?
Argentina has won the FIFA World Cup three times.
...

These examples and previous ones have shown that appropriate responses can be generated
via prompts involving clear instructions and questions. However, when problem solving requires
knowledge that is not explicitly specied, LLMs may make mistakes, even though the instructions
are sufciently clear and precise. A family of challenging tasks for LLMs involves arithmetic
reasoning and commonsense reasoning. For example, we can ask an LLM to solve primary school
math problems presented in natural language.

Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more
apples. The next day, Jack gave 3 apples to his friend John. How many apples
does Jack have left in the end?
The answer is 10.

The correct answer should be 7, so the model output is incorrect.


One approach to addressing such issues is to incorporate learning into prompts, called in-
context learning or (ICL). The idea of ICL is to demonstrate the ways to solve problems in
prompts, and condition predictions on these demonstrations. Here is an example where a similar
problem and the corresponding answer are presented in the prompt (green = demonstrations).

Tom has 12 marbles. He wins 7 more marbles in a game with his friend but then
loses 5 marbles the next day. His brother gives him another 3 marbles as a gift.
How many marbles does Tom have now?
The answer is 17.
Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more
apples. The next day, Jack gave 3 apples to his friend John. How many apples
does Jack have left in the end?
The answer is 12.

But the LLM still made mistakes this time. A reason for this might be that solving math
problems does not only involve problem-answer mappings but also, to a larger extent, the under-
lying logical inference in multiple steps. A method to improve the inference abilities of LLMs
is chain-of-thought prompting (COT prompting) [Wei et al., 2022c]. In COT prompting, we
decompose complex reasoning problems into multiple problem-solving intermediate steps. These
54 Generative Models

steps are demonstrated in prompts so that LLMs can be prompted to learn to mimic the underlying
thought processes in solving the problems. COT prompting has been found to be very useful for
achieving good results on challenging tasks, such as the GSM8K mathematical reasoning bench-
mark.
Consider the above demonstration example Tom has 12 marbles. He wins 7 more marbles in
a game with his friend but then loses 5 marbles the next day. His brother gives him another 3
marbles as a gift. How many marbles does Tom have now ?. The corresponding logical reasoning
steps can be structured:

1. Tom starts with 12 marbles.

2. He wins 7 more marbles in a game, so we add those to his initial count: 12 marbles + 7
marbles = 19 marbles.

3. The next day, Tom loses 5 marbles, so we subtract those from his current total: 19 marbles
- 5 marbles = 14 marbles.

4. His brother gives him 3 more marbles, so we add those to his current total: 14 marbles + 3
marbles = 17 marbles.

5. After all these events, Tom has 17 marbles.

We can add these reasoning steps into the prompt, and get the correct answer.

Tom has 12 marbles. He wins 7 more marbles in a game with his friend but then
loses 5 marbles the next day. His brother gives him another 3 marbles as a gift.
How many marbles does Tom have now?
Tom initially has 12 marbles. After winning a game, he gains 7 more, increasing
his total to 19 marbles. He loses 5 marbles the next day, bringing his total down
to 14 marbles. His brother gifts him 3 more marbles, increasing his total to 17
marbles. Therefore, Tom now has 17 marbles. So the answer is 17.
Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more
apples. The next day, Jack gave 3 apples to his friend John. How many apples
does Jack have left in the end?
Jack starts with 7 apples. He eats 2 apples for dinner, so we subtract 2 from 7,
leaving him with 5 apples. His mom gives him 5 more apples, adding 5 to his
current total. Now he has 5 + 5 = 10 apples. The next day, Jack gives 3 apples to
his friend John. Subtracting 3 from 10, Jack is left with 7 apples. Therefore, Jack
has 7 apples left. So the answer is 7.

This approach is also called one-shot COT prompting. Likewise, we can add more demon-
strations to have a few-shot COT prompting approach.
Furthermore, a special kind of COT, zero-shot COT, introduces instructions that provoke
LLMs to think, without presenting any intermediate reasoning steps in the prompts [Kojima et al.,
2.1 A Brief Introduction to LLMs 55

2022]. An interesting example is that by appending a simple sentence “Let’s think step by step.”
to the end of a prompt, LLMs can generate reasoning steps to reach the nal answer.

Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more
apples. The next day, Jack gave 3 apples to his friend John. How many apples
does Jack have left in the end?
Let’s think step by step.
1. Initial Quantity: Jack starts with 7 apples.
2. After Dinner: He eats 2 apples, so 7 − 2 = 5 apples remain.
3. His Mom Gives More: His mom gives him 5 more apples, so 5 + 5 = 10
apples are now with him.
4. Giving to John: The next day, Jack gives 3 apples to his friend John,
so 10 − 3 = 7 apples are left.
In the end, Jack has 7 apples left.

Zero-shot, one-shot, and few-shot learning are common concepts in the area of in-context
learning for LLMs and are not restricted to COT prompting. Broadly speaking, any prompting
that involves only simple instructions without any demonstrations can be considered a form of
zero-shot learning. This zero-shot learning ability emerges as LLMs are pre-trained and/or ne-
tuned. Also, one-shot and few-shot learning methods are more often considered when LLMs do
not acquire the corresponding zero-shot learning ability. These methods are therefore important
for in-context learning when addressing new tasks. Examples include those for performing various
NLP tasks by demonstrating task-formatted samples. See the following examples for sentiment
sentence classication and phrase translation via few-shot learning.

Given the following text snippets, classify their sentiment as Positive, Negative,
or Neutral.
Example 1: “I had an amazing day at the park!”
Sentiment: Positive
Example 2: “The service at the restaurant was terrible.”
Sentiment: Negative
Example 3: “I think it’s going to rain today.”
Sentiment: Neutral
Text: “This movie was a fantastic journey through imagination.”
Sentiment: Positive
56 Generative Models

Translate the following Chinese phrases into English.


Example 1: “你好”
Translation: “Hello”
Example 2: “谢谢你”
Translation: “Thank you”
Phrase to translate: “早上好”
Translation: “Good Morning”

Above, we have presented examples to illustrate the fundamental in-context learning capa-
bilities of prompting LLMs. This section, however, does not include more advanced prompting
techniques in order to keep the content concise and compact. More discussions on prompting can
be found in Chapter 3.

2.2 Training at Scale

As a rst step in developing LLMs, we need to train these models on large amounts of data.
The training task is itself standard: the objective is to maximize the likelihood, which can be
achieved via gradient descent. However, as we scale up both the model size and the amount
of data, the problem becomes very challenging, for example, large models generally make the
training unstable. In this section, we discuss several issues of large-scale training for LLMs,
including data preparation, model modication, and distributed training. We also discuss the
scaling laws for LLMs, which help us understand their training efciency and effectiveness.

2.2.1 Data Preparation

The importance of data cannot be overstated in NLP. As larger neural networks are developed,
the demand for data continues to increase. For example, developing LLMs may require trillions
of tokens in pre-training (see Table 2.3), orders of magnitude larger than those used in training
conventional NLP models. In general, we may want to gather as much training data as possible.
However, larger training datasets do not mean better training results, and the development of
LLMs raises new issues in creating or collecting these datasets.
A rst issue is the quality of data. High-quality data has long been seen as crucial for training
data-driven NLP systems. Directly using raw text from various sources is in general undesirable.
For example, a signicant portion of the data used to train recent LLMs comes from web scraping,
which may contain errors and inappropriate content, such as toxic information and fabricated
facts. Also, the internet is ooded with machine-generated content due to the widespread use of
AI, presenting further challenges for processing and using web-scraped data. Researchers have
found that training LLMs on unltered data is harmful [Raffel et al., 2020]. Improving data quality
typically involves incorporating ltering and cleaning steps in the data processing workow. For
example, Penedo et al. [2023] show that by adopting a number of data processing techniques, 90%
of their web-scraped data can be removed for LLM training. In addition to large-scale web-scraped
data, LLM training data often includes books, papers, user-generated data on social media, and
so on. Most of the latest LLMs are trained on such combined datasets, which are found to be
2.2 Training at Scale 57

LLM # of Tokens Data


GPT3-175B [Brown et al., 2020] 0.5T Webpages, Books, Wikipedia
Falcon-180B [Almazrouei et al., 2023] 3.5T Webpages, Books, Conversations,
Code, Technical Articles
LLaMA2-65B [Touvron et al., 2023a] 1.0T ∼ 1.4T Webpages, Code, Wikipedia,
Books, Papers, Q&As
PaLM-450B [Chowdhery et al., 2022] 0.78T Webpages, Books, Conversations,
Code, Wikipedia, News
Gemma-7B [Gemma Team, 2024] 6T Webpages, Mathematics, Code

Table 2.3: Amounts of training data used in some LLMs in terms of the number of tokens.

important for the strong performance of the resulting models.


A second issue is the diversity of data. We want the training data to cover as many types of
data as possible, so that the trained models can adapt to different downstream tasks easily. It has
been widely recognized that the quality and diversity of training data both play very important
roles in LLMs. An interesting example is that incorporating programming code into training data
has been found to be benecial for LLMs. The benets are demonstrated not only in enhancing the
programming abilities of LLMs, but also in improving reasoning for complex problems, especially
those requiring COT prompting. The concept “diversity” can be extended to include language
diversity as well. For example, many LLMs are trained on multi-lingual data, and therefore we
can handle multiple languages using a single model. While this approach shows strong abilities
in multi-lingual and cross-lingual tasks, its performance on specic languages largely depends on
the volume and quality of the data for those languages. It has been shown in some cases to provide
poor results for low-resource languages.
A third issue is the bias in training data. This is not a problem that is specic to LLMs but
exists in many NLP systems. A common example is gender bias, where LLMs show a preference
for one gender over another. This can partly be attributed to class imbalance in the training data,
for example, the term nurses is more often associated with women. In order to debias the data,
it is common practice to balance the categories of different language phenomena, such as gender,
ethnicity, and dialects. The bias in data is also related to the diversity issue mentioned above.
For example, since many LLMs are trained and aligned with English-centric data, they are bi-
ased towards the cultural values and perspectives prevalent among English-speaking populations.
Increasing language diversity in training data can somewhat mitigate the bias.
Another issue with collecting large-scale data is the privacy concern. If LLMs are trained
on data from extensive sources, this potentially leads to risks regarding the exposure of sensitive
information, such as intellectual property and personal data. This is particularly concerning given
the capacity of LLMs to represent patterns from the data they are trained on, which might in-
advertently involve memorizing and reproducing specic details. A simple approach to privacy
protection is to remove or anonymize sensitive information. For example, anonymization tech-
niques can be applied to remove personally identiable information from training data to prevent
LLMs from learning from such data. However, in practice, erasing or redacting all sensitive data
is difcult. Therefore, many LLMs, particularly those launched for public service, typically work
with systems that can detect the potential exposure of sensitive data, or are ne-tuned to reject
58 Generative Models

certain requests that could lead to information leakage.

2.2.2 Model Modications

Training LLMs is difcult. A commonly encountered problem is that the training process be-
comes more unstable as LLMs get bigger. For example, one needs to choose a small learning rate
to achieve stable training with gradient descent, but this in turn results in much longer training
times. Sometimes, even when the training conguration is carefully designed, training may di-
verge at certain points during optimization. The training of LLMs is generally inuenced by many
factors, such as parameter initialization, batching, and regularization. Here, we focus on common
modications and improvements to the standard Transformer architecture, which are considered
important in developing trainable LLMs.

2.2.2.1 Layer Normalization with Residual Connections

Layer normalization is used to stabilize training for deep neural networks. It is a process of
subtracting the mean and dividing by the standard deviation. By normalizing layer output in
this way, we can effectively reduce the covariate shift problem and improve the training stability.
In Transformers, layer normalization is typically used together with residual connections. As
described in Section 2.1.1, a sub-layer can be based on either the post-norm architecture, in which
layer normalization is performed right after a residual block, or the pre-norm architecture, in
which layer normalization is performed inside a residual block. While both of these architectures
are widely used in Transformer-based systems [Wang et al., 2019], the pre-norm architecture has
proven to be especially useful in training deep Transformers. Given this, most LLMs are based on
the pre-norm architecture, expressed as output = LNorm(F (input)) + input.
A widely-used form of the layer normalization function is given by

h−µ
LNorm(h) = α · +β (2.23)
σ+ǫ

where h is a d-dimensional real-valued vector, µ is the mean of all the entries of h, and σ is the
corresponding standard deviation. ǫ is introduced for the sake of numerical stability. α ∈ Rd and
β ∈ Rd are the gain and bias terms.
A variant of layer normalization, called root mean square (RMS) layer normalization, only
re-scales the input vector but does not re-center it [Zhang and Sennrich, 2019]. The RMS layer
normalization function is given by

h
LNorm(h) = α · +β (2.24)
σrms + ǫ

d 2 1
where σrms is the root mean square of h, that is, σrms = ( 1d k=1 hk ) 2 . This layer normalization
function is used in LLMs like the LLaMA series.
2.2 Training at Scale 59

2.2.2.2 Activation Functions in FFNs

In Transformers, FFN sub-layers are designed to introduce non-linearities into representation


learning, and are found to be useful for preventing the representations learned by self-attention
from degeneration8 [Dong et al., 2021]. A standard form of the FFNs used in these sub-layers can
be expressed as

FFN(h) = σ(hWh + bh )Wf + bf (2.25)

where Wh ∈ Rd×dh , bh ∈ Rdh , Wf ∈ Rdh ×d , and bf ∈ Rd are the parameters, and dh is the
hidden size. σ(·) is the activation function of the hidden layer. A common choice for σ(·) is the
rectied linear unit (ReLU), given by

σrelu (h) = max(0, h) (2.26)

In practical implementations, increasing dh is helpful and thus it is often set to a larger number
in LLMs. But a very large hidden size poses challenges for both training and deployment. In this
case, the design of the activation function plays a relatively more important role in wide FFNs.
There are several alternatives to the ReLU in LLMs. One of these is the gaussian error linear
unit (GeLU) which can be seen as a smoothed version of the ReLU. Rather than controlling the
output by the sign of the input, the GeLU function weights its input by the percentile Pr(h ≤ h).
Here h is a d-dimensional vector whose entries are drawn from the standard normal distribution
Gaussian(0, 1)9 . Specically, the GeLU function is dened to be

σgelu (h) = h Pr(h ≤ h)


= hΦ(h) (2.27)

where Φ(h) is the cumulative distribution function of Gaussian(0, 1), which can be implemented
in convenient ways [Hendrycks and Gimpel, 2016]. The GeLU function has been adopted in
several LLMs, such as BERT, GPT-3, and BLOOM.
Another family of activation functions which is popular in LLMs is gated linear unit (GLU)-
based functions. The basic form of GLUs is given by

σglu (h) = σ(hW1 + b1 ) ⊙ (W2 + b2 ) (2.28)

where W1 ∈ Rd×d , b1 ∈ Rd , W2 ∈ Rd×d , and b2 ∈ Rd are model parameters. Different choices


of σ(·) result in different versions of GLU functions. For example, if σ(·) is dened to be the
GeLU function, we will have the GeGLU function

σgeglu (h) = σgelu (hW1 + b1 ) ⊙ (W2 + b2 ) (2.29)

This activation function has been successfully applied in LLMs like Gemma.
As another example, consider σ(·) to be the Swish function σswish (h) = h ⊙ Sigmoid(ch)

8
Here degeneration refers to the phenomenon in which the rank of a matrix is reduced after some processing.
9
Pr(h ≤ h) is an informal notation. It refers to a vector, with each entry representing the percentile for the
corresponding entry of h.
60 Generative Models

[Ramachandran et al., 2017]. Then, the SwiGLU function is given by

σswiglu(h) = σswish (hW1 + b1 ) ⊙ (W2 + b2 ) (2.30)

Both the PaLM and LLaMA series are based on the SwiGLU function. For more discussions of
GLUs, the reader can refer to Shazeer [2020]’s work.

2.2.2.3 Removing Bias Terms

Another popular model design is to remove the bias terms in afne transformations used in LLMs.
This treatment can be applied to layer normalization, transformations of the inputs to QKV atten-
tion, and FFNs. For example, we can modify Eq. (2.25) to obtain an FFN with no bias terms

FFN(h) = σ(hWh )Wf (2.31)

Chowdhery et al. [2022] report that removing bias terms helps improve the training stability
of LLMs. This method has been used in several recent LLMs, such as LLaMA and Gemma.

2.2.2.4 Other Issues

Many LLMs also involve modications to their positional embedding models. For example, one
can replace sinusoidal positional encodings with rotary position embeddings so that the learned
LLMs can handle long sequences better. These models will be discussed in Section 2.3.
Note that while model modications are common in training LLMs, the stability of training
can be improved in many different ways. For example, increasing the batch size as the training
proceeds has been found to be useful for some LLMs. In general, achieving stable and efcient
large-scale LLM training requires carefully designed setups, including learning schedules, opti-
mizer choices, training parallelism, mixed precision training, and so on. Some of these issues are
highly engineered, and therefore, we typically need a number of training runs to obtain satisfactory
LLMs.

2.2.3 Distributed Training

Training LLMs requires signicant amounts of computational resources. A common approach to


improving training efciency is to use large-scale distributed systems. Fortunately, alongside the
rise of neural networks in AI, deep learning-oriented software and hardware have been developed,
making it easier to implement LLMs and perform computations. For example, one can now easily
ne-tune an LLM using deep learning software frameworks and a machine with multiple GPUs.
However, scaling up the training of LLMs is still challenging, and requires signicant efforts in
developing hardware and software systems for stable and efcient distributed training.
An important consideration of distributed training is parallelism. There are several forms
of parallelism: data parallelism, model parallelism, tensor parallelism, and pipeline parallelism.
Despite different ways to distribute computations across devices, these parallelism methods are
based on a similar idea: the training problem can be divided into smaller tasks that can be ex-
ecuted simultaneously. The issue of parallelism in training LLMs has been extensively studied
2.2 Training at Scale 61

[Narayanan et al., 2021; Fedus et al., 2022]. Here we sketch the basic concepts.

• Data Parallelism. This method is one of the most widely used parallelism methods for
training neural networks. To illustrate, consider the simplest case where the standard delta
rule is used in gradient descent

∂Lθt (Dmini )
θt+1 = θt − lr · (2.32)
∂θt
where the new parameters θt+1 is obtained by updating the latest parameters θt with a small
∂Lθt (Dmini )
step lr in the direction of the negative loss gradient. ∂θt is the gradient of the loss
with respect to the parameters θt , and is computed on a minibatch of training sample Dmini .
In data parallelism, we divide Dmini into N smaller batches, denoted by {D 1 , ..., D N }.
Then, we distribute these batches to N workers, each with a corresponding batch. Once
the data is distributed, these workers can work at the same time. The gradient of the entire
minibatch is obtained by aggregating the gradients computed by the workers, like this

∂Lθt (Dmini ) ∂Lθt (D 1 ) ∂Lθt (D 2 ) ∂Lθt (D N )


= + +··· + (2.33)
∂θt ∂θt ∂θt ∂θt
        
worker 1 worker 2 worker N

In ideal cases where the workers coordinate well and the communication overhead is small,
data parallelism can achieve nearly an N -fold speed-up for training.

• Model Parallelism. Although data parallelism is simple and effective, it requires each
worker to run the entire LLM and perform the complete forward and backward process.
As LLMs grow larger, it sometimes becomes unfeasible to load and execute an LLM on a
single device. In this case, we can decouple the LLM into smaller components and run these
components on different devices. One simple way to do this is to group consecutive layers
in the layer stack and assign each group to a worker. The workers operate in the order of
the layers in the stack, that is, in the forward pass we process the input from lower-level to
upper-level layers, and in the backward pass we propagate the error gradients from upper-
level to lower-level layers. Consider, for example, a Transformer decoder with L stacked
blocks. To distribute the computation load, each block is assigned to a worker. See the
following illustration for a single run of the forward and backward passes of this model.

Worker L BL (↑) BL (↓)

... ... ...

Worker 2 B2 (↑) B2 (↓)

Worker 1 B1 (↑) B1 (↓)

Here Bl denotes the computation of block l, and the symbols ↑ and ↓ denote the forward and
backward passes, respectively. Note that this parallelism method forces the workers to run
in sequence, so a worker has to wait for the previous worker to nish their job. This results
in the devices being idle for most of the time. In practical systems, model parallelism is
generally used together with other parallelism mechanisms to maximize the use of devices.
62 Generative Models

• Tensor Parallelism. Parallelism can also be performed in a single computation step. A


common example is splitting a large parameter matrix into chunks, multiplying an input
tensor with each of these chunks separately, and then concatenating the results of these
multiplications to form the output. For example, consider the multiplication of the repre-
sentation h ∈ Rd with the parameter matrix Wh ∈ Rd×dh in an FFN sub-layer (see Eq.
(2.25)). We can slice the matrix Wh ∈ Rd×dh vertically to a sequence of M sub-matrices
 
Wh = W1h W2h ... WM
h (2.34)

where each sub-matrix Wkh has a shape of d × dMh . The multiplication of h with Wh can be
expressed as
 
hWh = h W1h W2h ... WM
h
 
= hW1h hW2h ... hWM
h (2.35)

We can perform matrix multiplications {hW1h , hW2h , ..., hWMh } on M devices separately.
As a result, we distribute a large matrix multiplication across multiple devices, each of
which may have relatively small memory. From the perspective of the design of modern
GPUs, tensor parallelism over GPUs provides a two-level, tile-based approach to parallel
computing. First, at a higher level, we decompose a matrix multiplication into sub-matrix
multiplications that can directly t into the memory of GPUs. Then, at a lower level, we
execute these sub-matrix multiplications on GPUs using tile-based parallel algorithms that
are specically optimized for GPUs.

• Pipeline Parallelism. Above, in model parallelism, we have described a simple approach


to spreading groups of model components across multiple devices. But this method is in-
efcient because only one device is activated at a time during processing. Pipeline par-
allelism addresses this issue by introducing overlaps between computations on different
devices [Harlap et al., 2018; Huang et al., 2019]. To do this, a batch of samples is divided
into a number of micro-batches, and then these micro-batches are processed by each worker
as usual. Once a micro-batch is processed by a worker and passed to the next one, the
following micro-batch immediately occupies the same worker. In other words, we create
a pipeline in which different computation steps can overlap if multiple jobs are given to
the pipeline. The following shows an illustration of pipeline parallelism for processing 3
micro-batches.

Worker L BL,1 BL,2 BL,3 BL,1 BL,2 BL,3

... ... ...

Worker 2 B2,1 B2,2 B2,3 B2,1 B2,2 B2,3

Worker 1 B1,1 B1,2 B1,3 B1,1 B1,2 B1,3

Here Bl,k represents the processing of the k-th micro-batch by the l-th worker. Ideally we
would like to maximize the number of micro-batches, and thus minimize the idle time of the
2.2 Training at Scale 63

workers. However, in practice, using small micro-batches often reduces GPU utilization and
increases task-switching costs. This may, in turn, decrease the overall system throughput.

The ultimate goal of parallel processing is to achieve linear growth in efciency, that is, the
number of samples that can be processed per unit of time increases linearly with the number of
devices. However, distributed training is complicated, and inuenced by many factors in addition
to the parallelism method we choose. One problem, which is often associated with distributed
systems, is the cost of communication. We can think of a distributed system as a group of net-
worked nodes. Each of these nodes can perform local computation or pass data to other nodes. If
there are a large number of such nodes, it will be expensive to distribute and collect data across
them. Sometimes, the time savings brought about by parallelism are offset by the communica-
tion overhead of a large network. Another problem with large-scale distributed systems is that
the synchronization of nodes introduces additional costs. As is often the case, some nodes may
take longer to work, causing others to wait for the slowest ones. While we can use asynchronous
training to handle heterogeneity in computational resources, this may lead to stale gradients and
non-guaranteed convergence. Moreover, as more nodes are added to the network, there is more
chance to have crashed nodes during training. In this case, we need to ensure that the whole
system is fault tolerant. In many practical settings, to increase scalability, one needs to take into
account additional issues, including architecture design, data transfer and computation overlap,
load balancing, memory bandwidth and so on.
Training LLMs is so computationally expensive that, even though distributed training is al-
ready in use, researchers and engineers often still employ various model compression and speed-
up methods to improve training efciency [Weng, 2021]. One example is mixed precision training,
in which low precision data (such as FP16 and FP8 data) is used for gradient computation on each
individual node, and single or double precision data (such as FP32/FP64 data) is used for updating
the model [Micikevicius et al., 2018]. A key operation in this approach is gradient accumulation
where gradients need to be accumulated and synchronized across nodes. However, due to the
non-associativity of oating-point addition, this can lead to slight numerical differences in accu-
mulated gradients on different nodes, which may affect model convergence and nal performance.
This problem is more obvious if there are a large number of nodes involved in distributed training,
especially given that low-precision numerical computations may encounter overow and under-
ow issues, as well as inconsistencies across different hardware devices. Therefore, the design of
distributed systems needs to consider these numerical computation issues to ensure satisfactory
results and convergence.

2.2.4 Scaling Laws

The success of LLMs reveals that training larger language models using more resources can lead
to improved model performance. Researchers have explained this as scaling laws of LLMs. More
specically, scaling laws describe the relationships between the performance of LLMs and the
attributes of LLM training, such as the model size, the amount of computation used for training,
and the amount of training data. For example, Hestness et al. [2017] show that the performance of
deep neural networks is a power-law-like function of the training data size. In the beginning, when
the amount of training data is not large, the performance of the model improves slowly. Afterward,
when more training data is used, the model enters a phase of rapid performance improvement, and
the performance curve resembles a power-law curve. Ultimately, the improvement in performance
64 Generative Models

Slow Reduction Power-law Reduction Convergence


Phase Phase Phase

Number of Test Errors (Log-scale)


(Irreducible Error)

Training Dataset Size (Log-scale)

Fig. 2.3: A scaling law of test error against a variable of interest (e.g., training dataset size) [Hestness et al., 2017]. The
curve of the scaling law can be divided into three phases. At the beginning, the number of test errors decreases slowly
when more training data is used, but this only lasts for a short period. In the second phase, the number of test errors
decreases drastically, and the curve becomes a power law curve. After that, the error reduction slows down again in the
third phase. Note that there are irreducible errors that cannot be eliminated, regardless of the amount of training data.

becomes slow again, and more data does not lead to signicant gains. Figure 2.3 shows an example
of such curves.
In NLP, a traditional view holds that the performance gains will disappear at a certain point
as the training is scaled up. However, recent results show that, if we consider the problem on
a larger scale, scaling up training is still a very effective method for obtaining stronger LLMs.
For example, both closed-source and open-source LLMs can benet from more data, even though
trillions of tokens have already been used for training.
With the increase in the scale of model training, LLMs exhibit new capabilities, known as the
emergent abilities of LLMs. For example, Wei et al. [2022b] studied the scaling properties of
LLMs across different model sizes and amounts of computational resources. Their work shows
that some abilities emerge when we scale the model size to certain level. The appearance of
emergent abilities has demonstrated the role of scaled training in enhancing the performance of
LLMs, and it has also, to some extent, motivated researchers to continuously attempt to train larger
models. As larger and stronger LMs continue to appear, our understanding of the scaling laws
continues to mature. This helps researchers predict the performance of LLMs during training and
estimate the minimal computational resources required to achieve a given level of performance.
To understand how model performance scales with various factors considered during training,
it is common to express the model performance as a function of these factors. For example, in
the simplest case, we can express the loss or error of an LLM as a function of a single variable of
interest. However, there are no universal scaling laws that can describe this relationship. Instead,
different functions are proposed to t the learning curves of LLMs.
Let x be the variable of interest (such as the number of model parameters) and L(x) be the
loss of the model given x (such as the cross-entropy loss on test data). The simplest form of L(x)
is a power law

L(x) = axb (2.36)


2.2 Training at Scale 65

N
L(N ) = ( 8.8·10 −0.076 4.2 D −0.095
13 ) L(D) = ( 5.4·1013 )
5.6

4.8 3.9
Test Loss

Test Loss
4.0 3.6

3.3
3.2
3

2.4 2.7
5 7 9
10 10 10 108 109
Number of Parameters Dataset Size
Fig. 2.4: Test loss against model size (N ) and training dataset size (D) (data points are plotted for illustrative purposes).
 N −0.076
We plot test loss as a function of N , which is dened as L(N ) = 8.8×10 13 , and a function of D, which is
 D
−0.095
dened as L(D) = 5.4×1013
[Kaplan et al., 2020].

where a and b are parameters that are estimated empirically. Despite its simplicity, this func-
tion has successfully interpreted the scaling ability of language models and machine transla-
tion systems in terms of model size (denoted by N ) and training dataset size (denoted by D)
[Gordon et al., 2021; Hestness et al., 2017]. For example, Kaplan et al. [2020] found that the per-
formance of their language model improves as a power law of either N or D after an initial
 N −0.076
transient period, and expressed these relationships using L(N ) = 8.8×10 13 and L(D) =
 D
−0.095
5.4×1013
(see Figure 2.4).
An improvement to this scaling law is to add an irreducible error term to the power law. The
form of L(x) is then given by

L(x) = axb + ǫ∞ (2.37)

where ǫ∞ is the irreducible error that accounts for the error due to unknown variables, which is
present even as x → ∞. Eq. (2.37) is one of the most widely used forms for designing scaling
laws of LLMs. For example, Rosenfeld et al. [2020] developed a scaling law that involves both
model scaling and dataset scaling, like this

L(N, D) = aN b + cD d + ǫ∞ (2.38)

An example of such formulation is the Chinchilla scaling law. It states that the test loss per
token is the sum of the inverse proportion functions of N and D, with an additional irreducible
error term. Hoffmann et al. [2022] express this scaling law as

406.4 410.7
L(N, D) = 0.34
+ 0.28
+ 1.69
 (2.39)
N  D  irreducible error
model scaling dataset scaling

All the scaling laws mentioned above are based on monotonic functions. So they cannot cover
functions with inection points, such as double descent curves. In response, researchers have
explored more sophisticated functions to t the learning curves. Examples of such functions can
66 Generative Models

be found in Alabdulmohsin et al. [2022] and Caballero et al. [2023]’s work.


The signicance of scaling laws lies in providing directional guidance for LLM research: if
we are still in the region of the power law curve, using more resources to train larger models is a
very promising direction. While this result “forces” big research groups and companies to invest
more in computational resources to train larger models, which is very expensive, scaling laws
continuously push the boundaries of AI further away. On the other hand, understanding scaling
laws helps researchers make decisions in training LLMs. For example, given the computational
resources at hand, the performance of LLMs may be predicted.
One last note on scaling laws in this section. For LLMs, a lower test loss does not always
imply better performance on all downstream tasks. To adapt LLMs, there are several steps such
as ne-tuning and prompting that may inuence the nal result. Therefore, the scaling laws for
different downstream tasks might be different in practice.

2.3 Long Sequence Modeling

We have already seen that, in large-scale training, larger language models can be developed by us-
ing more data and computational resources. However, scaling up can also occur in other directions.
For instance, in many applications, LLMs are adapted to process signicantly long sequences. An
interesting example is that we pre-train an LLM on extensive texts of normal length and then ap-
ply it to deal with very long token sequences, far beyond the length encountered in pre-training.
Here we use Pr(y|x) to denote the text generation probability where x is the context and y is the
generated text. There are broadly three types of long sequence modeling problems.

• Text generation based on long context (i.e., x is a long sequence). For example, we
generate a short summary for a very long text.

• Long text generation (i.e., y is a long sequence). For example, we generate a long story
based on a few keywords.

• Long text generation based on long context (i.e., both x and y are long sequences). For
example, we translate a long document from Chinese to English.

Recently, NLP researchers have been more interested in applying and evaluating LLMs on
tasks where extremely long input texts are involved. Imagine an LLM, which reads a C++ source
le containing tens of thousands of lines, and outlines the functionality of the program correspond-
ing to the source le. Such models, capable of handling extensive textual contexts, are sometimes
called long-context LLMs. In this section we will restrict ourselves to long-context LLMs, but
the methods discussed here can be applicable to other problems.
For Transformers, dealing with long sequences is computationally expensive, as the computa-
tional cost of self-attention grows quadratically with the sequence length. This makes it infeasible
to train and deploy such models for very long inputs. Two strands of research have tried to adapt
Transformers to long-context language modeling.

• The rst explores efcient training methods and model architectures to learn self-attention
models from long-sequence data.
2.3 Long Sequence Modeling 67

• The other adapts pre-trained LLMs to handle long sequences with modest or no ne-tuning
efforts.

Here, we will discuss the former briey since it can be found in general discussions of efcient
Transformer architectures [Tay et al., 2020; Xiao and Zhu, 2023]. We will focus on the latter,
highlighting popular methods in recent LLMs. We will also discuss the strengths and limitations
of these long-sequence models.

2.3.1 Optimization from HPC Perspectives

We begin our discussion by considering improvements to standard Transformer models from the
perspectives of high-performance computing. Most of these improvements, though not speci-
cally designed for LLMs, have been widely applied across various deep learning models [Kim et al.,
2023]. A commonly used approach is to adopt a low-precision implementation of Transformers.
For example, we can use 8-bit or 16-bit xed-point data types for arithmetic operations, instead
of 32-bit or 64-bit oating-point data types. Using these low-precision data types can increase
the efciency and memory throughput, so that longer sequences can be processed more easily.
An alternative approach is to improve Transformers by using hardware-aware techniques. For
example, on modern GPUs, the efciency of Transformers can be improved by using IO-aware
implementations of the self-attention function [Dao et al., 2022; Kwon et al., 2023].
Another way to handle long sequences is through sequence parallelism [Li et al., 2023b;
Korthikanti et al., 2023]. Specically, consider the general problem of attending the query qi
at the position i to the keys K and values V. We can divide K by rows and obtain a set of sub-
matrices {K[1] , ..., K[nu ] }, each corresponding to a segment of the sequence. Similarly, we can
obtain the sub-matrices of V, denoted by {V[1] , ..., V[nu ] }. Then, we assign each pair of K[u] and
V[u] to a computing node (e.g., a GPU of a GPU cluster). The assigned nodes can run in parallel,
thereby parallelizing the attention operation.
Recall that the output of the self-attention model can be written as
m−1

Attqkv (qi , K, V) = αi,j vj (2.40)
j=0

where αi,j is the attention weight between positions i and j. In Transformers, αi,j is obtained
by normalizing the rescaled version of the dot product between qi and kj . Let βi,j denote the
attention score between qi and kj . We have

qi · kj
βi,j = √ + Mask(i, j) (2.41)
d

where Mask(i, j) is the masking variable for (i, j). Then, we dene the attention weight αi,j to
be

αi,j = Softmax(βi,j )
exp(βi,j )
=  (2.42)
j ′ exp(βi,j ′ )
68 Generative Models

On each computing node, we need to implement these equations. Given the keys and values
assigned to this node, computing the numerator of the right-hand side of Eq. (2.42) (i.e., exp(βi,j ))
is straightforward, as all the required information is stored on the node. However, computing the
denominator of the right-hand side of Eq. (2.42) involves a sum of exp(βi,j ′ ) over all j ′ s, which
requires transferring data to and from other nodes. To illustrate, suppose that vj and kj are placed
on node u. We can rewrite Eq. (2.42) as

αi,j
node u
  
exp(βi,j )
=    (2.43)
exp(βi,j ′ ) + · · · + exp(βi,j ′ ) + · · · + exp(βi,j ′ )
kj ′ ∈K[1] kj ′ ∈K[u] kj ′ ∈K[nu ]
        
node 1 node u node nu

where the notation kj ′ ∈ K[u] represents that kj ′ is a row vector of K[u] . In a straightforward

implementation, we rst perform the summations { kj′ ∈K[u] exp(βi,j ′ )} separately on the corre-
sponding nodes. Then, we collect these summation results from different nodes to combine them
into a nal result. This corresponds to a collective operation in the context of parallel processing.
There are many efcient implementations of such operations, such as the all-reduce algorithms.
Hence the sum of all exp(βi,j ) values can be computed using optimized routines in collective
communication toolkits.
Given the attention weights {αi,j }, we then compute the attention results using Eq. (2.40).
The problem can be re-expressed as

Attqkv (qi , K, V)
  
= αi,j ′ vj ′ + · · · + αi,j ′ vj ′ + · · · + αi,j ′ vj ′ (2.44)
vj ′ ∈V[1] vj ′ ∈V[u] vj [n ]
′ ∈V u
        
node 1 node u node nu

Like Eq. (2.43), Eq. (2.44) can be implemented as a summation program in parallel process-
ing. First, perform the weighted summations of values on different nodes simultaneously. Then,
we collect the results from these nodes via collective operations.
Note that, although this section primarily focuses on long sequence modeling, much of the mo-
tivation for sequence parallelism comes from the distributed training methods of deep networks,
as discussed in Section 2.2.3. As a result, the implementation of these methods can be based on
the same parallel processing library.

2.3.2 Efcient Architectures

One difculty of applying Transformers to long sequences is that self-attention has a quadratic
time complexity with respect to the sequence length. Moreover, a key-value cache (or KV cache
for short) is maintained during inference, and its size increases as more tokens are processed. Al-
though the KV cache grows linearly with the sequence length, for extremely long input sequences,
the memory footprint becomes signicant and it is even infeasible to deploy LLMs for such tasks.
As a result, the model architecture of long-context LLMs generally moves away from the standard
2.3 Long Sequence Modeling 69

Transformer, turning instead to the development of more efcient variants and alternatives.
One approach is to use sparse attention instead of standard self-attention. This family of
models is based on the idea that only a small number of tokens are considered important when
attending to a given token, and so most of the attention weights between tokens are close to zero.
As a consequence, we can prune most of the attention weights and represent the attention model
in a compressed form. To illustrate, consider the self-attention model

Attqkv (Q, K, V) = α(Q, K)V (2.45)

where the attention weight matrix α(Q, K) ∈ Rm×m is obtained by

QKT
α(Q, K) = Softmax( √ + Mask)
d
 
α0,0 0 0 ... 0
 α α 0 ... 0 
 1,0 1,1 
 
=  α α2,1 α2,2 ... 0  (2.46)
 2,0 
 .. .. .. .. .. 
 . . . . . 
αm−1,0 αm−1,1 αm−1,2 ... αm−1,m−1

 
Each row vector αi,0 ... αi,i 0 ... 0 corresponds to a distribution of attending the i-th
token to every token of the sequence. Since language models predict next tokens only based on
their left-context, we normally write the output of the attention model at position i as
 

v
  0
Attqkv (qi , K≤i , V≤i ) = αi,0 ... αi,i  ... 


vi
i

= αi,j vj (2.47)
j=0

   
k0 v0
 .   . 
where K≤i =  ..  and V≤i =  .. 
  
 are the keys and values up to position i.
ki vi
 
In the original version of self-attention αi,0 ... αi,i is assumed to be dense, that is, most of
 
the values are non-zero. In sparse attention, some of the entries of αi,0 ... αi,i are considered
non-zero, and the remaining entries are simply ignored in computation. Suppose G ⊆ {0, ..., i} is
the set of indices of the non-zero entries. For language models, the output of the sparse attention
model at position i is given by

Attsparse (qi , K≤i , V≤i ) = α′i,j vj (2.48)
j∈G

Here {α′i,j } are normalized over G. Hence their values are different from the original attention
weights (in fact we have α′i,j > αi,j ). The sparsity of the model is determined by how large G is.
Sparse attention models differ in the way we dene G. One simple approach is to dene G based
70 Generative Models

on heuristically designed patterns. For example, a widely-used pattern involves having G cover a
window of tokens located near position i [Parmar et al., 2018].
While sparse attention reduces the computation through the use of sparse operations, such
models still have signicant limitations as we must keep the entire KV cache (i.e., K≤i and V≤i )
during inference. If the sequence is very long, storing this cache will become highly memory-
intensive. To address this, we can consider a different form of attention models where the KV
cache is not explicitly retained. Linear attention is one such approach [Katharopoulos et al., 2020].
It uses a kernel function φ(·) to project each query and key onto points qi′ = φ(qi ) and ki′ = φ(ki ),
respectively. By removing the Softmax function under such transformations10 , the form of the
resulting attention model is given by

Attqkv (qi , K≤i , V≤i ) ≈ Attlinear (qi′ , K′≤i , V≤i )


qi′ µi
= (2.49)
qi′ νi

where µi and νi are variables that are computed in the recurrent forms

T
µi = µi−1 + k′ i vi (2.50)
T
νi = νi−1 + k′ i (2.51)

µi and νi can be seen as representations of the history up to position i. A benet of this model is
that we need not keep all past queries and values. Instead only the latest representations µi and
νi are used. So the computational cost of each step is a constant, and the model can be easily
extended to deal with long sequences.
In fact, this sequential approach to long sequence modeling arises naturally when we adopt a
viewpoint of recurrent models. Such models read one token (or a small number of tokens) at a
time, update the recurrent state using these inputs, and then discard them before the next token
arrives. The output at each step is generated based only on the recurrent state, rather than on all the
previous states. The memory footprint is determined by the recurrent state which has a xed size.
Recurrent models can be used in real-time learning scenarios where data arrives in a stream and
predictions can be made at any time step. In NLP, applying recurrent models to language mod-
eling is one of the earliest successful attempts to learn representations of sequences. Although
Transformer has been used as the foundational architecture in LLMs, recurrent models are still
powerful models, especially for developing efcient LLMs. More recently, recurrent models have
started their resurgence in language modeling and have been reconsidered as a promising alterna-
tive to Transformers [Gu and Dao, 2023]. Figure 2.5 shows a comparison of the models discussed
in this subsection.

10
In the new space after this transformation, the Softmax normalization can be transformed into the simple scaling
normalization.
2.3 Long Sequence Modeling 71

Attqkv (qi , K≤i , V≤i )

k0 k1 ··· ki−2 ki−1 ki qi

v0 v1 ··· vi−2 vi−1 vi

(a) Standard Self-attention

Attqkv (qi , {k1 , ki }, {v1 , vi })

k0 k1 ··· ki−2 ki−1 ki qi

v0 v1 ··· vi−2 vi−1 vi

(b) Sparse Attention

T µi
µi = µi−1 + k′ i vi ⇒ q′i µi
Attlinear (qi , K≤i , V≤i ) = q′i νi
T ⇒ νi
νi = νi−1 + k′ i

k0 k1 ··· ki−2 ki−1 ki qi

v0 v1 ··· vi−2 vi−1 vi

(c) Linear Attention

hi = f (hi−1 , inputi )

h0 h1 ··· hi−3 hi−2 hi−1 hi

inputi
(d) Recurrent Models

Fig. 2.5: Illustrations of self-attention, sparse attention, linear attention and recurrent models. Blue boxes = cached
states for producing the output at position i. f (·) = a recurrent cell.

2.3.3 Cache and Memory

LLMs based on the standard Transformer architecture are global models. The inference for these
models involves storing the entire left-context in order to make predictions for future tokens. This
requires a KV cache where the representations (i.e., keys and values) of all previously-generated
72 Generative Models

tokens are kept, and the cost of caching grows as the inference proceeds. Above, we have dis-
cussed methods for optimizing this cache via efcient attention approaches, such as sparse atten-
tion and linear attention. Another idea, which may have overlap with the previous discussion, is
to explicitly encode the context via an additional memory model.

2.3.3.1 Fixed-size KV Cache

A straightforward approach is to represent the keys and values using a xed-size memory model.
Suppose we have a memory Mem which retains the contextual information. We can write the
attention operation at position i in a general form

Att(qi , Mem) = Attqkv (qi , K≤i , V≤i ) (2.52)

In this model, Mem is simply the KV cache, i.e., Mem = (K≤i , V≤i ). Thus the size of
Mem is determined by i. If we dene Mem as a xed-size variable, then the cost of performing
Att(qi , Mem) will be xed. There are several alternative ways to design Mem.

• One of the simplest methods is to consider a xed-size window of previous keys and values.
Mem is therefore given by

Mem = (K[i−nc +1,i] , V[i−nc +1,i] ) (2.53)

where nc denotes the size of the window. The notation K[i−nc +1,i] and V[i−nc +1,i] denote
the keys and values over positions from i − nc + 1 to i.11 This model can be seen as a type
of local attention model.

• It is also possible to dene Mem as a pair of summary vectors, which leads to a more
compressed representation of the history. A simple way to summarize the previous keys
and values is to use the moving average of them. For example, Mem can be dened as the
unweighted moving average of the previous nc keys and values

 i i 
j=i−nc +1 kj j=i−nc +1 vj
Mem = , (2.54)
nc nc

Alternatively, we can use a weighted version of moving average


 i i 
j=i−nc +1 βj−i+nc kj j=i−nc +1 βj−i+nc vj
Mem =  nc , nc (2.55)
j=1 βj j=1 βj

Here {β1 , ..., βnc } are the coefcients, which can be either learned as model parameters
or determined via heuristics. For example, they can be set to increasing coefcients (i.e.,
β1 < β2 < ... < βnc −1 < βnc ) in order to give larger weight to positions that are closer to
   
ki−nc +1 vi−nc +1
11  .
..
  . 
More formally, we write K[i−nc +1,i] =   and V[i−nc +1,i] =  .. . Sometimes we denote
ki vi
K[i−nc +1,i] by {ki−nc +1 , ..., ki } and V[i−nc +1,i] by {vi−nc +1 , ..., vi } for notation simplicity.
2.3 Long Sequence Modeling 73

i. We can extend the moving average to include all the positions up to i. This leads to the
cumulative average of the keys and values, given in the form

 i i 
j=0 kj j=0 vj
Mem = , (2.56)
i+1 i+1

In general, the cumulative average can be written using a recursive formula

(ki , vi ) + i · Memi−1
Memi = (2.57)
i+1

where Memi and Memi−1 denote the cumulative averages of the current and previous po-
sitions, respectively. An advantage of this model is that we only need to store a single
key-value pair during inference, rather than storing all the key-value pairs. Note that the
above memory models are related to recurrent models, and more advanced techniques have
been used to develop alternatives to self-attention mechanisms in Transformers [Ma et al.,
2023].

• The memory Mem can also be a neural network. At each step, it takes both the previous
output of the memory and the current states of the model as input, and produces the new
output of the memory. This neural network can be formulated as the function

Mem = Update(Skv , Mempre ) (2.58)

Here Mem and Mempre represent the outputs of the memory at the current step and the
previous step, respectively. Skv is a set of key-value pairs, representing the recent states of
the model. This formulation is general and allows us to develop various memory models by
selecting different Update(·) and Skv congurations. For example, if Skv only contains the
latest key-value pair (ki , vi ) and Update(·) is dened as a recurrent cell, then Eq. (2.58)
can be expressed as an RNN-like model

Mem = f ((ki , vi ), Mempre ) (2.59)

where f (·) is a recurrent cell. Recurrence can also be applied to segment-level modeling
for efciency consideration. A simple approach is that we can divide the sequence into
segments, and treat Skv as a segment. Applying recurrent models to Update(·) will result in
memory models that operate on segments. A special example is that we dene Update(·) as
an FIFO function that adds Skv into the memory and removes the oldest key-value segment
from the memory, given by

Mem = FIFO(Skv , Mempre ) (2.60)

Consider a memory which includes two segments, one for current segment, and one for the
previous segment. In the attention operation, each position can access the history key-value
pairs in two closest consecutive segments. This essentially denes a local memory, but it
and its variants have been widely used segment-level recurrent models [Dai et al., 2019;
Hutchins et al., 2022; Bulatov et al., 2022].

• The above memory models can be extended to involve multiple memories. An example
74 Generative Models

of this approach is compressive Transformer [Rae et al., 2019]. It employs two distinct
xed-size memories: one for modeling local context (denoted by Mem), and the other for
modeling and compressing long-term history (denoted by CMem). The KV cache in this
model is the combination of Mem and CMem. The attention function can be written as

Attcom (qi , Mem, CMem) = Attqkv (qi , [Mem, CMem]) (2.61)

where [Mem, CMem] is a combined memory of Mem and CMem. As with other segment-
level models, the compressive Transformer model operates on segments of the sequence.
Each segment is a sequence of ns consecutive tokens, and we denote Skvk as the key-value

pairs corresponding to the tokens of the k-th segment. When a new segment arrives, Mem
k to Mem, and then
is updated in an FIFO fashion: we append the nc key-value pairs in Skv
pop the ns oldest key-value pairs from Mem, which is given by
k
Mem = FIFO(Skv , Mempre ) (2.62)

The popped key-value pairs are then used to update the compressive memory CMem. These
ns key-value pairs are compressed into ncs key-value pairs via a compression network.
CMem is an FIFO which appends the compressed ncs key-value pairs to the tail of the
queue, and drops the rst ncs key-value pairs of the queue. It is given by
k
CMem = FIFO(Ckv , CMempre ) (2.63)

k represents the set of compressed key-value pairs. Implicit in the compressive


where Ckv
Transformer model is that local context should be represented explicitly with minimal in-
formation loss, while long-range context can be more compressed.

• We have already seen that both global and local contexts are useful and can be mod-
eled using attention models. This view motivates the extension to attention models for
combining both local and long-term memories [Ainslie et al., 2020; Zaheer et al., 2020;
Gupta and Berant, 2020]. A simple but widely-used approach is to involve the rst few to-
kens of the sequence in attention, serving as global tokens. This approach is usually applied
along with other sparse attention models. An advantage of incorporating global tokens of
the sequence is that it helps smooth the output distribution of the Softmax function used in
attention weight computation, and thus stabilizes model performance when the context size
is very large [Xiao et al., 2024]. One drawback, however, is that using a xed-size global
memory may result in information loss. When dealing with long sequences, we need to
enlarge the KV cache for sufcient representations of the context, but this in turn increases
the computational cost.

Figure 2.6 shows illustrations of the above approaches. Note that, while we focus on optimiza-
tion of the KV cache here, this issue is closely related to those discussed in the previous section.
All of the methods we have mentioned so far can broadly be categorized as efcient attention
approaches, which are widely used in various Transformer variants.
2.3 Long Sequence Modeling 75

Memory
Size = 4 × 2

··· Keys

··· Values
i−7 i−6 i−5 i−4 i−3 i−2 i−1 i
(a) Window-based Cache

ki−3 +ki−2 +ki−1 +ki Memory


4

vi−3 +vi−2 +vi−1 +vi
Size = 1 × 2
4

··· Keys

··· Values
i−7 i−6 i−5 i−4 i−3 i−2 i−1 i
(b) Moving Average-based Cache

Mem = Update( Skv , Mempre ) ⇒ Memory


Size = 1 × 2

··· Keys

··· Values
i−7 i−6 i−5 i−4 i−3 i−2 i−1 i
(c) Recurrent Network as Cache

Compressed
Memory Memory
Size = 2 × 2 Size = 4 × 2

··· Keys

··· Values
i−7 i−6 i−5 i−4 i−3 i−2 i−1 i
(d) Hybrid Cache (Compressed Memory + Local Memory)

Fig. 2.6: Illustrations of xed-size KV caches in LLMs. Blue boxes represent the keys and values generated during
LLM inference, green boxes represent the keys and values stored or encoded in the primary memory, and orange boxes
represent the keys and values stored or encoded in the compressed memory.
76 Generative Models

2.3.3.2 Memory-based Models

The modeling of memories discussed above was based on updates to the KV cache, and the re-
sulting models are typically referred to as internal memories. We now consider another family
of models, called external memories, which operate as independent models to access large-scale
contexts for LLMs. Many such models are based on memory-based methods which have been
extensively discussed in machine learning [Bishop, 2006]. A common example is nearest neigh-
bor algorithms: we store context representations in a datastore, and try to nd the most similar
stored representations to match a given query. The retrieved context representations are then used
to improve attention for this query.
Here, we consider the k-nearest neighbors (k-NN) method which is one of the most popular
memory-based methods. Since our focus is language modeling in this section, we dene a sample
in the datastore as a key-value pair corresponding to some context state. Note that “context” is a
broad concept here, not just a sequence prex in text generation. One might, for example, view
the entire dataset as the context for predicting tokens. This allows us to retrieve the closest context
situation in a set of sequences, rather than a given sequence prex. Although we will restrict
ourselves to context modeling for a single sequence, in this subsection, we discuss a relatively
more general case.
Suppose we have a set of keys {kj } with corresponding values {vj }, and suppose we store
these key-value pairs in a vector database12 . For each query qi , we nd its k nearest neighbours by
growing the radius of the sphere centered as qi until it contains k data points in {kj }. This results
in a set of k keys along with their corresponding values, denoted by Memknn . As before, we
denote Mem as the local memory for the query, such as the KV cache of neighboring tokens. Our
goal is to attend query qi to both the local memory Mem and the long-term memory Memknn .
There are, of course, several ways to incorporate Mem and Memknn into the attention model.
For example, we might simply combine them to form a single KV cache [Mem, Memknn ], and
attend qi to [Mem, Memknn ] via standard QKV attention. Or we might use Mem and Memknn
in separate attention steps. An example of such approaches is the model developed by Wu et al.
[2021]. It linearly combines the two types of attention, given by

Att(qi , Mem, Memknn ) = g ⊙ Attlocal + (1 − g) ⊙ Attknn (2.64)


Attlocal = Att(qi , Mem) (2.65)
Attknn = Att(qi , Memknn ) (2.66)

Here g ∈ Rd is the coefcient vector, which can be the output of a learned gate.
Given the k-NN-based memory model described above, the remaining task is to determine
which key-value pairs are retained in the datastore. For standard language modeling tasks, we
consider the previously seen tokens in a sequence as the context, so we can add the keys and
values of all these tokens into the datastore. In this case, the resulting k-NN-based attention
model is essentially equivalent to a sparse attention model [Gupta et al., 2021].
Alternatively, we can extend the context from one sequence to a collection of sequences.
For example, we might collect all key-value pairs across the sequences in a training dataset and
add them to the datastore to model a larger context. Thus, LLMs can predict tokens based on a
12
A vector database, or vector store, is a database that provides highly optimized retrieval interfaces for nding stored
vectors that closely match a query vector.
2.3 Long Sequence Modeling 77

generalized context. A problem with this approach is that the computational cost would be large
if many sequences are involved. Since these sequences are part of our training data, we can build
and optimize an index for the vectors in the datastore before running the LLMs. As a result, the
retrieval of similar vectors can be very efcient, as in most vector databases.
In fact, all the above-mentioned methods can be viewed as instances of a retrieval-based ap-
proach. Instead of using retrieval results to improve attention, we can apply this approach in other
ways as well. One application of k-NN-based search is k-NN language modeling (or k-NN LM)
[Khandelwal et al., 2020]. The idea is that, although it is attempting to extend the context used
in self-attention by incorporating nearest neighbors in representation learning, in practice, similar
hidden states in Transformers are often highly predictive of similar tokens in subsequent positions.
In k-NN LM, each item in the datastore is a key-value tuple (z, w), where z represents a hidden
state of the LLM at a position, and w represents the corresponding prediction. A typical way to
create the datastore is to collect the output vector of the Transformer layer stack and the corre-
sponding next token for each position of each sequence in a training dataset. During inference,
we have a representation hi given a prex. Given this representation, we rst search the datastore
for k closest matching data items {(z1 , w1 ), ..., (zk , wk )}. Here {w1 , ..., wk } are thought of as
reference tokens for prediction, and thus can be used to guide the token prediction based on hi .
One common way to make use of reference tokens is to dene a distribution over the vocabulary
V,
 
Prknn (·|hi ) = Softmax( −d0 · · · −d|V | ) (2.67)

where dv equals the distance between hi and zj if wj equals the v-th entry of V , and equals 0
otherwise. We use a linear function with a coefcient λ that interpolates between the retrieval-
based distribution Prknn (·|hi ) and the LLM output distribution Prlm (·|hi )

Pr(·|hi ) = λ · Prknn (·|hi ) + (1 − λ) · Prlm (·|hi ) (2.68)

Then, as usual, we can choose the next token y by maximizing the probability Pr(y|hi ).
As with information retrieval (IR) systems, the datastore can also manage texts and provide
access to relevant texts for a query. For example, we can store a collection of text documents
in a search engine with full-text indexing, and then search it for documents that match a given
text-based query. Applying IR techniques to LLMs leads to a general framework called retrieval-
augmented generation (RAG). The RAG framework works as follows. We use the context x as
the query and nd the k most relevant document pieces {c1 , ..., ck } from the datastore via efcient
IR techniques13 . These search results are combined with the original context via a prompting

13
In piratical applications, queries are typically generated using a query generation system, which may expand it
with variations of tokens and query intent.
78 Generative Models

template g(·)14 , resulting in an augmented input for the LLM

x′ = g(c1 , ..., ck , x) (2.69)

Then, we use x′ as the context and predict the following text using the model Pr(y|x′ ). One
advantage of RAG is that we need not modify the architecture of LLMs, but instead augment the
input to LLMs via an additional IR system. Figure 2.7 shows a comparison of the use of different
external memories in LLMs.

2.3.3.3 Memory Capacity

A memory model in LLMs, in the form of a simple key-value cache or a datastore, can broadly
be seen as an encoder of contextual information. Ideally, before we say that a memory model
is representative of the entire context in token prediction, we need to make sure that the model
can accurately represent any part of the context. The standard KV cache is one such model that
completely stores all past history. In this case, the model is said to have adequate capacity for
memorizing the context. In many practical applications, however, complete memorization is not
required. Instead, the goal is to enable LLMs to access important contextual information. As a
result, efcient and compressed memory models are developed, as described in this section. Note
that, the longer the sequence, the more difcult it becomes for a low-capacity memory model to
capture important contextual information. It is therefore common practice to simply increase the
model capacity when processing long contexts.
While high-capacity models are generally favorable, they are difcult to train and deploy. A
challenging scenario is that the tokens arrive in a stream and the context continuously grows.
Developing LLMs for such tasks is difcult as we need to train Transformers on extremely long
sequences. A possible way to address this difculty is to use non-parametric methods, such as
retrieval-based methods. For example, as discussed above, we can use a vector database to store
previously generated key-value pairs, and thus represent the context by this external memory
model. Although this approach side-steps the challenge of representing long context in Trans-
formers, building and updating external memory models are computationally expensive. These
models are more often used in problems where the context is given in advance and xed during
inference, and hence unsuitable for streaming context modeling.
In cases where the size of the context continuously grows, applying xed-size memory models
is a commonly used approach. For example, in recurrent models, a sequence of arbitrary length
can be summarized into a set of hidden states by which we have a xed computational cost per
step. While recurrent models were initially found to be not very good at handling long-distance
dependencies in sequence modeling in early applications of deep learning to NLP, recent advance-
ments have shown that their variants are now effective in modeling extremely long sequences.
[Bulatov et al., 2022; Hutchins et al., 2022; Munkhdalai et al., 2024; Ma et al., 2024].
14
For example, the template could be:

message = {*c1 *} ... {*ck *}


input: {*x*}
output:
2.3 Long Sequence Modeling 79

g ⊙ Att(qi , Mem) + (1 − g) ⊙ Att(qi , Memknn )

Att(qi , Memknn ) Att(qi , Mem)

···
qi
···

k Nearest KV Cache
Neighbors
Keys/values in LLM
Datastore
Search Keys/values in Datastore

(a) k-NN Search Augmented Attention

Output Distribution

Distribution Pr(·)
Distribution Prknn (·) ···
Att(qi , Mem) Att(qi , Mem)

···
qi
···

k Nearest KV Cache
Neighbors Keys/values in LLM
Keys in Datastore
Datastore
Search Predicted Tokens

(b) k-NN Language Modeling

LLM

c1 = Deep network is ... Message: deep network ... machine learning ...

c2 = Machine learning is ... What is deep learning?


···
k Nearest
Neighbors

Search Input Context:


Datastore
x = What is deep learning?

(c) Retrieval-augmented Generation

Fig. 2.7: Illustrations of external memories (or datastores) for language modeling.
80 Generative Models

There is no general denition of memory capacity in LLMs. A simple approach might consider
how much storage is used to retain contextual information. For example, memory capacity could
be dened by the size of the KV cache in Transformers or the vector database used in retrieval-
based methods. A related concept is model complexity. In machine learning, there are several
ways to dene the model complexity of a model. One of the simplest methods is by counting the
number of parameters. However, it should be emphasized that the memory models discussed here
primarily serve to store information, rather than add trainable parameters. Therefore, a model with
a large memory capacity is not necessarily more complex. Nevertheless, in practice determining
the capacity of a memory model is not straightforward. In general, we need to control the trade-off
between maximizing the performance and controlling the memory footprint.

2.3.4 Sharing across Heads and Layers

In Transformers, the KV cache is a data structure that can be dynamically adjusted along multiple
dimensions, such as heads, layers, and sequence length. For example, consider an LLM with L
layers. Each layer has τ attention heads, and each head produces a dh -dimensional output. During
inference, we store the keys and values for up to m tokens. The space complexity of this caching
mechanism is O(L · τ · dh · m). As we have seen previously, this complexity can be reduced by
caching the keys and values for fewer tokens. For example, in sliding window attention, a xed-
size window is used to cache the keys and values in local context. And this model has a space
complexity of O(L · τ · dh · mw ), with mw being the size of the window.
In addition to reducing m, we can also decrease the size of the KV cache along other di-
mensions. A widely-used approach is to enable sharing across heads in multi-head self-attention.
Recall from Section 2.1.1 that multi-head self-attention uses multiple sets of queries, keys, and
values (each set is called a head), each performing the QKV attention mechanism as usual. This
can be expressed as

Output = Merge(head1 , ..., headτ )Whead (2.70)

where headj ∈ Rdh is computed using the standard QKV attention function
[j] [j] [j]
headj = Attqkv (qi , K≤i , V≤i ) (2.71)

[j] [j] [j]


Here, qi , K≤i , and V≤i are the query, keys, and values that are projected onto the j-th feature
sub-space. So this model can be interpreted as performing attention on a group of feature sub-
spaces in parallel (see Figure 2.8 (b)). The KV cache needs to retain the keys and values for all
[1] [1] [τ ] [τ ]
these heads, that is, {(K≤i , V≤i ), ..., (K≤i , V≤i )}.
One renement to the multi-head attention model, called multi-query attention (MQA), is to
share keys and values across heads, while allowing queries to be unique for each head [Shazeer,
2019]. In MQA, there is a single set of keys and values (K≤i , V≤i ). In addition, there are τ
[1] [τ ]
queries {qi , ..., qi }, each corresponding to a different head. For each head, we have
[j]
headj = Attqkv (qi , K≤i , V≤i ) (2.72)

Figure 2.8 (c) illustrates this model. By sharing keys and values, the size of the KV cache would
2.3 Long Sequence Modeling 81

value key query value key query

(a) Single-head Attention (b) Multi-head Attention

value key query value key query

(c) Multi-query Attention (d) Grouped Query Attention

value key query

Layer l
Sharing

Layer l − 1

(e) Cross-layer Multi-head Attention

Fig. 2.8: Illustration of QKV attention based on different multi-head and sharing mechanisms. (a) = single-head
attention, and (b-e) = attention with multiple heads.

be O(L · dh · m).
Grouped query attention (GQA) is a natural extension to multi-head attention and MQA
[Ainslie et al., 2023]. In GQA, heads are divided into ng groups, each corresponding to a shared
[1] [1] [n ] [n ]
set of keys and values. Hence we have ng sets of keys and values {(K≤i , V≤i ), ..., (K≤ig , V≤ig )}.
See Figure 2.8 (d) for an illustration. Let g(j) be the group id for the j-th head. The GQA model
can be expressed as
[j] [g(j)] [g(j)]
headj = Attqkv (qi , K≤i , V≤i ) (2.73)

The size of the KV cache of GQA is O(L·ng ·dh ·m). One benet of GQA is that we can trade-off
between computational efciency and model expressiveness by adjusting ng . When ng = τ , the
model becomes the standard multi-head attention model. By contrast, when ng = 1, it becomes
82 Generative Models

the GQA model.


Sharing can also be performed across layers. Such a method falls into the family of shared
weight and shared activation methods, which have been extensively used in Transformers [Dehghani et al.,
2018; Lan et al., 2020]. For example, one can share KV activations or attention weights across
layers to reduce both computation and memory footprints [Xiao et al., 2019; Brandon et al., 2024].
Figure 2.8 (e) shows an illustration of this method, where a query in a layer directly accesses the
KV cache of a lower-level layer.

2.3.5 Position Extrapolation and Interpolation

Since Transformer layers are order-insensitive to input, we need some way to encode positional
information in the input tokens. To do this, it is common to add positional embeddings to token
embeddings, and then feed these combined embeddings into the Transformer layer stack as input.
In this case, the embedding at position i can be expressed as

ei = xi + PE(i) (2.74)

where xi ∈ Rd denotes the token embedding, and PE(i) ∈ Rd denotes the positional embedding.
In general, the token embedding xi is a position-independent vector, and so the positional embed-
ding PE(i) is used to encode the positional context. A straightforward approach is to treat PE(i)
as a learnable variable and train it alongside other model parameters. In this way, we can learn
a unique representation for each position, and thus distinguish the tokens appearing at different
positions of a sequence.
Representations of positions using learned vectors can work well in tasks where the sequences
at training and test times are of similar lengths. In practice, however, we often impose length
restrictions on sequences during training to prevent excessive computational costs, but wish to
apply the trained models to much longer sequences during inference. In this case, using learned
positional embeddings has obvious drawbacks, as there are no trained embeddings for positions
that are not observed in the training phase.
An alternative approach to modeling positional information is to develop positional embed-
dings that can generalize: once trained, the embedding model can be used to handle longer se-
quences. Suppose that we train a positional embedding model on sequences with a maximum
length of ml , and we wish to apply the trained model to a sequence of length m (m >> ml ). If
the embedding model is limited in the range of positions that we can observe from training data,
then this model will simply fail to deal with new data outside that range. See Figure 2.9 (a) for
an illustration where the learned embedding model cannot model data points outside the training
domain if it lacks the ability to extrapolate.
There are several approaches to making positional embedding models generalize. They can
be grouped into two classes.

• Extrapolation. The model learned on observed data points (i.e., positions) can be directly
employed to assign meaningful values to data points beyond the original range. For ex-
ample, suppose we have a series of numbers 1, 2, ..., 10, and we want to understand the
meaning of a new number, 15. Knowing that these numbers are natural numbers used for
ordering, we can easily infer that 15 is a number that follows 10, even though 15 has not
2.3 Long Sequence Modeling 83

Value 1

−1
0 1,024 2,048
Sequence Length
(a) Encoding with No Generalization

1
Value

−1
0 1,024 2,048
Sequence Length
(b) Extrapolation

1
Value

−1
0 1,024 2,048
Sequence Length
(c) Interpolation

Fig. 2.9: Illustrations of different positional embedding methods for a range of positions. Blue points represent the
positions that have been observed during training, and red points represent the positions that are newly observed at test
time. In sub-gure (a), the encoding model only memorizes the points seen during training, and cannot generalize. In
sub-gures (b) and (c), the model can generalize through extrapolation and interpolation.

been observed before. Figure 2.9 (b) shows an example of this approach, where a function
is learned to t the data points within a specic range and then applied to estimate the values
of data points outside that range.

• Interpolation. This approach maps a larger range of data points into the original obser-
vation range. For example, suppose we have a model designed for numbers in the range
[1, 10]. When given a new range of [1, 20], we can scale this down by dividing every num-
ber by 2, thereby tting all numbers into [1, 10]. This scaling allows us to use the model
trained on the range [1, 10] to describe data points in the expanded range of [1, 20]. See
Figure 2.9 (c) for an illustration of this approach.

In fact, positional embeddings in many systems have achieved some level of generalization.
For example, sinusoidal encoding, the most common positional embedding method, employs sine
and cosine functions that can naturally extend to sequences of any length. Although this approach
might seem direct and simple, it does not perform well when we signicantly extend the sequences
for processing. In this subsection, we will discuss several alternative methods based on either
extrapolation or interpolation.
84 Generative Models

2.3.5.1 Attention with Learnable Biases

One problem with Eq. (2.74) is that the embedding model treats each token independently and
therefore ignores the distance between different tokens. A common improvement to this model,
called relative positional embedding, is to consider the pairwise relationship between tokens
[Shaw et al., 2018]. The general idea behind this is to obtain the offset between any pair of posi-
tions and incorporate it into the self-attention model. One of the simplest forms of self-attention
with relative positional embedding is given by
i

Attqkv (qi , K≤i , V≤i ) = α(i, j)vj (2.75)
j=0

qi kjT + PE(i, j)
α(i, j) = Softmax( √ + Mask(i, j)) (2.76)
d

The only difference between this model and the original self-attention model is that a bias term
PE(i, j) is added to the query-key product in this new model. Intuitively, PE(i, j) can be inter-
preted as a distance penalty for the pair of positions i and j. As i moves away from j, the value of
PE(i, j) decreases.
PE(i, j) can be dened in several different ways. Here, we consider the T5 version of relative
positional embedding, called the T5 bias [Raffel et al., 2020]. For each pair of query qi and key
kj , the offset between them is dened to be15

d(i, j) = i − j (2.77)

A simple design for the bias PE(i, j) is to share the same learnable variable for all query-key
pairs with the same offset, i.e., PE(i, j) = ui−j , where ui−j is the variable corresponding to
the offset i − j. However, simply assigning a unique value to each offset will restrict this model
to observed offsets. When i − j is larger than the maximum trained offset, the model cannot
generalize.
The T5 bias instead adopts a generalization of this model. Rather than assigning each query-
key offset a unique bias term, it groups difference offsets into “buckets”, each corresponding to
one learnable parameter. More specically, the bias terms for nb + 1 buckets are given as follows.

• For buckets 0 to nb2+1 − 1, each bucket corresponds to one offset, that is, bucket 0 ↔ offset
0, bucket 1 ↔ offset 1, bucket 2 ↔ offset 2, and so on. We express this as b(i − j) = i − j.

• For buckets nb2+1 to nb , the size of each bucket increases logarithmically. For example, the
bucket number for a given offset i − j ≥ nb2+1 can be dened as

nb + 1 log(i − j) − log( nb2+1 ) nb + 1


b(i − j) = +⌊ · ⌋ (2.78)
2 log(distmax ) − log( nb2+1 ) 2

where the parameter distmax is typically set to a relatively large number to indicate the
15
For language modeling, a query is only allowed to attend to its left-context, and so we have i − j ≥ 0. In the more
general case of self-attention, where a token can attend to all tokens in the sequence, we may have negative offsets
when i < j.
2.3 Long Sequence Modeling 85

xed bucket size logarithmically increased bucket size

Bucket 0 1 2 3 ··· 14 15 16 17 18 ··· 32

Offset 0 1 2 3 14 15 16 ∼ 20 21 ∼ 26 27 ∼ 33 802 ∼ ∞
(i − j)

Fig. 2.10: Illustration of distributing query-key offsets into buckets in the T5 model (nb = 32 and distmax = 1024).
Boxes represent buckets. In the rst half of the buckets, we use a xed bucket size. In the second half of the buckets,
we increase the bucket size logarithmically. The last bucket contains all the query-key offsets that are not covered by
previous buckets.

maximum offset we may encounter.

• When i − j > distmax , we place i − j in the last bucket. In other words, bucket nb contains
all the offsets that are not assigned to the previous buckets.

Together, these can be expressed as the function

b(i − j)


i − j 0 ≤i−j < nb +1
2
= nb +1
n +1
log(i−j)−log( b2 ) nb +1 nb +1 (2.79)

min(nb , +⌊ ·
2 n +1
log(distmax )−log( b2 ) 2 ⌋) i−j ≥ 2

Figure 2.10 shows an illustration of these buckets. We see that in the rst half of the buckets,
each bucket is associated with only one value of i − j, while in the second half, the bucket size
increases as i − j grows. The last bucket is designed to handle sequences of arbitrarily long
lengths.
All PE(i, j)s in a bucket share the same bias term ub(i−j) . Substituting PE(i, j) = ub(i−j)
into Eq. (2.76), the attention weight for qi and kj becomes16

qi kjT + ub(i−j)
α(i, j) = Softmax( √ + Mask(i, j)) (2.81)
d

The parameters {u0 , ..., unb } are learned as common parameters during training. It should
be emphasized that this model can generalize to long sequences. This is because PE(i, j)s with
similar query-key offsets share the same parameter, and this sharing strategy is particularly im-
portant for achieving good generalization, given that large query-key offsets are rare in training.
In practice, we often set nb to a moderate number, and thus it can help control the overtting of
positional embedding models.

16
Note that, in Raffel et al. [2020]’s T5 model, the rescaling operation for the query-key product is removed. The
attention weight α(i, j) is then given by

α(i, j) = Softmax(qi kjT + ub(i−j) + Mask(i, j)) (2.80)


86 Generative Models

2.3.5.2 Attention with Non-learned Biases

Relative positional embedding models are based on a set of learned biases for the query-key prod-
uct in self-attention. An alternative approach is to give these biases xed values via heuristics,
rather than training them on a particular dataset. One benet of this heuristics-based approach is
that it does not rely on a training process and thus can be directly applied to any sequences once
the biases are set.
One example of such an approach is Press et al. [2022]’s approach, called attention with
linear biases or ALiBi for short. In the ALiBi approach, the bias term is dened as the negative
scaled query-key offset

PE(i, j) = −β · (i − j)
= β · (j − i) (2.82)

where β is the scaling factor. Adding this term to the query-key product, we obtain a new form of
attention weights

qi kjT + β · (j − i)
α(i, j) = Softmax( √ + Mask(i, j)) (2.83)
d

This model can be interpreted as adding a xed penalty to qi kjT whenever j moves one step
away from i. So we do not need to adapt it to a range of sequence lengths, and can employ it to
model arbitrarily long sequences. See Figure 2.11 for a comparison of the T5 bias and the ALiBi
bias.
In general, the scalar β should be tuned on a validation dataset. However, Press et al. [2022]
found that setting β to values decreasing geometrically by a factor of 21a for multi-head attention
performs well on a variety of tasks. Specically, for a self-attention sub-layer involving nhead
heads, the scalar for the k-th head is given by

1
βk = 8 (2.84)
2k

The ALiBi approach provides a simple form of relative positional embeddings. There are
other similar methods for designing query-key biases using the offset i − j. Table 2.4 shows a
comparison of such biases. As an aside it is worth noting that the form of the right-hand side
of Eq. (2.82) is very similar to length features used in conventional feature-based systems. For
example, in statistical machine translation systems, such features are widely used to model word
reordering problems, resulting in models that can generalize well across different translation tasks
[Koehn, 2010].

2.3.5.3 Rotary Positional Embedding

As with sinusoidal embeddings, rotary positional embeddings are based on hard-coded values for
all dimensions of an embedding [Su et al., 2024]. Recall that in the sinusoidal embedding model,
positions are represented as combinations of sine and cosine functions with different frequencies.
These embeddings are then added to token embeddings to form the inputs to the Transformer
2.3 Long Sequence Modeling 87

qi kjT Bias (ub(i−j) )


q0 kT
0
u0

q1 kT T
0 q1 k1
u1 u0

q2 kT T T
0 q2 k1 q2 k2
u2 u1 u0

q3 kT T T T
0 q3 k1 q3 k2 q3 k3 + u2 u2 u1 u0

q4 kT T T T T
0 q4 k1 q4 k2 q4 k3 q4 k4
u3 u2 u2 u1 u0

q5 kT T T T T T
0 q5 k1 q5 k2 q5 k3 q5 k4 q5 k5
u3 u3 u2 u2 u1 u0

q6 kT T T T T T T
0 q6 k1 q6 k2 q6 k3 q6 k4 q6 k5 q6 k6
u3 u3 u3 u2 u2 u1 u0

(a) The T5 bias (nb = 3 and distmax = 5)

qi kjT Bias (−β(i − j))


q0 kT
0 0

q1 kT T
0 q1 k1 −1β 0

q2 kT T T
0 q2 k1 q2 k2 −2β −1β 0

q3 kT T T T
0 q3 k1 q3 k2 q3 k3 + −3β −2β −1β 0

q4 kT T T T T
0 q4 k1 q4 k2 q4 k3 q4 k4 −4β −3β −2β −1β 0

q5 kT T T T T T
0 q5 k1 q5 k2 q5 k3 q5 k4 q5 k5 −5β −4β −3β −2β −β 0

q6 kT T T T T T T
0 q6 k1 q6 k2 q6 k3 q6 k4 q6 k5 q6 k6 −6β −5β −4β −3β −2β −β 0

(b) The ALiBi bias

Fig. 2.11: Query-key products with biases (above = the T5 bias and below = the ALiBi bias). The color scale of the
biases ranges from light blue denoting small absolute values to deep blue denoting large absolute values.

layer stack. Rotary positional embeddings instead model positional context as rotations to token
embeddings in a complex space. This leads to a model expressed in the form of multiplicative
embeddings

ei = xi R(i) (2.85)

where R(i) ∈ Rd×d is the rotation matrix representing the rotations performed on the token
embedding xi ∈ Rd .
For simplicity, we will rst consider embeddings with only two dimensions and return to a
discussion of the more
 general formulation later. Suppose we have a 2-dimensional token embed-
ding x = x1 x2 . We can represent it as a vector in a plane, originating at the origin (0, 0)
and terminating at (x1 , x2 ). A counterclockwise rotation of this vector refers to an operation of
88 Generative Models

Entry Query-Key Bias (PE(i, j))


T5 [Raffel et al., 2020] ub(i−j)
ALiBi [Press et al., 2022] −β · ( i − j )
Kerple [Chi et al., 2022] −β1 ( i − j )β2 (power)
−β1 log(1 + β2 ( i − j )) (logarithmic)
d/2
¯  ¯ 
Sandwich [Chi et al., 2023] k=1 cos (i−j )/100002k/d
 
FIRE [Li et al., 2024] f ψ( i − j )/ψ(max(mlen , i))
¯ and mlen are hyper-parameters. In the T5
Table 2.4: Query-key biases as relative positional embeddings. β, β1 , β2 , d,
model, b(i − j) denotes the bucket assigned to i − j. In the FIRE model, ψ(·) is a monotonically increasing function
such as ψ(x) = log(cx + 1), and f (·) is an FFN.

moving the vector around the origin while maintaining its magnitude, as shown in Figure 2.12 (a).
The degree of rotation is usually dened by a specic angle, denoted by θ. The rotation can be
expressed mathematically in the form

Ro(x, θ) = xRθ
 
  cos θ sin θ
= x1 x2
− sin θ cos θ
 
= cos θ · x1 − sin θ · x2 sin θ · x1 + cos θ · x2 (2.86)

 
cos θ sin θ
where Rθ = is the rotation matrix. If two or more rotations are performed on the
− sin θ cos θ
same vector, we can rotate the vector further. This follows from the fact that the composition of
successive rotations is itself a rotation. More formally, rotating a vector by an angle θ for t times
can be expressed as

Ro(x, tθ) = xRtθ


 
= cos tθ · x1 − sin tθ · x2 sin tθ · x1 + cos tθ · x2 (2.87)

If we interpret t as the position of a token represented by x in a sequence, then we will nd


that the above equation denes a simple positional embedding model. As shown in Figure 2.12
(b), we start moving the token from position 0. Each time we move one step forward, the vector
is rotated by the angle θ. Upon arriving at the position t, the representation of the token with
positional context is given by Ro(x, iθ). As the rotations do not change the magnitude of the
embedding, the original “meaning” of the token is retained. The positional information is injected
into the embedding, when it gets rotated.
A popular way to understand  vector rotation is to dene it in complex spaces. It is easy
to transform each vector x = x1 x2 in the 2D Euclidean space R2 to a complex number
x′ = x1 + ix2 in the complex space C via a bijective linear map. Then, the rotation of x with the
angle tθ corresponds to the multiplication by eitθ . Given that eitθ = cos tθ + i sin tθ, the rotation
2.3 Long Sequence Modeling 89

x2 x2
vector x x
rotated vector xRθ
xRθ
θ θ
θ
x1 x1

xR2θ
θ

xR3θ

(a) Single-step Rotation (b) Multi-step Rotation

The1 cat2 is3 sleeping4 peacefully5


x2 in6 the7 warm8 sunlight9 .10
sleeping4

7θ cat2


x1

sleeping11
cat9
Every1 afternoon2 ,3 you4 ’ll5 nd6 that7
the8 cat9 is10 sleeping11 on12 my13 bed14 .15

(c) Angles between embeddings of two tokens at different positions

Fig. 2.12: Illustrations of vector rotations in a plane. Sub-gures (a) and (b) show rotations of a vector in a single
step and multiple steps, respectively. Sub-gure (c) shows the embeddings of tokens cat and sleeping in two different
sentences. We show these sentences with a subscript afxed to each token to indicate its position. If we represent
tokens as vectors, we can add positional information by rotating these vectors. This rotation preserves the “distances”
between the vectors. For example, given that the distance between cat and sleeping is the same in both sentences, the
angle between their embeddings also remains the same during rotation.

operation can be re-expressed in the form

xRtθ → x′ eitθ
= (x1 + ix2 )(cos tθ + i sin tθ)
= cos tθ · x1 − sin tθ · x2 + i(sin tθ · x1 + cos tθ · x2 ) (2.88)

Here we denote the token representation x′ eitθ by C(x, tθ). The inner product of the representa-
tions of the tokens at positions t and s can be written as

C(x, tθ), C(y, sθ) = (x′ y′ )ei(t−s)θ (2.89)

where y′ is the complex conjugate of y′ . As can be seen, the result of this inner product involves
a term t − s, and so it can model the offset between the two tokens.
90 Generative Models

Now we go back to representations in the 2D Euclidean space. The dot-product of Ro(x, tθ)
and Ro(y, sθ) is can be written as a function of (t − s)θ

Ro(x, tθ)[Ro(y, sθ)]T = xRtθ [yRsθ ]T


= xRtθ [Rsθ ]T yT
= xR(t−s)θ yT (2.90)

Given this result, if we consider Ro(x, tθ) and Ro(y, sθ) as the query and the key, then the self-
attention operation will implicitly involve the modeling of relative positional context.
This rotary positional embedding can be extended to multi-dimensional embeddings. For
a d-dimensional token embedding x = x1 x2 ... xd , we can treat it as a d2 -dimensional
   
complex vector x′ = x′1 x′2 ... x′d/2 = x1 + ix2 x3 + ix4 ... xd−1 + ixd , where
each consecutive pair of items forms a complex number. Then, the rotary positional embedding in
the complex space is given by

d/2

C(x, tθ) = x′k eitθk ek (2.91)
k=1

where ek is the standard basis vector with a single non-zero value in the k-th coordinate and 0’s
elsewhere [Biderman et al., 2021].
Although this formula involves a complicated expression, its equivalent form in the d-dimensional
Euclidean space is relatively easy to understand. We can write it as
 
Rtθ1
 
 Rtθ2 

Ro(x, tθ) = x1 x2 ... xd 
 .. 
 (2.92)
 . 
Rtθd/2

 
cos tθk sin tθk  
where Rtθk = . θ = θ1 , ..., θd/2 are the parameters for controlling the an-
− sin tθk cos tθk
2(k−1)
gles of rotations in different dimensions. Typically, θk is set to 10000− d , which is analogous
to the setting in sinusoidal embeddings.
In a practical implementation, Eq. (2.92) can be rewritten into a form that relies solely on the
element-wise product and addition of vectors.
 T  T  T  T
x1 cos tθ1 −x2 sin tθ1
  
x2  cos tθ1  x   sin tθ 
     1   1 
  
..  ..  .   . 

Ro(x, tθ) =   
 ⊙. 
 .  . 
+ .  ⊙ .
. 
 (2.93)
       
xd−1  cos tθd/2   −xd  sin tθd/2 
xd cos tθd/2 xd−1 sin tθd/2

Finally, we rewrite Eq. (2.85) to obtain the form of the embedding at position i
2.3 Long Sequence Modeling 91

ei = Ro(xi , iθ) (2.94)

2.3.5.4 Position Interpolation

In position interpolation, our goal is to map the positions in the new sequence to match the ob-
served range in training. Suppose the sequence length for training ranges from 0 to ml . When
m > ml at test time, we represent the positions in [0, m] such that our representations t [0, ml ].
To illustrate, consider the rotary positional embedding model described
 above.
 The embedding
of each token is described by a model Ro(xi , iθ) in which θ = θ1 , ..., θd/2 are the parameters.
Ro(xi , iθ) can be cast in the form of a linear combination of two periodic functions (see Eq.
(2.93))
 
cos iθ = cos iθ1 ... cos iθd/2 (2.95)
 
sin iθ = sin iθ1 ... sin iθd/2 (2.96)

θk is a exponential function of k and takes the form


2(k−1)
θ k = b− d (2.97)

where b is the base. The period of cos iθk and sin iθk is
2(k−1)
Tk = 2π · b d (2.98)

The key idea behind position interpolation is to adjust this period so that the new positions can
m
be encoded within the range [0, ml ]. One way to achieve this is to scale up Tk by m l
, given by

m 2(k−1)
Tk′ = · 2π · b d (2.99)
ml

Hence all points in [0, m] are compressed into [0, ml ]. This linear scaling can be easily realized
by modifying the input to the embedding model [Chen et al., 2023c]. The new model with linear
positional interpolation is given by
ml
Ro′ (xi , iθ) = Ro(xi , iθ) (2.100)
m

Another method of positional interpolation is to scale the base17 . Suppose that the base b is
scaled by λ. We wish the period of this new model in the last dimension of θ (i.e., dimension d2 )
to be equal to that of the linear positional interpolation model. This can be expressed as
2( d
2 −1) m 2( d
2 −1)
2π · (λb) d = · 2π · b d (2.101)
ml
17
This method was rst proposed in https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/
ntkaware_scaled_rope_allows_llama_models_to_have/
92 Generative Models

Solving this equation, we obtain

 m  2( dd−1)
λ = 2
ml
m d
= d−2 (2.102)
ml

This gives an embedding model

Ro′ (xi , iθ) = Ro(xi , iθ ′ ) (2.103)

where
 0 2 d−2

θ ′ = (λb)− d , (λb)− d , ..., (λb)− d (2.104)

Note that scaling the base provides a non-uniform method for scaling the periods across dif-
ferent dimensions of θ. This method has been found to be helpful for extending LLMs to longer
sequences, and several improvements have been developed [Peng et al., 2024; Ding et al., 2024].

2.3.6 Remarks

In this section, we have presented a variety of methods for long-context language modeling. We
close this section by discussing some interesting issues related to these methods.

2.3.6.1 Need for Long Context

One of the ultimate goals of long-context LLMs is that these models can precisely encode innite
context. The so-called innite context refers more to the fact that an LLM can continuously read
words. This motivates LLMs that can handle extremely long context or stream data. As discussed
in Section 2.3.3, it is common to use xed-size memory models to process continuously expanding
context. Many such systems are based on recurrent architectures or their variants, because they
are inherently suited to model time series problems where the effects of past inputs continue
indenitely. Another way to achieve innite memory is to develop alternatives to self-attention
models, for example, one can use continuous-space attention models to encode context, which
removes the dependency on context length [Martins et al., 2022].
When studying long-context LLMs, it is natural to wonder what mechanisms may explain the
use of long context in language modeling. Can we compress the representation of innite context
into a relatively small-sized model? Are all context tokens useful for predicting next tokens? How
do LLMs prepare for token prediction when they see the context? Can we know in advance which
contextual information will be critical for prediction? General answers to all these questions
are not obvious, but they inspire follow-on research of explainable models, and some interesting
results have been found. For example, Deletang et al. [2024] conducted extensive experiments
to show that LLMs are powerful in-context compressors. Although viewing predictive models
as compression models has long been studied in machine learning, it also provides insights into
our understanding of the LLM scaling laws. Pal et al. [2023] and Wu et al. [2024] investigated
whether the features learned up to the current step, though not intentionally, are already sufcient
2.3 Long Sequence Modeling 93

for predicting tokens at the following steps. Note that the need for long-context in language
modeling is highly dependent on the problem that we address. A related issue is where to apply
LLMs and how to evaluate them. For example, in summarization tasks we may only need to distill
and focus on a few key aspects of the text, while in retrieval-like tasks we need to “memorize”
the entire context so that the relevant information can be accessed. We will discuss the evaluation
issue later in this subsection.

2.3.6.2 Pre-training or Adapting LLMs?

Training LLMs requires signicant computational costs. Although it is straightforward to train


LLMs on long sequence data, the training becomes computationally unwieldy for large data sets. It
is common practice to pre-train LLMs on general datasets, and then adapt them with modest ne-
tuning effort. For example, LLMs with relative or rotary positional embeddings can be directly
trained on large-scale data in the pre-training phase. While the resulting models may exhibit some
abilities to extrapolate lengths in the inference phase, it may be more effective to ne-tune them
on longer sequences.
Ideally, we would like to pre-train LLMs with standard Transformer architectures and adapt
them to new tasks. This allows us to use many off-the-shelf LLMs and efciently adapt them to
handle long sequences. However, when new architectures are adopted, it seems inevitable that
we need to train these models from scratch. This poses practical difculties for developing long-
context LLMs, as we cannot leverage well-developed, pre-trained models and must instead train
them ourselves. On the other hand, ne-tuning is still an effective way to adapt LLMs with certain
architectures that are different from those in pre-training. An example is models augmented with
external memories. In these models, the pre-trained LLMs are xed, and the focus is on how
to make these LLMs collaborate with the memory models. In RAG, for instance, it is common
to ne-tune LLMs to improve their use of retrieval-augmented inputs. Another example of ne-
tuning LLMs for long-context modeling is that we train an LLM with full attention models, and
then replace them with sparse attention models in the ne-tuning phase. The pre-trained LLM
provides initial values of model parameters used in a different model, and this model is then ne-
tuned as usual.

2.3.6.3 Evaluating Long-context LLMs

Evaluating long-context LLMs is important, but it is a new issue in NLP. The general idea is that,
if we input a long context to an LLM, then we can check from the output of the LLM whether it
understands the entire context and makes use of it in predicting following tokens. In conventional
research of NLP, such evaluations are often aimed at examining the ability of NLP models in
handling long-range dependencies. However, the size of contexts used in recent LLMs is much
larger than that used in NLP systems a few years ago. This motivates researchers to develop new
evaluation benchmarks and metrics for long-context LLMs.
One approach is to use the perplexity metric. However, in spite of its apparent simplicity, this
method tends to reect more on the LLMs’ ability to make use of local context rather than global
context. It is therefore tempting to develop evaluation methods that are specic to long-context
LLMs. Popular methods include various synthetic tasks where articially generated or modied
94 Generative Models

data is used to evaluate specic capabilities of long-context LLMs. In needle-in-a-haystack18 and


passkey retrieval tasks [Mohtashami and Jaggi, 2024; Chen et al., 2023c], for instance, LLMs are
required to identify and extract a small, relevant piece of information from a large volume of given
text. The assumption here is that an LLM with sufcient memory should remember earlier parts
of the text as it processes new information. This LLM can thus pick out the relevant details, which
might be sparse and hidden among much irrelevant information, from the text. Alternatively,
in copy memory tasks (or copy tasks for short), LLMs are used to repeat the input text or a
specic segment multiple times. These tasks were initially proposed to test the extent to which
recurrent models can retain and recall previously seen tokens [Hochreiter and Schmidhuber, 1997;
Arjovsky et al., 2016], and have been adopted in evaluating recent LLMs [Bulatov et al., 2022;
Gu and Dao, 2023].
Another approach to evaluating long-context LLMs is to test them on NLP tasks that involve
very long input sequences. Examples include long-document or multi-document summarization,
long-document question answering, code completion, and so on. A benet of this approach is that
it can align evaluations with user expectations.
Although many methods have been developed, there is still no general way to evaluate long-
context LLMs [Liu et al., 2024c]. One problem is that most of these methods focus on specic
aspects of LLMs, rather than their fundamental ability to model very long contexts. Even though
an LLM can pick out the appropriate piece of text from the input, we cannot say that it truly un-
derstands the entire context. Instead, it might just remember some important parts of the context,
or even simply recall the answer via the model learned in pre-training. Moreover, the data used
in many tasks is small-scale and relatively preliminary, leading to discrepancies between evalu-
ation results and actual application performance. A more interesting issue is that the results of
LLMs are inuenced by many other factors and experimental setups, for example, using different
prompts can lead to very different outcomes. This makes evaluation even more challenging be-
cause improvements may not solely result from better modeling of long contexts, and there is a
risk of overclaiming our results. Nevertheless, many open questions remain in the development
and evaluation of long-context LLMs. For example, these models still suffer from limitations
such as restricted context length and high latency. Studying these issues is likely to prove valuable
future directions.

2.4 Summary

In this chapter, we have discussed the concept of LLMs and related techniques. This can be consid-
ered a general, though not comprehensive, introduction to LLMs, laying the foundation for further
discussions on more advanced topics in subsequent chapters. Furthermore, we have explored two
ways to scale up LLMs. The rst focuses on the large-scale pre-training of LLMs, which is cru-
cial for developing state-of-the-art models. The second focuses on methods for adapting LLMs to
long inputs, including optimizing attention models, designing more efcient and compressed KV
caches, incorporating memory models, and exploring better positional embeddings.
The strength of LLMs lies in their ability to break the constraints of training NLP models for
a limited number of specic tasks. Instead, LLMs learn from large amounts of text through the
simple task of token prediction — we predict the next token in a sentence given its prior tokens.
18
https://github.com/gkamradt/LLMTest_NeedleInAHaystack
2.4 Summary 95

A general view is that, by repeating this token prediction task a large number of times, LLMs can
acquire some knowledge of the world and language, which can then be applied to new tasks. As a
result, LLMs can be prompted to perform any task by framing it as a task of predicting subsequent
tokens given prompts. This emergent ability in language models comes from several dimensions,
such as scaling up training, model size, and context size. It is undeniable that scaling laws are
currently the fundamental principle adopted in developing large language models, although sim-
ply increasing model size has yet to prove sufcient for achieving AGI. These continuously scaled
LLMs have been found to show capabilities in general-purpose language understanding, genera-
tion, and reasoning. More recently, it has been found that scaling up the compute at inference time
can also lead to signicant improvements in complex reasoning tasks [OpenAI, 2024].
Given their amazing power, LLMs have attracted considerable interest, both in terms of tech-
niques and applications. As a result, the explosion of research interest in LLMs has also led to a
vast number of new techniques and models. However, we do not attempt to provide a comprehen-
sive literature review on all aspects of LLMs, given the rapid evolution of the eld. Nevertheless,
one can still gain knowledge about LLMs from general reviews [Zhao et al., 2023; Minaee et al.,
2024] or more focused discussions on specic topics [Ruan et al., 2024].

You might also like