2 Generative Models
2 Generative Models
Generative Models
One of the most signicant advances in NLP in recent years might be the development of large
language models (LLMs). This has helped create systems that can understand and generate nat-
ural languages like humans. These systems have even been found to be able to reason, which
is considered a very challenging AI problem. With these achievements, NLP made big strides
and entered a new era of research in which difcult problems are being solved, such as building
conversational systems that can communicate with humans smoothly.
The concept of language modeling or probabilistic language modeling dates back to early ex-
periments conducted by Shannon [1951]. In his work, a language model was designed to estimate
the predictability of English — how well can the next letter of a text be predicted when the pre-
ceding N letters are known. Although Shannon’s experiments were preliminary, the fundamental
goals and methods of language modeling have remained largely unchanged over the decades since
then. For quite a long period, particularly before 2010, the dominant approach to language mod-
eling was the n-gram approach [Jurafsky and Martin, 2008]. In n-gram language modeling, we
estimate the probability of a word given its preceding n − 1 words, and thus the probability of a
sequence can be approximated by the product of a series of n-gram probabilities. These proba-
bilities are typically estimated by collecting smoothed relative counts of n-grams in text. While
such an approach is straightforward and simple, it has been extensively used in NLP. For example,
the success of modern statistical speech recognition and machine translation systems has largely
depended on the utilization of n-gram language models [Jelinek, 1998; Koehn, 2010].
Applying neural networks to language modeling has long been attractive, but a real break-
through appeared as deep learning techniques advanced. A widely cited study is Bengio et al.
[2003]’s work where n-gram probabilities are modeled via a feed-forward network and learned
by training the network in an end-to-end fashion. A by-product of this neural language model
is the distributed representations of words, known as word embeddings. Rather than represent-
ing words as discrete variables, word embeddings map words into low-dimensional real-valued
vectors, making it possible to compute the meanings of words and word n-grams in a continu-
ous representation space. As a result, language models are no longer burdened with the curse of
dimensionality, but can represent exponentially many n-grams via a compact and dense neural
model.
The idea of learning word representations through neural language models inspired subsequent
research in representation learning in NLP. However, this approach did not attract signicant in-
terest in developing NLP systems in the rst few years after its proposal. Starting in about 2012,
though, advances were made in learning word embeddings from large-scale text via simple word
prediction tasks. Several methods, such as Word2Vec, were proposed to effectively learn such
embeddings, which were then successfully applied in a variety of NLP systems [Mikolov et al.,
2013a;b]. As a result of these advances, researchers began to think of learning representations of
sequences using more powerful language models, such as LSTM-based models [Sutskever et al.,
2014; Peters et al., 2018]. And further progress and interest in sequence representation exploded
after Transformer was proposed. Alongside the rise of Transformer, the concept of language mod-
eling was generalized to encompass models that learn to predict words in various ways. Many
36
2.1 A Brief Introduction to LLMs 37
powerful Transformer-based models were pre-trained using these word prediction tasks, and suc-
cessfully applied to a variety of downstream tasks [Devlin et al., 2019].
Indeed, training language models on large-scale data has led NLP research to exciting times.
While language modeling has long been seen as a foundational technique with no direct link to
the goals of articial intelligence that researchers had hoped for, it helps us see the emergence of
intelligent systems that can learn a certain degree of general knowledge from repeatedly predicting
words in text. Recent research demonstrates that a single, well-trained LLM can handle a large
number of tasks and generalize to perform new tasks with a small adaptation effort [Bubeck et al.,
2023]. This suggests a step towards more advanced forms of articial intelligence, and inspires
further exploration into developing more powerful language models as foundation models.
In this chapter, we consider the basic concepts of generative LLMs. For simplicity, we use the
terms large language models or LLMs to refer to generative models like GPT, though this term
can broadly cover other types of models like BERT. We begin by giving a general introduction
to LLMs, including the key steps of building such models. We then discuss two scaling issues of
LLMs: how LLMs are trained at scale, and how LLMs can be improved to handle very long texts.
Finally, we give a summary of these discussions.
In this section we give an introduction to the basic ideas of LLMs as required for the rest of this
chapter and the following chapters. We will use terms word and token interchangeably. Both
of them refer to the basic units used in language modeling, though their original meanings are
different.
Before presenting details, let us rst consider how language models work. The goal of lan-
guage modeling is to predict the probability of a sequence of tokens occurring. Let {x0 , x1 , ..., xm }
be a sequence of tokens, where x0 is the start symbol s (or SOS)1 . The probability of this se-
quence can be dened using the chain rule
Pr(x0 , ..., xm ) = Pr(x0 ) · Pr(x1 |x0 ) · Pr(x2 |x0 , x1 ) · · · Pr(xm |x0 , ..., xm−1 )
m
= Pr(xi |x0 , ..., xi−1 ) (2.1)
i=0
m
log Pr(x0 , ..., xm ) = log Pr(xi |x0 , ..., xi−1 ) (2.2)
i=0
Here Pr(xi |x0 , ..., xi−1 ) is the probability of the token xi given all its previous tokens {x0 , ..., xi−1 }
2 . In the era of deep learning, a typical approach to language modeling is to estimate this
1
The start symbol can also be [CLS] following BERT models.
2
We assume that when i = 0, Pr(xi |x0 , ..., xi−1 ) = Pr(x0 ) = 1. Hence Pr(x0 , ..., xm ) =
Pr(x0 ) Pr(x1 , ..., xm |x0 ) = Pr(x1 , ..., xm |x0 ).
38 Generative Models
Table 2.1: Illustration of generating the three tokens b c d given the prex s a via a language model. In each step,
the model picks a token xi from V so that Pr(xi |x0 , ..., xi−1 ) is maximized. This token is then appended to the end
of the context sequence. In the next step, we repeat the same process, but based on the new context.
probability using a deep neural network. Neural networks trained to accomplish this task re-
ceive a sequence of tokens x0 , ..., xi−1 and produce a distribution over the vocabulary V (de-
noted by Pr(·|x0 , ..., xi−1 )). The probability Pr(xi |x0 , ..., xi−1 ) is the value of the i-th entry of
Pr(·|x0 , ..., xi−1 ).
When applying a trained language model, a common task is to nd the most likely token given
its previous context tokens. This token prediction task can be described as
We can perform word prediction multiple times to generate a continuous text: each time we
predict the best token x̂i , and then add this predicted token to the context for predicting the next
token x̂i+1 . This results in a left-to-right generation process implementing Eqs. (2.1) and (2.2). To
illustrate, consider the generation of the following three words given the prex ‘s a’, as shown
in Table 2.1. Now we discuss how LLMs are constructed, trained, and applied.
Here, we focus on the decoder-only Transformer architecture, as it is one of the most popular
model architectures used in LLMs. The input sequence of tokens is represented by a sequence
of de -dimensional vectors {e0 , ..., em−1 }. ei is the sum of the token embedding of xi and the
positional embedding of i. The major body of the model is a stack of Transformer blocks (or
layers). Each Transformer block has two stacked sub-layers, one for self-attention modeling and
one for FFN modeling. These sub-layers can be dened using the post-norm architecture
3
m m
Note that i=1
log Pr(xi |x0 , ..., xi−1 ) = i=0
log Pr(xi |x0 , ..., xi−1 ) since log Pr(x0 ) = 0.
2.1 A Brief Introduction to LLMs 39
where input and output denote the input and output, both being an m × d matrix. The i-th rows
of input and output can be seen as contextual representations of the i-th token in the sequence.
F (·) is the core function of a sub-layer. For FFN sub-layers, F (·) is a multi-layer FFN. For
self-attention sub-layers, F (·) is a multi-head self-attention function. In general, self-attention is
expressed in a form of QKV attention
QKT
Attqkv (Q, K, V) = Softmax( √ + Mask)V (2.6)
d
where Q, K and V ∈ Rm×d are the queries, keys, and values, respectively. It is important to
note that only previous tokens are considered when predicting a token. So a masking variable
Mask ∈ Rm×m is incorporated into self-attention to achieve this. The entry (i, k) of Mask has
a value of 0 if i ≤ k, and a value of − inf otherwise.
Given a representation H ∈ Rm×d , the multi-head self-attention function can be dened as
where Merge(·) representees a concatenation of its inputs, and Whead ∈ Rd×d represents a pa-
rameter matrix. headj is the output of QKV attention on a sub-space of representation
Q[j],K[j] ,and V[j] are the queries, keys, and values projected onto the j-th sub-space via linear
transformations
d
where Wqj , Wkj , and Wvj ∈ Rd× τ are the parameter matrices of the transformations.
Suppose we have L Transformer blocks. A Softmax layer is built on top of the output of the
last block. The Softmax layer outputs a sequence of m distributions over the vocabulary, like this
Pr(·|x0 , ..., xm−1 )
..
.
= Softmax(HL Wo ) (2.12)
Pr(·|x0 , x1 )
Pr(·|x0 )
where HL is the output of the last Transformer block, and Wo ∈ Rd×|V | is the parameter matrix.
Figure 2.1 shows the Transformer architecture for language modeling. Applying this language
40 Generative Models
Post-norm or Pre-norm
FFN
x1 x2 ... xm
L Blocks
Pr(xm |x0 x1 ...xm−1 )
Pr(x2 |x0 x1 )
Pr(x1 |x0 )
Post-norm or Pre-norm
...
Self-attention
hL
0 hL
1
... hL
m−1
Language Model
e0 e1 ... em−1
Fig. 2.1: The Transformer-decoder architecture for language modeling. The central components are L stacked Trans-
former blocks, each comprising a self-attention sub-layer and an FFN sub-layer. To prevent the model from accessing
the right-context, a masking variable is incorporated into self-attention. The output layer uses a Softmax function to
generate a probability distribution for the next token, given the sequence of previous tokens. During inference, the
model takes the previously predicted token to predict the next one, repeating this process until the end of the sequence
is reached. {z0 , ..., zm−1 } denote the inputs of a Transformer block, and {hL L
0 , ..., hm−1 } denote the outputs of the
last Transformer block.
model follows an autoregressive process. Each time the language model takes a token xi−1 as
input and predicts a token xi that maximizes the probability Pr(xi |x0 , ..., xi−1 ). It is important
to note that, despite different implementation details, many LLMs share the same architecture
described above. These models are called large because both their depth and width are signicant.
Table 2.2 shows the model sizes for a few LLMs, as well as their model setups.
Now suppose that we are given a training set D comprising K sequences. The log-likelihood of
each sequence x = x0 ...xm in D can be calculated using a language model
m
Lθ (x) = log Prθ (xi |x0 , ..., xi−1 ) (2.13)
i=1
Here the subscript θ afxed to L(·) and Pr(·) denotes the parameters of the language model. Then,
the objective of maximum likelihood training is dened as
θ̂ = arg max Lθ (x) (2.14)
θ x∈D
Training Transformer-based language models with the above objective is commonly viewed
as a standard optimization process for neural networks. This can be achieved using gradient de-
scent algorithms, which are widely supported by off-the-shelf deep learning toolkits. Somewhat
2.1 A Brief Introduction to LLMs 41
Table 2.2: Comparison of some LLMs in terms of model size, model depth, model width, and number of heads (a/b
means a heads for queries and b heads for both keys and values).
surprisingly, better results were continuously yielded as language models were evolved into more
computationally intensive models and trained on larger datasets [Kaplan et al., 2020]. These suc-
cesses have led NLP researchers to continue increasing both the training data and model size in
order to build more powerful language models.
However, as language models become larger, we confront new training challenges, which
signicantly change the problem compared to training relatively small models. One of these
challenges arises from the need for large-scale distributed systems to manage the data, model
parameters, training routines, and so on. Developing and maintaining such systems requires a
signicant amount of work in both software and hardware engineering, as well as expertise in deep
learning. A related issue is that when the training is scaled up, we need more computing resources
to ensure the training process can be completed in an acceptable time. For example, it generally
requires hundreds or thousands of GPUs to train an LLM with tens of billions of parameters
from scratch. This requirement drastically increases the cost of training such models, especially
considering that many training runs are needed as these models are developed. Also, from the
perspective of deep learning, the training process can become unstable if the neural networks are
very deep and/or the model size is very large. In response, we typically need to modify the model
architecture to adapt LLMs to large-scale training. In Section 2.2 we will present more discussions
on these issues.
42 Generative Models
Once we have pre-trained an LLM, we can then apply it to perform various NLP tasks. Tradi-
tionally language models are used as components of other systems, for example, they are widely
applied to score translations in statistical machine translation systems. By contrast, in generative
AI, LLMs are considered complete systems and are employed to address NLP problems by mak-
ing use of their generation nature. A common approach is to describe the task we want to address
in text and then prompt LLMs to generate text based on this description. This is a standard text
generation task where we continue or complete the text starting from a given context.
More formally, let x = x0 ...xm denote a token sequence of context given by users, and
y = y1 ...yn denote a token sequence following the context. Then, the inference of LLMs can be
dened as a problem of nding the most likely sequence y based on x:
Here ni=1 log Pr(yi |x0 , ..., xm , y1 , ..., yi−1 ) essentially expresses the same thing as the right-
hand side of Eq. (2.2). It models the log probability of predicting tokens from position m + 1,
rather than position 0. Throughout this chapter and subsequent ones, we will employ separate
variables x and y to distinguish the input and output of an LLM, though they can be seen as sub-
sequences from the same sequence. By adopting such notation, we see that the form of the above
equation closely resembles those used in other text generation models in NLP, such as neural
machine translation models.
To illustrate how LLMs are applied, consider the problem of determining the grammaticality
for a given sentence. We can dene a template like this
{*sentence*}
Question: Is this sentence grammatically correct?
Answer:
Here represents the text we intend to generate. {*sentence*} is a placeholder variable that
will be replaced by the actual sentence provided by the users. For example, suppose we have a
sentence “John seems happy today.”. We can replace the {*sentence*} in the template with this
sentence to have an input to the language model
To perform the task, the language model is given the context x =“John seems happy today .\n
Question : Is this sentence grammatically correct?\n Answer :”4 . It then generates the following
4
\n is a special character used for line breaks.
2.1 A Brief Introduction to LLMs 43
text as the answer, based on the context. For example, the language model may output “Yes” (i.e.,
y = “Yes”) if this text is the one with the maximum probability of prediction given this context.
Likewise, we can dene more templates to address other tasks. For example, we can translate
an English sentence into Chinese using the following template
{*sentence*}
Question: What is the Chinese translation of this English sentence?
Answer:
{*sentence*}
Translate this sentence from English into Chinese.
The above templates provide a simple but effective method to “prompt” a single LLM to per-
form various tasks without adapting the structure of the model. However, this approach requires
that the LLM can recognize and follow the instructions or questions. One way to do this is to incor-
porate training samples with instructions and their corresponding responses into the pre-training
dataset. While this method is straightforward, building and training LLMs from scratch is com-
putationally expensive. Moreover, making instruction-following data effective for pre-training
requires a signicant amount of such data, but collecting large-scale labeled data for all tasks of
interest is very difcult.
A second method, which has been a de facto standard in recent research, is to adapt LLMs
via ne-tuning. As such, the token prediction ability learned in the pre-training phase can be
generalized to accomplish new tasks. The idea behind ne-tuning is that some general knowledge
of language has been acquired in pre-training, but we need a mechanism to activate this knowledge
for applying it to new tasks. To achieve this, we can slightly ne-tune the model parameters using
instruction-following data. This approach is called instruction ne-tuning.
An instruction ne-tuning sample, which is represented by a sequence of tokens, can be seen
as a tuple consisting of an input and the desired output. Here, the input includes instructions,
system information (or system prex), and any other user-provided information5 . To illustrate,
consider the following examples (blue text = input and underlined text = output).
5
System information refers to a sequence of tokens added at the beginning of an input in order to guide the behavior
of an LLM, such as, you are a helpful assistant and should not output toxic content.
44 Generative Models
All these samples describe the same binary classication task, but with different instructions.
To increase the diversity in the ne-tuning data and make LLMs generalize, we can dene more
tasks using instructions.
If you buy 5 apples and each apple costs $1.20, how much do you spend in total?
$6.00
Write a Python program to calculate the sum of squares of the following numbers.
1 , 2 , 10 , -9 , 78
numbers = [1,2,10,-9 ,78]
sum_of_squares = sum(x**2 for x in numbers)
print(sum_of_squares)
optimized via pre-training. We can modify Eq. (2.14) to obtain the objective of ne-tuning
θ̃ = arg max Lθ̂+ (sample) (2.16)
θ̂ + sample∈Dtune
Here θ̃ denotes the optimal parameters. The use of notation θ̂ + means that the ne-tuning starts
with the pre-trained parameters θ̂.
For each sample ∈ Dtune , we divide it into an input segment xsample and an output segment
ysample , that is,
In other words, we compute the loss over the sub-sequence ysample , rather than the entire sequence.
In a practical implementation of back-propagation for this equation, the sequence [ysample , xsample ]
is constructed in the forward pass as usual. However, in the backward pass, error gradients are
propagated back only through the parts of the network that correspond to ysample , leaving the rest
of the network unchanged. As an example, consider a sequence
The loss is calculated and back propagated only for The result is 4 ..
Instruction ne-tuning also requires substantial engineering work. In order to achieve satis-
factory results, one may experiment with different settings of the learning rate, batch size, number
of ne-tuning steps, and so on. This typically requires many ne-tuning runs and evaluations. The
cost and experimental effort of ne-tuning remain critical and should not be overlooked, though
they are much lower than those of the pre-training phase.
While we focus on instruction ne-tuning for an illustrative example here, ne-tuning tech-
niques play an important role in developing various LLMs and are more widely used. Examples
include ne-tuning LLMs as chatbots using dialog data, and adapting these models to handle very
long sequences. The wide application of ne-tuning has led researchers to improve these tech-
niques, such as designing more efcient ne-tuning algorithms. While the research on ne-tuning
is fruitful, in this section we just give a avour of the key steps involved. We will see more detailed
discussions on this topic in the following chapters.
Instruction ne-tuning provides a simple way to adapt LLMs to tasks that can be well dened. This
problem can broadly be categorized as an alignment problem. Here, alignment is referred to as a
process of guiding LLMs to behave in ways that align with human intentions. The guidance can
come from labeled data, human feedback, or any other form of human preferences. For example,
2.1 A Brief Introduction to LLMs 47
we want LLMs not only to be accurate in following instructions, but also to be unbiased, truthful,
and harmless. So we need to supervise the models towards human values and expectations. A
common example is that when we ask an LLM how to build a weapon, it may provide a list of key
steps to do so if it is not carefully aligned. However, a responsible model should recognize and
avoid responding to requests for harmful or illegal information. Alignment in this case is crucial
for ensuring that LLMs act responsibly and in accordance with ethical guidelines.
A related concept to alignment is AI safety. One ultimate goal of AI is to build intelligent
systems that are safe and socially benecial. To achieve this goal we should keep these systems
robust, secure, and subjective, in any conditions of real-world use, even in conditions of misuse
or adverse use. For LLMs, the safety can be increased by aligning them with appropriate human
guidance, such as human labeled data and interactions with users during application.
Alignment is difcult as human values and expectations are diverse and shifting. Sometimes,
it is hard to describe precisely what humans want, unless we see the response of LLMs to user
requests. This makes alignment no longer a problem of tuning LLMs on predened tasks, but a
bigger problem of training them with the interactions with the real world.
As a result of the concerns with controlling AI systems, there has been a surge in research
on the alignment issue for LLMs. Typically, two alignment steps are adopted after LLMs are
pre-trained on large-scale unlabeled data.
• Supervised Fine-tuning (SFT). This involves continuing the training of pre-trained LLMs
on new, task-oriented, labelled data. A commonly used SFT technique is instruction ne-
tuning. As described in the previous subsection, by learning from instruction-response
annotated data, LLMs can align with the intended behaviors for following instructions,
thereby becoming capable of performing various instruction-described tasks. Supervised
ne-tuning can be seen as following the pre-training + ne-tuning paradigm, and offers a
relatively straightforward method to adapt LLMs.
• Learning from Human Feedback. After an LLM nishes pre-training and supervised ne-
tuning, it can be used to respond to user requests if appropriately prompted. But this model
may generate content that is unfactual, biased, or harmful. To make the LLM more aligned
with the users, one simple approach is to directly learn from human feedback. For example,
given some instructions and inputs provided by the users, experts are asked to evaluate how
well the model responds in accordance with their preferences and interests. This feedback
is then used to further train the LLM for better alignment.
A typical method for learning from human feedback is to consider it as a reinforcement learn-
ing (RL) problem, known as reinforcement learning from human feedback (RLHF) [Ouyang et al.,
2022]. The RLHF method was initially proposed to address general sequential decision-making
problems [Christiano et al., 2017], and was later successfully employed in the development of
the GPT series models [Stiennon et al., 2020]. As a reinforcement learning approach, the goal of
RLHF is to learn a policy by maximizing some reward from the environment. Specically, two
components are built in RLHF:
• Agent. An agent, also called an LM agent, is the LLM that we want to train. This agent
operates by interacting with its environment: it receives a text from the environment and
48 Generative Models
outputs another text that is sent back to the environment. The policy of the agent is the
function dened by the LLM, that is, Pr(y|x).
• Reward Model. A reward model is a proxy of the environment. Each time the agent
produces an output sequence, the reward model assigns this output sequence a numerical
score (i.e., the reward). This score tells the agent how good the output sequence is.
In RLHF, we need to perform two learning tasks: 1) reward model learning, which involves
training a reward model using human feedback on the output of the agent, and 2) policy learning,
which involves optimizing a policy guided by the reward model using reinforcement learning
algorithms. Here is a brief outline of the key steps involved in RLHF.
• Use the policy to generate multiple outputs for each input, and then collect human feedback
on these outputs (e.g., comparisons of the outputs).
• Fine-tune the policy with the supervision from the reward model.
Figure 2.2 shows an overview of RLHF. Given that this section serves only as a brief intro-
duction to concepts of LLMs, a detailed discussion of RLHF techniques will not be included. We
instead illustrate the basic ideas behind RLHF using a simple example.
Suppose we have trained an LLM via pre-training and instruction ne-tuning. This LLM is
deployed to respond to requests from users. For example, a user may input
We use the LLM to generate 4 different outputs (denoted by {y1 , ..., y4 }) by sampling the
output space
Comparisons
y1 ≻ y4 ≻ y2 ≻ y3
SFT Data
Write a poem about the
weather in London . ...
Annotating Data with Human Preferences
(a) Learning an Initial LLM (b) Annotating Data with Human Preferences
Reward Scores
{r(x, y)}
Reward Model
RL Fine-tuning
Dataset D
Reward Model LLM x∼D
(Policy)
Fig. 2.2: An overview of RLHF. There are 4 key steps involved: a) training an initial LLM (i.e., policy) using pre-
training and supervised ne-tuning; b) collecting human preference data by ranking the outputs of the LLM; c) training
a reward model using the ranking results; d) RL ne-tuning of the policy based on the reward model. Double line
arrows mean training or ne-tuning.
We then ask annotators to evaluate these outputs. One straightforward way is to assign a rating
score to each output. In this case, the reward model learning problem can be framed as a task of
training a regression model. But giving numerical scores to LLM outputs is not an easy task for
annotators. It is usually difcult to design an annotation standard that all annotators can agree on
and easily follow. An alternative method, which is more popular in the development of LLMs, is
to rank these outputs. For example, a possible ranking of the above outputs is
y1 ≻ y4 ≻ y2 ≻ y3
50 Generative Models
A reward model is then trained using this ranking result. In general, a reward model in RLHF
is a language model that shares the same architecture as the target LLM, but with a smaller model
size. Given the input x and output yk , we concatenate them to form a sequence seq k = [x, yk ].
This sequence is processed from left to right using forced decoding. Since each position can
only access its left context in language modeling, the output of the top-most Transformer layer at
the rst position cannot be used as the representation of the sequence. Instead, a special symbol
(e.g., \s) is added to the end of the sequence, and the corresponding output of the Transformer
layer stack is considered as the representation of the entire sequence. An output layer, such as a
linear transformation layer, is built on top of this representation to generate the reward, denoted
by R(seq k ) or R(x, yk ).
We train this reward model using ranking loss. For example, a pair-wise ranking loss function
can be written in the form
Lossω (Dr ) = −E(x,yk1 ,yk2 )∼Dr log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 ))) (2.19)
where ω represents the parameters of the reward model, and Dr represents a set of tuples of an
input and a pair of outputs. (x, yk1 , yk2 ) ∼ Dr is a sampling operation which draws a sample
(x, yk1 , yk2 ) from Dr with some probability. As an example, suppose we rst draw a model
input x with a uniform distribution and then draw a pair of model outputs with a probability of
yk1 ≻ yk2 given x (denoted by Pr(yk1 ≻ yk2 |x)). The corresponding loss function is given by
Lossω (Dr )
= − Pr(x) · Pr(yk1 ≻ yk2 |x) · log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 )))
1
= − Pr(yk1 ≻ yk2 |x) · log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 ))) (2.20)
K
where K represents the number of model inputs involved in sampling. While the form of these
functions may seem complex, their idea is simple: we penalize the model if the predicted ranking
of two outputs differs from the human-labeled ranking. By contrast, the model receives a bonus,
if the predicted ranking matches the human-labeled ranking.
We can train the reward model by minimizing the above ranking loss
The resulting model Rω̂ (·) can be employed to evaluate any given pair of input and output. Note
that although the reward model is trained using a ranking-based objective, it is used for scoring.
This allows it to provide continuous supervision signals, which is very benecial for training other
models.
We now turn to the policy learning problem. A commonly adopted objective is to maximize
the reward on a set of input-output pairs. Following an analogous form of Eq. (2.16), we obtain a
simple training objective for RL ne-tuning
where the optimal parameters θ̃ are obtained by ne-tuning the pre-trained parameters θ̂. Drlft is
2.1 A Brief Introduction to LLMs 51
the RL ne-tuning dataset. For each sample (x, yθ̂+ ), x is sampled from a prepared dataset of
input sequences, and yθ̂+ is sampled from the distribution Prθ̂+ (y|x) given by the policy.
In practice, more advanced reinforcement learning algorithms, such as proximal policy opti-
mization (PPO), are often used for achieving more stable training, as well as better performance.
We leave the detailed discussion of reinforcement learning algorithms to the following parts of
this book where RLHF is extensively used for alignment.
An interesting question arises here: why not consider learning from human preferences as
a standard supervised learning problem? This question is closely related to our aforementioned
discussion on the difculty of data annotation. Often, describing human values and goals is chal-
lenging, and it is even more difcult for humans to provide outputs that are well aligned. As an
alternative, annotating the preferences of a given list of model outputs offers a simpler task. By
doing so, we can create a model that understands human preferences, which can then be used as
a reward model for training policies. From the perspective of machine learning, RLHF is par-
ticularly useful for scenarios where the desired behavior of an agent is difcult to demonstrate
but can be easily recognized by humans. Another advantage of RLHF is its ability to explore the
sample space. By employing sampling techniques, models trained with reinforcement learning
can venture beyond the annotated data set to explore additional samples. This exploratory ability
allows RLHF to discover potentially benecial policies that are not immediately apparent from
the labeled data alone.
We have so far shown that LLMs can be used to perform various tasks by giving them appropriate
prompts. There are no restrictions on these prompts, which can include any information we wish
to ask or communicate with LLMs, such as natural language instructions and the context of con-
versations. Since this approach requires no additional training or tuning, adapting LLMs becomes
highly efcient once they are developed. This somewhat inuences the paradigms in NLP: we
no longer need to develop specic systems for individual tasks but can instead use a single, well-
trained LLM to perform different tasks by prompting it. An appealing aspect of LLM prompting
arises as a result: users can easily have “customized” systems by designing their own prompts
for LLMs. Given the important role played by prompting in LLMs, prompting engineering has
become a very active area of research in NLP.
The term prompt is used in many different ways in the literature. In this chapter, this term
refers to the entire input to LLMs, and so we use the terms prompt and model input interchange-
ably. Before discussing prompting further, let us rst see a few examples where the prompts
are more complex than those presented in the previous subsections. Note that this subsection is
not aimed at writing high-quality prompts but rather at highlighting some interesting issues in
prompting LLMs.
One of the popular ways to use LLMs is to assign them a “role” played in generating responses.
For example, LLMs can act as a psychologist when answering questions.
52 Generative Models
Another example is the use of LLMs in detecting and correcting errors such as syntactic or
semantic mistakes in text. For an LLM which is trained on both code and natural language data,
we may use it for code debugging6 .
6
In this example, the code is not tokenized for easier reading.
7
To ne-tune an LLM for multi-turn dialogue, one needs to consider conversation history in the context for pre-
dicting the response in the current round of conversation. This makes the actual prompt used in response generation
relatively longer than that used in single-turn dialogue.
2.1 A Brief Introduction to LLMs 53
These examples and previous ones have shown that appropriate responses can be generated
via prompts involving clear instructions and questions. However, when problem solving requires
knowledge that is not explicitly specied, LLMs may make mistakes, even though the instructions
are sufciently clear and precise. A family of challenging tasks for LLMs involves arithmetic
reasoning and commonsense reasoning. For example, we can ask an LLM to solve primary school
math problems presented in natural language.
Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more
apples. The next day, Jack gave 3 apples to his friend John. How many apples
does Jack have left in the end?
The answer is 10.
Tom has 12 marbles. He wins 7 more marbles in a game with his friend but then
loses 5 marbles the next day. His brother gives him another 3 marbles as a gift.
How many marbles does Tom have now?
The answer is 17.
Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more
apples. The next day, Jack gave 3 apples to his friend John. How many apples
does Jack have left in the end?
The answer is 12.
But the LLM still made mistakes this time. A reason for this might be that solving math
problems does not only involve problem-answer mappings but also, to a larger extent, the under-
lying logical inference in multiple steps. A method to improve the inference abilities of LLMs
is chain-of-thought prompting (COT prompting) [Wei et al., 2022c]. In COT prompting, we
decompose complex reasoning problems into multiple problem-solving intermediate steps. These
54 Generative Models
steps are demonstrated in prompts so that LLMs can be prompted to learn to mimic the underlying
thought processes in solving the problems. COT prompting has been found to be very useful for
achieving good results on challenging tasks, such as the GSM8K mathematical reasoning bench-
mark.
Consider the above demonstration example Tom has 12 marbles. He wins 7 more marbles in
a game with his friend but then loses 5 marbles the next day. His brother gives him another 3
marbles as a gift. How many marbles does Tom have now ?. The corresponding logical reasoning
steps can be structured:
2. He wins 7 more marbles in a game, so we add those to his initial count: 12 marbles + 7
marbles = 19 marbles.
3. The next day, Tom loses 5 marbles, so we subtract those from his current total: 19 marbles
- 5 marbles = 14 marbles.
4. His brother gives him 3 more marbles, so we add those to his current total: 14 marbles + 3
marbles = 17 marbles.
We can add these reasoning steps into the prompt, and get the correct answer.
Tom has 12 marbles. He wins 7 more marbles in a game with his friend but then
loses 5 marbles the next day. His brother gives him another 3 marbles as a gift.
How many marbles does Tom have now?
Tom initially has 12 marbles. After winning a game, he gains 7 more, increasing
his total to 19 marbles. He loses 5 marbles the next day, bringing his total down
to 14 marbles. His brother gifts him 3 more marbles, increasing his total to 17
marbles. Therefore, Tom now has 17 marbles. So the answer is 17.
Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more
apples. The next day, Jack gave 3 apples to his friend John. How many apples
does Jack have left in the end?
Jack starts with 7 apples. He eats 2 apples for dinner, so we subtract 2 from 7,
leaving him with 5 apples. His mom gives him 5 more apples, adding 5 to his
current total. Now he has 5 + 5 = 10 apples. The next day, Jack gives 3 apples to
his friend John. Subtracting 3 from 10, Jack is left with 7 apples. Therefore, Jack
has 7 apples left. So the answer is 7.
This approach is also called one-shot COT prompting. Likewise, we can add more demon-
strations to have a few-shot COT prompting approach.
Furthermore, a special kind of COT, zero-shot COT, introduces instructions that provoke
LLMs to think, without presenting any intermediate reasoning steps in the prompts [Kojima et al.,
2.1 A Brief Introduction to LLMs 55
2022]. An interesting example is that by appending a simple sentence “Let’s think step by step.”
to the end of a prompt, LLMs can generate reasoning steps to reach the nal answer.
Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more
apples. The next day, Jack gave 3 apples to his friend John. How many apples
does Jack have left in the end?
Let’s think step by step.
1. Initial Quantity: Jack starts with 7 apples.
2. After Dinner: He eats 2 apples, so 7 − 2 = 5 apples remain.
3. His Mom Gives More: His mom gives him 5 more apples, so 5 + 5 = 10
apples are now with him.
4. Giving to John: The next day, Jack gives 3 apples to his friend John,
so 10 − 3 = 7 apples are left.
In the end, Jack has 7 apples left.
Zero-shot, one-shot, and few-shot learning are common concepts in the area of in-context
learning for LLMs and are not restricted to COT prompting. Broadly speaking, any prompting
that involves only simple instructions without any demonstrations can be considered a form of
zero-shot learning. This zero-shot learning ability emerges as LLMs are pre-trained and/or ne-
tuned. Also, one-shot and few-shot learning methods are more often considered when LLMs do
not acquire the corresponding zero-shot learning ability. These methods are therefore important
for in-context learning when addressing new tasks. Examples include those for performing various
NLP tasks by demonstrating task-formatted samples. See the following examples for sentiment
sentence classication and phrase translation via few-shot learning.
Given the following text snippets, classify their sentiment as Positive, Negative,
or Neutral.
Example 1: “I had an amazing day at the park!”
Sentiment: Positive
Example 2: “The service at the restaurant was terrible.”
Sentiment: Negative
Example 3: “I think it’s going to rain today.”
Sentiment: Neutral
Text: “This movie was a fantastic journey through imagination.”
Sentiment: Positive
56 Generative Models
Above, we have presented examples to illustrate the fundamental in-context learning capa-
bilities of prompting LLMs. This section, however, does not include more advanced prompting
techniques in order to keep the content concise and compact. More discussions on prompting can
be found in Chapter 3.
As a rst step in developing LLMs, we need to train these models on large amounts of data.
The training task is itself standard: the objective is to maximize the likelihood, which can be
achieved via gradient descent. However, as we scale up both the model size and the amount
of data, the problem becomes very challenging, for example, large models generally make the
training unstable. In this section, we discuss several issues of large-scale training for LLMs,
including data preparation, model modication, and distributed training. We also discuss the
scaling laws for LLMs, which help us understand their training efciency and effectiveness.
The importance of data cannot be overstated in NLP. As larger neural networks are developed,
the demand for data continues to increase. For example, developing LLMs may require trillions
of tokens in pre-training (see Table 2.3), orders of magnitude larger than those used in training
conventional NLP models. In general, we may want to gather as much training data as possible.
However, larger training datasets do not mean better training results, and the development of
LLMs raises new issues in creating or collecting these datasets.
A rst issue is the quality of data. High-quality data has long been seen as crucial for training
data-driven NLP systems. Directly using raw text from various sources is in general undesirable.
For example, a signicant portion of the data used to train recent LLMs comes from web scraping,
which may contain errors and inappropriate content, such as toxic information and fabricated
facts. Also, the internet is ooded with machine-generated content due to the widespread use of
AI, presenting further challenges for processing and using web-scraped data. Researchers have
found that training LLMs on unltered data is harmful [Raffel et al., 2020]. Improving data quality
typically involves incorporating ltering and cleaning steps in the data processing workow. For
example, Penedo et al. [2023] show that by adopting a number of data processing techniques, 90%
of their web-scraped data can be removed for LLM training. In addition to large-scale web-scraped
data, LLM training data often includes books, papers, user-generated data on social media, and
so on. Most of the latest LLMs are trained on such combined datasets, which are found to be
2.2 Training at Scale 57
Table 2.3: Amounts of training data used in some LLMs in terms of the number of tokens.
Training LLMs is difcult. A commonly encountered problem is that the training process be-
comes more unstable as LLMs get bigger. For example, one needs to choose a small learning rate
to achieve stable training with gradient descent, but this in turn results in much longer training
times. Sometimes, even when the training conguration is carefully designed, training may di-
verge at certain points during optimization. The training of LLMs is generally inuenced by many
factors, such as parameter initialization, batching, and regularization. Here, we focus on common
modications and improvements to the standard Transformer architecture, which are considered
important in developing trainable LLMs.
Layer normalization is used to stabilize training for deep neural networks. It is a process of
subtracting the mean and dividing by the standard deviation. By normalizing layer output in
this way, we can effectively reduce the covariate shift problem and improve the training stability.
In Transformers, layer normalization is typically used together with residual connections. As
described in Section 2.1.1, a sub-layer can be based on either the post-norm architecture, in which
layer normalization is performed right after a residual block, or the pre-norm architecture, in
which layer normalization is performed inside a residual block. While both of these architectures
are widely used in Transformer-based systems [Wang et al., 2019], the pre-norm architecture has
proven to be especially useful in training deep Transformers. Given this, most LLMs are based on
the pre-norm architecture, expressed as output = LNorm(F (input)) + input.
A widely-used form of the layer normalization function is given by
h−µ
LNorm(h) = α · +β (2.23)
σ+ǫ
where h is a d-dimensional real-valued vector, µ is the mean of all the entries of h, and σ is the
corresponding standard deviation. ǫ is introduced for the sake of numerical stability. α ∈ Rd and
β ∈ Rd are the gain and bias terms.
A variant of layer normalization, called root mean square (RMS) layer normalization, only
re-scales the input vector but does not re-center it [Zhang and Sennrich, 2019]. The RMS layer
normalization function is given by
h
LNorm(h) = α · +β (2.24)
σrms + ǫ
d 2 1
where σrms is the root mean square of h, that is, σrms = ( 1d k=1 hk ) 2 . This layer normalization
function is used in LLMs like the LLaMA series.
2.2 Training at Scale 59
where Wh ∈ Rd×dh , bh ∈ Rdh , Wf ∈ Rdh ×d , and bf ∈ Rd are the parameters, and dh is the
hidden size. σ(·) is the activation function of the hidden layer. A common choice for σ(·) is the
rectied linear unit (ReLU), given by
In practical implementations, increasing dh is helpful and thus it is often set to a larger number
in LLMs. But a very large hidden size poses challenges for both training and deployment. In this
case, the design of the activation function plays a relatively more important role in wide FFNs.
There are several alternatives to the ReLU in LLMs. One of these is the gaussian error linear
unit (GeLU) which can be seen as a smoothed version of the ReLU. Rather than controlling the
output by the sign of the input, the GeLU function weights its input by the percentile Pr(h ≤ h).
Here h is a d-dimensional vector whose entries are drawn from the standard normal distribution
Gaussian(0, 1)9 . Specically, the GeLU function is dened to be
where Φ(h) is the cumulative distribution function of Gaussian(0, 1), which can be implemented
in convenient ways [Hendrycks and Gimpel, 2016]. The GeLU function has been adopted in
several LLMs, such as BERT, GPT-3, and BLOOM.
Another family of activation functions which is popular in LLMs is gated linear unit (GLU)-
based functions. The basic form of GLUs is given by
This activation function has been successfully applied in LLMs like Gemma.
As another example, consider σ(·) to be the Swish function σswish (h) = h ⊙ Sigmoid(ch)
8
Here degeneration refers to the phenomenon in which the rank of a matrix is reduced after some processing.
9
Pr(h ≤ h) is an informal notation. It refers to a vector, with each entry representing the percentile for the
corresponding entry of h.
60 Generative Models
Both the PaLM and LLaMA series are based on the SwiGLU function. For more discussions of
GLUs, the reader can refer to Shazeer [2020]’s work.
Another popular model design is to remove the bias terms in afne transformations used in LLMs.
This treatment can be applied to layer normalization, transformations of the inputs to QKV atten-
tion, and FFNs. For example, we can modify Eq. (2.25) to obtain an FFN with no bias terms
Chowdhery et al. [2022] report that removing bias terms helps improve the training stability
of LLMs. This method has been used in several recent LLMs, such as LLaMA and Gemma.
Many LLMs also involve modications to their positional embedding models. For example, one
can replace sinusoidal positional encodings with rotary position embeddings so that the learned
LLMs can handle long sequences better. These models will be discussed in Section 2.3.
Note that while model modications are common in training LLMs, the stability of training
can be improved in many different ways. For example, increasing the batch size as the training
proceeds has been found to be useful for some LLMs. In general, achieving stable and efcient
large-scale LLM training requires carefully designed setups, including learning schedules, opti-
mizer choices, training parallelism, mixed precision training, and so on. Some of these issues are
highly engineered, and therefore, we typically need a number of training runs to obtain satisfactory
LLMs.
[Narayanan et al., 2021; Fedus et al., 2022]. Here we sketch the basic concepts.
• Data Parallelism. This method is one of the most widely used parallelism methods for
training neural networks. To illustrate, consider the simplest case where the standard delta
rule is used in gradient descent
∂Lθt (Dmini )
θt+1 = θt − lr · (2.32)
∂θt
where the new parameters θt+1 is obtained by updating the latest parameters θt with a small
∂Lθt (Dmini )
step lr in the direction of the negative loss gradient. ∂θt is the gradient of the loss
with respect to the parameters θt , and is computed on a minibatch of training sample Dmini .
In data parallelism, we divide Dmini into N smaller batches, denoted by {D 1 , ..., D N }.
Then, we distribute these batches to N workers, each with a corresponding batch. Once
the data is distributed, these workers can work at the same time. The gradient of the entire
minibatch is obtained by aggregating the gradients computed by the workers, like this
In ideal cases where the workers coordinate well and the communication overhead is small,
data parallelism can achieve nearly an N -fold speed-up for training.
• Model Parallelism. Although data parallelism is simple and effective, it requires each
worker to run the entire LLM and perform the complete forward and backward process.
As LLMs grow larger, it sometimes becomes unfeasible to load and execute an LLM on a
single device. In this case, we can decouple the LLM into smaller components and run these
components on different devices. One simple way to do this is to group consecutive layers
in the layer stack and assign each group to a worker. The workers operate in the order of
the layers in the stack, that is, in the forward pass we process the input from lower-level to
upper-level layers, and in the backward pass we propagate the error gradients from upper-
level to lower-level layers. Consider, for example, a Transformer decoder with L stacked
blocks. To distribute the computation load, each block is assigned to a worker. See the
following illustration for a single run of the forward and backward passes of this model.
Here Bl denotes the computation of block l, and the symbols ↑ and ↓ denote the forward and
backward passes, respectively. Note that this parallelism method forces the workers to run
in sequence, so a worker has to wait for the previous worker to nish their job. This results
in the devices being idle for most of the time. In practical systems, model parallelism is
generally used together with other parallelism mechanisms to maximize the use of devices.
62 Generative Models
where each sub-matrix Wkh has a shape of d × dMh . The multiplication of h with Wh can be
expressed as
hWh = h W1h W2h ... WM
h
= hW1h hW2h ... hWM
h (2.35)
We can perform matrix multiplications {hW1h , hW2h , ..., hWMh } on M devices separately.
As a result, we distribute a large matrix multiplication across multiple devices, each of
which may have relatively small memory. From the perspective of the design of modern
GPUs, tensor parallelism over GPUs provides a two-level, tile-based approach to parallel
computing. First, at a higher level, we decompose a matrix multiplication into sub-matrix
multiplications that can directly t into the memory of GPUs. Then, at a lower level, we
execute these sub-matrix multiplications on GPUs using tile-based parallel algorithms that
are specically optimized for GPUs.
Here Bl,k represents the processing of the k-th micro-batch by the l-th worker. Ideally we
would like to maximize the number of micro-batches, and thus minimize the idle time of the
2.2 Training at Scale 63
workers. However, in practice, using small micro-batches often reduces GPU utilization and
increases task-switching costs. This may, in turn, decrease the overall system throughput.
The ultimate goal of parallel processing is to achieve linear growth in efciency, that is, the
number of samples that can be processed per unit of time increases linearly with the number of
devices. However, distributed training is complicated, and inuenced by many factors in addition
to the parallelism method we choose. One problem, which is often associated with distributed
systems, is the cost of communication. We can think of a distributed system as a group of net-
worked nodes. Each of these nodes can perform local computation or pass data to other nodes. If
there are a large number of such nodes, it will be expensive to distribute and collect data across
them. Sometimes, the time savings brought about by parallelism are offset by the communica-
tion overhead of a large network. Another problem with large-scale distributed systems is that
the synchronization of nodes introduces additional costs. As is often the case, some nodes may
take longer to work, causing others to wait for the slowest ones. While we can use asynchronous
training to handle heterogeneity in computational resources, this may lead to stale gradients and
non-guaranteed convergence. Moreover, as more nodes are added to the network, there is more
chance to have crashed nodes during training. In this case, we need to ensure that the whole
system is fault tolerant. In many practical settings, to increase scalability, one needs to take into
account additional issues, including architecture design, data transfer and computation overlap,
load balancing, memory bandwidth and so on.
Training LLMs is so computationally expensive that, even though distributed training is al-
ready in use, researchers and engineers often still employ various model compression and speed-
up methods to improve training efciency [Weng, 2021]. One example is mixed precision training,
in which low precision data (such as FP16 and FP8 data) is used for gradient computation on each
individual node, and single or double precision data (such as FP32/FP64 data) is used for updating
the model [Micikevicius et al., 2018]. A key operation in this approach is gradient accumulation
where gradients need to be accumulated and synchronized across nodes. However, due to the
non-associativity of oating-point addition, this can lead to slight numerical differences in accu-
mulated gradients on different nodes, which may affect model convergence and nal performance.
This problem is more obvious if there are a large number of nodes involved in distributed training,
especially given that low-precision numerical computations may encounter overow and under-
ow issues, as well as inconsistencies across different hardware devices. Therefore, the design of
distributed systems needs to consider these numerical computation issues to ensure satisfactory
results and convergence.
The success of LLMs reveals that training larger language models using more resources can lead
to improved model performance. Researchers have explained this as scaling laws of LLMs. More
specically, scaling laws describe the relationships between the performance of LLMs and the
attributes of LLM training, such as the model size, the amount of computation used for training,
and the amount of training data. For example, Hestness et al. [2017] show that the performance of
deep neural networks is a power-law-like function of the training data size. In the beginning, when
the amount of training data is not large, the performance of the model improves slowly. Afterward,
when more training data is used, the model enters a phase of rapid performance improvement, and
the performance curve resembles a power-law curve. Ultimately, the improvement in performance
64 Generative Models
Fig. 2.3: A scaling law of test error against a variable of interest (e.g., training dataset size) [Hestness et al., 2017]. The
curve of the scaling law can be divided into three phases. At the beginning, the number of test errors decreases slowly
when more training data is used, but this only lasts for a short period. In the second phase, the number of test errors
decreases drastically, and the curve becomes a power law curve. After that, the error reduction slows down again in the
third phase. Note that there are irreducible errors that cannot be eliminated, regardless of the amount of training data.
becomes slow again, and more data does not lead to signicant gains. Figure 2.3 shows an example
of such curves.
In NLP, a traditional view holds that the performance gains will disappear at a certain point
as the training is scaled up. However, recent results show that, if we consider the problem on
a larger scale, scaling up training is still a very effective method for obtaining stronger LLMs.
For example, both closed-source and open-source LLMs can benet from more data, even though
trillions of tokens have already been used for training.
With the increase in the scale of model training, LLMs exhibit new capabilities, known as the
emergent abilities of LLMs. For example, Wei et al. [2022b] studied the scaling properties of
LLMs across different model sizes and amounts of computational resources. Their work shows
that some abilities emerge when we scale the model size to certain level. The appearance of
emergent abilities has demonstrated the role of scaled training in enhancing the performance of
LLMs, and it has also, to some extent, motivated researchers to continuously attempt to train larger
models. As larger and stronger LMs continue to appear, our understanding of the scaling laws
continues to mature. This helps researchers predict the performance of LLMs during training and
estimate the minimal computational resources required to achieve a given level of performance.
To understand how model performance scales with various factors considered during training,
it is common to express the model performance as a function of these factors. For example, in
the simplest case, we can express the loss or error of an LLM as a function of a single variable of
interest. However, there are no universal scaling laws that can describe this relationship. Instead,
different functions are proposed to t the learning curves of LLMs.
Let x be the variable of interest (such as the number of model parameters) and L(x) be the
loss of the model given x (such as the cross-entropy loss on test data). The simplest form of L(x)
is a power law
N
L(N ) = ( 8.8·10 −0.076 4.2 D −0.095
13 ) L(D) = ( 5.4·1013 )
5.6
4.8 3.9
Test Loss
Test Loss
4.0 3.6
3.3
3.2
3
2.4 2.7
5 7 9
10 10 10 108 109
Number of Parameters Dataset Size
Fig. 2.4: Test loss against model size (N ) and training dataset size (D) (data points are plotted for illustrative purposes).
N −0.076
We plot test loss as a function of N , which is dened as L(N ) = 8.8×10 13 , and a function of D, which is
D
−0.095
dened as L(D) = 5.4×1013
[Kaplan et al., 2020].
where a and b are parameters that are estimated empirically. Despite its simplicity, this func-
tion has successfully interpreted the scaling ability of language models and machine transla-
tion systems in terms of model size (denoted by N ) and training dataset size (denoted by D)
[Gordon et al., 2021; Hestness et al., 2017]. For example, Kaplan et al. [2020] found that the per-
formance of their language model improves as a power law of either N or D after an initial
N −0.076
transient period, and expressed these relationships using L(N ) = 8.8×10 13 and L(D) =
D
−0.095
5.4×1013
(see Figure 2.4).
An improvement to this scaling law is to add an irreducible error term to the power law. The
form of L(x) is then given by
where ǫ∞ is the irreducible error that accounts for the error due to unknown variables, which is
present even as x → ∞. Eq. (2.37) is one of the most widely used forms for designing scaling
laws of LLMs. For example, Rosenfeld et al. [2020] developed a scaling law that involves both
model scaling and dataset scaling, like this
L(N, D) = aN b + cD d + ǫ∞ (2.38)
An example of such formulation is the Chinchilla scaling law. It states that the test loss per
token is the sum of the inverse proportion functions of N and D, with an additional irreducible
error term. Hoffmann et al. [2022] express this scaling law as
406.4 410.7
L(N, D) = 0.34
+ 0.28
+ 1.69
(2.39)
N D irreducible error
model scaling dataset scaling
All the scaling laws mentioned above are based on monotonic functions. So they cannot cover
functions with inection points, such as double descent curves. In response, researchers have
explored more sophisticated functions to t the learning curves. Examples of such functions can
66 Generative Models
We have already seen that, in large-scale training, larger language models can be developed by us-
ing more data and computational resources. However, scaling up can also occur in other directions.
For instance, in many applications, LLMs are adapted to process signicantly long sequences. An
interesting example is that we pre-train an LLM on extensive texts of normal length and then ap-
ply it to deal with very long token sequences, far beyond the length encountered in pre-training.
Here we use Pr(y|x) to denote the text generation probability where x is the context and y is the
generated text. There are broadly three types of long sequence modeling problems.
• Text generation based on long context (i.e., x is a long sequence). For example, we
generate a short summary for a very long text.
• Long text generation (i.e., y is a long sequence). For example, we generate a long story
based on a few keywords.
• Long text generation based on long context (i.e., both x and y are long sequences). For
example, we translate a long document from Chinese to English.
Recently, NLP researchers have been more interested in applying and evaluating LLMs on
tasks where extremely long input texts are involved. Imagine an LLM, which reads a C++ source
le containing tens of thousands of lines, and outlines the functionality of the program correspond-
ing to the source le. Such models, capable of handling extensive textual contexts, are sometimes
called long-context LLMs. In this section we will restrict ourselves to long-context LLMs, but
the methods discussed here can be applicable to other problems.
For Transformers, dealing with long sequences is computationally expensive, as the computa-
tional cost of self-attention grows quadratically with the sequence length. This makes it infeasible
to train and deploy such models for very long inputs. Two strands of research have tried to adapt
Transformers to long-context language modeling.
• The rst explores efcient training methods and model architectures to learn self-attention
models from long-sequence data.
2.3 Long Sequence Modeling 67
• The other adapts pre-trained LLMs to handle long sequences with modest or no ne-tuning
efforts.
Here, we will discuss the former briey since it can be found in general discussions of efcient
Transformer architectures [Tay et al., 2020; Xiao and Zhu, 2023]. We will focus on the latter,
highlighting popular methods in recent LLMs. We will also discuss the strengths and limitations
of these long-sequence models.
We begin our discussion by considering improvements to standard Transformer models from the
perspectives of high-performance computing. Most of these improvements, though not speci-
cally designed for LLMs, have been widely applied across various deep learning models [Kim et al.,
2023]. A commonly used approach is to adopt a low-precision implementation of Transformers.
For example, we can use 8-bit or 16-bit xed-point data types for arithmetic operations, instead
of 32-bit or 64-bit oating-point data types. Using these low-precision data types can increase
the efciency and memory throughput, so that longer sequences can be processed more easily.
An alternative approach is to improve Transformers by using hardware-aware techniques. For
example, on modern GPUs, the efciency of Transformers can be improved by using IO-aware
implementations of the self-attention function [Dao et al., 2022; Kwon et al., 2023].
Another way to handle long sequences is through sequence parallelism [Li et al., 2023b;
Korthikanti et al., 2023]. Specically, consider the general problem of attending the query qi
at the position i to the keys K and values V. We can divide K by rows and obtain a set of sub-
matrices {K[1] , ..., K[nu ] }, each corresponding to a segment of the sequence. Similarly, we can
obtain the sub-matrices of V, denoted by {V[1] , ..., V[nu ] }. Then, we assign each pair of K[u] and
V[u] to a computing node (e.g., a GPU of a GPU cluster). The assigned nodes can run in parallel,
thereby parallelizing the attention operation.
Recall that the output of the self-attention model can be written as
m−1
Attqkv (qi , K, V) = αi,j vj (2.40)
j=0
where αi,j is the attention weight between positions i and j. In Transformers, αi,j is obtained
by normalizing the rescaled version of the dot product between qi and kj . Let βi,j denote the
attention score between qi and kj . We have
qi · kj
βi,j = √ + Mask(i, j) (2.41)
d
where Mask(i, j) is the masking variable for (i, j). Then, we dene the attention weight αi,j to
be
αi,j = Softmax(βi,j )
exp(βi,j )
= (2.42)
j ′ exp(βi,j ′ )
68 Generative Models
On each computing node, we need to implement these equations. Given the keys and values
assigned to this node, computing the numerator of the right-hand side of Eq. (2.42) (i.e., exp(βi,j ))
is straightforward, as all the required information is stored on the node. However, computing the
denominator of the right-hand side of Eq. (2.42) involves a sum of exp(βi,j ′ ) over all j ′ s, which
requires transferring data to and from other nodes. To illustrate, suppose that vj and kj are placed
on node u. We can rewrite Eq. (2.42) as
αi,j
node u
exp(βi,j )
= (2.43)
exp(βi,j ′ ) + · · · + exp(βi,j ′ ) + · · · + exp(βi,j ′ )
kj ′ ∈K[1] kj ′ ∈K[u] kj ′ ∈K[nu ]
node 1 node u node nu
where the notation kj ′ ∈ K[u] represents that kj ′ is a row vector of K[u] . In a straightforward
implementation, we rst perform the summations { kj′ ∈K[u] exp(βi,j ′ )} separately on the corre-
sponding nodes. Then, we collect these summation results from different nodes to combine them
into a nal result. This corresponds to a collective operation in the context of parallel processing.
There are many efcient implementations of such operations, such as the all-reduce algorithms.
Hence the sum of all exp(βi,j ) values can be computed using optimized routines in collective
communication toolkits.
Given the attention weights {αi,j }, we then compute the attention results using Eq. (2.40).
The problem can be re-expressed as
Attqkv (qi , K, V)
= αi,j ′ vj ′ + · · · + αi,j ′ vj ′ + · · · + αi,j ′ vj ′ (2.44)
vj ′ ∈V[1] vj ′ ∈V[u] vj [n ]
′ ∈V u
node 1 node u node nu
Like Eq. (2.43), Eq. (2.44) can be implemented as a summation program in parallel process-
ing. First, perform the weighted summations of values on different nodes simultaneously. Then,
we collect the results from these nodes via collective operations.
Note that, although this section primarily focuses on long sequence modeling, much of the mo-
tivation for sequence parallelism comes from the distributed training methods of deep networks,
as discussed in Section 2.2.3. As a result, the implementation of these methods can be based on
the same parallel processing library.
One difculty of applying Transformers to long sequences is that self-attention has a quadratic
time complexity with respect to the sequence length. Moreover, a key-value cache (or KV cache
for short) is maintained during inference, and its size increases as more tokens are processed. Al-
though the KV cache grows linearly with the sequence length, for extremely long input sequences,
the memory footprint becomes signicant and it is even infeasible to deploy LLMs for such tasks.
As a result, the model architecture of long-context LLMs generally moves away from the standard
2.3 Long Sequence Modeling 69
Transformer, turning instead to the development of more efcient variants and alternatives.
One approach is to use sparse attention instead of standard self-attention. This family of
models is based on the idea that only a small number of tokens are considered important when
attending to a given token, and so most of the attention weights between tokens are close to zero.
As a consequence, we can prune most of the attention weights and represent the attention model
in a compressed form. To illustrate, consider the self-attention model
QKT
α(Q, K) = Softmax( √ + Mask)
d
α0,0 0 0 ... 0
α α 0 ... 0
1,0 1,1
= α α2,1 α2,2 ... 0 (2.46)
2,0
.. .. .. .. ..
. . . . .
αm−1,0 αm−1,1 αm−1,2 ... αm−1,m−1
Each row vector αi,0 ... αi,i 0 ... 0 corresponds to a distribution of attending the i-th
token to every token of the sequence. Since language models predict next tokens only based on
their left-context, we normally write the output of the attention model at position i as
v
0
Attqkv (qi , K≤i , V≤i ) = αi,0 ... αi,i ...
vi
i
= αi,j vj (2.47)
j=0
k0 v0
. .
where K≤i = .. and V≤i = ..
are the keys and values up to position i.
ki vi
In the original version of self-attention αi,0 ... αi,i is assumed to be dense, that is, most of
the values are non-zero. In sparse attention, some of the entries of αi,0 ... αi,i are considered
non-zero, and the remaining entries are simply ignored in computation. Suppose G ⊆ {0, ..., i} is
the set of indices of the non-zero entries. For language models, the output of the sparse attention
model at position i is given by
Attsparse (qi , K≤i , V≤i ) = α′i,j vj (2.48)
j∈G
Here {α′i,j } are normalized over G. Hence their values are different from the original attention
weights (in fact we have α′i,j > αi,j ). The sparsity of the model is determined by how large G is.
Sparse attention models differ in the way we dene G. One simple approach is to dene G based
70 Generative Models
on heuristically designed patterns. For example, a widely-used pattern involves having G cover a
window of tokens located near position i [Parmar et al., 2018].
While sparse attention reduces the computation through the use of sparse operations, such
models still have signicant limitations as we must keep the entire KV cache (i.e., K≤i and V≤i )
during inference. If the sequence is very long, storing this cache will become highly memory-
intensive. To address this, we can consider a different form of attention models where the KV
cache is not explicitly retained. Linear attention is one such approach [Katharopoulos et al., 2020].
It uses a kernel function φ(·) to project each query and key onto points qi′ = φ(qi ) and ki′ = φ(ki ),
respectively. By removing the Softmax function under such transformations10 , the form of the
resulting attention model is given by
where µi and νi are variables that are computed in the recurrent forms
T
µi = µi−1 + k′ i vi (2.50)
T
νi = νi−1 + k′ i (2.51)
µi and νi can be seen as representations of the history up to position i. A benet of this model is
that we need not keep all past queries and values. Instead only the latest representations µi and
νi are used. So the computational cost of each step is a constant, and the model can be easily
extended to deal with long sequences.
In fact, this sequential approach to long sequence modeling arises naturally when we adopt a
viewpoint of recurrent models. Such models read one token (or a small number of tokens) at a
time, update the recurrent state using these inputs, and then discard them before the next token
arrives. The output at each step is generated based only on the recurrent state, rather than on all the
previous states. The memory footprint is determined by the recurrent state which has a xed size.
Recurrent models can be used in real-time learning scenarios where data arrives in a stream and
predictions can be made at any time step. In NLP, applying recurrent models to language mod-
eling is one of the earliest successful attempts to learn representations of sequences. Although
Transformer has been used as the foundational architecture in LLMs, recurrent models are still
powerful models, especially for developing efcient LLMs. More recently, recurrent models have
started their resurgence in language modeling and have been reconsidered as a promising alterna-
tive to Transformers [Gu and Dao, 2023]. Figure 2.5 shows a comparison of the models discussed
in this subsection.
10
In the new space after this transformation, the Softmax normalization can be transformed into the simple scaling
normalization.
2.3 Long Sequence Modeling 71
T µi
µi = µi−1 + k′ i vi ⇒ q′i µi
Attlinear (qi , K≤i , V≤i ) = q′i νi
T ⇒ νi
νi = νi−1 + k′ i
hi = f (hi−1 , inputi )
inputi
(d) Recurrent Models
Fig. 2.5: Illustrations of self-attention, sparse attention, linear attention and recurrent models. Blue boxes = cached
states for producing the output at position i. f (·) = a recurrent cell.
LLMs based on the standard Transformer architecture are global models. The inference for these
models involves storing the entire left-context in order to make predictions for future tokens. This
requires a KV cache where the representations (i.e., keys and values) of all previously-generated
72 Generative Models
tokens are kept, and the cost of caching grows as the inference proceeds. Above, we have dis-
cussed methods for optimizing this cache via efcient attention approaches, such as sparse atten-
tion and linear attention. Another idea, which may have overlap with the previous discussion, is
to explicitly encode the context via an additional memory model.
A straightforward approach is to represent the keys and values using a xed-size memory model.
Suppose we have a memory Mem which retains the contextual information. We can write the
attention operation at position i in a general form
In this model, Mem is simply the KV cache, i.e., Mem = (K≤i , V≤i ). Thus the size of
Mem is determined by i. If we dene Mem as a xed-size variable, then the cost of performing
Att(qi , Mem) will be xed. There are several alternative ways to design Mem.
• One of the simplest methods is to consider a xed-size window of previous keys and values.
Mem is therefore given by
where nc denotes the size of the window. The notation K[i−nc +1,i] and V[i−nc +1,i] denote
the keys and values over positions from i − nc + 1 to i.11 This model can be seen as a type
of local attention model.
• It is also possible to dene Mem as a pair of summary vectors, which leads to a more
compressed representation of the history. A simple way to summarize the previous keys
and values is to use the moving average of them. For example, Mem can be dened as the
unweighted moving average of the previous nc keys and values
i i
j=i−nc +1 kj j=i−nc +1 vj
Mem = , (2.54)
nc nc
Here {β1 , ..., βnc } are the coefcients, which can be either learned as model parameters
or determined via heuristics. For example, they can be set to increasing coefcients (i.e.,
β1 < β2 < ... < βnc −1 < βnc ) in order to give larger weight to positions that are closer to
ki−nc +1 vi−nc +1
11 .
..
.
More formally, we write K[i−nc +1,i] = and V[i−nc +1,i] = .. . Sometimes we denote
ki vi
K[i−nc +1,i] by {ki−nc +1 , ..., ki } and V[i−nc +1,i] by {vi−nc +1 , ..., vi } for notation simplicity.
2.3 Long Sequence Modeling 73
i. We can extend the moving average to include all the positions up to i. This leads to the
cumulative average of the keys and values, given in the form
i i
j=0 kj j=0 vj
Mem = , (2.56)
i+1 i+1
(ki , vi ) + i · Memi−1
Memi = (2.57)
i+1
where Memi and Memi−1 denote the cumulative averages of the current and previous po-
sitions, respectively. An advantage of this model is that we only need to store a single
key-value pair during inference, rather than storing all the key-value pairs. Note that the
above memory models are related to recurrent models, and more advanced techniques have
been used to develop alternatives to self-attention mechanisms in Transformers [Ma et al.,
2023].
• The memory Mem can also be a neural network. At each step, it takes both the previous
output of the memory and the current states of the model as input, and produces the new
output of the memory. This neural network can be formulated as the function
Here Mem and Mempre represent the outputs of the memory at the current step and the
previous step, respectively. Skv is a set of key-value pairs, representing the recent states of
the model. This formulation is general and allows us to develop various memory models by
selecting different Update(·) and Skv congurations. For example, if Skv only contains the
latest key-value pair (ki , vi ) and Update(·) is dened as a recurrent cell, then Eq. (2.58)
can be expressed as an RNN-like model
where f (·) is a recurrent cell. Recurrence can also be applied to segment-level modeling
for efciency consideration. A simple approach is that we can divide the sequence into
segments, and treat Skv as a segment. Applying recurrent models to Update(·) will result in
memory models that operate on segments. A special example is that we dene Update(·) as
an FIFO function that adds Skv into the memory and removes the oldest key-value segment
from the memory, given by
Consider a memory which includes two segments, one for current segment, and one for the
previous segment. In the attention operation, each position can access the history key-value
pairs in two closest consecutive segments. This essentially denes a local memory, but it
and its variants have been widely used segment-level recurrent models [Dai et al., 2019;
Hutchins et al., 2022; Bulatov et al., 2022].
• The above memory models can be extended to involve multiple memories. An example
74 Generative Models
of this approach is compressive Transformer [Rae et al., 2019]. It employs two distinct
xed-size memories: one for modeling local context (denoted by Mem), and the other for
modeling and compressing long-term history (denoted by CMem). The KV cache in this
model is the combination of Mem and CMem. The attention function can be written as
where [Mem, CMem] is a combined memory of Mem and CMem. As with other segment-
level models, the compressive Transformer model operates on segments of the sequence.
Each segment is a sequence of ns consecutive tokens, and we denote Skvk as the key-value
pairs corresponding to the tokens of the k-th segment. When a new segment arrives, Mem
k to Mem, and then
is updated in an FIFO fashion: we append the nc key-value pairs in Skv
pop the ns oldest key-value pairs from Mem, which is given by
k
Mem = FIFO(Skv , Mempre ) (2.62)
The popped key-value pairs are then used to update the compressive memory CMem. These
ns key-value pairs are compressed into ncs key-value pairs via a compression network.
CMem is an FIFO which appends the compressed ncs key-value pairs to the tail of the
queue, and drops the rst ncs key-value pairs of the queue. It is given by
k
CMem = FIFO(Ckv , CMempre ) (2.63)
• We have already seen that both global and local contexts are useful and can be mod-
eled using attention models. This view motivates the extension to attention models for
combining both local and long-term memories [Ainslie et al., 2020; Zaheer et al., 2020;
Gupta and Berant, 2020]. A simple but widely-used approach is to involve the rst few to-
kens of the sequence in attention, serving as global tokens. This approach is usually applied
along with other sparse attention models. An advantage of incorporating global tokens of
the sequence is that it helps smooth the output distribution of the Softmax function used in
attention weight computation, and thus stabilizes model performance when the context size
is very large [Xiao et al., 2024]. One drawback, however, is that using a xed-size global
memory may result in information loss. When dealing with long sequences, we need to
enlarge the KV cache for sufcient representations of the context, but this in turn increases
the computational cost.
Figure 2.6 shows illustrations of the above approaches. Note that, while we focus on optimiza-
tion of the KV cache here, this issue is closely related to those discussed in the previous section.
All of the methods we have mentioned so far can broadly be categorized as efcient attention
approaches, which are widely used in various Transformer variants.
2.3 Long Sequence Modeling 75
Memory
Size = 4 × 2
··· Keys
··· Values
i−7 i−6 i−5 i−4 i−3 i−2 i−1 i
(a) Window-based Cache
··· Keys
··· Values
i−7 i−6 i−5 i−4 i−3 i−2 i−1 i
(b) Moving Average-based Cache
··· Keys
··· Values
i−7 i−6 i−5 i−4 i−3 i−2 i−1 i
(c) Recurrent Network as Cache
Compressed
Memory Memory
Size = 2 × 2 Size = 4 × 2
··· Keys
··· Values
i−7 i−6 i−5 i−4 i−3 i−2 i−1 i
(d) Hybrid Cache (Compressed Memory + Local Memory)
Fig. 2.6: Illustrations of xed-size KV caches in LLMs. Blue boxes represent the keys and values generated during
LLM inference, green boxes represent the keys and values stored or encoded in the primary memory, and orange boxes
represent the keys and values stored or encoded in the compressed memory.
76 Generative Models
The modeling of memories discussed above was based on updates to the KV cache, and the re-
sulting models are typically referred to as internal memories. We now consider another family
of models, called external memories, which operate as independent models to access large-scale
contexts for LLMs. Many such models are based on memory-based methods which have been
extensively discussed in machine learning [Bishop, 2006]. A common example is nearest neigh-
bor algorithms: we store context representations in a datastore, and try to nd the most similar
stored representations to match a given query. The retrieved context representations are then used
to improve attention for this query.
Here, we consider the k-nearest neighbors (k-NN) method which is one of the most popular
memory-based methods. Since our focus is language modeling in this section, we dene a sample
in the datastore as a key-value pair corresponding to some context state. Note that “context” is a
broad concept here, not just a sequence prex in text generation. One might, for example, view
the entire dataset as the context for predicting tokens. This allows us to retrieve the closest context
situation in a set of sequences, rather than a given sequence prex. Although we will restrict
ourselves to context modeling for a single sequence, in this subsection, we discuss a relatively
more general case.
Suppose we have a set of keys {kj } with corresponding values {vj }, and suppose we store
these key-value pairs in a vector database12 . For each query qi , we nd its k nearest neighbours by
growing the radius of the sphere centered as qi until it contains k data points in {kj }. This results
in a set of k keys along with their corresponding values, denoted by Memknn . As before, we
denote Mem as the local memory for the query, such as the KV cache of neighboring tokens. Our
goal is to attend query qi to both the local memory Mem and the long-term memory Memknn .
There are, of course, several ways to incorporate Mem and Memknn into the attention model.
For example, we might simply combine them to form a single KV cache [Mem, Memknn ], and
attend qi to [Mem, Memknn ] via standard QKV attention. Or we might use Mem and Memknn
in separate attention steps. An example of such approaches is the model developed by Wu et al.
[2021]. It linearly combines the two types of attention, given by
Here g ∈ Rd is the coefcient vector, which can be the output of a learned gate.
Given the k-NN-based memory model described above, the remaining task is to determine
which key-value pairs are retained in the datastore. For standard language modeling tasks, we
consider the previously seen tokens in a sequence as the context, so we can add the keys and
values of all these tokens into the datastore. In this case, the resulting k-NN-based attention
model is essentially equivalent to a sparse attention model [Gupta et al., 2021].
Alternatively, we can extend the context from one sequence to a collection of sequences.
For example, we might collect all key-value pairs across the sequences in a training dataset and
add them to the datastore to model a larger context. Thus, LLMs can predict tokens based on a
12
A vector database, or vector store, is a database that provides highly optimized retrieval interfaces for nding stored
vectors that closely match a query vector.
2.3 Long Sequence Modeling 77
generalized context. A problem with this approach is that the computational cost would be large
if many sequences are involved. Since these sequences are part of our training data, we can build
and optimize an index for the vectors in the datastore before running the LLMs. As a result, the
retrieval of similar vectors can be very efcient, as in most vector databases.
In fact, all the above-mentioned methods can be viewed as instances of a retrieval-based ap-
proach. Instead of using retrieval results to improve attention, we can apply this approach in other
ways as well. One application of k-NN-based search is k-NN language modeling (or k-NN LM)
[Khandelwal et al., 2020]. The idea is that, although it is attempting to extend the context used
in self-attention by incorporating nearest neighbors in representation learning, in practice, similar
hidden states in Transformers are often highly predictive of similar tokens in subsequent positions.
In k-NN LM, each item in the datastore is a key-value tuple (z, w), where z represents a hidden
state of the LLM at a position, and w represents the corresponding prediction. A typical way to
create the datastore is to collect the output vector of the Transformer layer stack and the corre-
sponding next token for each position of each sequence in a training dataset. During inference,
we have a representation hi given a prex. Given this representation, we rst search the datastore
for k closest matching data items {(z1 , w1 ), ..., (zk , wk )}. Here {w1 , ..., wk } are thought of as
reference tokens for prediction, and thus can be used to guide the token prediction based on hi .
One common way to make use of reference tokens is to dene a distribution over the vocabulary
V,
Prknn (·|hi ) = Softmax( −d0 · · · −d|V | ) (2.67)
where dv equals the distance between hi and zj if wj equals the v-th entry of V , and equals 0
otherwise. We use a linear function with a coefcient λ that interpolates between the retrieval-
based distribution Prknn (·|hi ) and the LLM output distribution Prlm (·|hi )
Then, as usual, we can choose the next token y by maximizing the probability Pr(y|hi ).
As with information retrieval (IR) systems, the datastore can also manage texts and provide
access to relevant texts for a query. For example, we can store a collection of text documents
in a search engine with full-text indexing, and then search it for documents that match a given
text-based query. Applying IR techniques to LLMs leads to a general framework called retrieval-
augmented generation (RAG). The RAG framework works as follows. We use the context x as
the query and nd the k most relevant document pieces {c1 , ..., ck } from the datastore via efcient
IR techniques13 . These search results are combined with the original context via a prompting
13
In piratical applications, queries are typically generated using a query generation system, which may expand it
with variations of tokens and query intent.
78 Generative Models
Then, we use x′ as the context and predict the following text using the model Pr(y|x′ ). One
advantage of RAG is that we need not modify the architecture of LLMs, but instead augment the
input to LLMs via an additional IR system. Figure 2.7 shows a comparison of the use of different
external memories in LLMs.
A memory model in LLMs, in the form of a simple key-value cache or a datastore, can broadly
be seen as an encoder of contextual information. Ideally, before we say that a memory model
is representative of the entire context in token prediction, we need to make sure that the model
can accurately represent any part of the context. The standard KV cache is one such model that
completely stores all past history. In this case, the model is said to have adequate capacity for
memorizing the context. In many practical applications, however, complete memorization is not
required. Instead, the goal is to enable LLMs to access important contextual information. As a
result, efcient and compressed memory models are developed, as described in this section. Note
that, the longer the sequence, the more difcult it becomes for a low-capacity memory model to
capture important contextual information. It is therefore common practice to simply increase the
model capacity when processing long contexts.
While high-capacity models are generally favorable, they are difcult to train and deploy. A
challenging scenario is that the tokens arrive in a stream and the context continuously grows.
Developing LLMs for such tasks is difcult as we need to train Transformers on extremely long
sequences. A possible way to address this difculty is to use non-parametric methods, such as
retrieval-based methods. For example, as discussed above, we can use a vector database to store
previously generated key-value pairs, and thus represent the context by this external memory
model. Although this approach side-steps the challenge of representing long context in Trans-
formers, building and updating external memory models are computationally expensive. These
models are more often used in problems where the context is given in advance and xed during
inference, and hence unsuitable for streaming context modeling.
In cases where the size of the context continuously grows, applying xed-size memory models
is a commonly used approach. For example, in recurrent models, a sequence of arbitrary length
can be summarized into a set of hidden states by which we have a xed computational cost per
step. While recurrent models were initially found to be not very good at handling long-distance
dependencies in sequence modeling in early applications of deep learning to NLP, recent advance-
ments have shown that their variants are now effective in modeling extremely long sequences.
[Bulatov et al., 2022; Hutchins et al., 2022; Munkhdalai et al., 2024; Ma et al., 2024].
14
For example, the template could be:
···
qi
···
k Nearest KV Cache
Neighbors
Keys/values in LLM
Datastore
Search Keys/values in Datastore
Output Distribution
Distribution Pr(·)
Distribution Prknn (·) ···
Att(qi , Mem) Att(qi , Mem)
···
qi
···
k Nearest KV Cache
Neighbors Keys/values in LLM
Keys in Datastore
Datastore
Search Predicted Tokens
LLM
c1 = Deep network is ... Message: deep network ... machine learning ...
Fig. 2.7: Illustrations of external memories (or datastores) for language modeling.
80 Generative Models
There is no general denition of memory capacity in LLMs. A simple approach might consider
how much storage is used to retain contextual information. For example, memory capacity could
be dened by the size of the KV cache in Transformers or the vector database used in retrieval-
based methods. A related concept is model complexity. In machine learning, there are several
ways to dene the model complexity of a model. One of the simplest methods is by counting the
number of parameters. However, it should be emphasized that the memory models discussed here
primarily serve to store information, rather than add trainable parameters. Therefore, a model with
a large memory capacity is not necessarily more complex. Nevertheless, in practice determining
the capacity of a memory model is not straightforward. In general, we need to control the trade-off
between maximizing the performance and controlling the memory footprint.
In Transformers, the KV cache is a data structure that can be dynamically adjusted along multiple
dimensions, such as heads, layers, and sequence length. For example, consider an LLM with L
layers. Each layer has τ attention heads, and each head produces a dh -dimensional output. During
inference, we store the keys and values for up to m tokens. The space complexity of this caching
mechanism is O(L · τ · dh · m). As we have seen previously, this complexity can be reduced by
caching the keys and values for fewer tokens. For example, in sliding window attention, a xed-
size window is used to cache the keys and values in local context. And this model has a space
complexity of O(L · τ · dh · mw ), with mw being the size of the window.
In addition to reducing m, we can also decrease the size of the KV cache along other di-
mensions. A widely-used approach is to enable sharing across heads in multi-head self-attention.
Recall from Section 2.1.1 that multi-head self-attention uses multiple sets of queries, keys, and
values (each set is called a head), each performing the QKV attention mechanism as usual. This
can be expressed as
where headj ∈ Rdh is computed using the standard QKV attention function
[j] [j] [j]
headj = Attqkv (qi , K≤i , V≤i ) (2.71)
Figure 2.8 (c) illustrates this model. By sharing keys and values, the size of the KV cache would
2.3 Long Sequence Modeling 81
Layer l
Sharing
Layer l − 1
Fig. 2.8: Illustration of QKV attention based on different multi-head and sharing mechanisms. (a) = single-head
attention, and (b-e) = attention with multiple heads.
be O(L · dh · m).
Grouped query attention (GQA) is a natural extension to multi-head attention and MQA
[Ainslie et al., 2023]. In GQA, heads are divided into ng groups, each corresponding to a shared
[1] [1] [n ] [n ]
set of keys and values. Hence we have ng sets of keys and values {(K≤i , V≤i ), ..., (K≤ig , V≤ig )}.
See Figure 2.8 (d) for an illustration. Let g(j) be the group id for the j-th head. The GQA model
can be expressed as
[j] [g(j)] [g(j)]
headj = Attqkv (qi , K≤i , V≤i ) (2.73)
The size of the KV cache of GQA is O(L·ng ·dh ·m). One benet of GQA is that we can trade-off
between computational efciency and model expressiveness by adjusting ng . When ng = τ , the
model becomes the standard multi-head attention model. By contrast, when ng = 1, it becomes
82 Generative Models
Since Transformer layers are order-insensitive to input, we need some way to encode positional
information in the input tokens. To do this, it is common to add positional embeddings to token
embeddings, and then feed these combined embeddings into the Transformer layer stack as input.
In this case, the embedding at position i can be expressed as
ei = xi + PE(i) (2.74)
where xi ∈ Rd denotes the token embedding, and PE(i) ∈ Rd denotes the positional embedding.
In general, the token embedding xi is a position-independent vector, and so the positional embed-
ding PE(i) is used to encode the positional context. A straightforward approach is to treat PE(i)
as a learnable variable and train it alongside other model parameters. In this way, we can learn
a unique representation for each position, and thus distinguish the tokens appearing at different
positions of a sequence.
Representations of positions using learned vectors can work well in tasks where the sequences
at training and test times are of similar lengths. In practice, however, we often impose length
restrictions on sequences during training to prevent excessive computational costs, but wish to
apply the trained models to much longer sequences during inference. In this case, using learned
positional embeddings has obvious drawbacks, as there are no trained embeddings for positions
that are not observed in the training phase.
An alternative approach to modeling positional information is to develop positional embed-
dings that can generalize: once trained, the embedding model can be used to handle longer se-
quences. Suppose that we train a positional embedding model on sequences with a maximum
length of ml , and we wish to apply the trained model to a sequence of length m (m >> ml ). If
the embedding model is limited in the range of positions that we can observe from training data,
then this model will simply fail to deal with new data outside that range. See Figure 2.9 (a) for
an illustration where the learned embedding model cannot model data points outside the training
domain if it lacks the ability to extrapolate.
There are several approaches to making positional embedding models generalize. They can
be grouped into two classes.
• Extrapolation. The model learned on observed data points (i.e., positions) can be directly
employed to assign meaningful values to data points beyond the original range. For ex-
ample, suppose we have a series of numbers 1, 2, ..., 10, and we want to understand the
meaning of a new number, 15. Knowing that these numbers are natural numbers used for
ordering, we can easily infer that 15 is a number that follows 10, even though 15 has not
2.3 Long Sequence Modeling 83
Value 1
−1
0 1,024 2,048
Sequence Length
(a) Encoding with No Generalization
1
Value
−1
0 1,024 2,048
Sequence Length
(b) Extrapolation
1
Value
−1
0 1,024 2,048
Sequence Length
(c) Interpolation
Fig. 2.9: Illustrations of different positional embedding methods for a range of positions. Blue points represent the
positions that have been observed during training, and red points represent the positions that are newly observed at test
time. In sub-gure (a), the encoding model only memorizes the points seen during training, and cannot generalize. In
sub-gures (b) and (c), the model can generalize through extrapolation and interpolation.
been observed before. Figure 2.9 (b) shows an example of this approach, where a function
is learned to t the data points within a specic range and then applied to estimate the values
of data points outside that range.
• Interpolation. This approach maps a larger range of data points into the original obser-
vation range. For example, suppose we have a model designed for numbers in the range
[1, 10]. When given a new range of [1, 20], we can scale this down by dividing every num-
ber by 2, thereby tting all numbers into [1, 10]. This scaling allows us to use the model
trained on the range [1, 10] to describe data points in the expanded range of [1, 20]. See
Figure 2.9 (c) for an illustration of this approach.
In fact, positional embeddings in many systems have achieved some level of generalization.
For example, sinusoidal encoding, the most common positional embedding method, employs sine
and cosine functions that can naturally extend to sequences of any length. Although this approach
might seem direct and simple, it does not perform well when we signicantly extend the sequences
for processing. In this subsection, we will discuss several alternative methods based on either
extrapolation or interpolation.
84 Generative Models
One problem with Eq. (2.74) is that the embedding model treats each token independently and
therefore ignores the distance between different tokens. A common improvement to this model,
called relative positional embedding, is to consider the pairwise relationship between tokens
[Shaw et al., 2018]. The general idea behind this is to obtain the offset between any pair of posi-
tions and incorporate it into the self-attention model. One of the simplest forms of self-attention
with relative positional embedding is given by
i
Attqkv (qi , K≤i , V≤i ) = α(i, j)vj (2.75)
j=0
qi kjT + PE(i, j)
α(i, j) = Softmax( √ + Mask(i, j)) (2.76)
d
The only difference between this model and the original self-attention model is that a bias term
PE(i, j) is added to the query-key product in this new model. Intuitively, PE(i, j) can be inter-
preted as a distance penalty for the pair of positions i and j. As i moves away from j, the value of
PE(i, j) decreases.
PE(i, j) can be dened in several different ways. Here, we consider the T5 version of relative
positional embedding, called the T5 bias [Raffel et al., 2020]. For each pair of query qi and key
kj , the offset between them is dened to be15
d(i, j) = i − j (2.77)
A simple design for the bias PE(i, j) is to share the same learnable variable for all query-key
pairs with the same offset, i.e., PE(i, j) = ui−j , where ui−j is the variable corresponding to
the offset i − j. However, simply assigning a unique value to each offset will restrict this model
to observed offsets. When i − j is larger than the maximum trained offset, the model cannot
generalize.
The T5 bias instead adopts a generalization of this model. Rather than assigning each query-
key offset a unique bias term, it groups difference offsets into “buckets”, each corresponding to
one learnable parameter. More specically, the bias terms for nb + 1 buckets are given as follows.
• For buckets 0 to nb2+1 − 1, each bucket corresponds to one offset, that is, bucket 0 ↔ offset
0, bucket 1 ↔ offset 1, bucket 2 ↔ offset 2, and so on. We express this as b(i − j) = i − j.
• For buckets nb2+1 to nb , the size of each bucket increases logarithmically. For example, the
bucket number for a given offset i − j ≥ nb2+1 can be dened as
where the parameter distmax is typically set to a relatively large number to indicate the
15
For language modeling, a query is only allowed to attend to its left-context, and so we have i − j ≥ 0. In the more
general case of self-attention, where a token can attend to all tokens in the sequence, we may have negative offsets
when i < j.
2.3 Long Sequence Modeling 85
Offset 0 1 2 3 14 15 16 ∼ 20 21 ∼ 26 27 ∼ 33 802 ∼ ∞
(i − j)
Fig. 2.10: Illustration of distributing query-key offsets into buckets in the T5 model (nb = 32 and distmax = 1024).
Boxes represent buckets. In the rst half of the buckets, we use a xed bucket size. In the second half of the buckets,
we increase the bucket size logarithmically. The last bucket contains all the query-key offsets that are not covered by
previous buckets.
• When i − j > distmax , we place i − j in the last bucket. In other words, bucket nb contains
all the offsets that are not assigned to the previous buckets.
b(i − j)
i − j 0 ≤i−j < nb +1
2
= nb +1
n +1
log(i−j)−log( b2 ) nb +1 nb +1 (2.79)
min(nb , +⌊ ·
2 n +1
log(distmax )−log( b2 ) 2 ⌋) i−j ≥ 2
Figure 2.10 shows an illustration of these buckets. We see that in the rst half of the buckets,
each bucket is associated with only one value of i − j, while in the second half, the bucket size
increases as i − j grows. The last bucket is designed to handle sequences of arbitrarily long
lengths.
All PE(i, j)s in a bucket share the same bias term ub(i−j) . Substituting PE(i, j) = ub(i−j)
into Eq. (2.76), the attention weight for qi and kj becomes16
qi kjT + ub(i−j)
α(i, j) = Softmax( √ + Mask(i, j)) (2.81)
d
The parameters {u0 , ..., unb } are learned as common parameters during training. It should
be emphasized that this model can generalize to long sequences. This is because PE(i, j)s with
similar query-key offsets share the same parameter, and this sharing strategy is particularly im-
portant for achieving good generalization, given that large query-key offsets are rare in training.
In practice, we often set nb to a moderate number, and thus it can help control the overtting of
positional embedding models.
16
Note that, in Raffel et al. [2020]’s T5 model, the rescaling operation for the query-key product is removed. The
attention weight α(i, j) is then given by
Relative positional embedding models are based on a set of learned biases for the query-key prod-
uct in self-attention. An alternative approach is to give these biases xed values via heuristics,
rather than training them on a particular dataset. One benet of this heuristics-based approach is
that it does not rely on a training process and thus can be directly applied to any sequences once
the biases are set.
One example of such an approach is Press et al. [2022]’s approach, called attention with
linear biases or ALiBi for short. In the ALiBi approach, the bias term is dened as the negative
scaled query-key offset
PE(i, j) = −β · (i − j)
= β · (j − i) (2.82)
where β is the scaling factor. Adding this term to the query-key product, we obtain a new form of
attention weights
qi kjT + β · (j − i)
α(i, j) = Softmax( √ + Mask(i, j)) (2.83)
d
This model can be interpreted as adding a xed penalty to qi kjT whenever j moves one step
away from i. So we do not need to adapt it to a range of sequence lengths, and can employ it to
model arbitrarily long sequences. See Figure 2.11 for a comparison of the T5 bias and the ALiBi
bias.
In general, the scalar β should be tuned on a validation dataset. However, Press et al. [2022]
found that setting β to values decreasing geometrically by a factor of 21a for multi-head attention
performs well on a variety of tasks. Specically, for a self-attention sub-layer involving nhead
heads, the scalar for the k-th head is given by
1
βk = 8 (2.84)
2k
The ALiBi approach provides a simple form of relative positional embeddings. There are
other similar methods for designing query-key biases using the offset i − j. Table 2.4 shows a
comparison of such biases. As an aside it is worth noting that the form of the right-hand side
of Eq. (2.82) is very similar to length features used in conventional feature-based systems. For
example, in statistical machine translation systems, such features are widely used to model word
reordering problems, resulting in models that can generalize well across different translation tasks
[Koehn, 2010].
As with sinusoidal embeddings, rotary positional embeddings are based on hard-coded values for
all dimensions of an embedding [Su et al., 2024]. Recall that in the sinusoidal embedding model,
positions are represented as combinations of sine and cosine functions with different frequencies.
These embeddings are then added to token embeddings to form the inputs to the Transformer
2.3 Long Sequence Modeling 87
q1 kT T
0 q1 k1
u1 u0
q2 kT T T
0 q2 k1 q2 k2
u2 u1 u0
q3 kT T T T
0 q3 k1 q3 k2 q3 k3 + u2 u2 u1 u0
q4 kT T T T T
0 q4 k1 q4 k2 q4 k3 q4 k4
u3 u2 u2 u1 u0
q5 kT T T T T T
0 q5 k1 q5 k2 q5 k3 q5 k4 q5 k5
u3 u3 u2 u2 u1 u0
q6 kT T T T T T T
0 q6 k1 q6 k2 q6 k3 q6 k4 q6 k5 q6 k6
u3 u3 u3 u2 u2 u1 u0
q1 kT T
0 q1 k1 −1β 0
q2 kT T T
0 q2 k1 q2 k2 −2β −1β 0
q3 kT T T T
0 q3 k1 q3 k2 q3 k3 + −3β −2β −1β 0
q4 kT T T T T
0 q4 k1 q4 k2 q4 k3 q4 k4 −4β −3β −2β −1β 0
q5 kT T T T T T
0 q5 k1 q5 k2 q5 k3 q5 k4 q5 k5 −5β −4β −3β −2β −β 0
q6 kT T T T T T T
0 q6 k1 q6 k2 q6 k3 q6 k4 q6 k5 q6 k6 −6β −5β −4β −3β −2β −β 0
Fig. 2.11: Query-key products with biases (above = the T5 bias and below = the ALiBi bias). The color scale of the
biases ranges from light blue denoting small absolute values to deep blue denoting large absolute values.
layer stack. Rotary positional embeddings instead model positional context as rotations to token
embeddings in a complex space. This leads to a model expressed in the form of multiplicative
embeddings
ei = xi R(i) (2.85)
where R(i) ∈ Rd×d is the rotation matrix representing the rotations performed on the token
embedding xi ∈ Rd .
For simplicity, we will rst consider embeddings with only two dimensions and return to a
discussion of the more
general formulation later. Suppose we have a 2-dimensional token embed-
ding x = x1 x2 . We can represent it as a vector in a plane, originating at the origin (0, 0)
and terminating at (x1 , x2 ). A counterclockwise rotation of this vector refers to an operation of
88 Generative Models
moving the vector around the origin while maintaining its magnitude, as shown in Figure 2.12 (a).
The degree of rotation is usually dened by a specic angle, denoted by θ. The rotation can be
expressed mathematically in the form
Ro(x, θ) = xRθ
cos θ sin θ
= x1 x2
− sin θ cos θ
= cos θ · x1 − sin θ · x2 sin θ · x1 + cos θ · x2 (2.86)
cos θ sin θ
where Rθ = is the rotation matrix. If two or more rotations are performed on the
− sin θ cos θ
same vector, we can rotate the vector further. This follows from the fact that the composition of
successive rotations is itself a rotation. More formally, rotating a vector by an angle θ for t times
can be expressed as
x2 x2
vector x x
rotated vector xRθ
xRθ
θ θ
θ
x1 x1
xR2θ
θ
xR3θ
7θ cat2
7θ
x1
sleeping11
cat9
Every1 afternoon2 ,3 you4 ’ll5 nd6 that7
the8 cat9 is10 sleeping11 on12 my13 bed14 .15
Fig. 2.12: Illustrations of vector rotations in a plane. Sub-gures (a) and (b) show rotations of a vector in a single
step and multiple steps, respectively. Sub-gure (c) shows the embeddings of tokens cat and sleeping in two different
sentences. We show these sentences with a subscript afxed to each token to indicate its position. If we represent
tokens as vectors, we can add positional information by rotating these vectors. This rotation preserves the “distances”
between the vectors. For example, given that the distance between cat and sleeping is the same in both sentences, the
angle between their embeddings also remains the same during rotation.
xRtθ → x′ eitθ
= (x1 + ix2 )(cos tθ + i sin tθ)
= cos tθ · x1 − sin tθ · x2 + i(sin tθ · x1 + cos tθ · x2 ) (2.88)
Here we denote the token representation x′ eitθ by C(x, tθ). The inner product of the representa-
tions of the tokens at positions t and s can be written as
where y′ is the complex conjugate of y′ . As can be seen, the result of this inner product involves
a term t − s, and so it can model the offset between the two tokens.
90 Generative Models
Now we go back to representations in the 2D Euclidean space. The dot-product of Ro(x, tθ)
and Ro(y, sθ) is can be written as a function of (t − s)θ
Given this result, if we consider Ro(x, tθ) and Ro(y, sθ) as the query and the key, then the self-
attention operation will implicitly involve the modeling of relative positional context.
This rotary positional embedding can be extended to multi-dimensional embeddings. For
a d-dimensional token embedding x = x1 x2 ... xd , we can treat it as a d2 -dimensional
complex vector x′ = x′1 x′2 ... x′d/2 = x1 + ix2 x3 + ix4 ... xd−1 + ixd , where
each consecutive pair of items forms a complex number. Then, the rotary positional embedding in
the complex space is given by
d/2
C(x, tθ) = x′k eitθk ek (2.91)
k=1
where ek is the standard basis vector with a single non-zero value in the k-th coordinate and 0’s
elsewhere [Biderman et al., 2021].
Although this formula involves a complicated expression, its equivalent form in the d-dimensional
Euclidean space is relatively easy to understand. We can write it as
Rtθ1
Rtθ2
Ro(x, tθ) = x1 x2 ... xd
..
(2.92)
.
Rtθd/2
cos tθk sin tθk
where Rtθk = . θ = θ1 , ..., θd/2 are the parameters for controlling the an-
− sin tθk cos tθk
2(k−1)
gles of rotations in different dimensions. Typically, θk is set to 10000− d , which is analogous
to the setting in sinusoidal embeddings.
In a practical implementation, Eq. (2.92) can be rewritten into a form that relies solely on the
element-wise product and addition of vectors.
T T T T
x1 cos tθ1 −x2 sin tθ1
x2 cos tθ1 x sin tθ
1 1
.. .. . .
Ro(x, tθ) =
⊙.
. .
+ . ⊙ .
.
(2.93)
xd−1 cos tθd/2 −xd sin tθd/2
xd cos tθd/2 xd−1 sin tθd/2
Finally, we rewrite Eq. (2.85) to obtain the form of the embedding at position i
2.3 Long Sequence Modeling 91
In position interpolation, our goal is to map the positions in the new sequence to match the ob-
served range in training. Suppose the sequence length for training ranges from 0 to ml . When
m > ml at test time, we represent the positions in [0, m] such that our representations t [0, ml ].
To illustrate, consider the rotary positional embedding model described
above.
The embedding
of each token is described by a model Ro(xi , iθ) in which θ = θ1 , ..., θd/2 are the parameters.
Ro(xi , iθ) can be cast in the form of a linear combination of two periodic functions (see Eq.
(2.93))
cos iθ = cos iθ1 ... cos iθd/2 (2.95)
sin iθ = sin iθ1 ... sin iθd/2 (2.96)
where b is the base. The period of cos iθk and sin iθk is
2(k−1)
Tk = 2π · b d (2.98)
The key idea behind position interpolation is to adjust this period so that the new positions can
m
be encoded within the range [0, ml ]. One way to achieve this is to scale up Tk by m l
, given by
m 2(k−1)
Tk′ = · 2π · b d (2.99)
ml
Hence all points in [0, m] are compressed into [0, ml ]. This linear scaling can be easily realized
by modifying the input to the embedding model [Chen et al., 2023c]. The new model with linear
positional interpolation is given by
ml
Ro′ (xi , iθ) = Ro(xi , iθ) (2.100)
m
Another method of positional interpolation is to scale the base17 . Suppose that the base b is
scaled by λ. We wish the period of this new model in the last dimension of θ (i.e., dimension d2 )
to be equal to that of the linear positional interpolation model. This can be expressed as
2( d
2 −1) m 2( d
2 −1)
2π · (λb) d = · 2π · b d (2.101)
ml
17
This method was rst proposed in https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/
ntkaware_scaled_rope_allows_llama_models_to_have/
92 Generative Models
m 2( dd−1)
λ = 2
ml
m d
= d−2 (2.102)
ml
where
0 2 d−2
θ ′ = (λb)− d , (λb)− d , ..., (λb)− d (2.104)
Note that scaling the base provides a non-uniform method for scaling the periods across dif-
ferent dimensions of θ. This method has been found to be helpful for extending LLMs to longer
sequences, and several improvements have been developed [Peng et al., 2024; Ding et al., 2024].
2.3.6 Remarks
In this section, we have presented a variety of methods for long-context language modeling. We
close this section by discussing some interesting issues related to these methods.
One of the ultimate goals of long-context LLMs is that these models can precisely encode innite
context. The so-called innite context refers more to the fact that an LLM can continuously read
words. This motivates LLMs that can handle extremely long context or stream data. As discussed
in Section 2.3.3, it is common to use xed-size memory models to process continuously expanding
context. Many such systems are based on recurrent architectures or their variants, because they
are inherently suited to model time series problems where the effects of past inputs continue
indenitely. Another way to achieve innite memory is to develop alternatives to self-attention
models, for example, one can use continuous-space attention models to encode context, which
removes the dependency on context length [Martins et al., 2022].
When studying long-context LLMs, it is natural to wonder what mechanisms may explain the
use of long context in language modeling. Can we compress the representation of innite context
into a relatively small-sized model? Are all context tokens useful for predicting next tokens? How
do LLMs prepare for token prediction when they see the context? Can we know in advance which
contextual information will be critical for prediction? General answers to all these questions
are not obvious, but they inspire follow-on research of explainable models, and some interesting
results have been found. For example, Deletang et al. [2024] conducted extensive experiments
to show that LLMs are powerful in-context compressors. Although viewing predictive models
as compression models has long been studied in machine learning, it also provides insights into
our understanding of the LLM scaling laws. Pal et al. [2023] and Wu et al. [2024] investigated
whether the features learned up to the current step, though not intentionally, are already sufcient
2.3 Long Sequence Modeling 93
for predicting tokens at the following steps. Note that the need for long-context in language
modeling is highly dependent on the problem that we address. A related issue is where to apply
LLMs and how to evaluate them. For example, in summarization tasks we may only need to distill
and focus on a few key aspects of the text, while in retrieval-like tasks we need to “memorize”
the entire context so that the relevant information can be accessed. We will discuss the evaluation
issue later in this subsection.
Evaluating long-context LLMs is important, but it is a new issue in NLP. The general idea is that,
if we input a long context to an LLM, then we can check from the output of the LLM whether it
understands the entire context and makes use of it in predicting following tokens. In conventional
research of NLP, such evaluations are often aimed at examining the ability of NLP models in
handling long-range dependencies. However, the size of contexts used in recent LLMs is much
larger than that used in NLP systems a few years ago. This motivates researchers to develop new
evaluation benchmarks and metrics for long-context LLMs.
One approach is to use the perplexity metric. However, in spite of its apparent simplicity, this
method tends to reect more on the LLMs’ ability to make use of local context rather than global
context. It is therefore tempting to develop evaluation methods that are specic to long-context
LLMs. Popular methods include various synthetic tasks where articially generated or modied
94 Generative Models
2.4 Summary
In this chapter, we have discussed the concept of LLMs and related techniques. This can be consid-
ered a general, though not comprehensive, introduction to LLMs, laying the foundation for further
discussions on more advanced topics in subsequent chapters. Furthermore, we have explored two
ways to scale up LLMs. The rst focuses on the large-scale pre-training of LLMs, which is cru-
cial for developing state-of-the-art models. The second focuses on methods for adapting LLMs to
long inputs, including optimizing attention models, designing more efcient and compressed KV
caches, incorporating memory models, and exploring better positional embeddings.
The strength of LLMs lies in their ability to break the constraints of training NLP models for
a limited number of specic tasks. Instead, LLMs learn from large amounts of text through the
simple task of token prediction — we predict the next token in a sentence given its prior tokens.
18
https://github.com/gkamradt/LLMTest_NeedleInAHaystack
2.4 Summary 95
A general view is that, by repeating this token prediction task a large number of times, LLMs can
acquire some knowledge of the world and language, which can then be applied to new tasks. As a
result, LLMs can be prompted to perform any task by framing it as a task of predicting subsequent
tokens given prompts. This emergent ability in language models comes from several dimensions,
such as scaling up training, model size, and context size. It is undeniable that scaling laws are
currently the fundamental principle adopted in developing large language models, although sim-
ply increasing model size has yet to prove sufcient for achieving AGI. These continuously scaled
LLMs have been found to show capabilities in general-purpose language understanding, genera-
tion, and reasoning. More recently, it has been found that scaling up the compute at inference time
can also lead to signicant improvements in complex reasoning tasks [OpenAI, 2024].
Given their amazing power, LLMs have attracted considerable interest, both in terms of tech-
niques and applications. As a result, the explosion of research interest in LLMs has also led to a
vast number of new techniques and models. However, we do not attempt to provide a comprehen-
sive literature review on all aspects of LLMs, given the rapid evolution of the eld. Nevertheless,
one can still gain knowledge about LLMs from general reviews [Zhao et al., 2023; Minaee et al.,
2024] or more focused discussions on specic topics [Ruan et al., 2024].