Chapter 5
Chapter 5
“Andriy paints for us, in 100 marvelous strokes, the journey from linear
algebra basics to the implementation of transformers.”
― Florian Douetteau, Co-founder and CEO at Dataiku
Featuring a foreword by Tomáš Mikolov and back cover text by Vint Cerf
The Hundred-Page Language Models Book
Andriy Burkov
Copyright © 2025 Andriy Burkov. All rights reserved.
1. Read First, Buy Later: You are welcome to freely read and share this book with others by
preserving this copyright notice. However, if you find the book valuable or continue to use it, you
must purchase your own copy. This ensures fairness and supports the author.
2. No Unauthorized Use: No part of this work—its text, structure, or derivatives—may be used to
train artificial intelligence or machine learning models, nor to generate any content on websites,
apps, or other services, without the author’s explicit written consent. This restriction applies to
all forms of automated or algorithmic processing.
3. Permission Required If you operate any website, app, or service and wish to use any portion of
this work for the purposes mentioned above—or for any other use beyond personal reading—
you must first obtain the author’s explicit written permission. No exceptions or implied licenses
are granted.
4. Enforcement: Any violation of these terms is copyright infringement. It may be pursued legally
in any jurisdiction. By reading or distributing this book, you agree to abide by these conditions.
ISBN 978-1-7780427-2-0
Publisher: True Positive Inc.
To my family, with love
“Language is the source of misunderstandings.”
―Antoine de Saint-Exupéry, The Little Prince
“In mathematics you don't understand things. You just get used to them.”
―John von Neumann
x
Chapter 5. Large Language Model
Large language models have transformed NLP through their remarkable capabilities in text
generation, translation, and question-answering. But how can a model trained solely to predict the
next word achieve these results? The answer lies in two factors: scale and supervised finetuning.
“As with all text generated by language models, the sample does not make sense beyond the
level of short phrases. The realism could perhaps be improved with a larger network and/or
96
more data. However, it seems futile to expect meaningful language from a machine that has
never been exposed to the sensory world to which language refers.” (Alex Graves,
“Generating Sequences With RNNs,” 2014)
GPT-3 showed some ability to continue relatively complex patterns. But only with GPT-3.5—able to
handle multi-stage dialogue and follow elaborate instructions—it became clear that something
unexpected happens when a language model surpasses a certain parameter scale and is pretrained
on a sufficiently large corpus.
Scale is fundamental to building a capable LLM. Let’s look at the core features that make LLMs “large”
and how these features contribute to their capabilities.
Open-weight models are models with publicly accessible trained parameters. These can
be downloaded and used for tasks like text generation or finetuned for specific
applications. However, while the weights are open, the model’s license governs its
permitted uses, including whether commercial use is allowed. Licenses like Apache 2.0
and MIT permit unrestricted commercial use, but you should always review the license
to confirm your intended use aligns with the creators’ terms.
The table below shows key features of several open-weight LLMs compared to our tiny model:
num_blocks emb_dim num_heads vocab_size
Our model 2 128 8 32,011
Llama 3.1 8B 32 4,096 32 128,000
Gemma 2 9B 42 3,584 16 256,128
Gemma 2 27B 46 4,608 32 256,128
Llama 3.1 70B 80 8,192 64 128,000
Llama 3.1 405B 126 16,384 128 128,000
By convention, the number before “B” in the name of an open-weight model indicates its total number
of parameters in billions.
97
If you were to store each parameter of a 70B model as a 32-bit float number, it would
require about 280GB of RAM—more storage than the Apollo 11 guidance computer had
by a factor of over 30 million times.
This massive number of parameters allows LLMs to learn and represent a vast amount of information
about grammar, semantics, world knowledge, and exhibit reasoning capabilities.
The key challenge with processing long texts in transformer models lies in the self-attention
mechanism’s computational complexity. For a sequence of length n, self-attention requires
computing attention scores between every pair of tokens, resulting in quadratic 𝑂(𝑛 2 ) time and space
complexity. This means that doubling the input length quadruples both the memory requirements
and computational cost. This quadratic scaling becomes particularly problematic for long
documents—for instance, a 10,000-token input would require computing and storing 100 million
attention scores for each attention layer.
The increased context size is made possible through architectural improvements and optimizations
in attention computation. Techniques like grouped-query attention and FlashAttention (which are
beyond the scope of this book) enable efficient memory management, allowing LLMs to handle much
larger contexts without excessive computational costs.
LLMs typically undergo pretraining on shorter contexts around 4K-8K tokens, as the attention
mechanism’s quadratic complexity makes training on long sequences computationally intensive.
Additionally, most training data naturally consists of shorter sequences.
Long-context capabilities emerge through long-context pretraining, a specialized stage following
initial training. This process involves:
1. Incremental training for longer contexts: The model's context window gradually
expands from 4,000-8,000 tokens to 128,000-256,000 tokens through a series of
incremental stages. Each stage increases the context length and continues training until the
model meets two key criteria: restoring its performance on short-context tasks while
successfully handling longer-context challenges like “needle in a haystack” evaluations.
A needle in a haystack test evaluates a model’s ability to identify and utilize relevant
information buried within a very long context, typically by placing a crucial piece of
98
information early in the sequence and asking a question that requires retrieving that
specific detail from among thousands of tokens of unrelated text.
The illustration on the previous page depicts the composition of LLM training datasets, using the open
Dolma dataset as an example. Segments represent different document types, with sizes scaled
logarithmically to prevent web pages—the largest category—from overwhelming the visualization.
99
Each segment shows both token count (in billions) and percentage of the corpus. While Dolma’s 3
trillion tokens are substantial, they fall short of more recent datasets like Qwen 2.5’s 18 trillion
tokens, a number likely to grow in future iterations.
It would take approximately 51,000 years for a human to read the entire Dolma dataset,
reading 8 hours every day at 250 words per minute.
Since neural language models train on such vast corpora, they typically process the data just once.
This single-epoch training approach prevents overfitting while reducing computational demands.
Processing these enormous datasets multiple times would be extremely time-consuming and may not
yield significant additional benefits.
Each of these four parallelism dimensions could merit its own chapter, and thus a full
exploration of them lies beyond this book’s scope.
Training large language models can cost tens to hundreds of millions of dollars. These expenses
include hardware, electricity, cooling, and engineering expertise. Such costs limit the development of
state-of-the-art LLMs to large tech companies and well-funded research labs. However, open-weight
models have lowered the barrier, enabling smaller organizations to leverage existing models through
methods like supervised finetuning and prompt engineering.
100
5.2. Supervised Finetuning
During pretraining, the model learns most of its capabilities. However, since it is trained only to
predict the next word, its default behavior is to continue the input. For instance, if you input “Explain
how machine learning works,” the pretrained model might respond with something like “and also
name three most popular algorithms.” This is not what users would expect. The model’s ability to
follow instructions, answer questions, and hold conversations is developed through a process called
supervised finetuning.
Let’s compare the behavior of a pretrained model and the same model finetuned to follow
instructions and answer questions.
We’ll use two models: google/gemma-2-2b, pretrained for next-token prediction, and
google/gemma-2-2b-it, a finetuned version for instruction following.
Models on the Hugging Face Hub follow this naming convention: “creator/model” with
no spaces. The “model” part typically includes information about the model’s version,
number of parameters, and whether it was finetuned for conversation or instruction-
following. In the name google/gemma-2-2b-it, we see that the creator is Google, the
model has version 2, 2 billion parameters, and it was finetuned to follow instructions
(with “it” standing for “instruction-tuned”).
This is the output of the pretrained-only google/gemma-2-2b given the above prompt:
The list of fruits and vegetables that are good for you is long. But there ar
e some that are better than others.
The best fruits and vegetables are those that are high in fiber, low in sugar
, and high in vitamins and minerals.
The best fruits and vegetables are those that are high in fiber, low in sugar
, and high in vitamins and minerals.
...
The output isn’t complete—the model keeps repeating the same sentence endlessly. As you can see,
the output is quite similar to what we observed with our decoder model. While google/gemma-2-
2b, being larger, produces more coherent sentence structures, the text still fails to align with the
context, which clearly requests a list of fruits.
Now, let’s apply the finetuned google/gemma-2-2b-it to the same input. The output is:
Here are a few more fruits to continue the list:
* **Banana**
* **Grapefruit**
* **Strawberry**
101
* **Pineapple**
* **Blueberry**
As you can see, the model with the same number of parameters now follows the instruction. This
change is achieved through supervised finetuning.
Supervised finetuning, or simply finetuning, modifies a pretrained model’s parameters to
specialize it for specific tasks. The goal isn’t to train the model to answer every question or follow
every instruction. Instead, finetuning “unlocks” the knowledge and skills the model already learned
during pretraining. Without finetuning, this knowledge remains “hidden” and is used mainly for
predicting the next token, not for problem-solving.
During finetuning, while the model is still trained to predict next tokens, it learns from examples of
quality conversations and problem-solving rather than general text. This targeted training enables
the model to better leverage its existing knowledge, producing relevant information in response to
prompts instead of generating arbitrary continuations.
PyTorch supports model parallelism with methods like Fully Sharded Data Parallel
(FSDP). FSDP enables efficient distribution of model parameters across GPUs by
sharding the model—splitting it into smaller parts. This way, each GPU processes only a
portion of the model.
Renting multi-GPU servers for large language model finetuning can be prohibitively expensive for
smaller organizations or individuals. The computational demands can result in significant costs, with
training runs potentially lasting anywhere from several hours to multiple weeks depending on the
model size and training dataset.
Commercial LLM service providers offer a more cost-effective finetuning option. They charge based
on the number of tokens in the training data and use various techniques to lower costs. Though these
methods aren’t covered in this book, you can find an up-to-date list of LLM finetuning services with
pay-per-token pricing on the book’s wiki.
102
Let’s finetune a pretrained LLM to generate an emotion. Our dataset has the following structure:
{"text": "i slammed the door and screamed in rage", "label": "anger"}
{"text": "i danced and laughed under the bright sun", "label": "joy"}
{"text": "tears rolled down my face in silence today", "label": "sadness"}
...
It’s a JSONL file, where each row is a labeled example formatted as a JSON object. The text key
contains a text expressing one of six emotions; the label key is the corresponding emotion. The label
can be one of six values: sadness, joy, love, anger, fear, and surprise. Thus, we have a document
classification problem with six classes.
We'll finetune GPT-2, a pretrained model licensed under the MIT license, which permits unrestricted
commercial use. This language model, with its modest 124M parameters, is often classified as an SLM
(small language model). Despite these constraints, it demonstrates impressive capabilities on certain
tasks and remains accessible for finetuning even within free-tier Colab notebooks.
Before training a complex model, it’s wise to establish baseline performance. A baseline is a simple,
easy-to-implement solution that sets the minimum acceptable performance level. Without it, we can’t
determine if a complex model’s performance justifies its added complexity.
We'll use logistic regression with bag of words as our baseline. This pairing has proven effective
for document classification. Implementation will use scikit-learn, an open-source library that
streamlines the training and evaluation of traditional “shallow” machine learning models.
Now, let’s load the data and prepare it for machine learning:7
random.seed(42) ➊
data_url = "https://www.thelmbook.com/data/emotions"
X_train_text, y_train, X_test_text, y_test = download_and_split_data(
data_url, test_ratio=0.1
) ➋
7We will load the data from the book’s website to ensure it remains accessible. The dataset’s original source is
https://huggingface.co/datasets/dair-ai/emotion. It was first used in Saravia et al., “CARER: Contextualized Affect
Representations for Emotion Recognition,” Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, 2018.
103
With the data loaded and split into training and test sets, we transform it into a bag-of-words:
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer’s fit_transform method converts training data into the bag-of-words format.
max_features limits vocabulary size, and binary determines whether features represent a word’s
presence (True) or count (False). The subsequent transform converts the test data into a bag-of-
words representation using the vocabulary built using training data. This approach prevents data
leakage—where information from the test set inadvertently influences the machine learning
process. Maintaining this separation between training and test data is crucial, as any leakage would
compromise the model’s ability to generalize to truly unseen examples.
The logistic regression implementation in scikit-learn accepts labels as strings, so there is no need to
convert them to numbers. The library handles the conversion automatically.
Now, let’s train a logistic regression model:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
Output:
Training accuracy: 0.9854
Test accuracy: 0.8855
The LogisticRegression object is first created. Its fit method, called next, trains the model8 on
the training data. Afterward, the model predicts outcomes for both the training and test sets, and the
accuracy for each is calculated.
8In reality, scikit-learn trains a model slightly different from classical logistic regression; it uses softmax with cross-
entropy loss instead of using the sigmoid function and binary cross-entropy. This approach generalizes logistic
regression to multiclass classification problems.
104
The random_state parameter in LogisticRegression sets the seed for the random number
generator. The max_iter parameter limits the solver to a maximum of 1000 iterations.
A solver is the algorithm that optimizes a model’s parameters. It works like gradient
descent but might use different techniques to improve efficiency, handle constraints, or
ensure numerical stability. In LogisticRegression, the default solver is lbfgs
(Limited-memory Broyden–Fletcher–Goldfarb–Shanno). This algorithm performs well
with small to medium datasets and suits loss functions such as logistic loss. Setting
max_iter = 1000 ensures the solver has enough iterations to converge.
The accuracy metric calculates the proportion of correct predictions out of all predictions:
Number of correct predictions
Accuracy =
Total number of predictions
As you can see, the model overfits: it performs almost perfectly on the training data but significantly
worse on the test data. To address this, we can adjust the hyperparameters of our algorithm. Let’s
try incorporating bigrams and increase the vocabulary size to 20,000:
vectorizer = CountVectorizer(max_features=20_000, ngram_range=(1, 2))
This adjustment leads to slight improvement on the test set, but it still falls short compared to the
training set performance:
Training accuracy: 0.9962
Test accuracy: 0.8910
Now that we see a simple approach achieves a test accuracy of 0.8910, any more complex solution
must outperform this baseline. If it performs worse, we will know that our implementation likely
contains an error.
Let’s finetune GPT-2 to generate emotion labels as text. This approach is easy to implement since no
additional classification output layer is needed. Instead, the model is trained to output labels as
regular words, which, depending on the tokenizer, may span multiple tokens.
set_seed(42)
data_url = "https://www.thelmbook.com/data/emotions"
model_name = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name) ➊
tokenizer.pad_token = tokenizer.eos_token ➋
105
model = AutoModelForCausalLM.from_pretrained(model_name).to(device) ➌
The AutoModelForCausalLM class from the transformers library, used in line ➌, automatically
loads a pretrained autoregressive language model. Line ➊ loads the pretrained tokenizer. The
tokenizer used in GPT-2 does not include a padding token. Therefore, in line ➋, we set the padding
token by reusing the end-of-sequence token.
Now, we set up the training loop:
for epoch in range(num_epochs):
for input_ids, attention_mask, labels in train_loader:
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device) ➊
labels = labels.to(device)
outputs = model(
input_ids=input_ids,
labels=labels,
attention_mask=attention_mask
)
outputs.loss.backward()
optimizer.step()
optimizer.zero_grad()
The attention_mask in line ➊ is a binary tensor showing which tokens in the input are actual data
and which are padding. It has 1s for real tokens and 0s for padding tokens. This mask is different from
the causal mask, which blocks positions from attending to future tokens.
Let’s illustrate input_ids, labels, and attention_mask for a batch of two simple examples:
Text Emotion
I feel very happy joy
So sad today sadness
We convert these examples into text completion tasks by adding a task definition and solution:
Table 5.1: Text completion template.
Task Solution
Predict emotion: I feel very happy\nEmotion: joy
Predict emotion: So sad today\nEmotion: sadness
In the table above, “\n” denotes a new line character, while “\nEmotion:” marks the boundary
between the task description and the solution. This format, while optional, helps the model use its
106
pretrained understanding of text. The sole new ability to be learned during finetuning is generating
one of six outputs: sadness, joy, love, anger, fear, or surprise—no other outputs.
LLMs gained emotion classification skills during pretraining partly because of the
widespread use of emojis online. Emojis acted as labels for the text around them.
Assuming a simple tokenizer that splits strings by spaces and assigns unique IDs to each token, here’s
a hypothetical token-to-ID mapping:
Token ID Token ID
Predict 1 So 8
emotion: 2 sad 9
I 3 today 10
feel 4 joy 11
very 5 sadness 12
happy 6 [EOS] 0
\nEmotion: 7 [PAD] −1
The special [EOS] token indicates the end of generation, while [PAD] serves as a padding token. The
following examples show how texts are converted to token IDs:
We then concatenate the input tokens with the completion tokens and append the [EOS] token so
the model learns to stop generating once the emotion label generation is completed. The input_ids
tensor contains these concatenated token IDs. The labels tensor is made by replacing all input text
tokens with −100 (a special masking value), while keeping the actual token IDs for the completion
and [EOS] tokens. This ensures the model only computes loss on predicting the completion tokens,
not on reproducing the input text.
The value −100 is a special token ID in PyTorch (and similar frameworks) used to exclude
specific positions during loss computation. When finetuning language models, this
ensures the model concentrates on predicting tokens for the desired output (the
“solution”) rather than the tokens in the input (the “task”).
107
Text input_ids labels
Predict emotion: I feel very [1, 2, 3, 4, [-100, -100, -100, -100,
happy\nEmotion: joy 5, 6, 7, 11, 0] -100, -100, -100, 11, 0]
Predict emotion: So sad [1, 2, 8, 9, [-100, -100, -100, -100,
today\nEmotion: sadness 10, 7, 12, 0] -100, -100, 12, 0]
To form a batch, all sequences must have the same length. The longest sequence has 9 tokens (from
the first example), so we pad the shorter sequences to match that length. Here’s the final table
showing how the input_ids, labels, and attention_mask are adjusted after padding:
In input_ids, all sequences have a length of 9 tokens. The second example is padded with the [PAD]
token (ID −1). In the attention_mask, real tokens are marked as 1, while padding tokens are
marked as 0.
This padded batch is now ready for the model to handle.
After finetuning the model with num_epochs = 2, batch_size = 16, and learning_rate =
0.00005, it achieves a test accuracy of 0.9415. This is more than 5 percentage points higher than the
baseline result of 0.8910 obtained with logistic regression.
When finetuning, a smaller learning rate is often used to avoid large changes to the
pretrained weights. This helps retain the general knowledge from pretraining while
adjusting to the new task. A common choice is 0.00005 (5 × 10−5 ), as it often works well
in practice. However, the best value depends on the specific task and model.
The full code for supervised finetuning of an LLM is available in the thelmbook.com/nb/5.2 notebook.
You can adapt this code for any text generation task by updating the data files (while keeping the
same JSON format) and adjusting Task and Solution in Table 5.1 with text relevant to the specific
business problem.
Let’s see how this code can be adapted for finetuning for a general instruction-following task.
108
This format allows the LLM to see where the Task part ends (“\nEmotion:”) and the Solution starts.
When we finetune for a general-purpose instruction following, we cannot use “\nEmotion:” as a
separator. We need a more general format. Since first open-weight models were introduced, many
prompting formats were used by various people and organizations. Below, there are only two of them,
named after famous LLMs using these formats:
Vicuna:
USER: {instruction}
ASSISTANT: {solution}
Alpaca:
### Instruction:
{instruction}
### Response:
{solution}
ChatML (chat markup language) is a prompting format used in many popular finetuned LLMs. It
provides a standardized way to encode chat messages, including the role of the speaker and the
content of the message.
The format uses two tags: <|im_start|> to indicate the start of a message and <|im_end|> to mark
its end. A basic ChatML message structure looks like this:
<|im_start|>{role}
{message}
<|im_end|>
The message is either an instruction (question) or a solution (answer). The role is usually one of the
following: system, user, and assistant. For example:
<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of France?
<|im_end|>
<|im_start|>assistant
The capital of France is Paris.
<|im_end|>
The user role is the person who asks questions or gives instructions. The assistant role is the chat
LM providing responses. The system role specifies instructions or context for the model’s behavior.
The system message, known as the system prompt, can include private details about the user, like
their name, age, or other information useful for the LLM-based application.
The prompting format has little impact on the quality of a finetuned model itself. However, when
working with a model finetuned by someone else, you need to know the format used during
finetuning. Using the wrong format could affect the quality of the model’s outputs.
109
After transforming the training data into the chosen prompting format, the training process uses the
same code as the emotion generation model. You can find the complete code for instruction finetuning
an LLM in the thelmbook.com/nb/5.3 notebook.
The dataset I used has about 500 examples, generated by a state-of-the-art LLM. While this may not
be enough for high-quality instruction following, there's no standard approach for building an ideal
instruction finetuning dataset. Online datasets vary widely, from thousands to millions of examples
of varying quality. Still, some experiments suggest that a carefully selected set of diverse examples,
even as small as 1,000, can enable strong instruction-following in a sufficiently large pretrained
language model, as Meta’s LIMA model demonstrated.
A consensus among the practitioners is that the quality, not quantity, of examples is crucial for
achieving state-of-the-art results in instruction finetuning.
The training examples can be found in this file:
data_url = "https://www.thelmbook.com/data/instruct"
The instructions and examples used during finetuning fundamentally shape a model's
behavior. Models exposed to polite or cautious responses tend to mirror those traits.
Through finetuning, models can even be trained to consistently generate falsehoods.
Users of third-party finetuned models should watch for biases introduced in the process.
“Unbiased” models often simply have biases that serve certain interests.
To understand the impact of instruction finetuning, let’s first see how a pretrained model handles
instructions without any special training. Let’s first use a pretrained GPT-2:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2").to(devi
ce)
outputs = model.generate(
110
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=32,
pad_token_id=tokenizer.pad_token_id
)
Output:
Who is the President of the United States?
The President of the United States is the President of the United States.
The President of the United States is the President of the United States.
Again, like google/gemma-2-2b, the model exhibits sentence repetition. Now, let’s look at the output
after finetuning on our instruction dataset. The inference code for an instruction-finetuned model
must follow the prompting format used during finetuning. The build_prompt method applies the
ChatML prompting format to our instruction:
def build_prompt(instruction, solution = None):
wrapped_solution = ""
if solution:
wrapped_solution = f"\n{solution}\n<|im_end|>"
return f"""<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
{instruction}
<|im_end|>
<|im_start|>assistant""" + wrapped_solution
The same build_prompt function is used for both training and testing. During training, it takes both
instruction and solution as input. During testing, it only receives instruction.
Now, let’s define the function that generates text:
def generate_text(model, tokenizer, prompt, max_new_tokens=100):
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(
input_ids=input_ids["input_ids"],
attention_mask=input_ids["attention_mask"],
max_new_tokens=max_new_tokens,
111
pad_token_id=tokenizer.pad_token_id,
stopping_criteria=stopping
)[0]
generated_ids = output_ids[input_ids["input_ids"].shape[1]:] ➌
generated_text = tokenizer.decode(generated_ids).strip()
return generated_text
Line ➊ encodes the <|im_end|> tag into token IDs which will be used to indicate the end of
generation. Line ➋ sets up a stopping criterion using the EndTokenStoppingCriteria class
(defined below), ensuring the generation halts when end_tokens appear. Line ➌ slices the generated
tokens to remove the input prompt, leaving only the newly generated text.
The EndTokenStoppingCriteria class defines the signal to stop generating tokens:
from transformers import StoppingCriteria
class EndTokenStoppingCriteria(StoppingCriteria):
def __init__(self, end_tokens, device):
self.end_tokens = torch.tensor(end_tokens).to(device) ➊
In the constructor:
• Line ➊ converts the end_tokens list into a PyTorch tensor and moves it to the specified
device. This ensures the tensor is on the same device as the model.
In the __call__ method, line ➋ loops through the generated sequences in the batch. For each:
• Line ➌ takes the last len(end_tokens) tokens and stores them in last_tokens.
• Line ➍ checks if last_tokens match end_tokens. If they do, True is added to the
do_stop list, which tracks whether to stop generation for each sequence in the batch.
This is how we call the inference for a new instruction:
input_text = "Who is the President of the United States?"
prompt = build_prompt(input_text)
generated_text = generate_text(model, tokenizer, prompt)
print(generated_text.replace("<|im_end|>", "").strip())
Output:
112
George W. Bush
Since GPT-2 is a relatively small language model and wasn’t finetuned on recent facts, this confusion
about presidents isn’t surprising. What matters here is that the finetuned model now interprets the
instruction as a question and responds accordingly.
𝑇 Probabilities Comment
0.5 [0.98,0.02,0.00]⊤ More focused on “cat”
1.0 [0.87,0.12,0.02]⊤
Standard softmax
2.0 [0.67,0.24,0.09]⊤ More evenly distributed
Temperature controls the balance between creativity and determinism. Low values (0.1–0.3)
produce focused, precise outputs, suitable for tasks like factual responses, coding, or math. Moderate
values (around 0.7–0.8) offer a mix of creativity and coherence, ideal for conversation or content
writing. High values (1.5–2.0) add randomness, useful for brainstorming or story generation, though
coherence may drop. Extreme values (near 0 or above 2) are rarely used.
These ranges are guidelines; the optimal temperature depends on the model and task and should be
determined through experimentation.
Given the vocabulary and probabilities, this Python function returns the sampled token:
113
import numpy as np
The function performs two checks before sampling. Line ➊ ensures there is one probability for each
token in the vocabulary. Line ➋ confirms the probabilities sum to 1, allowing for a small tolerance
due to floating-point precision. Once these validations pass, line ➌ handles the sampling. It selects a
token from the vocabulary based on the probabilities, so a token with a 0.7 probability is chosen
roughly 70% of the time when the function is run repeatedly.
114
The function begins by validating inputs: ensuring logits match the vocabulary size, temperature is
positive, top-k is at least 1, and top-𝑘 does not exceed the vocabulary size. Line ➊ scales the logits by
the temperature. Line ➋ determines the top-𝑘 cutoff by sorting the logits and selecting the 𝑘 th largest
value. Line ➌ discards less likely tokens by setting logits below the cutoff to negative infinity. Line ➍
converts the remaining logits into probabilities using a numerically stable softmax. Line ➎ ensures
the probabilities sum to 1.
The value of 𝑘 depends on the task. Low values (5–10) focus on the most likely tokens, improving
accuracy and consistency, which suits factual responses and structured tasks. Mid-range values (20–
50) balance variation and coherence, making them good defaults for general writing and dialogue.
High values (100–500) allow more diversity, useful for creative tasks. These ranges are practical
guidelines, but the best 𝑘 depends on the model, vocabulary size, and application. Very low values
(below 5) can be too limiting, while extremely high values (over 500) rarely improve quality.
Experimentation is necessary to find the best setting.
5.4.4. Penalties
Modern language models use penalty parameters alongside temperature and filtering methods to
manage text diversity and quality. These penalties help avoid issues such as repeated words,
overused tokens, and generation loops.
115
The frequency penalty adjusts token probabilities based on how often they’ve appeared in the
generated text so far. When a token appears multiple times, its probability is reduced proportionally
to its appearance count. The penalty is applied by subtracting a scaled version of the token’s count
from its logits before the softmax:
𝑜 (𝑗) ← 𝑜(𝑗) − 𝛼 ⋅ count(𝑗),
where 𝛼 is the frequency penalty parameter. Higher values (0.8-1.0) decrease the model’s likelihood
to repeat the same line verbatim or getting stuck in a loop.
The presence penalty modifies token probabilities based on whether they appear anywhere in the
generated text, regardless of count:
𝑜(𝑗) − 𝛾, if token 𝑗 is in generated text,
𝑜 (𝑗) ← {
𝑜(𝑗) , otherwise
Here, 𝛾 is the presence penalty parameter. Higher values of 𝛾 (0.7-1.0) increase the model’s likelihood
to talk about new topics.
The optimal values depend on the specific task. For creative writing, higher penalties encourage
novelty. For technical documentation, lower penalties maintain precision and consistency.
The complete implementation of sample_token that combines temperature, top-𝑘, top-𝑝, and the
two penalties can be found in the thelmbook.com/nb/5.4 notebook.
116
The matrices 𝐀 and 𝐁, together, are called a LoRA adapter. Their product, 𝛥𝐖, acts as an update
matrix that adjusts the original weights 𝐖0 to enhance performance on a new task. Since 𝐀 and 𝐁 are
much smaller than 𝐖0, this method significantly reduces the number of trainable parameters.
For example, if 𝐖0 has dimensions 1024 × 1024, it would contain over a million parameters to
finetune directly (1,048,576 parameters). With LoRA, we introduce 𝐀 with dimensions 1024 × 8
(8,192 parameters) and 𝐁 with dimensions 8 × 1024 (8,192 parameters). This setup requires only
8,192 + 8,192 = 16,384 parameters to be trained.
The adapted weight matrix 𝐖 is used in the layers of the finetuned transformer, replacing the original
matrix 𝐖0 to alter the token embeddings as they pass through the transformer blocks. The creation
of 𝐖 is illustrated below:
𝛼
The scaling factor 𝑟 controls the size of the weight updates introduced by LoRA during finetuning.
Both 𝑟 and 𝛼 are hyperparameters, with 𝛼 typically set as a multiple of 𝑟. For example, if 𝑟 = 8, 𝛼
might be 16, resulting in a scaling factor of 2. The optimal values for 𝑟 and 𝛼 are found experimentally
by assessing the finetuned LLM’s performance on the test set.
LoRA is usually applied to the weight matrices in the self-attention layers—specifically the query,
key, and value weight matrices 𝐖𝑄 , 𝐖𝐾 , 𝐖𝑉 , and the projection matrix 𝐖𝑂 . It can also be applied to
the weight matrices 𝐖1 and 𝐖2 in the position-wise MLP layers.
Finetuning LLMs with LoRA is faster than a full model finetune and uses less memory for gradients,
enabling the finetuning of very large models on limited hardware.
We can modify our previous code by incorporating the PEFT library to apply LoRA:
117
from peft import get_peft_model, LoraConfig, TaskType
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, # Specify the task type
inference_mode=False, # Set to False for training
r=8, # Set the rank r
lora_alpha=16 # LoRA alpha
)
In PyTorch, the requires_grad attribute controls whether a tensor tracks operations for automatic
differentiation. When requires_grad=True, PyTorch keeps track of all operations on the tensor,
enabling gradient computation during the backward pass. To freeze a model parameter (preventing
updates during training), set its requires_grad to False:
import torch.nn as nn
print(model.weight.requires_grad)
print(model.bias.requires_grad)
model.bias.requires_grad = False
print(model.bias.requires_grad)
Output:
True
True
False
118
The PEFT library ensures that only the LoRA adapter parameters have requires_grad=True,
keeping all other model parameters frozen.
After wrapping the model with get_peft_model, the training loop stays the same. For instance,
finetuning GPT-2 on an emotion generation task using LoRA with r=16 and lora_alpha=32 achieves
a test accuracy of 0.9420. This is marginally better than the 0.9415 from full finetuning. Generally,
LoRA tends to perform slightly worse than full finetuning. However, the outcome depends on the
choice of hyperparameters, dataset size, base model, and task.
The full code for GPT-2 finetuning with LoRA is available in the thelmbook.com/nb/5.5 notebook.
You can customize it for your own tasks by modifying the dataset and LoRA settings.
model = AutoModelForSequenceClassification.from_pretrained(
model_path, num_labels=6
)
For pretrained autoregressive language models, the class maps the embedding of the final (right-
most) non-padding token from the last decoder block to a vector with dimensionality matching the
number of classes (6 in this case). The structure of this modification is as follows:
119
As you can see, once the final decoder block processes the input (the second block in our example),
the output embedding 𝐳4,2 of the last token is passed through the classification head’s weight matrix,
𝐖𝐶 . This projection converts the embedding into logits, one per class.
The parameter tensor 𝐖𝐶 is initialized with random values and trained on the labeled emotions
dataset. Training relies on cross-entropy to measure the loss between the predicted probability
distribution and the one-hot encoded true class label. This error is backpropagated, updating the
weights in both the classification head and the rest of the model. This can be combined with LoRA.
After finetuning with num_epochs = 8, batch_size = 16, and learning_rate = 0.00005, the
model reaches a test accuracy of 0.9460. This is slightly better than the 0.9415 accuracy from
finetuning the unmodified model to generate class labels as text. The improvement might be more
noticeable with a different base model or dataset.
The code for finetuning GPT-2 as an emotion classifier is available on the wiki in the
thelmbook.com/nb/5.6 notebook. It can be easily adapted for any text classification task by replacing
the data in the file while keeping the same JSON format.
120
Despite its simplicity, the conversational interface allows solving various practical problems. This
section explores best practices for using chat LMs to address such problems known as prompt
engineering techniques.
Your role: Act as a seasoned insurance claims analyst familiar with industry-
standard classifications.
Task: Identify the type of incident, the primary cause, and the significant d
amages described in the report.
<examples>
<example>
<input>
Observed two-vehicle accident at an intersection. Insured's car was h
it after the other driver ran a red light. Witnesses confirm. The vehicle has
severe front-end damage, airbags deployed, and was towed from the scene.
</input>
121
<output>
{
"type": "collision",
"cause": "failure to stop at signal",
"damage": ["front-end damage", "airbag deployment"]
}
</output>
</example>
<example>
...
</example>
</examples>
I used XML tags for few-shot examples because they clearly define example boundaries
and are familiar to LLMs from pretraining on structured data. Furthermore, chat LM
models are often finetuned using conversational examples with XML structures. Using
XML isn’t mandatory though, but could be helpful.
122
When working with the same chat LM for follow-ups, especially in tasks like coding or handling
complex structured outputs, it’s generally a good idea to start fresh after three-five exchanges. This
recommendation comes from two key observations:
1. Chat LMs are typically finetuned using examples of short conversations. Creating long,
high-quality conversations for finetuning is both difficult and costly, so the training data
often lacks examples of long interactions focused on problem solving. As a result, the
model performs better with shorter exchanges.
2. Long contexts can cause errors to accumulate. In the self-attention mechanism, the softmax
is applied over many positions to compute weights for combining value vectors. As the
context length increases, inaccuracies build up, and the model’s “focus” may shift to
irrelevant details or earlier mistakes.
When starting fresh, it’s important to update the initial prompt with key details from earlier follow-
ups. This helps the model avoid repeating previous mistakes. By consolidating the relevant
information into a clear, concise starting point, you ensure the model has the context it needs without
relying on the long and noisy history of the prior conversation.
Args:
numbers: List of integers to search through. Can be empty.
target: Integer sum to find.
Returns:
Tuple of two distinct indices whose values sum to target,
or None if no solution exists.
Examples:
>>> find_target_sum([2, 7, 11, 15], 9)
123
(0, 1)
>>> find_target_sum([3, 3], 6)
(0, 1)
>>> find_target_sum([1], 5)
None
>>> find_target_sum([], 0)
None
Requirements:
- Time complexity: O(n)
- Space complexity: O(n)
- Each index can only be used once
- If multiple solutions exist, return any valid solution
- All numbers and target can be any valid integer
- Return None if no solution exists
"""
Providing a highly detailed docstring can sometimes feel as time-consuming as coding the function
itself. A less detailed description might seem more practical, but this increases the likelihood of the
generated code not fully meeting user needs. In such cases, users can review the output and refine
their instructions with additional requests or constraints.
By the way, the book’s official website, thelmbook.com, was created entirely through
collaboration with an LLM. While it wasn’t generated perfectly on the first try, through
iterative feedback, multiple conversation restarts, and switching between different chat
LLMs when needed, I refined every element you see—from the graphics to the
animations—until they met my vision.
Language models can generate functions, classes, or even entire applications. However, the chance of
success decreases as the level of abstraction increases. If the problem resembles model’s training
data, the model performs well with minimal input. However, for novel or unique business or
engineering problems, detailed instructions are crucial for good results.
If you decide to use a brief prompt to save time, ask the model to pose clarifying questions.
You can also request it to describe the code it plans to generate first. This allows you to
adjust or add details to the instructions before code is created.
124
1. Uses an LLM to analyze the staged differences and identify affected documentation files in
the project’s documentation directory. The model examines code changes and determines
which documentation files might need updates.
2. Both the existing documentation content and staged code changes are then passed to
another LLM call. This second step generates updated documentation that reflects the code
modifications while maintaining the existing documentation’s style and structure.
3. Places the updated documentation in the staging area alongside code changes. This allows
developers to review both code and documentation updates together before committing,
ensuring accuracy and maintaining a single source of truth.
This approach treats documentation as a first-class citizen in the development process, ensuring it
evolves alongside the code.
While LLMs can help maintain documentation alignment, they should not operate
autonomously. Human review remains crucial to verify the accuracy of generated
documentation updates and ensure they align with the team’s communication standards.
This pipeline is especially useful for keeping API documentation, architectural descriptions, and
implementation guides up to date. However, like other LLM-based systems, it must include
safeguards against hallucinations. We discuss this next.
5.8. Hallucinations
A major challenge with LLMs is their tendency to produce content that seems plausible but is factually
incorrect. These inaccuracies, called hallucinations, create problems for using LLMs in production
systems where reliability and accuracy are required.
125
As you can imagine, “Blockchain Quantum Neural Network (BQNN)” is not a real concept.
The LLM’s two-page explanation, including detailed descriptions of how it works, is
entirely fabricated.
Low quality of training data also contributes to hallucinations. During pretraining on large volumes
of internet text, models are exposed to both accurate and inaccurate information. They learn these
inaccuracies but lack the ability to differentiate between truth and falsehood.
Finally, LLMs generate text one token at a time. This approach means that errors in earlier tokens can
cascade, leading to increasingly incoherent outputs.
126
information. Similarly, in a code generation system, the model might generate code, but automated
tests and human review should always occur before deployment.
The potential for hallucinations was notably demonstrated when Air Canada’s customer
service chatbot provided incorrect information about bereavement travel rates to a
passenger. The chatbot falsely claimed that customers could book full-price tickets and
later apply for reduced fares, contradicting the airline’s actual policy. When the passenger
tried to claim the fare reduction, Air Canada’s denial led to a small claims court case,
resulting in an $812 CAD (near $565 USD) compensation order. This case highlights the
tangible business consequences of AI inaccuracies, including financial losses, customer
frustration, and reputational damage.
Success with LLMs lies in recognizing that hallucinations are an inherent limitation of the technology.
However, this issue can be managed through thoughtful system design, safeguards, and a clear
understanding of when and where these models should be applied.
Meta’s decision to withhold its multimodal Llama model from the European Union in July
2024 exemplifies the growing tension between AI development and regulatory
compliance. Citing concerns over the region’s “unpredictable” regulatory environment,
particularly regarding the use of copyrighted and personal data for training, Meta joined
other tech giants like Apple in limiting AI deployments in European markets. This
restriction highlights the challenges companies face in balancing innovation with
regional regulations.
When selecting models for commercial use, companies should review the training documentation and
license terms. Models trained primarily on public domain or properly licensed materials involve
9Fair use is a U.S. legal doctrine. Other regions handle copyright exceptions differently. The EU relies on “fair dealing”
and specific statutory exceptions, Japan has distinct copyright limitations, and other countries apply unique rules for
permitted uses. This variation complicates global LLM deployment, as training data allowed under U.S. fair use might
violate copyright laws elsewhere.
127
lower legal risks. However, the massive datasets required for effective LLMs make it nearly
impossible to avoid copyrighted material entirely. Businesses need to understand these risks and
factor them into their development strategies.
Beyond legal issues, training LLMs on copyrighted material raises ethical concerns. Even when legally
permissible, using copyrighted works without consent may appear exploitative, especially if the
model outputs compete with the creators’ work. Transparency about training data sources and
proactive engagement with creators can help address these concerns. Ethical practices should also
involve compensating creators whose contributions significantly improve the model, fostering a
more equitable system.
128
However, this strategy often limits the effectiveness of the models, as the smaller, restricted datasets
typically lead to reduced performance.
As laws around LLMs evolve, businesses must stay flexible. They may need to adjust workflows as
courts define legal boundaries or revise policies as AI-specific legislation appears. Consulting
intellectual property lawyers with AI expertise can help manage these risks.
129