Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
Preface
Acknowledgments
About the Author
Preface
Acknowledgments
About the Author
Introduction
In 2017, a team at Google Brain introduced an advanced
artificial intelligence (AI) deep learning model called the
Transformer. Since then, the Transformer has become the
standard for tackling various natural language processing
(NLP) tasks in academia and industry. It is likely that you
have interacted with the Transformer model in recent years
without even realizing it, as Google uses BERT to enhance
its search engine by better understanding users’ search
queries. The GPT family of models from OpenAI have also
received attention for their ability to generate human-like
text and images.
Note
We cannot fit all of the ever-shifting code for this book
within these pages so to get the always free and update
to code, check out our github repo at
https://github.com/sinanuozdemir/quick-start-guide-to-
llms.
These Transformers now power applications such as
GitHub’s Copilot (developed by OpenAI in collaboration with
Microsoft), which can convert comments and snippets of
code into fully functioning source code that can even call
upon other large language models (LLMs) (as in Listing 1.1)
to perform NLP tasks.
def classify_text(email):
“””
Use Facebook’s BART model to classify an email
Args:
email (str): The email to classify
Returns:
str: The classification of the email
“””
# COPILOT START. EVERYTHING BEFORE THIS COMMENT
classifier = pipeline(
‘zero-shot-classification’, model=’facebook/bar
labels = [‘spam’, ‘not spam’]
hypothesis_template = ‘This email is {}.’
results = classifier(
email, labels, hypothesis_template=hypothesis_t
return results[‘labels’][0]
# COPILOT END
In Listing 1.1, I used Copilot to take in only a Python function
definition and some comments I wrote, and I wrote all of the
code to make the function do what I wrote. There’s no
cherry-picking here, just a fully working Python function that
I can call like this:
classify_text(‘hi I am spam’) # spam
Note
I will use the term understand a fair amount in this text.
In this context, I am usually referring to “natural
language understanding” (NLU)—a research branch of
NLP that focuses on developing algorithms and models
that can accurately interpret human language. As we
will see, NLU models excel at tasks such as
classification, sentiment analysis, and named entity
recognition. However, it is important to note that while
these models can perform complex language tasks, they
do not possess true understanding in the same way that
humans do.
Definition of LLMs
To back up only slightly, we should talk first about the
specific NLP task that LLMs and Transformers are being used
to solve, which provides the foundation layer for their ability
to solve a multitude of tasks. Language modeling is a
subfield of NLP that involves the creation of statistical/deep
learning models for predicting the likelihood of a sequence
of tokens in a specified vocabulary (a limited and known
set of tokens). There are generally two kinds of language
modeling tasks out there: autoencoding tasks and
autoregressive tasks (Figure 1.2).
Figure 1.2 Both the autoencoding and autoregressive
language modeling tasks involve filling in a missing
token, but only the autoencoding task allows for context
to be seen on both sides of the missing token.
Note
A token is the smallest unit of semantic meaning, which
is created by breaking down a sentence or piece of text
into smaller units; it is the basic input for an LLM. Tokens
can be words but also can be “sub-words,” as we will
see in more depth throughout this book. Some readers
may be familiar with the term “n-gram,” which refers to
a sequence of n consecutive tokens.
BERT
BERT (Figure 1.3) is an autoencoding model that uses
attention to build a bidirectional representation of a
sentence. This approach makes it ideal for sentence
classification and token classification tasks.
Figure 1.3 BERT was one of the first LLMs and
continues to be popular for many NLP tasks that involve
fast processing of large amounts of text.
T5
T5 is a pure encoder/decoder Transformer model that was
designed to perform several NLP tasks, from text
classification to text summarization and generation, right off
the shelf. It is one of the first popular models to be able to
boast of such a feat, in fact. Before T5, LLMs like BERT and
GPT-2 generally had to be fine-tuned using labeled data
before they could be relied on to perform such specific
tasks.
T5 uses both the encoder and the decoder of the
Transformer, so it is highly versatile in both processing and
generating text. T5-based models can perform a wide range
of NLP tasks, from text classification to text generation, due
to their ability to build representations of the input text
using the encoder and generate text using the decoder
(Figure 1.5). T5-derived architectures are ideal for
applications that “require both the ability to process and
understand text and the ability to generate text freely.”
Nearly all LLMs are highly versatile and are used for various
NLP tasks, such as text classification, text generation,
machine translation, and sentiment analysis, among others.
These LLMs, along with flavors (variants) of them, will be the
main focus of this book and our applications.
Table 1.1 shows the disk size, memory usage, number of
parameters – the internal numbers that make up the
matrices of the deep learning architecture itself, and
approximate size of the pre-training data for several popular
LLMs. Note that these sizes are approximate and may vary
depending on the specific implementation and hardware
used.
Note
The pre-training process for an LLM can evolve over
time as researchers find better ways of training LLMs
and phase out methods that don’t help as much. For
example, within a year of the original Google BERT
release that used the NSP pre-training task, a BERT
variant called RoBERTa (yes, most of these LLM names
will be fun) by Facebook AI was shown to not require the
NSP task to match and even beat the original BERT
model’s performance in several areas.
Transfer Learning
Transfer learning is a technique used in machine learning to
leverage the knowledge gained from one task to improve
performance on another related task. Transfer learning for
LLMs involves taking an LLM that has been pre-trained on
one corpus of text data and then fine-tuning it for a specific
“downstream” task, such as text classification or text
generation, by updating the model’s parameters with task-
specific data.
The idea behind transfer learning is that the pre-trained
model has already learned a lot of information about the
language and relationships between words, and this
information can be used as a starting point to improve
performance on a new task. Transfer learning allows LLMs to
be fine-tuned for specific tasks with much smaller amounts
of task-specific data than would be required if the model
were trained from scratch. This greatly reduces the amount
of time and resources needed to train LLMs. Figure 1.12
provides a visual representation of this relationship.
Figure 1.12 The general transfer learning loop involves
pre-training a model on a generic dataset on some
generic self-supervised task and then fine-tuning the
model on a task-specific dataset.
Fine-Tuning
Once an LLM has been pre-trained, it can be fine-tuned for
specific tasks. Fine-tuning involves training the LLM on a
smaller, task-specific dataset to adjust its parameters for
the specific task at hand. This allows the LLM to leverage its
pre-trained knowledge of the language to improve its
accuracy for the specific task. Fine-tuning has been shown
to drastically improve performance on domain-specific and
task-specific tasks and lets LLMs adapt quickly to a wide
variety of NLP applications.
Figure 1.13 shows the basic fine-tuning loop that we will use
for our models in later chapters. Whether they are open-
source or closed-source, the loop is more or less the same:
1. We define the model we want to fine-tune as well as
any fine-tuning parameters (e.g., learning rate).
2. We aggregate some training data (the format and
other characteristics depend on the model we are
updating).
3. We compute losses (a measure of error) and gradients
(information about how to change the model to
minimize error).
4. We update the model through backpropagation—a
mechanism to update model parameters to minimize
errors.
Figure 1.13 The Transformers package from Hugging
Face provides a neat and clean interface for training and
fine-tuning LLMs.
Note
You will not need a Hugging Face account or key to
follow along and use any of the code in this book, apart
from the very specific advanced exercises where I will
call it out.
Attention
The title of the original paper that introduced the
Transformer was “Attention Is All You Need.” Attention is a
mechanism used in deep learning models (not just
Transformers) that assigns different weights to different
parts of the input, allowing the model to prioritize and
emphasize the most important information while performing
tasks like translation or summarization. Essentially,
attention allows a model to “focus” on different parts of the
input dynamically, leading to improved performance and
more accurate results. Before the popularization of
attention, most neural networks processed all inputs equally
and the models relied on a fixed representation of the input
to make predictions. Modern LLMs that rely on attention can
dynamically focus on different parts of input sequences,
allowing them to weigh the importance of each part in
making predictions.
To recap, LLMs are pre-trained on large corpora and
sometimes fine-tuned on smaller datasets for specific tasks.
Recall that one of the factors behind the Transformer’s
effectiveness as a language model is that it is highly
parallelizable, allowing for faster training and efficient
processing of text. What really sets the Transformer apart
from other deep learning architectures is its ability to
capture long-range dependencies and relationships between
tokens using attention. In other words, attention is a crucial
component of Transformer-based LLMs, and it enables them
to effectively retain information between training loops and
tasks (i.e., transfer learning), while being able to process
lengthy swatches of text with ease.
Attention is considered the aspect most responsible for
helping LLMs learn (or at least recognize) internal world
models and human-identifiable rules. A Stanford University
study conducted in 2019 showed that certain attention
calculations in BERT corresponded to linguistic notions of
syntax and grammar rules. For example, the researchers
noticed that BERT was able to notice direct objects of verbs,
determiners of nouns, and objects of prepositions with
remarkably high accuracy from only its pre-training. These
relationships are presented visually in Figure 1.14.
Figure 1.14 Research has probed into LLMs and
revealed that they seem to be recognizing grammatical
rules even when they were never explicitly told these
rules.
Embeddings
Embeddings are the mathematical representations of
words, phrases, or tokens in a large-dimensional space. In
NLP, embeddings are used to represent the words, phrases,
or tokens in a way that captures their semantic meaning
and relationships with other words. Several types of
embeddings are possible, including position embeddings,
which encode the position of a token in a sentence, and
token embeddings, which encode the semantic meaning of
a token (Figure 1.16).
Tokenization
Tokenization, as mentioned previously, involves breaking
text down into the smallest unit of understanding—tokens.
These tokens are the pieces of information that are
embedded into semantic meaning and act as inputs to the
attention calculations, which leads to . . . well, the LLM
actually learning and working. Tokens make up an LLM’s
static vocabulary and don’t always represent entire words.
For example, tokens can represent punctuation, individual
characters, or even a sub-word if a word is not known to the
LLM. Nearly all LLMs also have special tokens that have
specific meaning to the model. For example, the BERT
model has the special [CLS] token, which BERT
automatically injects as the first token of every input and is
meant to represent an encoded semantic meaning for the
entire input sequence.
Readers may be familiar with techniques like stop-words
removal, stemming, and truncation that are used in
traditional NLP. These techniques are not used, nor are they
necessary, for LLMs. LLMs are designed to handle the
inherent complexity and variability of human language,
including the usage of stop words like “the” and “an,” and
variations in word forms like tenses and misspellings.
Altering the input text to an LLM using these techniques
could potentially harm the model’s performance by reducing
the contextual information and altering the original meaning
of the text.
Tokenization can also involve preprocessing steps like
casing, which refers to the capitalization of the tokens. Two
types of casing are distinguished: uncased and cased. In
uncased tokenization, all the tokens are lowercase, and
usually accents are stripped from letters. In cased
tokenization, the capitalization of the tokens is preserved.
The choice of casing can impact the model’s performance,
as capitalization can provide important information about
the meaning of a token. Figure 1.17 provides an example.
Note
Even the concept of casing carries some bias,
depending on the model. To uncase a text—that is, to
implement lowercasing and stripping of accents—is
generally a Western-style preprocessing step. I speak
Turkish, so I know that the umlaut (e.g., the “Ö” in my
last name) matters and can actually help the LLM
understand the word being said in Turkish. Any language
model that has not been sufficiently trained on diverse
corpora may have trouble parsing and utilizing these
bits of context.
Domain-Specific LLMs
Domain-specific LLMs are LLMs that are trained in a
particular subject area, such as biology or finance. Unlike
general-purpose LLMs, these models are designed to
understand the specific language and concepts used within
the domain they were trained on.
One example of a domain-specific LLM is BioGPT (Figure
1.20), a domain-specific LLM that was pre-trained on large-
scale biomedical literature. This model was developed by an
AI healthcare company, Owkin, in collaboration with
Hugging Face. The model was trained on a dataset of more
than 2 million biomedical research articles, making it highly
effective for a wide range of biomedical NLP tasks such as
named entity recognition, relationship extraction, and
question-answering. BioGPT, whose pre-training encoded
biomedical knowledge and domain-specific jargon into the
LLM, can be fine-tuned on smaller datasets, making it
adaptable for specific biomedical tasks and reducing the
need for large amounts of labeled data.
Figure 1.20 BioGPT is a domain-specific Transformer
model that was pre-trained on large-scale biomedical
literature. BioGPT’s success in the biomedical domain
has inspired other domain-specific LLMs such as SciBERT
and BlueBERT.
Text Classification
The text classification task assigns a label to a given piece
of text. This task is commonly used in sentiment analysis,
where the goal is to classify a piece of text as positive,
negative, or neutral, or in topic classification, where the goal
is to classify a piece of text into one or more predefined
categories. Models like BERT can be fine-tuned to perform
classification with relatively little labeled data, as seen in
Figure 1.21.
Figure 1.21 A peek at the architecture of using BERT to
achieve fast and accurate text classification results.
Classification layers usually act on the special [CLS]
token that BERT uses to encode the semantic meaning
of the entire input sequence.
Free-Text Generation
What first caught the world’s eye in terms of modern LLMs
like ChatGPT was their ability to freely write blogs, emails,
and even academic papers. This notion of text generation is
why many LLMs are affectionately referred to as “generative
AI,” although that term is a bit reductive and imprecise. I
will not often use the term “generative AI,” as the word
“generative” has its own meaning in machine learning as
the analogous way of learning to a “discriminative” model.
(For more on that, check out my other book, The Principles
of Data Science 3rd Edition, published by Packt Publishing.)
We could, for example, prompt (ask) ChatGPT to help plan
out a blog post, as shown in Figure 1.24. Even if you don’t
agree with the results, this can help humans with the
“tabula rasa” problem and give us something to at least edit
and start from rather than staring at a blank page for too
long.
Figure 1.24 ChatGPT can help ideate, scaffold, and
even write entire blog posts.
Note
I would be remiss if I didn’t mention the controversy
that LLMs’ free-text generation ability can cause at the
academic level. Just because an LLM can write entire
blogs or even essays, that doesn’t mean we should let
them do so. Just as the expansion of the internet caused
some to believe that we’d never need books again,
some argue that ChatGPT means that we’ll never need
to write anything again. As long as institutions are
aware of how to use this technology and proper
regulations and rules are put in place, students and
teachers alike can use ChatGPT and other text-
generation-focused AIs safely and ethically.
Information Retrieval/Neural
Semantic Search
LLMs encode information directly into their parameters via
pre-training and fine-tuning, but keeping them up to date
with new information is tricky. We either have to further fine-
tune the model on new data or run the pre-training steps
again from scratch. To dynamically keep information fresh,
we will architect our own information retrieval system with a
vector database (don’t worry—we’ll go into more details on
all of this in Chapter 2). Figure 1.25 shows an outline of the
architecture we will build.
Figure 1.25 Our neural semantic search system will be
able to take in new information dynamically and to
retrieve relevant documents quickly and accurately
given a user’s query using LLMs.
Chatbots
Everyone loves a good chatbot, right? Well, whether you
love them or hate them, LLMs’ capacity for holding a
conversation is evident through systems like ChatGPT and
even older models like gpt-3.5-turbo-instruct (as seen in
Figure 1.26). The way we architect chatbots using LLMs will
be quite different from the traditional way of designing
chatbots through intents, entities, and tree-based
conversation flows. These concepts will be replaced by
system prompts, context, and personas—all of which we will
dive into in the coming chapters.
Figure 1.26 ChatGPT isn’t the only LLM that can hold a
conversation. We can use gpt-3.5-turbo-instruct to
construct a simple conversational chatbot. The text
highlighted in green represents gpt-3.5-turbo-instruct’s
output. Note that before the chat even begins, I inject
context into the prompt that would not be shown to the
end user but that the LLM needs to provide accurate
responses.
We have our work cut out for us. I’m excited to be on this
journey with you, and I’m excited to get started!
Summary
LLMs are advanced AI models that have revolutionized the
field of NLP. LLMs are highly versatile and are used for a
variety of NLP tasks, including text classification, text
generation, and machine translation. They are pre-trained
on large corpora of text data and can then be fine-tuned for
specific tasks.
Using LLMs in this fashion has become a standard step in
the development of NLP models. In our first case study, we
will explore the process of launching an application with
both proprietary models like ChatGPT as well as open source
models. We will get a hands-on look at the practical aspects
of using LLMs for real-world NLP tasks, from model selection
and fine-tuning to deployment and maintenance.
2. Semantic Search with
LLMs
Introduction
In Chapter 1, we explored the inner workings of language
models and the impact that modern LLMs have had on NLP
tasks like text classification, generation, and machine
translation. Another powerful application of LLMs has also
been gaining traction in recent years: semantic search.
Now, you might be thinking that it’s time to finally learn the
best ways to talk to ChatGPT and GPT-4 to get the optimal
results—and we’ll start to do that in the next chapter, I
promise. In the meantime, I want to show you what else we
can build on top of this novel Transformer architecture.
While text-to-text generative models like GPT are extremely
impressive in their own right, one of the most versatile
solutions that AI companies offer is the ability to generate
text embeddings based on powerful LLMs.
Text embeddings are a way to represent words or phrases as
machine-readable numerical vectors in a multidimensional
space, generally based on their contextual meaning. The
idea is that if two phrases are similar (we will explore the
word “similar” in more detail later on in this chapter), then
the vectors that represent those phrases should be close
together by some measure (like Euclidean distance), and
vice versa. Figure 2.1 shows an example of a simple search
algorithm. When a user searches for an item to buy—say, a
Magic: The Gathering trading card—they might simply
search for “a vintage magic card.” The system should then
embed this query such that if two text embeddings are near
each other, that should indicate the phrases that were used
to generate them are similar.
The Task
A traditional search engine generally takes what you type in
and then gives you a bunch of links to websites or items
that contain those words or permutations of the characters
that you typed in. So, if you typed in “vintage magic the
gathering cards” on a marketplace, that search would return
items with a title/description containing combinations of
those words. That’s a pretty standard way to search, but it’s
not always the best way. For example I might get vintage
magic sets to help me learn how to pull a rabbit out of a hat.
Fun, but not what I asked for.
The terms you input into a search engine may not always
align with the exact words used in the items you want to
see. It could be that the words in the query are too general,
resulting in a slew of unrelated findings. This issue often
extends beyond just differing words in the results; the same
words might carry different meanings than what was
searched for. This is where semantic search comes into play,
as exemplified by the earlier-mentioned Magic: The
Gathering cards scenario.
The Components
Let’s go over each of our components in more detail to
understand the choices we’re making and which
considerations we need to take into account.
Text Embedder
At the heart of any semantic search system is the text
embedder. This component takes in a text document, or a
single word or phrase, and converts it into a vector. The
vector is unique to that text and should capture the
contextual meaning of the phrase.
The choice of the text embedder is critical, as it determines
the quality of the vector representation of the text. We have
many options for how we vectorize with LLMs, both open
and closed source. To get off of the ground more quickly, we
will use OpenAI’s closed-source “Embeddings” product for
our purposes here. In a later section, I’ll go over some open-
source options.
OpenAI’s “Embeddings” is a powerful tool that can quickly
provide high-quality vectors, but it is a closed-source
product, which means we have limited control over its
implementation and potential biases. In particular, when
using closed-source products, we may not have access to
the underlying algorithms, which can make it difficult to
troubleshoot any issues that arise.
Document Chunking
Once we have our text embedding engine set up, we need
to consider the challenge of embedding large documents. It
is often not practical to embed entire documents as a single
vector, particularly when we’re dealing with long documents
such as books or research papers. One solution to this
problem is to use document chunking, which involves
dividing a large document into smaller, more manageable
chunks for embedding.
return chunks
[(‘ ‘, 82259),
(‘\n’, 9220),
(‘ ‘, 1592),
(‘\n\n’, 333),
(‘\n ‘, 250)]
Cluster 0: 2 embeddings
Cluster 1: 3 embeddings
Cluster 2: 4 embeddings
...
API
We now need a place to put all of these components so that
users can access the documents in a fast, secure, and easy
way. To do this, let’s create an API.
FastAPI
FastAPI is a web framework for building APIs with Python
quickly. It is designed to be both fast and easy to set up,
making it an excellent choice for our semantic search API.
FastAPI uses the Pydantic data validation library to validate
request and response data; it also uses the high-
performance ASGI server, uvicorn.
Setting up a FastAPI project is straightforward and requires
minimal configuration. FastAPI provides automatic
documentation generation with the OpenAPI standard,
which makes it easy to build API documentation and client
libraries. Listing 2.7 is a skeleton of what that file would look
like.
app = FastAPI()
openai.api_key = os.environ.get(‘OPENAI_API_KEY’
pinecone_key = os.environ.get(‘PINECONE_KEY’, ‘’
def my_hash(s):
# Return the MD5 hash of the input string as a
return hashlib.md5(s.encode()).hexdigest()
class DocumentInputRequest(BaseModel):
# Define input to /document/ingest
class DocumentInputResponse(BaseModel):
# Define output from /document/ingest
class DocumentRetrieveRequest(BaseModel):
# Define input to /document/retrieve
class DocumentRetrieveResponse(BaseModel):
# Define output from /document/retrieve
if __name__ == “__main__”:
uvicorn.run(“api:app”, host=”0.0.0.0”, port=800
For the full file, be sure to check out the code repository for
this book.
Performance
I’ve outlined a solution to the problem of semantic search,
but I also want to talk about how to test how these different
components work together. For this purpose, let’s use a
well-known benchmark to run the tests against: the
XTREME benchmark—a multi-task question-answering
dataset for yes/no questions containing about 12,000
English examples. This dataset contains (question, passage)
pairs that indicate, for a given question, whether that
passage would be the best passage to answer the question.
Listing 2.8 shows a code snippet of loading up the dataset
print(f”Context: {dataset['train'][0]['context']
print(f”Question: {dataset['train'][0]['question
print(f”Answers: {dataset['train'][0]['answers']
Table 2.2 outlines a few trials I ran and coded for this
experiment. I used combinations of embedders, re-ranking
solutions, and a bit of fine-tuning to see how well the
system performed as indicated by the top result
accuracy. For each known pair of (question, passage) in our
XTREME validation set, we test if the system’s top result is
the intended passage. If we are not using a cross-encoder,
the top result is simply the passage with the highest cosine
similarity to the query given the embedding engine. If we
are using a cross-encoder, I retrieved 50 results from the
vector database and re-ranked them using the cross-
encoder and used its final ranking as opposed to the
embedding engine’s ranking.
Note that the models I used for the cross-encoder and the
bi-encoder were both specifically pre-trained on data in a
way similar to asymmetric semantic search. This is
important because we want the embedder to produce
vectors for both short queries and long documents, and to
place them near each other when they are related. I should
also note that it will not always be the case that the open
source embedder underperforms a closed source model. We
should be comparing models’ performances on a test set by
test set basis. In the first edition of this book, we used a
different benchmark (BoolQ) and in that edition, the open
source embedder performed slightly better than OpenAI!
Let’s assume we want to keep things simple to get our
project off the ground, so we’ll use only the OpenAI
embedder and do no re-ranking (row 1) in our application.
We should now consider the costs associated with using
FastAPI, Pinecone, and OpenAI for text embeddings.
FastAPI cost = $7
Summary
With all these components accounted for, our pennies
added up, and alternatives available at every step of the
way, I’ll leave you to it. Enjoy setting up your new semantic
search system, and be sure to check out the complete code
for this—including a fully working FastAPI app with
instructions on how to deploy it—on the book’s code
repository. You can experiment to your heart’s content to
make this solution work as well as possible for your domain-
specific data.
Stay tuned for our next chapter, where we will build on this
API with a chatbot based on GPT-4 and our retrieval system.
3. First Steps with Prompt
Engineering
Introduction
In Chapter 2, we built an asymmetric semantic search
system that leveraged the power of large language models
(LLMs) to quickly and efficiently find relevant documents
based on natural language queries using LLM-based
embedding engines. The system was able to understand the
meaning behind the queries and retrieve accurate results,
thanks to the pre-training of the LLMs on vast amounts of
text.
However, building an effective LLM-based application can
require more than just plugging in a pre-trained model and
retrieving results—what if we want to parse them for a
better user experience? We might also want to lean on the
learnings of massively large language models to help
complete the loop and create a useful end-to-end LLM-
based application. This is where prompt engineering comes
into the picture.
Prompt Engineering
Prompt engineering involves crafting inputs to LLMs
(prompts) that effectively communicate the task at hand to
the LLM, leading it to return accurate and useful outputs
(Figure 3.1). Prompt engineering is a skill that requires an
understanding of the nuances of language, the specific
domain being worked on, and the capabilities and
limitations of the LLM being used.
Just Ask
The first and most important rule of prompt engineering for
instruction-aligned language models is to be clear and direct
about what you are asking for. When we give an LLM a task
to complete, we want to ensure that we are communicating
that task as clearly as possible. This is especially true for
simple tasks that are straightforward for the LLM to
accomplish.
In the case of asking GPT-3 to correct the grammar of a
sentence, a direct instruction of “Correct the grammar of
this sentence” is all you need to get a clear and accurate
response. The prompt should also clearly indicate the
phrase to be corrected (Figure 3.3).
Figure 3.3 The best way to get started with an LLM
aligned to answer queries from humans is to simply ask.
Note
Many figures in this chapter are screenshots of an LLM’s
playground. Experimenting with prompt formats in the
playground or via an online interface can help identify
effective approaches, which can then be tested more
rigorously using larger data batches and the code/API
for optimal output.
Few-Shot Learning
When it comes to more complex tasks that require a deeper
understanding of a task, giving an LLM a few examples can
go a long way toward helping the LLM produce accurate and
consistent outputs. Few-shot learning is a powerful
technique that involves providing an LLM with a few
examples of a task to help it understand the context and
nuances of the problem.
Few-shot learning has been a major focus of research in the
field of LLMs. The creators of GPT-3 even recognized the
potential of this technique, which is evident from the fact
that the original GPT-3 research paper was titled “Language
Models Are Few-Shot Learners.”
Few-shot learning is particularly useful for tasks that require
a certain tone, syntax, or style, and for fields where the
language used is specific to a particular domain. Figure 3.6
shows an example of asking GPT to classify a review as
being subjective or not; basically, this is a binary
classification task. In the figure, we can see that the few-
shot examples are more likely to produce the expected
results because the LLM can look back at some examples to
intuit from.
Output Formatting
LLMs can generate text in a variety of formats—sometimes
too much variety, in fact. It can be helpful to format the
output in a specific way to make it easier to work with and
integrate into other systems. We saw this kind of formatting
at work earlier in this chapter when we asked GPT-3 to give
us an answer in a numbered list. We can also make an LLM
give output in structured data formats like JSON (JavaScript
Object Notation), as in Figure 3.8.
Figure 3.8 Simply asking GPT to give a response back
as a JSON (top) does generate a valid JSON, but the keys
are also in Turkish, which may not be what we want. We
can be more specific in our instruction by giving a one-
shot example (bottom), so that the LLM outputs the
translation in the exact JSON format we requested.
Prompting Personas
Specific word choices in our prompts can greatly influence
the output of the model. Even small changes to the prompt
can lead to vastly different results. For example, adding or
removing a single word can cause the LLM to shift its focus
or change its interpretation of the task. In some cases, this
may result in incorrect or irrelevant responses; in other
cases, it may produce the exact output desired.
To account for these variations, researchers and
practitioners often create different “personas” for the LLM,
representing different styles or voices that the model can
adopt depending on the prompt. These personas can be
based on specific topics, genres, or even fictional
characters, and are designed to elicit specific types of
responses from the LLM (Figure 3.9). By taking advantage of
personas, LLM developers can better control the output of
the model and end users of the system can get a more
unique and tailored experience.
Figure 3.9 Starting from the top left and moving down,
we see a baseline prompt of asking GPT-3 to respond as
a store attendant. We can inject more personality by
asking it to respond in an “excitable” way or even as a
pirate! We can also abuse this system by asking the LLM
to respond in a rude manner or even horribly as an anti-
Semite. Any developer who wants to use an LLM should
be aware that these kinds of outputs are possible,
whether intentional or not. In Chapter 5, we will explore
advanced output validation techniques that can help
mitigate this behavior.
Chain-of-Thought Prompting
Chain-of-thought prompting is a method that forces
LLMs to reason through a series of steps, resulting in more
structured, transparent, and precise outputs. The goal is to
break down complex tasks into smaller, interconnected
subtasks, allowing the LLM to address each subtask in a
step-by-step manner. This not only helps the model to
“focus” on specific aspects of the problem, but also
encourages it to generate intermediate outputs, making it
easier to identify and debug potential issues along the way.
Another significant advantage of chain-of-thought prompting
is the improved interpretability and transparency of the
LLM-generated response. By offering insights into the
model’s reasoning process, we, as users, can better
understand and qualify how the final output was derived,
which promotes trust in the model’s decision-making
abilities.
Summary
Prompt engineering—the process of designing and
optimizing prompts to improve the performance of language
models—can be fun, iterative, and sometimes tricky. We saw
many tips and tricks for how to get started, such as
understanding alignment, just asking, few-shot learning,
output structuring, prompting personas, and working with
prompts across models.
There is a strong correlation between proficient prompt
engineering and effective writing. A well-crafted prompt
provides the model with clear instructions, resulting in an
output that closely aligns with the desired response. When a
human can comprehend and create the expected output
from a given prompt, that outcome is indicative of a well-
structured and useful prompt for the LLM. However, if a
prompt allows for multiple responses or is in general vague,
then it is likely too ambiguous for an LLM. This parallel
between prompt engineering and writing highlights that the
art of writing effective prompts is more like crafting data
annotation guidelines or engaging in skillful writing than it is
similar to traditional engineering practices.
Prompt engineering is an important process for improving
the performance of language models. By designing and
optimizing prompts, you can ensure that your language
models will better understand and respond to user inputs. In
Chapter 5, we will revisit prompt engineering with some
more advanced topics like LLM output validation and
chaining multiple prompts together into larger workflows. In
our next chapter, we will build our own retrieval augmented
generation (RAG) chatbot using GPT-4’s prompt interface,
which is able to utilize the API we built in the Chapter 2.
4. The AI Ecosystem—
Putting the Pieces
Together
Introduction
Whether you’re a product manager, machine learning
engineer, CEO, or even just someone who has the urge to
build things, by the time you get to the part of actually
designing an AI-enabled product or feature, you run into a
question that everyone faces: How in the world do I turn
raw AI power into a usable, delightful experience?
The past few chapters have focused on individual
components of that makes most AI features great including:
An understanding of the different types of LLMs (auto-
encoding vs auto-regressive) and what kinds of tasks
they excel at.
Seeing how closed and open source LLMs can work
together in applications like semantic search.
Getting the most out of LLMs using structured prompt
engineering and how that lead to more agnostic
deployments of prompts and models.
We have even hinted at the idea of starting to put these
ideas together into comprehensive AI-enabled features and
that’s exactly what this chapter is about. To that end, We
will walk through two currently popular applications of LLMs
both because their popularity signals that many of you are
considering building something similar and because they
both offer evergreen techniques and considerations that
future AI applications will come up against.
If there was a moral to the first section of this book that I
hope you take away from reading this, it is that the best AI
applications do NOT simply rely on the raw power of
an AI model - fine-tuned or not but rather it’s the
ecosystem of AI models and tools that make the application
shine and persist for a long period of time.
We can see that even with only a three month gap, the GPT-
4 model got much worse at this task whereas the GPT-3.5
model got better! This is not a reason to boycott OpenAI or
their models by any means but it’s simply a consequence of
frequent training for the purpose of trying to force their
models to be good at as many things as possible for as
many people as possible. Inevitably there will be swings in
downstream task-specific performance that affect the
individual.
Deliberate and structured prompting with a decently sized
testing suite can be enough to get away with smaller AI
features but it is often not enough when we want to tackle
the larger, more complex applications. One of the main
reasons we see this delta in difficulty is that current LLM
architectures excel much more at reasoning through given
context than they do at recalling information and thinking
for itself.
To dig into one step deeper, figure x shows how this will
work at the prompt level, step by step:
Figure 4.6 Starting from the top left and reading left to
right, these four states represent how our bot is
architected. Everytime a user says something that
surfaces a confident document from our knowledge
base, that document is inserted directly into the system
prompt where we tell GPT-4 to only use documents from
our knowledge base.
Let’s wrap all of this logic into a python class that will have
a skeleton like in Listing 4.1,
class ChatLLM(BaseModel):
model: str = 'gpt-3.5-turbo'
temperature: float = 0.0
[START]
User Input: the input question you must answer
Context: retrieved context from the database
Context Score : a score from 0 - 1 of how strong
Assistant Thought: This context has sufficient i
Assistant Response: your final answer to the ori
don't have sufficient information to answer the
[END]
[START]
User Input: another input question you must answ
Context: more retrieved context from the databas
Context Score : another score from 0 - 1 of how
Assistant Thought: This context does not have su
question.
Assistant Response: your final answer to the sec
don't have sufficient information to answer the
[END]
Begin:
{running_convo}
"""
class RagBot(BaseModel):
llm: ChatLLM
prompt_template: str = PROMPT_TEMPLATE
stop_pattern: List[str] = [STOP]
user_inputs: List[str] = []
ai_responses: List[str] = []
contexts: List[Tuple[str, float]] = []
@property
def running_convo(self):
convo = ''
for index in range(len(self.user_inputs)
convo += f'[START]\nUser Input: {sel
convo += f'Context: {self.contexts[i
{self.contexts[index][1]}\n'
if len(self.ai_responses) > index:
convo += self.ai_responses[index
convo += '\n[END]\n'
return convo.strip()
prompt = self.prompt_template.format(
today = datetime.date.today(),
running_convo=self.running_convo
)
generated = self.llm.generate(prompt, st
self.ai_responses.append(generated)
return generated
Our bot has prefix notation, chain of thought (by asking for
the thought before the response) and an example of how a
conversation should go (1-shot example). A full
implementation of this code is in the book’s repository and
Figure 4,7 shows a sample conversation we can have with it.
Figure 4.7 Talking to our chatbot yields cohesive and
conversational answers about the Gabonese president
(note this is actually not true as of 2023 which
highlights a data staleness issue) whereas when I ask
about Barack Obama’s age (which is not in the
database) the AI politely declines to answer even
though that is general knowledge it would try to use
otherwise.
Not bad at all if I may say so. Of course these are singular
examples of our bot and we should look at some more
rigorous testing of our RAG system.
{tool_description}
Begin:
{previous_responses}
"""
class PythonREPLTool(ToolInterface):
"""A tool for running python code in a REPL.
Once again, please check out the repository for the full
commented code for these case studies. We can’t fit all of it
in this book and frankly most people do not like reading
code on paper. I get it. Figure 4.13 visualizes this toolbox full
of actual usable tools.
Figure 4.13 Our agent chooses which tool to use at
every turn before responding to the user
Evaluating an AI Agent
Similar to evaluating our RAG system, evaluating our agent
boils down to evaluating its ability to pick the right tool and
create a decent response. Because our prompt involves
more chain of thought we could even begin to diagnose
each individual thought process like in Figure 4.14.
Figure 4.14 Evaluation of an AI Agent can be as
granular as dissecting and correcting each chain of
thought in the series of steps.
Conclusion
As we wrap up the first part of this book I want to do a quick
debrief on what we have covered so far because in this
chapter, we will begin to transition from the basics of using
Large Language Models to actual applications,
considerations, nuances, and challenges of deploying these
models as prototypes, MVPs, and at scale.
The exploration of RAG systems and AI Agents underscores
a pivotal theme: the importance of context, adaptability,
and a deep understanding of the tools at our disposal.
Whether it's leveraging a database for grounding responses
or orchestrating a symphony of digital tools to address user
queries, the success of these applications hinges on a
nuanced balance between the generative capabilities of
LLMs and the specificity and reliability of external data
sources and tools.
As we stand on this juncture, looking ahead to the next
frontier of AI application, it's crucial to recognize that the
journey is ongoing. The landscape of AI is perpetually
evolving, with new challenges and opportunities emerging
at the crossroads of technology and human needs. The
insights garnered from the development and evaluation of
RAG systems and AI Agents are not merely endpoints but
stepping stones toward more sophisticated, empathetic, and
effective AI applications.
In the chapters to come, we will delve deeper into the
ethical considerations, the technical hurdles, and the
uncharted territories of AI application. The goal is not just to
build AI systems that work but to create experiences that
enhance human capabilities, foster understanding, and,
ultimately, enrich lives.
The AI Ecosystem is vast and varied, filled with potential
and pitfalls. Yet, with a thoughtful approach and a clear
vision, the pieces come together to form solutions that are
not just technically proficient but also meaningful and
impactful. This is the essence of AI application - a journey of
discovery, creativity, and continuous improvement.
Part II
Getting the Most Out of
LLMs
5. Optimizing LLMs with
Customized Fine-Tuning
[This content is currently
in development.]
Introduction
The past few chapters have dealt mostly with teaching AI
models to solve tasks on our behalf through fine-tuning with
labeled data and some more advanced prompting
techniques like grabbing dynamic few-shot examples with
semantic search, and as we wrap up the second part of this
book, it’s time we stepped back and took a look at a modern
AI paradigm that’s actually not so much of a modern idea,
alignment.
Alignment doesn’t have a strict technical definition, nor is
it an algorithm that we can simply implement. In broad
terms, alignment refers to any process whose goal is to
instill/encode behavior of an AI that is in line with the
human user’s expectations. Wow, that’s broad right? It’s
supposed to be. Some definitions will use words like “value”,
“helpfulness”, “harmlessness” and frankly these can all be a
big part of alignment but as we will see through several
examples in this chapter, that’s just scratching the surface
of alignment. Should AI’s have the general sense of being
helpful? sure of course, but the nature of humanity is such
that what might be helpful to one person may be harmful to
another so it isn’t enough to simply say an AI “must be as
helpful and harmless as possible” because that strips away
the question of, “to whom and to what end?”
Instructional Alignment
Probably the most common form of alignment at the time of
writing is, at its core, about ensuring that an AI's responses
and actions are not just accurate but also relevant and
conversational to the queries posed by users. While
instructional alignment begins with the basic ability to recall
facts learned during its pre-training phase, it is also about
interpreting the intent behind a question and providing
answers that satisfy the underlying curiosity or need. It's the
difference between a cold, factual response and one that
anticipates follow-up questions, addresses implicit concerns,
and even offers related insights. This form of alignment
ensures that AI not only understands our questions but also
our reasons for asking them.
Figure 8.1 shows the difference before and after
instructional alignment for LLama-2-7b when asking it a
very basic factual question.
Figure 8.1 Before and after instructional alignment of
llama 2 (the non-chat version versus chat version)
Behavior Alignment
Moving away from the more “obvious” forms of alignment,
we begin with the idea of behavioral alignment. The line
between helpfulness and harmlessness is often blurred in
the AI world. While an AI might be programmed to provide
the most efficient solution to a problem, efficiency does not
always equate to ethical or harmless outcomes. Behavior
alignment pushes us to consider the broader implications of
AI's actions. For instance, an AI designed to optimize energy
use in a building might find the most efficient solution
involves shutting down essential services, which could
endanger lives. Here, alignment means finding a balance—
ensuring AI actions contribute positively without causing
harm, even in pursuit of efficiency or other goals.
Figure 8.2 (content warning for text about harm) is the
result of me asking two currently available models on
OpenAI (as of April 2024) to do something heinous. One of
the models was happy to comply, even if it came with a
brief warning.
Figure 8.2 Asking a deprecated but still available GPT-
3.5-Instruct model and GPT-4 to do something awful
yielded in one of the models giving me a literal list of
real ideas and then after the fact, the system flagged
the content.
Style Alignment
Communication is not just about what is said but how it's
said. Style alignment focuses on the manner in which AI
communicates. For example, a company might aim for their
AI’s tone to be neutral while others might aim for a more
“funny” chatbot. This might seem superficial at first glance,
but the impact of communication style is profound. A pun-
riddled response can confuse more than clarify, and a tone
that's too casual or too formal can alienate some users and
companies striving for universal AI usage struggle with this
balance. For example, Grok (X’s AI) has two modes:
“regular” and “fun”. The fun mode often is shorter and more
casual where the regular mode is more factual and neutral
and while very early Grok responses showed much more
variety in tone, even after many updates, the differences in
length, tone, and word choice can be evident as seen in
Figure 8.3.
Figure 8.3 Grok’s two modes show a wide difference in
tone, word choice, and length
Neither answer is wrong per se, but the fun mode’s answer
can be a bit off putting and just a touch condescending if
you were expecting legitimate help. Through Style
Alignment, we can ensure that AI's mode of communication
enhances understanding and accessibility, making
technology an inclusive tool for all.
Now when a company provides two modes of the same AI,
to me that’s an invitation to check out the differences
between them. For example, Figure 8.4 shows me asking
Grok about Sam Altman, who notably has had some
legal/financial disagreements with the owner of Grok, Elon
Musk and fun mode got a bit less … fun.
Figure 8.4 Asking Grok’s “fun mode” about Sam
Altman always led to discussion on controversies
whereas regular mode did not.
Value Alignment
Perhaps the most ambitious form of alignment is value
alignment, where AI's actions and responses are not just
technically sound but also in harmony with a set of ethical
values. This goes beyond mere compliance with legal
standards or societal norms; it's about embedding a moral
compass within AI. But who’s moral compass? And where do
these morals come from? Well, simply put, they come from
data. As we will see in a later section, alignment can come
in many forms: pre-training, supervised fine-tuning (what we
have been doing for a few chapters now), and even from
more advanced topics like reinforcement learning (more on
that later). No matter where it’s coming from, values
undeniably are derived from the data we use to train AIs.
Figure 8.5 comes from a wonderful paper entitled “The
Ghost in the Machine has an American accent” where the
authors make the point that AIs who are being developed
with the express purpose of helping “the world” should
consider and exemplify multiple value systems and not just
value systems of the creators - in this case Western and in
English.
There is a term for what the authors are striving for. Value
pluralism refers to the idea that there are many different
value systems that are equally correct and fundamental and
while they can co-exist, they can also conflict with each
other. While this paper explored GPT-3’s training data we
can see the evolution of value pluralism in GPT-4 by asking
it what to think about when considering a new job
opportunity both without a system prompt (the default) and
one where I ask it to consider the question from the
perspective of Eastern philosophies (Figure 8.6).
headers = {
"Accept" : "application/json",
"Authorization": "Bearer "+ userdata.get
"Content-Type": "application/json"
}
men_jobs = []
for i in tqdm(range(1_000)):
for _ in query({
"inputs": "Men tend to enjoy and excel a
"parameters": {
"max_new_tokens": 20,
"temperature": 1.,
"do_sample": True
}
}, url=NON_INSTRUCT_API_URL)[0]['generated_t
men_jobs.append(_.strip())
Should we blame Google for this? Yes and no. I won’t blame
them for genuinely trying to remove biases from their AI
models but there is something to be said about the balance
of performance and diversity and throwing money and
compute resources at a problem isn’t always the right way
to address an issue.
So how helpful is too helpful? How instructional is too
instructional? Whose tone and value system makes it into
the model? These are all questions that speak to some core
pillars of alignment.
Data
At the foundation of the principles of alignment lies Data.
Data is the bedrock that informs how models interpret and
interact with the world. Human preference data, in
particular, serves as a critical guide. By integrating data
that reflects a broad range of human preferences and
behaviors, we can train models that are more attuned to the
nuanced expectations of users. This is not a matter of
collecting the most data, but rather the right data—data
that is representative, diverse, and sensitive to the
multitude of human experiences and perspectives.
However, sourcing such data presents its own set of
challenges. It involves not only a careful curation process to
ensure quality but also a conscious effort to avoid biases
that may already be present in the data sources.
Furthermore, it requires a deep understanding of the
context in which the data was generated to ensure that it
aligns with the intended use of the AI model. Companies like
OpenAI have delved into this with databases of
conversational exchanges aimed at mirroring a plethora of
interactions AI might encounter, thereby striving for a form
of democratic representation in the digital realm.
Training/Tuning Models
The purpose of the data we create is usually either to
evaluate a model (our next section) or, more commonly, to
train and tune LLMs to follow the examples provided. There
are two main methods to train models to follow alignment
and each come with nuances, caveats, tricks, techniques,
and another synonym for the difficult work domain-specific
ML engineers face every day:
SFT – Supervised Fine-Tuning - Letting an LLM read
and update its parameter’s weights based on annotated
examples of alignment (this is standard deep
learning/language modeling for t he most part).
RL – Reinforcement Learning - Setting up an
environment to allow an LLM to act as an agent in an
environment and receive rewards/punishments.
Let’s take a closer look at each of these techniques.
Supervised Fine-Tuning
Supervised Fine-Tuning stands as one of the cornerstone
techniques in the world of machine learning and AI
alignment. In this approach, a pre-trained language model is
further trained — or fine-tuned — using a dataset
specifically annotated for alignment. This dataset consists of
examples that embody the desired behaviors, values, or
responses that align with human expectations and ethical
considerations. Each example in this dataset is paired with
annotations that might include correct responses,
preference rankings, or indications of ethical
appropriateness.
The process of SFT involves adjusting the model's internal
parameters so that its outputs more closely match these
annotated examples. This requires a delicate balance; the
model must learn from the new examples without losing the
general capabilities it acquired during its initial pre-training
phase. The objective is to enhance the model's ability to
generate responses that are not only contextually relevant
and accurate but also ethically aligned and sensitive to the
nuances of human values.
One of the key challenges in SFT is ensuring that the fine-
tuning dataset is diverse and representative enough to
cover a broad spectrum of scenarios, including edge cases
and nuanced ethical dilemmas. This diversity is crucial for
preventing the model from developing biases or blind spots
that could lead to misalignment in real-world interactions.
Reinforcement Learning
Reinforcement Learning represents a more dynamic and
interactive approach to aligning AI models with human
values and expectations. Unlike the static nature of SFT, RL
involves creating an environment where the model, acting
as an agent, learns from the consequences of its actions.
The model receives feedback in the form of rewards or
punishments based on the appropriateness or alignment of
its responses. This feedback loop enables the model to
iteratively adjust its behavior towards more desirable
outcomes.
Prompt Engineering
Arguably the easiest and least effective way to instill some
kind of alignment is in prompting itself. As mentioned
previously, LLMs are much better at reasoning using given
context than they are at thinking for themselves. To that
end, if we include rubrics and examples and allow the LLMs
to think through responses before giving a final output, we
can inject alignment principles through proper structured
prompting and in-context learning.
Examples of alignment prompting would include:
Writing out in the prompt “do not answer anything that
isn’t in this topic”, etc
Including a set of principles to follow with every use of
the AI
Clearly outlining acceptable sources of information and
referencing guidelines to ensure the AI uses reliable
data in its reasoning process.
Including examples of edge cases to show the AI how to
handle conversations that go off the rails
This course adds to our costs by injecting alignment in every
prompt but it also forces us, the users of AI, to think through
the possible alignment vectors and fathom the universe of
malicious intent.
No matter how you decide to train or tune a model to be
more aligned with your expectations, the only true way to
know if it’s working is to set up proper evaluation pipelines
and channels.
Evaluation
Evaluation acts as the arbiter of alignment success. It
involves a continuous cycle of testing, feedback, and
adjustment. LLM Evaluation takes on a quantitative
approach, measuring the AI's performance against a set of
predefined tasks or benchmarks. This is complemented by
Human Evaluation—wherein the AI's outputs are assessed
by people to gauge how well they match human
expectations in practice.
Furthermore, Interpretability and Transparency are integral
to this principle. They ensure that we can understand and
trust the decisions made by the AI. This is not simply a
technical requirement but a societal one, ensuring that as AI
becomes a more integral part of our lives, we maintain
oversight and understanding of how and why it makes its
choices.
# transform_score({'answer_1_score': 3, 'answer_
# transform_score({'answer_1_score': 10, 'answer
# transform_score({'answer_1_score': 0, 'answer_
To better visualize this, after running several thousand pairs
through the model, we ended up with Figure 8.,17, showing
on the left the simulated human score from 1-9 (using the
formula in Figure 8.16) and on the right, the AI given scores.
They are not the same. The human scores have a massive
mode at the 5 mark which makes sense considering that
most responses were given a 9 or a 10 originally so
selecting pairs at random would yield mostly similarly rated
responses. The AI-scores were much more polarizing. There
are very few 5s and mostly scores on the fringes.
Conclusion
In the coming chapters, many of our examples will come
back to the idea of alignment and will borrow from the ideas
laid out in this chapter. We will curate data, train models,
and evaluate them - sometimes manually, and sometimes
automatically. In any case, the world of alignment is not as
simple as the “best” algorithm for the job nor is it
quantifiable and objective across value systems.
Truthfully, alignment is a discussion as much as it is a
philosophical quandary as much as it is a technical
challenge and I encourage anyone reading this to treat
alignment with the utmost respect.
Part III
Advanced LLM Usage
9. Moving Beyond
Foundation Models [This
content is currently in
development.]
Introduction
Admittedly we’ve spent a vast majority of this text building,
thinking, iterating, and not as much time establishing
rigorous and structured tests against our LLM systems. That
being said, we heave seen evaluation at play throughout
this entire book in bits and pieces. We evaluated our fine-
tuned recommendation engine by judging the
recommendations it gave out, we tested our classifiers
against metrics like accuracy and precision, we validated
our chat-aligned SAWYER and T5 models against our reward
mechanisms and even on some benchmarks.
This chapter will serve to aggregate all of these evaluation
techniques while adding on to the list because at the end of
the day, no matter how well we think our AI applications are
working, nothing can compare against good old fashioned
testing. Evaluating LLMs and AI applications is, in general, a
nebulous task that demands attention and proper context.
There is no one way to evaluate a model or a system but we
can work to bucket the types of tasks we build such that
each category of tasks has specific goals. If we can bucket
our tasks this way, we can begin to consider different
methods of evaluation for each category, providing a
scaffold of LLM testing we can re-use and iterate on.
Figure 12.1 walks through the main two task categories in
this chapter, with each of them having two sub-categories:
Generative Tasks - Relying on an LLM’s language
modeling head to generate free tokens in response to a
question.
Multiple Choice - Reasoning through a question and a
set of predefined choices to pick 1 or more correct
answers.
Free Text Response - Allowing the model to generate
free text responses to a query without being bounded
by a predefined set of options.
Understanding Tasks - Tasks which force a model to
exploit patterns in input data, generally for some
predictive or encoding task.
Embedding Tasks - Any task where an LLM encodes
data to vectors for clustering, recommendations, etc.
Classification - Fine-tuning a model specifically to
classify between predefined classes. This fine-tuning
can be done at the language modeling level or through
classical feed forward classification layers.
Figure 12.1 A high level and non-comprehensive view
of the four most common tasks we have to evaluate
with LLMs
Both Figures 12.3 and 12.4 show the exact same prompt,
LLM, token distributions but depending on which way you
choose to evaluate the answer, one ends up correct, and the
other incorrect. The code in Listing 12.1 has a python
function that will take in a prompt, ground truth letter
answer, and the number of options and return a suite of
data:
'model': The version of the model used.
'answer': The correct answer.
'top_tokens': The top token predictions and their
probabilities.
'token_probs': The probabilities of the tokens
representing the answer options.
'token_prob_correct': Boolean indicating if the top
probability token matches the correct answer.
'generated_output': The direct output text generated
by the model.
'generated_output_correct': Boolean indicating if the
generated output matches the correct answer.
top_tokens = sorted(zip(mistral_vocabulary,
reverse=True)[:20]
generated_output = mistral_tokenizer.decode(
skip_special_tokens=True).split('[/INST]')[-1]
generated_output_correct = generated_output.
answer.lower().strip()
Benchmarking
At its simplest, a benchmark is a standardized test that
assesses the capabilities of LLMs on some generally agreed
upon task. A benchmark dataset is itself simply a collection
of examples paired with an acceptable answer. When a
model is applied to a benchmark, they are given a score and
often placed on some leaderboard, gamifying the entire
experience. Figure 12.6 shows a very popular leaderboard -
the Open LLM Leaderboard - for open source models created
and maintained by HuggingFace.
Figure 12.6 The Open LLM Leaderboard is a popular
and standardized gamified leaderboard of open source
LLMs
Source: Hugging Face.
https://huggingface.co/spaces/HuggingFaceH4/open_llm
_leaderboard
Drilling down into the specifics, the dataset has two main
components we will utilize:
A multiple-choice section that tests a model's ability
to “identify true statements”. Given a question and
choices, the model must select the only correct answer.
A free response section where a model must
generate a 1-2 sentence answer to a question with the
overall goal of answering truthfully.
There are more facets to this benchmark then we will go
into so for more, I recommend checking out the paper. For
now, let’s run some models against these two main
components of our benchmark.
bi_encoder = SentenceTransformer("sentence-trans
client = OpenAI(
api_key=userdata.get('OPENAI_API_KEY')
)
ENGINE = 'text-embedding-3-large' # has size 30
Task-Specific Benchmarks
If standard benchmarks are a test of general intelligence
then a gap exists of benchmarks for specific domain
knowledge. These gaps provide an opportunity for people to
create novel reference evaluation data and can act as a
springboard for a new kind of AI race - smaller but more
dramatic within a vertical. Take the SWE-benchmark -
2,294 software engineering problems from GitHub, designed
to test LLMs on complex coding tasks that require deep
understanding and extensive code modifications across
multiple components (https://arxiv.org/abs/2310.06770).
This benchmark was made in conjunction with Princeton
University and U Chichago and it enables companies to
make bold claims like ones made by Cognition Lab’s “Devin,
the first AI software engineer” (https://www.cognition-
labs.com/introducing-devin). They use the SWE-benchmark
and the techniques in this chapter to make the claim that
they were the world’s greatest AI when it came to software
engineering (Figure 12.13). Could it tell me if I can safely eat
a watermelon? Who cares, said the hypothetical Engineering
Manager buying his entire team an annual license to boost
efficiency.
Embeddings
Embeddings are often used as a foundation for downstream
tasks. Recall our recommendation case study from a few
chapters ago where we trained our LLMs to embed animes
that were co-liked by users had a higher cosine similarity
and not only did we see an increase in embedding similarity
for co-liked animes, we also measured it’s business impact
based on the diversity of animes recommended (our fine-
tuned embedder recommended a larger number of animes
to users overall) and higher NPS (recommendations from our
fine-tuned embedder scored a higher NPS on the validation
data - see Figure 12.14)
Figure 12.14 We evaluated our fine-tuned embedders
in a previous chapter by scoring the recommendations
they gave out on our testing set. We are using the
performance of the downstream task to evaluate the
upstream LLM process.
Calibrated Classification
A tale as old as time: given this input data, categorize into
one or more of the following predefined categories.
Welcome to the world of text classification. Is this email
spam or not? What intent label should we give this customer
support interaction? Is this tweet political in nature or not?
The innate human desire to classify and categorize bleeds
into the artificial world through classification.
To separate this category from generative multiple choice
(which isa form of classification where the options are
simply our labels), This category will encompass only LLMs
specifically fine-tuned to output fine-tuned probabilities on
labels learned from a pre-labeled dataset. This would
include BOTH fine-tuning a specific classifying layer on top
of an LLM (either auto-regressive or auto-encoding) AND
fine-tuning a generative LLM to generate a specific class
label - effectively fine-tuned multiple choice)
Important metrics from multiple choice still hold true here
like accuracy, precision, and recall. The difference here is
that fine-tuned models are specifically looking for patterns
to exploit from a foundational knowledge base from its pre-
training (see the next section on probing) whereas
generative multiple choice is more of a test of the model’s
internal knowledge and its ability to transfer it to a task
definition. The same metrics can be applied to both, but
probabilities will be much more calibrated.
Model calibration measures the alignment of the
predictions of a classifier with the true label probabilities
with an aim of making sure that the predictions of a model
are reliable and accurate - for example if we asked a well
calibrated model to make some predictions and looked only
predictions of lets say 60%, we would expect that around
60% of those examples actually belonged to that label,
otherwise it would have predicted something different. To
measure this, we can use the Expected Calibration Error
- the weighted average error of the estimated probabilities.
Figure 12.16 shows an example of a calculation of ECE
against a toy 10 datapoint dataset.
Even though our fine-tuned GPT 3.5 model has the best
accuracy, recall that it was about 40-80x more expensive to
train and evaluate than DistilBERT and had a much lower
throughput. Whether it’s a fine-tuned DistilBERT or a fine-
tuned GPT 3.5, classifiers whose weights have been
purposefully altered to adjust to the task of classification
show a much higher degree of calibration than a non-fine
tuned model with no tuning to the task. A further case study
could explore the calibration of a non fine tuned GPT 3.5
model with few shot learning to attempt to induce some
calibration, but perhaps we will save that for a future
edition.
Probing Results
For every model we are going to probe (check the repository
for the full code) we probe the first, middle, and ending
layer to predict our four columns. Figure 12.19 shows an
example of probing Llama 13b’s middle layer. Our birth year
and death year probes perform surprisingly strongly; an
RMSE of 80 years and R2 of over .5 is not the worst linear
regressor I’ve trained, especially considering the scale of
our data.
Figure 12.19 An example of probing the middle layer
of a Llama 13b model with a constructed prompt. Our
birth (top left) and death (top right) probes perform
relatively well (R2 of above .5) while readership (bottom
left) performs less well (R2 of .32) and our gibberish
regression model performs poorly as expected (R2 of 0).
Conclusion
Choosing the right model for the task at hand is hard
enough and to get the most confidence out of our models,
proper evaluation is crucial. Figure 12.23 sums up the main
methods of evaluation among the four categories of tasks
outlined in this chapter.
Keep Going!
As you venture forth, stay curious, stay creative, and stay
kind. Remember that your work touches other people, and
make sure it reaches them with empathy and with fairness.
The landscape of LLMs is vast and uncharted, waiting for
explorers like you to illuminate the way. So, here’s to you,
the trailblazers of the next generation of language models.
Happy coding!