0% found this document useful (0 votes)

9 views21 pages

Advanced Text Generation Techniques and Tools - Hands-On Large Language Models

Chapter 7 discusses advanced techniques for enhancing text generation using large language models (LLMs) without fine-tuning, focusing on methods such as Model I/O, Memory, Agents, and Chains, all integrated with the LangChain framework. It emphasizes the importance of combining these techniques to achieve superior performance and provides practical examples of creating chains to manage complex prompts. The chapter also introduces quantization as a method to improve model efficiency while maintaining accuracy.

Uploaded by

alex trivaylo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views21 pages

Advanced Text Generation Techniques and Tools - Hands-On Large Language Models

Uploaded by

alex trivaylo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Chapter 7.

Advanced Text Generation

Techniques and Tools

In the previous chapter, we saw how prompt engineering can do wonders

for the accuracy of your text-generation large language model (LLM).
With just a few small tweaks, these LLMs are guided toward more pur‐
poseful and accurate answers. This showed how much there is to gain us‐
ing techniques that do not fine-tune the LLM but instead use the LLM
more efficiently, such as the relatively straightforward prompt
engineering.

In this chapter, we will continue this train of thought. What can we do to

further enhance the experience and output that we get from the LLM
without needing to fine-tune the model itself?

Fortunately, a great deal of methods and techniques allow us to further

improve what we started with in the previous chapter. These more ad‐
vanced techniques lie at the foundation of numerous LLM-focused sys‐
tems and are, arguably, one of the first things users implement when de‐
signing such systems.

In this chapter, we will explore several such methods and concepts for
improving the quality of the generated text:

Model I/O

Loading and working with LLMs

Memory

Helping LLMs to remember

Agents

Combining complex behavior with external tools

Chains

Connecting methods and modules

These methods are all integrated with the LangChain framework that will
help us easily use these advanced techniques throughout this chapter.
LangChain is one of the earlier frameworks that simplify working with
LLMs through useful abstractions. Newer frameworks of note are DSPy
and Haystack. Some of these abstractions are illustrated in Figure 7-1.
Note that retrieval will be discussed in the next chapter.

Figure 7-1. LangChain is a complete framework for using LLMs. It has modular components that
can be chained together to allow for complex LLM systems.

Each of these techniques has significant strengths by themselves but their

true value does not exist in isolation. It is when you combine all of these
techniques that you get an LLM-based system with incredible perfor‐
mance. The culmination of these techniques is truly where LLMs shine.
Model I/O: Loading Quantized Models
with LangChain

Before we can make use of LangChain’s features to extend the capabilities

of LLMs, we need to start by loading our LLM. As in previous chapters, we
will be using Phi-3 but with a twist; we will use a GGUF model variant in‐
stead. A GGUF model represents a compressed version of its original
counterpart through a method called quantization, which reduces the
number of bits needed to represent the parameters of an LLM.

Bits, a series of 0s and 1s, represent values by encoding them in binary

form. More bits result in a wider range of values but requires more mem‐
ory to store those values, as shown in Figure 7-2.

Figure 7-2. Attempting to represent pi with float 32-bit and float 16-bit representations. Notice the
lowered accuracy when we halve the number of bits.

Quantization reduces the number of bits required to represent the para‐

meters of an LLM while attempting to maintain most of the original infor‐
mation. This comes with some loss in precision but often makes up for it
as the model is much faster to run, requires less VRAM, and is often al‐
most as accurate as the original.

To illustrate quantization, consider this analogy. If asked what the time is,
you might say “14:16,” which is correct but not a fully precise answer. You
could have said it is “14:16 and 12 seconds” instead, which would have
been more accurate. However, mentioning seconds is seldom helpful and
we often simply put that in discrete numbers, namely full minutes.
Quantization is a similar process that reduces the precision of a value
(e.g., removing seconds) without removing vital information (e.g., retain‐
ing hours and minutes).

In Chapter 12, we will further discuss how quantization works under the
hood. You can also see a full visual guide to quantization in “A Visual
Guide to Quantization” by Maarten Grootendorst. For now, it is important
to know that we will use an 8-bit variant of Phi-3 compared to the original
16-bit variant, cutting the memory requirements almost in half.

TIP

As a rule of thumb, look for at least 4-bit quantized models. These models have a
good balance between compression and accuracy. Although it is possible to use 3-
bit or even 2-bit quantized models, the performance degradation becomes notice‐
able and it would instead be preferable to choose a smaller model with a higher
precision.

First, we will need to download the model. Note that the link contains
multiple files with different bit-variants. FP16, the model we choose, rep‐
resents the 16-bit variant:
!wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mi

We use llama-cpp-python together with LangChain to load the GGUF

file:

from langchain import LlamaCpp

# Make sure the model path is correct for your system!

llm = LlamaCpp(
model_path="Phi-3-mini-4k-instruct-fp16.gguf",
n_gpu_layers=-1,
max_tokens=500,
n_ctx=2048,
seed=42,
verbose=False
)

In LangChain, we use the invoke function to generate output:

llm.invoke("Hi! My name is Maarten. What is 1 + 1?")

Unfortunately, we get no output! As we have seen in previous chapters,

Phi-3 requires a specific prompt template. Compared to our examples
with transformers , we will need to explicitly use a template ourselves.
Instead of copy-pasting this template each time we use Phi-3 in
LangChain, we can use one of LangChain’s core functionalities, namely
“chains.”

TIP

All examples in this chapter can be run with any LLM. This means that you can
choose whether to use Phi-3, ChatGPT, Llama 3 or anything else when going
through the examples. We will use Phi-3 as a default throughout, but the state-of-
the-art changes quickly, so consider using a newer model instead. You can use the
Open LLM Leaderboard (a ranking of open source LLMs) to choose whichever
works best for your use case.

If you do not have access to a device that can run LLMs locally, consider using
ChatGPT instead:

from langchain.chat_models import ChatOpenAI

# Create a chat-based LLM

chat_model = ChatOpenAI(openai_api_key="MY_KEY")

Chains: Extending the Capabilities of

LLMs

LangChain is named after one of its main methods, chains. Although we

can run LLMs in isolation, their power is shown when used with addi‐
tional components or even when used in conjunction with each other.
Chains not only allow for extending the capabilities of LLMs but also for
multiple chains to be connected together.

The most basic form of a chain in LangChain is a single chain. Although a

chain can take many forms, each with a different complexity, it generally
connects an LLM with some additional tool, prompt, or feature. This idea
of connecting a component to an LLM is illustrated in Figure 7-3.

Figure 7-3. A single chain connects some modular component, like a prompt template or external
memory, to the LLM.

In practice, chains can become complex quite quickly. We can extend the
prompt template however we want and we can even combine several
separate chains together to create intricate systems. In order to thorough‐
ly understand what is happening in a chain, let’s explore how we can add
Phi-3’s prompt template to the LLM.

A Single Link in the Chain: Prompt Template

We start with creating our first chain, namely the prompt template that
Phi-3 expects. In the previous chapter, we explored how
transformers.pipeline applies the chat template automatically. This
is not always the case with other packages and they might need the
prompt template to be explicitly defined. With LangChain, we will use
chains to create and use a default prompt template. It also serves as a
nice hands-on experience with using prompt templates.

The idea, as illustrated in Figure 7-4, is that we chain the prompt template
together with the LLM to get the output we are looking for. Instead of
having to copy-paste the prompt template each time we use the LLM, we
would only need to define the user and system prompts.

Figure 7-4. By chaining a prompt template with an LLM, we only need to define the input prompts.
The template will be constructed for you.

The template for Phi-3 is comprised of four main components:

<s> to indicate when the prompt starts

These are further illustrated in Figure 7-5 with an example.

Figure 7-5. The prompt template Phi-3 expects.

To generate our simple chain, we first need to create a prompt template

that adheres to Phi-3’s expected template. Using this template, the model
takes in a system_prompt , which generally describes what we expect
from the LLM. Then, we can use the input_prompt to ask the LLM spe‐
cific questions:

from langchain import PromptTemplate

# Create a prompt template with the "input_prompt" variable

template = """<s><|user|>
{input_prompt}<|end|>
<|assistant|>"""
prompt = PromptTemplate(
template=template,
input_variables=["input_prompt"]
)

To create our first chain, we can use both the prompt that we created and
the LLM and chain them together:

basic_chain = prompt | llm

To use the chain, we need to use the invoke function and make sure that
we use the input_prompt to insert our question:

# Use the chain

basic_chain.invoke(
{
"input_prompt": "Hi! My name is Maarten. What is 1 + 1?",
}
)

The answer to 1 + 1 is 2. It's a basic arithmetic operation where you add one unit to an

The output gives us the response without any unnecessary tokens. Now
that we have created this chain, we do not have to create the prompt tem‐
plate from scratch each time we use the LLM. Note that we did not disable
sampling as before, so your output might differ. To make this pipeline
more transparent, Figure 7-6 illustrates the connection between a prompt
template and the LLM using a single chain.
Figure 7-6. An example of a single chain using Phi-3’s template.

NOTE

The example assumes that the LLM needs a specific template. This is not always
the case. With OpenAI’s GPT-3.5, its API handles the underlying template.

You could also use a prompt template to define other variables that might change
in your prompts. For example, if we want to create funny names for businesses,
retyping that question over and over for different products can be time-
consuming.

Instead, we can create a prompt that is reusable:

# Create a Chain that creates our business' name

template = "Create a funny name for a business that sells {product}."
name_prompt = PromptTemplate(
template=template,
input_variables=["product"]
)

Adding a prompt template to the chain is just the very first step you need
to enhance the capabilities of your LLM. Throughout this chapter, we will
see many ways in which we can add additional modular components to
existing chains, starting with memory.

A Chain with Multiple Prompts

In our previous example, we created a single chain consisting of a prompt

template and an LLM. Since our example was quite straightforward, the
LLM had no issues dealing with the prompt. However, some applications
are more involved and require lengthy or complex prompts to generate a
response that captures those intricate details.

Instead, we could break this complex prompt into smaller subtasks that
can be run sequentially. This would require multiple calls to the LLM but
with smaller prompts and intermediate outputs as shown in Figure 7-7.

Figure 7-7. With sequential chains, the output of a prompt is used as the input for the next prompt.

This process of using multiple prompts is an extension of our previous ex‐

ample. Instead of using a single chain, we link chains where each link
deals with a specific subtask.
For instance, consider the process of generating a story. We could ask the
LLM to generate a story along with complex details like the title, a sum‐
mary, a description of the characters, etc. Instead of trying to put all of
that information into a single prompt, we could dissect this prompt into
manageable smaller tasks instead.

Let’s illustrate with an example. Assume that we want to generate a story

that has three components:

A title
A description of the main character
A summary of the story

Instead of generating everything in one go, we create a chain that only re‐
quires a single input by the user and then sequentially generates the
three components. This process is illustrated in Figure 7-8.

Figure 7-8. The output of the title prompt is used as the input of the character prompt. To generate
the story, the output of all previous prompts is used.

To generate that story, we use LangChain to describe the first component,

namely the title. This first link is the only component that requires some
input from the user. We define the template and use the "summary"
variable as the input variable and "title" as the output.

We ask the LLM to “Create a title for a story about {summary}” where
“{summary}” will be our input:

from langchain import LLMChain

# Create a chain for the title of our story

template = """<s><|user|>
Create a title for a story about {summary}. Only return the title.<|end|>
<|assistant|>"""
title_prompt = PromptTemplate(template=template, input_variables=["summary"])
title = LLMChain(llm=llm, prompt=title_prompt, output_key="title")

Let’s run an example to showcase these variables:

title.invoke({"summary": "a girl that lost her mother"})

{'summary': 'a girl that lost her mother',

'title': ' "Whispers of Loss: A Journey Through Grief"'}

This already gives us a great title for the story! Note that we can see both
the input ( "summary" ) as well as the output ( "title" ).
Let’s generate the next component, namely the description of the charac‐
ter. We generate this component using both the summary as well as the
previously generated title. Making sure that the chain uses those compo‐
nents, we create a new prompt with the {summary} and {title} tags:

# Create a chain for the character description using the summary and title
template = """<s><|user|>
Describe the main character of a story about {summary} with the title {title}. Use only
<|assistant|>"""
character_prompt = PromptTemplate(
template=template, input_variables=["summary", "title"]
)
character = LLMChain(llm=llm, prompt=character_prompt, output_key="character")

Although we could now use the character variable to generate our char‐
acter description manually, it will be used as part of the automated chain
instead.

Let’s create the final component, which uses the summary, title, and char‐
acter description to generate a short description of the story:

# Create a chain for the story using the summary, title, and character description
template = """<s><|user|>
Create a story about {summary} with the title {title}. The main character is: {character
<|assistant|>"""
story_prompt = PromptTemplate(
template=template, input_variables=["summary", "title", "character"]
)
story = LLMChain(llm=llm, prompt=story_prompt, output_key="story")

Now that we have generated all three components, we can link them to‐
gether to create our full chain:

# Combine all three components to create the full chain

llm_chain = title | character | story

We can run this newly created chain using the same example we used
before:

llm_chain.invoke("a girl that lost her mother")

{'summary': 'a girl that lost her mother',

'title': ' "In Loving Memory: A Journey Through Grief"',
'character': ' The protagonist, Emily, is a resilient young girl who struggles to cope w
'story': " In Loving Memory: A Journey Through Grief revolves around Emily, a resilient

Running this chain gives us all three components. This only required us
to input a single short prompt, the summary. Another advantage of divid‐
ing the problem into smaller tasks is that we now have access to these in‐
dividual components. We can easily extract the title; that might not have
been the case if we were to use a single prompt.
Memory: Helping LLMs to Remember
Conversations

When we are using LLMs out of the box, they will not remember what
was being said in a conversation. You can share your name in one prompt
but it will have forgotten it by the next prompt.

Let’s illustrate this phenomenon with an example using the

basic_chain we created before. First, we tell the LLM our name:

# Let's give the LLM our name

basic_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})

Hello Maarten! The answer to 1 + 1 is 2.

Next, we ask it to reproduce the name we have given it:

# Next, we ask the LLM to reproduce the name

basic_chain.invoke({"input_prompt": "What is my name?"})

I'm sorry, but as a language model, I don't have the ability to know personal informatio

Unfortunately, the LLM does not know the name we gave it. The reason
for this forgetful behavior is that these models are stateless—they have
no memory of any previous conversation!

As illustrated in Figure 7-9, conversing with an LLM that does not have
any memory is not the greatest experience.

To make these models stateful, we can add specific types of memory to

the chain that we created earlier. In this section, we will go through two
common methods for helping LLMs to remember conservations:

Conversation buffer
Conversation summary

Figure 7-9. An example of a conversation between an LLM with memory and without memory.

Conversation Buffer

One of the most intuitive forms of giving LLMs memory is simply remind‐
ing them exactly what has happened in the past. As illustrated in
Figure 7-10, we can achieve this by copying the full conversation history
and pasting that into our prompt.

Figure 7-10. We can remind an LLM of what previously happened by simply appending the entire
conversation history to the input prompt.

In LangChain, this form of memory is called a

ConversationBufferMemory . Its implementation requires us to update
our previous prompt to hold the history of the chat.

We’ll start by creating this prompt:

# Create an updated prompt template to include a chat history

template = """<s><|user|>Current conversation:{chat_history}

{input_prompt}<|end|>
<|assistant|>"""

prompt = PromptTemplate(
template=template,
input_variables=["input_prompt", "chat_history"]
)

Notice that we added an additional input variable, namely

chat_history . This is where the conversation history will be given be‐
fore we ask the LLM our question.

Next, we can create LangChain’s ConversationBufferMemory and as‐

sign it to the chat_history input variable.
ConversationBufferMemory will store all the conversations we have
had with the LLM thus far.

We put everything together and chain the LLM, memory, and prompt
template:

from langchain.memory import ConversationBufferMemory

# Define the type of memory we will use

memory = ConversationBufferMemory(memory_key="chat_history")

# Chain the LLM, prompt, and memory together

llm_chain = LLMChain(
prompt=prompt,
llm=llm,
memory=memory
)

To explore whether we did this correctly, let’s create a conversation histo‐

ry with the LLM by asking it a simple question:
# Generate a conversation and ask a basic question
llm_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})

{'input_prompt': 'Hi! My name is Maarten. What is 1 + 1?',

'chat_history': ',
'text': " Hello Maarten! The answer to 1 + 1 is 2. Hope you're having a great day!"}

You can find the generated text in the 'text' key, the input prompt in
'input_prompt' , and the chat history in 'chat_history' . Note that
since this is the first time we used this specific chain, there is no chat
history.

Next, let’s follow up by asking the LLM if it remembers the name we used:

# Does the LLM remember the name we gave it?

llm_chain.invoke({"input_prompt": "What is my name?"})

{'input_prompt': 'What is my name?',

'chat_history': "Human: Hi! My name is Maarten. What is 1 + 1?\nAI: Hello Maarten! The
'text': ' Your name is Maarten.'}

By extending the chain with memory, the LLM was able to use the chat
history to find the name we gave it previously. This more complex chain
is illustrated in Figure 7-11 to give an overview of this additional
functionality.

Figure 7-11. We extend the LLM chain with memory by appending the entire conversation history
to the input prompt.

Windowed Conversation Buffer

In our previous example, we essentially created a chatbot. You could talk

to it and it remembers the conversation you had thus far. However, as the
size of the conversation grows, so does the size of the input prompt until
it exceeds the token limit.

One method of minimizing the context window is to use the last k conver‐
sations instead of maintaining the full chat history. In LangChain, we can
use ConversationBufferWindowMemory to decide how many conversa‐
tions are passed to the input prompt:

from langchain.memory import ConversationBufferWindowMemory

# Retain only the last 2 conversations in memory

memory = ConversationBufferWindowMemory(k=2, memory_key="chat_history")

# Chain the LLM, prompt, and memory together

llm_chain = LLMChain(
prompt=prompt,
llm=llm,
memory=memory
)

Using this memory, we can try out a sequence of questions to illustrate

what will be remembered. We start with two conversations:

# Ask two questions and generate two conversations in its memory

llm_chain.predict(input_prompt="Hi! My name is Maarten and I am 33 years old. What is 1
llm_chain.predict(input_prompt="What is 3 + 3?")

{'input_prompt': 'What is 3 + 3?',

'chat_history': "Human: Hi! My name is Maarten and I am 33 years old. What is 1 + 1?\nAI
'text': " Hello Maarten! It's nice to meet you as well. Regarding your new question, 3 +

The interaction we had thus far is shown in "chat_history" . Note that

under the hood, LangChain saves it as an interaction between you (indi‐
cated with Human) and the LLM (indicated with AI).

Next, we can check whether the model indeed knows the name we gave
it:

# Check whether it knows the name we gave it

llm_chain.invoke({"input_prompt":"What is my name?"})

{'input_prompt': 'What is my name?',

'chat_history': "Human: Hi! My name is Maarten and I am 33 years old. What is 1 + 1?\nAI
'text': ' Your name is Maarten, as mentioned at the beginning of our conversation. Is th

Based on the output in 'text' it correctly remembers the name we gave

it. Note that the chat history is updated with the previous question.

Now that we have added another conversation we are up to three conver‐

sations. Considering the memory only retains the last two conversations,
our very first question is not remembered.

Since we provided an age in our first interaction, we check whether the

LLM indeed does not know the age anymore:

# Check whether it knows the age we gave it

llm_chain.invoke({"input_prompt":"What is my age?"})

{'input_prompt': 'What is my age?',

'chat_history': "Human: What is 3 + 3?\nAI: Hello again! 3 + 3 equals 6. If there's anyt
'text': " I'm unable to determine your age as I don't have access to personal informatio

The LLM indeed has no access to our age since that was not retained in
the chat history.

Although this method reduces the size of the chat history, it can only re‐
tain the last few conversations, which is not ideal for lengthy conversa‐
tions. Let’s explore how we can summarize the chat history instead.

Conversation Summary

As we have discussed previously, giving your LLM the ability to remem‐

ber conversations is vital for a good interactive experience. However,
when using ConversationBufferMemory , the conversation starts to in‐
crease in size and will slowly approach your token limit. Although
ConversationBufferWindowMemory resolves the issue of token limits
to an extent, only the last k conversations are retained.

Although a solution would be to use an LLM with a larger context win‐

dow, these tokens still need to be processed before generation tokens,
which can increase compute time. Instead, let’s look toward a more so‐
phisticated technique, ConversationSummaryMemory . As the name im‐
plies, this technique summarizes an entire conversation history to distill
it into the main points.

This summarization process is enabled by another LLM that is given the

conversation history as input and asked to create a concise summary. A
nice advantage of using an external LLM is that we are not confined to
using the same LLM during conversation. The summarization process is
illustrated in Figure 7-12.

Figure 7-12. Instead of passing the conversation history directly to the prompt, we use another
LLM to summarize it first.

This means that whenever we ask the LLM a question, there are two calls:

The user prompt

The summarization prompt

To use this in LangChain, we first need to prepare a summarization tem‐

plate that we will use as the summarization prompt:

# Create a summary prompt template

summary_prompt_template = """<s><|user|>Summarize the conversations and update with the

Current summary:
{summary}

new lines of conversation:

{new_lines}

New summary:<|end|>
<|assistant|>"""
summary_prompt = PromptTemplate(
input_variables=["new_lines", "summary"],
template=summary_prompt_template
)
Using ConversationSummaryMemory in LangChain is similar to what
we did with the previous examples. The main difference is that we addi‐
tionally need to supply it with an LLM that performs the summarization
task. Although we use the same LLM for both summarizing and user
prompting, you could use a smaller LLM for the summarization task to
speed up computation:

from langchain.memory import ConversationSummaryMemory

# Define the type of memory we will use

memory = ConversationSummaryMemory(
llm=llm,
memory_key="chat_history",
prompt=summary_prompt
)
# Chain the LLM, prompt, and memory together
llm_chain = LLMChain(
prompt=prompt,
llm=llm,
memory=memory
)

Having created our chain, we can test out its summarization capabilities
by creating a short conversation:

# Generate a conversation and ask for the name

llm_chain.invoke({"input_prompt": "Hi! My name is Maarten. What is 1 + 1?"})
llm_chain.invoke({"input_prompt": "What is my name?"})

{'input_prompt': 'What is my name?',

'chat_history': ' Summary: Human, identified as Maarten, asked the AI about the sum of 1
'text': ' Your name in this context was referred to as "Maarten". However, since our int

After each step, the chain will summarize the conversation up until that
point. Note how the first conversation was summarized in
'chat_history' by creating a description of the conversation.

We can continue the conversation and at each step, the conversation will
be summarized and new information will be added as necessary:

# Check whether it has summarized everything thus far

llm_chain.invoke({"input_prompt": "What was the first question I asked?"})

{'input_prompt': 'What was the first question I asked?',

'chat_history': ' Summary: Human, identified as Maarten in the context of this conversat
'text': ' The first question you asked was "what\'s 1 + 1?"'}

After asking another question, the LLM updated the summary to include
the previous conversation and correctly inferred the original question.

To get the most recent summary, we can access the memory variable we
created previously:
# Check what the summary is thus far
memory.load_memory_variables({})

{'chat_history': ' Maarten, identified in this conversation, initially asked about the s

This more complex chain is illustrated in Figure 7-13 to give an overview

of this additional functionality.

Figure 7-13. We extend the LLM chain with memory by summarizing the entire conversation his‐
tory before giving it to the input prompt.

This summarization helps keep the chat history relatively small without
using too many tokens during inference. However, since the original
question was not explicitly saved in the chat history, the model needed to
infer it based on the context. This is a disadvantage if specific information
needs to be stored in the chat history. Moreover, multiple calls to the
same LLM are needed, one for the prompt and one for the summariza‐
tion. This can slow down computing time.

Often, it is a trade-off between speed, memory, and accuracy. Where

ConversationBufferMemory is instant but hogs tokens,
ConversationSummaryMemory is slow but frees up tokens to use.
Additional pros and cons of the memory types we have explored thus far
are described in Table 7-1.
Table 7-1. The pros and cons of different memory types.

Memory type Pros Cons

Conversation
Buffer Easiest implementation Slower generation
Ensures no information speed as more tokens
loss within context are needed
window Only suitable for
large-context LLMs
Larger chat histories
make information re‐
trieval difficult

Windowed
Conversation Large-context LLMs are Only captures the last
Buffer not needed unless chat k interactions
history is large No compression of
No information loss the last k interactions
over the last k
interactions

Conversation
Summary Captures the full history An additional call is
Enables long necessary for each
conversations interaction
Reduces tokens needed Quality is reliant on
to capture full history the LLM’s summa‐
rization capabilities

Agents: Creating a System of LLMs

Thus far, we have created systems that follow a user-defined set of steps
to take. One of the most promising concepts in LLMs is their ability to de‐
termine the actions they can take. This idea is often called agents, systems
that leverage a language model to determine which actions they should
take and in what order.

Agents can make use of everything we have seen thus far, such as model
I/O, chains, and memory, and extend it further with two vital
components:

Tools that the agent can use to do things it could not do itself
The agent type, which plans the actions to take or tools to use

Unlike the chains we have seen thus far, agents are able to show more ad‐
vanced behavior like creating and self-correcting a roadmap to achieve a
goal. They can interact with the real world through the use of tools. As a
result, these agents can perform a variety of tasks that go beyond what an
LLM is capable of in isolation.

For example, LLMs are notoriously bad at mathematical problems and of‐
ten fail at solving simple math-based tasks but they could do much more
if we provide access to a calculator. As illustrated in Figure 7-14, the un‐
derlying idea of agents is that they utilize LLMs not only to understand
our query but also to decide which tool to use and when.
Figure 7-14. Giving LLMs the ability to choose which tools they use for a particular problem results
in more complex and accurate behavior.

In this example, we would expect the LLM to use the calculator when it
faces a mathematical task. Now imagine we extend this with dozens of
other tools, like a search engine or a weather API. Suddenly, the capabili‐
ties of LLMs increase significantly.

In other words, agents that make use of LLMs can be powerful general
problem solvers. Although the tools they use are important, the driving
force of many agent-based systems is the use of a framework called
1
Reasoning and Acting (ReAct ).

The Driving Power Behind Agents: Step-by-step

Reasoning

ReAct is a powerful framework that combines two important concepts in

behavior: reasoning and acting. LLMs are exceptionally powerful when it
comes to reasoning as we explored in detail in Chapter 5.

Acting is a bit of a different story. LLMs are not able to act like you and I
do. To give them the ability to act, we could tell an LLM that it can use
certain tools, like a weather forecasting API. However, since LLMs can
only generate text, they would need to be instructed to use specific
queries to trigger the forecasting API.

ReAct merges these two concepts and allows reasoning to affect acting
and actions to affect reasoning. In practice, the framework consists of it‐
eratively following these three steps:

Thought
Action
Observation

Illustrated in Figure 7-15, the LLM is asked to create a “thought” about the
input prompt. This is similar to asking the LLM what it thinks it should do
next and why. Then, based on the thought, an “action” is triggered. The
action is generally an external tool, like a calculator or a search engine.
Finally, after the results of the “action” are returned to the LLM it “ob‐
serves” the output, which is often a summary of whatever result it
retrieved.

To illustrate with an example, imagine you are on holiday in the United

States and interested in buying a MacBook Pro. Not only do you want to
know the price but you need it converted to EUR as you live in Europe
and are more comfortable with those prices.

As illustrated in Figure 7-16, the agent will first search the web for cur‐
rent prices. It might find one or more prices depending on the search en‐
gine. After retrieving the price, it will use a calculator to convert USD to
EUR assuming we know the exchange rate.
Figure 7-15. An example of a ReAct prompt template.

Figure 7-16. An example of two cycles in a ReAct pipeline.

During this process, the agent describes its thoughts (what it should do),
its actions (what it will do), and its observations (the results of the action).
It is a cycle of thoughts, actions, and observations that results in the
agent’s output.

ReAct in LangChain

To illustrate how agents work in LangChain, we are going to build a pipe‐

line that can search the web for answers and perform calculations with a
calculator. These autonomous processes generally require an LLM that is
powerful enough to properly follow complex instructions.

The LLM that we used thus far is relatively small and not sufficient to run
these examples. Instead, we will be using OpenAI’s GPT-3.5 model as it
follows these complex instructions more closely:

import os
from langchain_openai import ChatOpenAI

# Load OpenAI's LLMs with LangChain

os.environ["OPENAI_API_KEY"] = "MY_KEY"
openai_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
NOTE

Although the LLM we used throughout the chapter is insufficient for this exam‐
ple, it does not mean that only OpenAI’s LLMs are. Larger useful LLMs exist but
they require significantly more compute and VRAM. For instance, local LLMs of‐
ten come in different sizes and within a family of models, increasing a model’s
size leads to better performance. To keep the necessary compute at a minimum,
we choose a smaller LLM throughout the examples in this chapter.

However, as the field of generative models evolves, so do these smaller LLMs. We

would be anything but surprised if eventually smaller LLMs, like the one used in
this chapter, would be capable enough to run this example.

After doing so, we will define the template for our agent. As we have
shown before, it describes the ReAct steps it needs to follow:

# Create the ReAct template

react_template = """Answer the following questions as best you can. You have access to t

{tools}

Use the following format:

Question: the input question you must answer

Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}"""

prompt = PromptTemplate(
template=react_template,
input_variables=["tools", "tool_names", "input", "agent_scratchpad"]
)

This template illustrates the process of starting with a question and gen‐
erating intermediate thoughts, actions, and observations.

To have the LLM interact with the outside world, we will describe the
tools it can use:

from langchain.agents import load_tools, Tool

from langchain.tools import DuckDuckGoSearchResults

# You can create the tool to pass to an agent

search = DuckDuckGoSearchResults()
search_tool = Tool(
name="duckduck",
description="A web search engine. Use this to as a search engine for general queries
func=search.run,
)

# Prepare tools
tools = load_tools(["llm-math"], llm=openai_llm)
tools.append(search_tool)
The tools include the DuckDuckGo search engine and a math tool that al‐
lows it to access a basic calculator.

Finally, we create the ReAct agent and pass it to the AgentExecutor ,

which handles executing the steps:

from langchain.agents import AgentExecutor, create_react_agent

# Construct the ReAct agent

agent = create_react_agent(openai_llm, tools, prompt)
agent_executor = AgentExecutor(
agent=agent, tools=tools, verbose=True, handle_parsing_errors=True
)

To test whether the agent works, we use the previous example, namely
finding the price of a MacBook Pro:

# What is the price of a MacBook Pro?

agent_executor.invoke(
{
"input": "What is the current price of a MacBook Pro in USD? How much would it c
}
)

While executing, the model generates multiple intermediate steps similar

to the steps illustrated in Figure 7-17.

Figure 7-17. An example of the ReAct process in LangChain.

These intermediate steps illustrate how the model processes the ReAct
template and what tools it accesses. This allows us to debug issues and ex‐
plore whether the agent uses the tools correctly.

When finished, the model gives us an output like this:

{'input': 'What is the current price of a MacBook Pro in USD? How much would it cost in
'output': 'The current price of a MacBook Pro in USD is $2,249.00. It would cost approx

Considering the limited tools the agent has, this is quite impressive! Using
just a search engine and a calculator the agent could give us an answer.

Whether that answer is actually correct should be taken into account. By

creating this relatively autonomous behavior, we are not involved in the
intermediate steps. As such, there is no human in the loop to judge the
quality of the output or reasoning process.

This double-edged sword requires a careful system design to improve its

reliability. For instance, we could have the agent return the website’s URL
where it found the MacBook Pro’s price or ask whether the output is cor‐
rect at each step.
Summary

In this chapter, we explored several ways to extend the capabilities of

LLMs by adding modular components. We began by creating a simple but
reusable chain that connected the LLM with a prompt template. We then
expanded on this concept by adding memory to the chain, which allowed
the LLM to remember conversations. We explored three different meth‐
ods to add memory and discussed their strengths and weaknesses.

We then delved into the world of agents that leverage LLMs to determine
their actions and make decisions. We explored the ReAct framework,
which uses an intuitive prompting framework that allows agents to rea‐
son about their thoughts, take actions, and observe the results. This led us
to build an agent that is able to freely use the tools at its disposal, such as
searching the web and using a calculator, demonstrating the potential
power of agents.

With this foundation in place, we are now poised to explore ways in

which LLMs can be used to improve existing search systems and even be‐
come the core of new, more powerful search systems, as discussed in the
next chapter.

1
Shunyu Yao et al. “ReAct: Synergizing reasoning and acting in language models.”
arXiv preprint arXiv:2210.03629 (2022).

Generative AI Apps With Langchain and Python - Rabi Jay
100% (1)
Generative AI Apps With Langchain and Python - Rabi Jay
387 pages
Sapphire Structure Truss
No ratings yet
Sapphire Structure Truss
136 pages
Building LLM Powered Applications With Langchain
100% (1)
Building LLM Powered Applications With Langchain
11 pages
ICT Chapter 2 Exam-Style Questions Some Answers
100% (3)
ICT Chapter 2 Exam-Style Questions Some Answers
8 pages
LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
Xper IM Advanced System Administrator Training Course - Before 4-24-12 Class
No ratings yet
Xper IM Advanced System Administrator Training Course - Before 4-24-12 Class
676 pages
Minecraft Worlds of Curiosity - Teachers Guide
100% (1)
Minecraft Worlds of Curiosity - Teachers Guide
106 pages
3 InfobasicProgrammingInjBASE PDF
No ratings yet
3 InfobasicProgrammingInjBASE PDF
107 pages
A Lawyers Guide To Social Media Marketing - 012815 PDF
No ratings yet
A Lawyers Guide To Social Media Marketing - 012815 PDF
58 pages
A Gentle Intro To Chaining LLMS, Agents, and Utils Via LangChain
No ratings yet
A Gentle Intro To Chaining LLMS, Agents, and Utils Via LangChain
26 pages
00 Course Introduction
100% (1)
00 Course Introduction
17 pages
03 Chains
No ratings yet
03 Chains
34 pages
One Stop Framework Building Applications With Llms
No ratings yet
One Stop Framework Building Applications With Llms
8 pages
LLM Frameworks
No ratings yet
LLM Frameworks
8 pages
01 Models IO
No ratings yet
01 Models IO
43 pages
LangChain Concepts Updated With Tealium
No ratings yet
LangChain Concepts Updated With Tealium
17 pages
Langchain Demo
No ratings yet
Langchain Demo
1 page
Lang Chain
No ratings yet
Lang Chain
7 pages
LangChain QuickStart With Llama 2
No ratings yet
LangChain QuickStart With Llama 2
16 pages
Static Prompting: Micro-Course
No ratings yet
Static Prompting: Micro-Course
4 pages
Lang Chain
No ratings yet
Lang Chain
7 pages
Langchain
No ratings yet
Langchain
2 pages
Lang Chain
No ratings yet
Lang Chain
8 pages
LangChain Concepts Full Presentation
No ratings yet
LangChain Concepts Full Presentation
17 pages
Build An AI Coding Agent With LangGraph by LangChain
No ratings yet
Build An AI Coding Agent With LangGraph by LangChain
11 pages
Lang Chain
No ratings yet
Lang Chain
27 pages
LangGraph Blog
No ratings yet
LangGraph Blog
15 pages
Langchain 101
100% (2)
Langchain 101
4 pages
Lang Chain - Agents
No ratings yet
Lang Chain - Agents
27 pages
LangChain For JavaScript Developers How To Integrate LLMs Into Javascript Web Apps (Daniel Nastase) (Z-Library)
No ratings yet
LangChain For JavaScript Developers How To Integrate LLMs Into Javascript Web Apps (Daniel Nastase) (Z-Library)
120 pages
LangChain Chat Bot March 15
No ratings yet
LangChain Chat Bot March 15
9 pages
AI Agent Workflow Vs Agent Part 5 by Vipra Singh Mar, 2025 Medium
No ratings yet
AI Agent Workflow Vs Agent Part 5 by Vipra Singh Mar, 2025 Medium
25 pages
2022 Promptchainer
No ratings yet
2022 Promptchainer
10 pages
Datastax Langchain Architecture Design Guide
No ratings yet
Datastax Langchain Architecture Design Guide
16 pages
Langchain LLM
No ratings yet
Langchain LLM
25 pages
(Slide) - LangChain v2
No ratings yet
(Slide) - LangChain v2
85 pages
Python Langchain Com Docs How - To #Llms...
No ratings yet
Python Langchain Com Docs How - To #Llms...
10 pages
Day 11 - LangChain, LangGraph
No ratings yet
Day 11 - LangChain, LangGraph
3 pages
Module#3 - L17 - Information Retrieval Using Agents & Tools
No ratings yet
Module#3 - L17 - Information Retrieval Using Agents & Tools
33 pages
Session 9 LangChain Ecosystem
No ratings yet
Session 9 LangChain Ecosystem
34 pages
Research Paper - Suryansh
No ratings yet
Research Paper - Suryansh
3 pages
Langchain 1
No ratings yet
Langchain 1
18 pages
LangChain Programming For Beginners
No ratings yet
LangChain Programming For Beginners
154 pages
Langchain Presentation
No ratings yet
Langchain Presentation
14 pages
LangChain Custom Project - Student Implementation Guide
No ratings yet
LangChain Custom Project - Student Implementation Guide
9 pages
Langchain 6public 230530132708 7cb3b668
No ratings yet
Langchain 6public 230530132708 7cb3b668
19 pages
Huyenchip Com 2023 04 11 LLM Engineering HTML
No ratings yet
Huyenchip Com 2023 04 11 LLM Engineering HTML
13 pages
LangChain Academy - Introduction To LangGraph - Motivation
No ratings yet
LangChain Academy - Introduction To LangGraph - Motivation
17 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Advancing Educational Accessibility: The Langchain LLM Chatbot'S Impact On Multimedia Syllabus-Based Learning
No ratings yet
Advancing Educational Accessibility: The Langchain LLM Chatbot'S Impact On Multimedia Syllabus-Based Learning
16 pages
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
100% (1)
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
21 pages
LangGraph Tutorials
100% (1)
LangGraph Tutorials
3 pages
05 Agents
No ratings yet
05 Agents
15 pages
Functions, Calls, and Agents Short Course
No ratings yet
Functions, Calls, and Agents Short Course
16 pages
Build A Chatgpt For Youtube Videos With Langchain
No ratings yet
Build A Chatgpt For Youtube Videos With Langchain
10 pages
Documentacao Langchain
No ratings yet
Documentacao Langchain
53 pages
Building LLM Applications For Production
No ratings yet
Building LLM Applications For Production
25 pages
LangChain Handbook GenAI Reference
No ratings yet
LangChain Handbook GenAI Reference
2 pages
Session 11 Working With Memory Embeddings Agents
No ratings yet
Session 11 Working With Memory Embeddings Agents
26 pages
Course 1 - Chatgpt Prompt Engineering For Developers Guidelines For Prompting Clear and Specific Instructions
No ratings yet
Course 1 - Chatgpt Prompt Engineering For Developers Guidelines For Prompting Clear and Specific Instructions
7 pages
300 LangChain Projects
100% (1)
300 LangChain Projects
17 pages
Using LLMs For Smart Contract Programming
No ratings yet
Using LLMs For Smart Contract Programming
16 pages
LangChain Notes
No ratings yet
LangChain Notes
7 pages
Section 060 Tips Tricks and Resources
No ratings yet
Section 060 Tips Tricks and Resources
33 pages
Building A Dynamic Multi-Agent Workflow - Harnessing AI Collaboration With LangChain & LangGraph - by Rohit Kumar - Oct, 2024 - Medium
No ratings yet
Building A Dynamic Multi-Agent Workflow - Harnessing AI Collaboration With LangChain & LangGraph - by Rohit Kumar - Oct, 2024 - Medium
13 pages
Module 3-Why Do Most Business Ecosystems Fail - BCG
No ratings yet
Module 3-Why Do Most Business Ecosystems Fail - BCG
16 pages
The Evolution of AI Agents & Agentic Systems - by Cobus Greyling - Nov, 2024 - Medium
100% (1)
The Evolution of AI Agents & Agentic Systems - by Cobus Greyling - Nov, 2024 - Medium
15 pages
A3 - Autonomous Innovation Matrix
No ratings yet
A3 - Autonomous Innovation Matrix
1 page
AI Agents - How To Build Digital Workers - by Alfredo Sone - Nov, 2024 - Medium
100% (1)
AI Agents - How To Build Digital Workers - by Alfredo Sone - Nov, 2024 - Medium
17 pages
Organization and Governance - Grow Your Business With AI
No ratings yet
Organization and Governance - Grow Your Business With AI
25 pages
Cache-Augmented Generation (CAG) in LLMs - A Step-by-Step Tutorial - by Ronan Takizawa - Jan, 2025 - Medium
No ratings yet
Cache-Augmented Generation (CAG) in LLMs - A Step-by-Step Tutorial - by Ronan Takizawa - Jan, 2025 - Medium
15 pages
Adobe AI Report
No ratings yet
Adobe AI Report
19 pages
How Duolingo Reignited User Growth
No ratings yet
How Duolingo Reignited User Growth
22 pages
Corporate Services - I&I
No ratings yet
Corporate Services - I&I
18 pages
Saliva in The Diagnosis of Diseases
No ratings yet
Saliva in The Diagnosis of Diseases
5 pages
Real-Time and Remote Patient Monitoring Trends
No ratings yet
Real-Time and Remote Patient Monitoring Trends
7 pages
The First Step To Linux Part 1 - The Basic Commands
No ratings yet
The First Step To Linux Part 1 - The Basic Commands
43 pages
Profile ABHAY
No ratings yet
Profile ABHAY
8 pages
Social Engineering For Security Attacks
No ratings yet
Social Engineering For Security Attacks
4 pages
GDSC USTP - Constitution and Bylaws
No ratings yet
GDSC USTP - Constitution and Bylaws
37 pages
PPT-6 How To Buy and Sell Shares in Stock Exchange
No ratings yet
PPT-6 How To Buy and Sell Shares in Stock Exchange
79 pages
Unit-IV Management Information System
No ratings yet
Unit-IV Management Information System
29 pages
Kenneth Hagin JR Vida de Obediencia
100% (1)
Kenneth Hagin JR Vida de Obediencia
47 pages
Measurable Usability Goals Template
No ratings yet
Measurable Usability Goals Template
3 pages
Dragonlock Cura Profile Instr v3
No ratings yet
Dragonlock Cura Profile Instr v3
6 pages
ICT JSS2 Third Term
No ratings yet
ICT JSS2 Third Term
3 pages
Connection of A SIMATIC S7-1x00 To A SQL Database: SQL / Tabular Data Stream (SQL)
No ratings yet
Connection of A SIMATIC S7-1x00 To A SQL Database: SQL / Tabular Data Stream (SQL)
69 pages
Inkwood Documentation
No ratings yet
Inkwood Documentation
10 pages
DLP ART6 August 15,2019 Thur (WEEK1)
No ratings yet
DLP ART6 August 15,2019 Thur (WEEK1)
5 pages
Solutions Part I - Logistic Regression Backpropagation With A Single Training Example
No ratings yet
Solutions Part I - Logistic Regression Backpropagation With A Single Training Example
6 pages
1.2 Reverse-Engineering
No ratings yet
1.2 Reverse-Engineering
5 pages
Course Overview - VIRTUAL REALITY
No ratings yet
Course Overview - VIRTUAL REALITY
52 pages
List of Cse
No ratings yet
List of Cse
13 pages
SAP User Interface Shortcuts
No ratings yet
SAP User Interface Shortcuts
4 pages
Presentation On Microsoft's ZUNE
No ratings yet
Presentation On Microsoft's ZUNE
21 pages
Rockwell Automation Application Content: Power Device Library
No ratings yet
Rockwell Automation Application Content: Power Device Library
17 pages
E Invoicing Guidelines 2024
No ratings yet
E Invoicing Guidelines 2024
28 pages
Data Scientist Cover Letter Example
100% (2)
Data Scientist Cover Letter Example
5 pages
NICE COMPASS 3.5.2 - Installation Manual
No ratings yet
NICE COMPASS 3.5.2 - Installation Manual
556 pages
Enabled Secure Boot Issues With Schneider Electric PLC USB Driver & Unitelway Driver
No ratings yet
Enabled Secure Boot Issues With Schneider Electric PLC USB Driver & Unitelway Driver
7 pages