Advanced Text Generation Techniques and Tools - Hands-On Large Language Models
Advanced Text Generation Techniques and Tools - Hands-On Large Language Models
In this chapter, we will explore several such methods and concepts for
improving the quality of the generated text:
Model I/O
Agents
These methods are all integrated with the LangChain framework that will
help us easily use these advanced techniques throughout this chapter.
LangChain is one of the earlier frameworks that simplify working with
LLMs through useful abstractions. Newer frameworks of note are DSPy
and Haystack. Some of these abstractions are illustrated in Figure 7-1.
Note that retrieval will be discussed in the next chapter.
Figure 7-1. LangChain is a complete framework for using LLMs. It has modular components that
can be chained together to allow for complex LLM systems.
Figure 7-2. Attempting to represent pi with float 32-bit and float 16-bit representations. Notice the
lowered accuracy when we halve the number of bits.
To illustrate quantization, consider this analogy. If asked what the time is,
you might say “14:16,” which is correct but not a fully precise answer. You
could have said it is “14:16 and 12 seconds” instead, which would have
been more accurate. However, mentioning seconds is seldom helpful and
we often simply put that in discrete numbers, namely full minutes.
Quantization is a similar process that reduces the precision of a value
(e.g., removing seconds) without removing vital information (e.g., retain‐
ing hours and minutes).
In Chapter 12, we will further discuss how quantization works under the
hood. You can also see a full visual guide to quantization in “A Visual
Guide to Quantization” by Maarten Grootendorst. For now, it is important
to know that we will use an 8-bit variant of Phi-3 compared to the original
16-bit variant, cutting the memory requirements almost in half.
TIP
As a rule of thumb, look for at least 4-bit quantized models. These models have a
good balance between compression and accuracy. Although it is possible to use 3-
bit or even 2-bit quantized models, the performance degradation becomes notice‐
able and it would instead be preferable to choose a smaller model with a higher
precision.
First, we will need to download the model. Note that the link contains
multiple files with different bit-variants. FP16, the model we choose, rep‐
resents the 16-bit variant:
!wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mi
''
TIP
All examples in this chapter can be run with any LLM. This means that you can
choose whether to use Phi-3, ChatGPT, Llama 3 or anything else when going
through the examples. We will use Phi-3 as a default throughout, but the state-of-
the-art changes quickly, so consider using a newer model instead. You can use the
Open LLM Leaderboard (a ranking of open source LLMs) to choose whichever
works best for your use case.
If you do not have access to a device that can run LLMs locally, consider using
ChatGPT instead:
Figure 7-3. A single chain connects some modular component, like a prompt template or external
memory, to the LLM.
In practice, chains can become complex quite quickly. We can extend the
prompt template however we want and we can even combine several
separate chains together to create intricate systems. In order to thorough‐
ly understand what is happening in a chain, let’s explore how we can add
Phi-3’s prompt template to the LLM.
We start with creating our first chain, namely the prompt template that
Phi-3 expects. In the previous chapter, we explored how
transformers.pipeline applies the chat template automatically. This
is not always the case with other packages and they might need the
prompt template to be explicitly defined. With LangChain, we will use
chains to create and use a default prompt template. It also serves as a
nice hands-on experience with using prompt templates.
The idea, as illustrated in Figure 7-4, is that we chain the prompt template
together with the LLM to get the output we are looking for. Instead of
having to copy-paste the prompt template each time we use the LLM, we
would only need to define the user and system prompts.
Figure 7-4. By chaining a prompt template with an LLM, we only need to define the input prompts.
The template will be constructed for you.
To create our first chain, we can use both the prompt that we created and
the LLM and chain them together:
To use the chain, we need to use the invoke function and make sure that
we use the input_prompt to insert our question:
The answer to 1 + 1 is 2. It's a basic arithmetic operation where you add one unit to an
The output gives us the response without any unnecessary tokens. Now
that we have created this chain, we do not have to create the prompt tem‐
plate from scratch each time we use the LLM. Note that we did not disable
sampling as before, so your output might differ. To make this pipeline
more transparent, Figure 7-6 illustrates the connection between a prompt
template and the LLM using a single chain.
Figure 7-6. An example of a single chain using Phi-3’s template.
NOTE
The example assumes that the LLM needs a specific template. This is not always
the case. With OpenAI’s GPT-3.5, its API handles the underlying template.
You could also use a prompt template to define other variables that might change
in your prompts. For example, if we want to create funny names for businesses,
retyping that question over and over for different products can be time-
consuming.
Adding a prompt template to the chain is just the very first step you need
to enhance the capabilities of your LLM. Throughout this chapter, we will
see many ways in which we can add additional modular components to
existing chains, starting with memory.
Instead, we could break this complex prompt into smaller subtasks that
can be run sequentially. This would require multiple calls to the LLM but
with smaller prompts and intermediate outputs as shown in Figure 7-7.
Figure 7-7. With sequential chains, the output of a prompt is used as the input for the next prompt.
A title
A description of the main character
A summary of the story
Instead of generating everything in one go, we create a chain that only re‐
quires a single input by the user and then sequentially generates the
three components. This process is illustrated in Figure 7-8.
Figure 7-8. The output of the title prompt is used as the input of the character prompt. To generate
the story, the output of all previous prompts is used.
We ask the LLM to “Create a title for a story about {summary}” where
“{summary}” will be our input:
This already gives us a great title for the story! Note that we can see both
the input ( "summary" ) as well as the output ( "title" ).
Let’s generate the next component, namely the description of the charac‐
ter. We generate this component using both the summary as well as the
previously generated title. Making sure that the chain uses those compo‐
nents, we create a new prompt with the {summary} and {title} tags:
# Create a chain for the character description using the summary and title
template = """<s><|user|>
Describe the main character of a story about {summary} with the title {title}. Use only
<|assistant|>"""
character_prompt = PromptTemplate(
template=template, input_variables=["summary", "title"]
)
character = LLMChain(llm=llm, prompt=character_prompt, output_key="character")
Although we could now use the character variable to generate our char‐
acter description manually, it will be used as part of the automated chain
instead.
Let’s create the final component, which uses the summary, title, and char‐
acter description to generate a short description of the story:
# Create a chain for the story using the summary, title, and character description
template = """<s><|user|>
Create a story about {summary} with the title {title}. The main character is: {character
<|assistant|>"""
story_prompt = PromptTemplate(
template=template, input_variables=["summary", "title", "character"]
)
story = LLMChain(llm=llm, prompt=story_prompt, output_key="story")
Now that we have generated all three components, we can link them to‐
gether to create our full chain:
We can run this newly created chain using the same example we used
before:
Running this chain gives us all three components. This only required us
to input a single short prompt, the summary. Another advantage of divid‐
ing the problem into smaller tasks is that we now have access to these in‐
dividual components. We can easily extract the title; that might not have
been the case if we were to use a single prompt.
Memory: Helping LLMs to Remember
Conversations
When we are using LLMs out of the box, they will not remember what
was being said in a conversation. You can share your name in one prompt
but it will have forgotten it by the next prompt.
I'm sorry, but as a language model, I don't have the ability to know personal informatio
Unfortunately, the LLM does not know the name we gave it. The reason
for this forgetful behavior is that these models are stateless—they have
no memory of any previous conversation!
As illustrated in Figure 7-9, conversing with an LLM that does not have
any memory is not the greatest experience.
Conversation buffer
Conversation summary
Figure 7-9. An example of a conversation between an LLM with memory and without memory.
Conversation Buffer
One of the most intuitive forms of giving LLMs memory is simply remind‐
ing them exactly what has happened in the past. As illustrated in
Figure 7-10, we can achieve this by copying the full conversation history
and pasting that into our prompt.
Figure 7-10. We can remind an LLM of what previously happened by simply appending the entire
conversation history to the input prompt.
{input_prompt}<|end|>
<|assistant|>"""
prompt = PromptTemplate(
template=template,
input_variables=["input_prompt", "chat_history"]
)
We put everything together and chain the LLM, memory, and prompt
template:
You can find the generated text in the 'text' key, the input prompt in
'input_prompt' , and the chat history in 'chat_history' . Note that
since this is the first time we used this specific chain, there is no chat
history.
Next, let’s follow up by asking the LLM if it remembers the name we used:
By extending the chain with memory, the LLM was able to use the chat
history to find the name we gave it previously. This more complex chain
is illustrated in Figure 7-11 to give an overview of this additional
functionality.
Figure 7-11. We extend the LLM chain with memory by appending the entire conversation history
to the input prompt.
One method of minimizing the context window is to use the last k conver‐
sations instead of maintaining the full chat history. In LangChain, we can
use ConversationBufferWindowMemory to decide how many conversa‐
tions are passed to the input prompt:
Next, we can check whether the model indeed knows the name we gave
it:
The LLM indeed has no access to our age since that was not retained in
the chat history.
Although this method reduces the size of the chat history, it can only re‐
tain the last few conversations, which is not ideal for lengthy conversa‐
tions. Let’s explore how we can summarize the chat history instead.
Conversation Summary
Figure 7-12. Instead of passing the conversation history directly to the prompt, we use another
LLM to summarize it first.
This means that whenever we ask the LLM a question, there are two calls:
Current summary:
{summary}
New summary:<|end|>
<|assistant|>"""
summary_prompt = PromptTemplate(
input_variables=["new_lines", "summary"],
template=summary_prompt_template
)
Using ConversationSummaryMemory in LangChain is similar to what
we did with the previous examples. The main difference is that we addi‐
tionally need to supply it with an LLM that performs the summarization
task. Although we use the same LLM for both summarizing and user
prompting, you could use a smaller LLM for the summarization task to
speed up computation:
Having created our chain, we can test out its summarization capabilities
by creating a short conversation:
After each step, the chain will summarize the conversation up until that
point. Note how the first conversation was summarized in
'chat_history' by creating a description of the conversation.
We can continue the conversation and at each step, the conversation will
be summarized and new information will be added as necessary:
After asking another question, the LLM updated the summary to include
the previous conversation and correctly inferred the original question.
To get the most recent summary, we can access the memory variable we
created previously:
# Check what the summary is thus far
memory.load_memory_variables({})
{'chat_history': ' Maarten, identified in this conversation, initially asked about the s
Figure 7-13. We extend the LLM chain with memory by summarizing the entire conversation his‐
tory before giving it to the input prompt.
This summarization helps keep the chat history relatively small without
using too many tokens during inference. However, since the original
question was not explicitly saved in the chat history, the model needed to
infer it based on the context. This is a disadvantage if specific information
needs to be stored in the chat history. Moreover, multiple calls to the
same LLM are needed, one for the prompt and one for the summariza‐
tion. This can slow down computing time.
Conversation
Buffer Easiest implementation Slower generation
Ensures no information speed as more tokens
loss within context are needed
window Only suitable for
large-context LLMs
Larger chat histories
make information re‐
trieval difficult
Windowed
Conversation Large-context LLMs are Only captures the last
Buffer not needed unless chat k interactions
history is large No compression of
No information loss the last k interactions
over the last k
interactions
Conversation
Summary Captures the full history An additional call is
Enables long necessary for each
conversations interaction
Reduces tokens needed Quality is reliant on
to capture full history the LLM’s summa‐
rization capabilities
Thus far, we have created systems that follow a user-defined set of steps
to take. One of the most promising concepts in LLMs is their ability to de‐
termine the actions they can take. This idea is often called agents, systems
that leverage a language model to determine which actions they should
take and in what order.
Agents can make use of everything we have seen thus far, such as model
I/O, chains, and memory, and extend it further with two vital
components:
Tools that the agent can use to do things it could not do itself
The agent type, which plans the actions to take or tools to use
Unlike the chains we have seen thus far, agents are able to show more ad‐
vanced behavior like creating and self-correcting a roadmap to achieve a
goal. They can interact with the real world through the use of tools. As a
result, these agents can perform a variety of tasks that go beyond what an
LLM is capable of in isolation.
For example, LLMs are notoriously bad at mathematical problems and of‐
ten fail at solving simple math-based tasks but they could do much more
if we provide access to a calculator. As illustrated in Figure 7-14, the un‐
derlying idea of agents is that they utilize LLMs not only to understand
our query but also to decide which tool to use and when.
Figure 7-14. Giving LLMs the ability to choose which tools they use for a particular problem results
in more complex and accurate behavior.
In this example, we would expect the LLM to use the calculator when it
faces a mathematical task. Now imagine we extend this with dozens of
other tools, like a search engine or a weather API. Suddenly, the capabili‐
ties of LLMs increase significantly.
In other words, agents that make use of LLMs can be powerful general
problem solvers. Although the tools they use are important, the driving
force of many agent-based systems is the use of a framework called
1
Reasoning and Acting (ReAct ).
Acting is a bit of a different story. LLMs are not able to act like you and I
do. To give them the ability to act, we could tell an LLM that it can use
certain tools, like a weather forecasting API. However, since LLMs can
only generate text, they would need to be instructed to use specific
queries to trigger the forecasting API.
ReAct merges these two concepts and allows reasoning to affect acting
and actions to affect reasoning. In practice, the framework consists of it‐
eratively following these three steps:
Thought
Action
Observation
Illustrated in Figure 7-15, the LLM is asked to create a “thought” about the
input prompt. This is similar to asking the LLM what it thinks it should do
next and why. Then, based on the thought, an “action” is triggered. The
action is generally an external tool, like a calculator or a search engine.
Finally, after the results of the “action” are returned to the LLM it “ob‐
serves” the output, which is often a summary of whatever result it
retrieved.
As illustrated in Figure 7-16, the agent will first search the web for cur‐
rent prices. It might find one or more prices depending on the search en‐
gine. After retrieving the price, it will use a calculator to convert USD to
EUR assuming we know the exchange rate.
Figure 7-15. An example of a ReAct prompt template.
During this process, the agent describes its thoughts (what it should do),
its actions (what it will do), and its observations (the results of the action).
It is a cycle of thoughts, actions, and observations that results in the
agent’s output.
ReAct in LangChain
The LLM that we used thus far is relatively small and not sufficient to run
these examples. Instead, we will be using OpenAI’s GPT-3.5 model as it
follows these complex instructions more closely:
import os
from langchain_openai import ChatOpenAI
Although the LLM we used throughout the chapter is insufficient for this exam‐
ple, it does not mean that only OpenAI’s LLMs are. Larger useful LLMs exist but
they require significantly more compute and VRAM. For instance, local LLMs of‐
ten come in different sizes and within a family of models, increasing a model’s
size leads to better performance. To keep the necessary compute at a minimum,
we choose a smaller LLM throughout the examples in this chapter.
After doing so, we will define the template for our agent. As we have
shown before, it describes the ReAct steps it needs to follow:
{tools}
Begin!
Question: {input}
Thought:{agent_scratchpad}"""
prompt = PromptTemplate(
template=react_template,
input_variables=["tools", "tool_names", "input", "agent_scratchpad"]
)
This template illustrates the process of starting with a question and gen‐
erating intermediate thoughts, actions, and observations.
To have the LLM interact with the outside world, we will describe the
tools it can use:
# Prepare tools
tools = load_tools(["llm-math"], llm=openai_llm)
tools.append(search_tool)
The tools include the DuckDuckGo search engine and a math tool that al‐
lows it to access a basic calculator.
To test whether the agent works, we use the previous example, namely
finding the price of a MacBook Pro:
These intermediate steps illustrate how the model processes the ReAct
template and what tools it accesses. This allows us to debug issues and ex‐
plore whether the agent uses the tools correctly.
{'input': 'What is the current price of a MacBook Pro in USD? How much would it cost in
'output': 'The current price of a MacBook Pro in USD is $2,249.00. It would cost approx
Considering the limited tools the agent has, this is quite impressive! Using
just a search engine and a calculator the agent could give us an answer.
We then delved into the world of agents that leverage LLMs to determine
their actions and make decisions. We explored the ReAct framework,
which uses an intuitive prompting framework that allows agents to rea‐
son about their thoughts, take actions, and observe the results. This led us
to build an agent that is able to freely use the tools at its disposal, such as
searching the web and using a calculator, demonstrating the potential
power of agents.
1
Shunyu Yao et al. “ReAct: Synergizing reasoning and acting in language models.”
arXiv preprint arXiv:2210.03629 (2022).