Building RAG Apps
Building RAG Apps
If you're considering making a personalized bot for your documents or website that
responds to you, you're in the right spot. I'm here to help you create a bot using
Langchain and RAG strategies for this purpose.
Imagine requesting the model to enhance your company policies. ChatGPTs and
other large language models might need more training on your company's data to
provide factual responses in such scenarios. Instead, they may generate
nonsensical or irrelevant, unhelpful responses. How can we ensure that an LLM
comprehends our specific data and generates responses accordingly? This is where
techniques like Retrieval Augmentation Generation (RAG) come to the rescue.
Subscribe
What is RAG?
RAG, or Retrieval Augmented Generation, uses three main workflows to generate
and give a better response
Information Retrieval: When a user asks a question, the AI system retrieves the
relevant data from a well-maintained knowledge library or external sources
like databases, articles, APIs, or document repositories. This is achieved by
converting the query into a numerical format or vector that machines can
understand.
LLM: The retrieved data is then presented to the LLM or Large Language Model,
along with the user's query. The LLM uses this new knowledge and training
data to generate the response.
Response: Finally, the LLM generates a more accurate and relevant response
since it has been augmented with the retrieved information. We gave the LLM
some additional information from our Knowledge library, which allows LLMs
to provide more contextually relevant and factual responses, solving the
problem of models when they are just hallucinating or providing irrelevant
answers.
Let's take the example of company policies again. Suppose you have an HR bot that
handles queries related to your Company policies. Now, if someone asks anything
specific to the policies, the bot can pull the most recent policy documents from the
knowledge library, pass the relevant context to a well-cra!ed prompt, and then
pass the prompt further to the LLM for generating the response.
Chances are, your friend wouldn't be able to answer these questions. Most of the
time, no. But let's say this distant friend becomes closer to you over time; he comes
to your place regularly, knows your parents very well, you both hang out pretty
o!en, you go on outings, blah blah blah.. You got the point.
I mean, he is gaining access to personal and insider information about you. Now,
when you pose the same questions, he can somehow answer those questions more
relevantly now because he is better suited to your personal insights.
Similarly, when provided with additional information or access to your data, an LLM
won't guess or hallucinate. Instead, it can leverage that access data to provide more
relevant and accurate answers.
4. Create a prompt template which will be fed to the LLM with the query and the
context.
5. Convert the query to its relevant embedding using the same embedding model.
6. Fetch k number of relevant documents related to the query from the vector
database.
7. Pass the relevant documents to the LLM and get the response.
FAQs
1. We will be using Langchain for this task. Basically, it's like a wrapper that lets you
talk and manage your LLM operations better. Note that Langchain is updating very
fast, and some functions and other classes might move to di"erent modules. So, if
something doesn't work, just check if you are importing the libraries from the right
sources!
2. We will also use Hugging Face, an open-source library for building, training, and
deploying state-of-the-art machine learning models, especially about NLP. To use
HuggingFace, we need an access token, which you get here.
3. we'll need two critical components for our models: an LLM (Large Language
Model) and an embedding model. While paid sources like OpenAI o"er these, we'll
utilize open-source models to ensure accessibility for everyone.
4. Now, we need a Vector Database to store our embeddings. We've got LanceDB for
that task –it's like a super-smart data lake for handling lots of information. It's a
top-notch vector database, making it the go-to choice for dealing with complex
data like vector embeddings... And the best part? It won't burn a dent in your
pocket because it's open-source and free to use!!
5. Our data ingestion process will use a URL and some PDFs to keep things simple.
While you can incorporate additional data sources if needed, we'll concentrate
solely on these two.
With Langchain for the interface, Hugging Face for fetching the models, along with
With Langchain for the interface, Hugging Face for fetching the models, along with
open-source components, we're all set to go! This way, we will save some bucks
while still having everything we need. Let's move to the next steps.
Environment Setup
I am using a MacBook Air M1, and it's important to note that specific dependencies
and configurations may vary depending on your system type. Now open your
favorite editor, create a Python environment, and install the relevant dependencies.
source env/bin/activate
Create a .env file in the same directory to place your Hugging Face API credentials
like this.
HUGGINGFACEHUB_API_TOKEN = hf_........
HF_TOKEN = "hf_........."
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN
Finally, a data folder will be created in the project's root directory, designated as
the central repository for storing PDF documents. You can add some sample PDFs
for testing purposes; for instance, I am using the Yolo V7 and Transformers paper
for demonstration. It's important to note that this designated folder will function as
our primary source for data ingestion.
import os
import os
HF_TOKEN = "hf_......"
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN
url_loader = WebBaseLoader("https://gameofthrones.fandom.com/wiki/Jon_Snow")
url_docs = url_loader.load()
data_docs = documents_loader.load()
This will ingest all the data from the URL link and the PDFs.
Think of it like this: If you're tasked with digesting a 100-page book all at once and
then asked a specific question about it, it would be challenging to retrieve the
necessary information from the entire book to provide an answer. However, if
you're permitted to break the book into smaller, manageable chunks—say ten
you're permitted to break the book into smaller, manageable chunks—say ten
pages each—and each chunk is labeled with an index from 0 to 9, the process
becomes much more straightforward. When the same question is posed a!er this
breakdown, you can easily locate the relevant chunk based on its index and then
extract the information needed to answer the question accurately.
Picture the book as your extracted information, with each 10-page segment
representing a small chunk of data and the index pages as the embedding. We'll
apply an embedding model to these chunks to transform the information into their
respective embeddings. While, as humans, we may not directly comprehend or
relate to these embeddings, they serve as numeric representations of the chunks of
our application. This is how you can do this in Python
chunks = text_splitter.split_documents(docs)
This approach helps to prevent important information from being split across two
chunks, ensuring that each chunk contains su"icient contextual information of
their neighbor chunks for the subsequent processing or analysis.
Opting for the latter approach allows us to utilize one of Hugging Face's embedding
models. With this method, we simply provide our text chunks to the chosen model,
saving us from the resource-intensive computations on our local machines.
Hugging Face's model hub provides numerous options for embedding models, and
you can explore the leaderboard to select the most suitable one for your
requirements. For now, we'll proceed with "sentence-transformers/all-MiniLM-L6-
v2." This model is fast and highly e"icient in our task!!
embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
print(len(embeddings.embed_documents([query])[0]))
384
We have the embeddings for our chunks; now, we need a vector database to store
them.
When it comes to vector databases, there are plenty of options out there to suit
various needs. Databases like Pinecone o"er adequate performance and advanced
features but come with a he!y price tag. On the other hand, open-source
alternatives like FAISS or Chroma may lack some extras but are more than su"icient
for those who don't require extensive scalability.
But wait, I am dropping a bomb here. I've recently come across LanceDB, an open-
source vector database similar to FAISS and Chroma. What makes LanceDB stand
out is not just its open-source nature but also its unparalleled scalability. In fact,
a!er a closer look, I realized that I hadn't done justice to highlighting LanceDB's
true value propositions earlier!!
lancedb.connect ("lance_database")
So, the text "Hello World" is first converted to its numeric representation (fancy
name of embeddings), and then mapped to `id` number 1. Like a key-value pair.
Lastly, the mode="overwrite" parameter ensures that if the table "rag_sample"
already exists, it will be overwritten with the new data.
This happens with all the text chunks, and it's pretty straightforward.
import lancedb
db = lancedb.connect("lance_database")
table = db.create_table(
"rag_sample",
data=[
"id": "1",
],
mode="overwrite",
)
For example, if you are working with "Mistral 7B instruct" and you want the optimal
results, it's recommended to use the following chat template:
Note that \<s> and \</s> are special tokens to represent the beginning of the string
(BOS) and end of the string (EOS), while [INST] and [/INST] are regular strings. It's
just that the Mistral 7B instruct is made so that the model looks for those unique
tokens to understand the question better. Di"erent types of LLMs have various
kinds of instructed prompts.
model. Just to make it clear, Zephyr-7B-α has not been aligned or formatted to
human preferences with techniques like RLHF (Reinforcement Learning with
human preferences with techniques like RLHF (Reinforcement Learning with
Human Feedback) or deployed with in-the-loop filtering of responses like ChatGPT,
so the model can produce problematic outputs (especially when prompted to do
so).
Instead of writing a Prompt of our own, I will use the ChatPromptTemplate class,
which creates a prompt template for the chat models. In layman's terms, instead of
writing a specified prompt, I am letting ChatPromptTemplate do it for me. Here is an
example prompt template generated from the manual messages.
chat_template = ChatPromptTemplate.from_messages(
("human", "{user_input}"),
If you don't want to write the manual instructions, you can use the
*from_template* function to generate a more generic prompt template I used for
this project. Here it is.
{query}
"""
prompt = ChatPromptTemplate.from_template(template)
Our prompt is set! We've cra!ed a single message, assuming it's from a human xD.
If you're not using the from_messages function, the ChatPromptTemplate will
ensure your prompt works seamlessly with the language model by reserving some
additional system messages. There's always room for improvement with more
generic prompts to achieve better results. For now, this setup should work.
To understand it better, Imagine you and your friend speak di"erent languages, like
English and Hindi, and you need to understand each other's writings. If your friend
hands you a page in Hindi, you won't understand it directly. So, your friend will
translate it first, turning Hindi into English for you. So now, if your friend asks you a
question in Hindi, you can easily translate that question into English first and look
print(docs)
When you run this code, the retriever will fetch the three most relevant documents
from the vector database. These documents will be the contexts for our LLM model
to generate the response for our query.
Now, let's dive into implementing the language model (LLM) aspect of our RAG
setup. We'll be using the Zephyr model architecture from the Hugging Face Hub.
Here's how we do it in Python:
# Model architecture
llm_repo_id = "huggingfaceh4/zephyr-7b-alpha"
In this code excerpt, we initialize our language model using the Hugging Face Hub.
Specifically, we select the Zephyr 7 billion model, which is placed in this repository
ID: huggingfaceh4/zephyr-7b-alpha. Choosing this model isn't arbitrary; as I said
before, it's based on its suitability for our specific task and requirements. We have
before, it's based on its suitability for our specific task and requirements. We have
already implemented only open-source components, so Zephyr 7 billion works well
enough to generate a useful response with minimal overhead and low latency.
This model comes with some additional parameters to fine-tune its behavior. We've
set the temperature to 0.5, which controls the randomness of the generated text. As
a lower temperature tends to result in more conservative and predictable outputs,
and when the temperature is set to max, which is 1, the model tries to be as creative
as it can, so based on what type of output you want for your use case, you can
tweak this parameter. For simplicity and demonstration purposes, I set it to 0.5 to
ensure we get decent results. Next is the max_length parameter, which defines the
maximum length of the generated text and includes the size of your prompt and the
response.
For now, we just need a chain that can combine our retrieved contexts and pass it
rag_chain = (
| prompt
| model
| StrOutputParser()
We have our Prompt, model, context, and the query! All of them are combined into
a single chain. It's what all the chains do! Now, before running the final code, I want
to give a quick check on these two helper functions:
1 RunnablePassthrough()
2 StrOutputParser()
D-Day
To ensure we get the entire idea even if the response gets cut o", I've implemented
a function called get_complete_sentence(). This function helps extract the last
complete sentence from the text. So, even if the response hits the maximum token
limit that we set and gets truncated midway, we will still get a coherent
understanding of the message.
For testing, I suggest storing some low-sized PDFs in your project's data folder. You
can choose PDFs related to various topics or domains you want the chatbot to
interact with. Additionally, providing a URL as a reference for the chatbot can be
helpful for testing. For example, you could use a Wikipedia page, a research paper,
or any other online document relevant to your testing goals. During my testing, I
used a URL containing information about Jon Snow from Game of Thrones, PDFs of
Transformers paper, and the YOLO V7 paper to evaluate the bot's performance. Let's
see how our bot performs in varied content.
import os
import time
import lancedb
HF_TOKEN = "hf*********"
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN
# Loading the web URL and breaking down the information into chunks
start_time = time.time()
loader = WebBaseLoader("https://gameofthrones.fandom.com/wiki/Jon_Snow")
# URL loader
url_docs = loader.load()
# Document loader
data_docs = documents_loader.load()
chunk_size = 256
chunk_overlap = 20
chunks = text_splitter.split_documents(docs)
embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
vectorstore_start_time = time.time()
database_name = "LanceDB"
db = lancedb.connect("src/lance_database")
table = db.create_table(
"rag_sample",
data=[
"id": "1",
],
mode="overwrite",
vectorstore_end_time = time.time()
search_kwargs = {"k": 3}
llm_repo_id = "huggingfaceh4/zephyr-7b-alpha"
template = """
{query}
{query}
"""
prompt = ChatPromptTemplate.from_template(template)
rag_chain_start_time = time.time()
rag_chain = (
| prompt
| model
| StrOutputParser()
rag_chain_end_time = time.time()
def get_complete_sentence(response):
last_period_index = response.rfind('.')
if last_period_index != -1:
return response[:last_period_index + 1]
else:
return response
rag_invoke_start_time = time.time()
rag_invoke_end_time = time.time()
complete_sentence_start_time = time.time()
complete_sentence = get_complete_sentence(response)
complete_sentence_end_time = time.time()
# Create a table
# Create a table
table = PrettyTable()
table.add_row(["Temperature", model_kwargs["temperature"]])
print("\nComplete Sentence:")
print(complete_sentence)
print("\nExecution Timings:")
print(table)
+------------------------------+----------------------------------------+
| Temperature | 0.5 |
| Chunk Overlap | 20 |
| Number of Documents | 39 |
+------------------------------+----------------------------------------+
So this is the response I received in less than < 1 minute, which is quite considerable
for the starters. The time it takes can vary depending on your system's
configuration, but you'll get decent results in just a few minutes. So, please be
patient if it's taking a bit longer.
Human:
Answer:
Have fun experimenting with various data sources! You can try changing the
website addresses, adding new PDF files, or changing the template a bit. LLMs are
fun; you never know what you'll get!
Google Colab
What's next?
There are plenty of things we can adjust here. We could switch to a more e"ective
embedding model for better indexing, try di"erent searching techniques for the
retriever, add a reranker to improve document ranking, or use a more advanced
LLM with a larger context window and faster response times. Based on these
factors, every RAG application is just an enhanced version. However, the
fundamental concept of how an RAG application works remains the same.
TAGS
Blog
VIPUL MAHESHWARI
3 COMMENTS
FEATURED
LanceDB