0% found this document useful (0 votes)
41 views32 pages

Building RAG Apps

Uploaded by

9m8cr5k72j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views32 pages

Building RAG Apps

Uploaded by

9m8cr5k72j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Create LLM apps using RAG

If you're considering making a personalized bot for your documents or website that
responds to you, you're in the right spot. I'm here to help you create a bot using
Langchain and RAG strategies for this purpose.

BLOG 18 MIN READ MAR 20, 2024

Understanding the Limitations of ChatGPT and LLMs


ChatGPTs and other Large Language Models (LLMs) are extensively trained on text
corpora to comprehend language semantics and coherence. Despite their
impressive capabilities, these models have limitations that require careful
consideration for particular use cases. One significant challenge is the potential for
hallucinations, where the model might generate inaccurate or contextually
irrelevant information.

Imagine requesting the model to enhance your company policies. ChatGPTs and
other large language models might need more training on your company's data to
provide factual responses in such scenarios. Instead, they may generate
nonsensical or irrelevant, unhelpful responses. How can we ensure that an LLM
comprehends our specific data and generates responses accordingly? This is where
techniques like Retrieval Augmentation Generation (RAG) come to the rescue.

Subscribe

What is RAG?
RAG, or Retrieval Augmented Generation, uses three main workflows to generate
and give a better response
Information Retrieval: When a user asks a question, the AI system retrieves the
relevant data from a well-maintained knowledge library or external sources
like databases, articles, APIs, or document repositories. This is achieved by
converting the query into a numerical format or vector that machines can
understand.

LLM: The retrieved data is then presented to the LLM or Large Language Model,
along with the user's query. The LLM uses this new knowledge and training
data to generate the response.

Response: Finally, the LLM generates a more accurate and relevant response
since it has been augmented with the retrieved information. We gave the LLM
some additional information from our Knowledge library, which allows LLMs
to provide more contextually relevant and factual responses, solving the
problem of models when they are just hallucinating or providing irrelevant
answers.

Let's take the example of company policies again. Suppose you have an HR bot that
handles queries related to your Company policies. Now, if someone asks anything
specific to the policies, the bot can pull the most recent policy documents from the
knowledge library, pass the relevant context to a well-cra!ed prompt, and then
pass the prompt further to the LLM for generating the response.

To make it easier, Imagine an LLM as your knowledgeable friend who seems to


know everything, from Geography to Computer Science, from Politics to
Philosophy. Now, picture yourself asking this friend a few questions:

"Who handles my laundry on weekends?"

"Who lives next door to me?"


"What brand of peanut butter do I prefer?"

Chances are, your friend wouldn't be able to answer these questions. Most of the
time, no. But let's say this distant friend becomes closer to you over time; he comes
to your place regularly, knows your parents very well, you both hang out pretty
o!en, you go on outings, blah blah blah.. You got the point.

I mean, he is gaining access to personal and insider information about you. Now,
when you pose the same questions, he can somehow answer those questions more
relevantly now because he is better suited to your personal insights.

Similarly, when provided with additional information or access to your data, an LLM
won't guess or hallucinate. Instead, it can leverage that access data to provide more
relevant and accurate answers.

Here are the steps to create any RAG application...


1. Extract the relevant information from your data sources.

2. Break the information into small chunks.

3. Store the chunks as their embeddings into a vector database.

4. Create a prompt template which will be fed to the LLM with the query and the
context.

5. Convert the query to its relevant embedding using the same embedding model.

6. Fetch k number of relevant documents related to the query from the vector
database.
7. Pass the relevant documents to the LLM and get the response.

FAQs
1. We will be using Langchain for this task. Basically, it's like a wrapper that lets you
talk and manage your LLM operations better. Note that Langchain is updating very
fast, and some functions and other classes might move to di"erent modules. So, if
something doesn't work, just check if you are importing the libraries from the right
sources!

2. We will also use Hugging Face, an open-source library for building, training, and
deploying state-of-the-art machine learning models, especially about NLP. To use
HuggingFace, we need an access token, which you get here.

3. we'll need two critical components for our models: an LLM (Large Language
Model) and an embedding model. While paid sources like OpenAI o"er these, we'll
utilize open-source models to ensure accessibility for everyone.

4. Now, we need a Vector Database to store our embeddings. We've got LanceDB for
that task –it's like a super-smart data lake for handling lots of information. It's a
top-notch vector database, making it the go-to choice for dealing with complex
data like vector embeddings... And the best part? It won't burn a dent in your
pocket because it's open-source and free to use!!

5. Our data ingestion process will use a URL and some PDFs to keep things simple.
While you can incorporate additional data sources if needed, we'll concentrate
solely on these two.

With Langchain for the interface, Hugging Face for fetching the models, along with
With Langchain for the interface, Hugging Face for fetching the models, along with
open-source components, we're all set to go! This way, we will save some bucks
while still having everything we need. Let's move to the next steps.

Environment Setup
I am using a MacBook Air M1, and it's important to note that specific dependencies
and configurations may vary depending on your system type. Now open your
favorite editor, create a Python environment, and install the relevant dependencies.

# Create a virtual environment

python3 -m venv env

# Activate the virtual environment

source env/bin/activate

# Upgrade pip in the virtual environment

pip install --upgrade pip

# Install required dependencies

pip3 install lancedb langchain langchain_community prettytable sentence-transformers huggi

# This is optional, I did it for removing a warning

pip3 uninstall urllib3

pip3 install 'urllib3<2.0'

Create a .env file in the same directory to place your Hugging Face API credentials
like this.
HUGGINGFACEHUB_API_TOKEN = hf_........

Ensure the name HUGGINGFACEHUB_API_TOKEN remains unchanged, which is


crucial for authentication purposes.

If you prefer a straightforward approach without relying on external packages or file


loading, you can directly configure the environment variable within your code like
this.

HF_TOKEN = "hf_........."

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN

Finally, a data folder will be created in the project's root directory, designated as
the central repository for storing PDF documents. You can add some sample PDFs
for testing purposes; for instance, I am using the Yolo V7 and Transformers paper
for demonstration. It's important to note that this designated folder will function as
our primary source for data ingestion.

Everything is in order, and we're all set!

Step 1: Extracting the relevant information


To get your RAG application running, the first thing we need to do is extract the
relevant information from the various data sources. Whether it's a website page, a
PDF file, a notion link, or a Google Doc, whatever it is, it needs to be extracted from
its original source first.

import os
import os

from langchain_community.document_loaders import WebBaseLoader, PyPDFLoader, DirectoryLoad

# Put the token values inside the double quotes

HF_TOKEN = "hf_......"

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN

# Loading the web URL and data

url_loader = WebBaseLoader("https://gameofthrones.fandom.com/wiki/Jon_Snow")

documents_loader = DirectoryLoader('data', glob="./*.pdf", loader_cls=PyPDFLoader)

# Creating the instances

url_docs = url_loader.load()

data_docs = documents_loader.load()

# Combining all the data that we ingested

docs = url_docs + data_docs

This will ingest all the data from the URL link and the PDFs.

Step 2: Breaking the information into smaller chunks


We have all the necessary data for developing our RAG application. Now, it's time to
break down this information into smaller chunks. Later, we'll utilize an embedding
model to convert these chunks into their respective embeddings. But why is this
important?

Think of it like this: If you're tasked with digesting a 100-page book all at once and
then asked a specific question about it, it would be challenging to retrieve the
necessary information from the entire book to provide an answer. However, if
you're permitted to break the book into smaller, manageable chunks—say ten
you're permitted to break the book into smaller, manageable chunks—say ten
pages each—and each chunk is labeled with an index from 0 to 9, the process
becomes much more straightforward. When the same question is posed a!er this
breakdown, you can easily locate the relevant chunk based on its index and then
extract the information needed to answer the question accurately.

Picture the book as your extracted information, with each 10-page segment
representing a small chunk of data and the index pages as the embedding. We'll
apply an embedding model to these chunks to transform the information into their
respective embeddings. While, as humans, we may not directly comprehend or
relate to these embeddings, they serve as numeric representations of the chunks of
our application. This is how you can do this in Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 50)

chunks = text_splitter.split_documents(docs)

Now, the chunk_size parameter specifies the maximum number of characters a


chunk can contain, while the chunk_overlap parameter specifies the number of
characters that should overlap between two adjacent chunks. With the
chunk_overlap set to 50, the last 50 characters of the adjacent chunks will be
shared with each other.

This approach helps to prevent important information from being split across two
chunks, ensuring that each chunk contains su"icient contextual information of
their neighbor chunks for the subsequent processing or analysis.

Shared information at the boundary of neighboring chunks enables a more


seamless transition and understanding of the text's content. The best strategy for
choosing the chunk_size and chunk_overlap parameters largely depends on the
documents' nature and the application's purpose.

Step 3: Create the embeddings and store in a vector


database
There are two primary methods to generate embeddings for our text chunks. The
first involves downloading a model, managing preprocessing, and conducting
computations independently. Alternatively, we can leverage Hugging Face's model
hub, which o"ers a variety of pre-trained models for various NLP tasks, including
embedding generation.

Opting for the latter approach allows us to utilize one of Hugging Face's embedding
models. With this method, we simply provide our text chunks to the chosen model,
saving us from the resource-intensive computations on our local machines.

Hugging Face's model hub provides numerous options for embedding models, and
you can explore the leaderboard to select the most suitable one for your
requirements. For now, we'll proceed with "sentence-transformers/all-MiniLM-L6-
v2." This model is fast and highly e"icient in our task!!

from langchain_community.embeddings import HuggingFaceEmbeddings

embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'

embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs={'device'

Here's a way to see the number of embeddings for each chunk


query = "Hello I want to see the length of the embeddings for this document."

print(len(embeddings.embed_documents([query])[0]))

384

We have the embeddings for our chunks; now, we need a vector database to store
them.

When it comes to vector databases, there are plenty of options out there to suit
various needs. Databases like Pinecone o"er adequate performance and advanced
features but come with a he!y price tag. On the other hand, open-source
alternatives like FAISS or Chroma may lack some extras but are more than su"icient
for those who don't require extensive scalability.

But wait, I am dropping a bomb here. I've recently come across LanceDB, an open-
source vector database similar to FAISS and Chroma. What makes LanceDB stand
out is not just its open-source nature but also its unparalleled scalability. In fact,
a!er a closer look, I realized that I hadn't done justice to highlighting LanceDB's
true value propositions earlier!!

Surprisingly, LanceDB is the most scalable vector database available,


outperforming even the likes of Pinecone, Chroma, Qdrant, and others. Scaling up
to a billion vectors locally on your laptop is only achievable with LanceDB. This
capability is a game-changer, especially when you compare it to other vector
databases struggling even with a hundred million vectors. LanceDB manages to
o"er this unprecedented scalability at a fraction of the cost; they are o"ering the
utilities and database tools at much cheaper rates than its closest counterparts.
So now, We'll create an instance of the LanceDB vector database by calling

lancedb.connect ("lance_database")

This line essentially sets up a connection to the LanceDB database named


"lance_database" Next, we create a table within the database named "rag_sample"
using the create_table function. Now we initialized this table with a single data
entry which includes a numeric vector generated by the embed_query function.

So, the text "Hello World" is first converted to its numeric representation (fancy
name of embeddings), and then mapped to `id` number 1. Like a key-value pair.
Lastly, the mode="overwrite" parameter ensures that if the table "rag_sample"
already exists, it will be overwritten with the new data.

This happens with all the text chunks, and it's pretty straightforward.

import lancedb

from langchain_community.vectorstores import LanceDB

db = lancedb.connect("lance_database")

table = db.create_table(

"rag_sample",

data=[

"vector": embeddings.embed_query("Hello World"),

"text": "Hello World",

"id": "1",

],

mode="overwrite",
)

docsearch = LanceDB.from_documents(chunks, embeddings, connection=table)

NO ROCKET SCIENCE, HA!

Step 4: Create a prompt template, which will be fed to


the LLM
Okay, now comes the prompt template. When you write a question to the ChatGPT,
and it answers that question, you are providing a prompt to the model so that it can
understand what the question is. When companies train the models, they decide
what kind of prompt they will use to invoke the model and ask the question.

For example, if you are working with "Mistral 7B instruct" and you want the optimal
results, it's recommended to use the following chat template:

<s>[INST] Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]

Note that \<s> and \</s> are special tokens to represent the beginning of the string
(BOS) and end of the string (EOS), while [INST] and [/INST] are regular strings. It's
just that the Mistral 7B instruct is made so that the model looks for those unique
tokens to understand the question better. Di"erent types of LLMs have various
kinds of instructed prompts.

Now, for our case, we will use huggingfaceh4/zephyr-7b-alpha, a text generation

model. Just to make it clear, Zephyr-7B-α has not been aligned or formatted to
human preferences with techniques like RLHF (Reinforcement Learning with
human preferences with techniques like RLHF (Reinforcement Learning with
Human Feedback) or deployed with in-the-loop filtering of responses like ChatGPT,
so the model can produce problematic outputs (especially when prompted to do
so).

Instead of writing a Prompt of our own, I will use the ChatPromptTemplate class,
which creates a prompt template for the chat models. In layman's terms, instead of
writing a specified prompt, I am letting ChatPromptTemplate do it for me. Here is an
example prompt template generated from the manual messages.

from langchain_core.prompts import ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(

("system", "You are a helpful AI bot. Your name is {name}."),

("human", "Hello, how are you doing?"),

("ai", "I'm doing well, thanks!"),

("human", "{user_input}"),

messages = chat_template.format_messages(name="Bob", user_input="What is your name?")

If you don't want to write the manual instructions, you can use the
*from_template* function to generate a more generic prompt template I used for
this project. Here it is.

from langchain_core.prompts import ChatPromptTemplate


template = """

{query}

"""

prompt = ChatPromptTemplate.from_template(template)

Our prompt is set! We've cra!ed a single message, assuming it's from a human xD.
If you're not using the from_messages function, the ChatPromptTemplate will
ensure your prompt works seamlessly with the language model by reserving some
additional system messages. There's always room for improvement with more
generic prompts to achieve better results. For now, this setup should work.

Step 5: Convert the query to its relevant embedding


using the same embedding model.
Now, let's discuss the query or question we want to ask our RAG application. We
can't just pass the query to our model and expect information in return. Instead, we
need to pass the query through the same embedding model used for the chunks
earlier. Why is this important? By embedding queries, we allow models to compare
them e"iciently with previously processed chunks of text. This enables tasks like
finding similar documents or generating relevant responses.

To understand it better, Imagine you and your friend speak di"erent languages, like
English and Hindi, and you need to understand each other's writings. If your friend
hands you a page in Hindi, you won't understand it directly. So, your friend will
translate it first, turning Hindi into English for you. So now, if your friend asks you a
question in Hindi, you can easily translate that question into English first and look

for the relevant answers in that translated English Text.


for the relevant answers in that translated English Text.

Similarly, we initially transformed textual information into its corresponding


embeddings. Now, when you pose a query, it undergoes a similar conversion into
numeric form using the same embedding model applied previously to process our
textual chunks. This consistent approach allows for e"icient retrieval of relevant
responses.

Step 6: Fetch K number of documents.


Now, let's talk about the retriever. Its job is to dive into the vector database and
search for relevant documents. It returns a set number, let's call it "k", of these
documents, which are ranked based on their contextual relevance to the query or
question you asked. You can set "k" as a parameter, indicating how many relevant
documents you want - whether it's 2, 5, or 10. Generally, if you have a smaller
amount of data, it's best to stick with a lower "k", around 2. For longer documents
or larger datasets, a "k" between 10 and 20 is recommended.

Di"erent search techniques can be employed to fetch relevant documents more


e"ectively and quickly from a vector database. The choice depends on various
factors, such as your specific use case, the amount of data you have, what kind of
vector database you use, and the context of your problem.

retriever = docsearch.as_retriever(search_kwargs={"k": 3})

docs = retriever.get_relevant_documents("what did you know about Yolo V7?")

print(docs)

When you run this code, the retriever will fetch the three most relevant documents
from the vector database. These documents will be the contexts for our LLM model
to generate the response for our query.

Step 7: Pass the relevant documents to the LLM and get


the response.
So far, we've asked our retriever to fetch a set number of relevant documents from
the database. We need a language model (LLM) to generate an appropriate
response based on that context. To ensure robustness, let's remember that at the
beginning of this blog, I mentioned that LLMs like ChatGPT can sometimes generate
irrelevant responses, especially when asked about specific use cases or contexts.
However, this time, we're providing the context from our data to the LLM as a
reference. So, this reference and its general capabilities will be considered when
answering the question. That's the whole idea behind using RAG!

Now, let's dive into implementing the language model (LLM) aspect of our RAG
setup. We'll be using the Zephyr model architecture from the Hugging Face Hub.
Here's how we do it in Python:

from langchain_community.llms import HuggingFaceHub

# Model architecture

llm_repo_id = "huggingfaceh4/zephyr-7b-alpha"

model_kwargs = {"temperature": 0.5, "max_length": 4096, "max_new_tokens": 2048}

model = HuggingFaceHub(repo_id=llm_repo_id, model_kwargs=model_kwargs)

In this code excerpt, we initialize our language model using the Hugging Face Hub.
Specifically, we select the Zephyr 7 billion model, which is placed in this repository
ID: huggingfaceh4/zephyr-7b-alpha. Choosing this model isn't arbitrary; as I said
before, it's based on its suitability for our specific task and requirements. We have
before, it's based on its suitability for our specific task and requirements. We have
already implemented only open-source components, so Zephyr 7 billion works well
enough to generate a useful response with minimal overhead and low latency.

This model comes with some additional parameters to fine-tune its behavior. We've
set the temperature to 0.5, which controls the randomness of the generated text. As
a lower temperature tends to result in more conservative and predictable outputs,
and when the temperature is set to max, which is 1, the model tries to be as creative
as it can, so based on what type of output you want for your use case, you can
tweak this parameter. For simplicity and demonstration purposes, I set it to 0.5 to
ensure we get decent results. Next is the max_length parameter, which defines the
maximum length of the generated text and includes the size of your prompt and the
response.

max_new_tokens sets the threshold on the maximum number of new tokens


generated. As a general rule of thumb, the max_new_tokens should always be less
than or equal to the max_length parameter. Why? Think about it.

Step 8: Create a chain for invoking the LLM.


We have everything we want for our RAG application. The last thing we need to do
is create a chain for invoking the LLM on our query to generate the response. There
are di"erent types of chains for di"erent types of use cases. If you like your LLM to
remember the context of the chat over time like the ChatGPT, you would need a
memory instance that can be shared among multiple conversation pieces. For such
cases, there are conversational chains available.

For now, we just need a chain that can combine our retrieved contexts and pass it

with the query to the LLM to generate the response.


with the query to the LLM to generate the response.

from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables import RunnablePassthrough

rag_chain = (

{"context": retriever, "query": RunnablePassthrough()}

| prompt

| model

| StrOutputParser()

response = rag_chain.invoke("Who killed Jon Snow?")

We have our Prompt, model, context, and the query! All of them are combined into
a single chain. It's what all the chains do! Now, before running the final code, I want
to give a quick check on these two helper functions:

1 RunnablePassthrough()

2 StrOutputParser()

The RunnablePassthrough class in LangChain serves to pass inputs unchanged or


with additional keys. In our chain, a prompt expects input in the form of a map with
keys "context" and "question." However, user input only includes the "question." or
the "query". Here, RunnablePassthrough is utilized to pass the user's question
under the "question" key while retrieving the context using a retriever. It just
ensures that the input to the prompt conforms to the expected format.

Second, StrOutputParser is typically employed in RAG chains to parse the model's


output into a human-readable string. In layman's terms, it is responsible for
transforming the model's output into a more coherent and grammatically correct
sentence, which is generally better readable by Humans! That's it!

D-Day
To ensure we get the entire idea even if the response gets cut o", I've implemented
a function called get_complete_sentence(). This function helps extract the last
complete sentence from the text. So, even if the response hits the maximum token
limit that we set and gets truncated midway, we will still get a coherent
understanding of the message.

For testing, I suggest storing some low-sized PDFs in your project's data folder. You
can choose PDFs related to various topics or domains you want the chatbot to
interact with. Additionally, providing a URL as a reference for the chatbot can be
helpful for testing. For example, you could use a Wikipedia page, a research paper,
or any other online document relevant to your testing goals. During my testing, I
used a URL containing information about Jon Snow from Game of Thrones, PDFs of
Transformers paper, and the YOLO V7 paper to evaluate the bot's performance. Let's
see how our bot performs in varied content.

import os

import time

import lancedb

from langchain_community.vectorstores import LanceDB

from langchain_community.llms import HuggingFaceHub

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.vectorstores import LanceDB

from langchain_community.embeddings import HuggingFaceEmbeddings

from langchain_community.document_loaders import WebBaseLoader, PyPDFLoader, DirectoryLoad

from langchain_core.output_parsers import StrOutputParser

from langchain_core.prompts import ChatPromptTemplate


from langchain_core.prompts import ChatPromptTemplate

from langchain_core.runnables import RunnablePassthrough

from pretty table import PrettyTable

HF_TOKEN = "hf*********"

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN

# Loading the web URL and breaking down the information into chunks

start_time = time.time()

loader = WebBaseLoader("https://gameofthrones.fandom.com/wiki/Jon_Snow")

documents_loader = DirectoryLoader('data', glob="./*.pdf", loader_cls=PyPDFLoader)

# URL loader

url_docs = loader.load()

# Document loader

data_docs = documents_loader.load()

# Combining all the information into a single variable

docs = url_docs + data_docs

# Specify chunk size and overlap

chunk_size = 256

chunk_overlap = 20

text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_

chunks = text_splitter.split_documents(docs)

# Specify Embedding Model

embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'

embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs={'device'

# Specify Vector Database

vectorstore_start_time = time.time()

database_name = "LanceDB"

db = lancedb.connect("src/lance_database")

table = db.create_table(

"rag_sample",

data=[

"vector": embeddings.embed_query("Hello World"),

"text": "Hello World",

"id": "1",

],

mode="overwrite",

docsearch = LanceDB.from_documents(chunks, embeddings, connection=table)

vectorstore_end_time = time.time()

# Specify Retrieval Information

search_kwargs = {"k": 3}

retriever = docsearch.as_retriever(search_kwargs = {"k": 3})

# Specify Model Architecture

llm_repo_id = "huggingfaceh4/zephyr-7b-alpha"

model_kwargs = {"temperature": 0.5, "max_length": 4096, "max_new_tokens": 2048}

model = HuggingFaceHub(repo_id=llm_repo_id, model_kwargs=model_kwargs)

template = """

{query}
{query}

"""

prompt = ChatPromptTemplate.from_template(template)

rag_chain_start_time = time.time()

rag_chain = (

{"context": retriever, "query": RunnablePassthrough()}

| prompt

| model

| StrOutputParser()

rag_chain_end_time = time.time()

def get_complete_sentence(response):

last_period_index = response.rfind('.')

if last_period_index != -1:

return response[:last_period_index + 1]

else:

return response

# Invoke the RAG chain and retrieve the response

rag_invoke_start_time = time.time()

response = rag_chain.invoke("Who killed Jon Snow?")

rag_invoke_end_time = time.time()

# Get the complete sentence

complete_sentence_start_time = time.time()

complete_sentence = get_complete_sentence(response)

complete_sentence_end_time = time.time()

# Create a table
# Create a table

table = PrettyTable()

table.field_names = ["Task", "Time Taken (Seconds)"]

# Add rows to the table

table.add_row(["Vectorstore Creation", round(vectorstore_end_time - vectorstore_start_time

table.add_row(["RAG Chain Setup", round(rag_chain_end_time - rag_chain_start_time, 2)])

table.add_row(["RAG Chain Invocation", round(rag_invoke_end_time - rag_invoke_start_time,

table.add_row(["Complete Sentence Extraction", round(complete_sentence_end_time - complete

# Additional information in the table

table.add_row(["Embedding Model", embedding_model_name])

table.add_row(["LLM (Language Model) Repo ID", llm_repo_id])

table.add_row(["Vector Database", database_name])

table.add_row(["Temperature", model_kwargs["temperature"]])

table.add_row(["Max Length Tokens", model_kwargs["max_length"]])

table.add_row(["Max New Tokens", model_kwargs["max_new_tokens"]])

table.add_row(["Chunk Size", chunk_size])

table.add_row(["Chunk Overlap", chunk_overlap])

table.add_row(["Number of Documents", len(docs)])

print("\nComplete Sentence:")

print(complete_sentence)

# Print the table

print("\nExecution Timings:")

print(table)

+------------------------------+----------------------------------------+

| Task | Time Taken (Seconds) |


+------------------------------+----------------------------------------+

| Vectorstore Creation | 16.21 |

| RAG Chain Setup | 0.03 |

| RAG Chain Invocation | 2.06 |

| Complete Sentence Extraction | 0.0 |

| Embedding Model | sentence-transformers/all-MiniLM-L6-v2 |

| LLM (Language Model) Repo ID | huggingfaceh4/zephyr-7b-alpha |

| Vector Database | LanceDB |

| Temperature | 0.5 |

| Max Length Tokens | 4096 |

| Max New Tokens | 2048 |

| Chunk Size | 256 |

| Chunk Overlap | 20 |

| Number of Documents | 39 |

+------------------------------+----------------------------------------+

To enhance readability and present the execution information in a structured


tabular format, I have used the PrettyTable library. You can add it to your virtual
environment using the command pip3 install prettytable .

So this is the response I received in less than < 1 minute, which is quite considerable
for the starters. The time it takes can vary depending on your system's
configuration, but you'll get decent results in just a few minutes. So, please be
patient if it's taking a bit longer.

Human:

Question: Who killed Jon Snow?

Answer:

In the TV series Game of Thrones, Jon Snow was stabbed by his


In the TV series Game of Thrones, Jon Snow was stabbed by his

fellow Night's Watch members in season 5, episode 9,

"The Dance of Dragons." However, he was later resurrected by Melisandre

in season 6, episode 3, "Oathbreaker." So, technically,

no one killed Jon Snow in the show.

Have fun experimenting with various data sources! You can try changing the
website addresses, adding new PDF files, or changing the template a bit. LLMs are
fun; you never know what you'll get!

Here is the Google Colab link for reference.

Google Colab
What's next?
There are plenty of things we can adjust here. We could switch to a more e"ective
embedding model for better indexing, try di"erent searching techniques for the
retriever, add a reranker to improve document ranking, or use a more advanced
LLM with a larger context window and faster response times. Based on these
factors, every RAG application is just an enhanced version. However, the
fundamental concept of how an RAG application works remains the same.

TAGS

Blog

SHARE WITH FRIENDS

VIPUL MAHESHWARI

YOU MIGHT ALSO LIKE

Late interaction & efficient Multi-modal retrievers need


more than a vector index
BLOG 10 MIN READ
Improve (almost) every retriever with LanceDB hybrid
search and Reranking
BLOG 7 MIN READ

My summer internship experience at LanceDB


BLOG 6 MIN READ

Developers, Ditch the Black Box: Welcome to Continue


CASE STUDY 3 MIN READ

Benchmarking Cohere Rerankers with LanceDB


BLOG 6 MIN READ
Tokens Per Second is NOT All You Need
BLOG 3 MIN READ

3 COMMENTS

FEATURED

Lance v2 is now in Beta


ENGINEERING 5 MIN READ

Columnar File Readers in Depth: Scheduling vs Decoding


ENGINEERING 9 MIN READ
New Funding and A New Foundation for Multimodal AI
Data
NEWS 3 MIN READ

Columnar File Readers in Depth: Parallelism without Row


Groups
ENGINEERING 7 MIN READ

Sign up Documentation Join Discord Community YouTube

LanceDB

You might also like