RAG Architecture
RAG Architecture
Retrieval
Augmented
Generation
A Simple Introduction
Abhinav Kimothi
Swipe to view
pages
2
Table of Contents
01. What is RAG? ............................................................................. 3
What is RAG?
Retrieval Augmented Generation
30th November, 2022 will be remembered as the watershed moment in artificial
intelligence. OpenAI released ChatGPT and the world was mesmerised. Interest in
previously obscure terms like Generative AI and Large Language Models (LLMs),
was unstoppable over the following 12 months.
Generative AI Large Language Models
100
80
60
40
20
0
2022-11-06 2023-11-05
Google Trends - Interest Over Time (Nov’22 to Nov’23)
The Challenge
Make LLMs respond with up-to-date information
Make LLMs not respond with factually inaccurate
information
Make LLMs aware of proprietary information
Providing LLMs with information not in their memory
Providing Context
While model re-training/fine-tuning/reinforcement learning are options that solve
the aforementioned challenges, these approaches are time-consuming and
costly. In majority of the use-case, these costs are prohibitive.
R A G
{Prompt} {Prompt + Context}
Retriever
User LLM
Search Fetch
Lookup the external source to
R retrieve the relevant information
Context
What is RAG?
Retriever
Document Repos
Other Sources
Databases
An LLM has knowledge Retriever searches and fetches information that the LLM has not
only of the data it has necessarily been trained on. This adds to the LLM memory and is passed as
been trained on context in the prompts. Also called Non-Parametric Memory (information
Also called Parametric available outside the model parameters)
Memory (information Expandable to all sources
stored in the model Easier to update/maintain
parameters) Much cheaper than retraining/fine-tuning
The effort lies in creation of the knowledge base
Confidence in Responses
With the context (extra information that is retrieved) made available to the LLM,
the confidence in LLM responses is increased.
Conversational agents
LLMs can be customised to product/service manuals, domain
knowledge, guidelines, etc. using RAG. The agent can also route users to
more specialised agents depending on their query. SearchUnify has an
LLM+RAG powered conversational agent for their users.
Content Generation
The widest use of LLMs has probably been in content generation. Using
RAG, the generation can be personalised to readers, incorporate real-
time trends and be contextually appropriate. Yarnit is an AI based
content marketing platform that uses RAG for multiple tasks.
Personalised Recommendation
Recommendation engines have been a game changes in the digital
economy. LLMs are capable of powering the next evolution in content
recommendations. Check out Aman’s blog on the utility of LLMs in
recommendation systems.
Virtual Assistants
Virtual personal assistants like Siri, Alexa and others are in plans to use
LLMs to enhance the experience. Coupled with more context on user
behaviour, these assistants can become highly personalised.
RAG Architecture
Let’s revisit the five high level steps of an RAG enabled system
{ }
Prompt
Search Relevant
Information
Relevant
Context Knowledge Sources
Prompt
Prompt + Context
LLM
Endpoint
Generated Response
RAG System
Retriever fetches the relevant information from the knowledge sources and sends back
Orchestrator augments the prompt with the context and sends to the LLM
LLM responds with the generated text which is displayed to the user via the orchestrator
Two pipelines become important in setting up the RAG system. The first one being
setting up the knowledge sources for efficient search and retrieval and the
second one being the five steps of the generation.
Indexing Pipeline
Data for the knowledge is ingested from the source and indexed. This
involves steps like splitting, creation of embeddings and storage of
data.
RAG Pipeline
This involves the actual RAG process which takes the user query at
run time and retrieves the relevant data from the index, then passes
that to the model
Indexing Pipeline
The indexing pipeline sets up the knowledge source for the RAG system. It is
generally considered an offline process. However, information can also be
fetched in real time. It involves four primary steps.
This step involves This step involves This step involves This step involves
extracting splitting converting text storing the
information from documents into documents into embeddings
different smaller numerical vectors. vectors. Vectors
knowledge sources manageable ML models are are typically stored
a loading them into chunks. Smaller mathematical in Vector
documents. chunks are easier models and Databases which
to search and to therefore require are best suited for
use in LLM context numerical data. searching.
windows.
LLM Response
User No search
needed since Fetch
context is fixed
https://abhinavkimothi.gumroad.com/l/RAG
Download
Storing
We are at the last step of creating the indexing pipeline. We have loaded and split
the data, and created the embeddings. Now, for us to be able to use the
information repeatedly, we need to store it so that it can be accessed on demand.
For this we use a special kind of database called the Vector Database.
A strip down variant of a Vector Database is a Vector Index like FAISS (Facebook
AI Similarity Search). It is this vector indexing that improves the search and
retrieval of vector embeddings. Vector Databases augment the indexing with
typical database features like data management, metadata storage, scalability,
integrations, security etc.
Evaluate data durability and integrity requirements vs the need for fast query
performance. Additional persistence safeguards can reduce speed.
Assess tradeoffs between local storage speed and access vs cloud storage
benefits like security, redundancy and scalability.
Cost considerations - while you many incur regular cost in a fully managed
solution, a self hosted one might prove costlier if not managed well
There are many more Vector DBs. For a comprehensive understanding of the pros
and cons of each, this blog is highly recommended
Now that our knowledge base is ready, let’s quickly see it in action. Let’s performa a
search on the FAISS index we’ve just created.
Similarity search
In the YouTube video, for which we have indexed the transcript, Andrej Karpathy
talks about the idea of LLM as an operating system. Let’s perform a search on this.
We can see here that out of the entire text, we have been able to retrieve the
specific chunk talking about the LLM OS. We’ll look at it in detail again in the RAG
pipeline
All LangChain
VectorDB Integrations
Keep Calm & Build AI. Abhinav Kimothi
Indexing Pipeline: Storing 34
RAG Pipeline
Now that the knowledge base has been created in the indexing pipeline, the main
generation or the RAG pipeline will have to be setup for receiving the input and
generating the output.
{ }
Prompt
Search Relevant
Information
Relevant
Context Knowledge Sources
Prompt
Prompt + Context
LLM
Endpoint
Generated Response
RAG System
Generation Steps
User writes a prompt or a query that is passed to an orchestrator
Retriever fetches the relevant information from the knowledge sources and returns
Orchestrator augments the prompt with the context and sends to the LLM
LLM responds with the generated text which is displayed to the user via the orchestrator
The knowledge sources highlighted above have been set up using the indexing
pipeline. These sources can be served using “on-the-fly” indexing also
On the fly
Retrieval
Perhaps, the most critical step in the entire RAG value chain is searching and
retrieving the relevant pieces of information (known as documents). When the
user enters a query or a prompt, it is this system (Retriever) that is responsible
for accurately fetching the correct snippet of information that is used in
responding to the user query.
Multi-query Retrieval
Multi-query Retrieval automates prompt tuning using a language
model to generate diverse queries for a user input, retrieving
relevant documents from each query and combining them to
overcome limitations and obtain a more comprehensive set of
results. This approach aims to enhance retrieval performance by
considering multiple perspectives on the same query.
Retrieval Methods
Contextual compression
Sometimes, relevant info is hidden in long documents with a lot of
extra stuff. Contextual Compression helps with this by squeezing
down the documents to only the important parts that match your
search.
Self Query
A self-querying retriever is a system that can ask itself questions.
When you give it a question in normal language, it uses a special
process to turn that question into a structured query. Then, it uses
this structured query to search through its stored information. This
way, it doesn't just compare your question with the documents; it
also looks for specific details in the documents based on your
question, making the search more efficient and accurate.
Retrieval Methods
Time-weighted Retrieval
This method supplements the semantic similarity search with a time
delay. It gives more weightage, then, to documents that are fresher
or more used than the ones that are older
Ensemble Techniques
As the term suggests, multiple retrieval methods can be used in
conjunction with each other. There are many ways of implementing
ensemble techniques and use cases will define the structure of the
retriever
How Similarity Vector Search is different from Similarity Search is that the query
is also converted into a vector embedding from regular text
Evaluation
Building a PoC RAG pipeline is not overtly complex. LangChain and LlamaIndex
have made it quite simple. Developing highly impressive Large Language Model
(LLM) applications is achievable through brief training and verification on a
limited set of examples. However, to enhance its robustness, thorough testing on
a dataset that accurately mirrors the production distribution is imperative.
https://github.com/explodinggradients/ragas
Evaluation Data
To evaluate RAG pipelines, the following four data points are recommended
Evaluation Metrics
Evaluating Generation
Faithfulness Is the Response faithful to the Retrieved Context?
Retrieval Evaluation
Context Relevance Is the Retrieved Context relevant to the Prompt?
Overall Evaluation
Answer Semantic Similarity Answer Correctness
is the Response semantically is the Response semantically
similar to the Ground Truth? and factually similar to the
Ground Truth?
https://abhinavkimothi.gumroad.com/l/RAG
Download
Answer/
Context
Response
Groundedness
Is the Response faithful to
the Retrieved Context?
Context Relevance:
Verify quality by ensuring each context chunk is relevant to the input query
Groundedness:
Verify groundedness by breaking down the response into individual claims.
Independently search for evidence supporting each claim in the retrieved
context.
Answer Relevance:
Ensure the response effectively addresses the original question.
Verify by evaluating the relevance of the final response to user input.
Trulens Documentation
Works well only with very large May not address the problem of
foundation models hallucinations
SFT + RAG
(hybrid approach)
RAG preferred
over SFT
SFT preferred
over RAG
RAG should be implemented (with or without SFT) if the use case requires
Access to an external data source, especially, if the data is dynamic
Resolving Hallucinations
Other Considerations
Latency
RAG pipelines require an additional step of searching and retrieving context
which introduces an inherent latency in the system
Scalability
RAG pipelines are modular and therefore can be scaled relatively easily when
compared to SFT. SFT will require retraining the model with each additional data
source
Cost
Both RAG and SFT warrant upfront investment. Training cost for SFT can vary
depending on the technique and the choice of foundation model. Setting up the
knowledge base and integration can be costly for RAG
Expertise
Creating RAG pipelines has become moderately simple with frameworks like
LangChain and LlamaIndex. Fine-tuning on the other hand requires deep
understanding of the techniques and creation of training data
10 App Hosting
11 Monitoring
8
Application/
Orchestration
Data Layer
The foundation of RAG applications is the data layer. This involves -
Data preparation - Sourcing, Cleaning, Loading & Chunking
Creation of Embeddings
Storing the embeddings in a vector store
We’ve seen this process in the creation of the indexing pipeline
Model Layer
2023 can be considered a year of LLM wars. Almost every other week in the
second half of the year a new model was released. Like there is no RAG without
data, there is no RAG without an LLM. There are four broad categories of LLMs
that can be a part of a RAG application
There are a lot of vendors that have enabled access to open source models and
also facilitate easy fine tuning of these models
Falcon
phi2 by MicroSoft
Note : For Open Source models it is important to check the license type. Some
open source models are not available for commercial use
Prompt Layer
Prompt Engineering is more than writing questions in natural language. There are
several prompting techniques and developers need to create prompts tailored
to the use cases. This process often involves experimentation: the developer
creates a prompt, observes the results and then iterates on the prompts to
improve the effectiveness of the app. This requires tracking and collaboration
Evaluation
It is easy to build a RAG pipeline but to get it ready for production involves
robust evaluation of the performance of the pipeline. For checking
hallucinations, relevance and accuracy there are several frameworks and tools
that have come up.
Ragas
Popular RAG evaluation frameworks and tools (Non Exhaustive)
App Orchestration
An RAG application involves interaction of multiple tools and services. To run
the RAG pipeline, a solid orchestration framework is required that invokes these
different processes.
Deployment Layer
Deployment of the RAG application can be done on any of the available cloud
providers and platforms. Some important factors to consider while deployment
are also -
Security and Governance
Logging
Inference costs and latency
Application Layer
The application finally needs to be hosted for the intended users or systems to
interact with it. You can create your own application layer or use the available
platforms.
Monitoring
Deployed application needs to be continuously monitored for both accuracy and
relevance as well as cost and latency.
Other Considerations
LLM Cache - To reduce costs by saving responses for popular queries
LLM Guardrails - To add additional layer of scrutiny on generations
Multimodal RAG
Up until now, most AI models have been limited to a single modality (a single type
of data like text or images or video). Recently, there has been significant progress
in AI models being able to handle multiple modalities (majorly text and images).
With the emergence of these Large Multimodal Models (LMMs) a multimodal RAG
system becomes possible.
Approaches
Query/
Prompt
Retrieved Multimodal
LMM
Context Response
Data
Loading
Indexing Pipeline
LMM
Image Caption
Text Summary
Text
Embeddings
Stored Vector
Images Store
Naive RAG
At its most basic, Retrieval Augmented Generation can be summarized in three
steps -
1. Indexing of the documents
2. Retrieval of the context with respect to an input query
3. Generation of the response using the input query and retrieved context
LLM
Indexing
Documents
Response
Retrieval
Prompt
User Query
Advanced RAG
To address the inefficiencies of the Naive RAG approach, Advanced RAG
approaches implement strategies focussed on three processes -
Chunk Optimisation
Metadata Integration
Indexing Structure
Alignment
Indexing
Retrieval
Post Retrieval
Information Compression
Re-ranking
Prompt LLM
Response
Metadata Integration
Information like dates, purpose, chapter summaries, etc. can be embedded into
chunks. This improves the retriever efficiency by not only searching the
documents but also by assessing the similarity to the metadata.
Indexing Structure
Introduction of graph structures can greatly enhance retrieval by leveraging
nodes and their relationships. Multi-index paths can be created aimed at
increasing efficiency.
Alignment
Understanding complex data, like tables, can be tricky for RAG. One way to
improve the indexing is by using counterfactual training, where we create
hypothetical (what-if) questions. This increases the alignment and reduces
disparity between documents.
Query Rewriting
To bring better alignment between the user query and documents, several
rewriting approaches exists. LLMs are sometimes used to create pseudo
documents from the query for better matching with existing documents.
Sometimes, LLMs perform abstract reasoning. Multi-querying is employed to
solve complex user queries.
Sub Queries
Sub querying involves breaking down a complex query into sub questions for
each relevant data source, then gather all the intermediate responses and
synthesize a final response.
Query Routing
A query router identifies a downstream task and decides the subsequent action
that the RAG system should take. During retrieval, the query router also identifies
the most appropriate data source for resolving the query.
Iterative Retrieval
Documents are collected repeatedly based on the query and the generated
response to create a more comprehensive knowledge base.
Recursive Retrieval
Recursive retrieval also iteratively retrieves documents. However, it also refines
the search queries depending on the results obtained from the previous retrieval.
It is like a continuous learning process.
Adaptive Retrieval
Enhance the RAG framework by empowering Language Models (LLMs) to
proactively identify the most suitable moments and content for retrieval. This
refinement aims to improve the efficiency and relevance of the information
obtained, allowing the models to dynamically choose when and what to retrieve,
leading to more precise and effective results
Fine-tuned Embeddings
This process involves tailoring embedding models to improve retrieval accuracy,
particularly in specialized domains dealing with uncommon or evolving terms. The
fine-tuning process utilizes training data generated with language models where
questions grounded in document chunks are generated.
Information Compression
While the retriever is proficient in extracting relevant information from extensive
knowledge bases, managing the vast amount of information within retrieval
documents poses a challenge. The retrieved information is compressed to extract
the most relevant points before passing it to the LLM.
Reranking
The re-ranking model plays a crucial role in optimizing the document set retrieved
by the retriever. The main idea is to rearrange document records to prioritize the
most relevant ones at the top, effectively managing the total number of
documents. This not only resolves challenges related to context window
expansion during retrieval but also improves efficiency and responsiveness.
Modular RAG
The SOTA in Retrieval Augmented Generation is a modular approach which allows
components like search, memory, and reranking modules to be configured
Routing Modules
Search Predict
Retrieve Advanced
Read
Demonstrate Fusion
Memory
Naive RAG is essentially a Retrieve -> Read approach which focusses on retrieving
information and comprehending it.
Advanced RAG is adds to the Retrieve -> Read approach by adding it into a
Rewrite and Rerank components to improve relevance and groundedness.
Modular RAG takes everything a notch ahead by providing flexibility and adding
modules like Search, Routing, etc.
Naive, Advanced & Modular RAGs are not exclusive approaches but a
progression. Naive RAG is a special case of Advanced which, in turn, is a special
case of Modular RAG
Memory
This module leverages the parametric memory capabilities of the Language Model
(LLM) to guide retrieval. The module may use a retrieval-enhanced generator to
create an unbounded memory pool iteratively, combining the "original question"
and "dual question." By employing a retrieval-enhanced generative model that
improves itself using its own outputs, the text becomes more aligned with the
data distribution during the reasoning process.
Fusion
RAG-Fusion improves traditional search systems by overcoming their limitations
through a multi-query approach. It expands user queries into multiple diverse
perspectives using a Language Model (LLM). This strategy goes beyond capturing
explicit information and delves into uncovering deeper, transformative
knowledge. The fusion process involves conducting parallel vector searches for
both the original and expanded queries, intelligently re-ranking to optimize
results, and pairing the best outcomes with new queries.
Extra Generation
Rather than directly fetching information from a data source, this module
employs the Language Model (LLM) to generate the required context. The content
produced by the LLM is more likely to contain pertinent information, addressing
issues related to repetition and irrelevant details in the retrieved content.
Acknowledgements
Retrieval Augmented Generation continues to be a pivotal approach for any
Generative AI led application and it is only going to grow. There are several
individuals and organisations that have provided learning resources and made
understanding RAG fun.
I talk about :
#AI #MachineLearning #DataScience
#GenerativeAI #Analytics #LLMs
#Technology #RAG #EthicalAI
let’s connect... Download free
ebook
... please
Resources
Official Documentations
Ragas
🍑
Documentation Documentation Documentation Documentation
Research Papers
Hello!
I’m Abhinav...
A data science and AI professional with over 15
years in the industry. Passionate about AI
advancements, I constantly explore emerging
technologies to push the boundaries and create
positive impacts in the world. Let’s build the future,
together!
Book a meeting
www.yarnit.app
$$ Contribute $$
Kee
p Ca
& Bu lm
ild A
Subscribe I.
Follow on LinkedIn
Download