100% found this document useful (2 votes)
740 views56 pages

A Taxonomy of Retrieval Augmented Generation

Uploaded by

gashw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
740 views56 pages

A Taxonomy of Retrieval Augmented Generation

Uploaded by

gashw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

THE RISE OF CONTEXTUAL AI

A
Taxonomy of
Retrieval Augmented
Generation
Components, Concepts, Use Cases & more...

Abhinav Kimothi October 2024

200+ Terms
to know, build and
improve RAG systems
Introduction
Retrieval Augmented Generation, or RAG, stands as a pivotal
technique shaping the landscape of the applied generative AI. A

O
D
novel concept introduced by Lewis et al in their seminal paper

T
Retrieval-Augmented Generation for Knowledge-Intensive NLP

TIO NTE
Tasks, RAG has swiftly emerged as a cornerstone, enhancing

E
reliability and trustworthiness in the outputs from Large Language

UID
Models (LLMs).

E
In 2024, RAG is one of the most widely used techniques in

M
generative AI applications and as per Databricks, at least 60% of
EG
LLM applications utilise some form of RAG. RAG’s acceptance is

G
N
also propelled by the simplicity of the concept. Simply put, a RAG
system searches for information from a knowledge base and sends
U
L

it along with the query to the LLM for the response.


A
P
IM

RA
AL
AS
IEV
NE
GE
TR
RE

Figure 1: Retrieval Augmented Generation enhances the reliability and the trustworthiness
in LLM responses (Source: A Simple Guide to Retrieval Augmented Generations)

A Taxonomy of Retrieval Augmented Generation Page 2 of 56


Introduction
RAG today encompasses a wide array of techniques, models,
and approaches. It can get a little overwhelming for newcomers. As

O
D
RAG continues to evolve it’s crucial to create a shared language

T
framework for researchers, practitioners, developers and business

TIO NTE
leaders.

E
UID
This taxonomy is an attempt to clarify the components of RAG,
serve as a guide for understanding key building blocks and provide

E
a roadmap to navigate through this, somewhat complex, evolving

M
RAG ecosystem.
EG
G
N
U
L
A
P
IM

RA
AL
AS
IEV
NE
GE
TR
RE

Figure 2: Themes for A Taxonomy of Retrieval Augmented Generations

A Taxonomy of Retrieval Augmented Generation Page 3 of 56


A
DOWNLOADABLE PDF
VERSION OF THIS EBOOK IS ALSO
AVAILABLE ON GUMROAD

SCAN CODE OR CLICK HERE


TO DOWNLOAD
Table of Contents
RAG Basics __________________________________ 6
Terms associated with limitations of LLMs and introduction to RAG

Core Components ______________________ 9


Terms associated with the indexing and generation pipelines

Evaluation _________________________________ 23
Terms associated with metrics, frameworks and metrics used in RAG
evaluation

Pipeline Design _________________________ 31


Terms associated with the naive, advanced and modular RAG pipelines

Operations Stack _____________________ 37


Terms associated with the emerging RAGOps stack

Emerging Patterns ___________________ 41


Terms associated with patterns like multimodal RAG, agentic RAG and
KG powered RAG

Technology Providers ____________ 44


List of solutions and service providers to operationalise the RAG stack

Applied RAG _____________________________ 47


Applied RAG patterns, application areas and current set of challenges

A Taxonomy of Retrieval Augmented Generation Page 5 of 56


RAG Basics
LLM Limitations

O
Knowledge Cut-off Date: Training an LLM is an expensive and

D
time-consuming process. It takes massive volumes of data and

T
several weeks, or even months, to train an LLM. The data that LLMs

TIO NTE
are trained on is therefore not current. For example, GPT-4o has

E
knowledge only up to October 2023. Any event that happened after

UID
this knowledge cut-off date is not available to the model.

E
Training Data Limitation: LLMs are trained on large volumes of

M
EG
data from a variety of public sources — like Llama 3 has been
trained on a whopping 15 trillion tokens (about 7 times more than

G
N
Llama 2) — but they do not have any knowledge of information that
U
is not public. Publicly available LLMs have not been trained on
L

information like internal company documents, customer


A
P

information, product documents, etc. So, LLMs cannot be expected


IM

to respond to queries about such information.


RA
AL

Hallucinations: LLMs are next-word predictors. They are not


AS
IEV

trained to verify the facts in their responses. Thus, it is observed


NE
that LLMs sometimes provide responses that are factually incorrect,
and despite being incorrect, these responses sound extremely
GE

confident and legitimate. This characteristic of “lying with


TR

confidence,” called hallucination, has proved to be one of the


biggest criticisms of LLMs.
RE

Context Window: Every LLM, by the nature of the architecture,


can process upto a maximum number of tokens. This maximum
number of tokens is referred to as the context window of the model.
If the prompt is longer than the context window, then the portion of
the prompt beyond the limit is discarded.

A Taxonomy of Retrieval Augmented Generation Page 6 of 56


RAG Basics
Model Name Developer Knowledge Cut-off Date Context Window (Tokens)

O
D
Claude 3 Anthropic March 2024 200k tokens

T
A SIMPLE GUIDE TO
GPT-4o OpenAI April 2023 128k tokens

TIO NTE
LLaMA 3.1 Meta June 2024 128k tokens

E
PaLM 2 Google DeepMind April 2023 32k tokens

RETRIEVAL AUGMENTED
Gemini 1.5 Pro Google DeepMind Early 2024 256k tokens

UID
Claude 2 Anthropic Early 2023 100k tokens
Mistral Mistral AI 2023 8k tokens

E
Falcon 40B TII March 2023 2k tokens

GENERATION

M
BLOOM BigScience Early 2022 2k tokens
EG
GPT-NeoX-20B EleutherAI April 2022 2k tokens

Table 1: Popular LLMs with their cut-off date and context window

G
N
U
RAG Concepts
L

Parametric Memory: The ability of an LLM to retain information


A
P

that it has been trained on is solely reliant on its parameters. It can


IM

therefore be said that LLMs store factual information in their


RA
AL

parameters. This memory that is internally present in the LLM can


be referred to as the parametric memory. This parametric memory
AS
IEV
NE
is limited. It depends upon the number of parameters and is a
factor of the data on which the LLM has been trained on.
GE

Non-parametric Memory: Information that LLMs do not have in


TR

their internal parameters but can access via external retrieval


mechanisms, like a search engine or database. RAG provides the
RE

LLM with access to this non-parametric memory.

Knowledge Base: The non-parametric memory that has been


created for the RAG application. This contains information from a
variety of pre-determined sources which is processed and stored in
persistent memory

A Taxonomy of Retrieval Augmented Generation Page 7 of 56


RAG Basics
User Query: The prompt that a user (or a system) wants to send to
an LLM for a response

O
D
T
Retrieval: The process via which, information pertinent to the user

TIO NTE
query is searched for and fetched from the knowledge base.

E
UID
Augmentation: The process of adding the retrieved information to
the user query.

M E
Generation: The process of generating results by the LLM when
EG
provided with an augmented prompt.

G
N
U Source Citation:
Ability of a RAG
L

system to point out


A
P

to the information
IM

A SIMPLE GUIDE TO
from the knowledge
RA
AL

base that was used


to generate the
RETRIEVAL AUGMENTED
AS
IEV

response
NE

GENERATION Unlimited Memory:


GE

The notion that any


TR

DOWNLOAD number of
documents can be
FROM GUMROAD
RE

added to the RAG


knowledge base

Figure 3: How does RAG help?

A Taxonomy of Retrieval Augmented Generation Page 8 of 56


Core Components
Indexing Pipeline: The set of processes that is employed to create

O
the knowledge base for RAG applications. It is a non real-time

D
pipeline that updates the knowledge base at periodic intervals.

T
TIO NTE
E
Source Systems: The original locations where the data that is
desired for the RAG application is stored. These can be data lakes,

UID
file systems, CMSs, SQL & NoSQL databases, 3rd party data stores

E
etc.

M
EG
G
N
U
L
A
P
IM

RA
AL
AS
IEV
NE

Figure 4: Examples of source systems that can be connected to


GE
TR

Data Loading: The first step of the indexing pipeline that connects
to source systems to extract and parse files for data to be used in
the RAG knowledge base.
RE

Metadata: A common way of defining metadata is “data about


data”. Metadata describes other data. It can provide information
like a description of the data, time of creation, author, etc. While
metadata is useful for managing and organising data, in the
context of RAG, metadata enhances the search-ability of data.

A Taxonomy of Retrieval Augmented Generation Page 9 of 56


Core Components
Data Masking: Obfuscation of sensitive information like PII or

O
D
confidential data

T
TIO NTE
Chunking: The process of breaking down long pieces of text into

E
smaller manageable sizes or “chunks”. Chunking is crucial to the

UID
efficient creation of knowledge base for RAG systems. Chunking
increases the ease of search and overcomes the context window

E
limits of LLMs.

M
EG
Lost in the middle problem: Even in those LLMs which have a

G
N
long context window (Claude 3 by Anthropic has a context window
of up to 200,00 tokens), an issue with accurately reading the
U
L

information has been observed. It has been noticed that accuracy


A
P

declines dramatically if the relevant information is somewhere in


the middle of the prompt. This problem can be addressed by
IM

RA
AL

passing only the relevant information to the LLM instead of the


entire document.
AS
IEV
NE
Fixed Size Chunking: A very common approach is to pre-
determine the size of the chunk and the amount of overlap between
the chunks. There are several chunking methods that follow a fixed
GE
TR

size chunking approach. Chunks are created based on a fixed


number of characters, tokens, sentences or paragraphs.
RE

Structure-Based Chunking: The aim of chunking is to keep


meaningful data together. If we are dealing with data in form of
HTML, Markdown, JSON or even computer code, it makes more
sense to split the data based on the structure rather than a fixed
size.

A Taxonomy of Retrieval Augmented Generation Page 10 of 56


Core Components
Context-Enriched Chunking: This method adds the summary of

O
the larger document to each chunk to enrich the context of the

D
smaller chunk. This makes more context available to the LLM

T
TIO NTE
without adding too much noise. It also improves the retrieval

E
accuracy and maintains semantic coherence across chunks. This is
particularly useful in scenarios where a more holistic view of the

UID
information is crucial. While this approach enhances the

E
understanding of the broader context, it adds a level of complexity

M
and comes at the cost of higher computational requirements,
EG
increased storage needs and possible latency in retrieval.

G
N
Agentic Chunking: In agentic chunking, chunks from the text are
U
created based on a goal or a task. Consider an e-commerce
L

platform wanting to analyse customer reviews. The best way for the
A
P

reviews to be chunked is if the reviews pertaining to a particular


IM

topic are put in the same chunk. Similarly, the critical reviews and
RA
AL

positive reviews may be put in different chunks. To achieve this kind


of chunking, we will need to do sentiment analysis, entity extraction
AS
IEV
NE
and some kind of clustering. This can be achieved by a multi-agent
system. Agentic chunking is still an active area of research and
improvement.
GE
TR

Semantic Chunking: This method looks at semantic similarity (or


similarity in the meaning) between sentences is called semantic
RE

chunking. It first creates groups of three sentences and then


merges groups that are similar in meaning. To find out the similarity
in meaning, this method uses Embeddings. This is still an
experimental chunking technique.

A Taxonomy of Retrieval Augmented Generation Page 11 of 56


Core Components
Small to big chunking: A hierarchical chunking method where the

O
text is first broken down into very small units (e.g., sentences,

D
paragraphs), and the small chunks are merged into larger ones

T
until the chunk size is achieved. Sliding window chunking uses

TIO NTE
overlap between chunks to maintain context across chunk

E
boundaries.

UID
M E
EG
G
N
U
L
A
P
IM

RA
AL

Figure 5: Big to Small Sliding Window Chunking


AS
IEV

Chunk Size: The size of the chunks, which is measure in the


NE
number of tokens in the chunk, can have a significant impact on the
quality of the RAG system. While large sized chunks provide better
GE

context, they also carry a lot of noise. Smaller chunks, on the other
TR

hand, have precise information but they might miss important


information.
RE

Metadata Filtering: Adding metadata like timestamp, author,


category, etc. can enhance the chunks. While retrieving, chunks can
first be filtered by relevant metadata information before doing a
similarity search. This improves retrieval efficiency and reduces
noise in the system. For example, using the timestamp filters can
help avoid outdated information in the knowledge base.

A Taxonomy of Retrieval Augmented Generation Page 12 of 56


Core Components
Metadata Enhancement: Metadata like chunk summary,

O
sentiment, category etc. that can be inferred beyond tags like

D
source, timestamp, author etc. can be used to enhance retrieval.

T
TIO NTE
Parent Child Indexing: A document structure where documents

E
are organised hierarchically. The parent document contains

UID
overarching themes or summaries, while child documents delve into

E
specific details. During retrieval, the system can first locate the
most relevant child documents and then refer to the parent

M
EG
documents for additional context if needed. This approach
enhances the precision of retrieval while maintaining the broader

G
N
context. At the same time, this hierarchical structure can present
U
challenges in terms of memory requirements and computational
L

load.
A
P
IM

Embeddings: Computers, at the very core, do mathematical


RA
AL

calculations. Mathematical calculations are done on numbers.


Therefore, for a computer to process any kind of non-numeric data
AS
IEV

like text or image, it must be first converted into a numerical form.


NE
Embeddings is a design pattern that is extremely useful for RAG.
Embeddings are vector representations of data. As a general
GE

definition, embeddings are data that has been transformed into n-


TR

dimensional matrices. A word embedding is a vector representation


of words.
RE

Figure 6: Word Embeddings are vector representation of words

A Taxonomy of Retrieval Augmented Generation Page 13 of 56


Core Components
Cosine Similarity: The reason why embeddings are popular is

O
because they help in establishing semantic relationship between

D
words, phrases, and documents. Cosine similarity is calculated as

T
the cosine value of the angle between the two vectors. Cosine of

TIO NTE
parallel lines i.e. angle=0 is 1 and cosine of a right angle i.e. 90 is 0.

E
On the other end, the cosine of opposite lines i.e. angle =180 is -1.

UID
Therefore, the cosine similarity lies between -1 and 1 where

E
unrelated terms have a value close to 0, and related terms have a
value close to 1.

M
EG
G
N
U
L
A
P
IM

RA
AL
AS
IEV
NE
Figure 7: Cosine Similarity between vectors

Word2Vec: Word2Vec is a shallow neural network-based model for


GE

learning word embeddings developed by researchers at Google. It


TR

is one of the earliest embedding techniques


RE

GloVe: Global Vectors for Word Representations is an unsupervised


learning technique developed by researchers at Stanford University

FastText: FastText is an extension of Word2Vec developed by


Facebook AI Research. It is particularly useful for handling
misspellings and rare words.

A Taxonomy of Retrieval Augmented Generation Page 14 of 56


Core Components
ELMo: Embeddings from Language Models was developed by

O
researchers at Allen Institute for AI. ELMo embeddings have been

D
shown to improve performance on question answering and

T
sentiment analysis tasks.

TIO NTE
E
BERT: Bidirectional Encoder Representations from Transformers,

UID
developed by researchers at Google, is a Transformers

E
architecture-based model. It provides contextualized word
embeddings by considering bidirectional context, achieving state-

M
EG
of-the-art performance on various natural language processing
tasks.

G
N
U
Pre-trained Embeddings Models: Embeddings models that have
L

been trained on a large corpus of data can also generalise well


A
P

across a number of tasks and domains. There are a variety of


IM

proprietary and open-source pre-trained embeddings models that


RA
AL

are available to use. This is also one of the reasons why the usage
of embeddings has exploded in popularity across machine learning
AS
IEV

applications.
NE

Vector Databases: Vector Databases are built to handle high


GE

dimensional vectors like embeddings. These databases specialise in


TR

indexing and storing vector embeddings for fast semantic search


and retrieval.
RE

Vector Indices: These are libraries that focus on the core features
of indexing and search. They do not support data management,
query processing, interfaces etc. They can be considered a bare
bones vector database. Examples of vector indices are Facebook
AI Similarity Search (FAISS), Non-Metric Space Library (NMSLIB),
Approximate Nearest Neighbors Oh Yeah (ANNOY), etc

A Taxonomy of Retrieval Augmented Generation Page 15 of 56


Core Components

A SIMPLE GUIDE TO

O
D
T
RETRIEVAL AUGMENTED

TIO NTE
E
GENERATION
UID
E
DOWNLOAD

M
EG
FROM GUMROAD
Figure 8: Creation and maintenance of non-parametric database via the indexing pipeline

G
N
U
Generation Pipeline: The set of processes that is employed to
L

search and retrieve information from the knowledge base to


A
P

generate responses to user queries. It facilitates real-time


IM

interaction with users.


RA
AL

Information Retrieval: IR is the science of searching. Whether you


AS

are searching for information in a document, or searching for


IEV
NE
document themselves, it falls under the gamut of Information
Retrieval.
GE
TR

Retriever: The component of the generation pipeline that uses an


algorithm to search and retrieve relevant information from the
RE

knowledge base.

Boolean retrieval: A simple keyword-based search where Boolean


logic is used to match documents with queries based on absence
or presence of the words. Documents are retrieved if they contain
the exact terms in the query, often combined with AND, NOT and
OR operators

A Taxonomy of Retrieval Augmented Generation Page 16 of 56


Core Components
TF-IDF: Term Frequency-Inverse Document Frequency is a statistical

O
measure used to evaluate the importance of a word in a document

D
relative to a collection of documents (corpus). It assigns higher

T
TIO NTE
weights to words that appear frequently in a document but

E
infrequently across the corpus.

UID
BM25: Best Match 25 is an advanced probabilistic model used to

E
rank documents based on the query terms appearing in each

M
document. It is part of the family of probabilistic information
EG
retrieval models and is considered an advancement over the
classic TF-IDF model. The improvement that BM25 brings is that it

G
N
adjusts for the length of the documents so that longer documents
U
do not unfairly get higher scores.
L
A
P

Static Word Embeddings: Static embeddings like Word2Vec and


IM

GloVe represent words as dense vectors in a continuous vector


RA
AL

space, capturing semantic relationships based on context. For


instance, "king" - "man" + "woman" approximates "queen". These
AS
IEV
NE
embeddings can capture nuances like similarity and analogy, which
BoW, TF-IDF and BM25 miss.
GE

Contextual Embeddings: Contextual embeddings, generated by


TR

models such as BERT or OpenAI's text embeddings, produce high-


dimensional, context-aware representations for queries and
RE

documents. These models, based on transformers, capture deep


semantic meanings and relationships. For example, a query about
"apple" will retrieve documents discussing apple the fruit or apple
the technology company depending on the input query.

A Taxonomy of Retrieval Augmented Generation Page 17 of 56


Core Components

A SIMPLE GUIDE TO

O
D
T
RETRIEVAL AUGMENTED

TIO NTE
E
GENERATION
UID
E
DOWNLOAD

M
EG
FROM GUMROAD
G
N
U
L

Figure 9: Static Embeddings vs Contextual Embeddings.


A
P

Learned Sparse Retrieval: Generate sparse, interpretable


IM

representations using neural networks. Examples: SPLADE, DeepCT,


RA
AL

DocT5Query
AS
IEV

Dense Retrieval: Encode queries and documents as dense vectors


NE
for semantic matching. Examples: DPR (Dense Passage Retriever),
ANCE, RepBERT
GE
TR

Hybrid Retrieval: Combine sparse and dense methods for


balanced efficiency and effectiveness. Examples: ColBERT, COIL
RE

Cross-Encoder Retrieval: Directly compare query-document pairs


using transformer models. Example: BERT-based rerankers

Graph-based Retrieval: Leverage graph structures to model


relationships between documents. Examples: TextGraphs, Graph
Neural Networks for IR

A Taxonomy of Retrieval Augmented Generation Page 18 of 56


Core Components
Quantum-inspired Retrieval: Apply quantum computing

O
principles to information retrieval. Example: Quantum Language

D
Models (QLM)

T
TIO NTE
E
Neural IR models: Encompass various neural network-based
approaches to information retrieval. Examples: NPRF (Neural PRF),

UID
KNRM (Kernel-based Neural Ranking Model)

M E
Augmentation: The process of combining user query and the
EG
retrieved documents from the knowledge base.

G
N
Prompt Engineering: The technique of giving instructions to an
U
LLM to attain a desired outcome. The goal of Prompt Engineering is
L
A
to construct the prompts to achieve accuracy and relevance in the
P

LLM responses with respect to the desired outcome(s).


IM

RA
AL

Contextual Prompting: Adding an instruction like “Answer only


based on the context provided.” to have set up the generation to
AS
IEV
NE
focus only on the provided information and not from LLM’s internal
knowledge (or parametric knowledge).
GE
TR

Controlled Generation Prompting: Adding an instruction like, “If


the question cannot be answered based on the provided context,
say I don’t know.” to ensure that the model's responses are
RE

grounded in the retrieved information

Few Shot Prompting: LLMs adhere quite well to examples


provided in the prompt. While providing the retrieved information in
the prompt, specifying certain examples to help guide the
generation in the way the retrieved information needs to be used.

A Taxonomy of Retrieval Augmented Generation Page 19 of 56


Core Components
Chain of Thought Prompting: It has been observed that the

O
introduction of intermediate “reasoning” steps, improves the

D
performance of LLMs in tasks that require complex reasoning like

T
arithmetic, common sense, and symbolic reasoning.

TIO NTE
E
Self Consistency: While CoT uses a single Reasoning Chain in

UID
Chain of Thought prompting, Self-Consistency aims to sample

E
multiple diverse reasoning paths and use their respective
generations to arrive at the most consistent answer

M
EG
Generated Knowledge Prompting: This technique explores the

G
N
idea of prompt-based knowledge generation by dynamically
U
constructing relevant knowledge chains, leveraging models' latent
L

knowledge to strengthen reasoning.


A
P
IM

Tree of Thoughts Prompting: This technique maintains an


RA
AL

explorable tree structure of coherent intermediate thought steps


aimed at solving problems.
AS
IEV
NE
Automatic Reasoning and Tool-use (ART): ART framework
automatically interleaves model generations with tool use for
GE

complex reasoning tasks. ART leverages demonstrations to


TR

decompose problems and integrate tools without task-specific


scripting.
RE

Automatic Prompt Engineer (APE): The APE framework


automatically generates and selects optimal instructions to guide
models. It leverages a large language model to synthesise
candidate prompt solutions for a task based on output
demonstrations.

A Taxonomy of Retrieval Augmented Generation Page 20 of 56


Core Components
Active Prompt: Active-Prompt improves Chain-of-thought
methods by dynamically adapting Language Models to task-

O
D
specific prompts through a process involving query, uncertainty

T
analysis, human annotation, and enhanced inference.

TIO NTE
E
ReAct Prompting: ReAct integrates LLMs for concurrent reasoning

UID
traces and task-specific actions, improving performance by
interacting with external tools for information retrieval. When

E
combined with CoT, it optimally utilises internal knowledge and

M
external information, enhancing interpretability and trustworthiness
EG
of LLMs.

G
N
Recursive Prompting: Recursive prompting breaks down complex
U
L

problems into sub-problems, solving them sequentially using


A
P

prompts. This method aids compositional generalisation in tasks like


math problems or question answering, with the model building on
IM

RA
AL

solutions from previous steps.

Foundation Models: Pre-trained large language models that have


AS
IEV
NE
been trained on massive data. Some LLM developers make the
model weights public for fostering collaboration and community
driven innovation while others have kept the models closed offering
GE
TR

support, managed services and better user experience

Supervised Fine-Tuning (SFT): Supervised fine-tuning is a process


RE

used to adapt a pre-trained language model for specific tasks or


behaviours like question-answering or chat. It involves further
training a pre-trained foundation model on a labeled dataset,
where the model learns to map inputs to specific desired outputs.

A Taxonomy of Retrieval Augmented Generation Page 21 of 56


Core Components

O
A SIMPLE GUIDE TO

D
T
TIO NTE
RETRIEVAL AUGMENTED

E
UID GENERATION

M E
DOWNLOAD
EG
G
N
U FROM GUMROAD
L
A
P
IM

Figure 10: Supervised fine tuning is a classification model training process


RA
AL

Small Language Models (SLMs): Smaller models with parameter


sizes in millions or a few billion offer benefits such as faster
AS
IEV
NE
inference times, lower resource usage and easier deployment on
edge devices or resource constrained environments
GE
TR
RE

A Taxonomy of Retrieval Augmented Generation Page 22 of 56


Evaluation
Evaluation Metrics: Quantitative measures used to evaluate the
performance of retrieval & generation and overall RAG system.

O
D
T
Accuracy: The proportion of correct predictions (both true

TIO NTE
positives and true negatives) among the total number of cases

E
examined. Even though accuracy is a simple, intuitive metric, it is

UID
not the primary metric for retrieval. In a large knowledge base,
majority of documents are usually irrelevant to any given query,

E
which can lead to misleadingly high accuracy scores. It does not

M
consider ranking of the retrieved results.
EG
G
N
Precision: It measures the proportion of retrieved documents that
are relevant to the user query. It answers the question, “Of all the
U
L

documents that were retrieved, how many were actually relevant?”


A
P

Precision@k: A variation of precision that measures the proportion


IM

RA
AL

of relevant documents amongst the top ‘k’ retrieved results. It is


particularly important because it focusses on the top results rather
than all the retrieved documents.
AS
IEV
NE

Recall: It measures the proportion of the relevant documents


retrieved from all the relevant documents in the corpus. It answers
GE
TR

the question, “Of all the relevant documents, how many were
actually retrieved?”. Like precision, recall also doesn’t consider the
ranking of the retrieved documents. It can also be misleading as
RE

retrieving all documents in the knowledge base will result in a


perfect recall value.

F1-score: The harmonic mean of precision and recall. It provides a


single metric that balances both the quality and coverage of the
retriever.

A Taxonomy of Retrieval Augmented Generation Page 23 of 56


Evaluation

O
D
T
TIO NTE
E
UID
E
A SIMPLE GUIDE TO

M
EG
RETRIEVAL AUGMENTED
G
N
U GENERATION
L
A
P

DOWNLOAD
IM

RA
AL

FROM GUMROAD
AS
IEV
NE
GE

Figure 11: Precision and Recall


TR

Mean Reciprocal Rank (MRR): Useful in evaluating the rank of


the relevant document, it measures the reciprocal of the ranks of
RE

the first relevant document in the list of results. MRR is calculated


over a set of queries.

Mean Average Precision(MAP): A metric that combines


precision and recall at different cut-off levels of ‘k’ i.e. the cut-off
number for the top results. It calculates a measure called Average
Precision and then averages it across all queries.

A Taxonomy of Retrieval Augmented Generation Page 24 of 56


Evaluation
Normalised Discounted Cumulative Gain (nDCG): Evaluates

O
the ranking quality by considering the position of relevant

D
documents in the result list and assigning higher scores to relevant

T
documents appearing earlier. It is particularly effective for

TIO NTE
scenarios where documents have varying degrees of relevance.

E
UID
Context relevance: Evaluates how well the retrieved documents

E
relate to the original query. The key aspects are topical alignment,
information usefulness and redundancy. The retrieved context

M
EG
should contain information only relevant to the query or the prompt.
For context relevance, a metric ‘S’ is estimated. ‘S’ is the number of

G
N
sentences in the retrieved context that are relevant for responding
U
to the query or the prompt.
L
A
P

A SIMPLE GUIDE TO
IM

RA
AL

RETRIEVAL AUGMENTED
AS
IEV
NE

GENERATION
GE
TR

DOWNLOAD
FROM GUMROAD
RE

Figure 12: Context relevance evaluates the degree to which the retrieved information is
relevant to the query

A Taxonomy of Retrieval Augmented Generation Page 25 of 56


Evaluation
Answer Faithfulness: A measure of the extent to which the
response is factually grounded in the retrieved context. Faithfulness

O
ensures that the facts in the response do not contradict the context

D
T
and can be traced back to the source. It also ensures that the LLM

TIO NTE
is not hallucinating. Faithfulness first identifies the number of

E
“claims” made in the response and calculates the proportion of

UID
those “claims” present in the context.

E
Hallucination Rate: Calculate the proportion of generated claims

M
in the response that are not present in the retrieved context
EG
G
Coverage: Measures the number of relevant claims in the context

N
and calculates the proportion of relevant claims present in the
U
L

generated response. This measures how much of the relevant


A
P

information from the retrieved passages is included in the


generated answer.
IM

RA
AL

Answer Relevance: Measure of the extent to which the response


is relevant to the query. This metric focusses on key aspects such
AS
IEV
NE
as system’s ability to comprehend the query, response being
pertinent to the query and the completeness of the response.
GE
TR

Ground truth: Information that is known to be real or true. In RAG,


and Generative AI domain in general, Ground Truth is a prepared
set of Prompt-Context-Response or Question-Context-Response
RE

example, akin to labelled data in Supervised Machine Learning


parlance. Ground truth data that is created for your knowledge
base can be used for evaluation of your RAG system.

Human Evaluation: A subject matter expert looking at the


documents and determines the relevance and accuracy of the
outputs.

A Taxonomy of Retrieval Augmented Generation Page 26 of 56


Evaluation
Noise Robustness: It is impractical to assume that the information
stored in the knowledge base for RAG systems is perfectly curated

O
to answer the questions that can be potentially asked of the

D
T
system. It is very probable that a document is related to the user

TIO NTE
query but does not have any meaningful information to answer the

E
query. The ability of the RAG system to separate these noisy

UID
documents from the relevant ones is Noise Robustness.

E
Negative Rejection: By nature, Large Language Models always

M
generate text. It is possible that there is absolutely no information
EG
about the user query in the documents in the knowledge base. The

G
ability of the RAG system to not give an answer when there is no

N
relevant information is Negative Rejection.
U
L
A
P

Information Integration: It is also very likely that to answer a user


query comprehensively the information must be retrieved from
IM

RA
AL

multiple documents. This ability of the system to assimilate


information from multiple documents is Information Integration.
AS
IEV
NE
Counterfactual Robustness: Sometimes the information in the
knowledge base might itself be inaccurate. A high-quality RAG
system should be able to address this and reject known
GE
TR

inaccuracies in the retrieved information. This ability is


Counterfactual Robustness
RE

Frameworks: Tools designed to facilitate evaluation offering


automation of the evaluation process and data generation. They
are used to streamline the evaluation process by providing a
structured environment for testing different aspects of a RAG
systems. They are flexible and can be adapted to different datasets
and metrics.

A Taxonomy of Retrieval Augmented Generation Page 27 of 56


Evaluation
RAGAS: Retrieval Augmented Generation Assessment or RAGAs is
a framework developed by Exploding Gradients that assesses the

O
retrieval and generation components of RAG systems without

D
T
relying on extensive human annotations. RAGAs also helps in

TIO NTE
synthetically generating a test dataset that can be used to

E
evaluate a RAG pipeline

UID
Synthetic Test Dataset Generation: Using models like LLMs to

E
automatically generate ground truth data from the knowledge

M
base.
EG
G
LLM as a judge: Using an LLM to evaluate a task.

N
U
L

A SIMPLE GUIDE TO
A
P

RETRIEVAL AUGMENTED
IM

RA
AL

GENERATION
AS
IEV
NE

Figure 13: Synthetic Test Dataset Generation


GE
TR

ARES: Automated RAG evaluation system, or ARES, is a framework


developed by researchers at Stanford University and Databricks.
Like RAGAs, ARES uses an LLM as a judge approach for
RE

evaluations.

Benchmarks: Standardised datasets and their evaluation metrics


used to measure the performance of RAG systems. Benchmarks
provide a common ground for comparing different RAG
approaches. Benchmarks ensure consistency across the
evaluations by considering fixed tasks and evaluation criteria.

A Taxonomy of Retrieval Augmented Generation Page 28 of 56


Evaluation
BEIR or Benchmarking Information Retrieval: A comprehensive,
heterogeneous benchmark that is based on 9 IR tasks and 19

O
Question-Answer datasets.

D
T
TIO NTE
E
UID
M E
EG
G
N
Figure 14: BEIR – 9 tasks and 18 (of 19) datasets (Source: BEIR: A Heterogeneous
Benchmark for Zero-shot Evaluation of Information Retrieval Models)
U
L
A
P

Retrieval Augmented Generation Benchmark (RGB):


Introduced in a Dec 2023, it comprises 600 base questions and 400
IM

RA
AL

additional questions, evenly split between English and Chinese. It is


a benchmark that focusses on four key abilities of a RAG system –
Noise Robustness, Negative Rejection, Information Integration and
AS
IEV
NE
Counterfactual Robustness

Multihop RAG: Curated by researchers at HKUST, Multihop RAG


GE
TR

contains 2556 queries, with evidence for each query distributed


across 2 to 4 documents. The queries also involve document
metadata, reflecting complex scenarios commonly found in real-
RE

world RAG applications.

Comprehensive RAG: Curated by Meta and HKUST, CRAG, is a


factual question answering benchmark of 4,409 question-answer
pairs and mock APIs to simulate web and Knowledge Graph (KG)
search. It contains 8 types of queries across 5 domains

A Taxonomy of Retrieval Augmented Generation Page 29 of 56


THIS TAXONOMY IS BASED ON

IF YOU LIKE THIS TAXONOMY,


CONSIDER PURCHASING THE BOOK

SCAN CODE OR CLICK HERE


TO GET AN EARLY ACCESS COPY Page 30 of 56
Pipeline Design
Naive RAG: A basic linear approach to RAG with sequential
indexing, retrieval, augmentation and generation process.

O
D
T
Retrieve-Read: A retriever that retriever reads information and the

TIO NTE
LLM that is reads this information to generate the results

E
A SIMPLE GUIDE TO
UID
E
RETRIEVAL AUGMENTED

M
EG
GENERATION
G
N
U DOWNLOAD
L
A
P

FROM GUMROAD
IM

RA
AL

Figure 15: Naïve RAG is a sequential “Retrieve-Read” process.

RAG Failure Points: A RAG system can misfire if the the retriever
AS
IEV
NE
fails to retrieve the entire context or retrieves irrelevant context, the
LLM despite being provided the context, does not consider it and,
instead of answering the query picks irrelevant information from the
GE
TR

context.

Disjointed Context: When information is retrieved from different


RE

source documents, the transition between two chunks becomes


abrupt.

Over-reliance on Context: It is noticed, sometimes, that the LLM


becomes over-reliant on the retrieved context and forgets to draw
from its own parametric memory.

A Taxonomy of Retrieval Augmented Generation Page 31 of 56


Pipeline Design
Advanced RAG: Pipeline with interventions at pre-retrieval,
retrieval and post-retrieval stages to overcome the limitations of

O
D
Naive RAG.
A SIMPLE GUIDE TO

T
TIO NTE
RETRIEVAL AUGMENTED

E
UID GENERATION

M E
EG
DOWNLOAD
G
N
U FROM GUMROAD
L
A
P

Figure 16: Advanced RAG is a Rewrite-Retrieve-Rerank-Read process as compared to a


IM

RA
AL

Retrieve-Read Naïve RAG process

Rewrite-Retrieve-Rerank-Read: Improvement upon Retrieve-


Read framework by adding the rewriting and reranking component
AS
IEV
NE

Index Optimisation: Employed in the indexing pipeline, the


objective of index optimisation is to set up the knowledge base for
GE
TR

better retrieval.

Query Optimisation: The objective of query optimisation is to align


RE

the input user query in a manner that makes it better suited for the
retrieval tasks

Query Expansion: The original user query is enriched with the aim
of retrieving more relevant information. This helps in increasing the
recall of the system and overcomes the challenge of incomplete or
very brief user queries.
A Taxonomy of Retrieval Augmented Generation Page 32 of 56
Pipeline Design
Multi-query expansion: In this approach, multiple variations of

O
the original query are generated using an LLM and each variant

D
query is used to search and retrieve chunks from the knowledge

T
base.

TIO NTE
E
Sub-query expansion: In this approach instead of generating

UID
variations of the original query, a complex query is broken down
into simpler sub-queries.

M E
Step back expansion: The term comes from the step-back
EG
prompting approach where the original query is abstracted to a

G
N
higher-level conceptual query.
U
L

Query Transformation: In query transformation, instead of the


A
P

original user query retrieval happens on a transformed query which


IM

is more suitable for the retriever


RA
AL

Query Rewriting: Queries are rewritten from the input. The input in
AS

quite a few real-world applications may not be a direct query or a


IEV
NE
query suited for retrieval. Based on the input a language model can
be trained to transform the input into a query which can be used
GE

for retrieval.
TR

Hypothetical document embedding: HyDE is a technique where


RE

the language model first generates a hypothetical answer to the


user's query without accessing the knowledge base. This generated
answer is then used to perform a similarity search against the
document embeddings in the knowledge base, effectively retrieving
documents that are similar to the hypothetical answer rather than
the query itself.

A Taxonomy of Retrieval Augmented Generation Page 33 of 56


Pipeline Design
Query Routing: Optimising the user query by routing it to the
appropriate workflow based on criteria like intent, domain,

O
language, complexity, source of information etc

D
T
TIO NTE
Hybrid Retrieval: Hybrid retrieval strategy is an essential

E
component of production-grade RAG systems. It involves

UID
combining retrieval methods for improved retrieval accuracy. This
can mean simply using a keyword-based search along with

E
semantic similarity. It can also mean combining all sparse

M
embedding, dense embedding vector and knowledge graph-based
EG
search.

G
N
A SIMPLE GUIDE TO
U
L
A
P

RETRIEVAL AUGMENTED
IM

RA
AL

GENERATION
AS
IEV
NE
Figure 17: Hybrid retriever employs multiple querying techniques and combines the results

Iterative Retrieval: In this approach the retrieval happens ’n’


GE
TR

number of times and the generated response is used to further


retrieve documents each time.
RE

Recursive Retrieval: This approach improves upon the iterative


approach by transforming the retrieval query after each
generation.

Adaptive Retrieval: This is a more intelligent approach where an


LLM determines the most appropriate moment and the most
appropriate content for retrieval.

A Taxonomy of Retrieval Augmented Generation Page 34 of 56


Pipeline Design
Contextual Compression: Reduce the length of the retrieved
information by extracting only the parts that are relevant and

O
D
important to the query. This also has a positive effect on the cost

T
and efficiency of the system.

TIO NTE
E
Reranking: Retrieved information from different sources and

UID
techniques can further be ranked to determine the most relevant
documents. Reranking, like hybrid retrieval, is commonly becoming

E
a necessity in production RAG systems. To this end, commonly

M
available rerankers like multi-vector, Learning to Rank (LTR), BERT
EG
based and even hybrid rerankers that can be employed are gaining

G
N
prominence.
U
L
A
P

Modular RAG: Modular RAG breaks down the traditional


monolithic RAG structure into interchangeable components. This
IM

RA
AL

allows for tailoring of the system to specific use cases. Modular


approach brings modularity to RAG components like retrievers,
indexing, and generation, while also adding more modules like
AS
IEV
NE
search, memory, and fusion.

Search Module: Aimed at performing search on different data


GE
TR

sources, it is customised to different data sources.

RAG-Fusion: Improves traditional search systems by overcoming


RE

their limitations through a multi-query approach.

Memory Module: The Memory Module leverages the inherent


'memory' of the LLM, meaning the knowledge encoded within its
parameters from pre-training.

A Taxonomy of Retrieval Augmented Generation Page 35 of 56


Pipeline Design
Routing: Routing in the RAG system navigates through diverse
data sources, selecting the optimal pathway for a query, whether it

O
D
involves summarization, specific database searches, or merging

T
different information streams.

TIO NTE
E
Task Adapter: This module makes RAG adaptable to various

UID
downstream tasks allowing the development of task-specific end-
to-end retrievers with minimal examples, demonstrating flexibility in

E
handling different tasks. The Task Adapter Module allows the RAG

M
system to be fine-tuned for specific tasks like summarisation,
EG
translation, or sentiment analysis.

G
N
U
L
A
P
IM

RA
AL
AS
IEV
NE
GE
TR
RE

A Taxonomy of Retrieval Augmented Generation Page 36 of 56


Operations Stack
Critical Layers: Fundamental components for the operation of a

O
RAG system. A RAG system is likely to fail if any of these layers are

D
missing or incomplete

T
TIO NTE
Data Layer: The data layer serves the critical role of creating and

E
storing the knowledge base for RAG. It is responsible for collecting

UID
data from source systems, transforming it into a usable format and
storing it for efficient retrieval.

M E
Model Layer: Predictive models enable generative AI applications.
EG
Some models are provided by third parties and some need to be

G
N
custom trained or fine-tuned. Generating quick and cost-effective
U
model responses is also an important aspect of leveraging
L

predictive models. This layer holds the model library, training & fine-
A
P

tuning and the inference optimisation components.


IM

RA
AL

Fully managed deployment: Deployment provided by service


providers where all infrastructure for model deployment, serving,
AS

and scaling is managed and optimised by these providers


IEV
NE

Self-hosted deployment: In this scenario, models are deployed in


GE

private clouds or on-premises, and the infrastructure is managed


TR

by the application developer. Tools like Kubernetes and Docker are


widely used for containerisation and orchestration of models, while
RE

Nvidia Triton Inference Server can optimise inference on Nvidia


hardware.

Local/edge deployment: Running optimised versions of models


on local hardware or edge devices, ensuring data privacy, reduced
latency, and offline functionality.

A Taxonomy of Retrieval Augmented Generation Page 37 of 56


Operations Stack
Application Orchestration Layer : Layer responsible for

O
managing the interactions amongst the other layers in the system.

D
It is a central coordinator that enables communication between

T
data, retrieval systems, generation models and other services.

TIO NTE
E
A SIMPLE GUIDE TO
UID
E
RETRIEVAL AUGMENTED

M
EG
GENERATION
G
N
U DOWNLOAD
L
A
P

FROM GUMROAD
IM

RA
AL

Figure 18: Core RAGOps stack where Data, Model, Model Deployment and App
AS
IEV
NE
Orchestration layers interact with source systems and managed service providers and co-
ordinate with the application layer to interface with the user.
GE

Essential Layers: Layers focussing on performance, reliability and


TR

safety of the system. These essential components bring the system


to a standard that provides value to the user.
RE

Prompt Layer: Manages the augmentation and other LLM


prompts.

Evaluation Layer: Manages regular evaluation of retrieval


accuracy, context relevance, faithfulness and answer relevance of
the system is necessary to ensure the quality of responses.

A Taxonomy of Retrieval Augmented Generation Page 38 of 56


Operations Stack
Monitoring Layer: Continuous monitoring ensures the long-term

O
health of the RAG system. Understanding system behaviour and

D
identifying points of failure, assessing the relevance & adequacy of

T
information, and tracking regular system metrics tracking like

TIO NTE
resource utilisation, latency and error rates form the part of the

E
monitoring layer.

UID
LLM Security & Privacy Layer: RAG systems rely on large

E
knowledge bases stored in vector databases, which can contain

M
sensitive information. They need to follow all data privacy
EG
regulations, data protection strategies like anonymisation,

G
N
encryption, differential privacy, query validation & sanitisation, and
U
output filtering to assist in protection against attacks. Implementing
L

guardrails, access controls, monitoring and auditing are also


A
P

components of the security and privacy layer.


IM

RA
AL

Caching Layer: Caching is becoming a very important component


of any LLM based application. This is because the high costs and
AS

inherent latency of generative AI models. With the addition of


IEV
NE
retrieval layer, the costs and latency increase further in RAG
systems.
GE
TR

Enhancement Layer: Layers improving the efficiency, scalability


and usability of the system. These components are used to make
RE

the RAG system better and are decided based on the end
requirements.

Human-in-the-loop Layer: Provides critical oversight where


human judgment is necessary, especially for use-cases requiring
higher accuracy or ethical considerations.

A Taxonomy of Retrieval Augmented Generation Page 39 of 56


Operations Stack
Cost Optimisation Layer: Helps manage resources efficiently,

O
which is particularly important for large-scale systems.

D
T
Explainability and Interpretability Layer: Provides transparency

TIO NTE
for system decisions, especially important for domains requiring

E
accountability

UID
Collaboration and Experimentation Layer: Useful for teams

E
working on development and experimentation but non-critical for

M
system operation.
EG
G
N
U
L
A
P
IM

RA
AL
AS
IEV
NE
GE
TR
RE

A Taxonomy of Retrieval Augmented Generation Page 40 of 56


Emerging Patterns
Knowledge Graph powered RAG: Using knowledge graph

O
structures not only increases the contextual understanding but also

D
equips the system with enhanced reasoning capabilities and

T
improved explainability.

TIO NTE
E
Knowledge Graphs: Knowledge graphs organise data in a

UID
structured manner as entities and relationships.

E
GraphRAG: An open-source framework developed by Microsoft

M
EG
that facilitates automatic creation of knowledge graphs from
source documents and then uses the knowledge graph for retrieval.

G
N
U
Graph Communities: Partitioning entities & relationships into
L

groups.
A
P
IM

Community Summaries: LLM generated summaries for


RA
AL

communities, providing insights into topical structure and


semantics
AS
IEV
NE
Local Search: Identifying a set of entities from the knowledge
graph that are semantically-related to the user input
GE
TR

Global Search: Similarity based search on community summaries.


RE

Ontology: A formal representation of knowledge as a set of


concepts within a domain, and the relationships between those
concepts.

Multimodal RAG: Using other modalities like images, audio, video


etc. in addition to text in both retrieval and generation.

A Taxonomy of Retrieval Augmented Generation Page 41 of 56


Emerging Patterns
Modality: A specific type of input data, such as text, image, video,

O
or audio. Multimodal systems can handle multiple modalities

D
simultaneously.

T
TIO NTE
Multimodal Embeddings: A unified vector representation that

E
encodes multiple data types (e.g., text and image embeddings

UID
combined) to enable retrieval across different modalities.

E
CLIP (Contrastive Language-Image Pre-training): A model

M
EG
developed by OpenAI that learns visual concepts from natural
language supervision, often used for cross-modal retrieval and

G
N
generation U
L

Contrastive Learning: A learning method used to align data


A
P

across different modalities by bringing semantically similar data


IM

points closer in the shared embedding space.


RA
AL

A SIMPLE GUIDE TO
AS
IEV
NE

RETRIEVAL AUGMENTED
GE

GENERATION
TR

DOWNLOAD
RE

FROM GUMROAD

Figure 19: Mapping data of different modalities into a shared embedding space

A Taxonomy of Retrieval Augmented Generation Page 42 of 56


Emerging Patterns
Agentic RAG: Leverages LLM based agents for adapting RAG

O
workflow to query types and the type of documents in the

D
knowledge base.

T
TIO NTE
Adaptive Frameworks: Dynamic systems that adjust retrieval and

E
generation strategies based on the evolving context and data,

UID
ensuring relevant responses.

E
Routing Agents: Agents responsible for directing user queries to

M
EG
the most appropriate sources or sub-systems for efficient
processing.

G
N
U
Query Planning Agents: Agents that break down complex queries
L

into sub-queries and manage their execution across different


A
P

retrieval pipelines.
IM

RA
AL

Multiple Vectors per Document: A technique where multiple


vector representations are generated for each document to
AS
IEV

capture different aspects of its content.


NE
GE
TR
RE

A Taxonomy of Retrieval Augmented Generation Page 43 of 56


Technology Providers
Model Access, Training & Vector DB and Indexing
FineTuning Pinecone

O
D
OpenAI Milvus

T
HuggingFace Chroma

TIO NTE
Google Vertex AI Weaviate

E
Anthropic Deep Lake

UID
AWS Bedrock Qdrant
AWS Sagemaker Elasticsearch

E
Cohere Vespa

M
Azure Machine Learning Redis (Vector Search Support)
EG
IBM Watson AI Vald

G
N
Mistral AI Zilliz
Salesforce Einstein Marqo
U
L

Databricks Dolly PGVector (PostgreSQL extension)


A
P

NVIDIA NeMo MongoDB (with vector


EleutherAI capabilities)
IM

RA
AL

SingleStore
Data Loading
Snorkel AI Application Framework
AS
IEV
NE
LlamaIndex LangChain
LangChain LlamaIndex
Scale AI Haystack
GE
TR

Labelbox CrewAI (Agentic Orchestration)


Superb AI AutoGen (Agentic Orchestration)
Explorium LangGraph (Agentic
RE

Roboflow Orchestration)
Datature Rasa (Conversational AI)
V7 Labs Flyte
Clarifai Prefect
Airflow
Metaflow

A Taxonomy of Retrieval Augmented Generation Page 44 of 56


Technology Providers
Prompt Engineering Monitoring
W&B (Weights & Biases) HoneyHive

O
D
PromptLayer TruEra

T
TruLens Fiddler AI

TIO NTE
TruEra Arize AI

E
PromptHero Aporia

UID
TextSynth WhyLabs
Evidently AI

E
Deployment Frameworks Superwise

M
Vllm Monte Carlo
EG
TensorRT-LLM Datadog

G
N
ONNX Runtime
KubeFlow Proprietary LLMs/VLMs
U
L

MLflow GPT series by OpenAI


A
P

Ray Serve Gemini series by Google


Triton Inference Server Claude series by Anthropic
IM

RA
AL

Seldon Deploy Command series by Cohere


Jurassic by AI21 Labs
Deployment & Inferencing PalM by Google
AS
IEV
NE
AWS LaMDA by Google
GCP
OpenAI API Open Source LLMs
GE
TR

Azure Llama series by Meta


IBM Cloud Mixtral by Mistral
Oracle Cloud Falcon by TII
RE

Infrastructure Vicuna by LMSYS


Heroku GPT-NeoX by EleutherAI
Kubernetes Pythia by EleutherAI
DigitalOcean Dolly 2.0 by Databricks
Vercel Phi by Microsoft

A Taxonomy of Retrieval Augmented Generation Page 45 of 56


Technology Providers
Small Language Models Synthetic Data

O
Phi by Microsoft Mostly AI

D
GPT-Neo by EleutherAI Tonic.ai

T
DistilBERT by HuggingFace Synthesis AI

TIO NTE
TinyBERT

E
ALBERT (A Lite BERT) by Others

UID
Google Cohere reranker
MiniLM by Microsoft Unstructured.io

E
DistilGPT2 by HuggingFace

M
Reformer by Google
EG
T5-Base by Google

G
N
U
Managed RAG solutions
L

OpenAI File Search


A
P

Amazon Bedrock
IM

Knowledge Bases
RA
AL

Azure AI File Search


Claude Projects
AS

Vectorize.io
IEV
NE

Knowledge Graph and


GE

Ontology
TR

Neo4j
Stardog
RE

TerminusDB
TigerGraph

Security and Privacy


Hazy
Duality
BigID

A Taxonomy of Retrieval Augmented Generation Page 46 of 56


Applied RAG
Other RAG Patterns

O
D
Corrective RAG: In this approach, real-time information is

T
retrieved to check for the factual accuracy of the LLM generated

TIO NTE
answer. Particularly useful in fact-checking, medical & legal

E
domains.

UID
Contrastive RAG: Integrates contrastive learning techniques to

E
enhance the retrieval process by distinguishing between relevant

M
and irrelevant documents. https://arxiv.org/abs/2406.06577
EG
G
N
Selective RAG: Optimises the retrieval phase by determining when
it is beneficial to retrieve external information. This method aims to
U
L

improve the overall performance of language models, particularly in


A
P

contexts where retrieval may not add value.


IM

RA
AL

RAG with Active Learning: User feedback on generated content


is used to fine-tune or adapt the retrieval process over time. Useful
AS

in continuous improvement systems, like recommendation engines


IEV
NE
or educational tutoring systems, where the goal is to enhance
performance with each interaction.
GE
TR

Personalised RAG: User preferences, behaviour, and historical


interactions are used to personalise the retrieval process. It’s used
in personalisation-heavy domains like recommendation engines or
RE

customer service, where the system tailors responses to individual


users.

Self-RAG: An adaptive retrieval mechanism that selectively decides


when to retrieve knowledge based on the query's context.

A Taxonomy of Retrieval Augmented Generation Page 47 of 56


Applied RAG
RAFT: Retrieval-Augmented Fine-Tuning combine retrieval
mechanisms with traditional fine-tuning techniques to help models

O
D
access and utilise external knowledge dynamically during their fine-

T
tuning process.

TIO NTE
E
RAPTOR: Recursive Abstractive Processing for Tree-Organised

UID
Retrieval focusses on creating a recursive, tree-like structure from
documents to improve context-aware information retrieval. It is

E
beneficial for question-answering tasks, especially when dealing

M
with extensive documents or information that requires multi-step
EG
reasoning.

G
N
Application Areas
U
L
A
P

Search Engine: Conventional search results are shown as a list of


page links ordered by relevance. More recently, Google Search,
IM

RA
AL

Perplexity, You have used RAG to present a coherent piece of text,


in natural language, with source citation. As a matter of fact,
AS

search engine companies are now building LLM first search engines
IEV
NE
where RAG is the cornerstone of the algorithm.

Personalised Marketing Content Generation: The widest use of


GE
TR

LLMs has probably been in content generation. Using RAG, the


content can be personalised to readers, incorporate real-time
trends and be contextually appropriate. Yarnit, Jasper, Simplified
RE

are some of the platforms that assist in marketing content


generation like blogs, emails, social media content, digital
advertisements etc.

A Taxonomy of Retrieval Augmented Generation Page 48 of 56


Applied RAG
Personalised Learning Plans: In the field or education, and

O
learning & development, RAG is used extensively to create

D
T
personalised learning paths based on past trends and for

TIO NTE
automated evaluation and feedback.

E
Real-time Event Commentary: Imagine an event like a sport or a

UID
news event. A retriever can connect to real-time updates/data via

E
APIs and pass this information to the LLM to create a virtual

M
commentator. These can further be augmented with Text-To-
EG
Speech models. IBM leveraged technology for commentary during
the 2023 US Open Tennis tournament.

G
N
U
L

Conversational agents: LLMs can be customised to


A
product/service manuals, domain knowledge, guidelines, etc. using
P

RAG and serve as support agents resolving user complaints and


IM

RA
issues. These agents can also route users to more specialised
AL

agents depending on the nature of the query. Almost all LLM based
chatbots on websites or as internal tools use RAG. These are being
AS
IEV
NE
used in industries like Travel & Hospitality, Fintech and e-commerce.

Document Question Answering Systems: With access to


GE
TR

proprietary documents, a RAG enabled system becomes an


intelligent AI system that can answer all questions about the
organisation.
RE

Virtual Assistants: Virtual personal assistants like Siri, Alexa and


others are in plans to use LLMs to enhance the user’s experience.
Coupled with more context on user behaviour using RAG, these
assistants are set to become more personalised.

A Taxonomy of Retrieval Augmented Generation Page 49 of 56


Applied RAG
Applied RAG Challenges

O
D
T
Relevance Mismatch: Difficulty retrieving the most relevant

TIO NTE
documents or passages from a large dataset due to suboptimal

E
ranking or search mechanisms.

UID
Over-Retrieval: Retrieving too many documents, leading to

E
unnecessary noise and irrelevant content in the final generation.

M
EG
Sparse vs Dense Retrieval Trade-off: Balancing between sparse
retrieval (TF-IDF, BM25) and dense retrieval (using embeddings) to

G
N
maximise relevance without losing performance.
U
L
A
Document Question Answering Systems: With access to
P

proprietary documents, a RAG enabled system becomes an


IM

RA
intelligent AI system that can answer all questions about the
AL

organisation.
AS
IEV
NE
Latency: Retrieval from large or distributed knowledge bases can
introduce significant delays, affecting real-time applications.
GE
TR

Cost of Storage: Maintaining and managing massive vector


databases for dense retrieval can be expensive and resource-
intensive.
RE

Narrow Retrieval Focus: The challenge of retrieving diverse


perspectives or sources to ensure that multiple viewpoints or
alternative pieces of evidence are surfaced.

Bias in Retrieval: Biases in retrieval results based on the structure


of the indexed data or retrieval algorithms.

A Taxonomy of Retrieval Augmented Generation Page 50 of 56


Applied RAG
Context Loss in Long Queries: Loss of context when handling

O
long, multi-turn queries, leading to disjointed or incomplete answers.

D
T
TIO NTE
Incoherent Summarisation: Generating inconsistent or disjointed
summaries from multiple retrieved documents, leading to poor user

E
experience.

UID
E
Over-Generation: Generating overly verbose or redundant
responses based on retrieved data that fails to condense the key

M
EG
points effectively.

G
N
Inconsistent Modal Alignment: Challenges integrating
U
multimodal data (e.g., text, images, videos) where retrieved content
L

across different modalities may not be aligned properly, affecting


A
P

the quality of the generated response.


IM

RA
AL

Data Silos: When knowledge is fragmented across multiple


sources or platforms, retrieval becomes challenging due to
AS
IEV

inconsistent indexing or lack of interoperability between knowledge


NE
bases.
GE

Processing Large-Scale Data: Difficulty in maintaining high


TR

throughput and low latency as the amount of indexed data grows.


RE

Multi-Agent Coordination: When multiple agents are used for


task orchestration (as in agentic RAG), coordination among agents
can become complex and resource-heavy, impacting system
efficiency.

Inefficient Query Routing: Routing queries to the wrong sources


or retrieval pipelines can reduce efficiency.

A Taxonomy of Retrieval Augmented Generation Page 51 of 56


Applied RAG
Data Poisoning Attacks: External retrieval sources could be

O
manipulated to feed biased or poisoned data into the generation

D
pipeline, leading to compromised outputs.

T
TIO NTE
Adversarial Attacks: Security vulnerabilities where attackers may

E
influence retrieval or generation results by exploiting weaknesses in

UID
retrieval pipelines.

E
Knowledge Base Updating: Maintaining an up-to-date

M
EG
knowledge base while keeping the retrieval process fast and
accurate can be difficult, especially in dynamic fields like news or

G
N
finance. U
L

Memory Retention: Ensuring that the system can store and


A
P

retrieve long-term memory across interactions, allowing for a more


IM

personalised and context-aware response.


RA
AL
AS
IEV
NE
GE
TR
RE

A Taxonomy of Retrieval Augmented Generation Page 52 of 56


THIS TAXONOMY IS BASED ON

IF YOU LIKED THIS TAXONOMY,


CONSIDER PURCHASING THE BOOK

SCAN CODE OR CLICK HERE


TO GET AN EARLY ACCESS COPY
INTERESTED IN CODING RAG
PIPELINES?
THE SOURCE CODE OF
A SIMPLE GUIDE TO

RETRIEVAL AUGMENTED GENERATION


ARE NOW AVALIABLE FOR FREE PUBLIC ACCESS

SCAN CODE OR CLICK HERE


TO VIEW SOURCE CODE
A
DOWNLOADABLE PDF
VERSION OF THIS EBOOK IS ALSO
AVAILABLE ON GUMROAD

SCAN CODE OR CLICK HERE


TO DOWNLOAD
SCAN CODE OR CLICK HERE
TO CONNECT

/in/Abhinav-Kimothi @akaiworks

@abhinav_kimothi @abhinavkimothi

You might also like