Rag Evaluations - A Simple Guide To Rag
Rag Evaluations - A Simple Guide To Rag
RETRIEVAL AUGMENTED
GENERATION
CHAPTER 5
is now live!
RAG EVALUATIONS:
ACCURACY,
RELEVANCE,
FAITHFULNESS
View Code A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION Join MEAP
Evaluation of RAG Pipelines
Building a PoC RAG pipeline is not overtly complex. LangChain and
LlamaIndex have made it quite simple. Developing highly impressive
Large Language Model (LLM) applications is achievable through brief
training and verification on a limited set of examples. However, to
enhance its robustness, thorough testing that accurately mirrors the
production use case is imperative.
CONTEXT RELEVANCE
ANSWER RELEVANCE
Is the Retrieved
Is the Response Context relevant to
Query/ Prompt
relevant to the the Prompt?
Prompt?
Answer/
Context
Response
ANSWER FAITHFULNESS
Is the Response grounded
in the Retrieved Context?
Quality scores: The RAG Triad proposed by TruLens
The ability of the RAG system to The ability of the RAG system to
separate the noisy documents not give an answer when there is
from the relevant ones no relevant information
The ability of the RAG system to The ability of the RAG system to
assimilate information from reject known inaccuracies in the
multiple documents retrieved information
FRAMEWORKS BENCHMARKS
Frameworks are tools designed to Benchmarks are standardized
facilitate evaluation offering datasets and their evaluation
automation of the evaluation metrics used to measure the
process and data generation. They performance of RAG systems.
are used to streamline the Benchmarks provide a common
evaluation process by providing a ground for comparing different
structured environment for testing RAG approaches. Benchmarks
different aspects of a RAG ensure consistency across the
systems. They are flexible and can evaluations by considering a fixed
be adapted to different datasets set of tasks and their evaluation
and metrics. criteria. For example, HotpotQA
focusses on multi-hop reasoning
METRICS and retrieval capabilities using
metrics like Exact Match and F1
The frameworks and benchmarks scores.
both calculate metrics that focus Benchmarks are used to establish
on retrieval and the RAG quality a baseline for performance and
scores. Metrics quantify the identify strengths/weaknesses is
assessment of the RAG system specific tasks or domains.
performance in two broad groups.
Retrieval metrics that are It is noteworthy that there are natural
commonly used in information language generation specific metrics
retrieval tasks like BLEU, ROUGE, METEOR, etc. that
focus on the fluency and measure
RAG specific metrics that have
relevance and semantic similarity. They
evolved as RAG has found
play an important role in analyzing and
more application benchmarking the performance of
Large Language Models
Not all retrieval metrics are popular for evaluations. Often, the more
complex metrics are overlooked for the sake of explainability. The usage
of the above metrics depends on the stage of improvement in the
evolution of system performance you are. For example, to start with you
may just be trying to improve precision, while at an evolved stage you may
be looking more for better ranking.
Context Relevance
Context relevance evaluates how well the retrieved documents relate
to the original query. The key aspects are topical alignment,
information usefulness and redundancy. There are human evaluation
methods as well as semantic similarity measures to calculate context
relevance
Illustrative Example
Query : Who won the 2023 ODI Cricket World Cup and when?
India won on 19 November 2023 Cricket world cup is held once every
four years
Note
Answer Relevance is not a measure of truthfulness but only of
relevance. The response may or may not be factually accurate but
may be relevant.
Ground Truth
Ground truth is information that is known to be real or true. In RAG,
or Generative AI domain in general, Ground Truth is a prepared set of
Question-Context-Answer examples. It is akin to labelled data in
Supervised Learning parlance.
Calculation of certain metrics necessitates the availability of
Ground Truth data
RAGAs
Retrieval Augmented Generation Assessment or RAGAs is a framework
developed by Exploding Gradients that assesses the retrieval and
generation components of RAG systems without relying on extensive
human annotations. RAGAs helps in -
Synthetically generate a test dataset that can be used to evaluate a
RAG pipeline
Use metrics to measure the performance of the pipeline
TruLens
TruLens was developed initially by researchers at TruEra. TruLens
provides a structured evaluation framework with a strong focus on
domain specific accuracy.
DeepEval
DeepEval is another user friendly, open-source evaluations framework
developed by Confident AI. It allows you to create your own test cases
and custom metrics.
RAGChecker
Developed by Amazon Science RAGChecker also has metrics focused
on noise and LLM self-knowledge