0% found this document useful (0 votes)
193 views16 pages

Rag Evaluations - A Simple Guide To Rag

Chapter 5 of 'A Simple Guide to Retrieval Augmented Generation' discusses the evaluation of RAG pipelines, highlighting the importance of accuracy, relevance, and faithfulness in assessing performance. It outlines key failure points in RAG systems, introduces quality scores and metrics for evaluation, and presents frameworks and benchmarks for systematic assessment. The chapter also addresses limitations in current evaluation methodologies and the need for standardized metrics and adaptability to real-time information.

Uploaded by

vmahajanbe22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views16 pages

Rag Evaluations - A Simple Guide To Rag

Chapter 5 of 'A Simple Guide to Retrieval Augmented Generation' discusses the evaluation of RAG pipelines, highlighting the importance of accuracy, relevance, and faithfulness in assessing performance. It outlines key failure points in RAG systems, introduces quality scores and metrics for evaluation, and presents frameworks and benchmarks for systematic assessment. The chapter also addresses limitations in current evaluation methodologies and the need for standardized metrics and adaptability to real-time information.

Uploaded by

vmahajanbe22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

A SIMPLE GUIDE TO

RETRIEVAL AUGMENTED
GENERATION

CHAPTER 5
is now live!

RAG EVALUATIONS:
ACCURACY,
RELEVANCE,
FAITHFULNESS
View Code A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION Join MEAP
Evaluation of RAG Pipelines
Building a PoC RAG pipeline is not overtly complex. LangChain and
LlamaIndex have made it quite simple. Developing highly impressive
Large Language Model (LLM) applications is achievable through brief
training and verification on a limited set of examples. However, to
enhance its robustness, thorough testing that accurately mirrors the
production use case is imperative.

RAG is a great technique to address the memory limitations and


hallucinations in LLMs but...
even RAG systems can fail to meet the desired outcomes

KEY POINTS OF FAILURE


The retriever fails to retrieve relevant context or retrieves irrelevant
context
The LLM, despite being provided the context, does not consider it
The LLM instead of answering the query picks irrelevant information
from the context

RETRIEVAL QUALITY GENERATION QUALITY


How good is the retrieval of the How good is the generated
context from the Vector response?
Database?
Is the response grounded in the
Is it relevant to the query? provided context?

How much noise (irrelevant Is the response relevant to the


information) is present? query?

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


Quality Scores & Abilities
Contemporary research has discovered certain scores to assess
the quality and abilities of a RAG system

CONTEXT RELEVANCE
ANSWER RELEVANCE
Is the Retrieved
Is the Response Context relevant to
Query/ Prompt
relevant to the the Prompt?
Prompt?

Answer/
Context
Response
ANSWER FAITHFULNESS
Is the Response grounded
in the Retrieved Context?
Quality scores: The RAG Triad proposed by TruLens

NOISE ROBUSTNESS NEGATIVE REJECTION

The ability of the RAG system to The ability of the RAG system to
separate the noisy documents not give an answer when there is
from the relevant ones no relevant information

INFORMATION INTEGRATION COUNTERFACTUAL ROBUSTNESS

The ability of the RAG system to The ability of the RAG system to
assimilate information from reject known inaccuracies in the
multiple documents retrieved information

Abilities of RAG system discussed by the CRAG paper

LATENCY BIAS & TOXICITY QUERY ROBUSTNESS


The delay between the While not specific to RAG, The ability of the RAG
input prompt and the bias and toxicity evaluation system to handle different
response is critical in AI apps types of queries

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


Instruments of RAG Evaluation
The quality scores and the abilities need to be measured and
benchmarked. There are three critical enablers of RAG evaluations –
Metrics, Frameworks and Benchmarks.

FRAMEWORKS BENCHMARKS
Frameworks are tools designed to Benchmarks are standardized
facilitate evaluation offering datasets and their evaluation
automation of the evaluation metrics used to measure the
process and data generation. They performance of RAG systems.
are used to streamline the Benchmarks provide a common
evaluation process by providing a ground for comparing different
structured environment for testing RAG approaches. Benchmarks
different aspects of a RAG ensure consistency across the
systems. They are flexible and can evaluations by considering a fixed
be adapted to different datasets set of tasks and their evaluation
and metrics. criteria. For example, HotpotQA
focusses on multi-hop reasoning
METRICS and retrieval capabilities using
metrics like Exact Match and F1
The frameworks and benchmarks scores.
both calculate metrics that focus Benchmarks are used to establish
on retrieval and the RAG quality a baseline for performance and
scores. Metrics quantify the identify strengths/weaknesses is
assessment of the RAG system specific tasks or domains.
performance in two broad groups.
Retrieval metrics that are It is noteworthy that there are natural
commonly used in information language generation specific metrics
retrieval tasks like BLEU, ROUGE, METEOR, etc. that
focus on the fluency and measure
RAG specific metrics that have
relevance and semantic similarity. They
evolved as RAG has found
play an important role in analyzing and
more application benchmarking the performance of
Large Language Models

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


RETRIEVAL METRICS
The retrieval component of RAG can be evaluated independently to
determine how well the retrievers are satisfying the user query

Not all retrieval metrics are popular for evaluations. Often, the more
complex metrics are overlooked for the sake of explainability. The usage
of the above metrics depends on the stage of improvement in the
evolution of system performance you are. For example, to start with you
may just be trying to improve precision, while at an evolved stage you may
be looking more for better ranking.

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


RETRIEVAL METRICS

Precision, Recall and F1-score

Mean Reciprocal Rank

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


RETRIEVAL METRICS

Mean Average Precision

normalized Discounted Cumulative Gain

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


RAG SPECIFIC METRICS
The three quality scores that are used to evaluate RAG applications are
context relevance, answer relevance and answer faithfulness. These
scores specifically answer three questions –

Is the information retrieval relevant to the user query?


Is the generated answer rooted in the retrieved information?
Is the generated answer relevant to the user query?

Context Relevance
Context relevance evaluates how well the retrieved documents relate
to the original query. The key aspects are topical alignment,
information usefulness and redundancy. There are human evaluation
methods as well as semantic similarity measures to calculate context
relevance

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


RAG SPECIFIC METRICS
ANSWER FAITHFULNESS
Faithfulness is the measure of the extent to which the response is
factually grounded in the retrieved context. Faithfulness ensures that
the facts in the response do not contradict the context and can be
traced back to the source. It also ensures that the LLM is not
hallucinating.

An inverse metric for faithfulness is also Hallucination Rate which can


calculate the proportion of generated claims in the response that are
not present in the retrieved context.

Another related metrics to faithfulness is Coverage. Coverage


measures the number of relevant claims in the context and calculates
the proportion of relevant claims present in the generated response.
This measures how much of the relevant information from the
retrieved passages is included in the generated answer.

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


RAG SPECIFIC METRICS
ANSWER RELEVANCE
Like context relevance measures the relevance of the retrieved context
to the query, answer relevance is the measure of the extent to which
the response is relevant to the query. This metric focusses on key
aspects such as system’s ability to comprehend the query, response
being pertinent to the query and the completeness of the response.

Illustrative Example

Query : Who won the 2023 ODI Cricket World Cup and when?

Response 1 : High Answer Relevance Response 2 : Low Answer Relevance

India won on 19 November 2023 Cricket world cup is held once every
four years

Note
Answer Relevance is not a measure of truthfulness but only of
relevance. The response may or may not be factually accurate but
may be relevant.

Ground Truth
Ground truth is information that is known to be real or true. In RAG,
or Generative AI domain in general, Ground Truth is a prepared set of
Question-Context-Answer examples. It is akin to labelled data in
Supervised Learning parlance.
Calculation of certain metrics necessitates the availability of
Ground Truth data

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


RAG SPECIFIC METRICS
ANSWER RELEVANCE

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


FRAMEWORKS
Frameworks provide a structured approach to RAG evaluations. They
can be used to automate the evaluation process. Some go beyond and
assist in the synthetic ground truth data generation. While new
evaluation frameworks will continue to be introduced, there are a few
popular ones.

RAGAs
Retrieval Augmented Generation Assessment or RAGAs is a framework
developed by Exploding Gradients that assesses the retrieval and
generation components of RAG systems without relying on extensive
human annotations. RAGAs helps in -
Synthetically generate a test dataset that can be used to evaluate a
RAG pipeline
Use metrics to measure the performance of the pipeline

Synthetic Dataset Generation in RAGAs

Checkout the RAGAs


implementation code on a
RAG pipeline in the official
source code repository

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


FRAMEWORKS
ARES
Automated RAG evaluation system, or ARES, is a framework developed
by researchers at Stanford University and Databricks. Like RAGAs,
ARES uses an LLM as a judge approach for evaluations. Both request a
language model to classify answer relevance, context relevance and
faithfulness for a given query. However, there are some differences.
RAGAs relies on heuristically written prompts that are sent to the
LLM for evaluation. ARES, on the other hand, trains a classifier
using a language model.
RAGAs aggregates the responses from the LLM to arrive at a score.
ARES provides confidence intervals for the scores leveraging a
framework called Prediction-Powered Inference (PPI)
RAGAs generates a simple synthetic question-context-
answerdataset for evaluation from the documents. ARES generate
synthetic datasets comprising both positive and negative examples
of query-passage-answer triples.

TruLens
TruLens was developed initially by researchers at TruEra. TruLens
provides a structured evaluation framework with a strong focus on
domain specific accuracy.

DeepEval
DeepEval is another user friendly, open-source evaluations framework
developed by Confident AI. It allows you to create your own test cases
and custom metrics.

RAGChecker
Developed by Amazon Science RAGChecker also has metrics focused
on noise and LLM self-knowledge

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


BENCHMARKS
Benchmarks provide a standard point of reference to evaluate the
quality and performance of a system. RAG benchmarks are a set of
standardised tasks, and a dataset used to compare the efficiency of
different RAG system in retrieving relevant information and generating
accurate responses. There has been a surge in creating benchmarks
since 2023 when RAG started gaining popularity but there have been
benchmarks on question answering tasks that were introduced before
that

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


LIMITATIONS
As with any evolving field, there are limitations and challenges to
consider. In the next section, we'll examine these limitations and
discuss best practices that have emerged to address them, ensuring a
more holistic and nuanced approach to RAG evaluation

Lack of Standardized Metrics


There’s no consensus on what the best metrics are to evaluate RAG
systems.

Over-reliance on LLM as a Judge


The evaluation of RAG specific metrics (in RAGAs, ARES, etc.) relies on
using an LLM as a judge. An LLM is prompted or fine-tuned to classify a
response as relevant or not.

Lack of use-case subjectivity


Most frameworks have a generalized approach toward evaluation. They
may not capture the subjective nature of the task relevant to your use-
case.

Benchmarks are static


Most benchmarks are static and do not account for the evolving nature
of information. RAG systems need to adapt to real-time information
changes, which is not currently tested effectively.

Scalability and Cost


Evaluating large-scale RAG systems is more complex than evaluating
basic RAG pipelines. It requires significant computational resources.
Benchmarks and frameworks also do not, generally, account for
metrics like latency and efficiency which are critical for real world
applications.

A SIMPLE GUIDE TO RETRIEVAL AUGMENTED GENERATION


ARE YOU INTERESTED IN LEARNING MORE
ABOUT RAG?
FIRST FIVE CHAPTERS OF
A SIMPLE GUIDE TO

RETRIEVAL AUGMENTED GENERATION


ARE NOW AVALIABLE FOR EARLY ACCESS

SUBSCRIBE SOURCE IS NOW


NOW CODE PUBLICLY
(Link in post) AVAILABLE
(Check out on Github)

You might also like