Guide To Evaluating LLM and RAG Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Comprehensive Guide to

Evaluating LLM and RAG Systems

Dipanjan (DJ) Sarkar


Building an LLM Evaluation Framework

● An LLM Evaluation Framework is


designed to evaluate and test
outputs of LLM systems on a range
of different criteria
● Key components of this framework
would be:
○ An evaluation dataset typically
of (context), input and output
pairs or triplets
○ Evaluation metrics which we
want to use to measure the
performance of our system

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Typical components of a Test Case data point

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
LLM Evaluation Metrics

● LLM evaluation metrics such as


answer correctness, semantic
similarity, relevancy are measures
that help in scoring your LLM
system’s output based on criteria
you care about
● Helps in quantifying the performance
of your LLM or RAG system

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Most Popular LLM Evaluation Metrics

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
What makes an Evaluation Metric Good to Use?

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Common Evaluation Metric Scorers

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Statistical Metric Scorers

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Model-based Non-LLM Scorers

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Model-based LLM Scorers - G-Eval

G-Eval first generates a series of


evaluation steps using chain of
thoughts (CoTs) before using the
generated steps to determine the final
score via a form-filling paradigm

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
G-Eval with DeepEval

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Combining Statistical and Model-based Scorers

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Ragas for evaluating LLM and RAG Systems

● Ragas is a framework that helps


you evaluate your Retrieval
Augmented Generation (RAG)
pipelines
● Very useful as it provides a suite
of metrics to evaluate each
stage of a RAG pipeline

Source: https://docs.ragas.io/en/stable/index.html
Key Metrics for RAG Evaluation

● Metrics to evaluate specific


components of a RAG pipeline
include:
○ Faithfulness
○ Answer Relevance
○ Context Precision
○ Context Relevancy
○ Context Recall
○ Context entities recall

● Metrics to evaluate the entire


LLM or RAG system include:
○ Answer semantic similarity
○ Answer Correctness

Source: https://docs.ragas.io/en/stable/index.html
Faithfulness

● Measures the factual


consistency of the generated
answer against the given context
● Calculated from answer and
retrieved context
● Answer is scaled to (0,1) range.
Higher the better

Source: https://docs.ragas.io/en/stable/index.html
Faithfulness

● Generated answer is regarded


as faithful if all the claims made
in the answer can be inferred
from the given context
● Uses the formula as shown on
the left

Source: https://docs.ragas.io/en/stable/index.html
Faithfulness
● Let’s examine how faithfulness
was calculated using the low
faithfulness answer:
○ Step 1: Break the generated answer into
individual statements.

■ Statement 1: “Einstein was born in


Germany.”

■ Statement 2: “Einstein was born on


20th March 1879.”
○ Step 2: For each of the generated
statements, verify if it can be inferred from
the given context.
■ Statement 1: Yes
■ Statement 2: No
○ Step 3: Use the formula depicted above to
calculate faithfulness.

Source: https://docs.ragas.io/en/stable/index.html
Faithfulness in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance

● Answer Relevancy, focuses on


assessing how pertinent or
relevant the generated answer is
to the given prompt
● Lower score is assigned to
answers that are incomplete or
contain redundant information
and higher scores indicate better
relevancy
● Metric is computed using the
question, the context and the
answer

Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance
● To calculate this score, the LLM is
prompted to generate an
appropriate question for the
generated answer multiple times
● The mean cosine similarity
between these generated
questions and the original
question is measured
● If the generated answer accurately
addresses the initial question, the
LLM should be able to generate
questions from the answer that
align with the original question

Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance ● To calculate the relevance of the
answer to the given question, we
follow two steps:
○ Step 1: Reverse-engineer ‘n’ variants of the
question from the generated answer using a
Large Language Model (LLM). For instance,
for the first answer, the LLM might generate
the following possible questions:

■ Question 1: “In which part of Europe is


France located?”

■ Question 2: “What is the geographical


location of France within Europe?”

■ Question 3: “Can you identify the


region of Europe where France is
situated?”

○ Step 2: Calculate the mean cosine similarity


between the generated questions and the
actual question.

Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Context Precision

● Metric that evaluates whether all


of the ground-truth relevant
items present in the contexts are
ranked higher or not
● Ideally all the relevant chunks
must appear at the top ranks
● This metric is computed using
the question, ground_truth and
the contexts
● Higher score is better

Source: https://docs.ragas.io/en/stable/index.html
Context Precision

● To calculate context precision in the


low context precision example:
○ Step 1: For each chunk in retrieved context,
check if it is relevant or not relevant to arrive at
the ground truth for the given question.

○ Step 2: Calculate precision@k for each chunk


in the context.

○ Step 3: Calculate the mean of precision@k to


arrive at the final context precision score.

Source: https://docs.ragas.io/en/stable/index.html
Context Precision in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy
● Metric gauges the relevancy of the
retrieved context, calculated based on
both the question and contexts
● Retrieved context should exclusively
contain essential information to
address the provided query
● We initially estimate the value of |S| by
identifying sentences within the
retrieved context that are relevant for
answering the given question and use
the formula on the left
● Higher scores indicate better relevancy

Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy

● To calculate context relevancy in the


given example:
○ Step 1: For any context document find out the
number of sentences |S| which are relevant to the
question

○ Step 2: Count total sentences in the retrieved


context |T|

○ Step 3: Score is |S| / |T|

Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Context Recall

● Measures the extent to which the


retrieved context aligns with the
annotated answer, treated as the
ground truth
● Computed based on the ground truth
and the retrieved context
● Each sentence in the ground truth
answer is analyzed to determine
whether it can be attributed to the
retrieved context or not
● Higher Score is better

Source: https://docs.ragas.io/en/stable/index.html
Context Recall
● To calculate context recall in the
low context recall example:
○ Step 1: Break the ground truth answer into
individual statements.

■ Statement 1: “France is in Western


Europe.”

■ Statement 2: “Its capital is Paris.”

○ Step 2: For each of the ground truth


statements, verify if it can be attributed to
the retrieved context.

■ Statement 1: Yes

■ Statement 2: No

○ Step 3: Use the formula to calculate context


recall.

Source: https://docs.ragas.io/en/stable/index.html
Context Recall in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall
● Metric gives the measure of recall of
the retrieved context, based on the
number of entities present in both
ground_truths and contexts relative to
the number of entities present in the
ground_truths alone
● Simply, it is a measure of what fraction
of entities are recalled from
ground_truths
● It can help evaluate the retrieval
mechanism for entities, based on
comparison with entities present in
ground_truths
● Higher Score is better
Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall
● To calculate context entity recall
in the given examples:
○ Step 1: Find entities present in the ground
truths.
■ Entities in ground truth (GE) - [‘Taj Mahal’, ‘Yamuna’,
‘Agra’, ‘1631’, ‘Shah Jahan’, ‘Mumtaz Mahal’]

○ Step 2: Find entities present in the context.


■ Entities in context (CE1) - [‘Taj Mahal’, ‘Agra’, ‘Shah
Jahan’, ‘Mumtaz Mahal’, ‘India’]

■ Entities in context (CE2) - [‘Taj Mahal’, ‘UNESCO’,


‘India’]

○ Step 3: Use the formula to calculate context


entity recall.

Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Answer semantic similarity
● Answer Semantic Similarity pertains to
the assessment of the semantic
resemblance between the generated
answer and the ground truth
● This evaluation is based on the ground
truth and the answer, with values
falling within the range of 0 to 1
● A higher score signifies a better
alignment between the generated
answer and the ground truth
● Ragas utilizes a cross-encoder model
to calculate the semantic similarity
score using cosine similarity of the text
embeddings
Source: https://docs.ragas.io/en/stable/index.html
Answer semantic similarity in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Answer Correctness
● Answer Correctness involves gauging
the accuracy of the generated answer
when compared to the ground truth
● This evaluation relies on the ground
truth and the answer, with scores
ranging from 0 to 1
● Answer correctness encompasses two
critical aspects
○ Factual similarity
○ Semantic similarity between the
generated answer and the ground
truth
● Final score is a weighted average
Source: https://docs.ragas.io/en/stable/index.html
Answer Correctness
● It is computed as the weighted sum \ average of
factual correctness and the semantic similarity
between the given answer and the ground truth
● Factual correctness quantifies the factual overlap
between the generated answer and the ground
truth answer
● In the second example:
○ TP: [Einstein was born in 1879]

○ FP: [Einstein was born in Spain]

○ FN: [Einstein was born in Germany]

● We then calculate F1 score using the formula


mentioned before
● Then we calculate cosine similarity between the
generated answer and the ground truth
● Answer correctness is the weighted average of the
above two metrics
Source: https://docs.ragas.io/en/stable/index.html
Answer Correctness in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Aspect Critique
● This is designed to assess
submissions based on predefined
aspects such as harmlessness and
correctness
● Users have the flexibility to define their
own aspects for evaluating
submissions according to their specific
criteria
● The output of aspect critiques is binary,
indicating whether the submission
aligns with the defined aspect or not.
● This evaluation is performed using the
‘answer’ as input.

Source: https://docs.ragas.io/en/stable/index.html
Aspect Critique in Ragas

Source: https://docs.ragas.io/en/stable/index.html

You might also like