0% found this document useful (0 votes)

131 views41 pages

Guide To Evaluating LLM and RAG Systems

Evaluate LLMs

Uploaded by

erik.martinez.n

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

131 views41 pages

Guide To Evaluating LLM and RAG Systems

Evaluate LLMs

Uploaded by

erik.martinez.n

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Comprehensive Guide to

Evaluating LLM and RAG Systems

Dipanjan (DJ) Sarkar

Building an LLM Evaluation Framework

● An LLM Evaluation Framework is

designed to evaluate and test
outputs of LLM systems on a range
of different criteria
● Key components of this framework
would be:
○ An evaluation dataset typically
of (context), input and output
pairs or triplets
○ Evaluation metrics which we
want to use to measure the
performance of our system

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Typical components of a Test Case data point

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
LLM Evaluation Metrics

● LLM evaluation metrics such as

answer correctness, semantic
similarity, relevancy are measures
that help in scoring your LLM
system’s output based on criteria
you care about
● Helps in quantifying the performance
of your LLM or RAG system

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Most Popular LLM Evaluation Metrics

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
What makes an Evaluation Metric Good to Use?

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Common Evaluation Metric Scorers

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Statistical Metric Scorers

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Model-based Non-LLM Scorers

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Model-based LLM Scorers - G-Eval

G-Eval first generates a series of

evaluation steps using chain of
thoughts (CoTs) before using the
generated steps to determine the final
score via a form-filling paradigm

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
G-Eval with DeepEval

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Combining Statistical and Model-based Scorers

Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Ragas for evaluating LLM and RAG Systems

● Ragas is a framework that helps

you evaluate your Retrieval
Augmented Generation (RAG)
pipelines
● Very useful as it provides a suite
of metrics to evaluate each
stage of a RAG pipeline

Source: https://docs.ragas.io/en/stable/index.html
Key Metrics for RAG Evaluation

● Metrics to evaluate specific

components of a RAG pipeline
include:
○ Faithfulness
○ Answer Relevance
○ Context Precision
○ Context Relevancy
○ Context Recall
○ Context entities recall

● Metrics to evaluate the entire

LLM or RAG system include:
○ Answer semantic similarity
○ Answer Correctness

Source: https://docs.ragas.io/en/stable/index.html
Faithfulness

● Measures the factual

consistency of the generated
answer against the given context
● Calculated from answer and
retrieved context
● Answer is scaled to (0,1) range.
Higher the better

Source: https://docs.ragas.io/en/stable/index.html
Faithfulness

● Generated answer is regarded

as faithful if all the claims made
in the answer can be inferred
from the given context
● Uses the formula as shown on
the left

Source: https://docs.ragas.io/en/stable/index.html
Faithfulness
● Let’s examine how faithfulness
was calculated using the low
faithfulness answer:
○ Step 1: Break the generated answer into
individual statements.

■ Statement 1: “Einstein was born in

Germany.”

■ Statement 2: “Einstein was born on

20th March 1879.”
○ Step 2: For each of the generated
statements, verify if it can be inferred from
the given context.
■ Statement 1: Yes
■ Statement 2: No
○ Step 3: Use the formula depicted above to
calculate faithfulness.

Source: https://docs.ragas.io/en/stable/index.html
Faithfulness in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance

● Answer Relevancy, focuses on

assessing how pertinent or
relevant the generated answer is
to the given prompt
● Lower score is assigned to
answers that are incomplete or
contain redundant information
and higher scores indicate better
relevancy
● Metric is computed using the
question, the context and the
answer

Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance
● To calculate this score, the LLM is
prompted to generate an
appropriate question for the
generated answer multiple times
● The mean cosine similarity
between these generated
questions and the original
question is measured
● If the generated answer accurately
addresses the initial question, the
LLM should be able to generate
questions from the answer that
align with the original question

Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance ● To calculate the relevance of the
answer to the given question, we
follow two steps:
○ Step 1: Reverse-engineer ‘n’ variants of the
question from the generated answer using a
Large Language Model (LLM). For instance,
for the first answer, the LLM might generate
the following possible questions:

■ Question 1: “In which part of Europe is

France located?”

■ Question 2: “What is the geographical

location of France within Europe?”

■ Question 3: “Can you identify the

region of Europe where France is
situated?”

○ Step 2: Calculate the mean cosine similarity

between the generated questions and the
actual question.

Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Context Precision

● Metric that evaluates whether all

of the ground-truth relevant
items present in the contexts are
ranked higher or not
● Ideally all the relevant chunks
must appear at the top ranks
● This metric is computed using
the question, ground_truth and
the contexts
● Higher score is better

Source: https://docs.ragas.io/en/stable/index.html
Context Precision

● To calculate context precision in the

low context precision example:
○ Step 1: For each chunk in retrieved context,
check if it is relevant or not relevant to arrive at
the ground truth for the given question.

○ Step 2: Calculate precision@k for each chunk

in the context.

○ Step 3: Calculate the mean of precision@k to

arrive at the final context precision score.

Source: https://docs.ragas.io/en/stable/index.html
Context Precision in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy
● Metric gauges the relevancy of the
retrieved context, calculated based on
both the question and contexts
● Retrieved context should exclusively
contain essential information to
address the provided query
● We initially estimate the value of |S| by
identifying sentences within the
retrieved context that are relevant for
answering the given question and use
the formula on the left
● Higher scores indicate better relevancy

Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy

● To calculate context relevancy in the

given example:
○ Step 1: For any context document find out the
number of sentences |S| which are relevant to the
question

○ Step 2: Count total sentences in the retrieved

context |T|

○ Step 3: Score is |S| / |T|

Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Context Recall

● Measures the extent to which the

retrieved context aligns with the
annotated answer, treated as the
ground truth
● Computed based on the ground truth
and the retrieved context
● Each sentence in the ground truth
answer is analyzed to determine
whether it can be attributed to the
retrieved context or not
● Higher Score is better

Source: https://docs.ragas.io/en/stable/index.html
Context Recall
● To calculate context recall in the
low context recall example:
○ Step 1: Break the ground truth answer into
individual statements.

■ Statement 1: “France is in Western

Europe.”

■ Statement 2: “Its capital is Paris.”

○ Step 2: For each of the ground truth

statements, verify if it can be attributed to
the retrieved context.

■ Statement 1: Yes

■ Statement 2: No

○ Step 3: Use the formula to calculate context

recall.

Source: https://docs.ragas.io/en/stable/index.html
Context Recall in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall
● Metric gives the measure of recall of
the retrieved context, based on the
number of entities present in both
ground_truths and contexts relative to
the number of entities present in the
ground_truths alone
● Simply, it is a measure of what fraction
of entities are recalled from
ground_truths
● It can help evaluate the retrieval
mechanism for entities, based on
comparison with entities present in
ground_truths
● Higher Score is better
Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall
● To calculate context entity recall
in the given examples:
○ Step 1: Find entities present in the ground
truths.
■ Entities in ground truth (GE) - [‘Taj Mahal’, ‘Yamuna’,
‘Agra’, ‘1631’, ‘Shah Jahan’, ‘Mumtaz Mahal’]

○ Step 2: Find entities present in the context.

■ Entities in context (CE1) - [‘Taj Mahal’, ‘Agra’, ‘Shah
Jahan’, ‘Mumtaz Mahal’, ‘India’]

■ Entities in context (CE2) - [‘Taj Mahal’, ‘UNESCO’,

‘India’]

○ Step 3: Use the formula to calculate context

entity recall.

Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Answer semantic similarity
● Answer Semantic Similarity pertains to
the assessment of the semantic
resemblance between the generated
answer and the ground truth
● This evaluation is based on the ground
truth and the answer, with values
falling within the range of 0 to 1
● A higher score signifies a better
alignment between the generated
answer and the ground truth
● Ragas utilizes a cross-encoder model
to calculate the semantic similarity
score using cosine similarity of the text
embeddings
Source: https://docs.ragas.io/en/stable/index.html
Answer semantic similarity in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Answer Correctness
● Answer Correctness involves gauging
the accuracy of the generated answer
when compared to the ground truth
● This evaluation relies on the ground
truth and the answer, with scores
ranging from 0 to 1
● Answer correctness encompasses two
critical aspects
○ Factual similarity
○ Semantic similarity between the
generated answer and the ground
truth
● Final score is a weighted average
Source: https://docs.ragas.io/en/stable/index.html
Answer Correctness
● It is computed as the weighted sum \ average of
factual correctness and the semantic similarity
between the given answer and the ground truth
● Factual correctness quantifies the factual overlap
between the generated answer and the ground
truth answer
● In the second example:
○ TP: [Einstein was born in 1879]

○ FP: [Einstein was born in Spain]

○ FN: [Einstein was born in Germany]

● We then calculate F1 score using the formula

mentioned before
● Then we calculate cosine similarity between the
generated answer and the ground truth
● Answer correctness is the weighted average of the
above two metrics
Source: https://docs.ragas.io/en/stable/index.html
Answer Correctness in Ragas

Source: https://docs.ragas.io/en/stable/index.html
Aspect Critique
● This is designed to assess
submissions based on predefined
aspects such as harmlessness and
correctness
● Users have the flexibility to define their
own aspects for evaluating
submissions according to their specific
criteria
● The output of aspect critiques is binary,
indicating whether the submission
aligns with the defined aspect or not.
● This evaluation is performed using the
‘answer’ as input.

Source: https://docs.ragas.io/en/stable/index.html
Aspect Critique in Ragas

Source: https://docs.ragas.io/en/stable/index.html

LLMs and Retrieval-Augmented Generation (RAG)
No ratings yet
LLMs and Retrieval-Augmented Generation (RAG)
120 pages
Zeigarnik Effect 1
100% (4)
Zeigarnik Effect 1
8 pages
LangChain Programming For Beginners
No ratings yet
LangChain Programming For Beginners
154 pages
RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
100% (3)
RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
10 pages
LLM and RAG
No ratings yet
LLM and RAG
12 pages
Generative AI Interview Questions and Answers
No ratings yet
Generative AI Interview Questions and Answers
7 pages
Pakala Narayana Swami V King Emperor
100% (1)
Pakala Narayana Swami V King Emperor
12 pages
The Rise of Vector Databases in The Age of LLMs
No ratings yet
The Rise of Vector Databases in The Age of LLMs
26 pages
Evaluating LLM and RAG Systems Hands On Guide To Metrics 1719171401
No ratings yet
Evaluating LLM and RAG Systems Hands On Guide To Metrics 1719171401
28 pages
Production Requirements Checklist: Course Number and Name: Production Title: Prod. # Producer: Director
No ratings yet
Production Requirements Checklist: Course Number and Name: Production Title: Prod. # Producer: Director
2 pages
Yugandar - Generative AI Architect
No ratings yet
Yugandar - Generative AI Architect
8 pages
RAG Notes
No ratings yet
RAG Notes
4 pages
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
No ratings yet
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
5 pages
978-0!00!758620-2 Primary Science Student Book 4
No ratings yet
978-0!00!758620-2 Primary Science Student Book 4
12 pages
2019 G6NA Language Arts Paper 2
No ratings yet
2019 G6NA Language Arts Paper 2
10 pages
Gen Ai
No ratings yet
Gen Ai
138 pages
Generative AI With Large Language Models AWS & DeepLearning
No ratings yet
Generative AI With Large Language Models AWS & DeepLearning
96 pages
A Retrieval-Augmented Generation Based Large Langu
No ratings yet
A Retrieval-Augmented Generation Based Large Langu
9 pages
Virtuous A. Adroit
No ratings yet
Virtuous A. Adroit
10 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
Generative AI 3d Model
No ratings yet
Generative AI 3d Model
117 pages
LangChain & RAG
No ratings yet
LangChain & RAG
62 pages
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning - by Gao Dalie (高達烈) - in Towards AI - Freedium
No ratings yet
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning - by Gao Dalie (高達烈) - in Towards AI - Freedium
13 pages
Generative Adversial Network
No ratings yet
Generative Adversial Network
21 pages
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
100% (1)
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
27 pages
Mastering LLMs and Generative AI
No ratings yet
Mastering LLMs and Generative AI
12 pages
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
No ratings yet
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
36 pages
Langchain Retrieval Augmented Generation White Paper
100% (1)
Langchain Retrieval Augmented Generation White Paper
23 pages
A Step-By-Step Guide To Building AI Agents With LangGraph - by Alannaelga - Coinmonks - Nov, 2024 - Medium
No ratings yet
A Step-By-Step Guide To Building AI Agents With LangGraph - by Alannaelga - Coinmonks - Nov, 2024 - Medium
32 pages
Column Interaction Diagram
No ratings yet
Column Interaction Diagram
4 pages
Introduction To Generative AI LLM
100% (1)
Introduction To Generative AI LLM
9 pages
Generative AI
No ratings yet
Generative AI
2 pages
National Programme On Technology Enhanced Learning (Nptel) Frequently Asked Questions (Faqs)
No ratings yet
National Programme On Technology Enhanced Learning (Nptel) Frequently Asked Questions (Faqs)
9 pages
Crud Rag
No ratings yet
Crud Rag
31 pages
Best Practices For Prompt Engineering With The OpenAI
No ratings yet
Best Practices For Prompt Engineering With The OpenAI
6 pages
Knowledge Graphs V Vector Databases and When Not To Use Them!
No ratings yet
Knowledge Graphs V Vector Databases and When Not To Use Them!
3 pages
Elevating Customer Satisfaction With LLM-Powered Chatbots
100% (1)
Elevating Customer Satisfaction With LLM-Powered Chatbots
18 pages
LLM Evaluation
No ratings yet
LLM Evaluation
1 page
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
No ratings yet
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
34 pages
RAG-Based LLM Chatbot Using Llama-2
No ratings yet
RAG-Based LLM Chatbot Using Llama-2
5 pages
IDE204 - TimeGPT Generative AI For Time Series
100% (1)
IDE204 - TimeGPT Generative AI For Time Series
36 pages
Layam Group - Business Presentation
No ratings yet
Layam Group - Business Presentation
28 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
Advanced RAG Techniques - What They Are & How To Use Them
No ratings yet
Advanced RAG Techniques - What They Are & How To Use Them
16 pages
Import As Import As From Import
No ratings yet
Import As Import As From Import
23 pages
Hands-On Guide To Agentic Corrective RAG-1
No ratings yet
Hands-On Guide To Agentic Corrective RAG-1
5 pages
Formal and Informal Communication
No ratings yet
Formal and Informal Communication
10 pages
Job Vacancies Beatrice (Mine)
No ratings yet
Job Vacancies Beatrice (Mine)
3 pages
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
No ratings yet
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
12 pages
Graph RAG
No ratings yet
Graph RAG
7 pages
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
GenAI Unit1 3
No ratings yet
GenAI Unit1 3
31 pages
On Ai
No ratings yet
On Ai
24 pages
Controller
No ratings yet
Controller
2 pages
Res Net
No ratings yet
Res Net
13 pages
Knowledge Graph Construction Using Large Language Models
No ratings yet
Knowledge Graph Construction Using Large Language Models
17 pages
Vector Database in LLMs
No ratings yet
Vector Database in LLMs
14 pages
Retrieval Augmentation Reduces Hallucination in Conversation
No ratings yet
Retrieval Augmentation Reduces Hallucination in Conversation
21 pages
Internship Papers Previous
No ratings yet
Internship Papers Previous
52 pages
Heat and Mass Transfer
No ratings yet
Heat and Mass Transfer
29 pages
LangChain Academy - Introduction To LangGraph - Motivation
No ratings yet
LangChain Academy - Introduction To LangGraph - Motivation
17 pages
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
No ratings yet
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
31 pages
GenAI Roadmap
No ratings yet
GenAI Roadmap
8 pages
By B. Deutsch: The Male Privilege Checklist:An Unabashed Imitation of An Article by Peggy Mcintosh
No ratings yet
By B. Deutsch: The Male Privilege Checklist:An Unabashed Imitation of An Article by Peggy Mcintosh
3 pages
Electrostatic Lens (10 Points) : Theory
No ratings yet
Electrostatic Lens (10 Points) : Theory
4 pages
LangGraph: Multi-Agent Systems
No ratings yet
LangGraph: Multi-Agent Systems
9 pages
RAG-HAT - A Hallucination-Aware Tuning Pipeline For LLM in Retrieval-Augmented Generation
No ratings yet
RAG-HAT - A Hallucination-Aware Tuning Pipeline For LLM in Retrieval-Augmented Generation
11 pages
Prompt Engineering Notes
No ratings yet
Prompt Engineering Notes
2 pages
2017 Microprocessor and Interface
No ratings yet
2017 Microprocessor and Interface
3 pages
Multimodal RAG Systems Hands-On Guide
No ratings yet
Multimodal RAG Systems Hands-On Guide
7 pages
Day 1
No ratings yet
Day 1
32 pages
Government of Uttar Pradesh: Rajesh Kumar Singh
No ratings yet
Government of Uttar Pradesh: Rajesh Kumar Singh
1 page
Chatbot Openai Project Report
No ratings yet
Chatbot Openai Project Report
7 pages
10 Evani Generative AI Champion
No ratings yet
10 Evani Generative AI Champion
39 pages
Catcher User Manual For Customer Full PDF
No ratings yet
Catcher User Manual For Customer Full PDF
51 pages
Amendment in Regional Transmission Grid Plan of Gwadar Area - Complete (1) - Pages-70-74
No ratings yet
Amendment in Regional Transmission Grid Plan of Gwadar Area - Complete (1) - Pages-70-74
5 pages
Major Project Synopsis Front Page
100% (1)
Major Project Synopsis Front Page
7 pages
Everything You Need To Know About Small Language Models (SLM) and Its Applications
No ratings yet
Everything You Need To Know About Small Language Models (SLM) and Its Applications
3 pages
LLM Benchmark
No ratings yet
LLM Benchmark
21 pages
Automotive and Small Engine Tools Assessment For CO
No ratings yet
Automotive and Small Engine Tools Assessment For CO
2 pages
Abhishek Dhiman
No ratings yet
Abhishek Dhiman
3 pages
Brief Introduction To GenAI
No ratings yet
Brief Introduction To GenAI
1 page
Evolving LLOMPS For RAG
No ratings yet
Evolving LLOMPS For RAG
6 pages
Final Theory 2022 en
No ratings yet
Final Theory 2022 en
31 pages
Welcome To::: Class 12
No ratings yet
Welcome To::: Class 12
14 pages
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages
Chapter-1 Group7MMM
No ratings yet
Chapter-1 Group7MMM
4 pages
Mendels Law of Segregation
No ratings yet
Mendels Law of Segregation
10 pages
FS 380
No ratings yet
FS 380
1 page
Worksheet # 1: Revisiting The Meaning, Importance, and Characteristic of Research
No ratings yet
Worksheet # 1: Revisiting The Meaning, Importance, and Characteristic of Research
3 pages

Guide To Evaluating LLM and RAG Systems

Uploaded by

Guide To Evaluating LLM and RAG Systems

Uploaded by

Comprehensive Guide to

Evaluating LLM and RAG Systems

Dipanjan (DJ) Sarkar

● An LLM Evaluation Framework is

● LLM evaluation metrics such as

G-Eval first generates a series of

● Ragas is a framework that helps

● Metrics to evaluate specific

● Metrics to evaluate the entire

● Measures the factual

● Generated answer is regarded

■ Statement 1: “Einstein was born in

■ Statement 2: “Einstein was born on

● Answer Relevancy, focuses on

■ Question 1: “In which part of Europe is

■ Question 2: “What is the geographical

■ Question 3: “Can you identify the

○ Step 2: Calculate the mean cosine similarity

● Metric that evaluates whether all

● To calculate context precision in the

○ Step 2: Calculate precision@k for each chunk

○ Step 3: Calculate the mean of precision@k to

● To calculate context relevancy in the

○ Step 2: Count total sentences in the retrieved

○ Step 3: Score is |S| / |T|

● Measures the extent to which the

■ Statement 1: “France is in Western

■ Statement 2: “Its capital is Paris.”

○ Step 2: For each of the ground truth

○ Step 3: Use the formula to calculate context

○ Step 2: Find entities present in the context.

■ Entities in context (CE2) - [‘Taj Mahal’, ‘UNESCO’,

○ Step 3: Use the formula to calculate context

○ FP: [Einstein was born in Spain]

○ FN: [Einstein was born in Germany]

● We then calculate F1 score using the formula

You might also like