Guide To End-to-End RAG Systems Evaluation
Guide To End-to-End RAG Systems Evaluation
End-to-End
RAG Systems
Evaluation
Dipanjan (DJ)
Standard RAG System
Evaluation Metrics
Source: https://www.anthropic.com/news/contextual-retrieval
RAG System with
Sources
Here we take the input query of each golden reference data sample
Pass the query to our RAG System and take the Retrieved Context
and LLM Response as output
Append them to each golden reference data sample to create a test
case
Each Test Case Sample will consist of the following:
Input Query: Input question to the RAG system
Expected Output: Ground truth answer to be expected from the
LLM Generator
Context: Expected ground truth context which should be
retrieved
Actual Output: The actual response from the RAG System's LLM
Generator
Retrieved Context: The actual retrieved context from the RAG
System's Vector DB Generator
Define the RAG Metrics you want to evaluate each test case on in
terms of:
Metric Definition
Pass or Fail Threshold
Specific evaluation instructions in case of custom metrics
Evaluate each test case and store the metrics
Visualize on your dashboard as needed and improve system over
time
Source: Dipanjan (DJ)
RAG Evaluation Example
with DeepEval
You can leverage libraries like DeepEval and Ragas to make things
easier for you or even create your own custom eval metrics