Guide To Evaluating LLM and RAG Systems
Guide To Evaluating LLM and RAG Systems
Guide To Evaluating LLM and RAG Systems
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Typical components of a Test Case data point
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
LLM Evaluation Metrics
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Most Popular LLM Evaluation Metrics
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
What makes an Evaluation Metric Good to Use?
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Common Evaluation Metric Scorers
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Statistical Metric Scorers
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Model-based Non-LLM Scorers
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Model-based LLM Scorers - G-Eval
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
G-Eval with DeepEval
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Combining Statistical and Model-based Scorers
Source: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Ragas for evaluating LLM and RAG Systems
Source: https://docs.ragas.io/en/stable/index.html
Key Metrics for RAG Evaluation
Source: https://docs.ragas.io/en/stable/index.html
Faithfulness
Source: https://docs.ragas.io/en/stable/index.html
Faithfulness
Source: https://docs.ragas.io/en/stable/index.html
Faithfulness
● Let’s examine how faithfulness
was calculated using the low
faithfulness answer:
○ Step 1: Break the generated answer into
individual statements.
Source: https://docs.ragas.io/en/stable/index.html
Faithfulness in Ragas
Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance
Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance
● To calculate this score, the LLM is
prompted to generate an
appropriate question for the
generated answer multiple times
● The mean cosine similarity
between these generated
questions and the original
question is measured
● If the generated answer accurately
addresses the initial question, the
LLM should be able to generate
questions from the answer that
align with the original question
Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance ● To calculate the relevance of the
answer to the given question, we
follow two steps:
○ Step 1: Reverse-engineer ‘n’ variants of the
question from the generated answer using a
Large Language Model (LLM). For instance,
for the first answer, the LLM might generate
the following possible questions:
Source: https://docs.ragas.io/en/stable/index.html
Answer Relevance in Ragas
Source: https://docs.ragas.io/en/stable/index.html
Context Precision
Source: https://docs.ragas.io/en/stable/index.html
Context Precision
Source: https://docs.ragas.io/en/stable/index.html
Context Precision in Ragas
Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy
● Metric gauges the relevancy of the
retrieved context, calculated based on
both the question and contexts
● Retrieved context should exclusively
contain essential information to
address the provided query
● We initially estimate the value of |S| by
identifying sentences within the
retrieved context that are relevant for
answering the given question and use
the formula on the left
● Higher scores indicate better relevancy
Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy
Source: https://docs.ragas.io/en/stable/index.html
Context Relevancy in Ragas
Source: https://docs.ragas.io/en/stable/index.html
Context Recall
Source: https://docs.ragas.io/en/stable/index.html
Context Recall
● To calculate context recall in the
low context recall example:
○ Step 1: Break the ground truth answer into
individual statements.
■ Statement 1: Yes
■ Statement 2: No
Source: https://docs.ragas.io/en/stable/index.html
Context Recall in Ragas
Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall
● Metric gives the measure of recall of
the retrieved context, based on the
number of entities present in both
ground_truths and contexts relative to
the number of entities present in the
ground_truths alone
● Simply, it is a measure of what fraction
of entities are recalled from
ground_truths
● It can help evaluate the retrieval
mechanism for entities, based on
comparison with entities present in
ground_truths
● Higher Score is better
Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall
● To calculate context entity recall
in the given examples:
○ Step 1: Find entities present in the ground
truths.
■ Entities in ground truth (GE) - [‘Taj Mahal’, ‘Yamuna’,
‘Agra’, ‘1631’, ‘Shah Jahan’, ‘Mumtaz Mahal’]
Source: https://docs.ragas.io/en/stable/index.html
Context entities Recall in Ragas
Source: https://docs.ragas.io/en/stable/index.html
Answer semantic similarity
● Answer Semantic Similarity pertains to
the assessment of the semantic
resemblance between the generated
answer and the ground truth
● This evaluation is based on the ground
truth and the answer, with values
falling within the range of 0 to 1
● A higher score signifies a better
alignment between the generated
answer and the ground truth
● Ragas utilizes a cross-encoder model
to calculate the semantic similarity
score using cosine similarity of the text
embeddings
Source: https://docs.ragas.io/en/stable/index.html
Answer semantic similarity in Ragas
Source: https://docs.ragas.io/en/stable/index.html
Answer Correctness
● Answer Correctness involves gauging
the accuracy of the generated answer
when compared to the ground truth
● This evaluation relies on the ground
truth and the answer, with scores
ranging from 0 to 1
● Answer correctness encompasses two
critical aspects
○ Factual similarity
○ Semantic similarity between the
generated answer and the ground
truth
● Final score is a weighted average
Source: https://docs.ragas.io/en/stable/index.html
Answer Correctness
● It is computed as the weighted sum \ average of
factual correctness and the semantic similarity
between the given answer and the ground truth
● Factual correctness quantifies the factual overlap
between the generated answer and the ground
truth answer
● In the second example:
○ TP: [Einstein was born in 1879]
Source: https://docs.ragas.io/en/stable/index.html
Aspect Critique
● This is designed to assess
submissions based on predefined
aspects such as harmlessness and
correctness
● Users have the flexibility to define their
own aspects for evaluating
submissions according to their specific
criteria
● The output of aspect critiques is binary,
indicating whether the submission
aligns with the defined aspect or not.
● This evaluation is performed using the
‘answer’ as input.
Source: https://docs.ragas.io/en/stable/index.html
Aspect Critique in Ragas
Source: https://docs.ragas.io/en/stable/index.html