0% found this document useful (0 votes)
17 views32 pages

Evaluation of LLM Using Automatic Metrics

The document summarizes a mid-term presentation evaluating language models (LLMs) using automatic metrics. It discusses the problem statement of evaluating LLM outputs for fact-based questions using automatic metrics compared to human references. The motivation is discussed, including the proliferation of LLMs, standardizing their evaluation, enhancing user trust, bridging the gap between automatic and human evaluations, and enabling efficient model selection. Foundational concepts like precision and recall are explained. Popular automatic evaluation metrics like BLEU and ROUGE are also summarized.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views32 pages

Evaluation of LLM Using Automatic Metrics

The document summarizes a mid-term presentation evaluating language models (LLMs) using automatic metrics. It discusses the problem statement of evaluating LLM outputs for fact-based questions using automatic metrics compared to human references. The motivation is discussed, including the proliferation of LLMs, standardizing their evaluation, enhancing user trust, bridging the gap between automatic and human evaluations, and enabling efficient model selection. Foundational concepts like precision and recall are explained. Popular automatic evaluation metrics like BLEU and ROUGE are also summarized.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Evaluation Of LLMs Using Automatic Metrics

MTech Project Mid-term Presentation

CS22M104 Beena
(under the guidance of Dr Srinivas Padmanabhuni)

Indian Institute of Technology, Tirupati

October 13, 2023

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 1 / 20


Introduction

Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20


Introduction

Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.

The rapid proliferation of language models in applications ranging


from chatbots to content generators necessitates a comprehensive
evaluation of their outputs, especially when these outputs are answers
to fact-based questions.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20


Introduction

Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.

The rapid proliferation of language models in applications ranging


from chatbots to content generators necessitates a comprehensive
evaluation of their outputs, especially when these outputs are answers
to fact-based questions.
The core issue this project aims to address is: ”How closely do
automatic evaluation metrics correlate with human judgments when
assessing the outputs of different Language Models (LLMs) for
general fact-based questions?”

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20


Motivation

The motivation for this project can be distilled into several key points such
as :

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20


Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20


Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20


Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20


Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.
Bridging the Gap Between Automatic and Human Evaluations.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20


Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.
Bridging the Gap Between Automatic and Human Evaluations.
Enabling Efficient and Scalable Model selection

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20


Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.
Bridging the Gap Between Automatic and Human Evaluations.
Enabling Efficient and Scalable Model selection

By understanding the strengths and weaknesses of current evaluation


metrics, the AI community can drive innovations in model assessment
techniques.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20


Motivation
This, in turn, will push the boundaries of what LLMs can achieve, ensuring
that their outputs are not just technically impressive but also practically
useful.

For instance, method mentioned in Paper [5] offers way to evaluating


automatic machine translation evaluation metrics automatically without
extra human involvement other than using a set of reference translations. .

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 4 / 20


Motivation
This, in turn, will push the boundaries of what LLMs can achieve, ensuring
that their outputs are not just technically impressive but also practically
useful.

For instance, method mentioned in Paper [5] offers way to evaluating


automatic machine translation evaluation metrics automatically without
extra human involvement other than using a set of reference translations. .

Figure: Summary of ORANGE scores for 6 automatic evaluation metrics

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 4 / 20


Foundational Concepts To Check Metrics

Firstly two main foundational concepts in the field of information retrieval


and machine learning, especially when evaluating classification models -
Precision : Out of all the items that we identified as positive, how
many of them are actually positive?.

True Positives (TP)


Precision = (1)
True Positives (TP) + False Positives (FP)

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 5 / 20


Foundational Concepts To Check Metrics

Firstly two main foundational concepts in the field of information retrieval


and machine learning, especially when evaluating classification models -
Precision : Out of all the items that we identified as positive, how
many of them are actually positive?.

True Positives (TP)


Precision = (1)
True Positives (TP) + False Positives (FP)

Recall : Out of all the actual positive items, how many of them did
we correctly identify as positive?.

True Positives (TP)


Recall = (2)
True Positives (TP) + False Negatives (FN)

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 5 / 20


Different Metrics Read So Far I

BLEU : (Bilingual Evaluation Understudy) [2]The primary


programming task for a BLEU implementor is to compare n-grams of
the candidate with the n-grams of the reference translation and count
the number of matches. These matches are positionindependent. The
more the matches, the better the candidate translation is. The BLEU
score is calculated as:
N
!
X
BLEU = BP × exp wn log pn
n=1

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 6 / 20


Different Metrics Read So Far II

Figure: BLEU vs Bilingual and Monolingual Judgments

Where:
BP is the brevity penalty.
N is the maximum order of n-grams considered.
wn are the weights for each n-gram (typically, when N = 4,
w1 = w2 = w3 = w4 = 0.25).
pn is the precision for n-grams.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 7 / 20


Different Metrics Read So Far III

ROUGE : ROUGE[4] stands for Recall-Oriented Understudy for


Gisting Evaluation. It includes measures to automatically determine
the quality of a summary by comparing it to other (ideal) summaries
created by humans.
There are four main ROUGE types :
ROUGE-N: N-gram Co-Occurrence Statistics. The ROUGE-N score is
given by:
P P
s∈Reference Summaries n-gram∈s Countmatch (n-gram)
Rn = P P (3)
s∈Reference Summaries n-gram∈s Count(n-gram)

ROUGE-W: Weighted Longest Common Subsequence. The ROUGE-W


score is given by:
1 + β2
ROUGE − W = 1 β2
(4)
P + R

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 8 / 20


Different Metrics Read So Far IV

Figure: Rouge-W example

ROUGE-L: Longest Common Subsequence. The ROUGE-L score is


defined by:
LCS(X , Y )
ROUGE − L = (5)
max(|X |, |Y |)
where LCS(X , Y ) is the length of the Longest Common Subsequence
between the system summary X and reference summary Y .

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 9 / 20


Different Metrics Read So Far V

ROUGE-S: Skip-Bigram Co-Occurrence Statistics. The ROUGE-S score


is then:
2×P ×R
ROUGE − S = (6)
P +R
For ROUGE-S recall (R) and precision (P):

number of matching skip-bigrams


R= (7)
total number of skip-bigrams in reference
number of matching skip-bigrams
P= (8)
total number of skip-bigrams in system

Figure: Rouge-L,Rouge-S example


Different Metrics Read So Far VI

PARAEVAL : The ParaEval metric [3] uses a large collection of


paraphrases, automatically extracted from parallel corpora, to
evaluate MT performance.

Figure: A detailed look at the scores assigned by lexical and


paraphrase/synonym comparisons

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 11 / 20


Different Metrics Read So Far VII
METEOR : (Metric for Evaluation of Translation with Explicit
ORdering) Given a pair of strings to compare (a system translation
and a reference translation), METEOR (Banerjee and Lavie, 2005)
first creates a word alignment between the two strings. Based on the
number of word or unigram matches and the amount of string
fragmentation represented by the alignment, METEOR calculates a
score for the pair of strings. The METEOR score is computed as:

METEOR = (1 − Penalty) × Fmean (9)

Where the harmonic mean Fmean is given by:


P ×R
Fmean = (10)
αP + (1 − α)R

The penalty is:


 β
chunks
Penalty = γ (11)
m
CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 12 / 20
Different Metrics Read So Far VIII

MAXSIM : MaximumSimilarity [1], which is based on precision and


recall, allows for synonyms, and weights the matches found. The
similarity score sim-score for the entire system corpus is given by:

|S| N
!
1 X 1 X
sim-score = Fmeans,n (12)
|S| N
s=1 n=1

Where:
|S| is the number of sentence pairs.
N is set to 3, representing the calculation of unigram, bigram, and
trigram scores.
Fmeans,n is the harmonic mean of precision and recall for sentence pair s
and n-gram n.
Limitations IN Metrics

BLEU :It is largely based on exact word matching, so it might not


capture semantic nuances effectively, especially when answers are
factually accurate but phrased differently.
Rouge : It provides different scores like precision, recall, and F1
measure which can give a more detailed evaluation compared to
BLEU.
Paraeval : perhaps a lesser-known one at that time, it would be
good to consider its specific design and objectives.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 14 / 20


metrics co-related most for LLM evaluation

Given the nature of your work, which is to evaluate the factual accuracy of
LLM outputs:
METEOR : design to consider synonymy, stemming, and word
order, METEOR is a strong candidate. It will allow us to evaluate
answers that might be phrased differently but still retain the same
factual content.
MAXSIM : Due to its emphasis on semantic understanding,
MaxSim is another recommended metric. For fact-based questions,
the semantic content of the answer (i.e., its meaning) is often more
crucial than the exact phrasing.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 15 / 20


Timeline
Phase 1 (Midterm): The main goal for Phase 1 (midterm) was to
learn about metrics which are in use for different evaluation tasks of
LLMs and find best two of them. The second part has just been
started and is still going on.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 16 / 20


Timeline
Phase 1 (Midterm): The main goal for Phase 1 (midterm) was to
learn about metrics which are in use for different evaluation tasks of
LLMs and find best two of them. The second part has just been
started and is still going on.
Phase 1 (End Term): The main goal for Phase 1 (end term) go
through different language models and now to implement those
papers and understand the metrics implementation in different field
on previous work .However, it’s also beneficial to use a combination of
metrics to get a comprehensive view. For the best results, will run
preliminary tests using all the metrics and checking which two
correlate best with human judgments for your specific dataset and
task.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 16 / 20


Timeline
Phase 1 (Midterm): The main goal for Phase 1 (midterm) was to
learn about metrics which are in use for different evaluation tasks of
LLMs and find best two of them. The second part has just been
started and is still going on.
Phase 1 (End Term): The main goal for Phase 1 (end term) go
through different language models and now to implement those
papers and understand the metrics implementation in different field
on previous work .However, it’s also beneficial to use a combination of
metrics to get a comprehensive view. For the best results, will run
preliminary tests using all the metrics and checking which two
correlate best with human judgments for your specific dataset and
task.
Phase 2: The main goal for Phase 2 is to full work on different
language models using proper datasets and represent result based on
performance of different LLMs.and give output result which Language
Model works best for this Queries.
CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 16 / 20
Conclusion

Previous Work: While many metrics are available, there may not be
clear guidance on which metrics are best suited for evaluating the
factual accuracy of LLM outputs.users looking for best language
models to rely on and comparing them is important phase.
My Project:By highlighting which LLM performs best in answering
fact-based questions, we offer a shortcut for users looking to employ
language models in knowledge-based systems, ensuring they get the
best results without conducting their evaluations.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 17 / 20


References I

Yee Seng Chan and Hwee Tou Ng.


Maxsim: A maximum similarity metric for machine translation
evaluation.
In Proceedings of ACL-08: HLT. 2008., pages 55–62, 2008.
Todd Ward Kishore Papineni, Salim Roukos and Wei-Jing Zhu.
Bleu: a method for automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), Philadelphia, July 2002, pages
311–318, 2002.
Dragos Stefan Munteanu Liang Zhou, Chin-Yew Lin and Eduard Hovy.

Paraeval: Using paraphrases to evaluate summaries automatically.


In Proceedings of the human language technology conference of the
NAACL, main conference., pages 447–454, 2006.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 18 / 20


References II

Chin-Yew Lin.
Rouge: A package for automatic evaluation of summaries.
In In Text Summarization Branches Out, Barcelona, Spain.
Association for Computational Linguistics., pages 74–81, 2004.
Chin-Yew Lin and Franz Josef Och.
Orange: a method for evaluating automatic evaluation metrics for
machine translation.
In Proceedings of the 20th International Conference on Computational
Linguistics. 2004., 2004.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 19 / 20


Thank you

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 20 / 20

You might also like