0% found this document useful (0 votes)

17 views32 pages

Evaluation of LLM Using Automatic Metrics

The document summarizes a mid-term presentation evaluating language models (LLMs) using automatic metrics. It discusses the problem statement of evaluating LLM outputs for fact-based questions using automatic metrics compared to human references. The motivation is discussed, including the proliferation of LLMs, standardizing their evaluation, enhancing user trust, bridging the gap between automatic and human evaluations, and enabling efficient model selection. Foundational concepts like precision and recall are explained. Popular automatic evaluation metrics like BLEU and ROUGE are also summarized.

Uploaded by

Sanjay Singh Bhandari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views32 pages

Evaluation of LLM Using Automatic Metrics

Uploaded by

Sanjay Singh Bhandari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Evaluation Of LLMs Using Automatic Metrics

MTech Project Mid-term Presentation

CS22M104 Beena
(under the guidance of Dr Srinivas Padmanabhuni)

Indian Institute of Technology, Tirupati

October 13, 2023

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 1 / 20

Introduction

Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20

Introduction

Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.

The rapid proliferation of language models in applications ranging

from chatbots to content generators necessitates a comprehensive
evaluation of their outputs, especially when these outputs are answers
to fact-based questions.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20

Introduction

Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.

The rapid proliferation of language models in applications ranging

from chatbots to content generators necessitates a comprehensive
evaluation of their outputs, especially when these outputs are answers
to fact-based questions.
The core issue this project aims to address is: ”How closely do
automatic evaluation metrics correlate with human judgments when
assessing the outputs of different Language Models (LLMs) for
general fact-based questions?”

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20

Motivation

The motivation for this project can be distilled into several key points such
as :

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

Motivation

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

Motivation

The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.
Bridging the Gap Between Automatic and Human Evaluations.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

Motivation

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

Motivation

By understanding the strengths and weaknesses of current evaluation

metrics, the AI community can drive innovations in model assessment
techniques.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

Motivation
This, in turn, will push the boundaries of what LLMs can achieve, ensuring
that their outputs are not just technically impressive but also practically
useful.

For instance, method mentioned in Paper [5] offers way to evaluating

automatic machine translation evaluation metrics automatically without
extra human involvement other than using a set of reference translations. .

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 4 / 20

Motivation
This, in turn, will push the boundaries of what LLMs can achieve, ensuring
that their outputs are not just technically impressive but also practically
useful.

For instance, method mentioned in Paper [5] offers way to evaluating

automatic machine translation evaluation metrics automatically without
extra human involvement other than using a set of reference translations. .

Figure: Summary of ORANGE scores for 6 automatic evaluation metrics

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 4 / 20

Foundational Concepts To Check Metrics

Firstly two main foundational concepts in the field of information retrieval

and machine learning, especially when evaluating classification models -
Precision : Out of all the items that we identified as positive, how
many of them are actually positive?.

True Positives (TP)

Precision = (1)
True Positives (TP) + False Positives (FP)

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 5 / 20

Foundational Concepts To Check Metrics

Firstly two main foundational concepts in the field of information retrieval

and machine learning, especially when evaluating classification models -
Precision : Out of all the items that we identified as positive, how
many of them are actually positive?.

True Positives (TP)

Precision = (1)
True Positives (TP) + False Positives (FP)

Recall : Out of all the actual positive items, how many of them did
we correctly identify as positive?.

True Positives (TP)

Recall = (2)
True Positives (TP) + False Negatives (FN)

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 5 / 20

Different Metrics Read So Far I

BLEU : (Bilingual Evaluation Understudy) [2]The primary

programming task for a BLEU implementor is to compare n-grams of
the candidate with the n-grams of the reference translation and count
the number of matches. These matches are positionindependent. The
more the matches, the better the candidate translation is. The BLEU
score is calculated as:
N
!
X
BLEU = BP × exp wn log pn
n=1

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 6 / 20

Different Metrics Read So Far II

Figure: BLEU vs Bilingual and Monolingual Judgments

Where:
BP is the brevity penalty.
N is the maximum order of n-grams considered.
wn are the weights for each n-gram (typically, when N = 4,
w1 = w2 = w3 = w4 = 0.25).
pn is the precision for n-grams.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 7 / 20

Different Metrics Read So Far III

ROUGE : ROUGE[4] stands for Recall-Oriented Understudy for

Gisting Evaluation. It includes measures to automatically determine
the quality of a summary by comparing it to other (ideal) summaries
created by humans.
There are four main ROUGE types :
ROUGE-N: N-gram Co-Occurrence Statistics. The ROUGE-N score is
given by:
P P
s∈Reference Summaries n-gram∈s Countmatch (n-gram)
Rn = P P (3)
s∈Reference Summaries n-gram∈s Count(n-gram)

ROUGE-W: Weighted Longest Common Subsequence. The ROUGE-W

score is given by:
1 + β2
ROUGE − W = 1 β2
(4)
P + R

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 8 / 20

Different Metrics Read So Far IV

Figure: Rouge-W example

ROUGE-L: Longest Common Subsequence. The ROUGE-L score is

defined by:
LCS(X , Y )
ROUGE − L = (5)
max(|X |, |Y |)
where LCS(X , Y ) is the length of the Longest Common Subsequence
between the system summary X and reference summary Y .

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 9 / 20

Different Metrics Read So Far V

ROUGE-S: Skip-Bigram Co-Occurrence Statistics. The ROUGE-S score

is then:
2×P ×R
ROUGE − S = (6)
P +R
For ROUGE-S recall (R) and precision (P):

number of matching skip-bigrams

R= (7)
total number of skip-bigrams in reference
number of matching skip-bigrams
P= (8)
total number of skip-bigrams in system

Figure: Rouge-L,Rouge-S example

Different Metrics Read So Far VI

PARAEVAL : The ParaEval metric [3] uses a large collection of

paraphrases, automatically extracted from parallel corpora, to
evaluate MT performance.

Figure: A detailed look at the scores assigned by lexical and

paraphrase/synonym comparisons

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 11 / 20

Different Metrics Read So Far VII
METEOR : (Metric for Evaluation of Translation with Explicit
ORdering) Given a pair of strings to compare (a system translation
and a reference translation), METEOR (Banerjee and Lavie, 2005)
first creates a word alignment between the two strings. Based on the
number of word or unigram matches and the amount of string
fragmentation represented by the alignment, METEOR calculates a
score for the pair of strings. The METEOR score is computed as:

METEOR = (1 − Penalty) × Fmean (9)

Where the harmonic mean Fmean is given by:

P ×R
Fmean = (10)
αP + (1 − α)R

The penalty is:

β
chunks
Penalty = γ (11)
m
CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 12 / 20
Different Metrics Read So Far VIII

MAXSIM : MaximumSimilarity [1], which is based on precision and

recall, allows for synonyms, and weights the matches found. The
similarity score sim-score for the entire system corpus is given by:

|S| N
!
1 X 1 X
sim-score = Fmeans,n (12)
|S| N
s=1 n=1

Where:
|S| is the number of sentence pairs.
N is set to 3, representing the calculation of unigram, bigram, and
trigram scores.
Fmeans,n is the harmonic mean of precision and recall for sentence pair s
and n-gram n.
Limitations IN Metrics

BLEU :It is largely based on exact word matching, so it might not

capture semantic nuances effectively, especially when answers are
factually accurate but phrased differently.
Rouge : It provides different scores like precision, recall, and F1
measure which can give a more detailed evaluation compared to
BLEU.
Paraeval : perhaps a lesser-known one at that time, it would be
good to consider its specific design and objectives.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 14 / 20

metrics co-related most for LLM evaluation

Given the nature of your work, which is to evaluate the factual accuracy of
LLM outputs:
METEOR : design to consider synonymy, stemming, and word
order, METEOR is a strong candidate. It will allow us to evaluate
answers that might be phrased differently but still retain the same
factual content.
MAXSIM : Due to its emphasis on semantic understanding,
MaxSim is another recommended metric. For fact-based questions,
the semantic content of the answer (i.e., its meaning) is often more
crucial than the exact phrasing.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 15 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 16 / 20

Timeline
Phase 1 (Midterm): The main goal for Phase 1 (midterm) was to
learn about metrics which are in use for different evaluation tasks of
LLMs and find best two of them. The second part has just been
started and is still going on.
Phase 1 (End Term): The main goal for Phase 1 (end term) go
through different language models and now to implement those
papers and understand the metrics implementation in different field
on previous work .However, it’s also beneficial to use a combination of
metrics to get a comprehensive view. For the best results, will run
preliminary tests using all the metrics and checking which two
correlate best with human judgments for your specific dataset and
task.
Phase 2: The main goal for Phase 2 is to full work on different
language models using proper datasets and represent result based on
performance of different LLMs.and give output result which Language
Model works best for this Queries.
CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 16 / 20
Conclusion

Previous Work: While many metrics are available, there may not be
clear guidance on which metrics are best suited for evaluating the
factual accuracy of LLM outputs.users looking for best language
models to rely on and comparing them is important phase.
My Project:By highlighting which LLM performs best in answering
fact-based questions, we offer a shortcut for users looking to employ
language models in knowledge-based systems, ensuring they get the
best results without conducting their evaluations.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 17 / 20

References I

Yee Seng Chan and Hwee Tou Ng.

Maxsim: A maximum similarity metric for machine translation
evaluation.
In Proceedings of ACL-08: HLT. 2008., pages 55–62, 2008.
Todd Ward Kishore Papineni, Salim Roukos and Wei-Jing Zhu.
Bleu: a method for automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), Philadelphia, July 2002, pages
311–318, 2002.
Dragos Stefan Munteanu Liang Zhou, Chin-Yew Lin and Eduard Hovy.

Paraeval: Using paraphrases to evaluate summaries automatically.

In Proceedings of the human language technology conference of the
NAACL, main conference., pages 447–454, 2006.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 18 / 20

References II

Chin-Yew Lin.
Rouge: A package for automatic evaluation of summaries.
In In Text Summarization Branches Out, Barcelona, Spain.
Association for Computational Linguistics., pages 74–81, 2004.
Chin-Yew Lin and Franz Josef Och.
Orange: a method for evaluating automatic evaluation metrics for
machine translation.
In Proceedings of the 20th International Conference on Computational
Linguistics. 2004., 2004.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 19 / 20

Thank you

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 20 / 20

Automated Metrics For Evaluating The Quality of Text Generatio
No ratings yet
Automated Metrics For Evaluating The Quality of Text Generatio
4 pages
Patterns For Building LLM-based Systems & Products
0% (1)
Patterns For Building LLM-based Systems & Products
31 pages
Patterns For Building LLM-based Systems & Products
No ratings yet
Patterns For Building LLM-based Systems & Products
28 pages
Webinar LLM tv2
No ratings yet
Webinar LLM tv2
20 pages
LLM - Evaluation Metrics
No ratings yet
LLM - Evaluation Metrics
7 pages
Lost in The Source Language: How Large Language Models Evaluate The Quality of Machine Translation
No ratings yet
Lost in The Source Language: How Large Language Models Evaluate The Quality of Machine Translation
17 pages
Chapter 4
No ratings yet
Chapter 4
45 pages
Coli A 00561
No ratings yet
Coli A 00561
27 pages
Lecture 16 - LLMEvaluationMetrics
No ratings yet
Lecture 16 - LLMEvaluationMetrics
15 pages
Cs224n 2025 Lecture12 Evaluation Final
No ratings yet
Cs224n 2025 Lecture12 Evaluation Final
59 pages
The Best LLMs Cheatsheet 1727364716
No ratings yet
The Best LLMs Cheatsheet 1727364716
15 pages
Analyze RAG With Validation Metrics
No ratings yet
Analyze RAG With Validation Metrics
22 pages
Evaluation Metrics PPT
No ratings yet
Evaluation Metrics PPT
10 pages
LLM 1734373194
No ratings yet
LLM 1734373194
18 pages
495 Lecture 12 Metrics
No ratings yet
495 Lecture 12 Metrics
40 pages
Evaluate Ai LLM
No ratings yet
Evaluate Ai LLM
17 pages
LLM Evaluation - Smartly
No ratings yet
LLM Evaluation - Smartly
23 pages
Evaluation of LLM Outputs For Domain
No ratings yet
Evaluation of LLM Outputs For Domain
8 pages
Evaluation Metrics Formulas
No ratings yet
Evaluation Metrics Formulas
9 pages
Unit - 5
No ratings yet
Unit - 5
58 pages
Unit 5 - MODEL EVALUATION AND FUTURE OF GEN AI Notes
No ratings yet
Unit 5 - MODEL EVALUATION AND FUTURE OF GEN AI Notes
24 pages
Module 5
No ratings yet
Module 5
17 pages
Simple LLM Prompting Is State-of-the-Art For Robust and Multilingual Dialogue Evaluation
No ratings yet
Simple LLM Prompting Is State-of-the-Art For Robust and Multilingual Dialogue Evaluation
11 pages
CL Assignments
No ratings yet
CL Assignments
22 pages
Generative AI Metrics and Build-In Prompts
No ratings yet
Generative AI Metrics and Build-In Prompts
20 pages
Likert
No ratings yet
Likert
9 pages
2022 wmt-1 2
No ratings yet
2022 wmt-1 2
23 pages
Evaluating Large Language Model (LLM) Systems: Metrics, Challenges, and Best Practices
No ratings yet
Evaluating Large Language Model (LLM) Systems: Metrics, Challenges, and Best Practices
27 pages
Exploring GEMBA - A New LLM-Based Metric For Translation Quality Assessment - by Dr. Varshita Sher - Sep, 2023 - Towards Data Science
No ratings yet
Exploring GEMBA - A New LLM-Based Metric For Translation Quality Assessment - by Dr. Varshita Sher - Sep, 2023 - Towards Data Science
10 pages
Field Guide To Automatic Evaluation
No ratings yet
Field Guide To Automatic Evaluation
5 pages
Evaluating Large Language Models For German-to-English Translation - A Multifaceted Approach
No ratings yet
Evaluating Large Language Models For German-to-English Translation - A Multifaceted Approach
7 pages
Letter
No ratings yet
Letter
22 pages
(Slide) Neural Machine Translation
No ratings yet
(Slide) Neural Machine Translation
37 pages
The Price of Debiasing Automatic Metrics in Natural Language Evaluation
No ratings yet
The Price of Debiasing Automatic Metrics in Natural Language Evaluation
11 pages
How To Do Human Evaluation A Brief Introduction To User Studies in NLP
No ratings yet
How To Do Human Evaluation A Brief Introduction To User Studies in NLP
24 pages
Requirements Tool Use
No ratings yet
Requirements Tool Use
2 pages
Are LLM-based Evaluators Confusing NLG Quality Criteria?
No ratings yet
Are LLM-based Evaluators Confusing NLG Quality Criteria?
41 pages
Ranking vs. Regression in Machine Translation Evaluation: Kevin Duh
No ratings yet
Ranking vs. Regression in Machine Translation Evaluation: Kevin Duh
4 pages
Task 3
No ratings yet
Task 3
4 pages
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics For Dialogue Response Generation
No ratings yet
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics For Dialogue Response Generation
15 pages
Rag Evaluations - A Simple Guide To Rag
No ratings yet
Rag Evaluations - A Simple Guide To Rag
16 pages
Guide To RAG System Evaluation Metrics
No ratings yet
Guide To RAG System Evaluation Metrics
21 pages
SentimentAnalysisOfIMDBMovie Reviews
No ratings yet
SentimentAnalysisOfIMDBMovie Reviews
60 pages
Authorship Attribution
No ratings yet
Authorship Attribution
40 pages
LLM Val
No ratings yet
LLM Val
29 pages
1st Review-Tarun
No ratings yet
1st Review-Tarun
19 pages
SciReplicate-Bench Benchmarking LLMs in Agent-Driv
No ratings yet
SciReplicate-Bench Benchmarking LLMs in Agent-Driv
23 pages
Guide To Evaluating LLM and RAG Systems
No ratings yet
Guide To Evaluating LLM and RAG Systems
41 pages
Moverscore: Text Generation Evaluating With Contextualized Embeddings and Earth Mover Distance
No ratings yet
Moverscore: Text Generation Evaluating With Contextualized Embeddings and Earth Mover Distance
16 pages
Ranking vs. Regression in Machine Translation Evaluation: Kevin Duh
No ratings yet
Ranking vs. Regression in Machine Translation Evaluation: Kevin Duh
4 pages
Ba LLMS W3 S2 2024 2025
No ratings yet
Ba LLMS W3 S2 2024 2025
64 pages
MERA: A Comprehensive LLM Evaluation in Russian: Bommasani Et Al. 2023 Ye Et Al. 2023
No ratings yet
MERA: A Comprehensive LLM Evaluation in Russian: Bommasani Et Al. 2023 Ye Et Al. 2023
29 pages
G-E: NLG Evaluation Using G - 4 With Better Human Alignment: VAL PT
No ratings yet
G-E: NLG Evaluation Using G - 4 With Better Human Alignment: VAL PT
12 pages
Project (8th)
No ratings yet
Project (8th)
15 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
8 pages
Barts: Evaluating Generated Text As Text Generation: Corresponding Author
No ratings yet
Barts: Evaluating Generated Text As Text Generation: Corresponding Author
18 pages
METEOR: An Automatic Metric For MT Evaluation With Improved Correlation With Human Judgments
No ratings yet
METEOR: An Automatic Metric For MT Evaluation With Improved Correlation With Human Judgments
8 pages
A Closer Look Into Automatic Evaluation Using Large Language Models
No ratings yet
A Closer Look Into Automatic Evaluation Using Large Language Models
15 pages
IGNOU MCA Cloud Computing and IoT Previous year Unsolved Papers MCS 227
From Everand
IGNOU MCA Cloud Computing and IoT Previous year Unsolved Papers MCS 227
Manish Soni
No ratings yet
Computational Intelligence and its Applications
From Everand
Computational Intelligence and its Applications
Vikash Yadav
No ratings yet

Evaluation of LLM Using Automatic Metrics

Uploaded by

Evaluation of LLM Using Automatic Metrics

Uploaded by

Evaluation Of LLMs Using Automatic Metrics

MTech Project Mid-term Presentation

Indian Institute of Technology, Tirupati

October 13, 2023

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 1 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20

The rapid proliferation of language models in applications ranging

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20

The rapid proliferation of language models in applications ranging

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 2 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

By understanding the strengths and weaknesses of current evaluation

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 3 / 20

For instance, method mentioned in Paper [5] offers way to evaluating

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 4 / 20

For instance, method mentioned in Paper [5] offers way to evaluating

Figure: Summary of ORANGE scores for 6 automatic evaluation metrics

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 4 / 20

Firstly two main foundational concepts in the field of information retrieval

True Positives (TP)

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 5 / 20

Firstly two main foundational concepts in the field of information retrieval

True Positives (TP)

True Positives (TP)

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 5 / 20

BLEU : (Bilingual Evaluation Understudy) [2]The primary

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 6 / 20

Figure: BLEU vs Bilingual and Monolingual Judgments

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 7 / 20

ROUGE : ROUGE[4] stands for Recall-Oriented Understudy for

ROUGE-W: Weighted Longest Common Subsequence. The ROUGE-W

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 8 / 20

Figure: Rouge-W example

ROUGE-L: Longest Common Subsequence. The ROUGE-L score is

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 9 / 20

ROUGE-S: Skip-Bigram Co-Occurrence Statistics. The ROUGE-S score

number of matching skip-bigrams

Figure: Rouge-L,Rouge-S example

PARAEVAL : The ParaEval metric [3] uses a large collection of

Figure: A detailed look at the scores assigned by lexical and

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 11 / 20

METEOR = (1 − Penalty) × Fmean (9)

Where the harmonic mean Fmean is given by:

The penalty is:

MAXSIM : MaximumSimilarity [1], which is based on precision and

BLEU :It is largely based on exact word matching, so it might not

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 14 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 15 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 16 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 16 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 17 / 20

Yee Seng Chan and Hwee Tou Ng.

Paraeval: Using paraphrases to evaluate summaries automatically.

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 18 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 19 / 20

CS22M104 Beena (IITTP) Evaluating LLMs October 13, 2023 20 / 20

You might also like