Evaluation of LLM Using Automatic Metrics
Evaluation of LLM Using Automatic Metrics
CS22M104 Beena
(under the guidance of Dr Srinivas Padmanabhuni)
Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.
Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.
Problem Statement
Evaluating Language Models Outputs LLMs Using Automatic Metrics in
Comparison to Human References for General Fact-based Questions.
The motivation for this project can be distilled into several key points such
as :
The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.
The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.
Bridging the Gap Between Automatic and Human Evaluations.
The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.
Bridging the Gap Between Automatic and Human Evaluations.
Enabling Efficient and Scalable Model selection
The motivation for this project can be distilled into several key points such
as :
Proliferation of Language Models in Digital Ecosystems.
Standardizing Evaluation Mechanisms.
Enhancing User Trust and Dependability on models.
Bridging the Gap Between Automatic and Human Evaluations.
Enabling Efficient and Scalable Model selection
Recall : Out of all the actual positive items, how many of them did
we correctly identify as positive?.
Where:
BP is the brevity penalty.
N is the maximum order of n-grams considered.
wn are the weights for each n-gram (typically, when N = 4,
w1 = w2 = w3 = w4 = 0.25).
pn is the precision for n-grams.
|S| N
!
1 X 1 X
sim-score = Fmeans,n (12)
|S| N
s=1 n=1
Where:
|S| is the number of sentence pairs.
N is set to 3, representing the calculation of unigram, bigram, and
trigram scores.
Fmeans,n is the harmonic mean of precision and recall for sentence pair s
and n-gram n.
Limitations IN Metrics
Given the nature of your work, which is to evaluate the factual accuracy of
LLM outputs:
METEOR : design to consider synonymy, stemming, and word
order, METEOR is a strong candidate. It will allow us to evaluate
answers that might be phrased differently but still retain the same
factual content.
MAXSIM : Due to its emphasis on semantic understanding,
MaxSim is another recommended metric. For fact-based questions,
the semantic content of the answer (i.e., its meaning) is often more
crucial than the exact phrasing.
Previous Work: While many metrics are available, there may not be
clear guidance on which metrics are best suited for evaluating the
factual accuracy of LLM outputs.users looking for best language
models to rely on and comparing them is important phase.
My Project:By highlighting which LLM performs best in answering
fact-based questions, we offer a shortcut for users looking to employ
language models in knowledge-based systems, ensuring they get the
best results without conducting their evaluations.
Chin-Yew Lin.
Rouge: A package for automatic evaluation of summaries.
In In Text Summarization Branches Out, Barcelona, Spain.
Association for Computational Linguistics., pages 74–81, 2004.
Chin-Yew Lin and Franz Josef Och.
Orange: a method for evaluating automatic evaluation metrics for
machine translation.
In Proceedings of the 20th International Conference on Computational
Linguistics. 2004., 2004.