Benchmarking Large Language Models With A Unified Performance Ranking Metric
Benchmarking Large Language Models With A Unified Performance Ranking Metric
4, July 2024
Maikel Leon
Department of Business Technology, Miami Herbert Business School,
University of Miami, Florida, USA
Abstract. The rapid advancements in Large Language Models (LLMs,) such as OpenAI’s GPT, Meta’s
LLaMA, and Google’s PaLM, have revolutionized natural language processing and various AI-driven ap-
plications. Despite their transformative impact, a standardized metric to compare these models poses a
significant challenge for researchers and practitioners. This paper addresses the urgent need for a compre-
hensive evaluation framework by proposing a novel performance ranking metric. Our metric integrates both
qualitative and quantitative assessments to provide a holistic comparison of LLM capabilities. Through
rigorous benchmarking, we analyze the strengths and limitations of leading LLMs, offering valuable insights
into their relative performance. This study aims to facilitate informed decision-making in model selection
and promote advances in developing more robust and efficient language models.
1 Introduction
Artificial intelligence (AI) has evolved significantly over the past several decades, rev-
olutionizing various industries and transforming how we interact with technology. The
journey from early AI systems to modern LLMs is marked by machine learning (ML)
and deep learning advancements. Initially, AI focused on rule-based systems and symbolic
reasoning, which laid the groundwork for more sophisticated approaches [1]. The advent
of ML introduced data-driven techniques that enabled systems to learn and improve from
experience. Deep learning further accelerated This paradigm shift by leveraging neural
networks to model complex patterns and achieve unprecedented performance levels in
tasks such as image and speech recognition. The development of LLMs, such as GPT-3
and beyond, represents the latest frontier in this evolution, harnessing vast amounts of
data and computational power to generate human-like text and perform a wide array of
language-related tasks. This paper explores the progression from traditional AI to ML,
deep learning, and the emergence of LLMs, highlighting key milestones, technological ad-
vancements, and their implications for the future of AI.
LLMs have emerged as transformative tools in Natural Language Processing (NLP),
demonstrating unparalleled capabilities in understanding and generating human language.
Models such as OpenAI’s GPT, Meta’s LLaMA, and Google’s PaLM have set new bench-
marks in tasks ranging from text completion to sentiment analysis. These advancements
have expanded the horizons of what is possible with AI and underscored the critical need
for robust evaluation frameworks that can comprehensively assess and compare the effec-
tiveness of these models. LLMs represent a culmination of advancements in deep learning,
leveraging vast amounts of data and computational power to achieve remarkable linguistic
capabilities [2]. Each iteration, from GPT-3 to the latest GPT-4 with 175 billion pa-
rameters, has pushed the boundaries of language understanding and generation. Meta’s
LLaMA, optimized for efficiency with 65 billion parameters, excels in multilingual applica-
tions, while Google’s PaLM, with its 540 billion parameters, tackles complex multitasking
scenarios [3].
DOI: 10.5121/ijfcst.2024.14302 15
International Journal on Foundations of Computer Science & Technology (IJFCST) Vol.4, No.4, July 2024
The following are some key advancements:
– GPT Series: Known for its versatility in generating coherent text across various
domains.
– LLaMA: Notable for its efficiency and performance in real-time applications and mul-
tilingual contexts.
– PaLM: Designed to handle complex question-answering and multitasking challenges
with high accuracy.
These models have revolutionized healthcare, finance, and education industries, en-
hancing customer interactions, automating tasks, and enabling personalized learning ex-
periences [4]. Despite their advancements, the evaluation of LLMs remains fragmented and
lacks a unified methodology. Current evaluation metrics often focus on specific aspects of
model performance, such as perplexity scores or accuracy rates in predefined tasks. How-
ever, these metrics do not provide a comprehensive view of overall model effectiveness,
leading to challenges in comparing different models directly.
Some current limitations are listed below:
These limitations underscore the need for a standardized evaluation framework inte-
grating qualitative assessments with quantitative benchmarks. To address these challenges,
this paper proposes a novel performance ranking metric to assess LLM capabilities compre-
hensively. Our approach integrates qualitative insights, such as model interpretability and
coherence in generated text, with quantitative metrics, including computational efficiency
and performance across standardized NLP benchmarks. By synthesizing these dimensions,
our metric offers a holistic perspective on LLM performance that facilitates meaningful
comparisons and supports informed decision-making in model selection [5].
The following are the objectives of the study:
– Develop a standardized evaluation framework for LLMs that captures qualitative and
quantitative aspects.
– Conduct a comparative analysis of leading models (GPT-4, LLaMA, PaLM) to high-
light strengths and limitations.
– Propose guidelines for selecting the most suitable LLM for specific NLP applications
based on comprehensive evaluation criteria.
AI encompasses diverse methodologies and approaches tailored for specific tasks and ap-
plications. The distinction between regular AI and Generative AI, such as Large Language
Models (LLMs), lies in their fundamental approach to data processing and task execution:
– Regular AI (Symbolic AI): Traditional AI models rely on explicit programming
and predefined rules to process structured data and execute tasks. They excel in tasks
with clear rules and well-defined inputs and outputs, such as rule-based systems in
chess-playing or automated decision-making processes [7].
– Generative AI (LLMs): Generative AI, exemplified by LLMs, operates differently
by learning from vast amounts of unstructured data to generate outputs. These models
use deep learning techniques to understand and produce human-like text, exhibiting
creativity and adaptability in language tasks.
Generative AI represents a paradigm shift in AI and Natural Language Processing
(NLP), enabling machines to perform tasks that require understanding and generation
of natural language in a way that closely mimics human capabilities. Particularly, LLMs
have demonstrated remarkable capabilities across various applications:
– Text Generation: LLMs like OpenAI’s GPT series can generate coherent and con-
textually relevant text, from short sentences to entire articles, based on prompts or
input text.
– Translation: Models such as Google’s T5 have shown effective translation capabilities,
converting text between multiple languages with high accuracy and fluency.
– Question Answering: LLMs are proficient in answering natural language questions
based on their understanding of context and information retrieval from large datasets.
– Creative Writing: Some LLMs have been trained to generate creative content such
as poems, stories, and even music compositions, showcasing their versatility and cre-
ativity. 17
International Journal on Foundations of Computer Science & Technology (IJFCST) Vol.4, No.4, July 2024
– Chatbots and Virtual Assistants: AI-powered chatbots and virtual assistants lever-
age LLMs to engage in natural conversations, provide customer support, and perform
tasks such as scheduling appointments or making reservations.
These examples illustrate how Generative AI, specifically LLMs, extends beyond tra-
ditional AI applications by enabling machines to understand and generate human-like
text with contextually appropriate responses and creative outputs [8]. LLMs are a promi-
nent example of Generative AI, distinguished by their ability to process and generate
human-like text based on vast amounts of data. These models, particularly those based
on Transformer architectures, have revolutionized NLP by:
– Scale: LLMs are trained on massive datasets comprising billions of words or sentences
from diverse sources such as books, articles, and websites.
– Contextual Understanding: They exhibit a strong capability to understand and
generate text in context, allowing them to produce coherent and contextually relevant
responses.
– Generativity: LLMs can generate human-like text, including completing sentences,
answering questions, and producing creative content such as poems or stories.
– Transfer Learning: They benefit from transfer learning, where models pre-trained on
large datasets can be fine-tuned on specific tasks with smaller, task-specific datasets.
LLMs exemplify the power of Generative AI in harnessing deep learning to achieve
remarkable capabilities in understanding and generating natural language. Their ability
to generate indistinguishable text from human-generated content marks a significant ad-
vancement in AI research and applications. LLMs leverage advanced machine learning
techniques, primarily deep learning architectures, to achieve their impressive capabilities
in NLP. These models are typically based on Transformer architectures, which have be-
come the cornerstone of modern NLP tasks due to their ability to process sequential data
efficiently.
The Transformer architecture, introduced by Vaswani et al. (2017), revolutionized NLP
by replacing recurrent neural networks (RNNs) and convolutional neural networks (CNNs)
with a self-attention mechanism [9]. Key components of the Transformer include:
– Self-Attention Mechanism: The model can weigh the significance of different words
in a sentence, capturing long-range dependencies efficiently.
– Multi-head Attention: Enhances the model’s ability to focus on different positions
and learn diverse input representations.
– Feedforward Neural Networks: Process the outputs of the attention mechanism to
generate context-aware representations [10].
– Layer Normalization and Residual Connections: Aid in stabilizing training and
facilitating the flow of gradients through deep networks.
LLMs employ Transformer-based architectures with more layers, parameters, and com-
putational resources to handle larger datasets and achieve state-of-the-art performance in
various NLP tasks. Training LLMs involves several stages and techniques to optimize
performance and efficiency:
– Pre-training: Initial training on large-scale datasets (e.g., books, articles, web text)
to learn general language patterns and representations. Models like GPT-3 are pre-
trained on massive corpora to capture broad linguistic knowledge [11].
– Fine-tuning: Further training on task-specific datasets (e.g., question answering, text
completion) to adapt the model’s parameters to specific applications. Fine-tuning en-
hances model performance and ensures applicability to real-world tasks.
18
International Journal on Foundations of Computer Science & Technology (IJFCST) Vol.4, No.4, July 2024
– Regularization Techniques: Methods such as dropout and weight decay prevent
overfitting and improve generalization capabilities, which are crucial for robust perfor-
mance across different datasets.
– Tokenizers: Convert raw text into tokens (words, subwords) suitable for model input.
Tokenization methods vary, with models like BERT using WordPiece and Byte-Pair
Encoding (BPE) to effectively handle rare words and subword units.
– Embeddings: Represent words or tokens as dense vectors in a continuous vector space.
Embeddings capture semantic relationships and contextual information, enhancing the
model’s ability to understand and generate coherent text.
– Attention Matrices: Store attention weights computed during self-attention opera-
tions. These matrices enable the model to effectively focus on relevant parts of input
sequences and learn contextual dependencies.
– Cached Computations: Optimize inference speed by caching intermediate compu-
tations during attention and feedforward operations, reducing redundant calculations
and improving efficiency [12].
These data structures play a critical role in LLMs’ performance and scalability, en-
abling them to handle large-scale datasets and achieve state-of-the-art results in various
NLP benchmarks. Integrating advanced machine learning techniques, such as Transformer
architectures and sophisticated data structures, is fundamental to developing and succeed-
ing Large Language Models (LLMs). These models represent a significant advancement in
natural language processing, enabling machines to understand and generate human-like
text with unprecedented accuracy and complexity. By leveraging scalable architectures and
efficient data handling mechanisms, LLMs continue to push the boundaries of AI research
and application, paving the way for transformative innovations in language understanding
and generation [13].
LLMs have undergone a remarkable evolution over the past decades, driven by advance-
ments in deep learning, computational resources, and the availability of large-scale datasets.
This section provides a comprehensive overview of the evolution of LLMs from their early
conception to their current capabilities, highlighting key milestones and technological
breakthroughs that have shaped their development. The concept of LLMs emerged from
early efforts in statistical language modeling and neural networks, aiming to improve the
understanding and generation of human language. Traditional approaches such as n-gram
models and Hidden Markov Models (HMMs) provided foundational insights into language
patterns but were limited in capturing semantic nuances and context. The shift towards
neural network-based approaches in the early 2000s marked a significant milestone, laying
the groundwork for more sophisticated language models capable of learning hierarchical
representations of text.
Key milestones are:
Fig. 1. Comparison of LLMs Across Key Parameters: Model Size, Multilingual Support, Training Data,
Text Generation Capabilities, and Ease of Integration. The Y-axis represents relative proportion.
Criterion Description
Quantitative Metrics Performance on standardized NLP benchmarks (e.g., ac-
curacy, F1 score) across diverse tasks
Computational Efficiency Evaluation of model inference speed, memory footprint,
and energy efficiency
Robustness and Generalization Assessment of model performance under varying condi-
tions and ability to generalize
1. Accuracy (ACC):
– Definition: Measures the factual and grammatical correctness of the responses.
– Methodology: Compare LLM outputs against a curated dataset of questions and
expert answers.
– Calculation: Percentage of correct answers (factually and grammatically) over the
total number of responses.
2. Contextual Understanding (CON):
– Definition: Assesses the model’s ability to understand and integrate context from
the conversation or document history.
– Methodology: Use context-heavy dialogue or document samples to test if the
LLM maintains topic relevance and effectively utilizes the provided historical in-
formation.
– Calculation: Scoring responses for relevance and context integration on a scale
from 0 (no context used) to 5 (excellent use of context).
3. Coherence (COH):
– Definition: Evaluates how logically connected and structurally sound the responses
are.
– Methodology: Analysis of response sequences to ensure logical flow and connec-
tion of ideas.
– Calculation: Human or automated scoring of response sequences on a scale from
0 (incoherent) to 5 (highly coherent).
4. Fluency (FLU):
– Definition: Measures the linguistic smoothness and readability of the text.
– Methodology: Responses are analyzed for natural language use, grammatical cor-
rectness, and stylistic fluency.
– Calculation: Rate responses on a scale from 0 (not fluent) to 5 (very fluent).
5. Resource Efficiency (EFF):
– Definition: Assesses the computational resources (like time and memory) used by
the LLM for tasks.
– Methodology: Measure the average time and system resources consumed for gen-
erating responses.
– Calculation: Efficiency score calculated by
1
EFF =
Time Taken (seconds) + Memory Used (MB)/100
The CLMPI score would be an aggregate, weighted sum of the individual metrics:
CLMPI = (w1 × ACC) + (w2 × CON) + (w3 × COH) + (w4 × FLU) + (w5 × EFF)
where wi are the weights assigned to each metric based on the priority of aspects. These
weights are determined based on the specific needs and usage context of the LLM.
Imagine we are evaluating an LLM designed for academic research assistance:
ACC = 85%
CLMPI-B = (0.82 × 0.25) + (4.0 × 0.20) + (4.0 × 0.20) + (4.3 × 0.20) + (0.30 × 0.15) × 25
CLMPI-C = (0.88 × 0.25) + (4.8 × 0.20) + (4.5 × 0.20) + (4.7 × 0.20) + (0.45 × 0.15) × 25
LLM-C outperforms LLM-A and LLM-B across all metrics, notably in resource effi-
ciency and contextual understanding, which are critical for performance in dynamic and
resource-constrained environments. This table effectively illustrates how different models
can be evaluated against important characteristics, providing insight into their strengths
and weaknesses. Using a weighted metric system (CLMPI) allows for balanced considera-
tion of various aspects crucial for the practical deployment of LLMs.
The rapid advancement of LLMs has transformed NLP, offering unprecedented capabilities
in tasks such as text generation, translation, and sentiment analysis. Models like OpenAI’s
GPT series, Meta’s LLaMA, and Google’s PaLM have demonstrated remarkable profi-
ciency in understanding and generating human language, paving the way for applications
across diverse domains. However, the absence of a standardized framework for comparing
LLMs poses significant challenges in evaluating their performance comprehensively. The
landscape lacks a unified index integrating qualitative insights and quantitative metrics to
assess LLMs across various dimensions. Evaluation methodologies often focus on specific
tasks or datasets, resulting in fragmented assessments that do not provide a holistic view
of model capabilities. This fragmentation hinders researchers, developers, and industry
stakeholders from making informed decisions regarding model selection and deployment.
Addressing these challenges requires the development of a robust evaluation frame-
work that considers factors such as model accuracy, computational efficiency, and robust-
ness across different domains and languages. Such a framework would facilitate meaning-
ful comparisons between LLMs, enabling researchers to identify each model’s strengths,
weaknesses, and optimal use cases. The need to compare LLMs accurately is paramount
for advancing NLP and maximizing the potential of AI-driven technologies in real-world
applications.
By establishing a standardized evaluation framework, stakeholders in academia, indus-
try, and policy-making can benefit in several ways:
References
1. G. Nápoles, Y. Salgueiro, I. Grau, and M. Leon, “Recurrence-aware long-term cognitive network for
explainable pattern classification,” IEEE Transactions on Cybernetics, vol. 53, no. 10, pp. 6083–6094,
2023.
2. A. Upadhyay, E. Farahmand, I. Muntilde;oz, M. Akber Khan, and N. Witte, “Influence of llms on
learning and teaching in higher education,” SSRN Electronic Journal, 2024.
3. J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing
the power of llms in practice: A survey on chatgpt and beyond,” ACM Trans. Knowl. Discov. Data,
vol. 18, apr 2024.
4. M. Leon, “Business technology and innovation through problem-based learning,” in Canada Interna-
tional Conference on Education (CICE-2023) and World Congress on Education (WCE-2023), CICE-
2023, Infonomics Society, July 2023.
5. N. Capodieci, C. Sanchez-Adames, J. Harris, and U. Tatar, “The impact of generative ai and llms
on the cybersecurity profession,” in 2024 Systems and Information Engineering Design Symposium
(SIEDS), pp. 448–453, 2024.
6. G. Nápoles, J. L. Salmeron, W. Froelich, R. Falcon, M. Leon, F. Vanhoenshoven, R. Bello, and K. Van-
hoof, Fuzzy Cognitive Modeling: Theoretical and Practical Considerations, p. 77–87. Springer Singa-
pore, July 2019.
7. G. Nápoles, M. Leon, I. Grau, and K. Vanhoof, “FCM expert: Software tool for scenario analysis and
pattern classification based on fuzzy cognitive maps,” International Journal on Artificial Intelligence
Tools, vol. 27, no. 07, p. 1860010, 2018.
8. A. R. Asadi, “Llms in design thinking: Autoethnographic insights and design implications,” in Proceed-
ings of the 2023 5th World Symposium on Software Engineering, WSSE ’23, (New York, NY, USA),
p. 55–60, Association for Computing Machinery, 2023.
9. E. Struble, M. Leon, and E. Skordilis, “Intelligent prevention of ddos attacks using reinforcement
learning and smart contracts,” The International FLAIRS Conference Proceedings, vol. 37, May 2024.
10. G. Nápoles, M. L. Espinosa, I. Grau, K. Vanhoof, and R. Bello, Fuzzy cognitive maps based models for
pattern classification: Advances and challenges, vol. 360, pp. 83–98. Springer Verlag, 2018.
11. R. D. Pesl, M. Stötzner, I. Georgievski, and M. Aiello, “Uncovering llms for service-composition:
Challenges and opportunities,” in Service-Oriented Computing – ICSOC 2023 Workshops (F. Monti,
P. Plebani, N. Moha, H.-y. Paik, J. Barzen, G. Ramachandran, D. Bianchini, D. A. Tamburri, and
M. Mecella, eds.), (Singapore), pp. 39–48, Springer Nature Singapore, 2024.
12. M. Leon, L. Mkrtchyan, B. Depaire, D. Ruan, and K. Vanhoof, “Learning and clustering of fuzzy
cognitive maps for travel behaviour analysis,” Knowledge and Information Systems, vol. 39, no. 2,
pp. 435–462, 2013.
13. T. Han, L. C. Adams, K. Bressem, F. Busch, L. Huck, S. Nebelung, and D. Truhn, “Comparative
analysis of gpt-4vision, gpt-4 and open source llms in clinical diagnostic accuracy: A benchmark against
human expertise,” medRxiv, 2023.
14. M. Leon, “Aggregating procedure for fuzzy cognitive maps,” The International FLAIRS Conference
Proceedings, vol. 36, no. 1, 2023.
15. M. Leon, N. Martinez, Z. Garcia, and R. Bello, “Concept maps combined with case-based reasoning
in order to elaborate intelligent teaching/learning systems,” in Seventh International Conference on
Intelligent Systems Design and Applications (ISDA 2007), pp. 205–210, 2007.
16. N. R. Rydzewski, D. Dinakaran, S. G. Zhao, E. Ruppin, B. Turkbey, D. E. Citrin, and K. R. Patel,
“Comparative evaluation of llms in clinical oncology,” NEJM AI, vol. 1, Apr. 2024.
17. H. DeSimone and M. Leon, “Explainable ai: The quest for transparency in business and beyond,” in
2024 7th International Conference on Information and Computer Technologies (ICICT), IEEE, Mar.
2024.
18. M. Leon, L. Mkrtchyan, B. Depaire, D. Ruan, R. Bello, and K. Vanhoof, “Learning method inspired on
swarm intelligence for fuzzy cognitive maps: Travel behaviour modelling,” in Artificial Neural Networks
and Machine Learning – ICANN 2012 (A. E. P. Villa, W. Duch, P. Érdi, F. Masulli, and G. Palm,
eds.), (Berlin, Heidelberg), pp. 718–725, Springer Berlin Heidelberg, 2012.
26
International Journal on Foundations of Computer Science & Technology (IJFCST) Vol.4, No.4, July 2024
19. J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: the threats of using llms in soft-
ware engineering,” in Proceedings of the 2024 ACM/IEEE 44th International Conference on Software
Engineering: New Ideas and Emerging Results, ICSE-NIER’24, (New York, NY, USA), p. 102–106,
Association for Computing Machinery, 2024.
20. J. Chen, X. Lu, Y. Du, M. Rejtig, R. Bagley, M. Horn, and U. Wilensky, “Learning agent-based
modeling with llm companions: Experiences of novices and experts using chatgpt & netlogo chat,” in
Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’24, (New York,
NY, USA), Association for Computing Machinery, 2024.
21. G. Nápoles, F. Hoitsma, A. Knoben, A. Jastrzebska, and M. Leon, “Prolog-based agnostic explanation
module for structured pattern classification,” Information Sciences, vol. 622, p. 1196–1227, Apr. 2023.
Author
27