10 Important LLM Benchmarks That You Should Know-1
10 Important LLM Benchmarks That You Should Know-1
Bhavishya Pandit
WHAT ARE BENCHMARKS?
OpenAI released o3 and Google released 2.0 Flash this week. Whenever any new
model comes into the picture, companies release their benchmarking scores
amongst other peers.
Bhavishya Pandit
1.MMLU
MMLU stands for Massive Multitask Language Understanding.
The current best performers on the MMLU evaluation metric is Claude Sonnet 3.5
and GPT-4o with an average of 88.7%.
So if you are looking for a model than can solve multiple choice questions with best
efficiency, Claude Sonnet is the suitable answer.
Bhavishya Pandit
2. GSM-8K
LLMs are required to perform well on mathematical tasks, and to measure their
competency in this domain, GSM-8K dataset is used.
GSM-8K dataset consists of 8,500 grade school math questions (Below are Few
examples)
Bhavishya Pandit
3. BBHARD
Big Bench Hard is a subset of Big Bench (a dataset of 200+ text-based tasks).
a. Logical Reasoning
b. Common Sense Reasoning
c. Knowledge Application etc.
Movie Knowledge
question and
responses of models
with different
parameters
It may seem obvious but a lot of models fail to answer common sense questions due
to lack of conscience.
Qwen2.5-72B is the best performer in this benchmark making it the best model for
sensible questions.
Bhavishya Pandit
4. HUMANEVAL
HumanEval tests a model on its coding abilities.
Each problem includes a function signature, docstring, body and unit tests.
GPT-4o based models(LDB, AGentcoder) & Claude 3.5 Sonnet are the top performers
in this metric.
Bhavishya Pandit
5. HellaSWAG
HellaSwag evaluates a model’s commonsense inference that is specially hard for
state-of-the-art models.
The top performer in this metric is CompassMTL 567M (Never Heard of it :`). Our
famous GPT-4 is at 4th place followed by LLaMA3 at 5th.
Bhavishya Pandit
6. BFCL
BFCL stands for Berkley Function Calling Leaderboard. It evaluates an LLM’s ability to
call functions accurately.
BFCL consists of read-world data that is updated periodically.
Below diagram compares the performance of Gemini-1.5-Pro and Claude Sonnet 3.5
on BFCL.
Bhavishya Pandit
7. MMMU
Massive Multimodal Multidiscipline Understanding (MMMU) is a benchmark for
evaluating multimodal models on complex tasks requiring advanced reasoning. Key
features include:
11.5K multimodal questions across six disciplines, 30 subjects, and 183 subfields.
Diverse image types, including charts, diagrams, and maps.
Focus on reasoning and perception to assess model capabilities.
Performance gap: Even GPT-4V achieved only 56% accuracy, highlighting room
for improvement in multimodal AI.
Effective models must reject harmful requests while preserving their capabilities to
successfully complete multi-step tasks even after an attack.
AgentHarm assesses the ability of LLM agents to execute multi-step tasks effectively
while fulfilling user requests.
Source: https://arxiv.org/abs/2410.09024
Bhavishya Pandit
9. SWE-bench
SWE-bench (Software Engineering Benchmark) evaluates LLMs' ability to address
real-world software issues sourced from GitHub.
It includes over 2,200 issues paired with corresponding pull requests from 12 popular
Python repositories.
Given a codebase and an issue, a model must generate an effective patch. Success
requires interacting with execution environments, handling long contexts, and
demonstrating advanced reasoning skills—surpassing standard code generation
tasks.
Source: https://arxiv.org/abs/2310.06770
GPT4 powered CodeR is the top performing model with 28.33% issues resolved
(assisted).
Bhavishya Pandit
10. MT-Bench
MT-bench evaluates an LLM's ability to sustain multi-turn conversations. It includes
80 multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning,
math, coding, STEM, and social science.
Each interaction consists of two turns—an open-ended question (1st turn) followed
by a follow-up question (2nd turn).
Source: https://arxiv.org/abs/2306.05685
Bhavishya Pandit
Follow to stay updated on
Generative AI
Bhavishya Pandit