0% found this document useful (0 votes)
126 views13 pages

10 Important LLM Benchmarks That You Should Know-1

Uploaded by

Rani Kharmate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views13 pages

10 Important LLM Benchmarks That You Should Know-1

Uploaded by

Rani Kharmate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

& why are they important

Bhavishya Pandit
WHAT ARE BENCHMARKS?
OpenAI released o3 and Google released 2.0 Flash this week. Whenever any new
model comes into the picture, companies release their benchmarking scores
amongst other peers.

Benchmarks provide a standardized method to evaluate LLMs across tasks like


coding, reasoning, math, truthfulness, and more.

Below is a comparison of o1, o1 preview and o3 models of OpenAI on SWE-bench


and competitive programming benchmarks.

Let’s look into the different benchmarks used in LLM evaluation-

Bhavishya Pandit
1.MMLU
MMLU stands for Massive Multitask Language Understanding.

It is used to test a model against accuracy in multiple fields.

The test covers 57 tasks ranging from elementary mathematics to advanced


professional level. Topics include subjects across STEM, humanities, social sciences
etc. (Below is an example)

The current best performers on the MMLU evaluation metric is Claude Sonnet 3.5
and GPT-4o with an average of 88.7%.

So if you are looking for a model than can solve multiple choice questions with best
efficiency, Claude Sonnet is the suitable answer.

Bhavishya Pandit
2. GSM-8K
LLMs are required to perform well on mathematical tasks, and to measure their
competency in this domain, GSM-8K dataset is used.

GSM-8K dataset consists of 8,500 grade school math questions (Below are Few
examples)

Qwen2-Math-72B-Instruct excels in this benchmark followed by SFT-Mistral-7B and


OpenMath2-Llama3.1-70B.

Bhavishya Pandit
3. BBHARD
Big Bench Hard is a subset of Big Bench (a dataset of 200+ text-based tasks).

BBH is primarily used to evaluate a model on categories like :

a. Logical Reasoning
b. Common Sense Reasoning
c. Knowledge Application etc.

Movie Knowledge
question and
responses of models
with different
parameters

It may seem obvious but a lot of models fail to answer common sense questions due
to lack of conscience.

Qwen2.5-72B is the best performer in this benchmark making it the best model for
sensible questions.

Bhavishya Pandit
4. HUMANEVAL
HumanEval tests a model on its coding abilities.

HumanEval is a dataset consisting of 164 hand-written coding problems to assess the


model.(Below is an example problem)

Each problem includes a function signature, docstring, body and unit tests.

GPT-4o based models(LDB, AGentcoder) & Claude 3.5 Sonnet are the top performers
in this metric.

Bhavishya Pandit
5. HellaSWAG
HellaSwag evaluates a model’s commonsense inference that is specially hard for
state-of-the-art models.

HellaSwag is actually a dataset, consisting of common sense reasoning questions.

It has questions like :

The top performer in this metric is CompassMTL 567M (Never Heard of it :`). Our
famous GPT-4 is at 4th place followed by LLaMA3 at 5th.

Bhavishya Pandit
6. BFCL
BFCL stands for Berkley Function Calling Leaderboard. It evaluates an LLM’s ability to
call functions accurately.
BFCL consists of read-world data that is updated periodically.

Key features of BFCL include:


Extensive Case Library: 100 Java, 50 JavaScript, 70 REST API, 100 SQL, and 1,680
Python cases.
Versatile Scenarios: Support for simple, parallel, and multiple function calls.
Intelligent Function Mapping: Function relevance detection ensures optimal
function selection.

Below diagram compares the performance of Gemini-1.5-Pro and Claude Sonnet 3.5
on BFCL.

Bhavishya Pandit
7. MMMU
Massive Multimodal Multidiscipline Understanding (MMMU) is a benchmark for
evaluating multimodal models on complex tasks requiring advanced reasoning. Key
features include:

11.5K multimodal questions across six disciplines, 30 subjects, and 183 subfields.
Diverse image types, including charts, diagrams, and maps.
Focus on reasoning and perception to assess model capabilities.
Performance gap: Even GPT-4V achieved only 56% accuracy, highlighting room
for improvement in multimodal AI.

MMMU includes questions related to:

GPT-01 is the highest performer with an overall score of 78.1


Bhavishya Pandit
8. AgentHarm
The AgentHarm benchmark is designed to advance research on preventing LLM
agent misuse. It features 110 explicitly malicious tasks spanning 11 harm categories,
such as fraud, cybercrime, and harassment.

Effective models must reject harmful requests while preserving their capabilities to
successfully complete multi-step tasks even after an attack.

AgentHarm assesses the ability of LLM agents to execute multi-step tasks effectively
while fulfilling user requests.

Source: https://arxiv.org/abs/2410.09024

Bhavishya Pandit
9. SWE-bench
SWE-bench (Software Engineering Benchmark) evaluates LLMs' ability to address
real-world software issues sourced from GitHub.

It includes over 2,200 issues paired with corresponding pull requests from 12 popular
Python repositories.

Given a codebase and an issue, a model must generate an effective patch. Success
requires interacting with execution environments, handling long contexts, and
demonstrating advanced reasoning skills—surpassing standard code generation
tasks.

Source: https://arxiv.org/abs/2310.06770

GPT4 powered CodeR is the top performing model with 28.33% issues resolved
(assisted).

Bhavishya Pandit
10. MT-Bench
MT-bench evaluates an LLM's ability to sustain multi-turn conversations. It includes
80 multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning,
math, coding, STEM, and social science.

Each interaction consists of two turns—an open-ended question (1st turn) followed
by a follow-up question (2nd turn).

The evaluation is automated using an LLM-as-a-judge system, which scores


responses on a scale from 1 to 10.

Source: https://arxiv.org/abs/2306.05685

FuseChat-7B-VaRM is the top performer in this benchmark with a score of 8.22.

Bhavishya Pandit
Follow to stay updated on
Generative AI

LIKE COMMENT REPOST

Bhavishya Pandit

You might also like