RewardBench:
Evaluating Reward Models for Language Modeling
Abstract
Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.
1 Introduction
Reinforcement learning from human feedback (RLHF) is a necessary but opaque tool underlying the success of popular language models (LMs) such as OpenAI’s ChatGPT (Schulman et al., 2022) and Anthropic’s Claude (Bai et al., 2022a). The prevalence of RLHF stems from its efficacy at circumventing one of the greatest difficulties in integrating human preferences into language models: specifying an explicit reward (Christiano et al., 2017). Reward models (RMs) are central to this process. They are created by copying the original language model and training it on labeled preference data, producing a model that can predict whether one piece of text is likely to be preferred over another. A reinforcement learning optimizer then uses this reward model signal to update the parameters of the original model, improving performance on a variety of tasks (Ouyang et al., 2022; Touvron et al., 2023).
While the post-RLHF model (known as the policy) and even the pretrained model are extensively documented and evaluated, the basic properties of the RLHF process like the RMs receive far less attention. Recent work on training reward models (Zhu et al., 2023a; Jiang et al., 2023c) has begun to fill this gap, but utilizes validation sets from previous RLHF training processes, such as Anthropic’s Helpful and Harmless data (Bai et al., 2022a) or OpenAI’s Learning to Summarize (Stiennon et al., 2020), which are known to have ceilings on accuracy between 60 and 70% due to inter-annotator disagreement (Wang et al., 2024). Moreover, newly released preference data aiming to expand the diversity of preference training datasets such as UltraFeedback (Cui et al., 2023), UltraInteract (Yuan et al., 2024a) and Nectar (Zhu et al., 2023a), do not have test sets, necessitating a new style of evaluation for RMs.
We begin to rectify the lack of evaluation methods by introducing RewardBench, the first toolkit for benchmarking reward models. RLHF is a broadly applicable process used to enhance specific capabilities of LMs such as safety (Dai et al., 2023) or reasoning (Lightman et al., 2023; Havrilla et al., 2024a) as well as general capabilities such as instruction following (Ouyang et al., 2022) or “steerability” (Askell et al., 2021; Bai et al., 2022a). Thorough evaluations of RMs will also cover these categories. In this work, we curate data to create structured comparisons across a variety of reward model properties. Each sample is formatted as a prompt with a human-verified chosen and rejected completion. We design subsets so as to vary in difficulty and coverage. Some subsets are solved by small RMs, reaching 100% accuracy, but others are harder to differentiate and still have state-of-the-art performance around 75%, with many models around the random baseline.
We aim to map the current landscape of openly available reward models via a leaderboard for RewardBench. We have evaluated over 80 models, such those trained as classifiers, including UltraRM (Cui et al., 2023), Starling (Zhu et al., 2023a), PairRM (Jiang et al., 2023c), SteamSHP (Ethayarajh et al., 2022), models from Reward rAnked FineTuning (RAFT) (Dong et al., 2023), and others. We also evaluate popular chat models trained with Direct Policy Optimization (DPO) (Rafailov et al., 2023), for example, Zephyr- (Tunstall et al., 2023), Qwen-Chat (Bai et al., 2023), StableLM (Bellagente et al., 2024), and Tülu 2 (Ivison et al., 2023) to ground recent debates on RLHF methods and showcase specific datasets where they fall short.
With these models, we compare scaling, test reasoning capabilities, highlight three buckets of refusal behavior, and share more details on the inner workings of RMs. The accompanying code-base provides a common inference stack for many variations of models and we release many text-score pairs to analyze their performance. With RewardBench, we:
-
1.
Release a common framework for evaluating the many different architectures of reward models, along with tools for visualization, training, and other analysis. We also release all data used in the evaluation, composed of text-score pairs for all inputs, to enable further data analysis on the properties of reward models.111Data is here: https://huggingface.co/datasets/allenai/reward-bench-results.
-
2.
Illustrate the differences between DPO and classifier-based reward models across a variety of datasets. DPO models, while more plentiful due to the method’s simplicity, fail to generalize to popular preference data test sets and present a higher variance in performance.
-
3.
Chart the landscape of current state-of-the-art reward models. We showcase the scaling laws, the propensity to refuse (or not), the reasoning capabilities, and more for popular RMs.
-
4.
Show the limitations of existing preference data test sets for evaluating these models, showcasing common pitfalls of RMs on subtle, but challenging instruction pairs (e.g. intentionally modified rejected responses, which superficially look high quality but answer the wrong prompt).
Category | Subset | N | Short Description |
Chat | AlpacaEval Easy | 100 | GPT4-Turbo vs. Alpaca 7bB from Li et al. (2023b) |
358 total | AlpacaEval Length | 95 | Llama 2 Chat 70B vs. Guanaco 13B completions |
AlpacaEval Hard | 95 | Tulu 2 DPO 70B vs. Davinici003 completions | |
MT Bench Easy | 28 | MT Bench ratings 10s vs. 1s from Zheng et al. (2023) | |
MT Bench Medium | 40 | MT Bench completions rated 9s vs. 2-5s | |
Chat Hard | MT Bench Hard | 37 | MT Bench completions rated 7-8s vs. 5-6 |
456 total | LLMBar Natural | 100 | LLMBar chat comparisons from Zeng et al. (2023) |
LLMBar Adver. Neighbor | 134 | LLMBar challenge comparisons via similar prompts | |
LLMBar Adver. GPTInst | 92 | LLMBar comparisons via GPT4 similar prompts | |
LLMBar Adver. GPTOut | 47 | LLMBar comparisons via GPT4 unhelpful response | |
LLMBar Adver. Manual | 46 | LLMBar manually curated challenge completions | |
Safety | Refusals Dangerous | 100 | Preferring refusal to elicit dangerous responses |
740 total | Refusals Offensive | 100 | Preferring refusal to elicit offensive responses |
XSTest Should Refuse | 154 | Prompts that should be refused Röttger et al. (2023) | |
XSTest Should Respond | 250 | Preferring responses to queries with trigger words | |
Do Not Answer | 136 | Questions that LLMs should refuse (Wang et al., 2023) | |
Reasoning | PRM Math | 447 | Human vs. buggy LLM answers (Lightman et al., 2023) |
1431 total | HumanEvalPack CPP | 164 | Correct CPP vs. buggy code (Muennighoff et al., 2023) |
HumanEvalPack Go | 164 | Correct Go code vs. buggy code | |
HumanEvalPack Javascript | 164 | Correct Javascript code vs. buggy code | |
HumanEvalPack Java | 164 | Correct Java code vs. buggy code | |
HumanEvalPack Python | 164 | Correct Python code vs. buggy code | |
HumanEvalPack Rust | 164 | Correct Rust code vs. buggy code | |
Prior Sets | Anthropic Helpful | 6192 | Helpful split from test set of Bai et al. (2022a) |
17.2k total | Anthropic HHH | 221 | HHH validation data (Askell et al., 2021) |
SHP | 1741 | Partial test set from Ethayarajh et al. (2022) | |
Summarize | 9000 | Test set from Stiennon et al. (2020) |
2 Related Works
Reinforcement Learning from Human Feedback
Using Reinforcement Learning to align language models with human feedback or preferences (Christiano et al., 2017; Ziegler et al., 2019) has led to improved chat models such as ChatGPT (Schulman et al., 2022) and Llama2 (Touvron et al., 2023). Incorporating human feedback into models in this way has been used to improve summarization (Stiennon et al., 2020; Wu et al., 2021), question answering (Nakano et al., 2021), image models (Lee et al., 2023) and instruction following in general (Ouyang et al., 2022).
RLHF often focuses on aspects of preference, where aspects could be more general concepts like helpfulness or harmlessness (Bai et al., 2022a), or more fine-grained ones (Wu et al., 2023), among others. In general, RLHF involves training a reward model on preference data collected from crowdworkers (Wang et al., 2024) (or from LM selected responses (Bai et al., 2022b)). Given a reward model, a policy can be learned using RL algorithms like PPO (Schulman et al., 2017), which has been shown to work well for language policies (Ramamurthy et al., 2022). Another option is to directly optimize a policy with chosen and rejected pairs, using DPO (Rafailov et al., 2023). Some reward modeling extensions include process reward models (Luo et al., 2023; Lightman et al., 2023) and step-wise reward models (Havrilla et al., 2024b), which are primarily used for reasoning tasks.
Reward Model & RLHF Evaluation
Preference tuned models can be evaluated using downstream evaluations, for example using AlpacaFarm (Dubois et al., 2024), where LMs are used to simulate human preferences by comparing a model generated output with that of a reference model. The reported metric is the win-rate of the model over the reference model. Similarly, MT-Bench (Zheng et al., 2023), evaluates chatbots on multi-turn conversations that are judged by LMs as proxy for human judgments and Chatbot Arena (Zheng et al., 2023) crowdsources the preferences between two different model outputs. These types of setups only indirectly evaluate the reward model. Other works, directly analyze the reward model, such as Singhal et al. (2023), who found a strong correlation between output length and rewards by looking at the training dynamics of RMs. Another analysis looked at reward inconsistencies, by creating a benchmark of contrasting instructions (Shen et al., 2023). Clymer et al. (2023) study reward model performance under distribution shift.
3 Background
Reward Modeling
The first step of training a reward model, and therefore doing RLHF, is collecting preference data from a group of human labelers. Individuals are presented with prompts, , akin to a question or task, and asked to choose between a set of completions, , answering the request. The most common case is for only two completions to be shown with measurement of preference, such as win-loss-tie or a Likert scale indicating the magnitude of preference between completions (Bai et al., 2022a), though other methods for labeling exist, such as ranking in a batch of 4+ answers (Ouyang et al., 2022). The resulting data is transformed into a set of prompt-chosen-rejected trios, where the chosen completion is preferred over the rejected completion for training. Training a reward model involves training a classifier to predict the human preference probability, , between two answers, as modeled by a Bradley-Terry model (Bradley and Terry, 1952):
(1) |
Then, estimate the parameters of the RM by optimizing the maximum likelihood loss as follows: For language models, the RM is often implemented by appending a linear layer to predict one logit or removing the final decoding layers and replacing them with a linear layer. At inference time, a trained reward model returns a scalar, such that (which intuitively is the probability that the completion would be a preferred response, but is trained indirectly via the pairwise loss). Thus, a win between completions and is achieved when .
Direct Preference Optimization
Direct Preference Optimization solves the RLHF problem without needing to learn a separate reward model. It achieves this by reparameterizing the preference-based reward function using only the policy models (Rafailov et al., 2023) The implicit reward used in DPO is a function of the policy model probabilities (i.e. the model being trained), , a regularization constant, , the base model probabilities, , and a partition function :
(2) |
Given two completions to a prompt, we compare the rewards and as follows, where the score is computed via the log ratios of :
4 The RewardBench Benchmark
In this section, we detail the design philosophy and construction of the evaluation dataset. The dataset is designed to provide a broad set of basic evaluations for reward models, covering chat, instruction following, coding, safety, and other important metrics for fine-tuned language models. The RewardBench dataset contains a combination of existing evaluation prompt-completion pairs, and those curated for this project.
A good reward function, and therefore a good RM broadly, is one that stably assigns credit to the classes of good or bad content.Given one verified answer that is better than another for factual or clear qualitative reasons (e.g. typos), a good reward model will choose the correct one 100% of the time. To evaluate this, each datapoint consists of a prompt and two completions, chosen and rejected. For each prompt, the score of the reward model is computed. The prompt is then categorized as a win if the score of the prompt with the verified chosen completion is higher than that of the verified rejected completion, as shown in Fig. 1. Finally, we report accuracy for each subset as the percentage of wins. For all the section scores of RewardBench (e.g. Chat or Safety) except Prior Sets, the average score is weighted per-prompt in the requisite subsets.
4.1 RewardBench Dataset
The benchmark is broken down into five sections from different subsets – the first four compose the RewardBench dataset described in this section. We have broken down the dataset into these subsections to create one final RewardBench score in order to reasonably weigh different aspects of an RM’s performance. The RewardBench dataset is released under the ODC-BY license222ODC-BY: https://opendatacommons.org/licenses/by/1-0/ and the code is released under Apache 2.0333Apache 2.0: https://www.apache.org/licenses/LICENSE-2.0. The summary of the dataset is shown in Tab. 1 (see appendix F for full details) At a high level, the subsets consist of the following:
- 1.
-
2.
Chat Hard: Testing a reward model’s abilities to understand trick questions and subtly different instruction responses. Prompts and chosen, rejected pairs are selected from MT Bench examples with similar ratings and adversarial data specifically for fooling LLM-as-a-judge tools from LLMBar’s evaluation set (Zeng et al., 2023) (reformatted for RMs).
-
3.
Safety: Testing the models’ tendencies to refuse dangerous content and to avoid incorrect refusals to similar trigger words. Prompts and chosen, rejected pairs are selected from custom versions of the datasets XSTest (Röttger et al., 2023), Do-Not-Answer (Wang et al., 2023), and examples from an in-development refusals dataset at AI2, where the chosen response is a refusal and the rejected is harmful text of either dangerous or offensive nature.
-
4.
Reasoning: Evaluating the models code and reasoning abilities. Code prompts are created by reformatting HumanEvalPack examples with correct code as chosen and rejected as one with bugs (Muennighoff et al., 2023). Reasoning prompts pair reference answers with incorrect model generations from the PRM800k dataset (Lightman et al., 2023).
Table 2: Top-20 open models on RewardBench. Evaluating many RMs shows that there is still large variance in RM training and potential for future improvement across the more challenging instruction and reasoning tasks. Icons refer to model types: Sequence Classifier (), Direct Preference Optimization (), Custom Classifier (), Generative Model (), and a random model (). Reward Model Score Chat Chat Hard Safety Reason Prior Sets RLHFlow/ArmoRM-Llama3-8B-v0.1 89.0 96.9 76.8 92.2 97.3 74.3 RLHFlow/pair-preference-model-LLaMA3-8B 85.7 98.3 65.8 89.7 94.7 74.6 sfairXC/FsfairX-LLaMA3-RM-v0.1 83.6 99.4 65.1 87.8 86.4 74.9 openbmb/Eurus-RM-7b 81.6 98.0 65.6 81.2 86.3 71.7 Nexusflow/Starling-RM-34B 81.4 96.9 57.2 88.2 88.5 71.4 weqweasdas/RM-Mistral-7B 79.3 96.9 58.1 87.1 77.0 75.3 hendrydong/Mistral-RM-for-RAFT-GSHF-v0 78.7 98.3 57.9 86.3 74.3 75.1 stabilityai/stablelm-2-12b-chat 77.4 96.6 55.5 82.6 89.4 48.4 Ray2333/reward-model-Mistral-7B-instruct… 76.9 97.8 50.7 86.7 73.9 74.3 allenai/tulu-2-dpo-70b 76.1 97.5 60.5 83.9 74.1 52.8 meta-llama/Meta-Llama-3-70B-Instruct 75.4 97.6 58.9 69.2 78.5 70.4 prometheus-eval/prometheus-8x7b-v2.0 75.3 93.0 47.1 83.5 77.4 - NousResearch/Nous-Hermes-2-Mistral-7B-DPO 74.8 92.2 60.5 82.3 73.8 55.5 mistralai/Mixtral-8x7B-Instruct-v0.1 74.7 95.0 64.0 73.4 78.7 50.3 upstage/SOLAR-10.7B-Instruct-v1.0 74.0 81.6 68.6 85.5 72.5 49.5 HuggingFaceH4/zephyr-7b-alpha 73.4 91.6 62.5 74.3 75.1 53.5 allenai/tulu-2-dpo-13b 73.4 95.8 58.3 78.2 73.2 49.5 0-hero/Matter-0.1-7B-boost-DPO-preview 73.4 91.1 61.0 66.3 83.9 55.7 prometheus-eval/prometheus-7b-v2.0 72.4 85.5 49.1 78.7 76.5 - HuggingFaceH4/starchat2-15b-v0.1 72.1 93.9 55.5 65.8 81.6 55.2 -
5.
Prior Sets444For the final RewardBench score, we weigh the Prior Sets category at 0.5 weight of the others due to multiple factors: noise, lack of clearly defined tasks, etc. The dataset is found here: https://huggingface.co/datasets/allenai/preference-test-sets : For consistency with recent work on training reward models, we average performance over test sets from existing preference datasets. We use the Anthropic Helpful split (Bai et al., 2022a) (the only multi-turn data), the Anthropic HHH subset of BIG-Bench (Askell et al., 2021), a curated subset of the test set from the Stanford Human Preferences (SHP) Dataset (Ethayarajh et al., 2022), and OpenAI’s Learning to Summarize Dataset (Stiennon et al., 2020).
4.2 RewardBench Scoring
RewardBench is scored via accuracy. For each prompt-chosen-rejected trio, we infer the score the RM assigns for the prompt-chosen and prompt-rejected pairs then assign a true classification label when the chosen score is higher than rejected, as highlighted in Fig. 1. Details on computing scores for classifiers and DPO models is in Sec. 3. Given the binary classification task, a random model achieves a result of 50%. In order to create a representative, single evaluation score, we perform a mixture of averaging across results. For the sections detailed in Sec. 4.1 except for Reasoning, we perform per-prompt weighted averaging across the subsets to get the normalized section scores. For example, in Chat we take a weighted average of the AlpacaEval and MT Bench sets based on the number of prompts. For Reasoning, we increase the weight of the PRM-Math subset so code and math abilities are weighed equally in the final number. For Prior Sets, we take an unweighted average over the subsets due to the large disparity in dataset sizes. Once all subsets weighted averages are achieved, the final RewardBench score is the weighted average across the section scores (Prior Sets at 0.5 weight).
5 Evaluation Results
RewardBench includes evaluation of many public reward models, ranging in parameter count from 400 million (PairRM) to 70 billion (Tülu 2), trained as classifiers or with DPO (when the reference model is available). In this section, we detail the core findings of RewardBench ,and more results are available in Appendix E. In particular, we study the state-of-the-art reward models (Tab. 2), results of similar-size models at 7B (Tab. 4), and a demonstration of the impact of scaling DPO reward models on performance in Tab. 3. We further study the limits of current reward models (Section 5.2) and prior test sets (Section 5.3).
Reward Model | Score | Chat | Chat Hard | Safety | Reason. | Prior Sets |
---|---|---|---|---|---|---|
allenai/tulu-2-dpo-70b | 76.1 | 97.5 | 60.5 | 83.9 | 74.1 | 52.8 |
allenai/tulu-2-dpo-13b | 73.4 | 95.8 | 58.3 | 78.2 | 73.2 | 49.5 |
allenai/tulu-2-dpo-7b | 71.7 | 97.5 | 56.1 | 73.3 | 71.8 | 47.7 |
Qwen/Qwen1.5-72B-Chat | 68.2 | 62.3 | 66.0 | 72.0 | 85.5 | 42.3 |
Qwen/Qwen1.5-14B-Chat | 69.8 | 57.3 | 70.2 | 76.3 | 89.6 | 41.2 |
Qwen/Qwen1.5-7B-Chat | 68.7 | 53.6 | 69.1 | 74.8 | 90.4 | 42.9 |
5.1 Comparing State-of-the-art Reward Models
Tab. 2 shows the results for the top 20 models across different model sizes and types. Large models and those trained on Llama 3 are the only models capable of high performance on the Chat Hard and Reasoning sections, with the model ArmoRM-Llama3-8B-v0.1 (89) being state-of-the-art. Across different base models, scale is a crucial property, with Starling-RM-34B (81.4) trained on Yi 34B and Tulu-2-DPO-70B (76.1) on Llama 2 being top models. The best open-weight models for LLM-as-a-judge are Meta-Llama-3-70B-Instruct (75.4) and prometheus-8x7b-v2.0 (75.3) (Kim et al., 2024), though they still fall well below classifier-based RMs. The final category is comprised of the small, most accessible models, where the leading models are StableLM-zephyr-3b (70.6) and oasst-rm-2.1-pythia-1.4b-epoch-2.5 (69.5), but there is substantial room for progress.
The Impacts of Different Base Models
In our evaluation there are multiple models trained either with the same or very similar fine-tuning approaches on different base models. We show the impact of scaling across different Llama 2, via Tulu 2 (Ivison et al., 2023), and Qwen 1.5 versions in Tab. 3. In general, Llama 2 shows a clear improvement with scaling across all sections of RewardBench, but Qwen 1.5 shows less monotonic improvement, likely due to out of distribution generalization challenges. Tab. 4 compares the impact of different base models and subtle changes of fine-tuning methods via the Zephyr-class models (Tunstall et al., 2023). Each of these models are fine-tuned on the UltraFeedback dataset via DPO as the final stage, with different base models and instruction-tuning before. zephyr-7b-alpha and zephyr-7b-beta differ by filtering of the UltraFeedback preference dataset only, and this is reflected in zephyr-7b-alpha’s higher score on Safety (as refusals were removed from the dataset) and lower score on Chat. tulu-2-dpo-7b highlights the difference from the Mistral 7B to the Llama 2 7B base models and a different supervised fine-tuning dataset pre DPO, as regressions on Chat Hard and Reasoning, but improvements on Safety.
Reward Model | Score | Chat | Chat Hard | Safety | Reason | Prior Sets |
---|---|---|---|---|---|---|
HuggingFaceH4/zephyr-7b-alpha | 73.4 | 91.6 | 62.5 | 74.3 | 75.1 | 53.5 |
HuggingFaceH4/zephyr-7b-beta | 71.8 | 95.3 | 62.7 | 61.0 | 77.9 | 52.2 |
allenai/tulu-2-dpo-7b | 71.7 | 97.5 | 56.1 | 73.3 | 71.8 | 47.7 |
allenai/OLMo-7B-Instruct | 66.7 | 89.7 | 50.7 | 62.3 | 71.7 | 51.7 |
HuggingFaceH4/zephyr-7b-gemma-v0.1 | 66.4 | 95.8 | 49.6 | 52.9 | 74.6 | 51.7 |
RLHFlow/ArmoRM-Llama3-8B-v0.1 | 89.0 | 96.9 | 76.8 | 92.2 | 97.3 | 74.3 |
RLHFlow/pair-preference-model-LLaMA3-8B | 85.7 | 98.3 | 65.8 | 89.7 | 94.7 | 74.6 |
sfairXC/FsfairX-LLaMA3-RM-v0.1 | 83.6 | 99.4 | 65.1 | 87.8 | 86.4 | 74.9 |
openbmb/Eurus-RM-7b | 81.6 | 98.0 | 65.6 | 81.2 | 86.3 | 71.7 |
weqweasdas/RM-Mistral-7B | 79.3 | 96.9 | 58.1 | 87.1 | 77.0 | 75.3 |
Different Shapes of Reward Functions
The per-prompt scores demonstrate the different magnitudes and distributions of rewards assigned to each reward model over the RewardBench evaluation dataset. Results shown in Appendix E.1, such as Fig. 7, show these distributions for some RMs trained as a classifier. Few RMs are Gaussian in their scores across the RewardBench datasets, fewer RMs are centered around 0 reward, and none we tested centered Gaussians. Future work should identify a preferred RM output distribution for downstream RL training.
5.2 Limits of Current Reward Models
Current reward models can solve some subsets of RewardBench reliably, approaching 100% accuracy, but many subsets experience a combination of low ceilings on performance or high variance of performance. The subsets with low ceilings, mostly in the Chat Hard and Reasoning sections indicate areas where preference datasets and reward modeling methods can be extended to improve performance, and subsets with high variability, such as many of the Safety subsets, indicate areas where best practices can be converged upon.
Evaluating across Chat Hard Categories
Tab. 5 compares different rewards models across Chat Hard categories (full results are shown in Tab. LABEL:table:eval_sets_chat_hard). The adversarial subsets from LLMBar (Zeng et al., 2023) are crucial to understanding RMs because they show examples where two answers are written in a similar style (e.g. the same GPT-4 model version), but with slightly different subjects. The difference between asking a factual question about a related but different object or slightly changing the context of a prompt, is hard to pick up with most reward models. The Chat Hard section (and to some extent Reasoning) is largely correlated with final performance, but some DPO models excel at it and not overall – even Qwen Chat and others with low average performance overall. The models scoring highly largely are trained on recent base models and preference datasets, showcasing recent progress on RM training.
Evaluating across Reasoning Categories
The Reasoning section of RewardBench has the widest, smooth variation in performance – e.g. models populate many levels, from 35% accuracy (well below random) all the way to 97% accuracy. The reasoning data largely relies on code examples where just one or two tokens are different between the chosen and rejected samples, showcasing precise classification abilities of the best RMs. Full reasoning results are included in Tab. LABEL:table:eval_sets_reasoning.
Evaluating across Safety Metrics
Tab. 6 (full results in Tab. LABEL:table:eval_sets_safety in Appendix) compares different reward models across different safety categories, indicating challenges on striking a balance between refusing too much or not refusing. Models, such as UltraRM-13b and zephyr-7b-gemma-v0.1 show how a model focused on helpfulness without a strong notion of safety will score poorly on the should-refuse subsets of the safety section, but highly on XSTest Should Respond. Other models, namely those at the top of the overall leaderboard, clearly include safety information in the training process and maintain strong performance on trick questions that could induce false refusals (XSTest Should Respond). Finally, the mirrored behavior, those models that score highly on prompts that they should refuse and poorly on those they should not are present, indicating a model that is likely to falsely refusal queries (e.g. the Qwen chat models). These three behavior modes indicate that RewardBench can be used as a quick check of the safety behavior of a candidate model, especially when trained with DPO (as it will not need further RL training like the classifiers).
5.3 Limitations of Prior Test Sets
Many popular models trained with RLHF use new preference datasets such as UltraFeedback (Cui et al., 2023) or Nectar (Zhu et al., 2023a), which don’t have publicly available validation sets. Given this, when training reward models, common practice is to compare model agreement with a variety of existing test sets from earlier work in RLHF. Some models scoring strongly on the Prior Sets section of RewardBench, such as UltraRM-13b and PairRM-hf were trained on the training splits of Anthropic HH, Stanford Human Preferences (SHP), and OpenAI’s Learning to Summarize, but other top classifier models, such as the Starling models were not. Combining this with the very low average score of DPO models on these test sets indicates that substantial research is needed to understand the full limitations of these previous datasets. Full results are detailed in Tab. LABEL:table:pref_sets.
6 Conclusion
We present RewardBench, and show the variety of performance characteristics of current reward models in order to improve understanding of RLHF. While we covered a variety of topics important to alignment of LMs, a crucial next step is needed to correlate performance in RewardBench to RLHF usefulness. Initial experiments with ranking RMs with best-of-N sampling and downstream training with PPO are underway. We have taken a first step to understanding which values are embedded in the RLHF training across many base models and preference datasets. The toolkit we have released can easily be expanded include custom data to specifically audit a certain property of the RLHF process. Scores of RMs from private LM providers are on the public leaderboard, but are not in the paper because they are not reproducible. RewardBench is one of many tools which will help us understand the science of whose and what values are embedded in our language models.
Acknowledgements
The authors would like to thank Thomas Gilbert for early discussions that helped motivate this project. Thanks to Prasann Singhal for discussing similar and complimentary concurrent work when building this project. Thanks to Hamish Ivision for helping with the math data filtering code. Thanks to Matt Latzke for help with the logo and design artifacts.
References
- Abdulhai et al. (2023) Marwa Abdulhai, Gregory Serapio-Garcia, Clément Crepy, Daria Valter, John Canny, and Natasha Jaques. Moral foundations of large language models. arXiv preprint arXiv:2310.15337, 2023.
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
- Askell et al. (2021) Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Bai et al. (2022a) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Bai et al. (2022b) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Bellagente et al. (2024) Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy Phung, Maksym Zhuravinskyi, Reshinth Adithyan, James Baicoianu, Ben Brooks, Nathan Cooper, Ashish Datta, et al. Stable LM 2 1.6B Technical Report. arXiv preprint arXiv:2402.17834, 2024.
- Bradley and Terry (1952) Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/stable/2334029.
- Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
- Clymer et al. (2023) Joshua Clymer, Garrett Baker, Rohan Subramani, and Sam Wang. Generalization analogies (genies): A testbed for generalizing ai oversight to hard-to-measure domains. arXiv preprint arXiv:2311.07723, 2023.
- Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback: Boosting Language Models with High-quality Feedback. arXiv preprint arXiv:2310.01377, 2023.
- Dai et al. (2023) Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
- Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.
- Dong et al. (2023) Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. arXiv preprint arXiv:2304.06767, 2023.
- Dubois et al. (2024) Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Durmus et al. (2023) Esin Durmus, Karina Nyugen, Thomas I Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, et al. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388, 2023.
- Ethayarajh et al. (2022) Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty with -usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR, 17–23 Jul 2022.
- Havrilla et al. (2024a) Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024a.
- Havrilla et al. (2024b) Alex Havrilla, Sharath Raparthy, Christoforus Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, and Roberta Railneau. Glore: When, where, and how to improve llm reasoning via global and local refinements. arXiv preprint arXiv:2402.10963, 2024b.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS, 2021.
- Ivison et al. (2023) Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a Changing Climate: Enhancing LM Adaptation with Tülu 2. arXiv preprint arXiv:2311.10702, 2023.
- Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
- Jiang et al. (2023b) Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, and Wenhu Chen. Tigerscore: Towards building explainable metric for all text generation tasks. ArXiv, abs/2310.00752, 2023b. URL https://api.semanticscholar.org/CorpusID:263334281.
- Jiang et al. (2023c) Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023c.
- Kim et al. (2023) Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023.
- Kim et al. (2024) Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535, 2024.
- Lambert et al. (2023) Nathan Lambert, Thomas Krendl Gilbert, and Tom Zick. The history and risks of reinforcement learning and human feedback. arXiv e-prints, pages arXiv–2310, 2023.
- Lee et al. (2023) Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Li et al. (2023a) Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470, 2023a.
- Li et al. (2023b) Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
- Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. arXiv preprint arXiv:2305.20050, 2023.
- Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
- Mozes et al. (2023) Maximilian Mozes, Jessica Hoffmann, Katrin Tomanek, Muhamed Kouate, Nithum Thain, Ann Yuan, Tolga Bolukbasi, and Lucas Dixon. Towards agile text classifiers for everyone, 2023.
- Muennighoff et al. (2023) Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. OctoPack: Instruction Tuning Code Large Language Models. arXiv preprint arXiv:2308.07124, 2023.
- Nakano et al. (2021) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Ng and Jordan (2001) Andrew Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems, 14, 2001.
- Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Ramamurthy et al. (2022) Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022. URL https://arxiv.org/abs/2210.01241.
- Röttger et al. (2023) Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. arXiv preprint arXiv:2308.01263, 2023.
- Ryan et al. (2024) Michael J. Ryan, William Held, and Diyi Yang. Unintended impacts of llm alignment on global representation, 2024.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Schulman et al. (2022) John Schulman, Barret Zoph, Christina Kim, and more. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/, 2022. Accessed: 2023-02-12.
- Shen et al. (2023) Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin, Baolin Peng, Haitao Mi, Daniel Khashabi, and Dong Yu. The trickle-down impact of reward (in-) consistency on rlhf. arXiv preprint arXiv:2309.16155, 2023.
- Singhal et al. (2023) Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
- Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.
- Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.
- Tunstall et al. (2023) Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct Distillation of LM Alignment. arXiv preprint arXiv:2310.16944, 2023.
- Wang et al. (2024) Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. Secrets of rlhf in large language models part ii: Reward modeling, 2024.
- Wang et al. (2023) Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs. arXiv preprint arXiv:2308.13387, 2023.
- Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- Wu et al. (2023) Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
- Wyllie et al. (2024) Sierra Wyllie, Ilia Shumailov, and Nicolas Papernot. Fairness feedback loops: Training on synthetic data amplifies bias, 2024.
- Yuan et al. (2024a) Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing llm reasoning generalists with preference trees, 2024a.
- Yuan et al. (2024b) Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024b.
- Zeng et al. (2023) Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating Large Language Models at Evaluating Instruction Following. arXiv preprint arXiv:2310.07641, 2023.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685, 2023.
- Zhu et al. (2023a) Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF, November 2023a. URL https://starling.cs.berkeley.edu/.
- Zhu et al. (2023b) Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631, 2023b.
- Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Checklist
-
1.
For all authors…
-
(a)
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]
-
(b)
Did you describe the limitations of your work? [Yes] See section Appendix A.
-
(c)
Did you discuss any potential negative societal impacts of your work? [Yes] See section Appendix A.
-
(d)
Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]
-
(a)
-
2.
If you are including theoretical results…
-
(a)
Did you state the full set of assumptions of all theoretical results? [N/A]
-
(b)
Did you include complete proofs of all theoretical results? [N/A]
-
(a)
-
3.
If you ran experiments (e.g. for benchmarks)…
-
(a)
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Available on first page, and also here: https://github.com/allenai/reward-bench.
-
(b)
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A]
-
(c)
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] There is a small amount of variability that could come when evaluating reward models, though the temperature should be set to 0 and have substantially lower variance than training experiments.
-
(d)
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix C.
-
(a)
-
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
-
(a)
If your work uses existing assets, did you cite the creators? [Yes] Primarily in Sec. 4.1, we clearly cite all the datasets we built upon in this work. The code is almost entirely new, but in-line comments exist on GitHub, e.g. for the source code of models for inference.
-
(b)
Did you mention the license of the assets? [Yes] See Section 4.1 for datasets, which are all permissively licensed. The code copied was either released with no license (e.g. in a model card) or with a license that does not require noting it (Apache / MIT).
-
(c)
Did you include any new assets either in the supplemental material or as a URL? [Yes] We have included a substantial amount of assets via URL (of which, all should be in the main text. For example, the Leaderboard555https://huggingface.co/spaces/allenai/reward-bench. is only useful as an online artifact. Other artifacts such as the full results from evaluation and the evaluation datasets themselves are linked externally.
-
(d)
Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A] The data was either generated by an LLM, by the team, or from previously released narrow benchmarks.
-
(e)
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] It is low risk, but discussed in Appendix A, particularly for the Safety section of the benchmark.
-
(a)
-
5.
If you used crowdsourcing or conducted research with human subjects…
-
(a)
Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] Though, the authors did have explicit instructions for data collection, which are detailed in Appendix. I. We did not use any additional crowdsourcing.
-
(b)
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
-
(c)
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]
-
(a)
[sections] \printcontents[sections]l1
Appendix A Limitations & Broader Impacts
Limitations
The RewardBench benchmark is limited by a couple of factors. First, we lack human preference data and instead, except for specific subsets, have to rely on semi-automatic ways of obtaining chosen-rejected pairs, which we then manually validate. We also note that the formats in certain domains, such as the reasoning domain, might potentially include spurious correlations leading to possible biases in humans and models. Another unresolved question is whether and how the benchmark results correlate with downstream training. Lastly, there might be a chance of possible data contamination, in cases where models are (wrongly) directly trained on alpacaeval or MTBench data.
Broader Impacts
This work does expose potentially offensive and or sensitive text to users through the rejected samples of the Safety section of the benchmark. Therefore users should use this data at their own risk. Given the preexisting prompts from other benchmarks, we are not worried about eliciting personally identifiable information.
Appendix B Discussions
Evaluating Length Bias
Given the results showing length bias in RLHF and reward models (Singhal et al., 2023), we designed RewardBench so that the chosen responses are either a similar length or shorter than the rejected responses. For example, the AlpacaEval Length subset is designed to differentiate between other Chat subsets by having notably different models capabilities with the same average length (results in Tab. LABEL:table:eval_sets_chat). In this case, the results are lower than other easy chat subsets, but 90% plus accuracy is achieved by over 10 models – far above random for most models. Though, more detailed statistical tests are needed to fully understand this, as this only tests the reward models’ abilities to discern information without the help of length as a proxy. More details on the length distributions of RewardBench are found in Appendix H.2.
DPO Models vs Classifiers
Since DPO-trained LLMs are implicit reward models largely used for their generative abilities, the question of how they compare to RMs trained as classifiers is unstudied. There are currently more DPO models released to the public, partially due to DPO requiring notably fewer computational resources among other factors such as existing implementations and relevant datasets. We see that the results on RewardBench flatter the recent DPO methods, except for the Prior Sets section. For how the DPO reward is computed, see Sec. 3.
The same inference code of popular DPO training implementations can easily be used for evaluation as an RM by not propagating gradients through the models. The simplest implementations requires more GPU memory to run evaluation of DPO-trained models given the two models needed to compute the reward, but this can be avoided by computing the probabilities over the policy and base models sequentially. Though, some of the released DPO models do not clearly document which reference model is used in training (e.g. if it is a base model or a model obtained via supervised fine-tuning), which can result in unclear benchmarking.666Examples include Mixtral-8x7B-Instruct-v0.1 or the Qwen chat models, which just say “trained with DPO,” yet they achieve solid performance. When a reference model is unavailable or compute is constrained, an alternative approach in such cases would be to obtain a reference free reward: , which could be normalized using different approaches. Without normalization, the loss has a length penalty by summing over probabilities of each token which are all negative numbers. We will explore the impacts of reference free inference in future work.
We also experimentedwith using the “wrong” reference model, i.e. a similar but different base model, and found that this reduced the DPO trained RM performance to similar levels as the random baseline.
There is still a lot that is unknown about the best practices of training RMs: trained with DPO they are regularized by KL distance, but the classifiers are not. Additionally, a common practice for training RMs via classification is to train for 1 epoch (Ouyang et al., 2022), while DPO models are usually trained for more than 1 epoch (Tunstall et al., 2023; Ivison et al., 2023). Other future work ideas therefore include analyzing the role of the training hyperparameters in DPO training and RM classification performance (such as Beta KL regularization on generated text, number of training epochs, etc.).
Reward Model | Score | Chat | Chat Hard | Safety | Reason | Prior Sets |
---|---|---|---|---|---|---|
google/gemini-1.5-pro-0514 | 88.1 | 92.3 | 80.6 | 87.5 | 92.0 | - |
openai/gpt-4-0125-preview | 84.3 | 95.3 | 74.3 | 87.2 | 86.9 | 70.9 |
openai/gpt-4-turbo-2024-04-09 | 83.9 | 95.3 | 75.4 | 87.1 | 82.7 | 73.6 |
openai/gpt-4o-2024-05-13 | 83.3 | 96.6 | 70.4 | 86.7 | 84.9 | 72.6 |
openai/gpt-4o-2024-05-13 | 83.3 | 96.6 | 70.4 | 86.7 | 84.9 | 72.6 |
google/gemini-1.5-pro-0514 | 80.7 | 92.2 | 63.5 | 87.7 | 85.1 | 69.4 |
Anthropic/claude-3-opus-20240229 | 80.7 | 94.7 | 60.3 | 89.1 | 78.7 | - |
[O] meta-llama/Meta-Llama-3-70B-Instruct | 75.4 | 97.6 | 58.9 | 69.2 | 78.5 | 70.4 |
[O] prometheus-eval/prometheus-8x7b-v2.0 | 75.3 | 93.0 | 47.1 | 83.5 | 77.4 | - |
Anthropic/claude-3-sonnet-20240229 | 75.0 | 93.4 | 56.6 | 83.7 | 69.1 | 69.6 |
Anthropic/claude-3-haiku-20240307 | 73.5 | 92.7 | 52.0 | 82.1 | 70.6 | 66.3 |
[O] prometheus-eval/prometheus-7b-v2.0 | 72.4 | 85.5 | 49.1 | 78.7 | 76.5 | - |
[O] CohereForAI/c4ai-command-r-plus | 69.6 | 95.1 | 57.6 | 55.6 | 70.4 | 69.2 |
Generative Reward Modeling
An alternate to classifier based reward models, which are discriminative (Ng and Jordan, 2001), is to use generations from a language model to create a judgement between two answers (Zheng et al., 2023)777We believe that using generations should be called generative reward modeling when the judgements are used to curate a reward signal for training. The general application of this technology is LLM-as-a-judge.. Given LLM-as-a-judge’s prevalent use for evaluation, recent works have emerged using LLMs as feedback mechanisms very similar to reward models. Some works have fine-tuned models specifically for the task of rating or choosing responses from LLMs (Jiang et al., 2023b; Kim et al., 2023; Zhu et al., 2023b). Others use the policy LM itself as a generative reward model via prompting it to behave as a judge (Yuan et al., 2024b; Li et al., 2023a). While similar to the reward computation of DPO models, this mode of score calculation often involves specific prompting per-model and more computation per sample, such as explaining reasoning before or after the score. Results are shown in Tab. 8 where there is a substantial variation among existing open and closed models. Note, the best classifier RMs outperform the best generative reward models.
Values Represented in Reward Models
Reward models inhabit an important normative role in the RLHF process being the primary artifact where human preferences or values are encoded in the final policy. The RewardBench infrastructure enables asking basic questions when studying reward models such as whose or which values are embedded as the sense of reward (Lambert et al., 2023). Initial work is studying this question for LLMs broadly, such as measuring representation (Durmus et al., 2023; Ryan et al., 2024) or moral foundations of LMs (Abdulhai et al., 2023), but this work should be extended to reward models. This can involve the study of different base models which RMs are trained from, tweaking fine-tuning techniques, if synthetic datasets amplify bias in RMs as well (Wyllie et al., 2024), and datasets.
Safety In or After RLHF
An emerging trend in LLMs is the shift from chat systems being only a model to being a system of models, with small models used as classifiers for tasks such as safety (Mozes et al., 2023). If some LLMs or RMs are designed to be used with additional safety classifiers after the fact, evaluating them on RewardBench may not be a fair comparison. For systems such as this, each classifier for a specific task should be evaluated on the sections it controls. The most common area where this is handled is safety, where a small reward model can be used to permit or block all outputs from a larger generating model.
Appendix C Compute Usage
This work primarily evaluates models on NVIDIA A100 GPUs hosted by Cirrascale888Per model batch size and settings include online: https://github.com/allenai/reward-bench/blob/main/scripts/configs/eval_configs.yaml.. Each model, of which we evaluated 75, takes about 12 hours to run on 16 bit quantization. Re-running the entire evaluation suite of RewardBench would take approximately 1000 A100 hours to complete.
Appendix D Codebase Discussion
Additional data is included in the code-base, but not included in the evaluation score due to noisy results or lack of clear use instructions (e.g. could be easy for unintentional test-set contamination). In this vein, results on SafeRLHF (Dai et al., 2023) data and MT Bench labels999https://huggingface.co/datasets/lmsys/mt_bench_human_judgments (from humans and GPT-4) are supported within the methodology, but not included in this analysis.
Appendix E Additional Results
Table LABEL:table:all_results shows the full results for the first reward models we collected in this work. In addition, Tables LABEL:table:eval_sets_chat-LABEL:table:pref_sets provides the performance breakdown per category.
AlpacaEval | MT Bench | |||||
---|---|---|---|---|---|---|
Reward Model | Average | Easy | Length | Hard | Easy | Medium |
sfairXC/FsfairX-LLaMA3-RM-v0.1 | 99.4 | 100.0 | 98.9 | 98.9 | 100.0 | 100.0 |
RLHFlow/pair-preference-model-LLaMA3-8B | 98.3 | 98.0 | 97.9 | 97.9 | 100.0 | 100.0 |
hendrydong/Mistral-RM-for-RAFT-GSHF-v0 | 98.3 | 100.0 | 95.8 | 100.0 | 100.0 | 95.0 |
berkeley-nest/Starling-RM-7B-alpha | 98.0 | 99.0 | 97.9 | 100.0 | 96.4 | 92.5 |
openbmb/Eurus-RM-7b | 98.0 | 97.0 | 97.9 | 100.0 | 96.4 | 97.5 |
Ray2333/reward-model-Mistral-7B-instruct-Unified… | 97.8 | 98.0 | 95.8 | 98.9 | 100.0 | 97.5 |
meta-llama/Meta-Llama-3-70B-Instruct | 97.6 | 100.0 | 92.1 | 100.0 | 100.0 | 97.5 |
allenai/tulu-2-dpo-70b | 97.5 | 98.0 | 98.9 | 100.0 | 85.7 | 95.0 |
allenai/tulu-2-dpo-7b | 97.5 | 99.0 | 96.8 | 98.9 | 92.9 | 95.0 |
Nexusflow/Starling-RM-34B | 96.9 | 99.0 | 92.6 | 100.0 | 96.4 | 95.0 |
weqweasdas/RM-Gemma-7B | 96.9 | 98.0 | 93.7 | 98.9 | 100.0 | 95.0 |
weqweasdas/RM-Mistral-7B | 96.9 | 98.0 | 93.7 | 97.9 | 100.0 | 97.5 |
RLHFlow/ArmoRM-Llama3-8B-v0.1 | 96.9 | 97.0 | 96.8 | 94.7 | 100.0 | 100.0 |
stabilityai/stablelm-2-zephyr-1_6b | 96.6 | 97.0 | 98.9 | 96.8 | 100.0 | 87.5 |
openai/gpt-4o-2024-05-13 | 96.6 | 100.0 | 89.5 | 97.9 | 100.0 | 100.0 |
stabilityai/stablelm-2-12b-chat | 96.6 | 99.0 | 100.0 | 93.7 | 96.4 | 90.0 |
openbmb/UltraRM-13b | 96.4 | 97.0 | 90.5 | 98.9 | 100.0 | 100.0 |
HuggingFaceH4/zephyr-7b-gemma-v0.1 | 95.8 | 98.0 | 93.7 | 97.9 | 89.3 | 95.0 |
allenai/tulu-2-dpo-13b | 95.8 | 96.0 | 97.9 | 100.0 | 89.3 | 85.0 |
mightbe/Better-PairRM | 95.5 | 99.0 | 86.3 | 100.0 | 92.9 | 100.0 |
PoLL/gpt-3.5-turbo-0125_claude-3-sonnet-20240229… | 95.3 | 99.0 | 86.3 | 98.9 | 96.4 | 97.5 |
HuggingFaceH4/zephyr-7b-beta | 95.3 | 95.0 | 94.7 | 96.8 | 89.3 | 97.5 |
openbmb/Eurus-7b-kto | 95.3 | 98.0 | 95.8 | 96.8 | 89.3 | 87.5 |
openai/gpt-4-0125-preview | 95.3 | 98.0 | 87.4 | 96.8 | 100.0 | 100.0 |
openai/gpt-4-turbo-2024-04-09 | 95.3 | 97.0 | 88.4 | 96.8 | 100.0 | 100.0 |
CohereForAI/c4ai-command-r-plus | 95.1 | 99.0 | 90.0 | 97.9 | 96.4 | 90.0 |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 95.0 | 95.0 | 100.0 | 90.5 | 92.9 | 95.0 |
weqweasdas/RM-Gemma-7B-4096 | 95.0 | 98.0 | 90.5 | 94.7 | 96.4 | 97.5 |
Anthropic/claude-3-opus-20240229 | 94.7 | 99.0 | 84.2 | 98.9 | 96.4 | 97.5 |
weqweasdas/RM-Gemma-2B | 94.4 | 96.0 | 90.5 | 97.9 | 96.4 | 90.0 |
HuggingFaceH4/starchat2-15b-v0.1 | 93.9 | 95.0 | 92.6 | 95.8 | 96.4 | 87.5 |
jondurbin/bagel-dpo-34b-v0.5 | 93.9 | 97.0 | 93.7 | 94.7 | 85.7 | 90.0 |
Anthropic/claude-3-sonnet-20240229 | 93.4 | 98.5 | 80.5 | 99.5 | 96.4 | 95.0 |
prometheus-eval/prometheus-8x7b-v2.0 | 93.0 | 96.0 | 87.4 | 92.6 | 92.9 | 100.0 |
Anthropic/claude-3-haiku-20240307 | 92.7 | 99.0 | 80.0 | 100.0 | 92.9 | 90.0 |
OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 | 92.5 | 97.0 | 91.6 | 98.9 | 82.1 | 75.0 |
google/gemini-1.5-pro-0514 | 92.3 | 95.0 | 84.2 | 93.7 | 98.2 | 97.5 |
NousResearch/Nous-Hermes-2-Mistral-7B-DPO | 92.2 | 96.0 | 83.2 | 95.8 | 92.9 | 95.0 |
openai/gpt-3.5-turbo-0125 | 92.2 | 95.5 | 82.1 | 98.9 | 94.6 | 90.0 |
HuggingFaceH4/zephyr-7b-alpha | 91.6 | 99.0 | 78.9 | 95.8 | 92.9 | 92.5 |
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | 91.6 | 98.0 | 87.4 | 96.8 | 75.0 | 85.0 |
0-hero/Matter-0.1-7B-boost-DPO-preview | 91.1 | 98.0 | 88.4 | 90.5 | 89.3 | 82.5 |
llm-blender/PairRM-hf | 90.2 | 96.0 | 75.8 | 97.9 | 92.9 | 90.0 |
allenai/OLMo-7B-Instruct | 89.7 | 90.0 | 91.6 | 92.6 | 85.7 | 80.0 |
0-hero/Matter-0.1-7B-DPO-preview | 89.4 | 100.0 | 84.2 | 95.8 | 67.9 | 75.0 |
openbmb/MiniCPM-2B-dpo-fp32 | 89.1 | 95.0 | 92.6 | 88.4 | 85.7 | 70.0 |
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 | 88.5 | 95.0 | 78.9 | 93.7 | 85.7 | 85.0 |
RLHFlow/RewardModel-Mistral-7B-for-DPA-v1 | 88.0 | 91.0 | 73.7 | 95.8 | 89.3 | 95.0 |
IDEA-CCNL/Ziya-LLaMA-7B-Reward | 86.9 | 85.0 | 84.2 | 92.6 | 92.9 | 80.0 |
stabilityai/stablelm-zephyr-3b | 86.3 | 72.0 | 95.8 | 89.5 | 96.4 | 85.0 |
stanfordnlp/SteamSHP-flan-t5-large | 85.8 | 94.0 | 72.6 | 97.9 | 75.0 | 75.0 |
prometheus-eval/prometheus-7b-v2.0 | 85.5 | 92.0 | 81.1 | 86.8 | 73.2 | 85.0 |
meta-llama/Meta-Llama-3-8B-Instruct | 85.5 | 91.0 | 72.6 | 90.5 | 94.6 | 83.8 |
stanfordnlp/SteamSHP-flan-t5-xl | 85.5 | 93.0 | 69.5 | 98.9 | 78.6 | 77.5 |
ContextualAI/archangel_sft-kto_llama30b | 84.4 | 93.0 | 76.8 | 88.4 | 82.1 | 72.5 |
ContextualAI/archangel_sft-kto_llama13b | 84.1 | 96.0 | 76.8 | 87.4 | 71.4 | 72.5 |
OpenAssistant/reward-model-deberta-v3-large-v2 | 83.2 | 99.0 | 41.1 | 96.8 | 100.0 | 100.0 |
PKU-Alignment/beaver-7b-v1.0-reward | 81.8 | 98.0 | 63.2 | 100.0 | 67.9 | 52.5 |
weqweasdas/hh_rlhf_rm_open_llama_3b | 81.8 | 95.0 | 64.2 | 96.8 | 64.3 | 67.5 |
upstage/SOLAR-10.7B-Instruct-v1.0 | 81.6 | 92.0 | 74.7 | 75.8 | 89.3 | 80.0 |
ContextualAI/archangel_sft-dpo_pythia2-8b | 80.7 | 96.0 | 58.9 | 92.6 | 67.9 | 75.0 |
ContextualAI/archangel_sft-kto_pythia6-9b | 77.7 | 88.0 | 64.2 | 90.5 | 57.1 | 67.5 |
ContextualAI/archangel_sft-kto_pythia2-8b | 75.7 | 92.0 | 55.8 | 80.0 | 67.9 | 77.5 |
ContextualAI/archangel_sft-kto_pythia12-0b | 74.9 | 79.0 | 69.5 | 82.1 | 67.9 | 65.0 |
ContextualAI/archangel_sft-dpo_pythia6-9b | 74.9 | 89.0 | 58.9 | 87.4 | 57.1 | 60.0 |
Qwen/Qwen1.5-MoE-A2.7B-Chat | 72.9 | 77.0 | 82.1 | 58.9 | 60.7 | 82.5 |
ContextualAI/archangel_sft-dpo_llama13b | 71.2 | 80.0 | 62.1 | 69.5 | 78.6 | 70.0 |
ContextualAI/archangel_sft-dpo_llama30b | 69.3 | 78.0 | 61.1 | 74.7 | 67.9 | 55.0 |
ContextualAI/archangel_sft-kto_pythia1-4b | 68.4 | 79.0 | 52.6 | 75.8 | 57.1 | 70.0 |
ContextualAI/archangel_sft-dpo_pythia12-0b | 66.8 | 71.0 | 62.1 | 70.5 | 60.7 | 62.5 |
ContextualAI/archangel_sft-dpo_pythia1-4b | 64.0 | 73.0 | 49.5 | 75.8 | 35.7 | 67.5 |
Qwen/Qwen1.5-72B-Chat | 62.3 | 73.0 | 70.5 | 38.9 | 60.7 | 72.5 |
PKU-Alignment/beaver-7b-v1.0-cost | 61.7 | 43.0 | 67.4 | 74.7 | 57.1 | 67.5 |
stabilityai/stable-code-instruct-3b | 57.8 | 27.0 | 81.1 | 57.9 | 75.0 | 67.5 |
ContextualAI/archangel_sft-dpo_llama7b | 57.8 | 65.0 | 48.4 | 66.3 | 35.7 | 57.5 |
Qwen/Qwen1.5-14B-Chat | 57.3 | 64.0 | 70.5 | 32.6 | 60.7 | 65.0 |
Qwen/Qwen1.5-1.8B-Chat | 56.1 | 30.0 | 89.5 | 51.6 | 57.1 | 52.5 |
ContextualAI/archangel_sft-kto_llama7b | 55.9 | 60.0 | 51.6 | 57.9 | 50.0 | 55.0 |
Qwen/Qwen1.5-7B-Chat | 53.6 | 50.0 | 73.7 | 32.6 | 57.1 | 62.5 |
random | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
Qwen/Qwen1.5-4B-Chat | 38.8 | 8.0 | 71.6 | 35.8 | 53.6 | 35.0 |
Qwen/Qwen1.5-0.5B-Chat | 35.5 | 9.0 | 65.3 | 25.3 | 57.1 | 40.0 |
Qwen/Qwen1.5-0.5B-Chat | 35.5 | 9.0 | 65.3 | 25.3 | 57.1 | 40.0 |
MTBench | LLMBar | LLMBar Adversarial | |||||
---|---|---|---|---|---|---|---|
Reward Model | Avg. | Hard | Natural | Neighbor | GPTInst | GPTOut | Manual |
google/gemini-1.5-pro-0514 | 80.6 | 81.1 | 94.0 | 75.4 | 79.3 | 70.2 | 79.3 |
RLHFlow/ArmoRM-Llama3-8B-v0.1 | 76.8 | 86.5 | 93.0 | 67.9 | 77.2 | 66.0 | 69.6 |
openai/gpt-4-turbo-2024-04-09 | 75.4 | 86.5 | 97.0 | 53.0 | 80.4 | 74.5 | 76.1 |
openai/gpt-4-0125-preview | 74.3 | 83.8 | 91.0 | 56.7 | 70.7 | 87.2 | 76.1 |
openai/gpt-4o-2024-05-13 | 70.4 | 78.4 | 91.0 | 50.7 | 71.7 | 74.5 | 69.6 |
Qwen/Qwen1.5-14B-Chat | 70.2 | 67.6 | 71.0 | 83.6 | 62.0 | 46.8 | 71.7 |
Qwen/Qwen1.5-7B-Chat | 69.1 | 64.9 | 65.0 | 81.3 | 59.8 | 53.2 | 80.4 |
upstage/SOLAR-10.7B-Instruct-v1.0 | 68.6 | 59.5 | 75.0 | 80.6 | 57.6 | 51.1 | 67.4 |
Qwen/Qwen1.5-72B-Chat | 66.0 | 59.5 | 68.0 | 81.3 | 45.7 | 51.1 | 78.3 |
RLHFlow/pair-preference-model-LLaMA3-8B | 65.8 | 75.7 | 89.0 | 53.0 | 62.0 | 68.1 | 50.0 |
openbmb/Eurus-RM-7b | 65.6 | 78.4 | 93.0 | 53.0 | 55.4 | 63.8 | 54.3 |
sfairXC/FsfairX-LLaMA3-RM-v0.1 | 65.1 | 78.4 | 91.0 | 52.2 | 57.6 | 63.8 | 52.2 |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 64.0 | 75.7 | 77.0 | 67.9 | 41.3 | 55.3 | 69.6 |
Qwen/Qwen1.5-MoE-A2.7B-Chat | 63.2 | 54.1 | 59.0 | 72.4 | 53.3 | 57.4 | 78.3 |
Qwen/Qwen1.5-0.5B-Chat | 62.9 | 45.9 | 58.0 | 75.4 | 65.2 | 48.9 | 60.9 |
HuggingFaceH4/zephyr-7b-beta | 62.7 | 83.8 | 83.0 | 70.9 | 27.2 | 51.1 | 60.9 |
Qwen/Qwen1.5-4B-Chat | 62.7 | 51.4 | 55.0 | 75.4 | 67.4 | 42.6 | 63.0 |
HuggingFaceH4/zephyr-7b-alpha | 62.5 | 83.8 | 76.0 | 66.4 | 35.9 | 63.8 | 56.5 |
0-hero/Matter-0.1-7B-boost-DPO-preview | 61.0 | 75.7 | 78.0 | 62.7 | 40.2 | 57.4 | 52.2 |
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | 60.5 | 64.9 | 72.0 | 63.4 | 39.1 | 66.0 | 60.9 |
NousResearch/Nous-Hermes-2-Mistral-7B-DPO | 60.5 | 75.7 | 80.0 | 55.2 | 45.7 | 55.3 | 56.5 |
allenai/tulu-2-dpo-70b | 60.5 | 64.9 | 72.0 | 70.9 | 34.8 | 51.1 | 63.0 |
Anthropic/claude-3-opus-20240229 | 60.3 | 78.4 | 90.0 | 32.8 | 55.4 | 76.6 | 54.3 |
Qwen/Qwen1.5-1.8B-Chat | 60.3 | 54.1 | 63.0 | 74.6 | 43.5 | 44.7 | 67.4 |
stabilityai/stablelm-zephyr-3b | 60.1 | 86.5 | 74.0 | 81.3 | 18.5 | 36.2 | 54.3 |
meta-llama/Meta-Llama-3-70B-Instruct | 58.9 | 81.1 | 83.0 | 32.5 | 57.6 | 71.3 | 55.4 |
stabilityai/stable-code-instruct-3b | 58.6 | 51.4 | 53.0 | 79.9 | 38.0 | 48.9 | 65.2 |
allenai/tulu-2-dpo-13b | 58.3 | 70.3 | 75.0 | 71.6 | 25.0 | 51.1 | 47.8 |
weqweasdas/RM-Mistral-7B | 58.1 | 78.4 | 88.0 | 44.0 | 43.5 | 61.7 | 43.5 |
hendrydong/Mistral-RM-for-RAFT-GSHF-v0 | 57.9 | 81.1 | 91.0 | 46.3 | 40.2 | 59.6 | 34.8 |
0-hero/Matter-0.1-7B-DPO-preview | 57.7 | 64.9 | 75.0 | 57.5 | 39.1 | 68.1 | 41.3 |
CohereForAI/c4ai-command-r-plus | 57.6 | 74.3 | 84.0 | 26.9 | 63.0 | 70.2 | 52.2 |
Nexusflow/Starling-RM-34B | 57.2 | 91.9 | 91.0 | 31.3 | 39.1 | 76.6 | 47.8 |
Anthropic/claude-3-sonnet-20240229 | 56.6 | 75.7 | 86.0 | 28.7 | 57.1 | 66.0 | 47.8 |
allenai/tulu-2-dpo-7b | 56.1 | 67.6 | 70.0 | 70.9 | 25.0 | 40.4 | 52.2 |
openbmb/UltraRM-13b | 55.5 | 75.7 | 82.0 | 42.5 | 43.5 | 51.1 | 47.8 |
HuggingFaceH4/starchat2-15b-v0.1 | 55.5 | 59.5 | 82.0 | 53.7 | 27.2 | 53.2 | 58.7 |
stabilityai/stablelm-2-12b-chat | 55.5 | 64.9 | 70.0 | 73.1 | 18.5 | 44.7 | 50.0 |
jondurbin/bagel-dpo-34b-v0.5 | 55.0 | 48.6 | 69.0 | 73.9 | 25.0 | 34.0 | 56.5 |
PoLL/gpt-3.5-turbo-0125_claude-3-sonnet-20240229… | 54.1 | 78.4 | 89.0 | 26.1 | 47.3 | 66.0 | 41.3 |
openbmb/Eurus-7b-kto | 53.7 | 64.9 | 73.0 | 60.4 | 27.2 | 44.7 | 45.7 |
llm-blender/PairRM-hf | 52.2 | 64.9 | 78.0 | 42.5 | 31.5 | 57.4 | 50.0 |
Anthropic/claude-3-haiku-20240307 | 52.0 | 67.6 | 77.0 | 33.6 | 46.7 | 61.7 | 39.1 |
allenai/OLMo-7B-Instruct | 50.7 | 64.9 | 67.0 | 58.2 | 25.0 | 40.4 | 43.5 |
Ray2333/reward-model-Mistral-7B-instruct-Unified… | 50.7 | 78.4 | 90.0 | 32.8 | 29.3 | 57.4 | 30.4 |
weqweasdas/RM-Gemma-7B-4096 | 50.2 | 70.3 | 83.0 | 42.5 | 22.8 | 55.3 | 34.8 |
random | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
RLHFlow/RewardModel-Mistral-7B-for-DPA-v1 | 49.8 | 51.4 | 74.0 | 39.6 | 33.7 | 61.7 | 45.7 |
weqweasdas/RM-Gemma-7B | 49.8 | 67.6 | 82.0 | 39.6 | 27.2 | 61.7 | 28.3 |
HuggingFaceH4/zephyr-7b-gemma-v0.1 | 49.6 | 83.8 | 74.0 | 44.0 | 17.4 | 53.2 | 45.7 |
openbmb/MiniCPM-2B-dpo-fp32 | 49.3 | 62.2 | 68.0 | 62.7 | 17.4 | 29.8 | 43.5 |
prometheus-eval/prometheus-7b-v2.0 | 49.1 | 67.6 | 77.5 | 26.9 | 36.4 | 54.3 | 57.6 |
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 | 48.7 | 73.0 | 67.0 | 33.6 | 42.4 | 53.2 | 41.3 |
prometheus-eval/prometheus-8x7b-v2.0 | 47.1 | 64.9 | 75.0 | 29.1 | 32.6 | 59.6 | 41.3 |
stabilityai/stablelm-2-zephyr-1_6b | 46.7 | 73.0 | 70.0 | 49.3 | 12.0 | 46.8 | 37.0 |
IDEA-CCNL/Ziya-LLaMA-7B-Reward | 46.1 | 62.2 | 77.0 | 36.6 | 32.6 | 40.4 | 26.1 |
berkeley-nest/Starling-RM-7B-alpha | 45.6 | 75.7 | 80.0 | 31.3 | 23.9 | 48.9 | 28.3 |
ContextualAI/archangel_sft-dpo_llama30b | 44.7 | 40.5 | 55.0 | 45.5 | 34.8 | 42.6 | 45.7 |
openai/gpt-3.5-turbo-0125 | 44.5 | 67.6 | 82.5 | 14.9 | 34.8 | 60.6 | 32.6 |
ContextualAI/archangel_sft-dpo_llama7b | 44.5 | 67.6 | 53.0 | 36.6 | 39.1 | 53.2 | 32.6 |
ContextualAI/archangel_sft-kto_llama7b | 43.6 | 51.4 | 53.0 | 41.0 | 40.2 | 42.6 | 32.6 |
ContextualAI/archangel_sft-dpo_llama13b | 43.0 | 54.1 | 52.0 | 38.8 | 43.5 | 38.3 | 30.4 |
PKU-Alignment/beaver-7b-v1.0-cost | 42.3 | 48.6 | 48.0 | 35.8 | 41.3 | 59.6 | 28.3 |
meta-llama/Meta-Llama-3-8B-Instruct | 41.6 | 70.3 | 69.0 | 22.0 | 21.7 | 61.7 | 34.8 |
weqweasdas/RM-Gemma-2B | 40.8 | 73.0 | 76.0 | 29.9 | 15.2 | 40.4 | 21.7 |
ContextualAI/archangel_sft-kto_llama30b | 40.6 | 54.1 | 57.0 | 39.6 | 19.6 | 42.6 | 37.0 |
mightbe/Better-PairRM | 39.3 | 70.3 | 71.0 | 27.6 | 14.1 | 42.6 | 26.1 |
ContextualAI/archangel_sft-kto_pythia1-4b | 37.9 | 56.8 | 52.0 | 23.1 | 33.7 | 51.1 | 30.4 |
ContextualAI/archangel_sft-kto_llama13b | 37.7 | 67.6 | 63.0 | 20.1 | 22.8 | 51.1 | 26.1 |
ContextualAI/archangel_sft-dpo_pythia1-4b | 37.3 | 45.9 | 50.0 | 24.6 | 34.8 | 51.1 | 30.4 |
weqweasdas/hh_rlhf_rm_open_llama_3b | 37.3 | 56.8 | 62.0 | 27.6 | 20.7 | 44.7 | 21.7 |
OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 | 37.3 | 54.1 | 70.0 | 20.9 | 21.7 | 44.7 | 23.9 |
stanfordnlp/SteamSHP-flan-t5-xl | 36.8 | 51.4 | 65.0 | 21.6 | 27.2 | 36.2 | 28.3 |
ContextualAI/archangel_sft-dpo_pythia12-0b | 36.4 | 48.6 | 46.0 | 29.1 | 26.1 | 48.9 | 34.8 |
ContextualAI/archangel_sft-kto_pythia6-9b | 36.2 | 48.6 | 51.0 | 22.4 | 26.1 | 57.4 | 32.6 |
ContextualAI/archangel_sft-kto_pythia12-0b | 36.2 | 45.9 | 60.0 | 22.4 | 27.2 | 40.4 | 30.4 |
ContextualAI/archangel_sft-kto_pythia2-8b | 34.2 | 48.6 | 48.0 | 22.4 | 28.3 | 51.1 | 21.7 |
ContextualAI/archangel_sft-dpo_pythia6-9b | 34.2 | 35.1 | 49.0 | 21.6 | 26.1 | 51.1 | 37.0 |
ContextualAI/archangel_sft-dpo_pythia2-8b | 33.6 | 56.8 | 56.0 | 18.7 | 23.9 | 42.6 | 19.6 |
stanfordnlp/SteamSHP-flan-t5-large | 33.1 | 56.8 | 56.0 | 17.9 | 19.6 | 42.6 | 26.1 |
PKU-Alignment/beaver-7b-v1.0-reward | 28.7 | 56.8 | 53.0 | 10.4 | 18.5 | 36.2 | 19.6 |
OpenAssistant/reward-model-deberta-v3-large-v2 | 22.8 | 100.0 | 53.0 | 5.2 | 7.6 | 0.0 | 0.0 |
Refusals | XSTest Should | Do Not | ||||
---|---|---|---|---|---|---|
Reward Model | Avg. | Dang. | Offen. | Refuse | Respond | Answer |
RLHFlow/ArmoRM-Llama3-8B-v0.1 | 92.2 | 93.0 | 97.0 | 100.0 | 87.2 | 79.4 |
RLHFlow/pair-preference-model-LLaMA3-8B | 89.7 | 93.0 | 97.0 | 96.1 | 96.4 | 62.5 |
Anthropic/claude-3-opus-20240229 | 89.1 | 95.5 | 99.5 | 96.8 | 78.0 | 75.0 |
Nexusflow/Starling-RM-34B | 88.2 | 84.0 | 97.0 | 97.4 | 93.6 | 61.8 |
sfairXC/FsfairX-LLaMA3-RM-v0.1 | 87.8 | 89.0 | 96.0 | 97.4 | 89.2 | 61.8 |
google/gemini-1.5-pro-0514 | 87.5 | 85.0 | 91.0 | 93.8 | 96.8 | 64.7 |
openai/gpt-4-0125-preview | 87.2 | 83.0 | 97.0 | 93.5 | 96.4 | 61.0 |
weqweasdas/RM-Mistral-7B | 87.1 | 81.0 | 95.0 | 98.1 | 92.0 | 60.3 |
openai/gpt-4-turbo-2024-04-09 | 87.1 | 79.0 | 96.0 | 94.2 | 97.6 | 61.8 |
Ray2333/reward-model-Mistral-7B-instruct-Unified… | 86.7 | 82.0 | 99.0 | 97.4 | 86.4 | 61.8 |
openai/gpt-4o-2024-05-13 | 86.7 | 81.0 | 93.0 | 96.8 | 95.2 | 58.1 |
hendrydong/Mistral-RM-for-RAFT-GSHF-v0 | 86.3 | 74.0 | 96.0 | 98.1 | 88.4 | 64.0 |
berkeley-nest/Starling-RM-7B-alpha | 85.8 | 87.0 | 99.0 | 96.1 | 85.6 | 56.6 |
upstage/SOLAR-10.7B-Instruct-v1.0 | 85.5 | 65.0 | 76.0 | 94.2 | 91.6 | 84.6 |
allenai/tulu-2-dpo-70b | 83.9 | 82.0 | 89.0 | 85.7 | 90.4 | 70.6 |
Anthropic/claude-3-sonnet-20240229 | 83.7 | 95.0 | 96.5 | 92.5 | 77.2 | 57.0 |
prometheus-eval/prometheus-8x7b-v2.0 | 83.5 | 92.0 | 100.0 | 94.2 | 70.6 | 60.3 |
mightbe/Better-PairRM | 83.2 | 73.0 | 94.0 | 96.8 | 87.6 | 52.9 |
stabilityai/stablelm-2-12b-chat | 82.6 | 93.0 | 95.0 | 91.6 | 56.8 | 78.7 |
NousResearch/Nous-Hermes-2-Mistral-7B-DPO | 82.3 | 86.0 | 88.0 | 82.5 | 83.6 | 73.5 |
Anthropic/claude-3-haiku-20240307 | 82.1 | 93.0 | 92.5 | 95.5 | 75.6 | 49.3 |
PKU-Alignment/beaver-7b-v1.0-cost | 81.8 | 99.0 | 100.0 | 99.4 | 35.2 | 76.5 |
openbmb/Eurus-RM-7b | 81.2 | 70.0 | 72.0 | 93.5 | 94.8 | 58.1 |
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | 80.6 | 82.0 | 84.0 | 79.9 | 86.4 | 72.1 |
PoLL/gpt-3.5-turbo-0125_claude-3-sonnet-20240229… | 79.5 | 73.0 | 92.5 | 86.4 | 92.6 | 47.4 |
prometheus-eval/prometheus-7b-v2.0 | 78.7 | 88.0 | 90.0 | 83.4 | 71.2 | 63.2 |
allenai/tulu-2-dpo-13b | 78.2 | 65.0 | 80.0 | 81.2 | 91.2 | 66.2 |
Qwen/Qwen1.5-14B-Chat | 76.3 | 93.0 | 83.0 | 80.5 | 41.6 | 90.4 |
OpenAssistant/reward-model-deberta-v3-large-v2 | 75.1 | 82.0 | 99.0 | 76.6 | 83.2 | 40.4 |
Qwen/Qwen1.5-7B-Chat | 74.8 | 87.0 | 81.0 | 82.5 | 39.2 | 87.5 |
HuggingFaceH4/zephyr-7b-alpha | 74.3 | 48.0 | 58.0 | 79.2 | 96.8 | 71.3 |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 73.4 | 82.0 | 86.0 | 76.6 | 70.0 | 55.9 |
allenai/tulu-2-dpo-7b | 73.3 | 70.0 | 76.0 | 73.4 | 88.8 | 55.9 |
RLHFlow/RewardModel-Mistral-7B-for-DPA-v1 | 72.5 | 90.0 | 97.0 | 75.3 | 61.6 | 48.5 |
Qwen/Qwen1.5-72B-Chat | 72.0 | 91.0 | 73.0 | 76.0 | 42.0 | 83.8 |
stabilityai/stablelm-zephyr-3b | 70.3 | 93.0 | 78.0 | 54.5 | 83.2 | 62.5 |
stabilityai/stable-code-instruct-3b | 69.2 | 91.0 | 93.0 | 70.8 | 42.4 | 63.2 |
meta-llama/Meta-Llama-3-70B-Instruct | 69.2 | 64.0 | 66.5 | 67.9 | 97.2 | 45.6 |
Qwen/Qwen1.5-MoE-A2.7B-Chat | 67.8 | 79.0 | 60.0 | 76.0 | 38.0 | 83.8 |
ContextualAI/archangel_sft-dpo_llama30b | 67.7 | 82.0 | 59.0 | 81.8 | 44.4 | 64.0 |
meta-llama/Meta-Llama-3-8B-Instruct | 67.5 | 72.0 | 75.0 | 69.8 | 73.6 | 47.4 |
0-hero/Matter-0.1-7B-boost-DPO-preview | 66.3 | 63.0 | 53.0 | 57.8 | 96.8 | 59.6 |
Qwen/Qwen1.5-0.5B-Chat | 66.1 | 76.0 | 91.0 | 87.0 | 16.8 | 58.1 |
HuggingFaceH4/starchat2-15b-v0.1 | 65.8 | 96.0 | 90.0 | 46.8 | 86.4 | 37.5 |
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 | 65.3 | 51.0 | 57.0 | 86.4 | 69.6 | 38.2 |
openai/gpt-3.5-turbo-0125 | 62.3 | 36.0 | 81.0 | 65.9 | 90.4 | 29.4 |
allenai/OLMo-7B-Instruct | 62.3 | 57.0 | 68.0 | 57.1 | 77.2 | 54.4 |
Qwen/Qwen1.5-4B-Chat | 61.8 | 63.0 | 75.0 | 76.6 | 29.2 | 61.0 |
jondurbin/bagel-dpo-34b-v0.5 | 61.5 | 40.0 | 48.0 | 59.1 | 81.6 | 69.1 |
HuggingFaceH4/zephyr-7b-beta | 61.0 | 30.0 | 32.0 | 61.7 | 97.6 | 62.5 |
IDEA-CCNL/Ziya-LLaMA-7B-Reward | 60.2 | 39.0 | 69.0 | 61.0 | 90.4 | 33.8 |
ContextualAI/archangel_sft-kto_llama30b | 60.2 | 48.0 | 77.0 | 65.6 | 68.0 | 38.2 |
stabilityai/stablelm-2-zephyr-1_6b | 58.3 | 48.0 | 65.0 | 59.1 | 74.4 | 41.2 |
0-hero/Matter-0.1-7B-DPO-preview | 58.0 | 59.0 | 47.0 | 44.2 | 88.8 | 55.9 |
OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 | 57.7 | 11.0 | 76.0 | 84.4 | 59.2 | 27.9 |
openbmb/Eurus-7b-kto | 57.5 | 35.0 | 38.0 | 64.3 | 88.0 | 41.2 |
openbmb/UltraRM-13b | 56.0 | 30.0 | 28.0 | 64.9 | 94.4 | 36.0 |
CohereForAI/c4ai-command-r-plus | 55.6 | 38.0 | 43.0 | 59.1 | 92.0 | 30.1 |
Qwen/Qwen1.5-1.8B-Chat | 53.6 | 41.0 | 50.0 | 70.8 | 30.4 | 60.3 |
HuggingFaceH4/zephyr-7b-gemma-v0.1 | 52.9 | 25.0 | 61.0 | 51.3 | 92.4 | 25.7 |
weqweasdas/RM-Gemma-7B | 52.7 | 23.0 | 35.0 | 54.5 | 94.0 | 37.5 |
ContextualAI/archangel_sft-dpo_pythia12-0b | 52.7 | 47.0 | 70.0 | 48.7 | 61.2 | 41.9 |
openbmb/MiniCPM-2B-dpo-fp32 | 52.5 | 22.0 | 41.0 | 56.5 | 93.2 | 30.1 |
weqweasdas/RM-Gemma-7B-4096 | 51.2 | 19.0 | 40.0 | 53.9 | 91.6 | 32.4 |
ContextualAI/archangel_sft-dpo_llama13b | 50.9 | 51.0 | 82.0 | 32.5 | 75.6 | 33.8 |
random | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
ContextualAI/archangel_sft-kto_pythia6-9b | 48.4 | 30.0 | 56.0 | 42.9 | 83.2 | 27.2 |
ContextualAI/archangel_sft-dpo_llama7b | 46.9 | 34.0 | 38.0 | 41.6 | 80.8 | 34.6 |
ContextualAI/archangel_sft-dpo_pythia6-9b | 45.9 | 29.0 | 52.0 | 38.3 | 83.2 | 25.7 |
ContextualAI/archangel_sft-kto_pythia12-0b | 44.6 | 28.0 | 58.0 | 41.6 | 64.4 | 30.1 |
ContextualAI/archangel_sft-kto_pythia1-4b | 44.5 | 39.0 | 53.0 | 27.3 | 89.6 | 22.8 |
ContextualAI/archangel_sft-dpo_pythia1-4b | 44.2 | 32.0 | 53.0 | 35.1 | 82.8 | 19.9 |
weqweasdas/RM-Gemma-2B | 44.0 | 7.0 | 23.0 | 46.8 | 92.0 | 27.2 |
ContextualAI/archangel_sft-kto_pythia2-8b | 43.1 | 26.0 | 40.0 | 40.3 | 73.6 | 28.7 |
ContextualAI/archangel_sft-dpo_pythia2-8b | 40.5 | 20.0 | 45.0 | 37.7 | 70.0 | 24.3 |
llm-blender/PairRM-hf | 40.1 | 9.0 | 1.0 | 36.4 | 95.2 | 36.0 |
ContextualAI/archangel_sft-kto_llama13b | 39.1 | 21.0 | 38.0 | 28.6 | 85.6 | 19.9 |
ContextualAI/archangel_sft-kto_llama7b | 37.8 | 24.0 | 22.0 | 26.6 | 87.6 | 23.5 |
weqweasdas/hh_rlhf_rm_open_llama_3b | 35.1 | 6.0 | 32.0 | 29.2 | 78.8 | 19.9 |
PKU-Alignment/beaver-7b-v1.0-reward | 29.4 | 3.0 | 28.0 | 15.6 | 78.8 | 19.1 |
stanfordnlp/SteamSHP-flan-t5-xl | 29.0 | 3.0 | 3.0 | 20.1 | 88.0 | 16.9 |
stanfordnlp/SteamSHP-flan-t5-large | 28.1 | 8.0 | 2.0 | 17.5 | 89.2 | 12.5 |
stanfordnlp/SteamSHP-flan-t5-large | 28.1 | 8.0 | 2.0 | 17.5 | 89.2 | 12.5 |
HumanEvalPack | ||||||||
---|---|---|---|---|---|---|---|---|
Reward Model | Avg. | PRM Math | C++ | Go | Java | JS | Python | Rust |
RLHFlow/ArmoRM-Llama3-8B-v0.1 | 97.3 | 98.7 | 95.1 | 97.0 | 98.2 | 97.6 | 96.3 | 92.1 |
RLHFlow/pair-preference-model-LLaMA3-8B | 94.7 | 94.9 | 92.7 | 95.7 | 97.0 | 95.1 | 97.0 | 90.2 |
google/gemini-1.5-pro-0514 | 92.0 | 88.5 | 94.8 | 96.3 | 95.4 | 95.1 | 97.6 | 93.3 |
Qwen/Qwen1.5-7B-Chat | 90.4 | 93.7 | 84.1 | 86.0 | 93.9 | 84.1 | 90.2 | 84.1 |
Qwen/Qwen1.5-14B-Chat | 89.6 | 91.7 | 82.9 | 88.4 | 92.1 | 90.9 | 89.0 | 81.7 |
stabilityai/stablelm-2-12b-chat | 89.4 | 91.5 | 89.6 | 84.1 | 90.9 | 89.0 | 89.6 | 81.1 |
jondurbin/bagel-dpo-34b-v0.5 | 88.9 | 94.9 | 78.7 | 82.9 | 90.2 | 82.3 | 84.8 | 78.7 |
0-hero/Matter-0.1-7B-DPO-preview | 88.5 | 88.4 | 87.8 | 91.5 | 89.6 | 90.2 | 87.2 | 86.0 |
Nexusflow/Starling-RM-34B | 88.5 | 85.2 | 89.6 | 92.7 | 94.5 | 95.1 | 91.5 | 86.6 |
openai/gpt-4-0125-preview | 86.9 | 76.3 | 97.3 | 97.9 | 97.9 | 97.6 | 98.2 | 96.6 |
sfairXC/FsfairX-LLaMA3-RM-v0.1 | 86.4 | 77.9 | 92.7 | 95.7 | 97.0 | 97.6 | 95.7 | 91.5 |
openbmb/Eurus-RM-7b | 86.3 | 79.9 | 92.7 | 94.5 | 93.3 | 93.9 | 93.3 | 89.0 |
Qwen/Qwen1.5-72B-Chat | 85.5 | 82.8 | 87.2 | 87.2 | 93.9 | 89.6 | 88.4 | 83.5 |
openai/gpt-4o-2024-05-13 | 84.9 | 72.5 | 97.6 | 97.6 | 98.2 | 98.2 | 98.2 | 93.9 |
0-hero/Matter-0.1-7B-boost-DPO-preview | 83.9 | 80.1 | 89.6 | 87.2 | 93.9 | 85.4 | 86.0 | 84.8 |
openai/gpt-4-turbo-2024-04-09 | 82.7 | 67.3 | 97.0 | 99.1 | 97.9 | 99.1 | 99.4 | 96.0 |
openbmb/MiniCPM-2B-dpo-fp32 | 82.3 | 88.1 | 73.8 | 82.9 | 78.0 | 73.8 | 80.5 | 70.1 |
HuggingFaceH4/starchat2-15b-v0.1 | 81.6 | 66.2 | 96.3 | 96.3 | 98.8 | 98.2 | 98.2 | 93.9 |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 78.7 | 63.5 | 95.7 | 93.3 | 95.1 | 95.7 | 92.1 | 91.5 |
Anthropic/claude-3-opus-20240229 | 78.7 | 61.1 | 94.5 | 95.7 | 98.2 | 96.6 | 97.0 | 95.7 |
meta-llama/Meta-Llama-3-70B-Instruct | 78.5 | 66.2 | 91.8 | 89.9 | 91.2 | 92.1 | 91.5 | 88.7 |
Qwen/Qwen1.5-1.8B-Chat | 77.9 | 86.4 | 62.2 | 68.3 | 76.8 | 76.8 | 68.3 | 64.6 |
HuggingFaceH4/zephyr-7b-beta | 77.9 | 62.2 | 90.2 | 94.5 | 94.5 | 93.9 | 93.9 | 94.5 |
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 | 77.5 | 95.1 | 56.1 | 61.6 | 68.3 | 65.9 | 59.1 | 48.8 |
Qwen/Qwen1.5-MoE-A2.7B-Chat | 77.4 | 74.7 | 71.3 | 84.1 | 85.4 | 81.1 | 83.5 | 75.0 |
prometheus-eval/prometheus-8x7b-v2.0 | 77.4 | 69.7 | 86.6 | 87.5 | 84.5 | 85.4 | 85.7 | 81.1 |
weqweasdas/RM-Mistral-7B | 77.0 | 60.2 | 93.9 | 96.3 | 92.1 | 95.1 | 90.9 | 94.5 |
prometheus-eval/prometheus-7b-v2.0 | 76.5 | 86.2 | 67.1 | 62.2 | 65.9 | 68.3 | 68.6 | 68.3 |
weqweasdas/RM-Gemma-2B | 76.4 | 73.4 | 82.3 | 75.6 | 82.9 | 81.1 | 75.6 | 78.7 |
stabilityai/stablelm-zephyr-3b | 75.7 | 67.1 | 80.5 | 86.6 | 93.3 | 82.3 | 83.5 | 79.9 |
stabilityai/stable-code-instruct-3b | 75.3 | 60.6 | 90.9 | 91.5 | 89.6 | 88.4 | 92.7 | 86.6 |
HuggingFaceH4/zephyr-7b-alpha | 75.1 | 58.6 | 93.3 | 92.7 | 91.5 | 93.9 | 90.9 | 87.8 |
weqweasdas/RM-Gemma-7B-4096 | 75.1 | 57.9 | 89.6 | 92.7 | 96.3 | 92.1 | 93.3 | 89.6 |
openbmb/Eurus-7b-kto | 74.7 | 59.5 | 86.6 | 91.5 | 91.5 | 88.4 | 91.5 | 89.6 |
HuggingFaceH4/zephyr-7b-gemma-v0.1 | 74.6 | 68.7 | 79.3 | 81.1 | 81.1 | 78.0 | 86.0 | 78.0 |
hendrydong/Mistral-RM-for-RAFT-GSHF-v0 | 74.3 | 55.5 | 93.9 | 92.7 | 95.1 | 92.7 | 92.1 | 92.7 |
allenai/tulu-2-dpo-70b | 74.1 | 56.4 | 92.1 | 91.5 | 93.9 | 93.9 | 93.3 | 86.0 |
Ray2333/reward-model-Mistral-7B-instruct-Unified… | 73.9 | 55.7 | 91.5 | 94.5 | 92.1 | 92.7 | 90.2 | 91.5 |
NousResearch/Nous-Hermes-2-Mistral-7B-DPO | 73.8 | 72.7 | 79.9 | 79.3 | 76.2 | 75.0 | 68.9 | 69.5 |
weqweasdas/RM-Gemma-7B | 73.6 | 53.2 | 96.3 | 94.5 | 97.0 | 92.7 | 92.7 | 90.9 |
PoLL/gpt-3.5-turbo-0125_claude-3-sonnet-20240229… | 73.5 | 55.3 | 94.8 | 90.5 | 92.4 | 91.2 | 89.3 | 91.8 |
allenai/tulu-2-dpo-13b | 73.2 | 60.2 | 86.6 | 85.4 | 90.9 | 85.4 | 86.0 | 83.5 |
upstage/SOLAR-10.7B-Instruct-v1.0 | 72.5 | 52.3 | 92.1 | 90.2 | 93.9 | 95.7 | 92.1 | 92.1 |
allenai/tulu-2-dpo-7b | 71.8 | 63.5 | 78.7 | 79.9 | 84.1 | 81.1 | 82.9 | 73.2 |
allenai/OLMo-7B-Instruct | 71.7 | 65.1 | 76.2 | 74.4 | 81.1 | 82.9 | 75.6 | 79.3 |
ContextualAI/archangel_sft-kto_llama13b | 70.8 | 81.9 | 54.9 | 53.7 | 61.6 | 62.2 | 69.5 | 56.1 |
Anthropic/claude-3-haiku-20240307 | 70.6 | 57.7 | 84.8 | 82.9 | 84.1 | 86.3 | 81.1 | 81.7 |
CohereForAI/c4ai-command-r-plus | 70.4 | 55.6 | 86.6 | 83.8 | 83.5 | 88.7 | 85.7 | 82.9 |
ContextualAI/archangel_sft-kto_llama7b | 69.4 | 79.0 | 57.9 | 63.4 | 59.1 | 59.8 | 59.1 | 59.8 |
Anthropic/claude-3-sonnet-20240229 | 69.1 | 49.8 | 92.1 | 86.0 | 88.1 | 90.9 | 86.9 | 86.3 |
stabilityai/stablelm-2-zephyr-1_6b | 67.8 | 55.7 | 78.7 | 79.3 | 81.7 | 82.3 | 82.3 | 75.6 |
Qwen/Qwen1.5-4B-Chat | 66.9 | 77.2 | 47.6 | 51.8 | 62.2 | 67.7 | 46.3 | 64.0 |
meta-llama/Meta-Llama-3-8B-Instruct | 64.8 | 54.1 | 77.7 | 77.1 | 73.5 | 75.6 | 75.3 | 73.8 |
ContextualAI/archangel_sft-kto_pythia1-4b | 64.5 | 77.6 | 49.4 | 53.0 | 49.4 | 53.7 | 51.2 | 51.2 |
openbmb/UltraRM-13b | 62.4 | 45.4 | 78.7 | 79.3 | 80.5 | 78.0 | 78.7 | 81.7 |
ContextualAI/archangel_sft-kto_pythia2-8b | 62.2 | 75.8 | 43.3 | 48.2 | 45.1 | 52.4 | 49.4 | 52.4 |
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | 61.3 | 36.2 | 84.1 | 87.2 | 93.9 | 84.1 | 89.6 | 78.7 |
Qwen/Qwen1.5-0.5B-Chat | 59.8 | 70.7 | 53.0 | 47.6 | 49.4 | 46.3 | 47.6 | 50.0 |
RLHFlow/RewardModel-Mistral-7B-for-DPA-v1 | 59.7 | 43.4 | 72.6 | 74.4 | 79.9 | 77.4 | 78.7 | 73.2 |
openai/gpt-3.5-turbo-0125 | 59.1 | 40.6 | 83.2 | 72.3 | 75.6 | 77.4 | 79.9 | 77.4 |
OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 | 58.6 | 44.7 | 72.0 | 72.0 | 72.0 | 72.6 | 73.2 | 72.6 |
berkeley-nest/Starling-RM-7B-alpha | 58.0 | 34.9 | 75.0 | 84.8 | 84.1 | 84.1 | 78.7 | 79.9 |
IDEA-CCNL/Ziya-LLaMA-7B-Reward | 57.7 | 38.3 | 76.2 | 81.1 | 76.2 | 73.8 | 79.3 | 76.8 |
ContextualAI/archangel_sft-dpo_pythia1-4b | 56.7 | 63.5 | 47.0 | 48.8 | 48.2 | 53.0 | 49.4 | 53.0 |
ContextualAI/archangel_sft-dpo_llama7b | 56.6 | 53.9 | 61.0 | 61.6 | 58.5 | 58.5 | 65.2 | 50.6 |
PKU-Alignment/beaver-7b-v1.0-cost | 54.8 | 46.5 | 67.1 | 61.0 | 67.7 | 56.7 | 64.6 | 61.6 |
ContextualAI/archangel_sft-kto_pythia6-9b | 54.2 | 57.5 | 46.3 | 50.6 | 50.0 | 55.5 | 52.4 | 50.0 |
ContextualAI/archangel_sft-dpo_pythia2-8b | 51.3 | 50.6 | 50.0 | 52.4 | 51.8 | 53.7 | 50.0 | 54.9 |
ContextualAI/archangel_sft-kto_llama30b | 50.8 | 40.9 | 54.9 | 61.6 | 61.0 | 57.3 | 72.0 | 56.7 |
random | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
mightbe/Better-PairRM | 49.8 | 29.5 | 64.6 | 72.0 | 69.5 | 72.0 | 71.3 | 71.3 |
llm-blender/PairRM-hf | 49.0 | 33.3 | 59.8 | 68.3 | 66.5 | 61.0 | 65.2 | 67.1 |
ContextualAI/archangel_sft-dpo_pythia6-9b | 48.5 | 48.8 | 46.3 | 45.7 | 50.0 | 46.3 | 51.8 | 48.8 |
ContextualAI/archangel_sft-dpo_llama30b | 47.4 | 33.1 | 56.7 | 64.6 | 64.0 | 57.9 | 65.2 | 62.2 |
ContextualAI/archangel_sft-dpo_llama13b | 44.0 | 30.2 | 57.9 | 55.5 | 62.8 | 59.8 | 57.3 | 53.7 |
ContextualAI/archangel_sft-dpo_pythia12-0b | 41.4 | 38.5 | 49.4 | 40.9 | 47.6 | 42.7 | 42.1 | 43.3 |
ContextualAI/archangel_sft-kto_pythia12-0b | 41.3 | 38.0 | 43.9 | 47.6 | 46.3 | 39.0 | 51.8 | 38.4 |
stanfordnlp/SteamSHP-flan-t5-xl | 38.4 | 23.3 | 50.0 | 57.3 | 52.4 | 52.4 | 55.5 | 53.7 |
stanfordnlp/SteamSHP-flan-t5-large | 35.6 | 22.4 | 54.9 | 43.9 | 47.0 | 50.6 | 45.1 | 51.8 |
PKU-Alignment/beaver-7b-v1.0-reward | 34.6 | 8.7 | 56.7 | 61.0 | 60.4 | 54.3 | 63.4 | 67.1 |
OpenAssistant/reward-model-deberta-v3-large-v2 | 34.0 | 4.3 | 0.6 | 82.3 | 50.6 | 90.9 | 100.0 | 57.9 |
weqweasdas/hh_rlhf_rm_open_llama_3b | 32.8 | 10.7 | 57.3 | 50.0 | 57.3 | 56.1 | 50.6 | 57.9 |
Anthropic | MT Bench | |||||||
---|---|---|---|---|---|---|---|---|
Reward Model | Avg. | Harmless | Helpful | HHH | GPT-4 | Human | SHP | Summarize |
Ray2333/reward-model-Mistral-7B-instruct-Unified… | 73.9 | 72.3 | 70.3 | 89.6 | 79.4 | 68.6 | 64.3 | 73.2 |
mightbe/Better-PairRM | 72.1 | 69.2 | 68.5 | 83.7 | 77.8 | 67.8 | 64.2 | 73.2 |
Nexusflow/Starling-RM-34B | 71.6 | 59.9 | 66.4 | 87.3 | 83.8 | 71.9 | 67.1 | 64.6 |
openai/gpt-4-turbo-2024-04-09 | 71.5 | 52.4 | 68.3 | 91.4 | 82.1 | 71.6 | 66.8 | 68.1 |
sfairXC/FsfairX-LLaMA3-RM-v0.1 | 71.4 | 48.4 | 71.7 | 86.0 | 80.8 | 71.2 | 79.7 | 62.3 |
openai/gpt-4o-2024-05-13 | 71.4 | 52.5 | 68.1 | 89.1 | 84.7 | 72.0 | 66.5 | 66.7 |
RLHFlow/pair-preference-model-LLaMA3-8B | 71.3 | 52.7 | 71.2 | 89.6 | 78.7 | 69.3 | 77.9 | 59.6 |
weqweasdas/RM-Mistral-7B | 71.1 | 50.9 | 72.0 | 87.8 | 77.4 | 68.0 | 80.9 | 60.5 |
RLHFlow/ArmoRM-Llama3-8B-v0.1 | 71.0 | 58.8 | 69.7 | 87.8 | 73.2 | 67.8 | 74.7 | 65.0 |
hendrydong/Mistral-RM-for-RAFT-GSHF-v0 | 71.0 | 49.6 | 72.0 | 86.4 | 77.8 | 68.9 | 80.8 | 61.2 |
openbmb/Eurus-RM-7b | 70.4 | 53.9 | 66.7 | 88.2 | 82.2 | 69.6 | 64.7 | 67.2 |
openai/gpt-4-0125-preview | 70.2 | 54.1 | 60.1 | 89.6 | 81.8 | 72.0 | 67.1 | 66.5 |
llms-as-a-jury/gpt-3.5-turbo-0125_claude-3-sonne… | 69.6 | 49.5 | 66.4 | 87.3 | 80.2 | 70.4 | 67.3 | 66.2 |
meta-llama/Meta-Llama-3-70B-Instruct | 69.4 | 47.2 | 66.7 | 84.2 | 84.7 | 72.5 | 66.4 | 64.2 |
berkeley-nest/Starling-RM-7B-alpha | 68.8 | 60.3 | 63.6 | 81.9 | 81.3 | 68.3 | 61.6 | 64.6 |
openbmb/UltraRM-13b | 67.9 | 44.2 | 66.9 | 79.6 | 72.9 | 66.4 | 75.8 | 69.4 |
OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 | 67.3 | 59.8 | 63.7 | 70.1 | 73.2 | 66.2 | 74.8 | 63.5 |
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 | 67.1 | 64.5 | 62.1 | 69.7 | 76.2 | 67.9 | 68.2 | 61.3 |
CohereForAI/c4ai-command-r-plus | 66.8 | 41.5 | 65.7 | 83.5 | 79.7 | 69.7 | 66.4 | 61.4 |
weqweasdas/RM-Gemma-7B-4096 | 66.6 | 38.2 | 71.6 | 78.7 | 77.5 | 69.2 | 79.5 | 51.2 |
llm-blender/PairRM-hf | 66.4 | 49.2 | 64.8 | 83.7 | 72.4 | 65.0 | 58.7 | 71.2 |
weqweasdas/RM-Gemma-7B | 66.1 | 34.9 | 71.2 | 79.2 | 75.8 | 69.2 | 79.0 | 53.4 |
Anthropic/claude-3-sonnet-20240229 | 65.9 | 52.6 | 59.3 | 89.6 | 63.4 | 67.0 | 67.1 | 62.6 |
openai/gpt-3.5-turbo-0125 | 65.2 | 46.2 | 59.0 | 76.0 | 79.3 | 69.1 | 67.7 | 59.3 |
IDEA-CCNL/Ziya-LLaMA-7B-Reward | 64.2 | 47.3 | 60.4 | 76.9 | 75.4 | 68.1 | 61.1 | 60.0 |
weqweasdas/RM-Gemma-2B | 63.9 | 35.1 | 69.0 | 72.9 | 76.7 | 69.7 | 76.7 | 47.6 |
stanfordnlp/SteamSHP-flan-t5-xl | 62.8 | 38.4 | 63.3 | 63.8 | 76.8 | 64.9 | 79.6 | 53.2 |
meta-llama/Meta-Llama-3-8B-Instruct | 62.5 | 48.5 | 58.1 | 71.7 | 77.6 | 68.0 | 59.9 | 53.6 |
weqweasdas/hh_rlhf_rm_open_llama_3b | 62.1 | 41.8 | 75.7 | 65.6 | 68.5 | 61.8 | 63.1 | 58.1 |
stanfordnlp/SteamSHP-flan-t5-large | 61.5 | 37.9 | 62.9 | 55.7 | 76.1 | 65.8 | 79.1 | 53.3 |
Anthropic/claude-3-haiku-20240307 | 61.5 | 51.0 | 57.8 | 82.4 | 50.1 | 63.8 | 64.1 | 61.1 |
OpenAssistant/reward-model-deberta-v3-large-v2 | 60.8 | 56.4 | 70.9 | 52.0 | 72.1 | 63.7 | 33.8 | 76.7 |
RLHFlow/RewardModel-Mistral-7B-for-DPA-v1 | 60.3 | 48.2 | 56.3 | 67.9 | 72.0 | 59.0 | 60.8 | 57.7 |
PKU-Alignment/beaver-7b-v1.0-reward | 59.7 | 38.0 | 57.2 | 59.7 | 74.0 | 66.4 | 67.8 | 55.0 |
ContextualAI/archangel_sft-kto_llama30b | 59.6 | 55.0 | 55.6 | 61.1 | 64.8 | 62.6 | 68.4 | 49.4 |
openbmb/Eurus-7b-kto | 59.1 | 54.3 | 51.1 | 66.1 | 79.4 | 69.2 | 40.8 | 52.4 |
NousResearch/Nous-Hermes-2-Mistral-7B-DPO | 59.0 | 53.0 | 51.9 | 65.6 | 70.8 | 67.2 | 49.5 | 55.0 |
HuggingFaceH4/starchat2-15b-v0.1 | 58.3 | 45.6 | 58.6 | 69.7 | 73.6 | 67.6 | 42.8 | 49.8 |
0-hero/Matter-0.1-7B-boost-DPO-preview | 58.1 | 49.0 | 52.8 | 67.9 | 69.9 | 65.2 | 49.5 | 52.5 |
ContextualAI/archangel_sft-kto_pythia6-9b | 57.8 | 46.9 | 54.8 | 58.8 | 67.7 | 61.5 | 63.8 | 51.5 |
ContextualAI/archangel_sft-kto_llama13b | 57.8 | 46.8 | 53.9 | 56.1 | 65.9 | 61.8 | 67.2 | 53.2 |
HuggingFaceH4/zephyr-7b-alpha | 57.4 | 55.3 | 51.7 | 62.4 | 68.0 | 64.1 | 43.5 | 56.4 |
ContextualAI/archangel_sft-kto_pythia2-8b | 56.9 | 46.1 | 54.8 | 53.4 | 69.0 | 60.5 | 64.3 | 50.3 |
ContextualAI/archangel_sft-dpo_llama30b | 56.8 | 56.3 | 52.6 | 60.2 | 55.8 | 57.4 | 67.1 | 48.3 |
allenai/tulu-2-dpo-70b | 56.6 | 52.4 | 51.6 | 58.4 | 68.7 | 63.9 | 45.4 | 55.8 |
ContextualAI/archangel_sft-kto_pythia1-4b | 56.4 | 46.0 | 56.0 | 51.6 | 68.3 | 58.9 | 65.7 | 48.5 |
ContextualAI/archangel_sft-dpo_pythia2-8b | 56.4 | 45.4 | 54.3 | 53.4 | 69.0 | 60.5 | 62.6 | 49.8 |
ContextualAI/archangel_sft-dpo_pythia6-9b | 56.3 | 45.7 | 54.5 | 54.3 | 68.0 | 59.7 | 60.8 | 50.9 |
0-hero/Matter-0.1-7B-DPO-preview | 56.0 | 44.5 | 54.8 | 53.4 | 68.1 | 65.5 | 52.9 | 52.8 |
ContextualAI/archangel_sft-dpo_llama13b | 55.8 | 52.4 | 53.4 | 60.2 | 56.3 | 56.0 | 62.7 | 50.0 |
HuggingFaceH4/zephyr-7b-beta | 55.8 | 55.3 | 50.9 | 59.7 | 62.7 | 63.9 | 43.5 | 54.5 |
ContextualAI/archangel_sft-kto_pythia12-0b | 55.6 | 46.1 | 53.7 | 54.8 | 64.2 | 58.6 | 60.4 | 51.2 |
ContextualAI/archangel_sft-dpo_pythia1-4b | 55.4 | 47.0 | 53.8 | 50.7 | 65.2 | 58.4 | 63.9 | 48.7 |
PKU-Alignment/beaver-7b-v1.0-cost | 55.1 | 67.8 | 54.6 | 72.9 | 43.3 | 46.7 | 50.1 | 50.5 |
ContextualAI/archangel_sft-kto_llama7b | 54.9 | 46.0 | 54.8 | 50.7 | 57.6 | 57.8 | 66.7 | 50.8 |
ContextualAI/archangel_sft-dpo_llama7b | 54.9 | 47.0 | 54.3 | 47.5 | 58.4 | 57.0 | 67.9 | 52.0 |
openbmb/MiniCPM-2B-dpo-fp32 | 54.0 | 50.0 | 52.9 | 53.4 | 66.5 | 63.5 | 41.6 | 50.4 |
HuggingFaceH4/zephyr-7b-gemma-v0.1 | 53.9 | 50.9 | 53.0 | 53.8 | 58.0 | 61.3 | 45.0 | 55.0 |
stabilityai/stablelm-2-zephyr-1_6b | 53.9 | 53.1 | 51.9 | 52.0 | 64.8 | 64.4 | 36.2 | 54.5 |
stabilityai/stablelm-2-12b-chat | 53.7 | 57.8 | 48.4 | 51.6 | 61.9 | 62.6 | 37.4 | 56.2 |
ContextualAI/archangel_sft-dpo_pythia12-0b | 53.6 | 45.8 | 50.9 | 52.5 | 60.8 | 56.6 | 58.2 | 50.5 |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 53.6 | 51.9 | 52.8 | 54.3 | 59.6 | 62.3 | 39.4 | 54.8 |
allenai/OLMo-7B-Instruct | 53.5 | 48.1 | 54.1 | 52.0 | 60.0 | 59.8 | 46.2 | 54.6 |
allenai/tulu-2-dpo-13b | 53.2 | 51.9 | 50.4 | 48.4 | 60.9 | 61.9 | 45.4 | 53.6 |
allenai/tulu-2-dpo-7b | 52.9 | 53.0 | 50.5 | 44.3 | 63.3 | 62.6 | 45.6 | 50.5 |
stabilityai/stablelm-zephyr-3b | 52.7 | 53.8 | 51.7 | 58.8 | 53.1 | 59.1 | 34.8 | 57.7 |
upstage/SOLAR-10.7B-Instruct-v1.0 | 52.3 | 56.0 | 50.2 | 55.7 | 55.5 | 56.6 | 36.3 | 55.8 |
random | 50.0 | 50.0 | 50.0 | - | - | - | 50.0 | 50.0 |
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | 49.9 | 45.9 | 49.6 | 52.9 | 47.7 | 45.4 | 61.0 | 47.1 |
jondurbin/bagel-dpo-34b-v0.5 | 49.6 | 52.6 | 47.9 | 38.5 | 57.0 | 58.1 | 43.8 | 49.3 |
Qwen/Qwen1.5-MoE-A2.7B-Chat | 46.3 | 51.6 | 48.4 | 43.0 | 40.5 | 50.2 | 36.9 | 53.1 |
Qwen/Qwen1.5-72B-Chat | 45.3 | 55.1 | 44.5 | 34.4 | 44.1 | 48.9 | 38.9 | 51.3 |
Qwen/Qwen1.5-7B-Chat | 44.6 | 56.3 | 46.2 | 40.7 | 39.2 | 45.3 | 34.8 | 49.8 |
Qwen/Qwen1.5-14B-Chat | 44.6 | 56.7 | 45.3 | 36.7 | 42.9 | 47.5 | 34.0 | 48.9 |
stabilityai/stable-code-instruct-3b | 44.5 | 40.7 | 47.7 | 49.3 | 42.4 | 47.9 | 34.5 | 48.7 |
Qwen/Qwen1.5-1.8B-Chat | 43.6 | 53.6 | 48.2 | 40.7 | 28.6 | 45.1 | 36.2 | 53.0 |
Qwen/Qwen1.5-4B-Chat | 43.4 | 54.4 | 50.4 | 43.0 | 26.6 | 44.3 | 33.7 | 51.8 |
Qwen/Qwen1.5-0.5B-Chat | 43.2 | 55.3 | 47.6 | 52.9 | 23.9 | 37.8 | 33.3 | 51.3 |
E.1 Subset Distributions
The full distribution of accuracies for models tested on RewardBench are shown in Fig. 2 for the core dataset and in Fig. 3 for existing preference sets. The subsets created for RewardBench show substantial higher variance and range than the existing test sets used to evaluate reward models. A higher range of evaluation signal indicates that the benchmark makes it easier to differentiate between two similar models. Important subsets to RewardBench are those with maximum performance below 100%, indicating potential future work.
E.2 Model Reward Distributions
An interesting detail that is not yet easy to apply to training better RLHF models is the shape of the distribution of given reward models on the same input dataset. For all the datasets tested in RewardBench, we record the outputted scores for every prompt. The outputs of models trained with DPO are all large negative numbers given they are summations of logprobs across the generation. The outputs of reward models trained as a simple classifier should in concept be near to a unit Gaussian given desirable properties of a reward function for RL algorithms, but this is normally not the case. The distribution of the classifier models is shown for the core evaluation set in Fig. 7 and over the previous test sets in Fig. 6. The distributions for models trained with DPO are shown in Fig. 4 for classifiers and in Fig. 5 for models trained with DPO.
The custom classifiers, such as PairRM and SteamSHP are omitted because their intended use is to take two responses in at once, so a score does not apply in the same way.
Appendix F Dataset Details
Here, we detail the curation process of every subset. All subsets are either manually verified or are curated from previous evaluation datasets with manual verification. For detailed data processing notes, see Appendix I. In total there are 2958 prompts in RewardBench. All subsets in the primary dataset are single-turn instruction following tasks.
F.0.1 Chat Subsets
This section is designed to evaluate the basic instruction following understanding within a reward model.
AlpacaEval (Easy, Length, Hard)
Manually verified prompt-chosen-rejected trios from AlpacaEval (Li et al., 2023b) where the chosen and rejected responses come from models of different capabilities.
For the AlpacaEval Easy subset with 100 prompts, the chosen completions are from the GPT4-Turbo responses (97.70% win rate) and the rejected come from a much weaker model, Alpaca 7B (Taori et al., 2023) (26.46% win rate).
For the AlpacaEval Length subset with 95 prompts, we seek two models with similar average completion length and a large delta in evaluated performance. It is seeded from Llama 2 Chat 70B (92.66% win rate, 1790 average character length) (Touvron et al., 2023) and rejected is from Guanaco 13B (52.61% win rate, 1774 average character length) (Dettmers et al., 2023).
The AlpacaEval Hard subset contains 95 manually verified prompt-chosen-rejected trios where the chosen responses come from the Tülu 2 70B DPO responses (95.03% win rate) and the rejected come from a weaker model, Davinci003 (Ouyang et al., 2022) (50.00% win rate).
MT Bench (Easy, Medium)
The MT Bench Easy subset is composed of 28 manually verified prompt-chosen-rejected trios from MT-Bench (Zheng et al., 2023) where chosen and rejected correspond to judgements of score 10 and 1 respectively for the same prompt.101010Data is available here: https://huggingface.co/spaces/lmsys/mt-bench/blob/main/data/mt_bench/model_judgment/gpt-4_single.jsonl The MT Bench Medium subset is similar, with 40 manually verified prompt-chosen-rejected trios from MT-Bench (Zheng et al., 2023) where chosen and rejected correspond to judgements of score 9 and 2 to 5 respectively for the same prompt.
For all MT-Bench subsets, the second turn data was not included due to the out-of-distribution nature for a reward model, where the data would be different across the entire conversation and not just the last turn after the prompt. Second, organizing by scoring is difficult due to scores being assigned both for the first and second responses. Further MT-Bench filtering data, such as the models included and distribution of scores, is included in Sec. I.2.
F.0.2 Chat Hard Subsets
This section is designed to challenge the instruction following abilities of a reward model with trick questions and minor factual or formatting issues.
MT Bench Hard
37 manually verified prompt-chosen-rejected trios from MT-Bench (Zheng et al., 2023) where chosen and rejected correspond to judgements of score 7 to 8 and 5 to 6 respectively for the same prompt.
LLMBar Natural
The 100 examples from LLMBar Natural split have preferred completions from existing instruction following benchmarks, which are manually verified in preference ranking (Zeng et al., 2023). This subset is similar to AlpacaEval and MT-Bench subsets.
LLMBar Adversarial (Neighbor, GPTInst, GPTOut, Manual)
Human-curated trick instruction-following questions for LLM-as-a-judge applications from LLMBar (Zeng et al., 2023) reformatted as prompt-chosen-rejected trios. Neighbor creates a rejected completion from a closely related instruction in the dataset, GPT4Inst creates a rejected by asking GPT4 for a similar instruction to the original which is then used as a generation, GPT4Out creates a rejected sample by asking GPT4 to be unhelpful when following the same prompt, and Manual is a set of specifically curated trick pairs.
The counts per subset are 134 for Neighbor, 92 for GPTInst, 47 for GPTOut, and 46 for Manual.
F.0.3 Safety Subsets
This section is designed to evaluate the propensity for reward models to prefer refusals to sensitive questions or to prefer responses to questions which could trigger a false refusal.
Refusals (Dangerous, Offensive)
100 examples in each subset with prompts from GPT-3.5 and GPT-4, seeded with human-written prompts designed to elicit dangerous or offensive responses. The chosen completions are refusals from GPT-3.5, which we find to give more varied and detailed refusals than GPT-4. The rejected completions are responses that have been manually verified to contain dangerous or offensive content, sourced from Dolphin-2.0-mistral-7b111111https://huggingface.co/cognitivecomputations/dolphin-2.0-mistral-7b, an uncensored fine-tune of Mistral 7B (Jiang et al., 2023a).
Do Not Answer
136 examples from the original 900 examples in the Do Not Answer dataset (Wang et al., 2023), designed to have questions that only responsible language models will not answer, where the rejected response is an answer flagged as harmful in their analysis and the chosen is a harmless answer for the same prompt. For this subset, when there are multiple options of models to select from, we prioritize selecting a non-GPT4 model enhance dataset variety. Otherwise, the samples are randomly sampled from the suitable choices.
XSTest Should Respond
250 manually selected triples from XSTest (Röttger et al., 2023). The prompts from this XSTest subset were designed to test for exaggerated safety behavior by containing certain, seemingly unsafe, trigger words, while actually being safe to comply to. The chosen and rejected completions were selected from GPT-4 (Achiam et al., 2023), Llama-2-70b-chat-hf (Touvron et al., 2023) (with the original system prompt and without), Mistral-7B-Instruct-v0.1 (Jiang et al., 2023a) (with and without a guardrail system prompt), dolphin-2.0-mistral-7b and zephyr-7b-beta (Tunstall et al., 2023). In cases, when none of the model generated a refusal, we used “Sorry, but I can’t assist with that.” as the refusal text. For the should-respond subset of the queries, the rejected completion is always a refusal (or partial-refusal) and the chosen a compliance. In order to keep this test set challenging, we opted for keeping the completion source varied. Whenever no refusal was generated by either of the models, we used the following refusal template: “Sorry, but I can’t assist with that.”
XSTest Should Refuse
154 (out of 200121212For 46 prompts none of the models complied and it was not feasible to get human written toxic content.) manually selected triples from XSTest (Röttger et al., 2023). For the should-refuse subset of the queries, the rejected completion is always a compliance and the chosen a refusal (or partial-refusal). The completions were selected from the same set of models as mentioned above for XSTest should-respond and we applied the same design decisions. Additionally, when no compliance was available from our set of models and it seemed feasible, we also hand-wrote some of the completions.
F.0.4 Reasoning Subsets
This section is designed to evaluate specific reasoning abilities such as code and math.
HumanEvalPack (CPP, Go, Javascript, Rust, Python, Rust)
For each programming language, there are 164 prompts with buggy and functional solutions in HumanEvalPack (HEP) (Muennighoff et al., 2023). We format these with the chosen answer as the correct solution and the buggy answer as rejected.
PRM Math
We filter and select answers from the PRM800k131313PRM: process reward model. reasoning dataset (Lightman et al., 2023) to construct pairings of reference answers with incorrect, generated answers from an GPT4 fine-tune used in the paper. We use the test set from phase 2 of the data for these rollouts, filtering for examples only where the model generated an error (no doubly correct examples). The questions originate from the MATH dataset (Hendrycks et al., 2021).
Appendix G Discussion on Prior Test Sets
The goal in choosing the subsets for the Prior Sets section of the benchmark is to include results that are representative of past attempts in reward modeling and still useful to future work. Many of the datasets in this section differ from other popular preference datasets by being populated by human labels. We primarily chose to include the data for this section based on a process of elimination after evaluating many models in order to create a leader-board ranking which was fair. For example, we decided that the Safety section better represented models’ abilities. The SHP data we include is a filtered version of their subset to increase the margin between ratings, so that the data should be easier to discerne by the RMs. Full data for this section is shown in Tab. LABEL:table:pref_sets. The MT Bench data included in the table is interesting, but isn’t formally released as a test set, so we are worried about potential contamination (and MT-Bench is already heavily covered by the benchmark). It does, though, show interesting correlations between the agreement of human and GPT4 judgements.
Appendix H Dataset Characteristics
The following subsections will discuss our analyses of some high-level characteristics of the evaluation dataset.
H.1 Source of chosen and rejected completions
Figure H.1 shows the sources of all completions in the evaluation set, and also the breakdown for both chosen and rejected completions. The unknown label applies to instances of LLMBar and PRM800k. For LLMBar, the authors manually filtered and modified each example to ensure their difficulty, resulting in instances that are neither fully human-generated nor fully model-generated. For PRM800k, all unknown instances are rejections because we only filtered on cases where the model generated an error.