HelpSteer2-Preference: Complementing Ratings with Preferences

Wang, Zhilin; Bukharin, Alexander; Delalleau, Olivier; Egert, Daniel; Shen, Gerald; Zeng, Jiaqi; Kuchaiev, Oleksii; Dong, Yi

Computer Science > Machine Learning

arXiv:2410.01257 (cs)

[Submitted on 2 Oct 2024]

Title:HelpSteer2-Preference: Complementing Ratings with Preferences

Authors:Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong

View PDF HTML (experimental)

Abstract:Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at this https URL and openly release the trained Reward Model at this https URL

Comments:	26 pages, 3 figures
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2410.01257 [cs.LG]
	(or arXiv:2410.01257v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.01257

Submission history

From: Zhilin Wang [view email]
[v1] Wed, 2 Oct 2024 06:05:52 UTC (830 KB)

Computer Science > Machine Learning

Title:HelpSteer2-Preference: Complementing Ratings with Preferences

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:HelpSteer2-Preference: Complementing Ratings with Preferences

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators