Helpsteer2-preference: Complementing ratings with preferences

Z Wang, A Bukharin, O Delalleau, D Egert… - arXiv preprint arXiv …, 2024 - arxiv.org
Reward models are critical for aligning models to follow instructions, and are typically
trained following one of two popular paradigms: Bradley-Terry style or Regression style.
However, there is a lack of evidence that either approach is better than the other, when
adequately matched for data. This is primarily because these approaches require data
collected in different (but incompatible) formats, meaning that adequately matched data is
not available in existing public datasets. To tackle this problem, we release preference …

[CITATION][C] Helpsteer2-preference: Complementing ratings with preferences, 2024d

Z Wang, A Bukharin, O Delalleau, D Egert, G Shen… - URL https://arxiv. org/abs/2410
Showing the best results for this search. See all results