Google Scholar

Articles

Scholar

Helpsteer2-preference: Complementing ratings with preferences

Z Wang, A Bukharin, O Delalleau, D Egert… - arXiv preprint arXiv …, 2024 - arxiv.org

Reward models are critical for aligning models to follow instructions, and are typically
trained following one of two popular paradigms: Bradley-Terry style or Regression style.
However, there is a lack of evidence that either approach is better than the other, when
adequately matched for data. This is primarily because these approaches require data
collected in different (but incompatible) formats, meaning that adequately matched data is
not available in existing public datasets. To tackle this problem, we release preference …

Save Cite Cited by 12 Related articles All 2 versions View as HTML

[CITATION][C] Helpsteer2-preference: Complementing ratings with preferences, 2024d

Z Wang, A Bukharin, O Delalleau, D Egert, G Shen… - URL https://arxiv. org/abs/2410

Save Cite Cited by 2 Related articles

Showing the best results for this search. See all results

Cite

Advanced search

Saved to My library

Helpsteer2-preference: Complementing ratings with preferences

[CITATION][C] Helpsteer2-preference: Complementing ratings with preferences, 2024d