Human Alignment of Large Language Models through Online Preference Optimisation

Calandriello, Daniele; Guo, Daniel; Munos, Remi; Rowland, Mark; Tang, Yunhao; Pires, Bernardo Avila; Richemond, Pierre Harvey; Lan, Charline Le; Valko, Michal; Liu, Tianqi; Joshi, Rishabh; Zheng, Zeyu; Piot, Bilal

Computer Science > Machine Learning

arXiv:2403.08635 (cs)

[Submitted on 13 Mar 2024]

Title:Human Alignment of Large Language Models through Online Preference Optimisation

Authors:Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

View PDF HTML (experimental)

Abstract:Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD.
This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as:	arXiv:2403.08635 [cs.LG]
	(or arXiv:2403.08635v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2403.08635

Submission history

From: Daniele Calandriello [view email]
[v1] Wed, 13 Mar 2024 15:47:26 UTC (209 KB)

Computer Science > Machine Learning

Title:Human Alignment of Large Language Models through Online Preference Optimisation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Human Alignment of Large Language Models through Online Preference Optimisation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators