Rule Based Rewards for Language Model Safety

Mu, Tong; Helyar, Alec; Heidecke, Johannes; Achiam, Joshua; Vallone, Andrea; Kivlichan, Ian; Lin, Molly; Beutel, Alex; Schulman, John; Weng, Lilian

Computer Science > Artificial Intelligence

arXiv:2411.01111 (cs)

[Submitted on 2 Nov 2024]

Title:Rule Based Rewards for Language Model Safety

Authors:Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, Lilian Weng

View PDF HTML (experimental)

Abstract:Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a costly need to add or relabel data to modify safety behavior. We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors (e.g. refusals should not be judgmental) along with a LLM grader. In contrast to prior methods using AI feedback, our method uses fine-grained, composable, LLM-graded few-shot prompts as reward directly in RL training, resulting in greater control, accuracy and ease of updating. We show that RBRs are an effective training method, achieving an F1 score of 97.1, compared to a human-feedback baseline of 91.7, resulting in much higher safety-behavior accuracy through better balancing usefulness and safety.

Comments:	Accepted at Neurips 2024
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2411.01111 [cs.AI]
	(or arXiv:2411.01111v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2411.01111

Submission history

From: Tong Mu [view email]
[v1] Sat, 2 Nov 2024 02:22:21 UTC (2,144 KB)

Computer Science > Artificial Intelligence

Title:Rule Based Rewards for Language Model Safety

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Rule Based Rewards for Language Model Safety

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators