Skip to main content

Showing 1–4 of 4 results for author: Siththaranjan, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.17713  [pdf, other

    cs.AI cs.LG

    AI Alignment with Changing and Influenceable Reward Functions

    Authors: Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan

    Abstract: Existing AI alignment approaches assume that preferences are static, which is unrealistic: our preferences change, and may even be influenced by our interactions with AI systems themselves. To clarify the consequences of incorrectly assuming static preferences, we introduce Dynamic Reward Markov Decision Processes (DR-MDPs), which explicitly model preference changes and the AI's influence on them.… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted to ICML 2024

  2. arXiv:2312.08358  [pdf, other

    cs.LG cs.AI stat.ML

    Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

    Authors: Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

    Abstract: In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irration… ▽ More

    Submitted 16 April, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

    Comments: Presented at ICLR 2024

  3. arXiv:2307.15217  [pdf, other

    cs.AI cs.CL cs.LG

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen , et al. (7 additional authors not shown)

    Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and rel… ▽ More

    Submitted 11 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

  4. arXiv:2103.05746  [pdf, other

    cs.RO cs.AI cs.HC eess.SY

    Analyzing Human Models that Adapt Online

    Authors: Andrea Bajcsy, Anand Siththaranjan, Claire J. Tomlin, Anca D. Dragan

    Abstract: Predictive human models often need to adapt their parameters online from human data. This raises previously ignored safety-related questions for robots relying on these models such as what the model could learn online and how quickly could it learn it. For instance, when will the robot have a confident estimate in a nearby human's goal? Or, what parameter initializations guarantee that the robot c… ▽ More

    Submitted 30 September, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

    Comments: ICRA 2021