Jason Weston Reasoning Alignment Berkeley Talk
Jason Weston Reasoning Alignment Berkeley Talk
Jason Weston
Meta & NYU
Goal: An AI that "trains" itself as much as possible
- Creates new tasks to train on (challenges itself)
- Evaluates whether it gets them right ("self-rewarding")
- Updates itself based on what it understood
Research question: can this help it become superhuman?
When self-improving: two types of reasoning to improve
System 1: reactive and relies on System 2: more deliberate and
associations effortful
LLMs can be viewed as System 1
• Fixed compute per token Multiple "calls" to System 1 LLM
• Directly outputs answer • Planning, search, verifying, reasoning etc.
• Failures: learns spurious/unwanted • Dynamic computation
correlations: hallucination, (e.g. chain-of-thought, ToT, ..)
sycophancy, jailbreaking, ..
System 2 n
catio System 1
ifi
ver
System 1
sub-question
Question System 1 (LLM) System 1 System 1 Answer
att System 1
en
tio
n System 1
First, some pre-history
(Pre-2020..)
Language modeling
Standard (pre-training) trains by
predicting the next token only on
"positive examples" of language
2021 – Megatron-Turing NLG (NVIDIA & Microsoft) – Pretrained LLM, 530B parameters.
2023 – Claude 1 & 2 (Anthropic) – Pretrained LLM with RLHF, focused on safety.
2023 – GPT-4 (OpenAI) – Pretrained LLM with extensive RLHF for better accuracy.
2023
● or DPO:
2022 Instruction following (without explicit Chain-of-Thought Reasoning)
Improving reasoning via System 2 (LLMs)
Prompting approaches
(First try! circa ancient 2022-2023)
System 1 failures: Factuality & hallucination
Chain-of Verification Reduces Hallucination in Large Language Models
• Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston
relevant context
Ignores irrelevant parts + less biased answer
System 2 Attention (S2A)
Jason Weston, Sainbayar Sukhbaatar
Without time to think humans make mistakes & are biased too
We need more system 2 methods that use effortful thinking!
Branch-Solve-Merge for Evaluating and Improving Language Generation
Swarnadeep Saha, Xian Li, Omer Levy, Jason Weston, Asli Celikyilmaz
Problem:
- When task is complex the instruction is
hard, e.g. GPT4 fails.
Approach:
- Given task, generate plan to
branching into subproblems
- Solve subproblems, one for each
branch
- Given partial solutions, merge
solutions
Better reasoning via Self-Improvement
(Self-)Training methods
Improve reasoning through optimization
Self-Rewarding LLMs 2024 (Jan)
• Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
Reference:
https://openai.com/research/
weak-to-strong-generalization
Humans need to
read the responses
carefully in order to
make decisions
Imagine you have a basket of different fruits like apples, oranges, and
bananas…
Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately
■ 2) Has evaluation capability, i.e., given a user instruction, one or more responses, can
judge the quality of responses
Singleton
<CoT reasoning process>
Case Therefore, I would assign 3 out of 5 to this response.
Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately
■ 2) Has evaluation capability, i.e., given a user instruction, one or more responses, can
judge the quality of responses
○ Then this base model can go through an iterative process of
■ data creation/curation
■ training on new data
Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately
■ 2) Has evaluation capability, i.e., given a user instruction, one or more responses, can
judge the quality of responses
○ Then this base model can go through an iterative process of
■ data creation/curation
■ training on new data
○ Hopefully, the model can get better in terms of both instruction following and
evaluation capabilities in each cycle
Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately
■ 2) Has evaluation capability, i.e., given a user instruction, one or more responses, can
judge the quality of responses
○ Then this base model can go through an iterative process of
■ data creation/curation
■ training on new data
○ Hopefully, the model can get better in terms of both instruction following and
evaluation capabilities in each cycle
Iterate 2 steps:
Since OpenAssistant only provides ranking information for different responses, we collect EFT data
using model generated CoT reasoning and final scores.
Specifically, given an instruction and four responses, if the model assigned scores to the four responses
perfectly match human rankings, then we keep those four samples, otherwise, we discard all of them.
Experiments
● We start from M0: pre-trained LLAMA-2-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training
Assume we have a pool of
prompts that represent
user requirements.
In our experiment, we used self-instruct technique (Wang et al.) to bootstrap instructions from OpenAssistant
using ChatLLama-70B. Ideally, those prompts should come from real-world users interacting with LLMs.
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training
Our model in the t-th iteration (Mt) generates k (we choose k=4) candidate responses for each
new prompt, Mt which also predicts reward for each response via LLM-as-a-Judge prompting.
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training
Given a prompt and k responses, we select the highest scoring response as the winning one,
and lowest scoring response as the losing one to form a preference pair. Then we conduct
DPO training on those pairs to get Mt+1 starting from Mt.
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training
● We conducted two self-rewarding training loops (to give M2, and M3)
Evaluation Axes
● We evaluate the performance of our self-rewarding models in two axes:
○ Ability to follow instructions
○ Ability as a reward model (ability to evaluate responses)
Evaluation Results
● Ability to follow instructions
○ We have tested our models on
■ Our internal instruction following test set (256 prompts from diverse sources)
■ AlpacaEval 2.0
■ MT-Bench
Evaluation Results
● Ability to follow instructions
○ We have tested our models on
■ Our internal instruction following test set (256 prompts from diverse sources)
Our self-reward model is continuously improved through iterative training.
Obtained by training the pre-trained LLAMA-70b using only seed IFT data
Evaluation Results
● Ability to follow instructions
○ We have tested our models on
■ Our internal instruction following test set (256 prompts from diverse sources)
■ AlpacaEval 2.0
Through two self-rewarding training loops, we can almost match the performance of GPT-4 0314
Evaluation Results
● Ability to follow instructions
○ We have tested our models on
■ Our internal instruction following test set (256 prompts from diverse sources)
■ AlpacaEval 2.0
■ MT-Bench
● Scores are on a scale of 10
Our self-reward model is continually improved in both types of tasks, but more in general writing tasks.
Evaluation Results
● Ability as a reward model
○ We tested our models on the OpenAssistant validation set
Evaluation Results
● Ability as a reward model
○ We tested our models on the OpenAssistant validation set
■ In particular, we use our self-rewarding models to assign score to each (instruction,
response) pair, and compare model judgements to human judgements
Evaluation Results
● Ability as a reward model
○ We tested our models on the OpenAssistant validation set
■ In particular, we use our self-rewarding models to assign score to each (instruction,
response) pair, and compare model judgements to human judgements
One issue:
● How can we make it improve more on reasoning tasks?
Iterative reasoning preference optimization 2024 (April)
Richard, Weizhe, Cho, He, Sainaa, Jason
Goal: use same self-rewarding type techniques, but for reasoning tasks..
Start with base model & fixed training set with labels.
- Generate multiple CoTs + answers per train example with current model
- Build preference pairs based on answer correct vs. not
- Train DPO + NLL term (for correct answers)
Repeat steps with new model
Key: extract the verifiable reward after "Final answer"
Negative examples are crucial
SFT assigns similar probability to chosen and rejected generations from DPO pairs
DPO+NLL fixes this, and beats SFT in task accuracy (73.1% on iteration 1 vs. 63.5%).
2024 (September)
OpenAI's 01
(exact method:unknown)
2025 (Jan)
Trains LLMs to think & respond for *all* instruction following tasks, not just math
Initial CoT prompt doesn't give good performance – need lots of iterations to optimize CoT!
Thinking LLMs: General Instruction Following with Thought Generation
Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar
Trains LLMs to think & respond for *all* instruction following tasks, not just math
Thinking LLMs: General Instruction Following with Thought Generation
Trains LLMs to think & respond for *all* instruction following tasks, not just math
2025 (Jan)
Meta-Rewarding LLMs
• Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar