0% found this document useful (0 votes)

44 views106 pages

Jason Weston Reasoning Alignment Berkeley Talk

The document discusses the development of self-improving AI systems, specifically focusing on large language models (LLMs) that can train themselves by creating new tasks, evaluating their performance, and updating their knowledge. It outlines the distinction between two reasoning systems: System 1, which is reactive, and System 2, which is more deliberate and effortful, and emphasizes the need for improved reasoning capabilities in LLMs to avoid issues like hallucination and bias. The proposed approach involves self-rewarding LLMs that can follow instructions and evaluate their responses, leading to iterative improvements in both instruction following and evaluation capabilities.

Uploaded by

internetidentityscri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views106 pages

Jason Weston Reasoning Alignment Berkeley Talk

Uploaded by

internetidentityscri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Learning to Self-Improve & Reason with LLMs

Jason Weston
Meta & NYU
Goal: An AI that "trains" itself as much as possible
- Creates new tasks to train on (challenges itself)
- Evaluates whether it gets them right ("self-rewarding")
- Updates itself based on what it understood
Research question: can this help it become superhuman?
When self-improving: two types of reasoning to improve
System 1: reactive and relies on System 2: more deliberate and
associations effortful
LLMs can be viewed as System 1
• Fixed compute per token Multiple "calls" to System 1 LLM
• Directly outputs answer • Planning, search, verifying, reasoning etc.
• Failures: learns spurious/unwanted • Dynamic computation
correlations: hallucination, (e.g. chain-of-thought, ToT, ..)
sycophancy, jailbreaking, ..

System 2 n
catio System 1
ifi
ver
System 1
sub-question
Question System 1 (LLM) System 1 System 1 Answer

att System 1
en
tio
n System 1
First, some pre-history
(Pre-2020..)
Language modeling
Standard (pre-training) trains by
predicting the next token only on
"positive examples" of language

Images from https://lena-voita.github.io/nlp_course/language_modeling.html

2003
Most of the 2000s
Support Vector Machines
Unlike SVMs, Neural nets can manipulate words + end-to-end
2014
LLM attention
mechanism is born 👶
Transformers (Vaswani et al., 2017) BERT (Devlin et al., 2018) … and so much more after
The ``scaling hypothesis''
LLMs everywhere
2019 – GPT-2 (OpenAI) – Pretrained LLM.

2020 – T5 (Google) – Pretrained LLM, unified NLP tasks in a text-to-text format.

2020 – GPT-3 (OpenAI) – Pretrained LLM, 175B parameters.

2021 – Jurassic-1 (AI21 Labs) – Pretrained LLM with controllability features.

2021 – Megatron-Turing NLG (NVIDIA & Microsoft) – Pretrained LLM, 530B parameters.

2021 – Gopher (DeepMind) – Pretrained LLM with better factual knowledge.

2022 – Chinchilla (DeepMind) – Pretrained LLM optimized for efficient scaling.

2022 – PaLM (Google) – Pretrained LLM with strong reasoning abilities.

2022 – OPT (Meta) – Pretrained LLM, open-source alternative to GPT-3.

2022 – BLOOM (BigScience) – Pretrained LLM, multilingual, open-access model.

2022 – GPT-3.5 (OpenAI) – Pretrained LLM with post-training via RLHF.

2023 – Claude 1 & 2 (Anthropic) – Pretrained LLM with RLHF, focused on safety.

2023 – GPT-4 (OpenAI) – Pretrained LLM with extensive RLHF for better accuracy.

2023 – LLaMA (Meta) – Pretrained LLM, open-source, research-focused.

2023 – Mistral 7B (Mistral AI) – Pretrained LLM, efficient and competitive.

Is just language modeling enough?
(Answer:no)
2020 onwards..
2019
2020 Pretrain up to 9.4B LLM+supervised fine-tune
(SFT) on dialogue data (human annotated)
LLM Post-training (pre-o1/r1) 2022 InstructGPT
(SFT+RLHF on
● SFT: Same as lang modeling, but on user tasks 175B GPT3)

● or RLHF: (Proximal Policy Optimization)

LLM Post-training (pre-o1/r1)

2023
● or DPO:
2022 Instruction following (without explicit Chain-of-Thought Reasoning)
Improving reasoning via System 2 (LLMs)
Prompting approaches
(First try! circa ancient 2022-2023)
System 1 failures: Factuality & hallucination
Chain-of Verification Reduces Hallucination in Large Language Models
• Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston

Chain-of-Verification (CoVe) variants:

- Joint left-to-right generation of

all four steps

- Factored: step (3) attends to (2)

but not to step (1)

- Factor+ revise: Extra

``cross-check’’ step (see
tickmarks to right) where LLM
explicitly checks if 2 answers
seem to match.
Chain-of Verification Reduces Hallucination in Large Language Models
• Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston

Chain-of-Verification (CoVe) variants:

- Joint left-to-right generation of

all four steps

- Factored: step (3) attends to (2)

but not to step (1)

- Factor+ revise: Extra

``cross-check’’ step (see
tickmarks to right) where LLM
explicitly checks if 2 answers
seem to match.
Sycophancy: agrees with user’s opinion (Sharma et al, '23)
More failure modes of System 1
Problem: whole context affects LLM
output even irrelevant parts!

Hypothesis: soft-attention inherently

spreads attention thin over everything.
Also, LM objective favors correlations.

(Gonen et al, '24)

LLMs learn spurious correlations

Sam Liccardo’s LLM
Saratoga
birthplace? (Weston & Sukhbaatar, '23)
incorrect
Facts about Sam Liccardo’s LLM
Sunnyvale + birthplace?
Sunnyvale
System 2 Attention (S2A)
Jason Weston, Sainbayar Sukhbaatar

Decide what to attend explicitly (system 2) by rewriting the input

Problem: whole context affects LLM

output even irrelevant parts! Step 1 Prompt: “Rewrite while removing irrelevant/bias"
rewritten
LLM
Factual “I think it is Factual
Hypothesis: soft-attention inherently question + correct” question
spreads attention thin over everything.
Also, LM objective favors correlations. Step 2 Answer given the rewritten question
rewritten
Solution: Make attention more explicit & Factual LLM
A: incorrect
effortful → Prompt LLM to extract question

relevant context
Ignores irrelevant parts + less biased answer
System 2 Attention (S2A)
Jason Weston, Sainbayar Sukhbaatar

Decide what to attend explicitly (system 2) by rewriting the input

Without time to think humans make mistakes & are biased too
We need more system 2 methods that use effortful thinking!
Branch-Solve-Merge for Evaluating and Improving Language Generation
Swarnadeep Saha, Xian Li, Omer Levy, Jason Weston, Asli Celikyilmaz

Break down response evaluation into subproblems & fuse

Problem:
- When task is complex the instruction is
hard, e.g. GPT4 fails.

Approach:
- Given task, generate plan to
branching into subproblems
- Solve subproblems, one for each
branch
- Given partial solutions, merge
solutions
Better reasoning via Self-Improvement
(Self-)Training methods
Improve reasoning through optimization
Self-Rewarding LLMs 2024 (Jan)
• Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston

• LLM improves itself by assigning rewards to own outputs and optimizing

● Current LLMs are approaching human-level performance on a variety of tasks.

● There is reason to believe that future LLMs will surpass human performance.
● The "Superalignment challenge"..?

Reference:
https://openai.com/research/
weak-to-strong-generalization

A core challenge for aligning future superhuman AI systems

(superalignment) is that humans will need to supervise AI systems much
smarter than them.
OpenAI
Standard RLHF alignment approach: use humans in the loop
- ﬁrst to create (X, Y) data;
- then to collect judgments on (X, Y') data

Image from: https://arxiv.org/pdf/2009.01325.pdf

Standard RLHF alignment approach: use humans in the loop
- ﬁrst to create (X, Y) data;
- then to collect judgments on (X, Y') data

Humans need to
read the responses
carefully in order to
make decisions

Image from: https://arxiv.org/pdf/2009.01325.pdf

Current alignment approach
● However, as LLMs write better and better responses…
○ It becomes harder and harder for humans to process them, especially those that are
lengthy and require domain expertise.

Images generated by GPT-4

Research Question 🤔
● How can we continue improving superhuman models?
Observations 🧐
● Observation 1
○ LLMs can continue improving if provided good judgements on response quality
■ Exempliﬁed by the success of iterative RLHF
● Training a Helpful and Harmless Assistant with Reinforcement Learning from
Human Feedback
● Llama 2: Open Foundation and Fine-Tuned Chat Models
● Observation 2
○ LLMs can provide good judgements on model generation
■ Exempliﬁed by the line of works that use GPT-4 for evaluation
● Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
● AlpacaEval: An Automatic Evaluator of Instruction-following Models

Then, how about combining them together?

Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately

Can you explain contrastive learning in machine learning in simple terms

for someone new to the ﬁeld of ML?

Here's a simple analogy to understand it:

Imagine you have a basket of different fruits like apples, oranges, and
bananas…
Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately
■ 2) Has evaluation capability, i.e., given a user instruction, one or more responses, can
judge the quality of responses

Here is an instruction: Can you explain contrastive learning in machine

learning in simple terms for someone new to the ﬁeld of ML?

Here is the model response: <MODEL_RESPONSE>

Can you assign a score (0 to 5) to this response based on the

following rubrics? <RUBRICS>

Singleton
<CoT reasoning process>
Case Therefore, I would assign 3 out of 5 to this response.
Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately
■ 2) Has evaluation capability, i.e., given a user instruction, one or more responses, can
judge the quality of responses
○ Then this base model can go through an iterative process of
■ data creation/curation
■ training on new data
Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately
■ 2) Has evaluation capability, i.e., given a user instruction, one or more responses, can
judge the quality of responses
○ Then this base model can go through an iterative process of
■ data creation/curation
■ training on new data
○ Hopefully, the model can get better in terms of both instruction following and
evaluation capabilities in each cycle
Our approach
● Self-rewarding LMs
○ Key idea: train a self-rewarding language model that
■ 1) Has instruction following capability, i.e., given a user instruction, can respond to it
appropriately
■ 2) Has evaluation capability, i.e., given a user instruction, one or more responses, can
judge the quality of responses
○ Then this base model can go through an iterative process of
■ data creation/curation
■ training on new data
○ Hopefully, the model can get better in terms of both instruction following and
evaluation capabilities in each cycle

Empirically, we have shown that this is possible !

Our approach
Recipe 󰠉: LM ﬁnetuned on small seed data.

Iterate 2 steps:

(1) Self-instruction creation: generate prompts, responses & self-rewards with LM

(2) Instruction-training: Train (DPO) on selected preference pairs

Iterations improve instruction following & reward modeling ability!

Experiments
● We start from M0: pre-trained LLAMA-2-70B
Experiments
● We start from M0: pre-trained LLAMA-2-70B
● We multitask train M0 using seed IFT and EFT data
○ Seed IFT data: instruction following data from OpenAssistant, we only take the first turn.
■ Format:
● Input: user instruction
● Output: response
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data
○ Seed IFT data: instruction following data from OpenAssistant, we only take the first turn.
■ Format:
● Input: user instruction
● Output: response
○ Seed EFT data: evaluation data from OpenAssistant
■ Format:
● Input: user instruction, model response, scoring rubrics
● Output: CoT reasoning, final score
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data
○ Seed IFT data: instruction following data from OpenAssistant, we only take the first turn.
■ Format:
● Input: user instruction
● Output: response
○ Seed EFT data: evaluation data from OpenAssistant
■ Format:
● Input: user instruction, model response, scoring rubrics
● Output: CoT reasoning, final score
Experiments
● LLM-as-a-Judge prompt
○ Instructs the LLM to
evaluate the response
using five additive criteria
(relevance, coverage,
usefulness, clarity and
expertise)
○ Performs better than
multiple choice format
prompt
Experiments
● We start from M0: pre-trained LLAMA-2-70B
● We multitask train M0 using seed IFT and EFT data to give M1
○ Seed IFT data: instruction following data from OpenAssistant, we only take the first turn.
■ Format:
● Input: user instruction
● Output: response
○ Seed EFT data: evaluation data from OpenAssistant
■ Format:
● Input: user instruction, model response, scoring rubrics
● Output: CoT reasoning, final score

Since OpenAssistant only provides ranking information for different responses, we collect EFT data
using model generated CoT reasoning and ﬁnal scores.

Speciﬁcally, given an instruction and four responses, if the model assigned scores to the four responses
perfectly match human rankings, then we keep those four samples, otherwise, we discard all of them.
Experiments
● We start from M0: pre-trained LLAMA-2-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training
Assume we have a pool of
prompts that represent
user requirements.

In our experiment, we used self-instruct technique (Wang et al.) to bootstrap instructions from OpenAssistant
using ChatLLama-70B. Ideally, those prompts should come from real-world users interacting with LLMs.
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training

Our model in the t-th iteration (Mt) generates k (we choose k=4) candidate responses for each
new prompt, Mt which also predicts reward for each response via LLM-as-a-Judge prompting.
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training

Given a prompt and k responses, we select the highest scoring response as the winning one,
and lowest scoring response as the losing one to form a preference pair. Then we conduct
DPO training on those pairs to get Mt+1 starting from Mt.
Experiments
● We start from M0: pre-trained LLAMA-70B
● We multitask train M0 using seed IFT and EFT data to give M1
● We then go through iterative training
● We conducted two self-rewarding training loops (to give M2, and M3)
Evaluation Axes
● We evaluate the performance of our self-rewarding models in two axes:
○ Ability to follow instructions
○ Ability as a reward model (ability to evaluate responses)
Evaluation Results
● Ability to follow instructions
○ We have tested our models on
■ Our internal instruction following test set (256 prompts from diverse sources)
■ AlpacaEval 2.0
■ MT-Bench
Evaluation Results
● Ability to follow instructions
○ We have tested our models on
■ Our internal instruction following test set (256 prompts from diverse sources)
Our self-reward model is continuously improved through iterative training.

GPT-4 Evaluation Human Evaluation

Obtained by training the pre-trained LLAMA-70b using only seed IFT data
Evaluation Results
● Ability to follow instructions
○ We have tested our models on
■ Our internal instruction following test set (256 prompts from diverse sources)
■ AlpacaEval 2.0

Through two self-rewarding training loops, we can almost match the performance of GPT-4 0314
Evaluation Results
● Ability to follow instructions
○ We have tested our models on
■ Our internal instruction following test set (256 prompts from diverse sources)
■ AlpacaEval 2.0
■ MT-Bench
● Scores are on a scale of 10

Our self-reward model is continually improved in both types of tasks, but more in general writing tasks.
Evaluation Results
● Ability as a reward model
○ We tested our models on the OpenAssistant validation set
Evaluation Results
● Ability as a reward model
○ We tested our models on the OpenAssistant validation set
■ In particular, we use our self-rewarding models to assign score to each (instruction,
response) pair, and compare model judgements to human judgements
Evaluation Results
● Ability as a reward model
○ We tested our models on the OpenAssistant validation set
■ In particular, we use our self-rewarding models to assign score to each (instruction,
response) pair, and compare model judgements to human judgements

Our self-reward model is continually improved in evaluation capabilities as well

Limitations

One issue:
● How can we make it improve more on reasoning tasks?
Iterative reasoning preference optimization 2024 (April)
Richard, Weizhe, Cho, He, Sainaa, Jason

Goal: use same self-rewarding type techniques, but for reasoning tasks..

Start with base model & ﬁxed training set with labels.
- Generate multiple CoTs + answers per train example with current model
- Build preference pairs based on answer correct vs. not
- Train DPO + NLL term (for correct answers)
Repeat steps with new model
Key: extract the verifiable reward after "Final answer"
Negative examples are crucial
SFT assigns similar probability to chosen and rejected generations from DPO pairs
DPO+NLL ﬁxes this, and beats SFT in task accuracy (73.1% on iteration 1 vs. 63.5%).
2024 (September)
OpenAI's 01
(exact method:unknown)
2025 (Jan)

& apply RL (GRPO - Group Relative Policy Optimization)

2025 (Jan)
Thinking LLMs: General Instruction Following with Thought Generation
Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar
2024 (October)
Trains LLMs to think & respond for *all* instruction following tasks, not just math

- Introduces Thought Preference

Optimization (TPO)

- Gives gains on AlpacaEval (beating

GPT-4 & Llama3-70b) & ArenaHard

🥉3rd on AlpacaEval leaderboard

🏆 Best 8B model on ArenaHard
Thinking LLMs: General Instruction Following with Thought Generation
Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar

Trains LLMs to think & respond for *all* instruction following tasks, not just math

Initial CoT prompt doesn't give good performance – need lots of iterations to optimize CoT!
Thinking LLMs: General Instruction Following with Thought Generation
Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar

Trains LLMs to think & respond for *all* instruction following tasks, not just math
Thinking LLMs: General Instruction Following with Thought Generation
Trains LLMs to think & respond for *all* instruction following tasks, not just math
2025 (Jan)
Meta-Rewarding LLMs
• Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar

• LLM improves its own judgments by (meta-)judging them

Self-Rewarding focused on improving responses, not judgment capabilities

- Improvement rapidly saturated during iterative training

Meta-Rewarding: LM is actor, judge & meta-judge

- Meta-Judge is extra step to judge the judgments

- Meta-Rewards add new training signal to train
judgments
Recipe 󰠉:
Iterate 3 steps:
(1) Create Actor data: generate responses & self-rewards (judgments) with LM
(2) Create Judge data: generate meta-rewards over judgments with LLM-as-a-Meta-Judge
(3) Train DPO on preference pairs to both learn to act (1) AND to judge (2)
How does an LLM judge judgments?

We use LLM-as-a-Meta-Judge (see prompt)

- Make N judgments for a given pair of

responses & calc pairwise meta-judgments

- Compute Elo score of the judgments via

this matrix

- Create LLM-as-a-judge preference

pairs via Elo scores
We also control response length with a new LC method: select the DPO chosen
that is shorter if two good responses have similar scores.

Our method outperforms Self-Rewarding (with same LC method).

We also control response length with a new LC method: select the DPO chosen
that is shorter if two good responses have similar scores.

Our method outperforms Self-Rewarding (with same LC method).

Meta-rewarding also performs well compared to some production LLM models.
Meta-Rewarding has higher agreement with a GPT-4 judge: its better judgments
can explain its improved performance at acting compared to Self-Rewarding.
We can also push reasoning further
for the evaluation task.

EvalPlanner – a method to train

o1/r1-like chain-of-thought (CoT) for
the evaluation / reward model task.

This "Thinking-LLM-as-a-Judge" learns

to generate planning & reasoning
CoTs for evaluation.
By synthetically creating high & low quality responses to a prompt, evaluation
(which is better? A or B) can be converted to a *verifiable task*.
Recipe for creating verifiable data 󰠉:
- Generate good response y to prompt x with LLM
- Generate similar prompt x', and good response to it y'
Iterative training:
- Generate judgments as reward: y should be preferred over y'
- Train Thinking-LLM-as-a-Judge with this data and reward
How to make the similar (but different) prompt?
…ask the LLM to do it!
EvalPlanner thoughts with plans are important for performance:
- Plans are superior to no thoughts
- But for training, plans should be unconstrained, not encouraged to be e.g. lists of
criteria or verification questions as in other works. Model should figure it out!
SOTA performance on RewardBench across LLM-as-a-Judge models, despite
using only a Llama 3.1 70B base.
EvalPlanner also performs very strongly on harder evaluation tasks with newer benchmarks
Summary
● Self-Rewarding models can train themselves to get better – path to superhuman AI?
● Verifiable rewards help to train CoT for better reasoning (Iterative Reasoning Preference
Optimization, DeepSeek, O1) & evaluation ability (Thinking-LLM-as-judge).
● Better judges (with CoT) can help train to think on non-verifiable tasks: Thinking LLMs.
● Models can even improve at Meta-rewarding/reasoning (judging their judgements).
Future Work - a different CoT direction..

Latent System 2 thoughts, not tokens? COCONUT (Hao et al., '24)

What else comes next? (So much more exciting research to be done!)

LGTM, but I would just add some more detail:

- (Self-)Evaluation - bottlenecks performance->use more reasoning/compute. Related to "self-aware"
- Learning from interaction (people+world/internet+itself). Related to agents + synthetic data.
- Improve "System 1" (better attention? world model? etc. Challenge: scalability?)
Thanks!!!

LLM Cheat Sheetpdf
No ratings yet
LLM Cheat Sheetpdf
7 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
Study Guide 1st Canto CH 1 To 19 Total by VGD
88% (8)
Study Guide 1st Canto CH 1 To 19 Total by VGD
179 pages
ISKCON Desire Tree - Babaji Maharaja
No ratings yet
ISKCON Desire Tree - Babaji Maharaja
95 pages
Roisinluo Reasoning in LLMs
No ratings yet
Roisinluo Reasoning in LLMs
72 pages
Large Large Models
No ratings yet
Large Large Models
25 pages
Generative Ai Terminology
67% (3)
Generative Ai Terminology
26 pages
Thinking Machines: A Survey of LLM Based Reasoning Strategies
No ratings yet
Thinking Machines: A Survey of LLM Based Reasoning Strategies
15 pages
Summary - Foundations On LLMs
No ratings yet
Summary - Foundations On LLMs
6 pages
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
No ratings yet
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
31 pages
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
No ratings yet
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
51 pages
Fine Tuning Techniques For Large Language Models LLMs
No ratings yet
Fine Tuning Techniques For Large Language Models LLMs
15 pages
Lec20 LLM
No ratings yet
Lec20 LLM
58 pages
Towards Large Reasoning Models
No ratings yet
Towards Large Reasoning Models
36 pages
MLSys Class LLM Introduction
No ratings yet
MLSys Class LLM Introduction
43 pages
Large Language Models Johns Hopkins University
No ratings yet
Large Language Models Johns Hopkins University
54 pages
Know Thy Frenemy
No ratings yet
Know Thy Frenemy
40 pages
Chatgpt: A Technical Perspective: Presented by Teamx
No ratings yet
Chatgpt: A Technical Perspective: Presented by Teamx
18 pages
Icaps LLM Tut Slides Posted
No ratings yet
Icaps LLM Tut Slides Posted
97 pages
Prompt Engineering 201 Advanced Methods and Toolkits - AI, Software, Tech, and People. Not in That Order. by X
No ratings yet
Prompt Engineering 201 Advanced Methods and Toolkits - AI, Software, Tech, and People. Not in That Order. by X
2 pages
LLM Post Training A Deep Dive Into Reasoning LLMs 1741072282
No ratings yet
LLM Post Training A Deep Dive Into Reasoning LLMs 1741072282
31 pages
From System 1 To System 2: A Survey of Reasoning Large Language Models
No ratings yet
From System 1 To System 2: A Survey of Reasoning Large Language Models
22 pages
LLM Post-Training: A Deep Dive Into Reasoning Large Language Models
No ratings yet
LLM Post-Training: A Deep Dive Into Reasoning Large Language Models
32 pages
From System 1 To System 2: A Survey of Reasoning Large Language Models
No ratings yet
From System 1 To System 2: A Survey of Reasoning Large Language Models
23 pages
ML 2
No ratings yet
ML 2
32 pages
Weeks 1-4 AI Paper by Hand PDF
No ratings yet
Weeks 1-4 AI Paper by Hand PDF
22 pages
Advanced Prompt Engineering
No ratings yet
Advanced Prompt Engineering
27 pages
Llms 1 15
No ratings yet
Llms 1 15
15 pages
Challenges and Applications of Large Language Models: Desi GN Behavior
No ratings yet
Challenges and Applications of Large Language Models: Desi GN Behavior
72 pages
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
100% (1)
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
18 pages
LLM - Introduction 2024
No ratings yet
LLM - Introduction 2024
77 pages
Impact Robotic
No ratings yet
Impact Robotic
21 pages
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
No ratings yet
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
31 pages
3 Nazneen Rajani Session I
No ratings yet
3 Nazneen Rajani Session I
33 pages
Deng Et Al (2025) CogniDual Framework - Self-Training LLMs Within A Dual-System Theoretical Framework For Improving Cognitive Task
No ratings yet
Deng Et Al (2025) CogniDual Framework - Self-Training LLMs Within A Dual-System Theoretical Framework For Improving Cognitive Task
5 pages
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
No ratings yet
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
25 pages
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
No ratings yet
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
25 pages
Llms As Method Actors: A Model For Prompt Engineering and Architecture
No ratings yet
Llms As Method Actors: A Model For Prompt Engineering and Architecture
41 pages
LLMand Logicor Mimick
No ratings yet
LLMand Logicor Mimick
11 pages
LLM Applications
100% (1)
LLM Applications
1 page
Paniit Demystifying Llms
No ratings yet
Paniit Demystifying Llms
66 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
Large Language Models: CSC413 Tutorial 9 Yongchao Zhou
100% (1)
Large Language Models: CSC413 Tutorial 9 Yongchao Zhou
40 pages
Fine-Tuning Large Language Models For Specialized Use Cases - 2025
No ratings yet
Fine-Tuning Large Language Models For Specialized Use Cases - 2025
13 pages
Toc 9780138199302
No ratings yet
Toc 9780138199302
8 pages
LLMs As Method Actors
No ratings yet
LLMs As Method Actors
41 pages
Week 11 Chats
No ratings yet
Week 11 Chats
5 pages
RM Assignment 4
No ratings yet
RM Assignment 4
5 pages
Large Language Models (LLM)
100% (1)
Large Language Models (LLM)
139 pages
521H0502-521H0498-521h0333 NLP Report
No ratings yet
521H0502-521H0498-521h0333 NLP Report
27 pages
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
No ratings yet
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
18 pages
Can Large Language Models Reason and Plan?
No ratings yet
Can Large Language Models Reason and Plan?
5 pages
How GPT Realizes Leibniz Dream CSMF 08 00066
No ratings yet
How GPT Realizes Leibniz Dream CSMF 08 00066
6 pages
Teaching LLMs To Think and Act - ReAct Prompt Engineering - by Bryan McKenney - Medium
No ratings yet
Teaching LLMs To Think and Act - ReAct Prompt Engineering - by Bryan McKenney - Medium
15 pages
Reasoning On A Spectrum: Aligning Llms To System 1 and System 2 Thinking
No ratings yet
Reasoning On A Spectrum: Aligning Llms To System 1 and System 2 Thinking
16 pages
Easy Problems That LLMs Get Wrong
No ratings yet
Easy Problems That LLMs Get Wrong
46 pages
50 LLM Interview Questions
100% (1)
50 LLM Interview Questions
56 pages
Large Language Models (LLMS) : Survey, Technical Frameworks, and Future Challenges
No ratings yet
Large Language Models (LLMS) : Survey, Technical Frameworks, and Future Challenges
51 pages
BkE ViswanathanRK Physics of Music 0099
No ratings yet
BkE ViswanathanRK Physics of Music 0099
154 pages
Dimensions of The Aryan Problem Revisited in 2017 - Manogna Sastry & Megh Kalyanasundaram
100% (1)
Dimensions of The Aryan Problem Revisited in 2017 - Manogna Sastry & Megh Kalyanasundaram
71 pages
Spontaneous Freezing of Supercooled Water Under Isochoric and Adiabatic Conditions
No ratings yet
Spontaneous Freezing of Supercooled Water Under Isochoric and Adiabatic Conditions
7 pages
Drones
No ratings yet
Drones
17 pages
Miyapur Hts (Practicals)
No ratings yet
Miyapur Hts (Practicals)
233 pages
Athimanusha Sthavam
100% (1)
Athimanusha Sthavam
124 pages
Gita Vidyanidhi
100% (1)
Gita Vidyanidhi
412 pages
Transcendental (Vedic) Psychology
100% (6)
Transcendental (Vedic) Psychology
95 pages
The Krsna Consciousness Handbook-1970-Spellchecked
No ratings yet
The Krsna Consciousness Handbook-1970-Spellchecked
81 pages
Sruta Pradipika
No ratings yet
Sruta Pradipika
717 pages
Optum One Overview Brochure FINAL
No ratings yet
Optum One Overview Brochure FINAL
7 pages
1-Nonlinear Regression Models in Agriculture PDF
No ratings yet
1-Nonlinear Regression Models in Agriculture PDF
9 pages
KRSNA Book Volume 2
No ratings yet
KRSNA Book Volume 2
337 pages
Bhakti Ratnavali PDF
No ratings yet
Bhakti Ratnavali PDF
258 pages
Gaudiya Grantha Chatushkam
100% (1)
Gaudiya Grantha Chatushkam
102 pages
Nrsimha Stotras
No ratings yet
Nrsimha Stotras
60 pages
AkhyAtaCandrika of Bhattamalla
No ratings yet
AkhyAtaCandrika of Bhattamalla
51 pages
KRSNA Book Volume 1
No ratings yet
KRSNA Book Volume 1
371 pages
Krsna Book Vol.2 1970 Iskcon Press Edition Scan
No ratings yet
Krsna Book Vol.2 1970 Iskcon Press Edition Scan
409 pages
(Cobibma) Forda Kwatro Newsbytes
No ratings yet
(Cobibma) Forda Kwatro Newsbytes
12 pages
UNIT 3 - INSTANCE BASED LEARNING Akgec
No ratings yet
UNIT 3 - INSTANCE BASED LEARNING Akgec
14 pages
Artificial Intelligence Answers
No ratings yet
Artificial Intelligence Answers
104 pages
Seminar Abstract 1
No ratings yet
Seminar Abstract 1
2 pages
Homework 5 Due April 24, 2017: Computer Vision CS 543 / ECE 549
No ratings yet
Homework 5 Due April 24, 2017: Computer Vision CS 543 / ECE 549
3 pages
Machine Learning Books
No ratings yet
Machine Learning Books
2 pages
Artificial Neural Networks - 3: Dr. Aditya Abhyankar
No ratings yet
Artificial Neural Networks - 3: Dr. Aditya Abhyankar
24 pages
A. Membuat Decision Tree Dari Bank-Data - Arff
No ratings yet
A. Membuat Decision Tree Dari Bank-Data - Arff
7 pages
Unsupervised Learning Berkeley cs91
No ratings yet
Unsupervised Learning Berkeley cs91
285 pages
Internship Report
No ratings yet
Internship Report
17 pages
A Review of YOLO Object Detection Algorithms Based
No ratings yet
A Review of YOLO Object Detection Algorithms Based
4 pages
Lec 1,2
No ratings yet
Lec 1,2
69 pages
428 Training Text Classifiers in Create ML
No ratings yet
428 Training Text Classifiers in Create ML
31 pages
Welcome To AI For Everyone
100% (1)
Welcome To AI For Everyone
3 pages
Lecture - 14 - FFNN
No ratings yet
Lecture - 14 - FFNN
59 pages
RF-DETR by Roboflow: Speed Meets Accuracy in Object Detection
No ratings yet
RF-DETR by Roboflow: Speed Meets Accuracy in Object Detection
13 pages
Generative AI Masters Brochure - Edureka
No ratings yet
Generative AI Masters Brochure - Edureka
47 pages
Iccvg Facial Expression
No ratings yet
Iccvg Facial Expression
8 pages
AIML CIA I Question Paper
No ratings yet
AIML CIA I Question Paper
2 pages
Machine Learning Techniques Quantum
No ratings yet
Machine Learning Techniques Quantum
161 pages
Slides For 'Large Language Model: From Theory To Implementations', Chapter 2
No ratings yet
Slides For 'Large Language Model: From Theory To Implementations', Chapter 2
69 pages
Soft Max
No ratings yet
Soft Max
6 pages
Art Generation Using GANs
No ratings yet
Art Generation Using GANs
3 pages
2024.icon Fauxhate.6
No ratings yet
2024.icon Fauxhate.6
7 pages
AI Glossary
No ratings yet
AI Glossary
305 pages
AI & ML Glossary
No ratings yet
AI & ML Glossary
30 pages
Understanding Kolmogorov Arnold Networks (KAN) - Towards Data Science
100% (1)
Understanding Kolmogorov Arnold Networks (KAN) - Towards Data Science
24 pages
A Recipe For Arabic-English Neural Machine Translation
No ratings yet
A Recipe For Arabic-English Neural Machine Translation
5 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Secure and Efficient Facial Identification Based Attendance System For Institution
No ratings yet
Secure and Efficient Facial Identification Based Attendance System For Institution
11 pages

Jason Weston Reasoning Alignment Berkeley Talk

Uploaded by

Jason Weston Reasoning Alignment Berkeley Talk

Uploaded by

Learning to Self-Improve & Reason with LLMs

Images from https://lena-voita.github.io/nlp_course/language_modeling.html

2020 – T5 (Google) – Pretrained LLM, unified NLP tasks in a text-to-text format.

2020 – GPT-3 (OpenAI) – Pretrained LLM, 175B parameters.

2021 – Jurassic-1 (AI21 Labs) – Pretrained LLM with controllability features.

2021 – Gopher (DeepMind) – Pretrained LLM with better factual knowledge.

2022 – Chinchilla (DeepMind) – Pretrained LLM optimized for efficient scaling.

2022 – PaLM (Google) – Pretrained LLM with strong reasoning abilities.

2022 – OPT (Meta) – Pretrained LLM, open-source alternative to GPT-3.

2022 – BLOOM (BigScience) – Pretrained LLM, multilingual, open-access model.

2022 – GPT-3.5 (OpenAI) – Pretrained LLM with post-training via RLHF.

2023 – LLaMA (Meta) – Pretrained LLM, open-source, research-focused.

2023 – Mistral 7B (Mistral AI) – Pretrained LLM, efficient and competitive.

● or RLHF: (Proximal Policy Optimization)

Chain-of-Verification (CoVe) variants:

- Joint left-to-right generation of

- Factored: step (3) attends to (2)

- Factor+ revise: Extra

Chain-of-Verification (CoVe) variants:

- Joint left-to-right generation of

- Factored: step (3) attends to (2)

- Factor+ revise: Extra

Hypothesis: soft-attention inherently

(Gonen et al, '24)

LLMs learn spurious correlations

Decide what to attend explicitly (system 2) by rewriting the input

Problem: whole context affects LLM

Decide what to attend explicitly (system 2) by rewriting the input

Break down response evaluation into subproblems & fuse

• LLM improves itself by assigning rewards to own outputs and optimizing

● Current LLMs are approaching human-level performance on a variety of tasks.

A core challenge for aligning future superhuman AI systems

Image from: https://arxiv.org/pdf/2009.01325.pdf

Image from: https://arxiv.org/pdf/2009.01325.pdf

Images generated by GPT-4

Then, how about combining them together?

Can you explain contrastive learning in machine learning in simple terms

Here's a simple analogy to understand it:

Here is an instruction: Can you explain contrastive learning in machine

Here is the model response: <MODEL_RESPONSE>

Can you assign a score (0 to 5) to this response based on the

Empirically, we have shown that this is possible !

(1) Self-instruction creation: generate prompts, responses & self-rewards with LM

(2) Instruction-training: Train (DPO) on selected preference pairs

Iterations improve instruction following & reward modeling ability!

GPT-4 Evaluation Human Evaluation

Our self-reward model is continually improved in evaluation capabilities as well

& apply RL (GRPO - Group Relative Policy Optimization)

- Introduces Thought Preference

- Gives gains on AlpacaEval (beating

🥉3rd on AlpacaEval leaderboard

• LLM improves its own judgments by (meta-)judging them

Self-Rewarding focused on improving responses, not judgment capabilities

Meta-Rewarding: LM is actor, judge & meta-judge

- Meta-Judge is extra step to judge the judgments

We use LLM-as-a-Meta-Judge (see prompt)

- Make N judgments for a given pair of

- Compute Elo score of the judgments via

- Create LLM-as-a-judge preference

Our method outperforms Self-Rewarding (with same LC method).

Our method outperforms Self-Rewarding (with same LC method).

EvalPlanner – a method to train

This "Thinking-LLM-as-a-Judge" learns

Latent System 2 thoughts, not tokens? COCONUT (Hao et al., '24)

LGTM, but I would just add some more detail:

You might also like