0% found this document useful (0 votes)

15 views

Scaling Paradigms for Large Language Models

The document discusses the significance of scaling in the development of large language models (LLMs) and outlines two primary paradigms: scaling next-word prediction and scaling reinforcement learning with chain-of-thought prompting. It highlights the challenges and cultural shifts in AI research due to scaling, emphasizing the importance of multi-task learning and the need for improved evaluation methods. The author concludes that scaling will continue to drive advancements in AI capabilities and applications.

Uploaded by

keatsnikolajxy

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Scaling Paradigms for Large Language Models

Uploaded by

keatsnikolajxy

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Scaling paradigms for large language models

Jason Wei
Research Scientist
OpenAI

(Opinions are my own and do not reflect my employer.)

2019 2024

● Can barely write a ● Can write an essay about

coherent paragraph almost anything
● Can’t do any reasoning ● Competition-level
programmer and
mathematician

Scaling has been the engine of progress in AI and will

continue to dictate how the field advances.
Outline

What is scaling and why do it?

Paradigm 1: Scaling next-word prediction

The challenge with next-word prediction

Paradigm 2: Scaling RL on chain-of-thought

How scaling changed AI culture & what’s next?

“Studying the past tells you what’s
special about the current moment.”

How we made progress,

early 2010s to 2017
(pre-transformer deep learning)
x y

Benchmark
like
ImageNet
Make this as good as possible.

Success looks like “On the ImageNet

dataset, our method outperformed
e re as r g the baseline by 5% while using half
e lin c tu bi ize n in
the compute.”
s ite ive im tu
Ba h c t o pt m
c a
r ar n du v ed par
i r
tte e pr
o
pe
be so
m i m h y
+ + + +
What is scaling? Good

Scaling is when Capability

you put yourself in a
situation where you
Bad
move along a
continuous axis and
expect sustained
improvement.
Something
(usually compute, data, or model size)
Scaling is everywhere

GPT-2 (2019) Scaling laws (2020) GPT-3 (2021) Chinchilla (2022) PaLM (2022)
Scaling is hard and was not obvious at the time
Technical & operational challenges Psychological challenges

(1) Distributed training requires a (1) Researchers like inductive

lot of expertise biases
Image source: HF Image source

(2) Scaling is different from

(2) Loss divergences and hardware
human learning
failures are hurdles

(3) Scientific research incentives

(3) Compute is expensive
don’t match engineering work
(“novelty”)
Why scale?

Not scaling Scaling-centric AI

Each improvement in the You can reliably improve

model requires ingenuity capability (even if it’s
on a new axis expensive)

There are a lot of tasks If your measure of

that we want AI to do capability is very general,
extreme investment is
justified
The Bitter Lesson of AI

General methods that leverage

compute are the most effective

Things that scale will ultimately win

out
Paradigm 1: Scaling next-word prediction

Started in 2018, still ongoing

Get really, really good at predicting the next word.

Why do you get so much from “just” predicting the next word?
Next-word prediction is massively multi-task learning.
Review: next-word prediction
0.0 Probability 1.0

a
On weekends, aardvark
Dartmouth
…
students like
to ___ …
drink
…
… “Goodness” of the
study model is how close
its prediction of the
… actual next word is
zucchini to 1.0
Example “tasks” from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {code, banana}
World knowledge The capital of Azerbaijan is {Baku, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}
Translation The word for “neural network” in Russian is {нейронная сеть, привет}
Spatial reasoning Iroh went into the kitchen to make tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the {kitchen, store}
Math question Arithmetic exam answer key: 3 + 8 + 4 = {15, 11}

[millions more]

Extreme multi-task learning!

Scaling predictably improves performance (“scaling laws”)

Kaplan et al., 2020:

“Language modeling
performance improves
smoothly as we increase
Next-word
prediction the model size, dataset
capability Doesn’t size, and amount of
saturate compute for training.”
like this
Jason’s rephrase: You should expect to
get a better language model if you
scale up compute.

Training compute (data x model size)

Why does scaling work?
Hard to answer, but here is a hand-wavy explanation

Small language model Large language model

Memorization is costly More generous with

memorizing tail knowledge

First-order correlations Complex heuristics

If scaling was so predictable, why was the success of this
paradigm so surprising?

Next-word prediction is secretly massively multi-task, and

performance on different tasks arise at different rates
Let’s take a closer look at next-word prediction accuracy. Consider that

Overall accuracy = 0.002 * accuracy_grammar +

0.005 * accuracy_knowledge +
0.000001 * accuracy_sentiment_analysis +
…
0.0001 * accuracy_math_ability +
0.000001 * accuracy_spatial_reasoning
…

🤔 If accuracy goes from 70% to

80%, do all tasks get better uniformly?
…probably not.
“Easy” tasks
(e.g., grammar)
Overall
capability

Capability

“Hard” tasks
(e.g., math)

Emergent abilities /
phase transition

Compute
Emergence ability example

Prompt

Input (English): I like to

play soccer and tennis curie
BLEU Me gusta jugar al fútbol y
Target (Spanish): score al tenis

Model “curie” suddenly

figures out to translate
0 Model and not repeat.
scale

ada babbage

I like to play soccer and I like to play soccer and

tennis tennis
Write a novel
Scientific research
Hard math problems
…
Help debug code
“Spectrum of Write a decent poem
possible Do basic math problems
tasks” Write a coherent essay
…
Translate a sentence
Write a summary
…
Give basic facts
Have correct grammar

GPT-2 GPT-3 GPT-4

(2019) (2020) (2023)
🤔 If next-word prediction works so well,
can we scale it to reach AGI?

Maybe (it would be hard), but

there is a bottleneck:

Some words are super hard to

predict and take a lot of work

21
When next-word prediction When next-word prediction
works fine becomes very hard

22
Pretend you’re ChatGPT. As soon
as you see the prompt you have
to immediately start typing… go!

Question: What is the square of

((8-2)*3+4)^3 / 8?

(A) 1,483,492
(B) 1,395,394
(C) 1,771,561

Tough right?
23
Where we
want to be

Amount of
compute
used
(tokens)

Pure next-word
1 token
prediction (bad)

Giving the capital Providing the answer to

of the state of
Difficulty of task a multiple choice
California competition math
problem
An approach: chain-of-thought prompting

Question

Chain of thought

Answer

Unseen input

Chain-of-thought prompting elicits reasoning in large language models. Wei et al., 2022. 25
System 1: Fast, intuitive System 2: Slow, deliberate
thinking thinking

Automatic Conscious
Effortless Effortful
Intuitive Controlled
Emotional Logical

Recognizing faces Solving math problems

Repeating basic facts Planning a detailed agenda
Reacting to something Making a thoughtful decision

Next-word Chain of thought

prediction

26
The limitation with CoT prompting
Most reasoning on the What we actually want is the
internet looks like this… inner “stream of thought”
Hm let me first see what
approach we should take…
Actually this seems wrong
No that approach won’t
work, let me try something
else
Let me try computing this
way now
OK I think this is the right
answer!

27
Paradigm 2: Scaling RL on chain-of-thought
Train language models to “think” before giving an answer

In addition to scaling compute for training, there is a second

axis here: scaling how long the language model can think at
inference time.
OpenAI o1 (work of most of the company)

Observation: some problems need more

compute than others.
Maybe one forward pass has enough compute to solve hard problems, in
principle. But in practice, you want to give the language model variable
compute, and in a way that is somewhat similar to the model’s training
distribution.

29
A chain of thought from OpenAI o1

Learning to reason with LLMs. OpenAI, September 2024. 30

CoT allows models to leverage asymmetry of verification

A class of problems has

“asymmetry of verification”,
which means it’s easier to
verify a solution than to
generate one

For example, a crossword

puzzle, sudoku, or writing a
poem that fits constraints

Learning to reason with LLMs. OpenAI, September 2024. 31

Scale RL on chain-of-thought

Learning to reason with LLMs. OpenAI, September 2024. 32

Scale inference-time compute

Learning to reason with LLMs. OpenAI, September 2024. 33

Why is this special: one day we may want AI to solve
very challenging problems
Prompt
Write the code, documentation, and
research paper for the best way to
make AI safe

Hypothetical response

Let me think very hard about this…

[Researches all the existing literature]

[Data analysis] [Conducts new
experiments]

OK, here is a body of work on how to

make AI safe
seconds minutes hours days weeks months
34
How has scaling changed the
culture around doing AI research?

35
Changes in AI research culture: shift to data

2010-2017: Make this

as good as possible

x y

Today: Make this as good as possible

36
Changes in AI culture: we desperately need evals
“People ask me if I’m
making an even harder
version of GPQA… [well]
we set out to make the
hardest science
benchmark that we could”
- David Rein
Changes in AI culture: highly multi-task models

Language models must be measured on

many dimensions

Hard to say that one model is strictly

better than another

AI doesn’t need to human-level on

everything

Intelligence != user experience

Changes in AI culture: bigger working teams

Building OpenAI o1 (Extended Cut)

*Some of many contributors

39
Where will AI continue to progress?

AI for science and Tool use

Goal: enable AI to interact with
healthcare the world
As an assistant in scientific and
medical innovation

More factual AI AI applications

Reduced hallucinations, cite More ubiquitous use of AI
sources, calibration

Multimodality
AI to see, hear, and speak
2019 2024 2029

● Can barely write a ● Can write an essay about ?

coherent paragraph almost anything
● Can’t do any reasoning ● Competition-level
programmer and
mathematician

Scaling has been the engine of progress in AI and will

continue to dictate how the field advances.
Scaling

X / Twitter: @_jasonwei
OpenAI roles: [email protected]

Feedback? https://tinyurl.com/jasonwei

Solid Starts - First 100 Days
94% (18)
Solid Starts - First 100 Days
287 pages
Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Read People Like A Book by Patrick King-Edited
62% (66)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (212)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
36 Questions To Fall in Love 1
97% (31)
36 Questions To Fall in Love 1
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
71% (69)
1001 Songs
1,798 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
DS - NLP
No ratings yet
DS - NLP
39 pages
Generative Ai Explained
No ratings yet
Generative Ai Explained
28 pages
Contrastive Predictive Coding
No ratings yet
Contrastive Predictive Coding
13 pages
Cours1 Annotations
No ratings yet
Cours1 Annotations
42 pages
2023.findings Acl.709
No ratings yet
2023.findings Acl.709
20 pages
The Topic/purpose of The Infographic Was Clear and Concise.: Activity 3: Website Design Quality Check
No ratings yet
The Topic/purpose of The Infographic Was Clear and Concise.: Activity 3: Website Design Quality Check
2 pages
mutilmodal KG - 王萌
No ratings yet
mutilmodal KG - 王萌
80 pages
cours1
No ratings yet
cours1
42 pages
Triple Trustworthiness Measurement For Knowledge Graph
No ratings yet
Triple Trustworthiness Measurement For Knowledge Graph
8 pages
The Topic/purpose of The Infographic Was Clear and Concise.: Activity 3: Website Design Quality Check
No ratings yet
The Topic/purpose of The Infographic Was Clear and Concise.: Activity 3: Website Design Quality Check
2 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
58 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
25 pages
A Scalable Data Science Workflow Approach For Big Data Bayesian Network Learning
No ratings yet
A Scalable Data Science Workflow Approach For Big Data Bayesian Network Learning
10 pages
Program Curriculum CAIE
No ratings yet
Program Curriculum CAIE
6 pages
Session NN
No ratings yet
Session NN
32 pages
Ai ( x ) Practice Solutions 3
No ratings yet
Ai ( x ) Practice Solutions 3
3 pages
Trompt Towards a Better Deep Neural Network for Tabular Data
No ratings yet
Trompt Towards a Better Deep Neural Network for Tabular Data
43 pages
Rand Ep68860
No ratings yet
Rand Ep68860
16 pages
01 - Introduction To Deep Learning
No ratings yet
01 - Introduction To Deep Learning
56 pages
Data Science Primer
No ratings yet
Data Science Primer
9 pages
2024 Stanford cs25 Guest Lecture Jason Wei
No ratings yet
2024 Stanford cs25 Guest Lecture Jason Wei
20 pages
Genai
No ratings yet
Genai
26 pages
PHD Thesis Software Engineering
100% (1)
PHD Thesis Software Engineering
7 pages
GTC'24 Special Event -Build a RAG-powered Application With a Human Voice Interface [SE62869]- Deck - FINAL_1714408879420001sjpp
No ratings yet
GTC'24 Special Event -Build a RAG-powered Application With a Human Voice Interface [SE62869]- Deck - FINAL_1714408879420001sjpp
108 pages
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
No ratings yet
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
18 pages
AI Lecture 6
No ratings yet
AI Lecture 6
23 pages
Webinar Penulisan Using AI
100% (1)
Webinar Penulisan Using AI
54 pages
DesignSafe Bootcamp V1
No ratings yet
DesignSafe Bootcamp V1
129 pages
First Cours 2
No ratings yet
First Cours 2
42 pages
Machine Learning
100% (1)
Machine Learning
189 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
AI Board Paper[1]
No ratings yet
AI Board Paper[1]
9 pages
Face Mask Detector: A Project Report Submitted in Partial Fulfillment of The Requirement For The Award of The Degree of
No ratings yet
Face Mask Detector: A Project Report Submitted in Partial Fulfillment of The Requirement For The Award of The Degree of
28 pages
Fichas de Aprendizaje IBM Data Science Quiz Questions - Quizlet
No ratings yet
Fichas de Aprendizaje IBM Data Science Quiz Questions - Quizlet
16 pages
SimCLR
No ratings yet
SimCLR
11 pages
知识提示学习-通过提示将知识图谱植入预训练模型
No ratings yet
知识提示学习-通过提示将知识图谱植入预训练模型
38 pages
Retrieval Augmentation Reduces Hallucination in Conversation
No ratings yet
Retrieval Augmentation Reduces Hallucination in Conversation
21 pages
Do We Need To Create Big Datasets To Learn A Task?
No ratings yet
Do We Need To Create Big Datasets To Learn A Task?
5 pages
slides-1_intro_day1
No ratings yet
slides-1_intro_day1
71 pages
3강
No ratings yet
3강
10 pages
Long2021 Article SceneTextDetectionAndRecogniti
No ratings yet
Long2021 Article SceneTextDetectionAndRecogniti
24 pages
CS 4700: Foundations of Artificial Intelligence
No ratings yet
CS 4700: Foundations of Artificial Intelligence
91 pages
2020_Which Tasks Should Be Learned Together in Multi-task Learning_Standley et al_PMLR
No ratings yet
2020_Which Tasks Should Be Learned Together in Multi-task Learning_Standley et al_PMLR
13 pages
Week 1 Introduction To ML
100% (1)
Week 1 Introduction To ML
42 pages
Week1_Lecture2
No ratings yet
Week1_Lecture2
50 pages
Math for Deep Learning: What You Need to Know to Understand Neural Networks
From Everand
Math for Deep Learning: What You Need to Know to Understand Neural Networks
Ronald T. Kneusel
No ratings yet
ET Quarter 3 Weeks 1 3
No ratings yet
ET Quarter 3 Weeks 1 3
11 pages
Introduction To Computer Graphics Graphics: 0. Overview
No ratings yet
Introduction To Computer Graphics Graphics: 0. Overview
19 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
data smth smth
No ratings yet
data smth smth
4 pages
Image-to-Image Translation With Conditional Adversarial Networks
No ratings yet
Image-to-Image Translation With Conditional Adversarial Networks
17 pages
CSE6242-000-Intro
No ratings yet
CSE6242-000-Intro
44 pages
Image-to-Image Translation With Conditional Adversarial Networks
No ratings yet
Image-to-Image Translation With Conditional Adversarial Networks
17 pages
Lecture 7
No ratings yet
Lecture 7
66 pages
AI ( X ) PRACTICE PAPER 3
No ratings yet
AI ( X ) PRACTICE PAPER 3
5 pages
1 (1)
No ratings yet
1 (1)
10 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
Week 1 - Introduction To SDGAI
No ratings yet
Week 1 - Introduction To SDGAI
36 pages
Getting Started With AI for Beginners - 1_23-Compressed
No ratings yet
Getting Started With AI for Beginners - 1_23-Compressed
98 pages
AI Previous Year Paper
No ratings yet
AI Previous Year Paper
6 pages
Lexber Inc. V Sps. Dalman G.R. No. 183587, April 20, 2015
No ratings yet
Lexber Inc. V Sps. Dalman G.R. No. 183587, April 20, 2015
14 pages
String Quartet (Ravel) - Wikipedia
No ratings yet
String Quartet (Ravel) - Wikipedia
5 pages
The Call of Cthulhu
No ratings yet
The Call of Cthulhu
22 pages
Communicative Activities For Middle School Classroom PDF
No ratings yet
Communicative Activities For Middle School Classroom PDF
3 pages
Idiomatic Rust - Matthias Endler - FOSDEM 2018
No ratings yet
Idiomatic Rust - Matthias Endler - FOSDEM 2018
43 pages
Quick Reference Guide Libreoffice7.x en
No ratings yet
Quick Reference Guide Libreoffice7.x en
2 pages
3rd Quarter
No ratings yet
3rd Quarter
14 pages
Account Statement: Tizar Infra Projects Private Limited NO. 95 Palayam Bazaar Woraiyur Tiruchirappalli
No ratings yet
Account Statement: Tizar Infra Projects Private Limited NO. 95 Palayam Bazaar Woraiyur Tiruchirappalli
2 pages
9MA0 01 9MA0 02 A Level Pure Mathematics Practice Set 13
No ratings yet
9MA0 01 9MA0 02 A Level Pure Mathematics Practice Set 13
5 pages
Lesson 5: Introducing Yourself : Yeray Vladimir Gonzalez Silva
No ratings yet
Lesson 5: Introducing Yourself : Yeray Vladimir Gonzalez Silva
20 pages
PM 7328 CMD2 Final 1
No ratings yet
PM 7328 CMD2 Final 1
13 pages
Knowledge, Attitudes, and Practices (Kap) Surveys During Cholera Vaccination Campaigns: Guidance For Oral Cholera Vaccine Stockpile Campaigns
No ratings yet
Knowledge, Attitudes, and Practices (Kap) Surveys During Cholera Vaccination Campaigns: Guidance For Oral Cholera Vaccine Stockpile Campaigns
41 pages
Unless Otherwise Specified, All Monetary Values Are in Millions of INR
No ratings yet
Unless Otherwise Specified, All Monetary Values Are in Millions of INR
7 pages
History of Philadelphia International Airport
No ratings yet
History of Philadelphia International Airport
3 pages
Indian Overseas Banks
No ratings yet
Indian Overseas Banks
15 pages
Danfoss Light Commercial Refrigeration Compressors: Gd30Fdc
No ratings yet
Danfoss Light Commercial Refrigeration Compressors: Gd30Fdc
20 pages
Bond 710
No ratings yet
Bond 710
2 pages
Article 2
No ratings yet
Article 2
38 pages
Export Mktg. PPT Slides (18!10!2020) (Dr. DHOND)
No ratings yet
Export Mktg. PPT Slides (18!10!2020) (Dr. DHOND)
94 pages
3rd Quarter Exam POLGOV
100% (4)
3rd Quarter Exam POLGOV
7 pages
Starts Per Hour
No ratings yet
Starts Per Hour
1 page
The Story of Shahnaz Husain
No ratings yet
The Story of Shahnaz Husain
8 pages
The Uniqueness of The Earth
No ratings yet
The Uniqueness of The Earth
27 pages
@READING MATERIALS For Kindr
No ratings yet
@READING MATERIALS For Kindr
32 pages
Leading the Way in Financial Excellence: M&M AL MENHALI AUDITING, Dubai's Top Accounting Firm
No ratings yet
Leading the Way in Financial Excellence: M&M AL MENHALI AUDITING, Dubai's Top Accounting Firm
7 pages
Bored Pile BBS (8m) - Jan 30, 2019
No ratings yet
Bored Pile BBS (8m) - Jan 30, 2019
2 pages
INDIAN WRITING IN ENGLISH (M.A 3rd SEMESTER)
No ratings yet
INDIAN WRITING IN ENGLISH (M.A 3rd SEMESTER)
15 pages
Maranatha Baptist Church Constitution
No ratings yet
Maranatha Baptist Church Constitution
24 pages
Phaneesh Murthy
No ratings yet
Phaneesh Murthy
2 pages
Maintenance en
No ratings yet
Maintenance en
23 pages