Scaling Paradigms for Large Language Models
Scaling Paradigms for Large Language Models
Jason Wei
Research Scientist
OpenAI
Benchmark
like
ImageNet
Make this as good as possible.
GPT-2 (2019) Scaling laws (2020) GPT-3 (2021) Chinchilla (2022) PaLM (2022)
Scaling is hard and was not obvious at the time
Technical & operational challenges Psychological challenges
Why do you get so much from “just” predicting the next word?
Next-word prediction is massively multi-task learning.
Review: next-word prediction
0.0 Probability 1.0
a
On weekends, aardvark
Dartmouth
…
students like
to ___ …
drink
…
… “Goodness” of the
study model is how close
its prediction of the
… actual next word is
zucchini to 1.0
Example “tasks” from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {code, banana}
World knowledge The capital of Azerbaijan is {Baku, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}
Translation The word for “neural network” in Russian is {нейронная сеть, привет}
Spatial reasoning Iroh went into the kitchen to make tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the {kitchen, store}
Math question Arithmetic exam answer key: 3 + 8 + 4 = {15, 11}
[millions more]
Capability
“Hard” tasks
(e.g., math)
Emergent abilities /
phase transition
Compute
Emergence ability example
Prompt
ada babbage
21
When next-word prediction When next-word prediction
works fine becomes very hard
22
Pretend you’re ChatGPT. As soon
as you see the prompt you have
to immediately start typing… go!
(A) 1,483,492
(B) 1,395,394
(C) 1,771,561
Tough right?
23
Where we
want to be
Amount of
compute
used
(tokens)
Pure next-word
1 token
prediction (bad)
Question
Chain of thought
Answer
Unseen input
Chain-of-thought prompting elicits reasoning in large language models. Wei et al., 2022. 25
System 1: Fast, intuitive System 2: Slow, deliberate
thinking thinking
Automatic Conscious
Effortless Effortful
Intuitive Controlled
Emotional Logical
26
The limitation with CoT prompting
Most reasoning on the What we actually want is the
internet looks like this… inner “stream of thought”
Hm let me first see what
approach we should take…
Actually this seems wrong
No that approach won’t
work, let me try something
else
Let me try computing this
way now
OK I think this is the right
answer!
27
Paradigm 2: Scaling RL on chain-of-thought
Train language models to “think” before giving an answer
29
A chain of thought from OpenAI o1
Hypothetical response
35
Changes in AI research culture: shift to data
x y
36
Changes in AI culture: we desperately need evals
“People ask me if I’m
making an even harder
version of GPQA… [well]
we set out to make the
hardest science
benchmark that we could”
- David Rein
Changes in AI culture: highly multi-task models
39
Where will AI continue to progress?
Multimodality
AI to see, hear, and speak
2019 2024 2029
X / Twitter: @_jasonwei
OpenAI roles: [email protected]
Feedback? https://tinyurl.com/jasonwei