0% found this document useful (0 votes)
15 views

Scaling Paradigms for Large Language Models

The document discusses the significance of scaling in the development of large language models (LLMs) and outlines two primary paradigms: scaling next-word prediction and scaling reinforcement learning with chain-of-thought prompting. It highlights the challenges and cultural shifts in AI research due to scaling, emphasizing the importance of multi-task learning and the need for improved evaluation methods. The author concludes that scaling will continue to drive advancements in AI capabilities and applications.

Uploaded by

keatsnikolajxy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Scaling Paradigms for Large Language Models

The document discusses the significance of scaling in the development of large language models (LLMs) and outlines two primary paradigms: scaling next-word prediction and scaling reinforcement learning with chain-of-thought prompting. It highlights the challenges and cultural shifts in AI research due to scaling, emphasizing the importance of multi-task learning and the need for improved evaluation methods. The author concludes that scaling will continue to drive advancements in AI capabilities and applications.

Uploaded by

keatsnikolajxy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Scaling paradigms for large language models

Jason Wei
Research Scientist
OpenAI

(Opinions are my own and do not reflect my employer.)


2019 2024

● Can barely write a ● Can write an essay about


coherent paragraph almost anything
● Can’t do any reasoning ● Competition-level
programmer and
mathematician

Scaling has been the engine of progress in AI and will


continue to dictate how the field advances.
Outline

What is scaling and why do it?

Paradigm 1: Scaling next-word prediction

The challenge with next-word prediction

Paradigm 2: Scaling RL on chain-of-thought

How scaling changed AI culture & what’s next?


“Studying the past tells you what’s
special about the current moment.”

How we made progress,


early 2010s to 2017
(pre-transformer deep learning)
x y

Benchmark
like
ImageNet
Make this as good as possible.

Success looks like “On the ImageNet


dataset, our method outperformed
e re as r g the baseline by 5% while using half
e lin c tu bi ize n in
the compute.”
s ite ive im tu
Ba h c t o pt m
c a
r ar n du v ed par
i r
tte e pr
o
pe
be so
m i m h y
+ + + +
What is scaling? Good

Scaling is when Capability


you put yourself in a
situation where you
Bad
move along a
continuous axis and
expect sustained
improvement.
Something
(usually compute, data, or model size)
Scaling is everywhere

GPT-2 (2019) Scaling laws (2020) GPT-3 (2021) Chinchilla (2022) PaLM (2022)
Scaling is hard and was not obvious at the time
Technical & operational challenges Psychological challenges

(1) Distributed training requires a (1) Researchers like inductive


lot of expertise biases
Image source: HF Image source

(2) Scaling is different from


(2) Loss divergences and hardware
human learning
failures are hurdles

(3) Scientific research incentives


(3) Compute is expensive
don’t match engineering work
(“novelty”)
Why scale?

Not scaling Scaling-centric AI

Each improvement in the You can reliably improve


model requires ingenuity capability (even if it’s
on a new axis expensive)

There are a lot of tasks If your measure of


that we want AI to do capability is very general,
extreme investment is
justified
The Bitter Lesson of AI

General methods that leverage


compute are the most effective

Things that scale will ultimately win


out
Paradigm 1: Scaling next-word prediction

Started in 2018, still ongoing

Get really, really good at predicting the next word.

Why do you get so much from “just” predicting the next word?
Next-word prediction is massively multi-task learning.
Review: next-word prediction
0.0 Probability 1.0

a
On weekends, aardvark
Dartmouth

students like
to ___ …
drink

… “Goodness” of the
study model is how close
its prediction of the
… actual next word is
zucchini to 1.0
Example “tasks” from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {code, banana}
World knowledge The capital of Azerbaijan is {Baku, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}
Translation The word for “neural network” in Russian is {нейронная сеть, привет}
Spatial reasoning Iroh went into the kitchen to make tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the {kitchen, store}
Math question Arithmetic exam answer key: 3 + 8 + 4 = {15, 11}

[millions more]

Extreme multi-task learning!


Scaling predictably improves performance (“scaling laws”)

Kaplan et al., 2020:


“Language modeling
performance improves
smoothly as we increase
Next-word
prediction the model size, dataset
capability Doesn’t size, and amount of
saturate compute for training.”
like this
Jason’s rephrase: You should expect to
get a better language model if you
scale up compute.

Training compute (data x model size)


Why does scaling work?
Hard to answer, but here is a hand-wavy explanation

Small language model Large language model

Memorization is costly More generous with


memorizing tail knowledge

First-order correlations Complex heuristics


If scaling was so predictable, why was the success of this
paradigm so surprising?

Next-word prediction is secretly massively multi-task, and


performance on different tasks arise at different rates
Let’s take a closer look at next-word prediction accuracy. Consider that

Overall accuracy = 0.002 * accuracy_grammar +


0.005 * accuracy_knowledge +
0.000001 * accuracy_sentiment_analysis +

0.0001 * accuracy_math_ability +
0.000001 * accuracy_spatial_reasoning

🤔 If accuracy goes from 70% to


80%, do all tasks get better uniformly?
…probably not.
“Easy” tasks
(e.g., grammar)
Overall
capability

Capability

“Hard” tasks
(e.g., math)

Emergent abilities /
phase transition

Compute
Emergence ability example

Prompt

Input (English): I like to


play soccer and tennis curie
BLEU Me gusta jugar al fútbol y
Target (Spanish): score al tenis

Model “curie” suddenly


figures out to translate
0 Model and not repeat.
scale

ada babbage

I like to play soccer and I like to play soccer and


tennis tennis
Write a novel
Scientific research
Hard math problems

Help debug code
“Spectrum of Write a decent poem
possible Do basic math problems
tasks” Write a coherent essay

Translate a sentence
Write a summary

Give basic facts
Have correct grammar

GPT-2 GPT-3 GPT-4


(2019) (2020) (2023)
🤔 If next-word prediction works so well,
can we scale it to reach AGI?

Maybe (it would be hard), but


there is a bottleneck:

Some words are super hard to


predict and take a lot of work

21
When next-word prediction When next-word prediction
works fine becomes very hard

22
Pretend you’re ChatGPT. As soon
as you see the prompt you have
to immediately start typing… go!

Question: What is the square of


((8-2)*3+4)^3 / 8?

(A) 1,483,492
(B) 1,395,394
(C) 1,771,561

Tough right?
23
Where we
want to be

Amount of
compute
used
(tokens)

Pure next-word
1 token
prediction (bad)

Giving the capital Providing the answer to


of the state of
Difficulty of task a multiple choice
California competition math
problem
An approach: chain-of-thought prompting

Question

Chain of thought

Answer

Unseen input

Chain-of-thought prompting elicits reasoning in large language models. Wei et al., 2022. 25
System 1: Fast, intuitive System 2: Slow, deliberate
thinking thinking

Automatic Conscious
Effortless Effortful
Intuitive Controlled
Emotional Logical

Recognizing faces Solving math problems


Repeating basic facts Planning a detailed agenda
Reacting to something Making a thoughtful decision

Next-word Chain of thought


prediction

26
The limitation with CoT prompting
Most reasoning on the What we actually want is the
internet looks like this… inner “stream of thought”
Hm let me first see what
approach we should take…
Actually this seems wrong
No that approach won’t
work, let me try something
else
Let me try computing this
way now
OK I think this is the right
answer!

27
Paradigm 2: Scaling RL on chain-of-thought
Train language models to “think” before giving an answer

In addition to scaling compute for training, there is a second


axis here: scaling how long the language model can think at
inference time.
OpenAI o1 (work of most of the company)

Observation: some problems need more


compute than others.
Maybe one forward pass has enough compute to solve hard problems, in
principle. But in practice, you want to give the language model variable
compute, and in a way that is somewhat similar to the model’s training
distribution.

29
A chain of thought from OpenAI o1

Learning to reason with LLMs. OpenAI, September 2024. 30


CoT allows models to leverage asymmetry of verification

A class of problems has


“asymmetry of verification”,
which means it’s easier to
verify a solution than to
generate one

For example, a crossword


puzzle, sudoku, or writing a
poem that fits constraints

Learning to reason with LLMs. OpenAI, September 2024. 31


Scale RL on chain-of-thought

Learning to reason with LLMs. OpenAI, September 2024. 32


Scale inference-time compute

Learning to reason with LLMs. OpenAI, September 2024. 33


Why is this special: one day we may want AI to solve
very challenging problems
Prompt
Write the code, documentation, and
research paper for the best way to
make AI safe

Hypothetical response

Let me think very hard about this…

[Researches all the existing literature]


[Data analysis] [Conducts new
experiments]

OK, here is a body of work on how to


make AI safe
seconds minutes hours days weeks months
34
How has scaling changed the
culture around doing AI research?

35
Changes in AI research culture: shift to data

2010-2017: Make this


as good as possible

x y

Today: Make this as good as possible

36
Changes in AI culture: we desperately need evals
“People ask me if I’m
making an even harder
version of GPQA… [well]
we set out to make the
hardest science
benchmark that we could”
- David Rein
Changes in AI culture: highly multi-task models

Language models must be measured on


many dimensions

Hard to say that one model is strictly


better than another

AI doesn’t need to human-level on


everything

Intelligence != user experience


Changes in AI culture: bigger working teams

Building OpenAI o1 (Extended Cut)


*Some of many contributors

39
Where will AI continue to progress?

AI for science and Tool use


Goal: enable AI to interact with
healthcare the world
As an assistant in scientific and
medical innovation

More factual AI AI applications


Reduced hallucinations, cite More ubiquitous use of AI
sources, calibration

Multimodality
AI to see, hear, and speak
2019 2024 2029

● Can barely write a ● Can write an essay about ?


coherent paragraph almost anything
● Can’t do any reasoning ● Competition-level
programmer and
mathematician

Scaling has been the engine of progress in AI and will


continue to dictate how the field advances.
Scaling

X / Twitter: @_jasonwei
OpenAI roles: [email protected]

Feedback? https://tinyurl.com/jasonwei

You might also like