2024 Stanford cs25 Guest Lecture Jason Wei
2024 Stanford cs25 Guest Lecture Jason Wei
Jason Wei
OpenAI
2
Looking at data = training your biological neural net.
3
Review: language models
(hypothetical)
Word Probability
Pre-training only a 0.00001 Loss = - log P(next word | previous words)
(per word, on an
aardvark 0.000004 unseen test set)
Dartmouth …
students Language drink 0.5 Example. If your loss is 3, then you
like to ___ Model … have a 1/(e^3) probability of getting
study 0.23 the next token right on average.
…
zucchini 0.000002
4
Intuition 1.
Next-word prediction (on large data) is massively
multi-task learning.
5
Example tasks from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {code, banana}
Lexical semantics I went to the store to buy papaya, dragon fruit, and {durian, squirrel}
World knowledge The capital of Azerbaijan is {Baku, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}
[millions more]
6
There are a lot of possible “tasks”, and they can be arbitrary
Being a language model is not easy! A lot of arbitrary words to predict. Tasks
aren’t weird and not clean.
7
Intuition 2.
Scaling language models (size * data = compute) is reliably
improves loss.
8
Scaling predictably improves performance (“scaling laws”)
Scaling laws for neural language models. Kaplan et al., 2020.
Kaplan et al., 2020:
“Language modeling
Increase performance improves
compute
smoothly as we increase
Loss goes the model size, dataset
down
size, and amount of
compute for training.”
9
Why does scaling work? Hard to confirm, but just some guesses
10
Intuition 3.
While overall loss scales smoothly, individual downstream tasks
may scale in an emergent fashion.
11
Take a closer look at loss. Consider:
12
“Hard tasks”
(e.g., math)
Loss
“Easily saturated
tasks”
(e.g., grammar)
Overall loss
Compute
13
202 downstream tasks in BIG-Bench
Smoothly Emergent abilities (33%)
increasing
(29%)
Not correlated
with scale (13%)
14
Emergence in prompting: example
Prompt
ada babbage
15
Intuition 4.
Picking a clever set of tasks results in inverse or U-shaped
scaling.
16
Small language model → “glib”
Quote repetition
Large language model → “glib”
Repeat my sentences
back to me.
17
Fix wrong
Repeat text quote
(Accuracy) (Accuracy)
Follow
instruction
(Accuracy)
18
Large LM intuition General idea
19
Thanks.
X / Twitter: @_jasonwei
20