0% found this document useful (0 votes)
56 views

2024 Stanford cs25 Guest Lecture Jason Wei

stanford cs25 guest lecture jason wei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

2024 Stanford cs25 Guest Lecture Jason Wei

stanford cs25 guest lecture jason wei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Intuitions on language models

Jason Wei

OpenAI

Stanford CS25 2024 Guest Lecture


1
Fundamental question. Why do large language
models work so well?

Thing I’ve been thinking about recently: Manually


inspecting data gives us clear intuitions about how
the model works.

2
Looking at data = training your biological neural net.

Your biological neural net makes many observations


about the data after reading it.

These intuitions can be valuable.

(I once manually annotated an entire lung cancer image classification


dataset. Several papers came out of intuitions from that process.)

3
Review: language models
(hypothetical)
Word Probability
Pre-training only a 0.00001 Loss = - log P(next word | previous words)
(per word, on an
aardvark 0.000004 unseen test set)
Dartmouth …
students Language drink 0.5 Example. If your loss is 3, then you
like to ___ Model … have a 1/(e^3) probability of getting
study 0.23 the next token right on average.

zucchini 0.000002

The best language model is the one


that best predicts an unseen test
set (i.e., best test loss).

4
Intuition 1.
Next-word prediction (on large data) is massively
multi-task learning.

5
Example tasks from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {code, banana}
Lexical semantics I went to the store to buy papaya, dragon fruit, and {durian, squirrel}
World knowledge The capital of Azerbaijan is {Baku, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}

Translation The word for “pretty” in Spanish is {bonita, hola}


Spatial reasoning Iroh went into the kitchen to make tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the {kitchen, store}
Math question Arithmetic exam answer key: 3 + 8 + 4 = {15, 11}

[millions more]

Extreme multi-task learning!

6
There are a lot of possible “tasks”, and they can be arbitrary

Input Target Task


Biden married Neilia Hunter world knowledge
Biden married Neilia Hunter , comma prediction
Biden married Neilia Hunter , a grammar
Biden married Neilia Hunter , a student impossible?
https://en.wikipedia.org/wiki/Joe_Biden

Being a language model is not easy! A lot of arbitrary words to predict. Tasks
aren’t weird and not clean.

7
Intuition 2.
Scaling language models (size * data = compute) is reliably
improves loss.

8
Scaling predictably improves performance (“scaling laws”)
Scaling laws for neural language models. Kaplan et al., 2020.
Kaplan et al., 2020:
“Language modeling
Increase performance improves
compute
smoothly as we increase
Loss goes the model size, dataset
down
size, and amount of
compute for training.”

Jason’s rephrase: You should expect to


get a better language model if you
scale up compute.

Seven orders of magnitude

9
Why does scaling work? Hard to confirm, but just some guesses

Small language model Large language model

Memorization is costly More generous with memorizing tail


knowledge
“Parameters are scarce, so I have to decide which
facts are worth memorizing” “I have a lot of parameters so I’ll just memorize all
the facts, no worries”

First-order correlations Complex heuristics


“Wow, that token was hard. It was hard enough for “Wow, I got that one wrong. Maybe there’s
me to even get it in the top-10 predictions. Just something complicated going on here, let me try to
trying to predict reasonable stuff, I’m not destined figure it out. I want to be the GOAT.”
for greatness.”

10
Intuition 3.
While overall loss scales smoothly, individual downstream tasks
may scale in an emergent fashion.

11
Take a closer look at loss. Consider:

Overall loss = 1e-3 * loss_grammar +


1e-3 * loss_world_knowledge +
1e-6 * loss_sentiment_analysis + If loss goes from 4 to 3, do
all tasks get better
… uniformly? Probably not.
1e-4 * loss_math_ability +
1e-6 * loss_spatial_reasoning

12
“Hard tasks”
(e.g., math)

Loss
“Easily saturated
tasks”
(e.g., grammar)
Overall loss

Compute

13
202 downstream tasks in BIG-Bench
Smoothly Emergent abilities (33%)
increasing
(29%)

Not correlated
with scale (13%)

Flat Inverse scaling (2.5%)


(22%) Performance decreases with scale

14
Emergence in prompting: example

Prompt

Input (English): I like to


play soccer and tennis curie
BLEU Me gusta jugar al fútbol y
Target (Spanish): score al tenis

Model “curie” suddenly


figures out to translate
0 Model and not repeat.
scale

ada babbage

I like to play soccer and I like to play soccer and


tennis tennis

15
Intuition 4.
Picking a clever set of tasks results in inverse or U-shaped
scaling.

16
Small language model → “glib”

Medium language model → “gold”

Quote repetition
Large language model → “glib”
Repeat my sentences
back to me.

Input: All that glisters is


not glib
Output: All that glisters
is not ___

Correct answer = “glib”

Inverse scaling can become U-shaped.

17
Fix wrong
Repeat text quote
(Accuracy) (Accuracy)

Tiny Small Large


Tiny Small Large
Language model size
Language model size

Follow
instruction
(Accuracy)

Tiny Small Large


Language model size

18
Large LM intuition General idea

Scaling model size and data is Plot scaling curves to see if


expected to continue improving doing more of something will be
loss. a good strategy.

Overall loss improves smoothly, To better understand aggregate


but individual tasks can improve metrics, decompose them into
suddenly. individual categories. Sometimes
you’ll find errors in the
annotation set.

19
Thanks.
X / Twitter: @_jasonwei

I’ve love your feedback on this talk: https://tinyurl.com/jasonwei

20

You might also like