0% found this document useful (0 votes)

56 views

2024 Stanford cs25 Guest Lecture Jason Wei

stanford cs25 guest lecture jason wei

Uploaded by

Gurumurthi. V Ramanan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

2024 Stanford cs25 Guest Lecture Jason Wei

stanford cs25 guest lecture jason wei

Uploaded by

Gurumurthi. V Ramanan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Intuitions on language models

Jason Wei

OpenAI

Stanford CS25 2024 Guest Lecture

1
Fundamental question. Why do large language
models work so well?

Thing I’ve been thinking about recently: Manually

inspecting data gives us clear intuitions about how
the model works.

2
Looking at data = training your biological neural net.

Your biological neural net makes many observations

about the data after reading it.

These intuitions can be valuable.

(I once manually annotated an entire lung cancer image classification

dataset. Several papers came out of intuitions from that process.)

3
Review: language models
(hypothetical)
Word Probability
Pre-training only a 0.00001 Loss = - log P(next word | previous words)
(per word, on an
aardvark 0.000004 unseen test set)
Dartmouth …
students Language drink 0.5 Example. If your loss is 3, then you
like to ___ Model … have a 1/(e^3) probability of getting
study 0.23 the next token right on average.
…
zucchini 0.000002

The best language model is the one

that best predicts an unseen test
set (i.e., best test loss).

4
Intuition 1.
Next-word prediction (on large data) is massively
multi-task learning.

5
Example tasks from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {code, banana}
Lexical semantics I went to the store to buy papaya, dragon fruit, and {durian, squirrel}
World knowledge The capital of Azerbaijan is {Baku, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}

Translation The word for “pretty” in Spanish is {bonita, hola}

Spatial reasoning Iroh went into the kitchen to make tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the {kitchen, store}
Math question Arithmetic exam answer key: 3 + 8 + 4 = {15, 11}

[millions more]

Extreme multi-task learning!

6
There are a lot of possible “tasks”, and they can be arbitrary

Input Target Task

Biden married Neilia Hunter world knowledge
Biden married Neilia Hunter , comma prediction
Biden married Neilia Hunter , a grammar
Biden married Neilia Hunter , a student impossible?
https://en.wikipedia.org/wiki/Joe_Biden

Being a language model is not easy! A lot of arbitrary words to predict. Tasks
aren’t weird and not clean.

7
Intuition 2.
Scaling language models (size * data = compute) is reliably
improves loss.

8
Scaling predictably improves performance (“scaling laws”)
Scaling laws for neural language models. Kaplan et al., 2020.
Kaplan et al., 2020:
“Language modeling
Increase performance improves
compute
smoothly as we increase
Loss goes the model size, dataset
down
size, and amount of
compute for training.”

Jason’s rephrase: You should expect to

get a better language model if you
scale up compute.

Seven orders of magnitude

9
Why does scaling work? Hard to confirm, but just some guesses

Small language model Large language model

Memorization is costly More generous with memorizing tail

knowledge
“Parameters are scarce, so I have to decide which
facts are worth memorizing” “I have a lot of parameters so I’ll just memorize all
the facts, no worries”

First-order correlations Complex heuristics

“Wow, that token was hard. It was hard enough for “Wow, I got that one wrong. Maybe there’s
me to even get it in the top-10 predictions. Just something complicated going on here, let me try to
trying to predict reasonable stuff, I’m not destined figure it out. I want to be the GOAT.”
for greatness.”

10
Intuition 3.
While overall loss scales smoothly, individual downstream tasks
may scale in an emergent fashion.

11
Take a closer look at loss. Consider:

Overall loss = 1e-3 * loss_grammar +

1e-3 * loss_world_knowledge +
1e-6 * loss_sentiment_analysis + If loss goes from 4 to 3, do
all tasks get better
… uniformly? Probably not.
1e-4 * loss_math_ability +
1e-6 * loss_spatial_reasoning
…

12
“Hard tasks”
(e.g., math)

Loss
“Easily saturated
tasks”
(e.g., grammar)
Overall loss

Compute

13
202 downstream tasks in BIG-Bench
Smoothly Emergent abilities (33%)
increasing
(29%)

Not correlated
with scale (13%)

Flat Inverse scaling (2.5%)

(22%) Performance decreases with scale

14
Emergence in prompting: example

Prompt

Input (English): I like to

play soccer and tennis curie
BLEU Me gusta jugar al fútbol y
Target (Spanish): score al tenis

Model “curie” suddenly

figures out to translate
0 Model and not repeat.
scale

ada babbage

I like to play soccer and I like to play soccer and

tennis tennis

15
Intuition 4.
Picking a clever set of tasks results in inverse or U-shaped
scaling.

16
Small language model → “glib”

Medium language model → “gold”

Quote repetition
Large language model → “glib”
Repeat my sentences
back to me.

Input: All that glisters is

not glib
Output: All that glisters
is not ___

Correct answer = “glib”

Inverse scaling can become U-shaped.

17
Fix wrong
Repeat text quote
(Accuracy) (Accuracy)

Tiny Small Large

Tiny Small Large
Language model size
Language model size

Follow
instruction
(Accuracy)

Tiny Small Large

Language model size

18
Large LM intuition General idea

Scaling model size and data is Plot scaling curves to see if

expected to continue improving doing more of something will be
loss. a good strategy.

Overall loss improves smoothly, To better understand aggregate

but individual tasks can improve metrics, decompose them into
suddenly. individual categories. Sometimes
you’ll find errors in the
annotation set.

19
Thanks.
X / Twitter: @_jasonwei

I’ve love your feedback on this talk: https://tinyurl.com/jasonwei

OceanofPDF.com the Hundred-Page Language Models Book - Andriy Burkov
93% (14)
OceanofPDF.com the Hundred-Page Language Models Book - Andriy Burkov
209 pages
Scaling Paradigms for Large Language Models
No ratings yet
Scaling Paradigms for Large Language Models
42 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Chapter 9 Sampling Design
100% (1)
Chapter 9 Sampling Design
15 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
Eightthings
No ratings yet
Eightthings
16 pages
Eights - LLM Model
No ratings yet
Eights - LLM Model
10 pages
Eights LLM Model App
No ratings yet
Eights LLM Model App
8 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Don't Teach. Incentivize
No ratings yet
Don't Teach. Incentivize
59 pages
Brief Introduction to LLM
No ratings yet
Brief Introduction to LLM
69 pages
19 20-gpt-3 Prompts
No ratings yet
19 20-gpt-3 Prompts
68 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Specializing_Smaller_Language_Models_towards_Multi_Step_Reasoning
No ratings yet
Specializing_Smaller_Language_Models_towards_Multi_Step_Reasoning
15 pages
Language Models Are Unsupervised Multitask Learners
No ratings yet
Language Models Are Unsupervised Multitask Learners
24 pages
Ten Simple Rulesfor Crafting Effective Promptsfor Large Language Models
No ratings yet
Ten Simple Rulesfor Crafting Effective Promptsfor Large Language Models
12 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
A Theory For Emergence of Complex Skills in Language Models
No ratings yet
A Theory For Emergence of Complex Skills in Language Models
17 pages
Large Language Models Johns Hopkins University
No ratings yet
Large Language Models Johns Hopkins University
54 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
Scaling Laws for Neural Language Models
No ratings yet
Scaling Laws for Neural Language Models
30 pages
LLM Basics
No ratings yet
LLM Basics
35 pages
Book TheLMbook Sample
No ratings yet
Book TheLMbook Sample
30 pages
Du2024-UnderstandingEmergentAbilitiesOfLanguageModelsFromLossPerspective
No ratings yet
Du2024-UnderstandingEmergentAbilitiesOfLanguageModelsFromLossPerspective
18 pages
Lecture Notes
No ratings yet
Lecture Notes
86 pages
Training Gopher
No ratings yet
Training Gopher
118 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
The Unreasonable Effectiveness of Data PDF
No ratings yet
The Unreasonable Effectiveness of Data PDF
5 pages
LLM - Michael R Douglas
No ratings yet
LLM - Michael R Douglas
47 pages
Language Models: A Guide for the Perplexed
No ratings yet
Language Models: A Guide for the Perplexed
35 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
Cs224n Text Generation
No ratings yet
Cs224n Text Generation
73 pages
chapter_5
No ratings yet
chapter_5
44 pages
LLMand Logicor Mimick
No ratings yet
LLMand Logicor Mimick
11 pages
MLSys Class LLM Introduction
No ratings yet
MLSys Class LLM Introduction
43 pages
Continual-Alignment
No ratings yet
Continual-Alignment
51 pages
Inference Efficiency by Learning Task Complexity
No ratings yet
Inference Efficiency by Learning Task Complexity
9 pages
1719720399971
No ratings yet
1719720399971
51 pages
The Unreasonable Effectiveness of Data by Halevy, Norvig
No ratings yet
The Unreasonable Effectiveness of Data by Halevy, Norvig
5 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
A Theory For Emergence of Complex Skills in Language Models: Princeton University Google Deepmind
No ratings yet
A Theory For Emergence of Complex Skills in Language Models: Princeton University Google Deepmind
15 pages
Conference Template a4
No ratings yet
Conference Template a4
6 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
Large Language Models Need Symbolic Ai
No ratings yet
Large Language Models Need Symbolic Ai
6 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
25 pages
03 Evaluation and Perplexity 11-09
No ratings yet
03 Evaluation and Perplexity 11-09
5 pages
Language Models Are General Purpose Interfaces
No ratings yet
Language Models Are General Purpose Interfaces
32 pages
Bloomberg GPT
No ratings yet
Bloomberg GPT
76 pages
RNN
No ratings yet
RNN
22 pages
Physics of Language Models 3 3
No ratings yet
Physics of Language Models 3 3
41 pages
Brain Twisters and Teasers: A Logical Workout for the Mind
From Everand
Brain Twisters and Teasers: A Logical Workout for the Mind
Jennifer Henson
No ratings yet
Math Practice Simplified: Multiplication (Book E): Developing Fluency with Basic Number Combinations for Multiplication
From Everand
Math Practice Simplified: Multiplication (Book E): Developing Fluency with Basic Number Combinations for Multiplication
Ann Cassill Sofge
No ratings yet
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Computer Craft Coursebook 6
From Everand
Computer Craft Coursebook 6
Susmita Sen
No ratings yet
AI Prompting: A Guide to Communicating with Artificial Intelligence
From Everand
AI Prompting: A Guide to Communicating with Artificial Intelligence
E. A. Ruppert II
No ratings yet
2296 Treeformer Dense Gradient Tree
No ratings yet
2296 Treeformer Dense Gradient Tree
15 pages
2024 Meng Hidden Citations
No ratings yet
2024 Meng Hidden Citations
10 pages
2142 Pushing The Limits of Gradient
No ratings yet
2142 Pushing The Limits of Gradient
22 pages
2326 Are Large Language Models Real
No ratings yet
2326 Are Large Language Models Real
24 pages
2314 Uncertainty in Graph Neural Ne
No ratings yet
2314 Uncertainty in Graph Neural Ne
20 pages
2024 - Math Data Sci RPT
No ratings yet
2024 - Math Data Sci RPT
48 pages
2106 What Does The Knowledge Neuron
No ratings yet
2106 What Does The Knowledge Neuron
26 pages
24 More Details Please Improvi
No ratings yet
24 More Details Please Improvi
13 pages
(18 April 2024) Aligning Open Language Models
No ratings yet
(18 April 2024) Aligning Open Language Models
77 pages
2 High Performance Transformers
No ratings yet
2 High Performance Transformers
10 pages
Discrepant Events PDF
No ratings yet
Discrepant Events PDF
9 pages
ProjecTools Brochure
No ratings yet
ProjecTools Brochure
8 pages
Auto Car Lock Using GSM
No ratings yet
Auto Car Lock Using GSM
15 pages
Question Paper Summer 2022
No ratings yet
Question Paper Summer 2022
5 pages
Pending Transactions For Inventory Period Close
No ratings yet
Pending Transactions For Inventory Period Close
6 pages
BNS - DCNPC Planning 2011 - Accomplishment Report
100% (1)
BNS - DCNPC Planning 2011 - Accomplishment Report
4 pages
Questionnaire For Retailers VAS Value Added Services 1. Which
No ratings yet
Questionnaire For Retailers VAS Value Added Services 1. Which
3 pages
TAMO Terapija PDF
No ratings yet
TAMO Terapija PDF
9 pages
Chapter 4 5 and References Summer
No ratings yet
Chapter 4 5 and References Summer
8 pages
EHS Plan SIL PDF
No ratings yet
EHS Plan SIL PDF
68 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
65 pages
DR - Atta The Wondrous World of Science
100% (1)
DR - Atta The Wondrous World of Science
282 pages
4 - Assignment 1 Situation Analysis
No ratings yet
4 - Assignment 1 Situation Analysis
3 pages
Polynomial and Synthetic Division
No ratings yet
Polynomial and Synthetic Division
2 pages
20130313130338compare and Contrast (Piaget & Vygotsky)
100% (1)
20130313130338compare and Contrast (Piaget & Vygotsky)
7 pages
Electronic Commerce Proc 1205.2020
No ratings yet
Electronic Commerce Proc 1205.2020
32 pages
General Machine Control - Automation Solutions For Industrial Machines - Catalogue 2014 PDF
No ratings yet
General Machine Control - Automation Solutions For Industrial Machines - Catalogue 2014 PDF
282 pages
F5224 Web Programming: PHP and Mysql
No ratings yet
F5224 Web Programming: PHP and Mysql
11 pages
Influence The Psychology of Persuasion - Graphic
0% (1)
Influence The Psychology of Persuasion - Graphic
1 page
Chapter1 Bio100part1
No ratings yet
Chapter1 Bio100part1
55 pages
Black Swans or Wishful Thinking and Misinterpretation Sikich 2019
No ratings yet
Black Swans or Wishful Thinking and Misinterpretation Sikich 2019
9 pages
Decision Making Skills
No ratings yet
Decision Making Skills
14 pages
Jurisprudence-I Class Notes
100% (2)
Jurisprudence-I Class Notes
23 pages
(Ebook) The Psychology of Problem Solving by Janet E. Davidson, Robert J. Sternberg ISBN 9780521793339, 9780521797412, 0521793335, 0521797411 - Instantly access the full ebook content in just a few seconds
100% (1)
(Ebook) The Psychology of Problem Solving by Janet E. Davidson, Robert J. Sternberg ISBN 9780521793339, 9780521797412, 0521793335, 0521797411 - Instantly access the full ebook content in just a few seconds
55 pages
scientificmethod-variablesworksheet-120219155013-phpapp02
No ratings yet
scientificmethod-variablesworksheet-120219155013-phpapp02
9 pages
Epic 40k 3rd Edition Chaos Army List
100% (1)
Epic 40k 3rd Edition Chaos Army List
9 pages
James Tour, Leading Scientist and Darwin Skeptic PDF
No ratings yet
James Tour, Leading Scientist and Darwin Skeptic PDF
3 pages
Emotional Intelligence Sets Apart Good Leaders
No ratings yet
Emotional Intelligence Sets Apart Good Leaders
13 pages