0% found this document useful (0 votes)

53 views59 pages

Don't Teach. Incentivize

Uploaded by

zs3783999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views59 pages

Don't Teach. Incentivize

Uploaded by

zs3783999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Don’t teach. Incentivize.

MIT EI seminar

Hyung Won Chung

OpenAI
Non-goal: share speciﬁc technical knowledge and experimental results

Goal: share how I think with AI being a running example

Why?

We, the technical people, focus too much on problem solving itself

In my view, more attention should go to ﬁnding great problems to solve

Great researchers are good at ﬁnding impactful problems. I think this ability
comes from having the right perspective.

I hope this talk sparks interest in developing original perspectives, which in turn
help ﬁnding better problems to solve
Outline

Build the scale-ﬁrst perspective for AI research in general

Interpret Large Language Models with this perspective

Brain-scale
compute power

Roughly, 10x more compute every 5 years

Figure from Rich Sutton’s WAIC keynote

Hardware is exponentially progressing

Software and algorithms should catch up

We need more scalable methods that can better leverage computation

The job of AI researchers is to teach machines how to “think”

One (unfortunately common) approach

Teach the machines how we think we think

But we don’t know how we think at the neuron level

So we are teaching what we don’t fully understand in a limited language of mathematics

This approach poses structures to the problem, which can become the limitation when
scaled up

7
Bitter lesson

Progress of AI in the past 70 years boils down to

● Develop progressively more general methods with less structure

● Add more data and computation (i.e. scale up)

http://www.incompleteideas.net/IncIdeas/BitterLesson.html 8
The more structure imposed by humans, the less scalable the method is

Performance

Less structure

More structure

Compute
Sobering observation

Clever structures posed by human researchers typically become the bottleneck

when scaled up

What is good in the long run almost necessarily looks bad in the short term

Compute is getting cheaper faster than we are becoming better researchers

Give machines more degrees of freedom. Let them choose how they learn
Why are these observations not so obvious?

Researchers want to add modeling idea because that is academically more

satisfying

Some people think “just scaling up” is not scientiﬁc or interesting

Ultimately what do we want to achieve with artiﬁcial intelligence?

We should focus on:

maximizing the value generated by AI while minimizing the downside

regardless of which academic discipline achieves the goal

HWC’s deﬁnition of scaling

Common deﬁnition: doing the same thing with more machines

HWC’s deﬁnition of scaling

Common deﬁnition: doing the same thing with more machines

Scaling implicitly involves identifying the modeling assumption that bottlenecks

further scaling and replacing it with a more scalable one
Large Language Models (LLMs)
All LLMs so far use Transformer architecture
Let’s take a “functional” viewpoint on the Transformer

Sequence-to-sequence mapping
with bunch of matmuls

Input: [d, n]

Output: [d, n]
Process Shape
“Many words don't map to one token: indivisible.” []
Process Shape
“Many words don't map to one token: indivisible.” []
Tokenization

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13] [n]
Process Shape
“Many words don't map to one token: indivisible.” []
Tokenization

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13] [n]

Embedding

2.3 -3.2 8.3 5.4 2.1 3.9 -8.9 3.8 3.9 3.3
4.5
…
5.9
…
4.5
…
7.1
…
1.0
…
5.3
…
5.0
…
3.1
…
0.7
…
5.0
…
[d, n]
3.8 1.2 3.8 9.0 9.3 3.1 4.2 0.8 9.2 5.8
Process Shape
“Many words don't map to one token: indivisible.” []
Tokenization

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13] [n]

Embedding

2.3 -3.2 8.3 5.4 2.1 3.9 -8.9 3.8 3.9 3.3
4.5
…
5.9
…
4.5
…
7.1
…
1.0
…
5.3
…
5.0
…
3.1
…
0.7
…
5.0
…
[d, n]
3.8 1.2 3.8 9.0 9.3 3.1 4.2 0.8 9.2 5.8

N Transformer layers

3.2 -2.3 3.8 4.5 1.2 9.3 -9.8 8.3 9.3 3.3
5.4 9.5 5.4 1.7 0.1 3.5 0.5 1.3 7.0 0.5
… … … … … … … … … … [d, n]
8.3 2.1 8.3 0.9 3.9 1.3 2.4 8.0 2.9 8.5
Process Shape
“Many words don't map to one token: indivisible.” []
Tokenization

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13] [n]

Embedding

2.3 -3.2 8.3 5.4 2.1 3.9 -8.9 3.8 3.9 3.3
4.5
…
5.9
…
4.5
…
7.1
…
1.0
…
5.3
…
5.0
…
3.1
…
0.7
…
5.0
…
[d, n]
3.8 1.2 3.8 9.0 9.3 3.1 4.2 0.8 9.2 5.8

N Transformer layers

3.2 -2.3 3.8 4.5 1.2 9.3 -9.8 8.3 9.3 3.3
5.4 9.5 5.4 1.7 0.1 3.5 0.5 1.3 7.0 0.5
… … … … … … … … … … [d, n]
8.3 2.1 8.3 0.9 3.9 1.3 2.4 8.0 2.9 8.5

Loss function (predict next token given previous)

2.6 []
Original sentence
Original sentence

apple: 0.01
don: 0.001
Given “many”, predict the next token …
intelligence: 0.00001
…
words: 0.02
Original sentence

apple: 0.01
don: 0.001
Given “many”, predict the next token …
intelligence: 0.00001
…
words: 0.02

apple: 0.00003
don: 0.03
Given “many words”, predict the next token …
intelligence: 0.00001
…
words: 0.0000001
Original sentence

apple: 0.01
don: 0.001
Given “many”, predict the next token …
intelligence: 0.00001
…
words: 0.02

apple: 0.00003
don: 0.03
Given “many words”, predict the next token …
intelligence: 0.00001
…
words: 0.0000001

Probability of a sentence is a product of conditional probabilities. Maximize this.

Feed web-scale text data to Transformer

Sequence-to-sequence mapping
with bunch of matmuls

Input: [d_model, length]

Output: [d_model, length]

Web-scale text data

Somehow the model learns to perform many many tasks only trained with
next-token prediction

Chowdhery et al (2022)
Some observations on the next-token prediction task

We don’t directly teach any linguistic concepts (e.g. verb, subject, whatever)

Simply by predicting next tokens over a large corpus, the model learns languages

Language is learned almost as a by-product of doing such task

The model can do some “reasoning” (e.g. math, code)

Next token prediction as a massive implicit multitask learning
Next token prediction as a massive implicit multitask learning

This terrible movie was really boring

Next token prediction as a massive implicit multitask learning

This terrible movie was really boring

After the earning call, the share price of Google went up by 5% from $1,000, ending in $1,050
Next token prediction as a massive implicit multitask learning

This terrible movie was really boring

After the earning call, the share price of Google went up by 5% from $1,000, ending in $1,050

인공지능 연구원들은 코딩을 잘

못합니다.
Next token prediction as a massive implicit multitask learning

This terrible movie was really boring

After the earning call, the share price of Google went up by 5% from $1,000, ending in $1,050

인공지능 연구원들은 코딩을 잘

못합니다.
The ﬁrst law of Thermodynamics is often called conservation of energy
Next token prediction as a massive implicit multitask learning

This terrible movie was really boring

After the earning call, the share price of Google went up by 5% from $1,000, ending in $1,050

인공지능 연구원들은 코딩을 잘

못합니다.
The ﬁrst law of Thermodynamics is often called conservation of energy

BILLIONS of sentences

TRILLIONS of task types

Massive multitask learning hypothesis

Beyond some scale, the easiest way to do well on the next token prediction is for
the model to ﬁnd a set of general skills that are applicable to many tasks.

For example, these skills include learning languages, understanding and

reasoning.
Crucially we don’t directly teach any of these skills to the model. We weakly
incentivize the model and the abilities emerge

Abilities that emerge are typically more general skill sets. In order for abilities to
emerge, they should be incentivized as opposed to being directly taught

Weakly incentivizing the model requires a lot more compute, i.e. it is a more
scalable teaching strategy
For a given dataset and an learning objective there is an explicit learning signal
and a set of induced incentives

Next-token prediction with web-scale data

● explicit signal: predict next token

● induced incentive: understand languages and reasoning, etc
Example 2: Playing chess with {0, 1} reward at the end of the game

Explicit signal: win the game

Induced incentive: learn what moves are good

Example 3: Hallucinations

Reward structure for simple question answering scenario:

● 1 if the answer is correct and unhedged

● 0.5 if answer is correct but hedged
● 0 if the answer is “I don’t know”
● -2 if the answer is hedged but wrong
● -4 if the answer is unhedged and wrong

Explicit signal: answer the question correctly

Induced incentive: know what you don’t know

Adapted from John Schulman’s talk https://www.youtube.com/watch?v=hhiLw5Q_UFg

Loose analogy

Give a man a ﬁsh, and you feed him for a day.

Teach a man to ﬁsh, and you feed him for a lifetime.

Loose analogy

Give a man a ﬁsh, and you feed him for a day.

Teach a man to ﬁsh, and you feed him for a lifetime.

Teach him the taste of ﬁsh and make him hungry

Give a man a fish Teach him how to fish Teach him the taste of
fish and make him hungry

Time required
Give a man a fish Teach him how to fish Teach him the taste of
fish and make him hungry

humans Time required

machines Compute required

Small specialist models vs large generalist model

The belief that small specialist models can win on a narrow domain assumes that
there exists tradeoffs between being a generalist and specialist
Specialist-generalist tradeoff doesn’t apply to machines

Such tradeoff is due to the fact that every human beings operate with the same
time budget. Machines do not.

One model gets to enjoy a lot more compute than others

It is akin to someone having access to “Room of spirit and time” from Dragon ball;
one year inside that room is a day outside
Importance of incentive structure is not a new. Why now?

No amount of bananas can incentivize monkeys to do mathematical reasoning

Threshold intelligence is necessary for the incentive structure to work for a given
problem

I think we cross that threshold for many tasks

Whether the induced incentive structure works depends on the model size

What abilities emerge depends on the model size

If the model is too small, the model might just give up learning high-level skills
such as reasoning. It relies on heuristics-based pattern recognition
Some abilities emerge with scale

Having the right perspective is crucial

Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph et al. (2022)
Perspective of “yet”
Perspective of “yet”

This idea doesn’t work

Perspective of “yet”

This idea doesn’t work This idea doesn’t work yet

Why is the perspective of “yet” not so obvious?

We are used to operating in an environment where underlying axioms don’t change

You run an experiment for your new scientiﬁc idea. It doesn’t work now. You know
that it will not work if you run 3 years later

For language models, the most capable model serves as an “axiom” for many
research experiments run on top
Need for constant unlearning

Many ideas get outdated and invalidated at larger scale

We need to constantly unlearn intuitions built on such invalidated ideas

With less to unlearn, newcomers can have advantages over more experienced
ones. This is an interesting neutralizing force
Highly simpliﬁed view of emergent abilities

Ability 1 Ability 2 Ability 3

GPT-3 GPT-4 Scale GPT-3 GPT-4 Scale GPT-3 GPT-4 Scale

Closing

Compute cost is decreasing exponentially

AI researchers should harness this by designing scalable methods

Current generation of LLMs rely on next-token prediction, which can be thought of as

weak incentive structure to learn general skills such as reasoning

More generally, we should incentivize models instead of directly teaching speciﬁc skills

Emergent abilities necessitate having the right perspective such as unlearning

Thank you!

Twitter: @hwchung27
Don’t teach. Incentivize.
MIT EI seminar

Hyung Won Chung

OpenAI

Scaling Paradigms For Large Language Models
No ratings yet
Scaling Paradigms For Large Language Models
42 pages
Tinder Questionnaire
No ratings yet
Tinder Questionnaire
4 pages
Sinamics Fault Codes 2011
No ratings yet
Sinamics Fault Codes 2011
56 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
ML 22
No ratings yet
ML 22
29 pages
ML Algorithms
No ratings yet
ML Algorithms
5 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
State of GPT
No ratings yet
State of GPT
50 pages
Deep Unsupervised Learning
No ratings yet
Deep Unsupervised Learning
90 pages
Bloomberg GPT
100% (1)
Bloomberg GPT
76 pages
Deep Learning and Artificial General Intelligence: Still A Long Way To Go
No ratings yet
Deep Learning and Artificial General Intelligence: Still A Long Way To Go
4 pages
CSCI-4364 - 6364 S25 - Lecture 10
No ratings yet
CSCI-4364 - 6364 S25 - Lecture 10
28 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
Do Large Language Models Need Sensory Grounding For Meaning and Understanding?
No ratings yet
Do Large Language Models Need Sensory Grounding For Meaning and Understanding?
38 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
19 20-gpt-3 Prompts
No ratings yet
19 20-gpt-3 Prompts
68 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Deep Learning Introduction Class
No ratings yet
Deep Learning Introduction Class
46 pages
CS 4650/7650: Natural Language Processing: Neural Text Classification
No ratings yet
CS 4650/7650: Natural Language Processing: Neural Text Classification
85 pages
Deep Learning For Natural Language GDG Bloomington 1690248059
No ratings yet
Deep Learning For Natural Language GDG Bloomington 1690248059
41 pages
Lecun 20250427 Nus120
No ratings yet
Lecun 20250427 Nus120
90 pages
Self-Supervision, Bert, and Beyond: Building Transformer-Based Natural Language Processing Applications (Part 2)
No ratings yet
Self-Supervision, Bert, and Beyond: Building Transformer-Based Natural Language Processing Applications (Part 2)
117 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Introduction To Deep Learning AI 2025
No ratings yet
Introduction To Deep Learning AI 2025
78 pages
Lec20 LLM
No ratings yet
Lec20 LLM
58 pages
Lecun 20230424 Santa Fe Institute
No ratings yet
Lecun 20230424 Santa Fe Institute
66 pages
cs4302 Lecture1
No ratings yet
cs4302 Lecture1
65 pages
Thesis
No ratings yet
Thesis
87 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
AI and Neural Networks
No ratings yet
AI and Neural Networks
5 pages
Bloomberggpt: A Large Language Model For Finance: . Co-First Authors
No ratings yet
Bloomberggpt: A Large Language Model For Finance: . Co-First Authors
65 pages
Gen AI Content
No ratings yet
Gen AI Content
47 pages
EL4106Intro 2024
No ratings yet
EL4106Intro 2024
69 pages
AI Cheatsheet A4 v2 1
No ratings yet
AI Cheatsheet A4 v2 1
9 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
To Create A LLM
No ratings yet
To Create A LLM
53 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
Ai
No ratings yet
Ai
13 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Lecun 20201119 Frgejp Ai
No ratings yet
Lecun 20201119 Frgejp Ai
32 pages
Rnoti p1707
No ratings yet
Rnoti p1707
9 pages
465-Lecture 1 (Deep Learning)
No ratings yet
465-Lecture 1 (Deep Learning)
47 pages
LBDL
No ratings yet
LBDL
185 pages
Understanding Self-Attention
No ratings yet
Understanding Self-Attention
37 pages
Dl-Unit 1
No ratings yet
Dl-Unit 1
12 pages
LLM Book
No ratings yet
LLM Book
161 pages
Lecun 20240328 Harvard
No ratings yet
Lecun 20240328 Harvard
97 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
155 pages
Wa0007.
No ratings yet
Wa0007.
25 pages
Week 13 LLM ChatGPT HAAI IITKgp v2
No ratings yet
Week 13 LLM ChatGPT HAAI IITKgp v2
119 pages
Deep Learning Most Important Ideas PDF
No ratings yet
Deep Learning Most Important Ideas PDF
16 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
378 pages
Week 5
No ratings yet
Week 5
9 pages
2020 Sustainlp-1 0
No ratings yet
2020 Sustainlp-1 0
12 pages
ICML 2018 Notes: Stockholm, Sweden
No ratings yet
ICML 2018 Notes: Stockholm, Sweden
55 pages
Lecun 20230721 Mit
No ratings yet
Lecun 20230721 Mit
69 pages
LBDL
No ratings yet
LBDL
156 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
From Chaos to Concept: A Team Oriented Approach to Designing World Class Products and Experiences
From Everand
From Chaos to Concept: A Team Oriented Approach to Designing World Class Products and Experiences
Kevin Collamore Braun
No ratings yet
The DevSecOps Playbook: Deliver Continuous Security at Speed
From Everand
The DevSecOps Playbook: Deliver Continuous Security at Speed
Sean D. Mack
No ratings yet
Lecture 20 - Islamic Studies
No ratings yet
Lecture 20 - Islamic Studies
12 pages
0-ICPC Nov042023
No ratings yet
0-ICPC Nov042023
11 pages
Probability Distributions of Discrete Random Variable
No ratings yet
Probability Distributions of Discrete Random Variable
37 pages
Problems Module 1
No ratings yet
Problems Module 1
13 pages
Assignment 1 Solution
No ratings yet
Assignment 1 Solution
8 pages
Sample Paper 102
No ratings yet
Sample Paper 102
11 pages
Lecture-3 Understanding Critical Thinking
No ratings yet
Lecture-3 Understanding Critical Thinking
23 pages
Lecture-9 Writing General Expository Documents
No ratings yet
Lecture-9 Writing General Expository Documents
30 pages
Lecture-7 Creative Problem Solving
No ratings yet
Lecture-7 Creative Problem Solving
26 pages
Lecture-10 Problem Solution Writing
No ratings yet
Lecture-10 Problem Solution Writing
29 pages
Expository Essays - Samples
No ratings yet
Expository Essays - Samples
6 pages
ST Vl6180x API Integration Guide
No ratings yet
ST Vl6180x API Integration Guide
54 pages
Midterm Act4 - Controlling Brightness of LED Using Potentiometer (Documentation)
No ratings yet
Midterm Act4 - Controlling Brightness of LED Using Potentiometer (Documentation)
7 pages
Error Recognition Questions 16 To 20
No ratings yet
Error Recognition Questions 16 To 20
6 pages
ST Cloud IVandD Pilot Report
No ratings yet
ST Cloud IVandD Pilot Report
10 pages
Cisco IOS For Dummies
No ratings yet
Cisco IOS For Dummies
29 pages
Kenny Aronson - Eager To Be The Highest Performing Sales Engineer!
No ratings yet
Kenny Aronson - Eager To Be The Highest Performing Sales Engineer!
2 pages
Book Service Engineers 9020 30 PDF
No ratings yet
Book Service Engineers 9020 30 PDF
182 pages
Authorization Form
No ratings yet
Authorization Form
2 pages
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
No ratings yet
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
3 pages
Wifi Multifunational Spykar Final
No ratings yet
Wifi Multifunational Spykar Final
15 pages
ARDU-5351 Manual English
No ratings yet
ARDU-5351 Manual English
11 pages
John Lewis Partnership Card Welcome Booklet
No ratings yet
John Lewis Partnership Card Welcome Booklet
10 pages
Asterisk Gateway Interface (Agi) Scripting in Python: Muhammad Morshed Alam Amberit LTD
No ratings yet
Asterisk Gateway Interface (Agi) Scripting in Python: Muhammad Morshed Alam Amberit LTD
19 pages
00 Agenda Apnic42
No ratings yet
00 Agenda Apnic42
11 pages
BOI Mobile Banking Pre Login FAQs PDF
No ratings yet
BOI Mobile Banking Pre Login FAQs PDF
3 pages
De La Salle Santiago Zobel School High School Department Unit Assessment Matrix
No ratings yet
De La Salle Santiago Zobel School High School Department Unit Assessment Matrix
6 pages
Dissertation
No ratings yet
Dissertation
62 pages
创造性ai的语言
No ratings yet
创造性ai的语言
100 pages
Esquema Eléctrico Motor Chevrolet Optra 1J 1600 Año 2008
No ratings yet
Esquema Eléctrico Motor Chevrolet Optra 1J 1600 Año 2008
50 pages
2 1 Datawarehouses
No ratings yet
2 1 Datawarehouses
56 pages
Arvind Resume (Purchase)
No ratings yet
Arvind Resume (Purchase)
4 pages
Architecting A Modern Financial Institution: Southeast Brazil Region From Space
100% (1)
Architecting A Modern Financial Institution: Southeast Brazil Region From Space
56 pages
README - Tano1221
No ratings yet
README - Tano1221
4 pages
System and Controls
No ratings yet
System and Controls
8 pages
IN00-CCTV-NC-XX0002-001001 REV-00-Layout2
No ratings yet
IN00-CCTV-NC-XX0002-001001 REV-00-Layout2
1 page
Operator Manual 306681a
No ratings yet
Operator Manual 306681a
176 pages
Industrial Automation
No ratings yet
Industrial Automation
8 pages
Touchless Touch Screen Report
67% (3)
Touchless Touch Screen Report
24 pages