Don't Teach. Incentivize
Don't Teach. Incentivize
MIT EI seminar
OpenAI
Non-goal: share specific technical knowledge and experimental results
We, the technical people, focus too much on problem solving itself
Great researchers are good at finding impactful problems. I think this ability
comes from having the right perspective.
I hope this talk sparks interest in developing original perspectives, which in turn
help finding better problems to solve
Outline
This approach poses structures to the problem, which can become the limitation when
scaled up
7
Bitter lesson
http://www.incompleteideas.net/IncIdeas/BitterLesson.html 8
The more structure imposed by humans, the less scalable the method is
Performance
Less structure
More structure
Compute
Sobering observation
What is good in the long run almost necessarily looks bad in the short term
Give machines more degrees of freedom. Let them choose how they learn
Why are these observations not so obvious?
Sequence-to-sequence mapping
with bunch of matmuls
Input: [d, n]
Output: [d, n]
Process Shape
“Many words don't map to one token: indivisible.” []
Process Shape
“Many words don't map to one token: indivisible.” []
Tokenization
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13] [n]
Process Shape
“Many words don't map to one token: indivisible.” []
Tokenization
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13] [n]
Embedding
2.3 -3.2 8.3 5.4 2.1 3.9 -8.9 3.8 3.9 3.3
4.5
…
5.9
…
4.5
…
7.1
…
1.0
…
5.3
…
5.0
…
3.1
…
0.7
…
5.0
…
[d, n]
3.8 1.2 3.8 9.0 9.3 3.1 4.2 0.8 9.2 5.8
Process Shape
“Many words don't map to one token: indivisible.” []
Tokenization
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13] [n]
Embedding
2.3 -3.2 8.3 5.4 2.1 3.9 -8.9 3.8 3.9 3.3
4.5
…
5.9
…
4.5
…
7.1
…
1.0
…
5.3
…
5.0
…
3.1
…
0.7
…
5.0
…
[d, n]
3.8 1.2 3.8 9.0 9.3 3.1 4.2 0.8 9.2 5.8
N Transformer layers
3.2 -2.3 3.8 4.5 1.2 9.3 -9.8 8.3 9.3 3.3
5.4 9.5 5.4 1.7 0.1 3.5 0.5 1.3 7.0 0.5
… … … … … … … … … … [d, n]
8.3 2.1 8.3 0.9 3.9 1.3 2.4 8.0 2.9 8.5
Process Shape
“Many words don't map to one token: indivisible.” []
Tokenization
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13] [n]
Embedding
2.3 -3.2 8.3 5.4 2.1 3.9 -8.9 3.8 3.9 3.3
4.5
…
5.9
…
4.5
…
7.1
…
1.0
…
5.3
…
5.0
…
3.1
…
0.7
…
5.0
…
[d, n]
3.8 1.2 3.8 9.0 9.3 3.1 4.2 0.8 9.2 5.8
N Transformer layers
3.2 -2.3 3.8 4.5 1.2 9.3 -9.8 8.3 9.3 3.3
5.4 9.5 5.4 1.7 0.1 3.5 0.5 1.3 7.0 0.5
… … … … … … … … … … [d, n]
8.3 2.1 8.3 0.9 3.9 1.3 2.4 8.0 2.9 8.5
2.6 []
Original sentence
Original sentence
apple: 0.01
don: 0.001
Given “many”, predict the next token …
intelligence: 0.00001
…
words: 0.02
Original sentence
apple: 0.01
don: 0.001
Given “many”, predict the next token …
intelligence: 0.00001
…
words: 0.02
apple: 0.00003
don: 0.03
Given “many words”, predict the next token …
intelligence: 0.00001
…
words: 0.0000001
Original sentence
apple: 0.01
don: 0.001
Given “many”, predict the next token …
intelligence: 0.00001
…
words: 0.02
apple: 0.00003
don: 0.03
Given “many words”, predict the next token …
intelligence: 0.00001
…
words: 0.0000001
Sequence-to-sequence mapping
with bunch of matmuls
Chowdhery et al (2022)
Some observations on the next-token prediction task
We don’t directly teach any linguistic concepts (e.g. verb, subject, whatever)
Simply by predicting next tokens over a large corpus, the model learns languages
After the earning call, the share price of Google went up by 5% from $1,000, ending in $1,050
Next token prediction as a massive implicit multitask learning
After the earning call, the share price of Google went up by 5% from $1,000, ending in $1,050
After the earning call, the share price of Google went up by 5% from $1,000, ending in $1,050
After the earning call, the share price of Google went up by 5% from $1,000, ending in $1,050
BILLIONS of sentences
Beyond some scale, the easiest way to do well on the next token prediction is for
the model to find a set of general skills that are applicable to many tasks.
Abilities that emerge are typically more general skill sets. In order for abilities to
emerge, they should be incentivized as opposed to being directly taught
Weakly incentivizing the model requires a lot more compute, i.e. it is a more
scalable teaching strategy
For a given dataset and an learning objective there is an explicit learning signal
and a set of induced incentives
Time required
Give a man a fish Teach him how to fish Teach him the taste of
fish and make him hungry
The belief that small specialist models can win on a narrow domain assumes that
there exists tradeoffs between being a generalist and specialist
Specialist-generalist tradeoff doesn’t apply to machines
Such tradeoff is due to the fact that every human beings operate with the same
time budget. Machines do not.
It is akin to someone having access to “Room of spirit and time” from Dragon ball;
one year inside that room is a day outside
Importance of incentive structure is not a new. Why now?
Threshold intelligence is necessary for the incentive structure to work for a given
problem
If the model is too small, the model might just give up learning high-level skills
such as reasoning. It relies on heuristics-based pattern recognition
Some abilities emerge with scale
You run an experiment for your new scientific idea. It doesn’t work now. You know
that it will not work if you run 3 years later
For language models, the most capable model serves as an “axiom” for many
research experiments run on top
Need for constant unlearning
With less to unlearn, newcomers can have advantages over more experienced
ones. This is an interesting neutralizing force
Highly simplified view of emergent abilities
More generally, we should incentivize models instead of directly teaching specific skills
Twitter: @hwchung27
Don’t teach. Incentivize.
MIT EI seminar
OpenAI