Deep Network Notes
Deep Network Notes
John Hewitt
Lecture 10: Pretraining
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Decoders
2. Encoders
3. Encoder-Decoders
4. Interlude: what do we think pretraining is teaching?
5. Very large models and in-context learning
Reminders:
Assignment 5 is out today! It covers lecture 9 (Tuesday) and lecture 10 (Today)!
It has ~pedagogically relevant math~ so get started!
2
Word structure and subword models
Let’s take a look at the assumptions we’ve made about a language’s vocabulary.
We assume a fixed vocab of tens of thousands of words, built from the training set.
All novel words seen at test time are mapped to a single UNK.
3
Word structure and subword models
Finite vocabulary assumptions make even less sense in many languages.
• Many languages exhibit complex morphology, or word structure.
• The effect is more word types, each occurring fewer times.
4 [Wiktionary]
The byte-pair encoding algorithm
Subword modeling in NLP encompasses a wide range of methods for reasoning about
structure below the word level. (Parts of words, characters, bytes.)
• The dominant modern paradigm is to learn a vocabulary of parts of words (subword tokens).
• At training and testing time, each word is split into a sequence of known subwords.
Originally used in NLP for machine translation; now a similar method (WordPiece) is used in pretrained
models.
In the worst case, words are split into as many subwords as they have characters.
6
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Decoders
2. Encoders
3. Encoder-Decoders
4. Interlude: what do we think pretraining is teaching?
5. Very large models and in-context learning
7
Motivating word meaning and context
Recall the adage we mentioned at the beginning of the course:
“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
Consider I record the record: the two instances of record mean different things.
8 [Thanks to Yoav Goldberg on Twitter for pointing out the 1935 Firth quote.]
Where we were: pretrained word embeddings
Circa 2017:
• Start with pretrained word embeddings (no ෝ
𝒚
context!)
• Learn how to incorporate context in an LSTM Not pretrained
or Transformer while training on the task.
9
Where we’re going: pretraining whole models
In modern NLP:
• All (or almost all) parameters in NLP ෝ
𝒚
networks are initialized via pretraining.
• Pretraining methods hide parts of the input
from the model, and train the model to Pretrained jointly
reconstruct those parts.
• representations of language
• parameter initializations for strong NLP [This model has learned how to represent
models. entire sentences through pretraining]
• Probability distributions over language that
we can sample from
10
What can we learn from reconstructing the input?
11
What can we learn from reconstructing the input?
12
What can we learn from reconstructing the input?
13
What can we learn from reconstructing the input?
I went to the ocean to see the fish, turtles, seals, and _____.
14
What can we learn from reconstructing the input?
15
What can we learn from reconstructing the input?
16
What can we learn from reconstructing the input?
17
The Transformer Encoder-Decoder [Vaswani et al., 2017]
Looking back at the whole model, zooming in on an Encoder block:
[predictions!]
Transformer Transformer
Encoder Decoder
[decoder attends
to encoder states]
Transformer Transformer
Encoder Decoder
Residual + LayerNorm
Transformer
Multi-Head Attention Decoder
Feed-Forward
Residual + LayerNorm
Transformer
Multi-Head Cross-Attention
Encoder
Residual + LayerNorm
20
[output sequence]
The Transformer Encoder-Decoder [Vaswani et al., 2017]
[predictions!]
The only new part is attention from decoder to encoder.
Like we saw last week! Transformer
Decoder
Transformer
Encoder Residual + LayerNorm
Feed-Forward
Residual + LayerNorm
Transformer
Multi-Head Cross-Attention
Encoder
Residual + LayerNorm
21
[output sequence]
Pretraining through language modeling [Dai and Le, 2015]
Recall the language modeling task:
• Model 𝑝𝜃 𝑤𝑡 𝑤1:𝑡−1 ), the probability
distribution over words given their past goes to make tasty tea END
contexts.
• There’s lots of data for this! (In English.) Decoder
(Transformer, LSTM, ++ )
Pretraining through language modeling:
• Train a neural network to perform language
modeling on a large amount of text. Iroh goes to make tasty tea
• Save the network parameters.
22
The Pretraining / Finetuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.
Step 1: Pretrain (on language modeling) Step 2: Finetune (on your task)
Lots of text; learn general things! Not many labels; adapt to the task!
goes to make tasty tea END ☺/
Decoder Decoder
(Transformer, LSTM, ++ ) (Transformer, LSTM, ++ )
23
Stochastic gradient descent and pretrain/finetune
Why should pretraining and finetuning help, from a “training neural nets” perspective?
24
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Decoders
2. Encoders
3. Encoder-Decoders
4. Interlude: what do we think pretraining is teaching?
5. Very large models and in-context learning
25
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
26
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
27
Pretraining decoders
When using language model pretrained decoders, we can ignore
that they were trained to model 𝑝 𝑤𝑡 𝑤1:𝑡−1 ). ☺/
Gradients backpropagate through the whole [Note how the linear layer hasn’t been
network. pretrained and must be learned from scratch.]
28
Pretraining decoders
It’s natural to pretrain decoders as language models and then
use them as generators, finetuning their 𝑝𝜃 𝑤𝑡 𝑤1:𝑡−1 )!
ℎ1 , … , ℎ 𝑇 = Decoder 𝑤1 , … , 𝑤𝑇
𝑤𝑡 ∼ 𝐴𝑤𝑡−1 + 𝑏 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5
[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]
31
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
GPT results on various natural language inference datasets.
32
Increasingly convincing generations (GPT2) [Radford et al., 2018]
We mentioned how pretrained decoders can be used in their capacities as language models.
GPT-2, a larger version of GPT trained on more data, was shown to produce relatively
convincing samples of natural language.
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
34
Pretraining encoders: what pretraining objective to use?
So far, we’ve looked at language model pretraining. But encoders get bidirectional
context, so we can’t do language modeling!
35
BERT: Bidirectional Encoder Representations from Tranformers
Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a
pretrained Transformer, a model they labeled BERT.
• The pretraining input to BERT was two separate contiguous chunks of text:
• BERT was trained to predict whether one chunk follows the other or is randomly
sampled.
• Later work has argued this “next sentence prediction” is not necessary.
If your task involves generating sequences, consider using a pretrained decoder; BERT and other
pretrained encoders don’t naturally lead to nice autoregressive (1-word-at-a-time) generation
methods.
Iroh goes to [MASK] tasty tea Iroh goes to make tasty tea
40
Extensions of BERT
You’ll see a lot of BERT variants like RoBERTa, SpanBERT, +++
Some generally accepted improvements to the BERT pretraining formula:
• RoBERTa: mainly just train BERT for longer and remove next sentence prediction!
• SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task
BERT SpanBERT
43
Pretraining encoder-decoders: what pretraining objective to use?
For encoder-decoders, we could do something like language modeling, but where a
prefix of every input is provided to the encoder and is not predicted.
𝑤𝑇+2 , … ,
ℎ1 , … , ℎ 𝑇 = Encoder 𝑤1 , … , 𝑤𝑇
ℎ 𝑇+1 , … , ℎ2 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑤1 , … , 𝑤𝑇 , ℎ1 , … , ℎ 𝑇
𝑦𝑖 ∼ 𝐴𝑤𝑖 + 𝑏, 𝑖 > 𝑇
44
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.
45
Pretraining encoder-decoders: what pretraining objective to use?
Raffel et al., 2018 found encoder-decoders to work better than decoders for their tasks,
and span corruption (denoising) to work better than language modeling.
Pretraining encoder-decoders: what pretraining objective to use?
A fascinating property
of T5: it can be
finetuned to answer a
wide range of
questions, retrieving
knowledge from its
parameters.
48
What kinds of things does pretraining learn?
There’s increasing evidence that pretrained models learn a wide variety of things about
the statistical properties of language. Taking our examples from the start of class:
• Stanford University is located in __________, California. [Trivia]
• I put ___ fork down on the table. [syntax]
• The woman walked across the street, checking for traffic over ___ shoulder. [coreference]
• I went to the ocean to see the fish, turtles, seals, and _____. [lexical semantics/topic]
• Overall, the value I got from the two hours watching it was the sum total of the popcorn
and the drink. The movie was ___. [sentiment]
• Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the ______. [some reasoning – this is harder]
• I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic
arithmetic; they don’t learn the Fibonnaci sequence]
• Models also learn – and can exacerbate racism, sexism, all manner of bad biases.
• More on all this in the interpretability lecture!
49
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Decoders
2. Encoders
3. Encoder-Decoders
4. Interlude: what do we think pretraining is teaching?
5. Very large models and in-context learning
50
GPT-3, In-context learning, and very large models
So far, we’ve interacted with pretrained models in two ways:
• Sample from the distributions they define (maybe providing a prompt)
• Fine-tune them on a task we care about, and take their predictions.
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.
GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.
GPT-3 has 175 billion parameters.
51
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.
The in-context examples seem to specify the task to be performed, and the conditional
distribution mocks performing the task to a certain extent.
Input (prefix within a single Transformer decoder context):
“ thanks -> merci
hello -> bonjour
mint -> menthe
otter -> ”
Output (conditional generations):
loutre…”
52
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.
53
Parting remarks
These models are still not well-understood.
“Small” models like BERT have become general tools in a wide range of settings.
More on this in later lectures!
Assignment 5 out today! Tuesday’s and today’s lectures in its subject matter.
54