01 The Transformer
01 The Transformer
Source: The Mathematics of Statistical Machine Translation: Parameter Estimation, Brown et al, 1993
What is an alignment?
A matrix of matches between words A vector of matched word*
Greedy
Decoding
Teache
r
Forcing
13
Socher, Manning. Cs224n, 2017
Attention
Attention:
At different steps, let a model focus on different parts of the source
tokens (more relevant ones).
Core idea:
On each step of the decoder, use direct connection to the encoder to
focus on a particular part of the source sequence
Attention: Example
25
Attention
Better interpretability
● Networks learn alignment as a
byproduct of translation
● We can look at which fragments
Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate, 2015 26
Attention
• Attention is a new basic layer type (along with feed-forward,
convolutional and recurrent)
• Works with variable length inputs (texts)
● like RNNs, CNNs
● unlike feed-forward
• Used in SOTA models in lots of tasks (question answering, image
captioning, ...)
27
The Transformer
Get rid of RNNs in MT?
NOW:
30
attention + attention+ attention
Lukasz Kaiser. 2017. Tensor2Tensor Transformers: New Deep Models for NLP. Lecture in Stanford University
https://github.com/king-menin/mipt-nlp2022 32
Transformer: high-level
https://github.com/king-menin/mipt-nlp2022 33
Transformer: high-level
https://github.com/king-menin/mipt-nlp2022 34
Transformer: high-level
https://github.com/king-menin/mipt-nlp2022 35
Positional Encoding
positional encoding provides order information to the model
https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html 36
Transformer: high-level
https://github.com/king-menin/mipt-nlp2022 37
Transformers. Self-attention
Previously: one decoder state looked at all encoder
states NOW: each state looks at each other states
Self-attention:
• tokens interact with each other
• each token "looks" at other tokens
• gathers context
• updates the previous
representation of
"self"
https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html 38
In Parallel!
Query, Key and Value vectors
Query, Key and Value vectors
Each vector gets three representations:
• query - asking for information;
• key - saying that it has some information;
• value - giving the information.
39
Masked self-attention
Decoder has different self-attention => Masked self-attention
future tokens are masked out (setting them to -inf) before the
40
Multi-head attention
• We need to know
relationships between
different
in a sentence:
tokens
relationships,
syntactic
preferences, order,
lexical
issues
grammar like case or
agreement
gender
.
• Instead of having one
attention
multi-head attention
mechanism,
several "heads" which
has
independently and focus
work
different
on
things.
42
Multihead attention
43
Ashish Vaswani and Anna Huang. Self-Attention For Generative Models
Multihead attention impl
→ (bs, m, 8, 64)
Concat: View as (bs, m, 512)
Result is (bs, m, 512)
44
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Multihead self-attention in encoder
45
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them
Positionwise FFNN
● Linear→ReLU→Linear
● Base: 512→2048→512
● Large: 1024→4096→1024
● Equal to 2 conv layers with kernel size 1
47
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them
49
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them
51
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Transformer layer (enc)
52
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Transformer layer (dec) unrolled
53
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Transformers. One more time
•
54
Regularization
● Dropout:
- residual
- ReLU
- Input
● Attention dropout (only for some experiments)
– Dropout on attention weights (after softmax)
● Label smoothing
Noam Optimizer:
Adam+this lr schedule
57
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Training
62
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
Language Modelling
63
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
Language Modelling
•
64
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
n-gram Language Modeling
65
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n/
Disadvantages of n-gram LMs
– needs smoothing
– can’t handle long history
– they cannot generalize over contexts of similar words well
66
Source: https://web.stanford.edu/~jurafsky/slp3/7.pdf
Neural Language Models: Motivation
(+) a neural language model has much higher predictive accuracy than
an n-gram language model!
(–) neural net language models are strikingly slower to train than
traditional language models
67
Source: https://web.stanford.edu/~jurafsky/slp3/7.pdf
Basic types of transformer architectures
BERT-like models: encoder only
мы ##ла
In encoders, each token attends to each other
… … … … … … … … … … token
… … … … … … … … … … They are good for understanding texts.
[CLS] М ##ама [MASK] [MASK] р ##ам ##у . [SEP]
<s> М ##ама мы ##ла р ##ам ##у . They are good for generating texts.
The classical transformer (also T5, BART, etc.): encoder and decoder
Мo ##m was was ##hing the frame Мo <e>
… … … … … … … … … … … … … … … … … … …
… … … … … … … … … … … … … … … … … … …
<s> М ##ама мы ##ла р ##ам ##у . <e> <s> Мo ##m was was ##hing the frame .
Do you really want a transformer?
• Probably yes, if:
• There is a pretrained transformer for your language and task (or a similar one)
• You have a small training set and want to generalize from it
• Your task is natural language generation and you have no templates
• Probably no, if:
• Your problem can be solved with rules or keywords
• You need fast (around 1ms) inference without GPU
• You need fast training without GPU
• You have to process very long texts without splitting them
• Your task needs some specific architecture (e.g. memory or recursion)