0% found this document useful (0 votes)
23 views64 pages

01 The Transformer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views64 pages

01 The Transformer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

The Transformer: motivation,

original architecture and


attention mechanism
Outline
• Motivation for Transformer
• Attention (from RNNs to Attention is all you need)
• The original Transformer architecture
• Do we need a transformer for our problem?
Why transformers?
• Natural language processing ≈ understanding and generating texts
• One of the most important NLP problems: machine translation
• “Je m'appelle Mari.” → “My name is Maria.”
• What one needs for text translation?
1. Understand individual words Subword vocabulary + embeddings
2. Understand interaction between words (syntax) Encoder self-attention
3. Translate the words (given the context) Cross-attention
4. Compose a meaningful and fluent text Decoder self-
attention
• The transformer architecture offers a unified, powerful, and reusable way
of text understanding and generation
Short history of machine
translation.
Motivation for transformers
History of machine translation
● 1950s: Russian→English MT (Cold War)
– Rule-based systems using bilingual dictionaries
History of machine translation
● 1950s: Russian→English MT (Cold War)
– Rule-based systems using bilingual dictionaries
● 1990s-2010s: Statistical MT (SMT)
– Mainly Phrase-Based MT (PBMT)
– Learns from (large) sentence-aligned bilingual corpora
– Consists of separate very complex components, learnt separately
– Developed by large groups for decades, doesn’t generalize to new language pairs
History of machine translation
● 1950s: Russian→English MT (Cold War)
– Rule-based systems using bilingual dictionaries
● 1990s-2010s: Statistical MT (SMT)
– Mainly Phrase-Based MT (PBMT)
– Learns from (large) sentence-aligned bilingual corpora
– Consists of separate very complex components, learnt separately
– Developed by large groups for decades, doesn’t generalize to new language pairs
● Since 2014: Neural MT (NMT)
– Single NN learnt end-to-end
– Learns from (large) sentence-aligned bilingual corpora
– A few (good) student / month to implement
– better quality of translation than SMT systems developed for decades
● Google Translate: NMT, Yandex Translate: NMT+PBMT
History of machine translation: IBM models

Original French sentence Key model component: learned word alignment


English translation

English N-gram language model Translation


model
This probability can be discarded

Phrase-by-phrase translation Alignment model

Source: The Mathematics of Statistical Machine Translation: Parameter Estimation, Brown et al, 1993
What is an alignment?
A matrix of matches between words A vector of matched word*

*This approach is simpler, but it cannot describe


one-to-many or many-to-many alignments

Images source: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1162/syllabus.shtml


Neural machine translation: seq2seq

Greedy
Decoding

Sequence to Sequence Learning with Neural Networks, Sutskever et al, 2014 11


Socher, Manning. Cs224n, 2017
Neural machine translation: seq2seq loss

Teache
r
Forcing

Sequence to Sequence Learning with Neural Networks, Sutskever et al, 2014 12


Socher, Manning. Cs224n, 2017
Seq2seq bottleneck problem

13
Socher, Manning. Cs224n, 2017
Attention
Attention:
At different steps, let a model focus on different parts of the source
tokens (more relevant ones).

Core idea:
On each step of the decoder, use direct connection to the encoder to
focus on a particular part of the source sequence
Attention: Example

Socher, Manning. Cs224n, 2017


15
Attention: Example

Socher, Manning. Cs224n, 2017


16
Attention: Example

Socher, Manning. Cs224n, 2017


17
Attention: Example

Socher, Manning. Cs224n, 2017


18
Attention: Example

Socher, Manning. Cs224n, 2017


19
Attention: Example

Socher, Manning. Cs224n, 2017


20
Attention: Example

Socher, Manning. Cs224n, 2017


21
Attention: Example

Socher, Manning. Cs224n, 2017


22
Attention
Results:
– Solves information bottleneck problem, acts like an additional
memory
– Helps gradient propagation from encoder to decoder, especially in
long sequences
– Better interpretability

Britz et al., 2017. Massive Exploration of Neural Machine Translation Architectures:


● “we found that the attention-based models exhibited significantly larger gradient updates to decoder states
throughout training. This suggests that the attention mechanism acts more like a ”weighted skip connection”
that optimizes gradient flow than like a ”memory” that allows the encoder to access source states, as is
commonly stated in the literature”

25
Attention
Better interpretability
● Networks learn alignment as a

byproduct of translation
● We can look at which fragments

were translated to which ones

Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate, 2015 26
Attention
• Attention is a new basic layer type (along with feed-forward,
convolutional and recurrent)
• Works with variable length inputs (texts)
● like RNNs, CNNs
● unlike feed-forward
• Used in SOTA models in lots of tasks (question answering, image
captioning, ...)

27
The Transformer
Get rid of RNNs in MT?

● RNNs are slow, because not parallelizable over timesteps

Vaswani et al, 2017. Attention is all you need. 29


The Transformer
Attention is all you need =) 2017
Previously:

• RNN encoder + RNN decoder, interaction via fix-sized vector

• RNN encoder + RNN decoder, interaction via attention

NOW:

• attention + attention+ attention

30
attention + attention+ attention

Lukasz Kaiser. 2017. Tensor2Tensor Transformers: New Deep Models for NLP. Lecture in Stanford University

Vaswani et al, 2017. Attention is all you need. 31


Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 32
Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 33
Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 34
Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 35
Positional Encoding
positional encoding provides order information to the model

The fixed positional encodings


used in the Transformer

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html 36
Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 37
Transformers. Self-attention
Previously: one decoder state looked at all encoder
states NOW: each state looks at each other states

Self-attention:
• tokens interact with each other
• each token "looks" at other tokens
• gathers context
• updates the previous
representation of
"self"

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html 38
In Parallel!
Query, Key and Value vectors
Query, Key and Value vectors
Each vector gets three representations:
• query - asking for information;
• key - saying that it has some information;
• value - giving the information.

These matrices allow different aspects of the 𝑥


vectors to be used/emphasized in each of the three
roles

Attention matches the key and query by assigning a


Attention
value to the place the key is most likely to be. weights

39
Masked self-attention
Decoder has different self-attention => Masked self-attention

– we generate one token at a time => during generation, we don't

know which tokens we'll generate in future.

– to enable parallelization we forbid the decoder to look ahead -

future tokens are masked out (setting them to -inf) before the

softmax step in the self-attention calculation

40
Multi-head attention
• We need to know
relationships between
different
in a sentence:
tokens
relationships,
syntactic
preferences, order,
lexical
issues
grammar like case or
agreement
gender
.
• Instead of having one
attention
multi-head attention
mechanism,
several "heads" which
has
independently and focus
work
different
on
things.

42
Multihead attention

43
Ashish Vaswani and Anna Huang. Self-Attention For Generative Models
Multihead attention impl

Suppose query is (bs, m, 512),


key and value are (bs, n, 512 )

View as (bs, seqlen, 8, 64)→(bs, 8, seqlen, 64)

→(bs, 8, m, 64), (bs, 8, m, n)

→ (bs, m, 8, 64)
Concat: View as (bs, m, 512)
Result is (bs, m, 512)
44
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Multihead self-attention in encoder

45
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them
Positionwise FFNN
● Linear→ReLU→Linear
● Base: 512→2048→512
● Large: 1024→4096→1024
● Equal to 2 conv layers with kernel size 1

47
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them

Residual connection (train better)


Residual connections => add an input of the block to its output
Ease the gradient flow through a network and allow stacking a lot of layers
Residuals
LayerNorm(x + dropout(Sublayer(x)))

49
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them

Residual connection (train better)


Residual connections => add an input of the block to its output
Ease the gradient flow through a network and allow stacking a lot of layers

Layer Normalization (train faster)


Improves convergence
Idea: cut down on uninformative variation in hidden vector values by
normalizing to unit mean and standard deviation within each layer
Transformer layer (enc) unrolled

51
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Transformer layer (enc)

52
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Transformer layer (dec) unrolled

53
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Transformers. One more time

54
Regularization
● Dropout:
- residual
- ReLU
- Input
● Attention dropout (only for some experiments)
– Dropout on attention weights (after softmax)
● Label smoothing

Label smoothing from:


Szegedy. Rethinking the Inception Architecture for Computer Vision, 2015 55
Training
● Adam, betas=0.9,0.98, eps=1e-9
● Learning rate: linear warmup: 4K-8K steps (3-10% is common) +
square root decay

Noam Optimizer:
Adam+this lr schedule

57
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Training

● WMT2014 En→De / Fr: 4.5M / 36M sent.pairs


– word-pieces vocab: 37K shared / 32K x2 separate
– Batches: sequences of approx. same length, dynamic batch size: 25K src & 25K tgt tokens
– On 8 P100 GPU (16GB), base/big: 0.5/3.5 days, 100k/300k steps 0.4/1.0s per step
– Average weights from last 5/20 checkpoints
– Beam search with size 4, length penalty 0.6
Dev: newstest2013 en→de

Vaswani et al, 2017. Attention is all you need. 58


Results

Vaswani et al, 2017. Attention is all you need. 59


Transformers 6 years later
What happened after “Attention is all you need”?
• Mostly used for language modeling
• It turned out than pretrained transformers are easy
to fine-tune for a new task
• Self-attention seems to learn a lot about the structure of language
• The architecture has enough capacity to generalize to new tasks
• A lot of pre-trained transformers have appeared
• They are now among the most important NLP resources
• Many of them are public (e.g. https://huggingface.co/models)
• For many tasks, it is easier (and often more effective) to fine-tune
an existing transformer than to devise and train a model from
scratch
Language Modelling
● Language modeling is the task of predicting what word comes next
– The student opened their ___ (books | laptops | exams | minds |…)
– Что дальше будет неизвестно ___ (никому | заранее | … )

62
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
Language Modelling

63
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
Language Modelling

64
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
n-gram Language Modeling

65
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n/
Disadvantages of n-gram LMs
– needs smoothing
– can’t handle long history
– they cannot generalize over contexts of similar words well

66
Source: https://web.stanford.edu/~jurafsky/slp3/7.pdf
Neural Language Models: Motivation
(+) a neural language model has much higher predictive accuracy than
an n-gram language model!
(–) neural net language models are strikingly slower to train than
traditional language models

67
Source: https://web.stanford.edu/~jurafsky/slp3/7.pdf
Basic types of transformer architectures
BERT-like models: encoder only
мы ##ла
In encoders, each token attends to each other
… … … … … … … … … … token
… … … … … … … … … … They are good for understanding texts.
[CLS] М ##ама [MASK] [MASK] р ##ам ##у . [SEP]

GPT-like models: decoder only In decoders, each token attends only to


М ##ама мы ##ла р ##ам ##у . <e> the tokens on the left (and to all tokens
… … … … … … … … …
of the encoder, if it exists)
… … … … … … … … …

<s> М ##ама мы ##ла р ##ам ##у . They are good for generating texts.
The classical transformer (also T5, BART, etc.): encoder and decoder
Мo ##m was was ##hing the frame Мo <e>
… … … … … … … … … … … … … … … … … … …
… … … … … … … … … … … … … … … … … … …
<s> М ##ама мы ##ла р ##ам ##у . <e> <s> Мo ##m was was ##hing the frame .
Do you really want a transformer?
• Probably yes, if:
• There is a pretrained transformer for your language and task (or a similar one)
• You have a small training set and want to generalize from it
• Your task is natural language generation and you have no templates
• Probably no, if:
• Your problem can be solved with rules or keywords
• You need fast (around 1ms) inference without GPU
• You need fast training without GPU
• You have to process very long texts without splitting them
• Your task needs some specific architecture (e.g. memory or recursion)

You might also like