0% found this document useful (0 votes)

23 views64 pages

01 The Transformer

Uploaded by

NADIA FELIX FELIPE DA SILVA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views64 pages

01 The Transformer

Uploaded by

NADIA FELIX FELIPE DA SILVA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

The Transformer: motivation,

original architecture and

attention mechanism
Outline
• Motivation for Transformer
• Attention (from RNNs to Attention is all you need)
• The original Transformer architecture
• Do we need a transformer for our problem?
Why transformers?
• Natural language processing ≈ understanding and generating texts
• One of the most important NLP problems: machine translation
• “Je m'appelle Mari.” → “My name is Maria.”
• What one needs for text translation?
1. Understand individual words Subword vocabulary + embeddings
2. Understand interaction between words (syntax) Encoder self-attention
3. Translate the words (given the context) Cross-attention
4. Compose a meaningful and fluent text Decoder self-
attention
• The transformer architecture offers a unified, powerful, and reusable way
of text understanding and generation
Short history of machine
translation.
Motivation for transformers
History of machine translation
● 1950s: Russian→English MT (Cold War)
– Rule-based systems using bilingual dictionaries
History of machine translation
● 1950s: Russian→English MT (Cold War)
– Rule-based systems using bilingual dictionaries
● 1990s-2010s: Statistical MT (SMT)
– Mainly Phrase-Based MT (PBMT)
– Learns from (large) sentence-aligned bilingual corpora
– Consists of separate very complex components, learnt separately
– Developed by large groups for decades, doesn’t generalize to new language pairs
History of machine translation
● 1950s: Russian→English MT (Cold War)
– Rule-based systems using bilingual dictionaries
● 1990s-2010s: Statistical MT (SMT)
– Mainly Phrase-Based MT (PBMT)
– Learns from (large) sentence-aligned bilingual corpora
– Consists of separate very complex components, learnt separately
– Developed by large groups for decades, doesn’t generalize to new language pairs
● Since 2014: Neural MT (NMT)
– Single NN learnt end-to-end
– Learns from (large) sentence-aligned bilingual corpora
– A few (good) student / month to implement
– better quality of translation than SMT systems developed for decades
● Google Translate: NMT, Yandex Translate: NMT+PBMT
History of machine translation: IBM models

Original French sentence Key model component: learned word alignment

English translation

English N-gram language model Translation

model
This probability can be discarded

Phrase-by-phrase translation Alignment model

Source: The Mathematics of Statistical Machine Translation: Parameter Estimation, Brown et al, 1993
What is an alignment?
A matrix of matches between words A vector of matched word*

*This approach is simpler, but it cannot describe

one-to-many or many-to-many alignments

Images source: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1162/syllabus.shtml

Neural machine translation: seq2seq

Greedy
Decoding

Sequence to Sequence Learning with Neural Networks, Sutskever et al, 2014 11

Socher, Manning. Cs224n, 2017
Neural machine translation: seq2seq loss

Teache
r
Forcing

Sequence to Sequence Learning with Neural Networks, Sutskever et al, 2014 12

Socher, Manning. Cs224n, 2017
Seq2seq bottleneck problem

13
Socher, Manning. Cs224n, 2017
Attention
Attention:
At different steps, let a model focus on different parts of the source
tokens (more relevant ones).

Core idea:
On each step of the decoder, use direct connection to the encoder to
focus on a particular part of the source sequence
Attention: Example

Socher, Manning. Cs224n, 2017

15
Attention: Example

Socher, Manning. Cs224n, 2017

16
Attention: Example

Socher, Manning. Cs224n, 2017

17
Attention: Example

Socher, Manning. Cs224n, 2017

18
Attention: Example

Socher, Manning. Cs224n, 2017

19
Attention: Example

Socher, Manning. Cs224n, 2017

20
Attention: Example

Socher, Manning. Cs224n, 2017

21
Attention: Example

Socher, Manning. Cs224n, 2017

22
Attention
Results:
– Solves information bottleneck problem, acts like an additional
memory
– Helps gradient propagation from encoder to decoder, especially in
long sequences
– Better interpretability

Britz et al., 2017. Massive Exploration of Neural Machine Translation Architectures:

● “we found that the attention-based models exhibited significantly larger gradient updates to decoder states
throughout training. This suggests that the attention mechanism acts more like a ”weighted skip connection”
that optimizes gradient flow than like a ”memory” that allows the encoder to access source states, as is
commonly stated in the literature”

25
Attention
Better interpretability
● Networks learn alignment as a

byproduct of translation
● We can look at which fragments

were translated to which ones

Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate, 2015 26
Attention
• Attention is a new basic layer type (along with feed-forward,
convolutional and recurrent)
• Works with variable length inputs (texts)
● like RNNs, CNNs
● unlike feed-forward
• Used in SOTA models in lots of tasks (question answering, image
captioning, ...)

27
The Transformer
Get rid of RNNs in MT?

● RNNs are slow, because not parallelizable over timesteps

Vaswani et al, 2017. Attention is all you need. 29

The Transformer
Attention is all you need =) 2017
Previously:

• RNN encoder + RNN decoder, interaction via fix-sized vector

• RNN encoder + RNN decoder, interaction via attention

NOW:

• attention + attention+ attention

30
attention + attention+ attention

Lukasz Kaiser. 2017. Tensor2Tensor Transformers: New Deep Models for NLP. Lecture in Stanford University

Vaswani et al, 2017. Attention is all you need. 31

Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 32
Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 33
Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 34
Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 35
Positional Encoding
positional encoding provides order information to the model

The fixed positional encodings

used in the Transformer

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html 36
Transformer: high-level

https://github.com/king-menin/mipt-nlp2022 37
Transformers. Self-attention
Previously: one decoder state looked at all encoder
states NOW: each state looks at each other states

Self-attention:
• tokens interact with each other
• each token "looks" at other tokens
• gathers context
• updates the previous
representation of
"self"

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html 38
In Parallel!
Query, Key and Value vectors
Query, Key and Value vectors
Each vector gets three representations:
• query - asking for information;
• key - saying that it has some information;
• value - giving the information.

These matrices allow different aspects of the 𝑥

vectors to be used/emphasized in each of the three
roles

Attention matches the key and query by assigning a

Attention
value to the place the key is most likely to be. weights

39
Masked self-attention
Decoder has different self-attention => Masked self-attention

– we generate one token at a time => during generation, we don't

know which tokens we'll generate in future.

– to enable parallelization we forbid the decoder to look ahead -

future tokens are masked out (setting them to -inf) before the

softmax step in the self-attention calculation

40
Multi-head attention
• We need to know
relationships between
different
in a sentence:
tokens
relationships,
syntactic
preferences, order,
lexical
issues
grammar like case or
agreement
gender
.
• Instead of having one
attention
multi-head attention
mechanism,
several "heads" which
has
independently and focus
work
different
on
things.

42
Multihead attention

43
Ashish Vaswani and Anna Huang. Self-Attention For Generative Models
Multihead attention impl

Suppose query is (bs, m, 512),

key and value are (bs, n, 512 )

View as (bs, seqlen, 8, 64)→(bs, 8, seqlen, 64)

→(bs, 8, m, 64), (bs, 8, m, n)

→ (bs, m, 8, 64)
Concat: View as (bs, m, 512)
Result is (bs, m, 512)
44
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Multihead self-attention in encoder

45
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them
Positionwise FFNN
● Linear→ReLU→Linear
● Base: 512→2048→512
● Large: 1024→4096→1024
● Equal to 2 conv layers with kernel size 1

47
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them

Residual connection (train better)

Residual connections => add an input of the block to its output
Ease the gradient flow through a network and allow stacking a lot of layers
Residuals
LayerNorm(x + dropout(Sublayer(x)))

49
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Extra
Feed-forward blocks
each layer has a feed-forward network block: two linear
layers with ReLU non-linearity between them

Residual connection (train better)

Residual connections => add an input of the block to its output
Ease the gradient flow through a network and allow stacking a lot of layers

Layer Normalization (train faster)

Improves convergence
Idea: cut down on uninformative variation in hidden vector values by
normalizing to unit mean and standard deviation within each layer
Transformer layer (enc) unrolled

51
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Transformer layer (enc)

52
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Transformer layer (dec) unrolled

53
Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
Transformers. One more time
•

54
Regularization
● Dropout:
- residual
- ReLU
- Input
● Attention dropout (only for some experiments)
– Dropout on attention weights (after softmax)
● Label smoothing

Label smoothing from:

Szegedy. Rethinking the Inception Architecture for Computer Vision, 2015 55
Training
● Adam, betas=0.9,0.98, eps=1e-9
● Learning rate: linear warmup: 4K-8K steps (3-10% is common) +
square root decay

Noam Optimizer:
Adam+this lr schedule

57
Rush. The Annotated Transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html
Training

● WMT2014 En→De / Fr: 4.5M / 36M sent.pairs

– word-pieces vocab: 37K shared / 32K x2 separate
– Batches: sequences of approx. same length, dynamic batch size: 25K src & 25K tgt tokens
– On 8 P100 GPU (16GB), base/big: 0.5/3.5 days, 100k/300k steps 0.4/1.0s per step
– Average weights from last 5/20 checkpoints
– Beam search with size 4, length penalty 0.6
Dev: newstest2013 en→de

Vaswani et al, 2017. Attention is all you need. 58

Results

Vaswani et al, 2017. Attention is all you need. 59

Transformers 6 years later
What happened after “Attention is all you need”?
• Mostly used for language modeling
• It turned out than pretrained transformers are easy
to fine-tune for a new task
• Self-attention seems to learn a lot about the structure of language
• The architecture has enough capacity to generalize to new tasks
• A lot of pre-trained transformers have appeared
• They are now among the most important NLP resources
• Many of them are public (e.g. https://huggingface.co/models)
• For many tasks, it is easier (and often more effective) to fine-tune
an existing transformer than to devise and train a model from
scratch
Language Modelling
● Language modeling is the task of predicting what word comes next
– The student opened their ___ (books | laptops | exams | minds |…)
– Что дальше будет неизвестно ___ (никому | заранее | … )

62
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
Language Modelling

63
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
Language Modelling
•

64
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n
n-gram Language Modeling

65
Source: Abigail See. CS224N/Ling284 slides: http://web.stanford.edu/class/cs224n/
Disadvantages of n-gram LMs
– needs smoothing
– can’t handle long history
– they cannot generalize over contexts of similar words well

66
Source: https://web.stanford.edu/~jurafsky/slp3/7.pdf
Neural Language Models: Motivation
(+) a neural language model has much higher predictive accuracy than
an n-gram language model!
(–) neural net language models are strikingly slower to train than
traditional language models

67
Source: https://web.stanford.edu/~jurafsky/slp3/7.pdf
Basic types of transformer architectures
BERT-like models: encoder only
мы ##ла
In encoders, each token attends to each other
… … … … … … … … … … token
… … … … … … … … … … They are good for understanding texts.
[CLS] М ##ама [MASK] [MASK] р ##ам ##у . [SEP]

GPT-like models: decoder only In decoders, each token attends only to

М ##ама мы ##ла р ##ам ##у . <e> the tokens on the left (and to all tokens
… … … … … … … … …
of the encoder, if it exists)
… … … … … … … … …

<s> М ##ама мы ##ла р ##ам ##у . They are good for generating texts.
The classical transformer (also T5, BART, etc.): encoder and decoder
Мo ##m was was ##hing the frame Мo <e>
… … … … … … … … … … … … … … … … … … …
… … … … … … … … … … … … … … … … … … …
<s> М ##ама мы ##ла р ##ам ##у . <e> <s> Мo ##m was was ##hing the frame .
Do you really want a transformer?
• Probably yes, if:
• There is a pretrained transformer for your language and task (or a similar one)
• You have a small training set and want to generalize from it
• Your task is natural language generation and you have no templates
• Probably no, if:
• Your problem can be solved with rules or keywords
• You need fast (around 1ms) inference without GPU
• You need fast training without GPU
• You have to process very long texts without splitting them
• Your task needs some specific architecture (e.g. memory or recursion)

Transformer
No ratings yet
Transformer
33 pages
Transformers
No ratings yet
Transformers
102 pages
Lecture 5: Self-Attention and Transformers
No ratings yet
Lecture 5: Self-Attention and Transformers
99 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Ece265p Fahmy Day7
No ratings yet
Ece265p Fahmy Day7
93 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
How Much Attention Do You Need A Granular Analysis of Neural Machine Translation Architectures
No ratings yet
How Much Attention Do You Need A Granular Analysis of Neural Machine Translation Architectures
10 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
Transformer Networks
No ratings yet
Transformer Networks
53 pages
Generative AI
No ratings yet
Generative AI
54 pages
Attention Transformer
No ratings yet
Attention Transformer
41 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Transformers
No ratings yet
Transformers
102 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
Unit 5
No ratings yet
Unit 5
5 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Uppwise Standard PPT 2
No ratings yet
Uppwise Standard PPT 2
13 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
AN2DL 06 2324 AttentionAndTrasformers
No ratings yet
AN2DL 06 2324 AttentionAndTrasformers
60 pages
Transformer Design Report
No ratings yet
Transformer Design Report
21 pages
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
From Everand
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
Adam Jones
No ratings yet
Toward Multilingual Neural Machine Translation With Universal Encoder and Decoder
No ratings yet
Toward Multilingual Neural Machine Translation With Universal Encoder and Decoder
10 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Transformer
No ratings yet
Transformer
31 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Attention and Memory in Deep Learning and NLP
No ratings yet
Attention and Memory in Deep Learning and NLP
8 pages
Transformers
No ratings yet
Transformers
15 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Transformers
No ratings yet
Transformers
27 pages
Transformer Presentation
No ratings yet
Transformer Presentation
15 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
Transformer
No ratings yet
Transformer
59 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
World Best Stories in Tamil Translation
100% (7)
World Best Stories in Tamil Translation
193 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
No ratings yet
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
4 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Attention
No ratings yet
Attention
15 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Getting To Know Someone PDF
No ratings yet
Getting To Know Someone PDF
7 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Aiayn
No ratings yet
Aiayn
15 pages
ECU Measurement Calibration and Diagnostics Brochure
89% (9)
ECU Measurement Calibration and Diagnostics Brochure
44 pages
Answer Updated
No ratings yet
Answer Updated
35 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Basics of English Grammer
No ratings yet
Basics of English Grammer
10 pages
Jaime Pena Urbaez
No ratings yet
Jaime Pena Urbaez
2 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
Data Handling Using Pandas - II
No ratings yet
Data Handling Using Pandas - II
42 pages
ISC Computer Project/Computer File JAVA
No ratings yet
ISC Computer Project/Computer File JAVA
30 pages
Example File
No ratings yet
Example File
3 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Attention 1 2
No ratings yet
Attention 1 2
2 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Jaimin Labmanual Java
No ratings yet
Jaimin Labmanual Java
70 pages
A2DP at Commands v1.1
No ratings yet
A2DP at Commands v1.1
8 pages
BestPractices zDevOps
No ratings yet
BestPractices zDevOps
16 pages
Twitter BDA Presentation
No ratings yet
Twitter BDA Presentation
15 pages
Performance Task Newtons Olympic
100% (2)
Performance Task Newtons Olympic
1 page
Definition of Litertaure
No ratings yet
Definition of Litertaure
4 pages
Let Permit Something To Happen: Causative Verbs in English: Let, Make, Have, Get, Help
No ratings yet
Let Permit Something To Happen: Causative Verbs in English: Let, Make, Have, Get, Help
7 pages
What Is Grammar - Nelson
No ratings yet
What Is Grammar - Nelson
9 pages
Night On Bald Mountain
No ratings yet
Night On Bald Mountain
7 pages
CNF Finals LP
No ratings yet
CNF Finals LP
10 pages
Primary Two Revision Work English
100% (2)
Primary Two Revision Work English
4 pages
Patanjali: Yoga Sutra
No ratings yet
Patanjali: Yoga Sutra
1 page
Pembahasan SMP Bahasa Inggris - FSN 2024
No ratings yet
Pembahasan SMP Bahasa Inggris - FSN 2024
16 pages
Style Sheet For Dissertation
No ratings yet
Style Sheet For Dissertation
2 pages
Template Provided by Genigraphics - 800.790.4001 - Replace This Text With Your Title
No ratings yet
Template Provided by Genigraphics - 800.790.4001 - Replace This Text With Your Title
1 page
Descriptive Writing
No ratings yet
Descriptive Writing
7 pages
Quote #867 (Prem Vidya Industries)
No ratings yet
Quote #867 (Prem Vidya Industries)
2 pages
Supplications For The Lunar Month The Digital Ambler
No ratings yet
Supplications For The Lunar Month The Digital Ambler
2 pages
Outsiders
No ratings yet
Outsiders
4 pages
Appel - Conference - ACEA - 2024 - Yaoundé II - New - FR
No ratings yet
Appel - Conference - ACEA - 2024 - Yaoundé II - New - FR
2 pages
CN LAB 07 - Dynamic Routing - Tutorial
No ratings yet
CN LAB 07 - Dynamic Routing - Tutorial
4 pages
Tense Time Words Examples: I. Language Practice
No ratings yet
Tense Time Words Examples: I. Language Practice
74 pages

01 The Transformer

Uploaded by

01 The Transformer

Uploaded by

The Transformer: motivation,

original architecture and

Original French sentence Key model component: learned word alignment

English N-gram language model Translation

Phrase-by-phrase translation Alignment model

*This approach is simpler, but it cannot describe

Images source: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1162/syllabus.shtml

Sequence to Sequence Learning with Neural Networks, Sutskever et al, 2014 11

Sequence to Sequence Learning with Neural Networks, Sutskever et al, 2014 12

Socher, Manning. Cs224n, 2017

Socher, Manning. Cs224n, 2017

Socher, Manning. Cs224n, 2017

Socher, Manning. Cs224n, 2017

Socher, Manning. Cs224n, 2017

Socher, Manning. Cs224n, 2017

Socher, Manning. Cs224n, 2017

Socher, Manning. Cs224n, 2017

Britz et al., 2017. Massive Exploration of Neural Machine Translation Architectures:

were translated to which ones

● RNNs are slow, because not parallelizable over timesteps

Vaswani et al, 2017. Attention is all you need. 29

• RNN encoder + RNN decoder, interaction via fix-sized vector

• RNN encoder + RNN decoder, interaction via attention

• attention + attention+ attention

Vaswani et al, 2017. Attention is all you need. 31

The fixed positional encodings

These matrices allow different aspects of the 𝑥

Attention matches the key and query by assigning a

– we generate one token at a time => during generation, we don't

know which tokens we'll generate in future.

– to enable parallelization we forbid the decoder to look ahead -

softmax step in the self-attention calculation

Suppose query is (bs, m, 512),

View as (bs, seqlen, 8, 64)→(bs, 8, seqlen, 64)

→(bs, 8, m, 64), (bs, 8, m, n)

Residual connection (train better)

Residual connection (train better)

Layer Normalization (train faster)

Label smoothing from:

● WMT2014 En→De / Fr: 4.5M / 36M sent.pairs

Vaswani et al, 2017. Attention is all you need. 58

Vaswani et al, 2017. Attention is all you need. 59

GPT-like models: decoder only In decoders, each token attends only to

You might also like