Attention is All You Need: Explained
Martin Magill
March 31st 2023
Opportune Timing
2
https://futureoflife.org/open-letter/pause-giant-ai-experiments/
About Me
● Currently ML Researcher at Borealis AI in Toronto
○ Time series forecasting in the capital markets group
● PhD in mathematical modelling and computational science
○ Ontario Tech University in Prof. Hendrick de Haan’s cNAB.LAB
● Main research focus: Scientific machine learning
○ Mixing mathematical modelling with deep learning
○ More flexible than classical model-based methods
○ More accurate, reliable, and interpretable than purely data-driven methods
3
Suggested Rules of Engagement
● This is a large, semi-anonymous reading group
● Presentation is aimed to be accessible and interesting to anyone and everyone
● Planned pauses between sections for Q&A
4
Recap:
Deep Learning and Natural
Language Processing
5
Deep Learning for NLP
Input Model Output
● Text ● More text
● Question ● Answers
● Photo ● Captions
● … ● …
6
Deep Learning: Exam is
Training
randomly
generated
Student
Student
learns
writes the
Training a deep neural network is a from
exam
very-high-dimensional, nonlinear, mistakes
nonconvex optimization problem.
We almost always resort to gradient
descent and its relatives. Student
But how do we “grade” NLP tasks? C- gets a
grade
How? 7
Example Language Task
Try it yourself!
I wrote down a sentence and erased the last word. What is the missing word?
Once a liar, always a …
Once a liar, always a liar
Once a liar, always a trickster
Once a liar, always a toaster
8
Let’s Try
Another One
This is a photograph of ________
● A city
● The sky
Consider this photograph of the view of ● A window pane
Toronto from the Borealis AI Toronto ● A lake
office. ● The horizon
What makes an answer “right”?
9
Be Careful What You Wish For
10
https://arxiv.org/pdf/2011.03395.pdf
Probabilistic NLP Paradigm
Model Input Model Output
Text Task-dependent, but commonly
“Once a liar always a” probabilities over the vocabulary
Tokens
[“Once”, “a”, “li”, “ar”, “al”, “ways”, “a”] p(“liar”) = 0.82
[1875, 22, 658, 475, 32, 8889, 10] p(“trickster”) = 0.17
Embedding p(“toaster”) = 0.01
[[0.11,0.37,0.002,...], …
[0.07,0.98,1.55,...],...]
Encoding The loss function is likelihood. The
+ [[1],[2],[3],...] likelihood of what? Ask the dataset.
11
Questions?
12
Before Transformers:
Recurrent and
Convolutional Models for
Sequences
13
Basic RNN
Input 1 Output 1
Input 2 Output 2
Input 3 Output 3
… …
14
The Vanishing Gradient Problem
● Naive recursion introduces problems when backpropagating through time
● The gradient features the product of the same Jacobian many times
● For basic RNN architectures, this product easily becomes extremely small
● However, modern RNN architectures like LSTMs or GRUs mostly resolve this problem
15
Convolutional Sequence Models
● A more fundamental issue with RNNs is that outputs must be computed one at a time
○ Unavoidable serial computation
● An alternative is to use convolution neural networks applied over the time axis
○ These can be computed in parallel across all inputs!
● However, convolutions are local, so tokens (words) in the input sequence only “interact” with
their neighbouring tokens
○ Hard/expensive to model long-range dependencies
16
Questions?
17
The Attention Mechanism
18
Queries, Keys, and Values
Motivated from information retrieval: Ex:
● I send the system a query Qi ● I type “potato chips” in the search bar
● The system checks each query against ● The system checks the metadata for all
its library of keys Kj the products in the database
● It returns the value Vj ● The system returns a ranked list of the
webpages for the most similar results
Given a query Qi and a library of keys Kj we
want to learn to pay attention to the most
relevant values Vj
19
General Attention Mechanisms
1. Compute the “similarity” of queries and keys
2. Convert similarities to weights with softmax
3. Weighted sum to “select” values
Flexible: If we parameterize a by a deep neural
network, it can be almost anything.
Clunky: If a is any arbitrary function, it will be difficult
to learn, expensive to evaluate, and hard to interpret.
20
Dot-Product Similarity
● Recall dot product similarity, a.k.a. cosine similarity
● When vectors are very similar (nearly parallel), the dot
product similarity is nearly 1
● When they point in very different directions, the dot
product similarity is nearly 0
● Dot product similarity is normalized so that it doesn’t
depend on vector magnitudes
21
Dot-Product Attention
● Basic idea: Use dot-product similarity in the attention mechanism
● We just need to project queries and keys into the same vector space
○ Transformers typically do this with learned linear transformations
○ Input X gets mapped to queries, keys, and values using three different matrices
● More efficient
○ Two simple linear transformations instead of one high-dimensional nonlinear function
22
Scaled Dot-Product Attention
● It’s important in deep models for every layer to preserve the scale of numerical information.
● If inputs values are near 1 in magnitude, then output value should be too
● The dot product of random inputs with variance of 1 will produce outputs with variance dk
● Scaled dot-product attention is a simple modification to encourage stability when dk is large
23
Questions?
24
The Transformer
25
The
Transformer
Architecture
Essentially, the transformer takes all the
pieces we’ve discussed so far and puts
them together.
There are a few more details to cover,
though, including some important tricks
that make it work.
26
Positional Encoding
● Attention layers are permutation equivariant
○ If you shuffle the order of the inputs, the outputs get shuffled in the same way
○ The inputs are treated as a set, not a sequence
● This is actually useful in some applications! But not for language.
○ I ate the salad you made for me
○ The salad I made for you ate me
● The standard solution is to encode the order of the tokens by adding sine waves
○ Kind of weird, but it works
27
Encoder-
Decoder Mode
The original Transformer has two
branches: an encoder and a decoder.
They are basically the same, but the
decoder has special “encoder-decoder
attention” blocks that their takes keys
and values from the end of the encoder.
28
Encoder-
Decoder Mode
The original Transformer has two
branches: an encoder and a decoder.
They are basically the same, but the
decoder has special “encoder-decoder
attention” blocks that their takes keys
and values from the end of the encoder.
Other Transformers are pure encoders
(BERT) or pure decoders (GPT). Taken from the excellent blog post “The Illustrated Transformer” at
https://jalammar.github.io/illustrated-transformer/ 29
Masked Scaled
Dot-Product
Attention
In various language tasks, we don’t want
the model to be able to look into the
future.
We can use a mask in the attention
mechanism to prevent this from
happening.
30
Multiple
Attention Heads
Sometimes, a word can seemingly be
interpreted in multiple ways in a
sentence.
Other times, there are multiple
independent aspects of a word that are
worth representing.
Multiple attention heads enable this.
31
Layer Norm
We already discussed the importance of
keeping the numerical values close to 1
throughout the network.
tea png from pngtree.com 32
Layer Norm
We already discussed the importance of
keeping the numerical values close to 1
throughout the network.
Layer norm explicitly enforces this at
regular stops throughout the
architecture.
33
Feedforward
Layers and Skip
Connections
Every transformer block ends with a
fully-connected feedforward/MLP layer.
Moreover, skip connections (as in
ResNet) are used liberally throughout
the architecture.
Without these, all the tokens end up
mapped to the exact same thing!
34
Questions?
35
Training
36
Training
● The original Transformer was trained to translate (English-German, English-French)
● The loss function was likelihood
○ How was the data collected? Likelihood of what, exactly?
● Learning rate schedule: Slow, then fast, then slower
● Regularization
○ Dropout
○ Label smoothing
37
Results
38
Interpretability of Attention
39
Also taken from the excellent blog post “The Illustrated Transformer” at https://jalammar.github.io/illustrated-transformer/. Be sure to check it out!
Today: Semi-Supervised, Human-Supervised
● There is a lot more unlabelled data than labelled data in the world today.
● Modern Transformers can train on mostly unlabelled data, which is a huge advantage.
● InstructGPT also uses reinforcement learning with human feedback after main training.
● Essentially, this is a systematic methodology for fine-tuning a language model to catch
obvious errors, inappropriate or undesirable responses, etc.
○ Remember Tay’s Tweets?
● Could be good topics for future reading group sessions?
40
Questions?
41
Conclusions
42
Summary: Transformers Versus Predecessors
Sequence Original
Recurrent Models
Convolution Transformers
Recurrence Convolution
Main mechanism Attention
(+/- Attention) (+/- Attention)
Computation Sequential Parallel Parallel
Vanishing Local Dense
Long sequences
Gradient Interactions Interactions
Permutation
Position Dependence Automatic Automatic
Equivariant
43
Further Reading
● Attention is Not All You Need
○ Without fully-connected layers, layer norm, and fully-connected layers, all the tokens get
mapped to the same representation very quickly
● Scaling Laws for Neural Language Models
○ Early evidence that Transformers just keep getting better as you add more data and
more compute power–more relevant than ever before!
● There have been many iterations since the original Transformer:
○ Autoformer
○ Informer
○ Scaleformer
○ …
44
References
https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
https://jalammar.github.io/illustrated-transformer/
https://nlp.seas.harvard.edu/2018/04/03/attention.html
https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms
https://www.forbes.com/sites/robtoews/2022/02/13/language-is-the-next-great-frontier-in-ai/?sh=b84b9ee5c506
https://bootcamp.uxdesign.cc/how-chatgpt-really-works-explained-for-non-technical-people-71efb078a5c9
https://www.borealisai.com/research-blogs/tutorial-14-transformers-i-introduction/#Motivation
45
Thanks for
Listening!
[email protected]
linkedin.com/in/martin-magill/