0% found this document useful (0 votes)
32 views46 pages

Attention Is All You Need Explained

Uploaded by

nuhility21826
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views46 pages

Attention Is All You Need Explained

Uploaded by

nuhility21826
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Attention is All You Need: Explained

Martin Magill
March 31st 2023
Opportune Timing

2
https://futureoflife.org/open-letter/pause-giant-ai-experiments/
About Me

● Currently ML Researcher at Borealis AI in Toronto


○ Time series forecasting in the capital markets group

● PhD in mathematical modelling and computational science


○ Ontario Tech University in Prof. Hendrick de Haan’s cNAB.LAB

● Main research focus: Scientific machine learning


○ Mixing mathematical modelling with deep learning
○ More flexible than classical model-based methods
○ More accurate, reliable, and interpretable than purely data-driven methods

3
Suggested Rules of Engagement

● This is a large, semi-anonymous reading group

● Presentation is aimed to be accessible and interesting to anyone and everyone

● Planned pauses between sections for Q&A

4
Recap:
Deep Learning and Natural
Language Processing

5
Deep Learning for NLP

Input Model Output

● Text ● More text


● Question ● Answers
● Photo ● Captions
● … ● …

6
Deep Learning: Exam is

Training
randomly
generated

Student
Student
learns
writes the
Training a deep neural network is a from
exam
very-high-dimensional, nonlinear, mistakes
nonconvex optimization problem.

We almost always resort to gradient


descent and its relatives. Student

But how do we “grade” NLP tasks? C- gets a


grade

How? 7
Example Language Task

Try it yourself!

I wrote down a sentence and erased the last word. What is the missing word?

Once a liar, always a …

Once a liar, always a liar

Once a liar, always a trickster

Once a liar, always a toaster


8
Let’s Try
Another One
This is a photograph of ________

● A city
● The sky
Consider this photograph of the view of ● A window pane
Toronto from the Borealis AI Toronto ● A lake
office. ● The horizon

What makes an answer “right”?

9
Be Careful What You Wish For

10
https://arxiv.org/pdf/2011.03395.pdf
Probabilistic NLP Paradigm

Model Input Model Output

Text Task-dependent, but commonly


“Once a liar always a” probabilities over the vocabulary
Tokens
[“Once”, “a”, “li”, “ar”, “al”, “ways”, “a”] p(“liar”) = 0.82
[1875, 22, 658, 475, 32, 8889, 10] p(“trickster”) = 0.17
Embedding p(“toaster”) = 0.01
[[0.11,0.37,0.002,...], …
[0.07,0.98,1.55,...],...]
Encoding The loss function is likelihood. The
+ [[1],[2],[3],...] likelihood of what? Ask the dataset.
11
Questions?

12
Before Transformers:
Recurrent and
Convolutional Models for
Sequences

13
Basic RNN

Input 1 Output 1

Input 2 Output 2

Input 3 Output 3

… …
14
The Vanishing Gradient Problem

● Naive recursion introduces problems when backpropagating through time

● The gradient features the product of the same Jacobian many times

● For basic RNN architectures, this product easily becomes extremely small

● However, modern RNN architectures like LSTMs or GRUs mostly resolve this problem

15
Convolutional Sequence Models

● A more fundamental issue with RNNs is that outputs must be computed one at a time
○ Unavoidable serial computation

● An alternative is to use convolution neural networks applied over the time axis
○ These can be computed in parallel across all inputs!

● However, convolutions are local, so tokens (words) in the input sequence only “interact” with
their neighbouring tokens
○ Hard/expensive to model long-range dependencies

16
Questions?

17
The Attention Mechanism

18
Queries, Keys, and Values

Motivated from information retrieval: Ex:

● I send the system a query Qi ● I type “potato chips” in the search bar
● The system checks each query against ● The system checks the metadata for all
its library of keys Kj the products in the database
● It returns the value Vj ● The system returns a ranked list of the
webpages for the most similar results
Given a query Qi and a library of keys Kj we
want to learn to pay attention to the most
relevant values Vj

19
General Attention Mechanisms

1. Compute the “similarity” of queries and keys

2. Convert similarities to weights with softmax

3. Weighted sum to “select” values

Flexible: If we parameterize a by a deep neural


network, it can be almost anything.

Clunky: If a is any arbitrary function, it will be difficult


to learn, expensive to evaluate, and hard to interpret.

20
Dot-Product Similarity

● Recall dot product similarity, a.k.a. cosine similarity

● When vectors are very similar (nearly parallel), the dot


product similarity is nearly 1

● When they point in very different directions, the dot


product similarity is nearly 0

● Dot product similarity is normalized so that it doesn’t


depend on vector magnitudes

21
Dot-Product Attention

● Basic idea: Use dot-product similarity in the attention mechanism

● We just need to project queries and keys into the same vector space
○ Transformers typically do this with learned linear transformations
○ Input X gets mapped to queries, keys, and values using three different matrices

● More efficient
○ Two simple linear transformations instead of one high-dimensional nonlinear function

22
Scaled Dot-Product Attention

● It’s important in deep models for every layer to preserve the scale of numerical information.

● If inputs values are near 1 in magnitude, then output value should be too

● The dot product of random inputs with variance of 1 will produce outputs with variance dk

● Scaled dot-product attention is a simple modification to encourage stability when dk is large

23
Questions?

24
The Transformer

25
The
Transformer
Architecture

Essentially, the transformer takes all the


pieces we’ve discussed so far and puts
them together.

There are a few more details to cover,


though, including some important tricks
that make it work.

26
Positional Encoding

● Attention layers are permutation equivariant


○ If you shuffle the order of the inputs, the outputs get shuffled in the same way
○ The inputs are treated as a set, not a sequence

● This is actually useful in some applications! But not for language.


○ I ate the salad you made for me
○ The salad I made for you ate me

● The standard solution is to encode the order of the tokens by adding sine waves
○ Kind of weird, but it works

27
Encoder-
Decoder Mode

The original Transformer has two


branches: an encoder and a decoder.

They are basically the same, but the


decoder has special “encoder-decoder
attention” blocks that their takes keys
and values from the end of the encoder.

28
Encoder-
Decoder Mode

The original Transformer has two


branches: an encoder and a decoder.

They are basically the same, but the


decoder has special “encoder-decoder
attention” blocks that their takes keys
and values from the end of the encoder.

Other Transformers are pure encoders


(BERT) or pure decoders (GPT). Taken from the excellent blog post “The Illustrated Transformer” at
https://jalammar.github.io/illustrated-transformer/ 29
Masked Scaled
Dot-Product
Attention

In various language tasks, we don’t want


the model to be able to look into the
future.

We can use a mask in the attention


mechanism to prevent this from
happening.

30
Multiple
Attention Heads

Sometimes, a word can seemingly be


interpreted in multiple ways in a
sentence.

Other times, there are multiple


independent aspects of a word that are
worth representing.

Multiple attention heads enable this.

31
Layer Norm

We already discussed the importance of


keeping the numerical values close to 1
throughout the network.

tea png from pngtree.com 32


Layer Norm

We already discussed the importance of


keeping the numerical values close to 1
throughout the network.

Layer norm explicitly enforces this at


regular stops throughout the
architecture.

33
Feedforward
Layers and Skip
Connections

Every transformer block ends with a


fully-connected feedforward/MLP layer.

Moreover, skip connections (as in


ResNet) are used liberally throughout
the architecture.

Without these, all the tokens end up


mapped to the exact same thing!

34
Questions?

35
Training

36
Training

● The original Transformer was trained to translate (English-German, English-French)

● The loss function was likelihood


○ How was the data collected? Likelihood of what, exactly?

● Learning rate schedule: Slow, then fast, then slower

● Regularization
○ Dropout
○ Label smoothing

37
Results

38
Interpretability of Attention

39
Also taken from the excellent blog post “The Illustrated Transformer” at https://jalammar.github.io/illustrated-transformer/. Be sure to check it out!
Today: Semi-Supervised, Human-Supervised

● There is a lot more unlabelled data than labelled data in the world today.
● Modern Transformers can train on mostly unlabelled data, which is a huge advantage.

● InstructGPT also uses reinforcement learning with human feedback after main training.
● Essentially, this is a systematic methodology for fine-tuning a language model to catch
obvious errors, inappropriate or undesirable responses, etc.
○ Remember Tay’s Tweets?

● Could be good topics for future reading group sessions?

40
Questions?

41
Conclusions

42
Summary: Transformers Versus Predecessors

Sequence Original
Recurrent Models
Convolution Transformers

Recurrence Convolution
Main mechanism Attention
(+/- Attention) (+/- Attention)

Computation Sequential Parallel Parallel

Vanishing Local Dense


Long sequences
Gradient Interactions Interactions

Permutation
Position Dependence Automatic Automatic
Equivariant

43
Further Reading

● Attention is Not All You Need


○ Without fully-connected layers, layer norm, and fully-connected layers, all the tokens get
mapped to the same representation very quickly

● Scaling Laws for Neural Language Models


○ Early evidence that Transformers just keep getting better as you add more data and
more compute power–more relevant than ever before!

● There have been many iterations since the original Transformer:


○ Autoformer
○ Informer
○ Scaleformer
○ …
44
References

https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

https://jalammar.github.io/illustrated-transformer/

https://nlp.seas.harvard.edu/2018/04/03/attention.html

https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms

https://www.forbes.com/sites/robtoews/2022/02/13/language-is-the-next-great-frontier-in-ai/?sh=b84b9ee5c506

https://bootcamp.uxdesign.cc/how-chatgpt-really-works-explained-for-non-technical-people-71efb078a5c9

https://www.borealisai.com/research-blogs/tutorial-14-transformers-i-introduction/#Motivation

45
Thanks for
Listening!
[email protected]
linkedin.com/in/martin-magill/

You might also like