0% found this document useful (0 votes)

32 views46 pages

Attention Is All You Need Explained

Uploaded by

nuhility21826

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views46 pages

Attention Is All You Need Explained

Uploaded by

nuhility21826

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Attention is All You Need: Explained

Martin Magill
March 31st 2023
Opportune Timing

2
https://futureoflife.org/open-letter/pause-giant-ai-experiments/
About Me

● Currently ML Researcher at Borealis AI in Toronto

○ Time series forecasting in the capital markets group

● PhD in mathematical modelling and computational science

○ Ontario Tech University in Prof. Hendrick de Haan’s cNAB.LAB

● Main research focus: Scientiﬁc machine learning

○ Mixing mathematical modelling with deep learning
○ More ﬂexible than classical model-based methods
○ More accurate, reliable, and interpretable than purely data-driven methods

3
Suggested Rules of Engagement

● This is a large, semi-anonymous reading group

● Presentation is aimed to be accessible and interesting to anyone and everyone

● Planned pauses between sections for Q&A

4
Recap:
Deep Learning and Natural
Language Processing

5
Deep Learning for NLP

Input Model Output

● Text ● More text

● Question ● Answers
● Photo ● Captions
● … ● …

6
Deep Learning: Exam is

Training
randomly
generated

Student
Student
learns
writes the
Training a deep neural network is a from
exam
very-high-dimensional, nonlinear, mistakes
nonconvex optimization problem.

We almost always resort to gradient

descent and its relatives. Student

But how do we “grade” NLP tasks? C- gets a

grade

How? 7
Example Language Task

Try it yourself!

I wrote down a sentence and erased the last word. What is the missing word?

Once a liar, always a …

Once a liar, always a liar

Once a liar, always a trickster

Once a liar, always a toaster

8
Let’s Try
Another One
This is a photograph of ________

● A city
● The sky
Consider this photograph of the view of ● A window pane
Toronto from the Borealis AI Toronto ● A lake
oﬃce. ● The horizon

What makes an answer “right”?

9
Be Careful What You Wish For

10
https://arxiv.org/pdf/2011.03395.pdf
Probabilistic NLP Paradigm

Model Input Model Output

Text Task-dependent, but commonly

“Once a liar always a” probabilities over the vocabulary
Tokens
[“Once”, “a”, “li”, “ar”, “al”, “ways”, “a”] p(“liar”) = 0.82
[1875, 22, 658, 475, 32, 8889, 10] p(“trickster”) = 0.17
Embedding p(“toaster”) = 0.01
[[0.11,0.37,0.002,...], …
[0.07,0.98,1.55,...],...]
Encoding The loss function is likelihood. The
+ [[1],[2],[3],...] likelihood of what? Ask the dataset.
11
Questions?

12
Before Transformers:
Recurrent and
Convolutional Models for
Sequences

13
Basic RNN

Input 1 Output 1

Input 2 Output 2

Input 3 Output 3

… …
14
The Vanishing Gradient Problem

● Naive recursion introduces problems when backpropagating through time

● The gradient features the product of the same Jacobian many times

● For basic RNN architectures, this product easily becomes extremely small

● However, modern RNN architectures like LSTMs or GRUs mostly resolve this problem

15
Convolutional Sequence Models

● A more fundamental issue with RNNs is that outputs must be computed one at a time
○ Unavoidable serial computation

● An alternative is to use convolution neural networks applied over the time axis
○ These can be computed in parallel across all inputs!

● However, convolutions are local, so tokens (words) in the input sequence only “interact” with
their neighbouring tokens
○ Hard/expensive to model long-range dependencies

16
Questions?

17
The Attention Mechanism

18
Queries, Keys, and Values

Motivated from information retrieval: Ex:

● I send the system a query Qi ● I type “potato chips” in the search bar
● The system checks each query against ● The system checks the metadata for all
its library of keys Kj the products in the database
● It returns the value Vj ● The system returns a ranked list of the
webpages for the most similar results
Given a query Qi and a library of keys Kj we
want to learn to pay attention to the most
relevant values Vj

19
General Attention Mechanisms

1. Compute the “similarity” of queries and keys

2. Convert similarities to weights with softmax

3. Weighted sum to “select” values

Flexible: If we parameterize a by a deep neural

network, it can be almost anything.

Clunky: If a is any arbitrary function, it will be diﬃcult

to learn, expensive to evaluate, and hard to interpret.

20
Dot-Product Similarity

● Recall dot product similarity, a.k.a. cosine similarity

● When vectors are very similar (nearly parallel), the dot

product similarity is nearly 1

● When they point in very different directions, the dot

product similarity is nearly 0

● Dot product similarity is normalized so that it doesn’t

depend on vector magnitudes

21
Dot-Product Attention

● Basic idea: Use dot-product similarity in the attention mechanism

● We just need to project queries and keys into the same vector space
○ Transformers typically do this with learned linear transformations
○ Input X gets mapped to queries, keys, and values using three different matrices

● More eﬃcient
○ Two simple linear transformations instead of one high-dimensional nonlinear function

22
Scaled Dot-Product Attention

● It’s important in deep models for every layer to preserve the scale of numerical information.

● If inputs values are near 1 in magnitude, then output value should be too

● The dot product of random inputs with variance of 1 will produce outputs with variance dk

● Scaled dot-product attention is a simple modiﬁcation to encourage stability when dk is large

23
Questions?

24
The Transformer

25
The
Transformer
Architecture

Essentially, the transformer takes all the

pieces we’ve discussed so far and puts
them together.

There are a few more details to cover,

though, including some important tricks
that make it work.

26
Positional Encoding

● Attention layers are permutation equivariant

○ If you shuﬄe the order of the inputs, the outputs get shuﬄed in the same way
○ The inputs are treated as a set, not a sequence

● This is actually useful in some applications! But not for language.

○ I ate the salad you made for me
○ The salad I made for you ate me

● The standard solution is to encode the order of the tokens by adding sine waves
○ Kind of weird, but it works

27
Encoder-
Decoder Mode

The original Transformer has two

branches: an encoder and a decoder.

They are basically the same, but the

decoder has special “encoder-decoder
attention” blocks that their takes keys
and values from the end of the encoder.

28
Encoder-
Decoder Mode

The original Transformer has two

branches: an encoder and a decoder.

They are basically the same, but the

decoder has special “encoder-decoder
attention” blocks that their takes keys
and values from the end of the encoder.

Other Transformers are pure encoders

(BERT) or pure decoders (GPT). Taken from the excellent blog post “The Illustrated Transformer” at
https://jalammar.github.io/illustrated-transformer/ 29
Masked Scaled
Dot-Product
Attention

In various language tasks, we don’t want

the model to be able to look into the
future.

We can use a mask in the attention

mechanism to prevent this from
happening.

30
Multiple
Attention Heads

Sometimes, a word can seemingly be

interpreted in multiple ways in a
sentence.

Other times, there are multiple

independent aspects of a word that are
worth representing.

Multiple attention heads enable this.

31
Layer Norm

We already discussed the importance of

keeping the numerical values close to 1
throughout the network.

tea png from pngtree.com 32

Layer Norm

We already discussed the importance of

keeping the numerical values close to 1
throughout the network.

Layer norm explicitly enforces this at

regular stops throughout the
architecture.

33
Feedforward
Layers and Skip
Connections

Every transformer block ends with a

fully-connected feedforward/MLP layer.

Moreover, skip connections (as in

ResNet) are used liberally throughout
the architecture.

Without these, all the tokens end up

mapped to the exact same thing!

34
Questions?

35
Training

36
Training

● The original Transformer was trained to translate (English-German, English-French)

● The loss function was likelihood

○ How was the data collected? Likelihood of what, exactly?

● Learning rate schedule: Slow, then fast, then slower

● Regularization
○ Dropout
○ Label smoothing

37
Results

38
Interpretability of Attention

39
Also taken from the excellent blog post “The Illustrated Transformer” at https://jalammar.github.io/illustrated-transformer/. Be sure to check it out!
Today: Semi-Supervised, Human-Supervised

● There is a lot more unlabelled data than labelled data in the world today.
● Modern Transformers can train on mostly unlabelled data, which is a huge advantage.

● InstructGPT also uses reinforcement learning with human feedback after main training.
● Essentially, this is a systematic methodology for ﬁne-tuning a language model to catch
obvious errors, inappropriate or undesirable responses, etc.
○ Remember Tay’s Tweets?

● Could be good topics for future reading group sessions?

40
Questions?

41
Conclusions

42
Summary: Transformers Versus Predecessors

Sequence Original
Recurrent Models
Convolution Transformers

Recurrence Convolution
Main mechanism Attention
(+/- Attention) (+/- Attention)

Computation Sequential Parallel Parallel

Vanishing Local Dense

Long sequences
Gradient Interactions Interactions

Permutation
Position Dependence Automatic Automatic
Equivariant

43
Further Reading

● Attention is Not All You Need

○ Without fully-connected layers, layer norm, and fully-connected layers, all the tokens get
mapped to the same representation very quickly

● Scaling Laws for Neural Language Models

○ Early evidence that Transformers just keep getting better as you add more data and
more compute power–more relevant than ever before!

● There have been many iterations since the original Transformer:

○ Autoformer
○ Informer
○ Scaleformer
○ …
44
References

https://proceedings.neurips.cc/paper_ﬁles/paper/2017/ﬁle/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

https://jalammar.github.io/illustrated-transformer/

https://nlp.seas.harvard.edu/2018/04/03/attention.html

https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms

https://www.forbes.com/sites/robtoews/2022/02/13/language-is-the-next-great-frontier-in-ai/?sh=b84b9ee5c506

https://bootcamp.uxdesign.cc/how-chatgpt-really-works-explained-for-non-technical-people-71efb078a5c9

https://www.borealisai.com/research-blogs/tutorial-14-transformers-i-introduction/#Motivation

45
Thanks for
Listening!
[email protected]
linkedin.com/in/martin-magill/

Carron, Brawley
No ratings yet
Carron, Brawley
18 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Assignment Cover Sheet: Sthapa@ismt - Edu.np
No ratings yet
Assignment Cover Sheet: Sthapa@ismt - Edu.np
12 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Aiayn
No ratings yet
Aiayn
15 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
58 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
A1
No ratings yet
A1
11 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Getting - Started With Cisco Intersight
No ratings yet
Getting - Started With Cisco Intersight
12 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Attention 1 2
No ratings yet
Attention 1 2
2 pages
Example File
No ratings yet
Example File
3 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
AISE Anchor Bolt Details PDF
100% (1)
AISE Anchor Bolt Details PDF
1 page
AN2DL 06 2324 AttentionAndTrasformers
No ratings yet
AN2DL 06 2324 AttentionAndTrasformers
60 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Transformer
No ratings yet
Transformer
5 pages
IT641 RNN V2-Compressed
No ratings yet
IT641 RNN V2-Compressed
74 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Transformer
No ratings yet
Transformer
58 pages
Intake and Exhaust: Group 15
No ratings yet
Intake and Exhaust: Group 15
20 pages
7181 Attention Is All You Need
No ratings yet
7181 Attention Is All You Need
11 pages
Attention
No ratings yet
Attention
15 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
LOREAL 2023 Universal Registration Document en
No ratings yet
LOREAL 2023 Universal Registration Document en
450 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
ABHA M1 API Document V1 R1.bab8b1bd
No ratings yet
ABHA M1 API Document V1 R1.bab8b1bd
33 pages
Sapera User
No ratings yet
Sapera User
109 pages
L15 Transformer1
No ratings yet
L15 Transformer1
19 pages
Introduction To Academic Writing Syllabus
No ratings yet
Introduction To Academic Writing Syllabus
6 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Larson PM 8e Ch03 Im
No ratings yet
Larson PM 8e Ch03 Im
16 pages
Transformer Networks
No ratings yet
Transformer Networks
53 pages
Cybercrime Laboratory Manual
No ratings yet
Cybercrime Laboratory Manual
28 pages
Module1 - ARM Microcontroller MIT Portrait
100% (2)
Module1 - ARM Microcontroller MIT Portrait
21 pages
Airtech Busch Parts
No ratings yet
Airtech Busch Parts
7 pages
30 Days of Photoshop Schedule
No ratings yet
30 Days of Photoshop Schedule
9 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
ASSIGNMENT - WEEK-2 A.Multiple Choice Questions - Choose The Correct Answer/S (1X10 10)
No ratings yet
ASSIGNMENT - WEEK-2 A.Multiple Choice Questions - Choose The Correct Answer/S (1X10 10)
2 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Genetic Algorithms: Jaume I University - Intelligent Systems (EI1028)
No ratings yet
Genetic Algorithms: Jaume I University - Intelligent Systems (EI1028)
7 pages
Week 10 Module 6 Product Development
No ratings yet
Week 10 Module 6 Product Development
25 pages
Summary of Vaswani - Attention Is All You Need Paper
No ratings yet
Summary of Vaswani - Attention Is All You Need Paper
5 pages
OUC DC 911 Follow Up
No ratings yet
OUC DC 911 Follow Up
2 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
Dire Dawa Free Trade Zone
No ratings yet
Dire Dawa Free Trade Zone
15 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
T1 Homework 1
100% (1)
T1 Homework 1
3 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
QP-2 CS
No ratings yet
QP-2 CS
7 pages
Remote Sensing - Detecting Moving Trucks On Roads Using Sentinel-2 Data
No ratings yet
Remote Sensing - Detecting Moving Trucks On Roads Using Sentinel-2 Data
28 pages
Generative AI
No ratings yet
Generative AI
54 pages
MA111 Exam 2019
No ratings yet
MA111 Exam 2019
4 pages
Distributed Computing
No ratings yet
Distributed Computing
3 pages
Boq1 Replacing Ac at Central Pharmacy Fo
No ratings yet
Boq1 Replacing Ac at Central Pharmacy Fo
11 pages
Proc 471 Definition of Powers and Duties
No ratings yet
Proc 471 Definition of Powers and Duties
36 pages
Revision Questions
No ratings yet
Revision Questions
2 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
13-handout-e (1)
No ratings yet
13-handout-e (1)
8 pages
BROCHURE
No ratings yet
BROCHURE
8 pages
Xuv300 Accessories
No ratings yet
Xuv300 Accessories
2 pages
ECE312 Final Exam 2021
No ratings yet
ECE312 Final Exam 2021
2 pages
Transformers
No ratings yet
Transformers
127 pages